详细信息
文献类型:期刊文献
中文题名:基于改进TF-IDF和ABLCNN的中文文本分类模型
英文题名:Chinese Text Classification Model Based on Improved TF-IDF and ABLCNN
作者:景丽[1];何婷婷[1]
第一作者:景丽
机构:[1]河南财经政法大学计算机与信息工程学院,郑州450000
第一机构:河南财经政法大学计算机与信息工程学院
年份:2021
卷号:48
期号:S02
起止页码:170-175
中文期刊名:计算机科学
外文期刊名:Computer Science
收录:CSTPCD;;北大核心:【北大核心2020】;CSCD:【CSCD_E2021_2022】;
基金:国家自然科学基金(61806073)。
语种:中文
中文关键词:文本分类;TF-IDF;卷积神经网络;注意力机制;长短期记忆网络
外文关键词:Text classification;Term frequency-inverse document frequency;Convolutional neural network;Attention;Long-term and short-term memory network
摘要:文本分类是自然语言处理领域中的重要内容,常用于信息检索、情感分析等领域。针对传统的文本分类模型文本特征提取不全面、文本语义表达弱的问题,提出一种基于改进TF-IDF算法、带有注意力机制的长短期记忆卷积网络(Attention base on Bi-LSTM and CNN,ABLCNN)相结合的文本分类模型。该模型首先利用特征项在类内、类间的分布关系和位置信息改进TF-IDF算法,突出特征项的重要性,并结合Word2vec工具训练的词向量对文本进行表示;然后使用ABLCNN提取文本特征,ABLCNN结合了注意力机制、长短期记忆网络和卷积神经网络的优点,既可以有重点地提取文本的上下文语义特征,又兼顾了局部语义特征;最后,将特征向量通过softmax函数进行文本分类。在THUCNews数据集和online_shopping_10_cats数据集上对基于改进TF-IDF和ABLCNN的文本分类模型进行实验,结果表明,所提模型在两个数据集上的准确率分别为97.38%和91.33%,高于其他文本分类模型。
Text classification which is often used in information retrieval,emotion analysis and other fields,is a very important content in the field of natural language processing and has become a research hotspot of many scholars.Traditional text classification model exists the problems of incomplete text feature extraction and weak semantic expression,thus,a text classification model based on improved TF-IDF algorithm and attention base on Bi-LSTM and CNN(ABLCNN)is proposed.Firstly,the TF-IDF algorithm is improved by using the distribution relationship of feature items within and between classes and location information to highlight the importance of feature items,the text is represented by word vector trained by word2vec tool and improved TF-IDF.Then,ABLCNN extracts the text features.ABLCNN combines the advantages of attention mechanism,long-term memory network and convolutional neural network.ABLCNN not only extracts major the context semantic features of the text,but also takes into account the local semantic features,At last,the feature vector is classified by softmax function.Chinese text classification model based on improved TF-IDF and ABLCNN is tested on THUCNews dataset and online_shopping_10_cats dataset.The results of experimental show that the accuracy on the THUCNews dataset is 97.38%and the accuracy on the online_shopping_10_cats dataset is 91.33%,the accuracy of experiment is higher than that of other text classification models.
参考文献:
正在载入数据...