计算机科学与应用

Vol.4 No.9 (September 2014)

基于KNN的烟草企业档案文本自动分类算法研究
An Approach for Algorithm of Tobacco Enterprise Archives Text Automatic Classification Based on KNN

 

作者:

黄世反 , 沈 勇 , 康洪炜 , 郑见琳 , 郎 波 , 王 冬 , 贾丛丛 :云南大学,软件学院,昆明

王道红 :云南省农村信用社科技结算中心,昆明

 

关键词:

TFIDFKNN烟草档案文本自动分类保存期限TFIDF KNN Archives of Tobacco Automatic Text Categorization Storage Life

 

摘要:

通过对云南某卷烟厂历史档案文本数据的分析研究,结合实际情况,对档案文本主题词的获取和自动分类算法进行了详细的设计。且在主题词获取算法中引入了TFIDF算法,解决了档案文本缺少题名、文号及责任者项时,算法无法自动获取主题词的问题。在文本自动分类算法中引入了KNN最邻近算法,解决了无法根据题名、文号进行档案文本自动分类的问题。同时,还考虑了档案文本按保存期限进行分类的问题。实验结果证明,该算法明显提高了烟草企业档案文本的分类效率。

By researching historical archives text data of a cigarette factory in Yunnan province, combing with actual situation, we have detailedly designed acquisition of file text subject headings and automatic classification algorithm. Furthermore, TFIDF algorithm is introduced to acquisition algorithm of subject headings, thus the problem that algorithm can’t automatically obtain subject headings when text file lack title, document number and statement items is solved. In this paper, KNN adjacent algorithm is introduced to the algorithm of automatic classification, and it solves the problem which can’t be solved according to the title and approval document for automatically classifying archives text. At the same time, we also consider the problem that classifies file text according to the storage life. The experimental results show that this algorithm obviously improves the classified efficiency of archives text of the tobacco enterprise.

文章引用:

黄世反 , 沈 勇 , 康洪炜 , 王道红 , 郑见琳 , 郎 波 , 王 冬 , 贾丛丛 (2014) 基于KNN的烟草企业档案文本自动分类算法研究。 计算机科学与应用, 4, 204-216. doi: 10.12677/CSA.2014.49029

 

参考文献

分享
Top