高效朴素贝叶斯Web新闻文本分类模型的简易实现
The Simply Implement of Effective Naive Bayes Web News Text Classification Model
作者: 吴致晖 , 刘洪伟 , 陈 丽 :广东工业大学管理学院,广州;
关键词: 文本分类; 特征选择; 朴素贝叶斯; TF-IDF标准; Text Classification; Feature Selection; Naive Bayes; TF-IDF Standard
摘要:Abstract: When using Naive Bayes theory as a text classification algorithm, it is especially important to choose an effetive feature selection method, due to the hypothesis that occurrence probabilities of features are independent of each other which is equally important. In this paper, jieba Chinese segmentation module’s TF-IDF standard is used to select the features for the training news text and Naive Bayes text classification model is implemented with high performance. Before the test of classification model, it’s still necessary to use the TF-IDF standard to select thekeywords for testing news texts. The experiment result showed that this method is of high efficiency inclassification.
文章引用: 吴致晖 , 刘洪伟 , 陈 丽 (2014) 高效朴素贝叶斯Web新闻文本分类模型的简易实现。 统计学与应用, 3, 30-35. doi: 10.12677/SA.2014.31005
参考文献
[1] Salton, G. and McGill, M.J. (1983.) Introduction to Modern Information Retrieval. McGraw-Hill Book Co., New York.
[2] Mamitsuka, H. (2006) Selecting Features in Microarray Classification Using ROC Curves. Pattern Recog-nition, 39, 2393-2404.
[3] Soucy, P., Mineau, G.W. (2005) Beyond TFIDF Weighting for Text Categorization in the Vector Space Model. Morgan Kaufmann, San Francisco,1130-1135.
[4] Blansche, A., Gancarski, P. and Korczak, J.J. (2006) A Modular Approach for Clustering with Local Attribute Weighting. Pattern Recognition Letters, 27, 1299-1306.
[5] Dunning, T.E. (1993) Accurate Methods for the Statistics of Surprise and Coincidence. Comutational Linguistics,19, 61-74.
[6] 周茜, 赵明生 (2004) 中文文本分类中的特征选择研究. 中文信息学报, 3, 17-23.
[7] 樊兴华, 孙茂松 (2006) 一种高性能的两类中文分词方法. 计算机学报, 1, 124-131.
[8] Harrington, P. (2013) 机器学习实战. 人民邮电出版社, 北京.