基于标题的中文新闻分类研究
Research of Chinese News Classification Based on Titles

作者: 王海涛 * , 岳磅 :深圳大学计算机与软件学院,深圳; 赵艳琼 :安徽移动网络部,合肥;

关键词: 文本分类标题分类新闻分类语义相似度Text Classification Title Classification News Classification Semantic Similarity

摘要: 如何快捷、准确、全面地检索互联网信息是互联网时代的重要问题。网络新闻比传统纸质媒体新闻速度更快、内容更丰富、形式更灵活生动,正逐渐取代传统新闻媒体成为很多人获取新闻信息的主要途径。然而,面对快速更新的大量新闻信息,传统的手工分类方式无法满足用户的需求。新闻的主要内容一般都是以文本的方式呈现,因此,利用文本自动分类技术对网络新闻进行自动分类是解决手工新闻分类问题的一个有效途径。由于网络新闻信息形式多样,很多新闻内容完全是由图片或者视频组成,不包含文本内容。本文提出通过新闻标题对网络新闻进行分类的方法,比通过内容进行分类的方法分类速度更快,并且有更强的适应性,可对无文本内容的新闻(如图片新闻、标题新闻等)进行分类。本文创建了基于标题的文本分类模型;从网络上获取新闻语料,验证模型的工作情况;并通过与基于内容的文本分类方法比较,验证基于标题的文本分类模型的优劣。本文构建了基于标题的两步分类系统,所提出的类别唯一特征,对于可分样本可以实现高分类准确率。

Abstract: Retrieving online information efficiently becomes a crucial issue in nowadays online experience. Compared with traditional news in paper form, online news are faster, more convenient and more flexible. It is a trend that online news are replacing their traditional counterpart and becoming the most common way for people to obtain daily information. However, the volume of frequent updated news becomes so large that the traditional manual news classification cannot meet the needs of online users. One of the solutions for this will be applying automatic text classification technologies to classify online news. Many IT companies are developing automatic news classification systems. There are different forms of network news. Some of the news are composed mostly by graphics or videos instead of text and therefore not able to be coped with by classic text classification. A new approach of news classifier based on news titles is proposed to dealing with such news. In this paper, the title based classification model was created. The model was evaluated by a built corpus and compared with contents based classification. A two-phase news classification system is constructed and category key feature is proposed.

文章引用: 王海涛 , 赵艳琼 , 岳磅 (2013) 基于标题的中文新闻分类研究。 数据挖掘, 3, 33-39. doi: 10.12677/HJDM.2013.33007

参考文献

[1] The Official Google Blog. http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html

[2] 谷歌资讯(Google News)[URL]. http://news.google.cn

[3] QQ新闻[Z]. http//news.qq.com

[4] E. I. Sicilia-Garcia and F. J. Smith. Statistical language modeling. Encyclopedia of Library and Information Science, 2002, 71(34): 309-338.

[5] 黄昌宁. 统计语言模型能做什么?[J]. 语言文字应用, 2002, 1(2): 77-84.

[6] D. D. Lewis. Representation and learning in information retrieval. University of Massachusetts, Amherst, 1992.

[7] ICTCLAS中文分词工具[URL]. http://ictclas.org

[8] S. Chakrabarti. Hypertext databases and data mining. Proceedings of the 1999 ACM SIGMOD International Conference on Manage- ment of Data, 1999, 28(2): 508.

[9] G. Salton, M. J. McGill. Introduction to modern information retrieval. New York: Mc Graw Hill, 1983

[10] Y. Yang, J. O. Pedersen. A comparative study on feature selec- tion in text categorization. Morgan Kaufmann Publishers, Bur- lington, 1997: 412-420.

[11] 张庆国, 张宏伟, 张君玉. 一种基于 k 最近邻的快速文本分类方法[J]. 中国科学院研究生院学报, 2005, 22(5): 554-559.

[12] 刘斌, 黄铁军, 程军等. 一种新的基于统计的自动文本分类方法[J]. 中文信息学报, 2002, 16(6): 18-24.

分享
Top