计算机科学与应用

Vol.5 No.9 (September 2015)

基于位置及词频信息的优化CHI文本特征选择方法
An Improved CHI Text Feature Selection Method Based on the Location and Word Frequency Information

 

作者:

宋阿羚 , 刘海峰 , 刘守生 :解放军理工大学理学院,江苏 南京

 

关键词:

特征选择&chi2统计相关性位置分布类偏斜Feature Selection Chi-Square Relevance Location Distribution Class Deflection

 

摘要:

特征选择是文本自动分类的核心技术。针对经典的CHI模型不足之处,本文首先从特征项与类别之间的正负相关性角度对特征项进行删减;然后针对类偏斜分类环境下的特征项权重进行调整;进而以特征项的词频数为依据,从特征项在文本中的具体位置、特征项的类内及类间分布等层面再对模型逐步改进,提出了一种优化的CHI特征选择方法。随后的文本分类试验验证了该方法的有效性。

Text feature selection is the core technology of text automatic categorization. Aiming at the short-comings of classical CHI model, we have screened the feature set which is based on the point of view of the positive and negative correlation between the feature and categories firstly. According to the type of deflection classification conditions, we adjust the feature weighting secondly. Thirdly, basing on characteristics of word frequency, we gradually improve the model based on the characteristics of a specific location in the text and the characteristics of distribution of information between classes. Finally, we propose an optimized CHI feature selection method. Text classification experiments demonstrate the effectiveness of the optimized CHI model.

文章引用:

宋阿羚 , 刘海峰 , 刘守生 (2015) 基于位置及词频信息的优化CHI文本特征选择方法。 计算机科学与应用, 5, 322-330. doi: 10.12677/CSA.2015.59040

 

参考文献

分享
Top