﻿ 一种用于分类问题的属性加权方法

# 一种用于分类问题的属性加权方法A Weighting Attribute Method for Classification Problems

Abstract: Attribute weighting adjustments are used in machine learning models to improve performance. In this paper, we propose a novel attribute weighting method based on mutual information and apply this method to two classical machine learning models for classification. We study the performance of our weighting method by conducting experiments on the Wisconsin Breast Cancer database. In both machine learning models, our weighted attribute models tend to outperform the corresponding conventional machine learning models in classification which also approves that our weighting method is reasonable and applicable.

1. 引言

2. 方法

2.1. 权重获取

${\sum }_{x\in X}f\left(x\right)\omega \left(x\right)$ (1)

${\prod }_{x\in X}f\left(x\right)\omega \left(x\right)$ (2)

$\omega \left({x}_{i}\right)=NMI\left(X,C\right)=\frac{MI\left(X,C\right)}{mean\left(H\left(X\right),H\left(C\right)\right)}$ (3)

$MI\left(X,C\right)={\sum }_{i=1}^{n}{\sum }_{j=1}^{K}\frac{|{x}_{i}\cap {c}_{j}|}{N}\mathrm{log}\left(\frac{N|{x}_{i}\cap {c}_{j}|}{|{x}_{i}||{c}_{j}|}\right)$ (4)

$H\left(X\right)=-{\sum }_{i=1}^{n}P\left(\frac{|{x}_{i}|}{N}\right)\mathrm{log}\left(\frac{|{x}_{i}|}{N}\right)$ (5)

$H\left(C\right)=-{\sum }_{i=1}^{K}P\left(\frac{|{c}_{i}|}{N}\right)\mathrm{log}\left(\frac{|{c}_{i}|}{N}\right)$ (6)

2.2. 加权朴素贝叶斯分类器

1) 根据数据集信息计算各属性的权重，为待分类实例中的每一个属性值分配相应的权重值：

$\omega \left({x}^{\left(i\right)}\right)=\omega \left({X}^{\left(i\right)}\right)=NMI\left({X}^{\left(i\right)},\mathcal{Y}\right)$ (7)

2) 计算在特定类别下各属性值的加权条件概率：

(8)

3) 根据贝叶斯判定准则，对待分类实例的类标签进行预测：

$y=\mathrm{arg}{\mathrm{max}}_{{C}_{k}}P\left(Y={C}_{k}\right){\prod }_{i=1}^{n}{P}_{\omega }\left({x}^{\left(i\right)}|{C}_{k}\right)$ (9)

2.3. 加权k近邻方法

k近邻方法是一种常用的监督学习方法。k近邻方法是懒惰学习算法中的一种，只有在输入待分类实例后，算法才会开始运行，训练时间开销为零。算法通过距离函数来计算两两实例间的距离，从而确定待分类实例的k个最近邻样本点，最后根据决策规则来判断待分类实例的类标签。k的取值不同，分类结果会有显著不同；其次，采用不同的距离计算方式，找出的近邻也可能存在显著差异，从而导致分类结果有显著不同 [7]。本文采用欧氏距离作为实例点之间的距离度量，采用多数表决规则来确定待分类实例的类别标签。

${d}_{\omega }\left({x}_{i},{x}_{j}\right)={\left({\sum }_{l=1}^{n}\omega \left({x}_{i}^{\left(l\right)}\right){|{x}_{i}^{\left(l\right)}-{x}_{j}^{\left(l\right)}|}^{2}\right)}^{\frac{1}{2}}$ (10)

$\omega \left({x}_{i}^{\left(l\right)}\right)=NMI\left({X}^{\left(l\right)},C\right)$ (11)

1) 通过加权欧氏距离 ${d}_{W}\left(\text{}\right)$ 的计算结果找出训练集T中与实例x最近邻的k个样本点，

2) 在 ${N}_{\omega k}\left(x\right)$ 中根据多数表决规则决定待分类实例x的类别标签y：

$y=\mathrm{arg}{\mathrm{max}}_{{C}_{j}}{\sum }_{{x}_{j}\in {N}_{\omega k}\left(x\right)}I\left({y}_{i}={C}_{j}\right)\text{ }\text{ },\text{\hspace{0.17em}}\text{\hspace{0.17em}}i=1,2,\cdots ,N;\text{\hspace{0.17em}}\text{\hspace{0.17em}}j=1,2,\cdots ,K$ (12)

3. 实验及结果

3.1. 数据集及实验步骤

1) 首先采用传统的机器学习方法：朴素贝叶斯分类器、k近邻方法，对数据集进行分类，得到相应算法的分类准确率。

2) 其次，采用本文提出的各类加权机器学习方法：加权朴素贝叶斯分类器、加权k近邻方法再次进行分类，同样以分类准确率作为评价指标。

3) 最后，通过比较这两种传统机器学习方法与其相应的加权方法在分类准确率方面的差异来对本文提出的加权机器学习方法的性能进行评估并借此验证本文加权方法的实用性。

Table 1. Features description in Wisconsin breast cancer database

3.2. 实验结果

Table 2. Classification accuracies (%) for Naïve Bayes

Table 3. Classification accuracies (%) for k-NN

4. 结论

[1] 李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012: 122.

[2] Karabatak, M. (2015) A New Classifier for Breast Cancer Detection Based on Naïve Bayesian. Measurement, 72, 32-36.
https://doi.org/10.1016/j.measurement.2015.04.028

[3] Zaidi, N.A., Cerquides, J., Carman, M.J., et al. (2013) Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting. Journal of Machine Learning Research, 14, 1947-1988.

[4] Wu, J., Pan, S., Cai, Z., et al. (2014) Dual Instance and Attribute Weighting for Naive Bayes Classification. International Joint Conference on Neural Networks (IJCNN), Beijing, 1675-1679.
https://doi.org/10.1109/IJCNN.2014.6889572

[5] Wettschereck, D., Aha, D.W. and Mohri, T. (1977) A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithm. Artificial Intelligence Review, 11, 273-314.
https://doi.org/10.1023/A:1006593614256

[6] Gupta, M. (2012) Dynamic k-NN with Attribute Weighting for Automatic Web Page Classification (Dk-NNwAW). International Journal of Computer Applications, 58, 34-40.
https://doi.org/10.5120/9321-3554

[7] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016: 121-128.

[8] Mangasarian, O.L. and Wolberg, W.H. (1990) Cancer Diagnosis via Linear Programming. SIAM News, 23, 1-18.

[9] Bagui, S.C., Bagui, S., Pal, K. and Pal, N.R. (2003) Breast Cancer Detection Using Rank Nearest Neighbor Classification Rules. Pattern Recognition, 36, 25-34.
https://doi.org/10.1016/S0031-3203(02)00044-4

Top