# 基于显著性融合的细粒度图像分类方法研究Research of Improved Fine-Grained Image Classification Based on Saliency

Abstract: In view of large intraclass differences, small differences between classes and the problems of dependency on data annotation in fine-grained images, an algorithm based on saliency fusion to improve fine-grained image classification is proposed. This paper introduced a two-input deep neural network, which integrated two components in a single framework: the salient feature fusion structure and the feature extractor. Firstly, the SALICON saliency detection algorithm is used to generate the saliency map. The original RGB image is fused with the saliency map according to the fusion network structure. Secondly, in order to make full use of higher resolution, the modulation potential of the salient features, maximum pooling operation is used to reduce the dimensionality of the data space so that the modulation potential of higher resolution salient features can be fully utilized. Finally, with the help of migration learning, the deep neural network model Inception_V3.0 pretrained on the ImageNet dataset is used as the basic feature extraction model to extract high-level semantic features. The comparison experiments in the public datasets CUB200-2011 and Stanford Dogs show that the classification accuracy of the algorithm is 84.36%, 84.94%, compared with Part R-CNN, LRBP and other mainstream fine-grained classification algorithms, this method can achieve better classification results.

1. 引言

2. 相关基本原理

2.1. 显著性检测原理

Figure 1. Saliency diagram generated by different significance detection algorithms

2.2. 迁移学习

Figure 3. Migration learning

3. 基于显著性融合的双输入深度神经网络模型

Figure 4. The overall structure of a two-input deep neural network based on saliency fusion

${L}_{\phi }\in \left[1,2,\cdots n\right]$。采用显著性检测算法SALICON产生 ${x}_{i}$ 对应的显著图 ${Y}_{i}$。由于图像数据集的图像大小可

a) 正向传播。采用激励–响应机制计算每个神经元的输入，进行基础特征提取卷积神经网络的正向传播，每个神经元的输出如公式(1)所示：

${x}_{n}^{i}=f\left({y}_{n}^{i}\right)=f\left(\underset{j=0}{\overset{{C}_{n-1}}{\sum }}{w}_{n}^{ji}{x}_{n-1}^{j}\right)$ (1)

b) 反向传播。对于一系列神经元互联形成的卷积神经网络，反向传播用于学习网络的权值。

$\Delta {w}_{n}^{ji}=-\alpha \frac{\partial L{\left(y,t\right)}_{n}}{\partial {w}_{n}^{ji}}$ (2)

$\frac{\partial L{\left(y,t\right)}_{n-1}}{\partial {x}_{n-1}^{i}}=\sum {w}_{n}^{kj}\frac{\partial L{\left(y,t\right)}_{n}}{\partial {x}_{n}^{i}}$ (3)

${w}_{n}^{now}={w}_{n}^{pre}-\alpha \frac{\partial L{\left(y,t\right)}_{n}}{{w}_{n}}$ (4)

Figure 5. Flow chart of the network model training algorithm

3.1. Fushion层网络结构

${h}_{i}=f\left({h}_{i}\left(w,h,c\right)*\left[S\left(x,y\right)+1\right]\right)$ (5)

3.2. 目标函数优化

$p\left(i\right)=\frac{{e}^{{W}^{T}{x}_{i}+b}}{\sum {}_{i=1}^{n}{e}^{{W}^{T}{x}_{i}+b}}$ (6)

$\begin{array}{l}L\left(y,t\right)={L}_{c}+{L}_{s}\\ \text{}=-r\sum {}_{i=1}^{n}I\left(i=t\right)\mathrm{log}\left({p}_{i}\right)+\mu \frac{1}{h×w}\sum {}_{i=1}^{h}\sum {}_{j=1}^{w}{\left({Y}_{i,j}-{T}_{i,j}\right)}^{2}\end{array}$ (7)

4. 实验及结果分析

4.1. 数据集与性能评价

Figure 6. Dataset of CUB-200-2011

Figure 7. Dataset of Stanford Dogs

$accuracy=\frac{{n}_{t}}{n}$ (8)

4.2. 数据预处理

1) 裁剪相同尺度。不同的细粒度图像数据集图像大小不同，由于本文采用了Inception_V3网络在ImageNet数据上进行预训练，而Inception_V3的输入图像尺寸固定为 $299×299×3$，因此将输入网络模型的图像统一裁剪为 $299×299$ 大小。

2) 数据归一化。由于数据每一维度的数值变化范围有所差异，为降低模型的分类误差，需要事先对原始图像序列进行处理，将这些像素值都除以255，使得每一个通道的数值变化范围缩放到0~1，且每一个维度具有零均值和单位方差。这不仅可以加快神经网络的收敛速度，还能够防止梯度弥散。

3) 数据扩充。考虑到网络参数的数量巨大，可能出现过拟合现象，但受限于图像数据量，往往需要对现有数据进行扩充处理。在本实验中对细粒度图像数据集采用多种方法来扩充数据, 包括随机翻转及扭曲图像、随机裁剪图像、随机添加噪音、随机修改图像的对比度和饱和度等。使得细粒度图像每一类的训练样本数量保持相对均衡。

4.3. 实验结果与分析

4.3.1. 不同分类算法的比较

Table 1. Classification accuracy on the Stanford Dogs dataset

Table 2. Classification accuracy on the CUB-200-2011 dataset

4.3.2. 不同CNN基础模型的识别效果分析

Table 3. Classification accuracy of different CNN models on different datasets

4.3.3. 不同显著性检测算法对图像分类的影响

Figure 8. Comparison of different significance detection algorithms

4.3.4. 融合网络结构变体的选取

Figure 9. The influence of different fusion structures on classification

5. 总结与展望

[1] Wah, C., Branson, S., Welinder, P., et al. (2011) The Caltech-UCSD Birds-200-2011 Dataset.

[2] Khosla, A., Jaya-devaprakash, N., Yao, B. and Li, F.-F. (2011) Novel Dataset for Fine-Grained Image Categorization: Stanford Dogs. Proceedings of CVPR Workshop on Fine-Grained Visual Categorization, 1-2.

[3] Maji, S., Rahtu, E., Kannala, J., Blaschko, M. and Vedaldi, A. (2013) Fine-Grained Visual Classification of Aircraft. ArXiv Preprint ArXiv: 1306.5151.

[4] Nilsback, M.E. and Zisserman, A. (2008) Automated Flower Classification over a Large Number of Classes. 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Bhubaneswar, India, 16-19 December 2008, 722-729.
https://doi.org/10.1109/ICVGIP.2008.47

[5] Krause, J., Stark, M., Deng, J. and Li, F.-F. (2013) 3D Object Representations for Fine-Grained Categorization. 2013 IEEE International Conference on Com-puter Vision Workshops, Sydney, Australia, 2-8 December 2013, 554-561.
https://doi.org/10.1109/ICCVW.2013.77

[6] 罗建豪, 吴建鑫. 基于深度卷积特征的细粒度图像分类研究综述[J]. 自动化学报, 2017, 43(8): 1306-1318.

[7] 张琳波, 王春恒, 肖柏华, 等. 基于Bag-of-Phrases的图像表示方法[J]. 自动化学报, 2012, 38(1): 46-54.

[8] Berg, T. and Belhumeur, P.N. (2013) POOF: Part-Based One-vs-One Features for Fine-Grained Categorization, Face Verification, and Attribute Estimation. 2013 IEEE Conference on Com-puter Vision and Pattern Recognition, Portland, OR, 23-28 June 2013, 955-962.
https://doi.org/10.1109/CVPR.2013.128

[9] Daniilidis, K., Maragos, P. and Paragios, N. (2010) Improving the Fisher Kernel for Large-Scale Image Classification. Proceedings of the 11th European Conference on Computer Vision (ECCV), Crete, Greece, 5-11 September 2010, 143-156.
https://doi.org/10.1007/978-3-642-15561-1

[10] Wang, P., et al. (2013) Supervised Kernel Descriptors for Visual Recognition. 2013 IEEE Conference on Computer Vision and Pattern Recognition, 23-28 June 2013, Portland, OR, 1828-1830.
https://doi.org/10.1109/CVPR.2013.368

[11] Zhang, N., Donahue, J., Girshick, R. and Darrell, T. (2014) Part-Based R-CNNs for Fine-Grained Category Detection. In: Fleet, D., Pajdla, T., Schiele, B. and Tuytelaars, T., Eds., Computer Vision-ECCV 2014. Lecture Notes in Computer Science, Volume 8689, Springer, Cham, 834-849.
https://doi.org/10.1007/978-3-319-10590-1_54

[12] Branson, S., Belongie, S., Van Horn, G. and Perona, P. (2014) Bird Species Categorization Using Pose Normalized Deep Convolutional Nets. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, 594-605.
https://doi.org/10.5244/C.28.87

[13] Wei, X.-S., Xie, C.-W., Wu, J.X. and Shen, C. (2018) Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-Grained Bird Species Categorization. Pattern Recognition, 76, 704-714.
https://doi.org/10.1016/j.patcog.2017.10.002

[14] Lam, M., Todorovic, S. and Mahasseni, B. (2017) Fine-Grained Recognition as HSnet Search for Informative Image Parts. 2017 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 21-26 July 2017, 6497-6506.
https://doi.org/10.1109/CVPR.2017.688

[15] Xiao, T.J., et al. (2015) The Application of Two-Level Attention Models in Deep Convolutional Neural Network for Fine-Grained Image Classification. Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 842-850.

[16] Simon, M. and Rodner, E. (2015) Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks. Proceed-ings of the 15th IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7-13 December 2015, 1143-1151.
https://doi.org/10.1109/ICCV.2015.136

[17] Lin, T.Y., Roychowdhury, A. and Maji, S. (2015) Bilin-ear CNN Models for Fine-Grained Visual Recognition. Proceedings of the 15th IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7-13 December 2015, 1449-1457.
https://doi.org/10.1109/ICCV.2015.170

[18] Fu, J.L., Zheng, H.L. and Mei, T. (2017) Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition. 2017 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 21-26 July 2017, 4438-4446.
https://doi.org/10.1109/CVPR.2017.476

[19] Zhang, X.P., Xiong, H., Zhou, W., Lin, W. and Tian, Q. (2016) Picking Deep Filter Responses for Fine-Grained Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 27-30 June 2016, 1134-1142.
https://doi.org/10.1109/CVPR.2016.128

[20] Zhao, B., Wu, X., Feng, J.S., Peng, Q. and Yan, S. (2017) Diversi-fied Visual Attention Networks for Fine-Grained Object Classification. IEEE Transactions on Multimedia, 19, 1245-1256.
https://doi.org/10.1109/TMM.2017.2648498

[21] Liu, X., Xia, T., Wang, J., et al. (2016) Fully Con-volutional Attention Localization Networks: Efficient Attention Localization for Fine-Grained Recognition.
https://arxiv.org/pdf/1603.06765.pdf

[22] Kong, S. and Fowlkes, C. (2017) Low-Rank Bilinear Pooling for Fi-ne-Grained Classification. 2017 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Hono-lulu, HI, 21-26 July 2017, 365-374.
https://doi.org/10.1109/CVPR.2017.743

Top