# 一种用于在架图书书脊语义分割的山字形网络A Mountain-Shaped Network for Semantic Segmentation of Book Spines on-Shelves

Abstract: Identifying book spine on-shelves in the image can achieve a more convenient book inventory and is possible to realize a better reader experience, such as take-and-go. Segmentation of the spine region is their important prerequisite. Different from ordinary target segmentation, the difficulty of this segmentation problem lies in that the spines are densely-packed and repeating. In this paper, a mountain-shaped deep neural network structure is proposed, which consists of one encoder and two decoders. One of the decoders is the main segmenting channel for the spine, and the other combines the spine interval information to incorporate more spine edge details. In addition, this research establishes a spine image sample dataset, including 661 images with 15,454 manually labeled polygons. The experimental results show that the proposed network model has high accuracy for semantic segmentation of dense target like book spine images, and has an average intersection ratio of 90% and an average pixel accuracy of 95% in the established dataset. The performance is better than the classical segmentation models, which verifies the effectiveness of the proposed model.

1. 引言

(a) 原图 (b) 理想分割结果

Figure 1. A segmentation example of book spine image

2. 山字形网络

2.1. 书脊分割损失函数

${L}_{c}=-\underset{i=0}{\overset{1}{\sum }}{y}_{i}\mathrm{log}{p}_{i}$ (1)

Figure 2. Mountain-shaped network structure

2.2. 书脊间隙分割任务

Table 1. Steps of obtaining the masks between the spines of books

${L}_{f}=-\underset{i=0}{\overset{1}{\sum }}{\alpha }_{i}{\left(1-{p}_{i}\right)}^{\gamma }{y}_{i}\mathrm{log}\left({p}_{i}\right)$ (2)

3. 训练及测试结果分析

3.1. 样本获取及模型训练

$lr=l{r}_{0}\cdot {\left(\frac{\mathrm{max}_iter-iter}{\mathrm{max}_iter}\right)}^{0.9}$ (3)

3.2. 分割结果及分析讨论

$mIoU=\frac{1}{2}\underset{i=0}{\overset{1}{\sum }}\frac{T{P}_{i}}{F{P}_{i}+F{N}_{i}+T{P}_{i}}$ (4)

$mPA=\frac{1}{2}\underset{i=0}{\overset{1}{\sum }}\frac{T{P}_{i}}{F{P}_{i}+T{P}_{i}}$ (5)

Segnet和U-net则使用了编解码结构，在解码器上应用了多层小尺寸的反卷积，增加了计算量，但大大减小了参数量。该结构重视了分割的细节，书的边角可以正确分割，但书脊间隙仍然分割困难，且书脊不被当做整体看待，远处的背景以及不完整的书脊均有部分被当作目标。

Deeplab采用了空洞卷积的方法，可以大大提高感受野，因此书脊被当作整体看待，远处背景以及不完整的书脊区域均可被正确分类。Deeplab同时采用了残差神经网络的结构，网络总层数达到100层之多，参数量相对于其他网络也激增，如表2所示。从图3可以看出，无论是在书脊间隙还是在书脊边缘部分，提出模型相对于Deeplab方法均有较大的提升，且参数总量也大大优于Deeplab。

Table 2. Performance comparison of algorithms

(a) FCN16s (b) FCN32s (c) Segnet (d) U-net (e) Deeplab v3 (f) 提出模型

Figure 3. Sample results of common algorithms

4. 结论

NOTES

*通讯作者。

1标注数据集已开放下载，请参考链接http://doi.org/10.4121/uuid:33f2a166-de13-4505-b359-2b202c491fd8。

[1] 田萱, 王亮, 丁琪. 基于深度学习的图像语义分割方法综述[J]. 软件学报, 2019, 30(2): 440-468.

[2] 张顺, 龚怡宏, 王进军. 深度卷积神经网络的发展及其在计算机视觉领域的应用[J]. 计算机学报, 2019, 42(3): 453-482.

[3] Long, J., Shelhamer, E. and Darrell, T. (2015) Fully Convolutional Networks for Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 7-12 June 2015, 3431-3440.
https://doi.org/10.1109/CVPR.2015.7298965

[4] Badrinarayanan, V., Kendall, A. and Cipolla, R. (2017) Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Transactions on Pattern Analysis Machine Intelligence, 39, 2481-2495.
https://doi.org/10.1109/TPAMI.2016.2644615

[5] Ronneberger, O., Fischer, P. and Brox, T. (2015) U-net: Convolutional Networks for Biomedical Image Segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, 5-9 October 2015, 234-241.
https://doi.org/10.1007/978-3-319-24574-4_28

[6] Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K. and Yuille, A.L. (2017) Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFS. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 834-848.
https://doi.org/10.1109/TPAMI.2017.2699184

[7] Ruder, S. (2017) An Overview of Multi-Task Learning in Deep Neural Networks. arXiv preprint, arXiv(1706): 05098.

[8] Zhou, X.Y., Zhuo, J.C. and Krahenbuhl, P. (2019) Bottom-Up Object Detection by Grouping Extreme and Center Points. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 15-20 June 2019, 850-859.
https://doi.org/10.1109/CVPR.2019.00094

[9] Lin, T.-Y., Goyal, P., Girshick, R., He, K.M. and Dollár, P. (2017) Focal Loss for Dense Object Detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 2980-2988.

Top