基于领域模式的Web数据抽取与集成系统研究与实现
The Research and Implementation of Extraction and Integration of Web Data Based on Domain Pattern

作者: 李贵 , 李征宇 :沈阳建筑大学信息与控制工程学院,辽宁 沈阳 ; 耿传杰 , 韩子扬 :沈阳建筑大学信息与控制工程学院,辽宁 沈阳;

关键词: Web数据模型与模式领域数据模型与模式领域数据抽取与集成领域增值服务Web Data Model and Pattern Domain Data Model and Pattern Domain Data Extraction and Integration Domain Value Added Service

摘要: 提供面向领域的信息增值服务是Web数据挖掘的目标之一,面向领域的Web数据抽取与集成是提供领域信息增值服务的基础,也是Web数据挖掘领域的一个主要研究方向,结合领域需求,本文提出一种面向领域的Web数据抽取与集成架构,在给出Web数据模型与Web数据模式、领域数据模型和领域数据模式等相关概念基础上,提出Web数据模式与领域数据模式的映射方法和数据层次上的集成方法,用于解决集成过程中的模式层次和数据层次的冲突问题,并讨论了web数据抽取和领域增值服务的实现方法。结合实际需求开发了房地产信息平台及综合应用系统,验证了模型和算法的有效性。

Abstract: One of the objectives of the Web data mining is to provide the domain-oriented information value added service. Domain-oriented web data extraction and integration is the basis of providing value added services, and is also a major research direction in the field of web data mining. In com-bination with the requirement of the field, we proposed the domain-oriented web data extraction and integration architecture. Based on the concepts of web data model and web data pattern, do-main data model and domain data pattern, the mapping method of web data pattern and domain data pattern and integration method on data level are proposed to solve the conflict problem of pattern layer and data layer in the integration process. We also discussed the implementation method of web data extraction and domain value added services. Real estate information platform and integrated application system are developed with the actual requirements, and the effective-ness of the model and algorithm is verified.

文章引用: 李贵 , 耿传杰 , 韩子扬 , 李征宇 (2016) 基于领域模式的Web数据抽取与集成系统研究与实现。 计算机科学与应用, 6, 203-215. doi: 10.12677/CSA.2016.64026

参考文献

[1] Cafarella, M.J., Halevy, A., Wang, D. Z., Wu, E. and Zhang, Y. (2008) WebTables: Exploring the Power of Tables on the Web. Proceedings of VLDB-08, 1, 538-549.
http://dx.doi.org/10.14778/1453856.1453916

[2] Liu, B. (2013) Web Data Mining [M]. 俞勇, 薛贵荣, 韩定一, 译. 北京: 清华大学出版社, 2013.

[3] Volkovs, M., Chiang, F., Szlichta, J. and Miller, R.J. (2014) Continuous Data Cleaning. CDE, 244-255.

[4] Geerts, F., Mecca, G., Papotti, P. and Santoro, D. (2014) Mapping and Cleaning. ICDE, 232-2243.

[5] 李贵, 张淼. 基于领域模型的Web数据抽取与集成[J]. 微电子学与计算机, 29(9): 152-156.

[6] 马安香, 张斌, 高克宁, 齐鹏, 张引. 基于结果模式的Deep Web数据抽取[J]. 计算机研究, 46(2): 280-288.

[7] Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B. and Pollak, B. (2007) Towards Domain-Independent Information Extraction from Web Tables. Proceedings of WWW-07, Banff, 8-12 May 2007, 71-80.
http://dx.doi.org/10.1145/1242572.1242583

[8] Sheng, C., Zhang, N., Tao, Y.F. and Jin, X. (2012) Optimal Algorithms for Crawling a Hidden Database in the Web. Proceedings of VLDB, 5, 1112-1123.
http://dx.doi.org/10.14778/2350229.2350232

[9] 田建伟, 李石君. 基于层次树模型的Deep Web数据提取方法[J]. 计算机研究与发展, 2011, 48(1): 94-102.

[10] 寇月, 李冬, 申德荣, 于戈, 聂铁铮. D-EEM-一种基于DOM树的Deep Web实体抽取机制[J]. 计算机研究与发展, 2010, 47(5): 858-865.

[11] Wang, R. and Cohen, W. (2008) Iterative Set Expansion of Named Entity Using the Web. ICDM.

[12] Pantel, P., Crestan, E., Borkovsky, A., et al. (2009) Web-Scale Distributional Similarity and Entity Set Expansion. Proceedings of EMNLP 2009, Singapore, 6-7 August 2009, 938-947.

[13] 李贵, 陈韶刚, 等. 基于Web的实例扩展与属性值扩充方法[J] 计算机科学, 2014, 41(11A): 411-418.

[14] Dalvi, N., Rastogi, V. and Dasgupta, A. (2013) Optimal Hashing Schemes for Entity Matching. Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, 13-17 May 2013, 295-305.
http://dx.doi.org/10.1145/2488388.2488415

分享
Top