# 基于MapReduce的语义网空间数据关联A Map-Reduce-Based Parallel Approach for Geospatial Data Interlinking in a Semantic Web

Abstract: The Web of Data represents an intermediate step towards the Semantic Web. Constructing links among different Resource Description Framework (RDF) datasets is a key issue in the Web of Data. An identity link aims to match entities from different datasets and is an important type of RDF link. There are many approaches to constructing identity links between geospatial entities. This paper adopts the Hausdorff distance to compute the location and shape similarity between two entities. Because the computation of the Hausdorff distance is complex and geospatial data intrinsically large, the entire matching process is very time consuming. This paper proposes a Map-Reduce-based framework to parallelize the similarity computation, significantly reducing the runtime. This approach was verified to be effective in an experiment using data from Nomenclature of Territorial Units for Statistics (NUTS) and Database of Global Administrative Areas (GADM). The matching precision was high, and with the utilization of the proposed parallel framework, the runtime was reduced to only approximately 3 h on 8 nodes; in contrast, when run on 1 node, the runtime exceeded one day.

1. 引言

2. 地理空间实体的同质关联

Figure 1. Finding matches between geospatial datasets A and B

3. 方法

3.1. Hausdorff相似性度量

Hausdorff距离隐含地计算了实体之间的位置和形状相似度，非常适合用于计算地理空间实体的空间相似度。

$H\left(A,B\right)=\mathrm{max}\left(h\left(A,B\right),h\left(B,A\right)\right)$ (1)

$h\left(A,B\right)={\mathrm{max}}_{a\in A}{\mathrm{min}}_{b\in B}‖a-b‖$ (2)

 (3)

$sim\left(A,B\right)=1-\frac{H\left(A,B\right)}{{L}_{D}}$ (4)

3.2. 基于MapReduce的并行匹配框架

MapReduce是大型数据集并行处理的编程模型 [19] 。模型的所有输入和输出都是键/值对的形式。该模型有两个主要函数：Map和Reduce。Map接收一个输入键值对，以用户自定义的方式处理输入数据并生成中间键/值对。然后，MapReduce库将这些中间键/值对分成许多组，每组含有相同的键，并将它们传递给Reduce。每个Reduce都会接收一个键和该键的一组值，以用户自定义的方式处理它们并生成新的键/值对作为输出。

Table 1. Hausdorff similarity-matching algorithm

Figure 2. Map-Reduce workflow. Dataset B is split using Map-reduce, whereas dataset A is shared

4. 实验

Figure 3. Experiment workflow

4.1. 数据预处理

(a)(b)

Figure 4. The data for Austria. (a) Austrian data shown in the RDF graph. (b) Austrian data shown in N-TRIPLES format

Figure 5. A portion of the Austrian data in NUTS

4.2. 实体匹配

4.3. 匹配结果

Table 2. Matching result

Figure 6. Illustration of the data for Italy: Data from NUTS (left) and data from GADM (right)

Figure 7. Published triples of Italy

4.4. 运行时间和可扩展性

Table 3. Speedup as the number of nodes increases

Figure 8. Runtime as a function of the number of nodes

5. 讨论

6. 结论

[1] Auer, S., et al. (2007) DBpedia: A Nucleus for a Web of Open Data. Proceedings of 6th International Semantic Web Conference and 2nd Asian Semantic WEB Conference, Busan, 11-15 November 2007, 722-735.
https://doi.org/10.1007/978-3-540-76298-0_52

[2] Auer, S., Lehmann, J. and Hellmann, S. (2009) Linked Geo Data: Adding a Spatial Dimension to the Web of Data. Proceedings of International Semantic Web Conference, Chantilly, 25-29 October 2009, 731-746.

[3] Mika, P. and Tummarello, G. (2008) Web Semantics in the Clouds. IEEE Intelligent Systems, 23, 82-87.
https://doi.org/10.1109/MIS.2008.94

[4] Hoffart, J., et al. (2013) YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia. Artificial Intelligence, 194, 28-61.
https://doi.org/10.1016/j.artint.2012.06.001

[5] Berners-Lee, T. (2006) Linked Data.

[6] Heath, T. and Bizer, C. (2011) Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, San Rafael.

[7] Winkler, W.E. (1990) String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, Alexandria, 354-359.

[8] Rodriguez, M.A. and Egenhofer, M.J. (2003) Determining Semantic Similarity among Entity Classes from Different Ontologies. IEEE Transactions on Knowledge and Data Engineering, 15, 442-456.
https://doi.org/10.1109/TKDE.2003.1185844

[9] Varelas, G., et al. (2005) Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web. In: Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management, ACM, New York, 10-16.
https://doi.org/10.1145/1097047.1097051

[10] Nguyen, H.A. and Al-Mubaid, H. (2006) A Combination-Based Semantic Similarity Measure Using Multiple Information Sources. IEEE International Conference on Information Reuse and Integration, 16-18 September 2006, 617-621.

[11] Ge, J. and Qiu, Y. (2008) Concept Similarity Matching Based on Semantic Distance. 4th International Conference on Semantics, Knowledge and Grid, 3-5 December 2008, 380-383.
https://doi.org/10.1109/SKG.2008.24

[12] Tejada, S., Knoblock, C.A. and Minton, S. (2001) Learning Object Identification Rules for Information Integration. Information Systems, 26, 607-633.
https://doi.org/10.1016/S0306-4379(01)00042-4

[13] Cohen, W.W., Ravikumar, P. and Fienberg, S.E. (2003) A Comparison of String Metrics for Matching Names and Records. KDD Workshop on DATA Cleaning & Object Con-solidation, Washington, DC, Vol. 3, 73-78.

[14] Zhang, M., et al. (2013) An Interlinking Approach for Linked Geo-spatial Data. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 40, 283-287.
https://doi.org/10.5194/isprsarchives-XL-7-W2-283-2013

[15] Tversky, A. (1977) Features of Similarity. Psychological Review, 84, 327-352.
https://doi.org/10.1037/0033-295X.84.4.327

[16] Pschorr, J., et al. (2010) Sensor Discovery on Linked Data. Proceedings of the 7th Extended Semantic Web Conference, Heraklion.

[17] Volz, J., et al. (2010) Silk—A Link Discovery Framework for the Web of Data. LDOW, 538.

[18] Bizer, C., Cyganiak, R. and Heath, T. (2007) How to Publish Linked Data on the Web.