Data Linking for the Semantic Web
International Journal on Semantic Web and Information Systems (IJSWIS)
By specifying that published datasets must link to other existing datasets, the 4th linked data principle ensures a Web of data and not just a set of unconnected data islands. We propose in this paper the term data linking to name the problem of finding equivalent resources on the Web of linked data. In order to perform data linking, many techniques were developed, finding their roots in statistics, database, natural language processing and graph theory. We begin this paper by providing
... nd information and terminological clarifications related to data linking. We then provide a comprehensive survey over the various techniques available for data linking. We classify these techniques along the three criteria of granularity, type of evidence, and source of the evidence. Finally, we survey eleven recent tools performing data linking and we classify them according to the surveyed techniques. Pre-processing & optimization (optional). Pre-processing of data is an optional step that can be executed for two main purposes. The first is to transform the original representation of data according to a reference format used for the comparison. A second goal is to minimize the number of comparisons that have to be executed in order to produce the final mapping set. To this end several kinds of blocking techniques can be adopted to compare each object description only against those descriptions that have a high probability to be considered similar to it. Matching. In the instance matching step, the object description comparison is executed according to the metrics chosen in the configuration step. In many cases, more than one type of matching techniques are combined, including for example string/value matching, learning-based matching, similarity propagation. If required, external resources are used in this step to optimize the matching with respect to a pre-defined mapping or to determine the similarity between property values according to existing lexical resources or ontologies (e.g., WordNet, SKOS, OpenCyC). In case of a semi-automatic process, the user interaction is needed in the matching step in order to select the correct mappings or to validate the system result.