ON TECHNIQUES FOR PAY-AS-YOU-GO DATA INTEGRATION OF LINKED DATA

UoM administered thesis: Phd

  • Authors:
  • Klitos Christodoulou

Abstract

It is recognised that nowadays, users interact with large amounts of data that exist in disparate forms, and are stored under different settings. Moreover, it is true that the amount of structured and un-structured data outside a single well organised data management system is expanding rapidly. To address the recent challenges of managing large amounts of potentially distributed data, the vision of a dataspace was introduced. This data management paradigm aims at reducing the complexity behind the challenges of integrating heterogeneous data sources.Recently, efforts by the Linked Data (LD) community gave rise to a Web of Data (WoD) that interweaves with the current Web of documents in a way that it is useful for data consumption by both humans and computational agents. On the WoD, datasets are structured under a common data model and published as Web resources following a simple set of guidelines that enables them to be linked with other pieces of data, as well as, to be annotated with useful meta data that help determine their semantics. The WoD is an evolving open ecosystem including specialist publishers as well as community efforts aiming at re-publishing isolated databases as LD on the WoD, and annotating them with meta data.The WoD raises new opportunities and challenges. However, currently it mostly relies on manual efforts for integrating the large amounts of heterogeneous data sources on the WoD. This dissertation makes the case that several techniques from the dataspaces research area (aiming at on-demand integration of data sources in a pay-as-you-go fashion) can support the integration of heterogeneous WoD sources. In so doing, this dissertation explores the opportunities and identifies the challenges of adapting existing pay-as-you-go data integration techniques in the context of LD. More specifically, this dissertation makes the following contributions: (1) a case-study for identifying the challenges when existing pay-as-you-go data integration techniques are applied in a setting where data sources are LD; (2) a methodology that deals with the ''schema-less'' nature of LD sources by automatically inferring a conceptual structure from a given RDF graph thus enabling downstream tasks, such as the identification of matches and the derivation of mappings, which are, both, essential for the automatic bootstrapping of a dataspace; and (3) a well-defined, principled methodology that builds on a Bayesian inference technique for reasoning under uncertainty to improve pay-as-you-go integration. Although the developed methodology is generic in being able to reason with different hypothesis, its effectiveness has only been explored on reducing the uncertain decisions made by string-based matchers during the matching stage of a dataspace system.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
Award date1 Aug 2015