A PROBABILISTIC APPROACH TO UNCERTAINTY QUANTIFICATION IN PAY-AS-YOU-GO DATA INTEGRATION

UoM administered thesis: Phd

  • Authors:
  • Fernando Rene Sanchez Serrano

Abstract

The use of Web standards, compact publication guidelines, and open data initiatives have motivated many public and private organisations to publish data on the Web, giving rise to a global data space. Consuming data from heterogeneous data sources published on the Web requires integration at scale. The pay-as-you-go approach to data integration (PAYG) addresses integration at scale, relying on automatic techniques to provide candidate integrations. The high reliance on automatic techniques gives rise to uncertainty. Uncertainty may arise and propagate to all the tasks of the life cycle of a PAYG approach whose effect may be manifested in the quality of an automatically generated integration. Quantifying the uncertainty on the outcomes of a bootstrapped integration is a crucial task that can help in understanding the decisions made by the automatic algorithms, aiming to reduce such uncertainty that ultimately can improve the quality of an integration. In this thesis, we address the issue of quantifying the uncertainty that arises dur- ing the bootstrapping phase of PAYG in the context of Dataspaces. In particular, two approaches are proposed: (i) an approach to quantify the uncertainty in mapping gener- ation using internal evidence; (ii) an approach to quantify the uncertainty on the quality of an entire integration using user feedback in a pay-as-you-go manner. More specifically, this thesis makes the following contributions: (i) a principled methodology to derive degrees of belief on mappings that builds on Bayesian infer- ence to assimilate evidence in the form of fitness scores associated to mappings during mapping generation; (ii) a novel methodology to quantify the uncertainty on the quality of an entire integration by assimilating user feedback on tuple results; (iii) an experi- mental evaluation of the proposed techniques on a real-world integration scenario. The experimental evaluation of the contributed techniques presented in this dis- sertation provides empirical evidence of their cost-effectiveness, when applied in syn- thetic and real-world scenarios, in quantifying the quality of a pay-as-you-go data in- tegration.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
Award date1 Aug 2019