CROWDSOURCING IN PAY-AS-YOU-GO DATA INTEGRATION

UoM administered thesis: Phd

  • Authors:
  • Fernando Osorno Gutierrez

Abstract

In pay-as-you-go data integration, feedback can inform the regeneration of different aspects of a data integration system, and as a result, helps to improve the system's quality. However, feedback could be expensive as the amount of feedback required to annotate all the possible integration artefacts is potentially big in contexts where the budget can be limited. Also, feedback could be used in different ways. Feedback of different types and in different orders could have different effects in the quality of the integration. Some feedback types could give rise to more benefit than others. There is a need to develop techniques to collect feedback effectively. Previous efforts have explored the benefit of feedback in one aspect of the integration. However, the contributions have not considered the benefit of different feedback types in a single integration task.We have investigated the annotation of mapping results using crowdsourcing, and implementing techniques for reliability. The results indicate that precision estimates derived from crowdsourcing improve rapidly, suggesting that crowdsourcing can be used as a cost-effective source of feedback.We propose an approach to maximize the improvement of data integration systems given a budget for feedback. Our approach takes into account the annotation of schema matchings, mapping results and pairs of candidate record duplicates. We define a feedback plan, which indicates the type of feedback to collect, the amount of feedback to collect and the order in which different types of feedback are collected. We defined a fitness function and a genetic algorithm to search for the most cost-effective feedback plans. We implemented a framework to test the application of feedback plans and measure the improvement of different data integration systems. In the framework, we use a greedy algorithm for the selection of mappings. We designed quality measures to estimate the quality of a dataspace after the application of a feedback plan. For the evaluation of our approach, we propose a method to generate synthetic data scenarios. We evaluate our approach in scenarios with different characteristics. The results showed that the generated feedback plans achieved higher quality values than the randomly generated feedback plans in several scenarios.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
Award date1 Aug 2016