PhD Proposal: Techniques for Selecting and Integrating Dynamic Data Sources

Talk
Theodoros Rekatsinas
Time: 
07.31.2014 10:00 to 11:30
Location: 

AVW 3450

Data is becoming a commodity in many public and enterprise application domains and integrating data from multiple data sources has tremendous value. However, the number of data sources has risen rapidly due to recent developments in data publishing and availability over the web. The proliferation of services such as cloud-based data markets has facilitated the collection, publishing and trading of data. Furthermore, the adoption of open data policies both in science and government has increased the amount of open access data without restrictions or fees promoting the idea that data should be universally available. However, our ability to reason about a large number of data sources in a systematic way falls well short of what is needed to benefit from this abundance of data. The data sources are typically heterogeneous in their focus and content, often provide duplicate and conflicting information, and the sources also vary significantly in terms of the accuracy and the timeliness of the data they provide. When the number of data sources is large, humans have a limited capability of extracting accurate estimates of source authoritativeness and quality.
In this proposal, I study the problem of automated, principled, and efficient management of dynamic data sources. I introduce a framework that enables discovery and selection of beneficial sources for diverse integration tasks. I show how historical snapshots of available data sources, providing both structured and unstructured data, can be collectively analyzed to assess and profile the content quality of sources and extract the evolution patterns of the entire data domain. I then demonstrate how the quality profiles of individual sources can be used to estimate the integration quality for arbitrary sets of data sources without performing the actual integration. Furthermore, I introduce efficient algorithms with rigorous theoretical guaranties that use these estimates to select a near-optimal set of sources to be integrated. Finally, I present highly efficient algorithms for computing the integration coverage of dependent sources. In my proposed research, I plan to extend the proposed framework to a fully functional data source management system for arbitrary data domains. I also intend to explore how crowdsourcing techniques can be used to extend the set of available sources and derive more accurate source quality estimates. Finally, I will study the implications of source interdependencies (e.g., content overlaps and copying patterns) in dynamic sources, and I aim to develop techniques for detecting how the benefit of integration changes in highly dynamic domains where new sources may appear or existing sources may disappear
Examining Committee:
Committee Chair: - Dr. Amol Deshpande
Dept's Representative - Dr. Mihai Pop
Committee Member(s): - Dr. Lise Getoor