Data integration on complex data

Resource type
Thesis type
(Thesis) M.Sc.
Date created
More often than not, a data source can be modeled as a relational table. Due to various reasons, the schema information about those data sources may not be accurate, complete, or even available. As we know, a primary/foreign-key constraint explicitly defines how two tables should be joined. However, when the constraint is not applicable, it is possible to have multiple ways to join the data, and different users may expect different join results. In this thesis, we first tackle this data integration challenge by investigating how to join tables guided by user preferences. To further improve the quality of data integration, data value similarity needs to be considered as well. Threshold-driven similarity join has been extensively studied in the past. However, the process of tuning similarity threshold is tedious and error-prone. In this thesis, when performing a similarity join, we further seek to provide a few user preferences instead of similarity thresholds for a user to select from. Once a particular preference is chosen, we automatically tune the threshold and return the corresponding similarity join result. Comparing to state-of-the-art baselines, our work provide significantly better effectiveness while having comparable efficiency and scalability.
Copyright statement
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Pei, Jian
Member of collection
Attachment Size
etd20446.pdf 1.35 MB