Gao, Chuancong

Resource type

Thesis

Thesis type

(Thesis) M.Sc.

Date created

2019-08-22

Authors/Contributors

Author: Gao, Chuancong

Abstract

More often than not, a data source can be modeled as a relational table. Due to various reasons, the schema information about those data sources may not be accurate, complete, or even available. As we know, a primary/foreign-key constraint explicitly defines how two tables should be joined. However, when the constraint is not applicable, it is possible to have multiple ways to join the data, and different users may expect different join results. In this thesis, we first tackle this data integration challenge by investigating how to join tables guided by user preferences. To further improve the quality of data integration, data value similarity needs to be considered as well. Threshold-driven similarity join has been extensively studied in the past. However, the process of tuning similarity threshold is tedious and error-prone. In this thesis, when performing a similarity join, we further seek to provide a few user preferences instead of similarity thresholds for a user to select from. Once a particular preference is chosen, we automatically tune the threshold and return the corresponding similarity join result. Comparing to state-of-the-art baselines, our work provide significantly better effectiveness while having comparable efficiency and scalability.

Keywords

Identifier

etd20446

Copyright statement

Copyright is held by the author.

Permissions

This thesis may be printed or downloaded for non-commercial research and scholarly purposes.

Scholarly level

Graduate student (Masters)

Supervisor or Senior Supervisor

Thesis advisor: Pei, Jian

Member of collection

Computing Science Theses

Download file	Size
etd20446.pdf	1.35 MB

Data integration on complex data

Keywords

Views & downloads - as of June 2023