Wang, Pei

Resource type

Thesis

Thesis type

(Thesis) Ph.D.

Date created

2021-03-18

Authors/Contributors

Author: Wang, Pei

Abstract

Data preparation is the process of transforming raw data into a clean and consumable format. It is widely known as the bottleneck to extract value and insights from data, due to the number of possible tasks in the pipeline and factors that can largely affect the results, such as human expertise, application scenarios, and solution methodology. Researchers and practitioners devised a great variety of techniques and tools over the decades, while many of them still place a significant burden on human's side to configure the suitable input rules and parameters. In this thesis, with the goal of reducing human manual effort, we explore using the power of statistical analysis techniques to automate three subtasks in the data preparation pipeline: data enrichment, error detection, and entity matching. Statistical analysis is the process of discovering underlying patterns and trends from data and deducing properties of an underlying distribution of probability from a sample, for example, testing hypotheses and deriving estimates. We first discuss CrawlEnrich, which automatically figures out the queries for data enrichment via web API data, by estimating the potential benefit of issuing a certain query. Then we study how to derive reusable error detection configuration rules from a web table corpus, so that end-users get results with no efforts. Finally, we introduce AutoML-EM, aiming to automate the entity matching model development process. Entity matching is to find the identical entities in real-world. Our work provides powerful angles to automate the process of various data preparation steps, and we conclude this thesis by discussing future directions.

Keywords

Identifier

etd21305

Copyright statement

Copyright is held by the author(s).

Permissions

This thesis may be printed or downloaded for non-commercial research and scholarly purposes.

Supervisor or Senior Supervisor

Thesis advisor: Wang, Jiannan

Language

English

Member of collection

Computing Science Theses

Download file	Size
input_data\21387\etd21305.pdf	6.01 MB

Automating data preparation with statistical analysis

Keywords

Views & downloads - as of June 2023