Zheng, Weiling

Thesis type

(Thesis) M.Sc.

Date created

2021-08-09

Authors/Contributors

Author: Zheng, Weiling

Abstract

Feature discovery from a relational table is a critical but challenging problem in machine learning model development. Being able to discover a good feature can significantly boost model performance. A typical feature discovery procedure consists of four steps: 1) come up with a SQL query; 2) execute the query on the table to generate a feature; 3) add the feature to training data; 4) retrain a model to check the model performance improvement. The first step is the most challenging one since there could be a huge number of queries (in millions or even billions) to consider. Unfortunately, this scalability issue has not been studied in prior work. In this paper, we propose SQLGEN, an automated SQL query generation framework to solve this problem. SQLGEN allows a data scientist to specify a large pool of queries using a query template. Instead of worrying about which query to pick from the large query pool, she can leverage SQLGEN to automatically search for the best query. The key insight is to model it as a hyperparameter tuning problem, but the difference between the two problems makes SQLGEN ineffective when directly applying a hyperparameter tuning algorithm (named TPE). We propose three optimization techniques: i) cheaper evaluation; ii) two rounds of TPE; iii) learned mapping function, and implement them to improve SQLGEN. We conduct extensive experiments to evaluate SQLGEN on real ML datasets. The results show that SQLGEN outperforms baselines by a large margin. The case studies demonstrate that SQLGEN can automatically find highly effective features but missed by FeatureTools.

Keywords

Identifier

etd21504

Copyright statement

Copyright is held by the author(s).

Permissions

This thesis may be printed or downloaded for non-commercial research and scholarly purposes.

Supervisor or Senior Supervisor

Thesis advisor: Wang, Jiannan

Language

English

Member of collection

Computing Science Theses

Download file	Size
input_data\22309\etd21504.pdf	2.16 MB

An automated SQL query generation framework for scalable feature discovery

Keywords

Views & downloads - as of June 2023