Skip to main content

An automated SQL query generation framework for scalable feature discovery

Thesis type
(Thesis) M.Sc.
Date created
2021-08-09
Authors/Contributors
Abstract
Feature discovery from a relational table is a critical but challenging problem in machine learning model development. Being able to discover a good feature can significantly boost model performance. A typical feature discovery procedure consists of four steps: 1) come up with a SQL query; 2) execute the query on the table to generate a feature; 3) add the feature to training data; 4) retrain a model to check the model performance improvement. The first step is the most challenging one since there could be a huge number of queries (in millions or even billions) to consider. Unfortunately, this scalability issue has not been studied in prior work. In this paper, we propose SQLGEN, an automated SQL query generation framework to solve this problem. SQLGEN allows a data scientist to specify a large pool of queries using a query template. Instead of worrying about which query to pick from the large query pool, she can leverage SQLGEN to automatically search for the best query. The key insight is to model it as a hyperparameter tuning problem, but the difference between the two problems makes SQLGEN ineffective when directly applying a hyperparameter tuning algorithm (named TPE). We propose three optimization techniques: i) cheaper evaluation; ii) two rounds of TPE; iii) learned mapping function, and implement them to improve SQLGEN. We conduct extensive experiments to evaluate SQLGEN on real ML datasets. The results show that SQLGEN outperforms baselines by a large margin. The case studies demonstrate that SQLGEN can automatically find highly effective features but missed by FeatureTools.
Document
Identifier
etd21504
Copyright statement
Copyright is held by the author(s).
Permissions
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Wang, Jiannan
Language
English
Member of collection
Download file Size
input_data\22309\etd21504.pdf 2.16 MB

Views & downloads - as of June 2023

Views: 19
Downloads: 0