Feature discovery from a relational table is a critical but challenging problem in machine learning model development. Being able to discover a good feature can significantly boost model performance. A typical feature discovery procedure consists of four steps: 1) come up with a SQL query; 2) execute the query on the table to generate a feature; 3) add the feature to training data; 4) retrain a model to check the model performance improvement. The first step is the most challenging one since there could be a huge number of queries (in millions or even billions) to consider. Unfortunately, this scalability issue has not been studied in prior work. In this paper, we propose SQLGEN, an automated SQL query generation framework to solve this problem. SQLGEN allows a data scientist to specify a large pool of queries using a query template. Instead of worrying about which query to pick from the large query pool, she can leverage SQLGEN to automatically search for the best query. The key insight is to model it as a hyperparameter tuning problem, but the difference between the two problems makes SQLGEN ineffective when directly applying a hyperparameter tuning algorithm (named TPE). We propose three optimization techniques: i) cheaper evaluation; ii) two rounds of TPE; iii) learned mapping function, and implement them to improve SQLGEN. We conduct extensive experiments to evaluate SQLGEN on real ML datasets. The results show that SQLGEN outperforms baselines by a large margin. The case studies demonstrate that SQLGEN can automatically find highly effective features but missed by FeatureTools.
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Wang, Jiannan
Member of collection