Supporting SQL-ML queries in database systems have recently attracted great attention in industry. A SQL-ML query treats ML models as user defined functions and embeds them into a SQL query. Since ML models do not always produce perfect predictions, a user may find the answer to a SQL-ML query different from what she expects and asks the system to provide an explanation. Although SQL-only or ML-only explanation has been well studied in the literature, to the best of our knowledge, we are the first to study the SQL-ML explanation problem. This thesis makes two major contributions. Firstly, we propose a formal definition of the SQL-ML explanation problem. Intuitively, our definition aims to trace the query answer back to the training data and identifies a small number of training examples that have the biggest impact on the query answer. Secondly, we study how to extend existing explanation frameworks and discuss their limitations to solve our problem. To overcome these limitations, we propose InfComp, a novel influence function based approach for SQL-ML explanation. We find that InfComp is a powerful tool to debug training data (i.e., detect corrupted features and mislabeled instances). We conduct extensive experiments using three real applications (Entity Resolution, Image Classification, and Spam Detection), and compare with the state-of-the-art approaches. Results show that InfComp can more accurately identify erroneous training examples than the baselines in an efficient manner.
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Wang, Jiannan
Member of collection