Error detection is key for data quality management. Leveraging domain knowledge in the form of user-specified constraints is one of the major approaches to error detection. A recent trend in error detection has been utilizing approximate constraints (ACs) that a relation is expected to satisfy only to a certain degree rather than completely. An example are the recently introduced statistical constraints, that allow the user to specify which correlations among attributes she expects to be present or absent in the data. Statistical constraints allow the user to express a broad range of statistical and causal domain knowledge. Extensive empirical investigations indicate that even traditional integrity constraints such as functional dependencies hold only approximately in real-world datasets. Approximate functional dependencies (AFDs) have been a data cleaning tool for some time. This thesis introduces a new technique for enhancing error detection with approximate constraints. Our starting observation is that approximate constraints are context-sensitive: the degree to which they are satisfied depends on the sub-population being considered. An error region is a subset of the data that violates an AC to a higher degree than the data as a whole, and is therefore more likely to contain erroneous records. For example, an error region may contain the set of records from before a certain year, or from a certain location. We describe an efficient algorithm for identifying distinct data regions that violate given ACs to different degrees, based on a recursive tree partitioning scheme. The learned trees describe different error regions in terms of data attributes that are easily interpreted by users (e.g. all records before 2003). This helps to explain to the user why some records were identified as likely errors. After identifying error regions, we can apply error detection methods to each error region separately, rather than to the dataset as a whole. Our empirical evaluation, done using four datasets containing both real world and synthetic errors, shows that identifying error regions increases both precision and recall of error detection based on ACs. Error regions can be combined not only with constraint-based error detection, but also with other approaches such as those based on machine learning. Our experiments provide evidence that the error regions boost the performance of machine learning methods.
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Schulte, Oliver
Member of collection