Sensitive disclosures under differential privacy guarantees

Resource type
Thesis type
(Thesis) Ph.D.
Date created
Author: Han, Chao
Most syntactic methods consider non-independent reasoning (NIR) as a privacy violation and smooth the distribution of published data to avoid sensitive NIR, where NIR allows the information about one record in the data could be learned from the information of other records in the data. The drawback of this approach is that it limits the utility of learning statistical relationships. The differential privacy criterion considers NIR as a non-privacy violation, therefore, enables learning statistical relationships, but at the cost of potential disclosures through NIR. In this thesis, we investigate the extent to which private information of an individual may be disclosed through NIR by query answers that satisfy differential privacy. We first define what a disclosure of NIR means by randomized query answers, then present a formal analysis on such disclosures by differentially private query answers. Our analysis on real life data sets demonstrates that while disclosures of NIR can be eliminated by adopting a more restricted setting of differential privacy, such settings adversely affects the utility of query answers for data analysis, and this conflict can not be easily resolved because both disclosures and utility depend on the accuracy of noisy query answers. This study suggests that under the assumption that the disclosure through NIR is a privacy concern, differential privacy is not suitable because it does not provide both privacy and utility. The question is whether it is possible to (1) allow learning statistical relationships, yet (2) prevent sensitive NIR about an individual. In the second part of the thesis, we present a data perturbation and sampling method to achieve both (1) and (2). The enabling mechanism is a new privacy criterion that distinguishes the two types of NIR in (1) and (2) with the help of the law of large numbers. In particular, the record sampling effectively prevents the sensitive disclosure in (2) while having less effect on the statistical learning in (1). The data perturbation and sampling method are evaluated in real life data sets in terms of both sensitive disclosures and utility. Empirical results confirm that disclosures can be prevented with minor loss of utility.
Copyright statement
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Wang, Ke
Member of collection
Attachment Size
etd9475_CHan.pdf 675.84 KB