On the utility of randomization approaches for privacy preserving data publishing

Resource type
Thesis type
(Thesis) Ph.D.
Date created
2010-06-08
Authors/Contributors
Abstract
In today's electronic society, collecting and selling information is a big business. Everywhere you turn, someone is asking for your personal information. In this thesis, we discuss the Privacy Preserving Data Publishing problem, which involves protecting individual privacy, while at the same time, extracting useful knowledge that may benefit society as a whole. Recent work shows that traditional partition-based approaches to this problem are susceptible to background knowledge attacks and therefore cannot adequately protect individual privacy. To overcome this limitation, several randomization-based approaches have been proposed. With stronger privacy guarantees, not to mention faster runtimes, this promising new field of research is worthwhile studying, especially since there are many open questions. In particular, there is a lack of work on randomization approaches that maximize utility. In fact, partition-based advocates often criticize randomization-based approaches, rightly arguing that there is no point in publishing data in the first place if it is not useful for extracting knowledge. In this thesis, we aim to elevate the utility of randomization approaches for Privacy Preserving Data Publishing. Specifically, our goal is to increase the probability that a record in the dataset retains its sensitive value, thereby decreasing distortion, and increasing utility. We propose two different algorithms that achieve this goal. Perturbation Partitioning is the first algorithm to increase retention probability through independent random perturbation on sub-tables of the original table. Intuitively, a sub-table will have a smaller domain, which decreases the choices for changing a sensitive value to some other value, and therefore increases the retention probability. Empirically, we show a significant decrease in the reconstruction error for count queries for Perturbation Partitioning compared to conventional randomization- and partition-based algorithms. Fine-Grain Perturbation is the first algorithm to find an optimal perturbation operator satisfying privacy constraints at a fine granularity. Our key observation is that not all sensitive values are equally sensitive. Conventional perturbation operators adhere to a uniform privacy specification and therefore can overprotect less-sensitive values. Intuitively, if retaining more less-sensitive values is allowed, more data may be retained overall. Empirically, we show that Fine-Grain Perturbation always retains more data than Uniform Perturbation.
Document
Identifier
etd6053
Copyright statement
Copyright is held by the author.
Permissions
The author granted permission for the file to be printed and for the text to be copied and pasted.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Wang, Ke
Member of collection
Attachment Size
etd6053_RChaytor.pdf 1.86 MB