The Complicated Provenance of American Community Survey Data: How Far Will PROV and DDI Take Us? Tools

Peer reviewed: 
No, item is not peer reviewed.
Date created: 

In a series of three papers in 2013, researchers at the Cornell Node of the NSF Census Research Network ( investigated and proposed solutions for two fundamental yet distinct issues in the curation of quantitative social science data: confidentiality and provenance. We argued that the W3C PROV model, a foundation for semantically-rich, interoperable, and web-compatible provenance metatdata, is especially important in a web environment in which data from distributed sources and of varying integrity can be combined and derived. In this paper we combine and expand upon these two separate threads—confidentiality and provenance—and experiment with the use of PROV and DDI in documenting the complex provenance chain between the highly confidential environment of the U.S. Census Bureau and restricted and public versions of internal census demographic files. In particular, our presentation will report on our effort to: 1) test PROV’s ability to describe meaningful relationships between confidential, restricted and public data at the variable level; 2) develop a user interface for researchers attempting to understand the relationships between distinct versions of confidential, restricted, and public census files. Longer term our work should produce a useful metadata resource for users of public and restricted American Community Survey data.


William Block (Cornell University), Warren Brown, Jeremy Williams, Lars Vilhuber, Carl Lagoze

Document type: 
Conference presentation
Copyright remains with the author.