The Crumbling Wall: Data Archiving and Reproducibility in Published Science

Peer reviewed: 
No, item is not peer reviewed.
Date created: 

Data are the foundation of empirical research, yet all too often the datasets underlying published papers are lost or poorly curated. This is a serious issue, because future researchers are then unable to validate published results, and the data cannot be used to explore new ideas and hypotheses. As part of a study on how the availability of research data is affected by article age, we emailed authors to request the raw data from 516 published articles. These 516 studies were all published between 1991 and 2011, and included a Discriminant Function Analysis (DFA) on morphometric data from animals or plants. We found that broken emails and outdated storage media were the main obstacles to getting the data, such that we only received a total of 101 datasets. However, even when we did receive a data file, there is no guarantee that it matches the exact dataset used in the study itself. To assess how often problems with metadata or data curation affect reproducibility, we tried to recreate the DFA results reported in the paper. Nine papers did not present common types of quantitative results from their DFA and were excluded. For an additional 15 papers we were unable to relate the dataset we received to that used in the original DFA. The reasons ranged from incomprehensible or absent variable labels, the DFA being performed on an unspecified random subset of the data, or incomplete data sets. For another 20 papers, the dataset seemed to correspond to the one in the paper but we could not come close to recreating the authors’ results, which (of course) may stem from an error on either our or the authors’ part. We were able to exactly repeat the results of the DFA analyses from 29 papers, and came very close with an additional 17. Our results illustrate the disconnect between the carefully documented and repeatable science we learned about in school and the grim reality of the current situation – many datasets are lost within a few years, and a significant proportion of the remainder are rendered useless by poor data curation. 


Tim Vines, Arianne Y.K. Albert, Rose L. Andrew, Dan G. Bock, Sébastien Renaut, Diana J. Rennison (University of British Columbia)

Document type: 
Conference presentation
Copyright remains with the author.