73 to 93 percent of published research cannot produce its underlying data on request
This is the empirical baseline against which the rest of the paper is read. Four direct-contact studies across two decades, four disciplines, and four funder regimes converge on the same finding: a substantial majority of published research carries underlying data that cannot be produced when an outside party requests it.
- Vines 2014 (S-0001): 19% delivered, 516 ecology and evolutionary biology papers (i.e., ~81% non-delivery).
- Wicherts 2006 (S-0003): 73% non-compliance, 141 APA psychology papers.
- Acciai 2023 (S-0004): 86% non-sharing, 1,634 PNAS and Nature-portfolio papers from 2017–2021.
- Gabelica 2022 (S-0002): 93% non-compliance, 1,792 biomedical papers whose authors had explicitly committed to share.
The range is 73-93%. The studies span ecology, biomedicine, psychology, and high-impact general-science journals. They span 2006 through 2023. They use direct-contact methodology in every case — the cited measure is what authors actually delivered when asked, not what data-availability statements promised.
This is the steady state, not a probabilistic forecast. The events that produce it — drives lost, laptops stolen, repositories shut down, personnel departed, formats obsoleted, backups overwritten — have already happened across most of the institution's published output. The question subsequent sections answer is not how often loss occurs, but what the accumulated loss costs when something forces it to surface.