Resilient Data Futures
MethodM-0003draft

The Four-Term Liability Formula (A + B + C + D)

§5.22026-05-034 out · 4 in

The institutional liability carried, for each dataset in the unretrievable fraction (the 73-93% baseline), is the sum of four terms.

A — Sunk grant value. The dollars that originally produced the data, available directly from the award record. For a multi-year PI grant at a research university, A typically ranges from several hundred thousand to several million dollars. Stern et al. 2014 (S-0050) measured the median NIH direct cost per retracted paper at $239,381 and the mean at $392,582 — a defensible per-paper attribution when grant-specific data is not available.

B — Replacement cost. When reconstruction is feasible, B can run significantly higher than A because field conditions, cohorts, and experimental alignments rarely reassemble; re-collection carries full new-project overhead rather than incremental cost. When reconstruction is not feasible — long-term ecological time series, decommissioned cohorts, datasets tied to one-time events — B is maximal rather than zero, because the data cannot be purchased back at any price and the institution has destroyed an irreplaceable capital asset.

C — Downstream value lost. The citation stream, follow-on grants, regulatory products, and reuse the dataset would have generated across its useful life. Piwowar & Vision 2013 (S-0055) documented a 9% citation advantage for papers with open data and estimated 150 reuse papers generated per 100 deposited datasets within 5 years. C typically exceeds A on its own for long-horizon research.

D — False Claims Act exposure. Settlement risk under the FCA, which attaches to false or fraudulent claims connected to federal grant funds. The "reckless disregard" standard, combined with the implied-certification theory established in Universal Health Services v. Escobar (2016), extends liability beyond deliberate fraud to certifications of compliance the institution has no infrastructure to verify.

A + C + D is a lower bound on per-dataset liability; A + B + C is maximum exposure under full surfacing. Expected annual loss is the sum weighted by the probability of surfacing — currently low but on a rising trajectory across three vectors (funder verification, FCA precedent, regulatory convergence).

The formula does not describe a cost incurred when something breaks. It describes a liability the institution already carries on 73-93% of its published output. What varies year to year is not the liability but the probability that some fraction of it surfaces.