Resilient Data Futures

Composed narrative · anchored at Q-0003

What does research data loss cost institutions and science?

Discourse graphs aren't designed to be read linearly. Once a graph exists — whether decomposed from a paper, accumulated through contributions, or both — narratives are composed by anchoring at a node and traversing its neighborhood. Each is a dated view of the argument at the moment it was rendered; regenerated next year against a graph that has accumulated new supporting and opposing evidence, the same anchor produces a different telling.

This page is one such narrative. The bundle around QuestionQ-0003 was assembled by a semantic walk — a type-aware traversal driven by argumentative role rather than hop count. From a Question anchor, that pulls its addressing Claims, each Claim's supporting and opposing Evidence with their Sources, the Methods used, and counter-Claims expanded recursively along opposes-chains; related Questions join when ≥1 in-bundle Claims address them (wide breadth). The walk stops on topology, not depth. The bundle is then handed to an LLM to generate a narrative, work that will improve as models continue to improve.

Want to see the bundle this narrative drew from? Inspect the bundle visually.

Looking for the source paper this graph was decomposed from? Read the original whitepaper.

The empirical record on research data preservation converges on a structural outcome: between 73 and 93 percent of published research cannot produce its underlying data on request [C-0002], a baseline that has persisted across two decades, multiple disciplines, and successive funder regimes. This steady state is not an operational failure—insufficient training, inadequate data management plans, or underfunded libraries—but an architectural one. Research data typically exists in a single copy, held by a single organization, funded by a single grant, and maintained by a single individual, each of which is a single point of failure [C-0001]. The documented loss mechanisms—personnel turnover, hardware failure, grant termination, and platform discontinuation—are absorbed without permanent loss only when independent copies exist across independent failure domains. The architectural tier at which a dataset effectively exists is determined by the deployment, not the underlying software: a Tier 3 protocol like Git delivers Tier 1 resilience when used as a single-platform centralization, as demonstrated by the 2019 GitHub sanctions episode, where developers who had relied exclusively on hosted access lost their work to a single jurisdictional decision [E-0009].

The institutional cost of this architectural condition is substantial. Applying the Four-Term Liability Formula to a representative R1 university with $200 million in annual research expenditure and approximately 3,000 peer-reviewed publications, of which 80 percent carry unverifiable data, yields a latent liability of roughly $1.1 billion per year [C-0005]. This figure is not a realized loss but a carrying cost: the sunk grant value (Term A), the replacement cost of irreplaceable datasets (Term B), the downstream value lost to foregone reuse (Term C), and the False Claims Act exposure (Term D) that attaches to compliance certifications the institution cannot independently verify. The probability of surfacing is rising across three independent vectors—funder verification policy, FCA precedent, and regulatory convergence—each of which is loading the conditions under which the latent liability converts to realized cost [C-0025]. The prevention cost, by contrast, is effectively zero: Tier 3 protocol nodes run on existing institutional infrastructure at idle capacity, with marginal costs measured in tens of dollars per year for standalone deployments and near-zero for institutions already operating the substrate [C-0006]. The asymmetry between a billion-dollar tail exposure and a rounding-error fix binds the recommendation regardless of timing assumptions on any specific surfacing trajectory. The same investment that hedges the liability also captures the upside of the AI-era data substrate: institutions that deploy Tier 3 preservation in 2026 hold three positions simultaneously—the preservation posture, the verification posture, and the AI-readiness posture—while those that do not cede each of them on infrastructure they already operate at substantial idle capacity [C-0039].