Resilient Data Futures

Narratives · original whitepaper

Resilient Data Futures

This is the original whitepaper that seeded the graph. The first pass of 312 nodes was decomposed from these sections — every Claim, Evidence, and Source carries a source_section: field pointing back here.

That paper-first origin is unusual. Discourse graphs are normally built incrementally: contributors add Questions, Claims, Evidence, and counter-evidence over time, and the graph emerges. From any such graph, narratives can be composed for any audience — academic paper, executive brief, blog post, position statement — by selecting nodes and rendering them in a chosen voice. The composed narratives below are early demonstrations of that direction.

Resilient Data Futures #

A Case for Resilient Research Data Infrastructure #


Executive Summary #

Seventy-three to ninety-three percent of published research sits on underlying data that cannot be produced on request. Four direct-contact studies across two decades — 516 ecology papers (Vines 2014), 1,792 biomedical papers whose authors had explicitly committed to share (Gabelica 2022), 141 psychology papers in APA journals whose authors had signed the APA data-sharing compliance certification (Wicherts 2006), and 1,634 PNAS and Nature-portfolio papers from 2017–2021 whose data-availability statements promised data on request (Acciai 2023) — converge on the same finding: most of the scientific record cannot be independently verified from its sources [1][2][3][4]. At a representative $200 million R1 research university, that non-verifiability represents approximately $1.1 billion per year in unverifiable research output carried as latent liability on the institutional balance sheet. The probability that any of it is surfaced is rising: funder verification is actively shifting from self-reported plans to programmatic compliance checks, and the False Claims Act's implied-certification doctrine sits doctrinally available against institutions that certify compliance they cannot independently verify, though no FCA case has yet been brought on architectural data loss specifically. The conditions for surfacing are loading. The same architecture that closes this exposure produces, as a structural byproduct, the AI-ready data substrate the next decade of institutional research will run on, converting the deployment from a one-sided hedge into a two-sided position on the same infrastructure investment.

The loss occurs through ordinary operations. A principal investigator leaves the institution and the operational knowledge of where the data lives leaves with them. A laptop is stolen and the dataset dies with it. A grant ends and the server maintenance budget ends with it. A repository closes — 191 have shut down since 2012, at a median operational age of twelve years, and 47% of those closures gave no indication of data migration or continued limited access [5] — and the datasets it held close with it. A platform changes its access terms, a funding agency reprioritizes, a vendor is acquired. Each mechanism is routine; each happens at every research institution every year; and each destroys the data it touches only when that data exists in a single copy within a single failure domain.

This paper argues that research data loss is an architectural failure rather than a failure of operations. The single-copy concentration that produces the losses also forecloses every procedural fix layered on top of it. No amount of policy reform, better data management plans, or stronger institutional guidance will change an outcome that is determined by the underlying storage architecture.

The paper develops a four-tier taxonomy of that architecture. Tier 0 is local storage on a single system. Tier 1 is hosted storage with a single provider. Tier 2 is coordinated preservation across institutional agreements, exemplified by CLOCKSS, the International Nucleotide Sequence Database Collaboration, and the Worldwide Protein Data Bank. Tier 3 is protocol-level distribution, in which redundancy arises as a structural byproduct of use rather than as the output of any organization's ongoing commitment. Tier 3 is the architecture of the systems that have survived longest on the Internet — DNS, email, BitTorrent, Git — and it is the only architecture that delivers preservation, compliance verification, and audit evidence as structural byproducts of operation. It runs at near-zero marginal cost on infrastructure most institutions already operate at a fraction of its capacity.

The 2026 tightening of funder mandates converts this architectural question from an aspiration into a fiduciary obligation. The NIH Data Management and Sharing Policy, the OSTP Nelson Memo, Horizon Europe's FAIR data requirements, the Wellcome Trust's compliance framework, and the Gates Foundation's transition to automated compliance monitoring all shift the regulatory posture from "did you write a plan?" to "did you actually do it, and can you prove it?" Institutions operating at Tier 0, Tier 1, or Tier 2 cannot answer the second question by inspection. Institutions operating at Tier 3 can.

Headline findings #

  1. Research data loss is structurally inevitable under single-copy architecture and structurally avoidable under distributed architecture.
  2. Every documented repository closure, platform shutdown, funding termination, physical disaster, and personnel departure produces permanent loss on Tier 0 or Tier 1 infrastructure and no loss on infrastructure with independent copies across independent failure domains.
  3. The best scientific preservation systems ever built operate at Tier 2, and their resilience depends on the continued coordination of three to four institutions. When governance or funding fails, the consortium's preservation contract fails with it.
  4. The cost of data loss operates across three compounding dimensions: the direct institutional liability carried on each dataset the institution cannot produce on request, the compliance and verification exposure created by the gap between mandate and delivery, and the loss of scientific output that the destroyed data would have enabled across its useful life.
  5. A representative R1 research university carries approximately $1.1 billion per year in unverifiable research output — a latent liability whose probability of surfacing is rising under the shift from self-reported funder plans to programmatic verification, with the False Claims Act's implied-certification doctrine doctrinally available behind it. Section 5 develops the formula, the R1 application, and the trajectory under which the loaded conditions could begin to fire.
  6. The marginal cost of adding Tier 3 protocol participation to existing institutional IT infrastructure is effectively zero. More than half of institutional server capacity already sits idle [6]; networks run at roughly 26% average utilization [7]; bandwidth contracts are flat-rate regardless of traffic volume [8].
  7. Tier 3 is the only architecture that generates verifiable compliance evidence as a byproduct of operation. The integrity of content-addressed data is mathematically verifiable. The number and location of independent copies is observable by inspection. The 2026 NIH standardized DMSP format will require evidence that Tier 0, Tier 1, and Tier 2 architecture cannot produce by inspection.
  8. The same Tier 3 architecture that hedges the unverifiable-data liability produces, as a structural byproduct, the provenance-verified, content-addressed, federated data substrate that institutional artificial intelligence strategy now depends on, making the deployment a two-sided position rather than a one-sided hedge.

Recommendations #

This paper concludes with seven recommendations, developed in detail in Section 11.

  1. Conduct an architectural audit of existing data infrastructure against the four-tier framework, classifying every research dataset the institution holds by its current tier and its exposure to the failure modes documented in Section 3.
  2. Deploy at least one protocol-level preservation node on existing institutional IT infrastructure within twelve months, using the reference configurations documented in Appendix D.
  3. Integrate compliance evidence generation into the data deposit workflow, so that every dataset produces, at the point of deposit, the verification artifacts required by funder mandates.
  4. Require verifiable evidence of data preservation, not self-reported plans, in all institutional grant submission and progress reporting processes.
  5. Fund preservation through facilities and administrative cost recovery rather than through project budgets, aligning the duration of funding with the duration of preservation need.
  6. Maintain local clones and content-addressed copies of all research data, as a standard practice at the principal investigator and laboratory level.
  7. Publish reference deployments, audit templates, and cost models through coordinated working-group activity, lowering the adoption cost for institutions that follow.

The institution holds an unhedged billion-dollar position on unverifiable research output against a rounding-error cost to close it. The infrastructure exists, the economics favor deployment by more than an order of magnitude, and the compliance trajectory is closing the window in which the decision remains voluntary. The same architectural decision positions the institution for the next decade of federal artificial intelligence funding, faculty recruiting, and data-governance compliance, on the same infrastructure investment. The remaining variable is institutional decision.


1. Introduction #

1.1 Scope and audience #

This paper is a working document of the Resilient Data Futures working group, first convened by SciOS at US-RSE 2025 and meeting monthly since. It circulates publicly to collect feedback, corrections, and domain-specific evidence from the communities whose workflows and obligations converge on research data infrastructure: university administrators, principal investigators and their research groups, institutional IT staff, research librarians and data stewards, funders, compliance officers, and research-focused policy analysts. The argument developed here assumes no specialized background in distributed systems, digital preservation, or research data policy. Technical detail sufficient to evaluate the architectural claims is included where necessary.

1.2 The state of research data preservation #

The empirical record on research data preservation is unambiguous. Direct-contact studies across two decades have found 73 to 93 percent of published research carries underlying data that cannot be produced on request: Vines and colleagues received only 19 percent of requested datasets across 516 ecology and evolutionary biology papers, Gabelica and colleagues reached 93 percent non-compliance across 1,792 biomedical papers whose authors had explicitly committed to share, Wicherts and colleagues found 73 percent non-compliance across 141 APA psychology papers in 2005, and Acciai and colleagues found 86 percent non-sharing across 1,634 PNAS and Nature-portfolio papers published 2017–2021 and contacted in 2022 [1][2][3][4]. Conditional on authors responding with the status of their data, the odds of a dataset still existing fell by roughly 17 percent per year after publication [1]. Preclinical research in the United States alone is estimated to consume approximately $28 billion per year on work that cannot be reproduced [9]. An analysis of the re3data directory identified 191 research data repositories that have shut down since 2012 at a median operational age of twelve years, with 47% giving no indication of data migration or continued limited access [5].

These are the continuous operating conditions of research data infrastructure, measured across multiple independent studies, disciplines, and funder regimes. The losses are occurring at institutions that have data management policies, at universities that fund repositories, and on datasets that were deposited in recognized platforms. The standard tools of research data management are in place, and the data is disappearing anyway.

1.3 The central argument #

The central argument of this paper is that the losses documented above are architectural rather than operational. Operational explanations — insufficient training, inadequate data management plans, uneven researcher discipline, underfunded libraries — identify real problems, and addressing them produces real but bounded improvements. What operational explanations do not address is the underlying property that makes the losses possible in the first place: research data typically exists in a single copy, held by a single organization, funded by a single grant, maintained by a single person. Each of those is a single point of failure, and in most research environments they coincide.

1.4 Structure of the report #

Section 2 presents the architectural framework: four tiers of preservation architecture and the properties that distinguish them. Section 3 shows how documented failure mechanisms operate on single-copy architecture. Section 4 documents the consequences for scientific progress when data is lost. Section 5 quantifies the institutional liability that unverifiable data creates and the mechanisms now converting that liability into realized cost. Section 6 documents how the same failure mechanisms produce survival rather than loss when distributed architecture is present. Section 7 analyzes the economics of each architectural tier. Section 8 examines the architectural properties that allow an institution to produce verification evidence as a byproduct of operation. Section 9 speaks to broader effects of a tier 3 research data substrate. Section 10 develops the case for Tier 3 as the data infrastructure of artificial-intelligence-era research. Section 11 synthesizes the argument into seven recommendations for institutional, funder, and researcher action; Section 12 closes with a brief statement of implications.


2. A Framework for Data Resilience #

This section develops the analytical framework that organizes the rest of the paper. It begins with the architectural principles that determine whether a given dataset survives a triggering event, formalizes those principles as four tiers of preservation architecture, addresses the most common technical objection to applying distributed protocols to scientific data, and closes with the distinction between the infrastructure available to a researcher and the practice of using it.

2.1 Architectural principles #

Three properties determine whether an information system survives triggering events over long time horizons: the distribution of independent copies across independent failure domains, the capacity to verify integrity and location without trusting the holder, and the independence of the system's persistence from any single organization's governance, funding, or operational continuity.

Distribution across independent failure domains. A failure domain is the scope within which a single event can cause total loss. A server is a failure domain. An organization is a failure domain, because the same budget cut, acquisition, or governance failure affects every server the organization operates. A funding source is a failure domain, because the end of funding ends every activity dependent on it. A jurisdiction is a failure domain, because a single regulatory or sanctions action can reach every asset within it. For a dataset to survive a failure event, independent copies must exist in failure domains that the event does not reach.

Verifiable integrity. In a system where data integrity depends on the assertion of the holder, verification is procedural and depends on the continued cooperation of the holder. In a system where data is identified by a cryptographic hash of its contents, verification is mathematical. Any recipient can independently confirm that the bytes received are identical to the bytes originally published, without trusting any intermediate party. The RFC 6920 specification formalizes this approach, and systems including Git, IPFS, and Software Heritage are built on it [10][11].

Organizational independence. Any preservation system that requires an organization's continued commitment inherits the organization's failure modes: governance change, budget cut, acquisition, operational collapse, strategic reprioritization, or jurisdictional action. When the organization fails, the preservation fails with it. Protocols that produce additional copies as a byproduct of use — Git's clone operation, BitTorrent's seeding behavior, DNS caching — deliver resilience that does not depend on any single organization continuing to maintain it. The act of use is the act of contributing to redundancy, and the redundancy accrues structurally.

2.2 Four architectural tiers #

These three properties are present in combinations that form four distinct architectural tiers, each defined by how many copies exist, across how many independent failure domains, under what coordination model, and with what verification capability. The tiers describe the landscape of preservation architecture available to research institutions.

Tier 0: Local storage #

Tier 0 is a single copy on a single system in a single location. A laboratory server, a departmental drive, a principal investigator's laptop, an external hard drive on a shelf — these are all Tier 0. There is no replication, no geographic redundancy, no verification layer beyond the trust placed in the storage medium, and no preservation plan beyond the continued operation of the single system. When the grant ends, the principal investigator moves, or the hardware fails, the data migrates through manual effort or it does not migrate at all. Tier 0 is the default architecture of most research data, and the default outcome is the year-over-year decline in dataset survival documented in Section 1 [1].

Tier 1: Hosted storage #

Tier 1 is a single copy held by an external hosting provider — an institutional repository, a domain repository such as Zenodo or Dryad, a cloud storage bucket on Amazon Web Services or Google Cloud. The data receives professional management, redundancy within the provider's infrastructure, discoverability through metadata, and some degree of persistence planning. This is the architecture that most data management plans describe and that most compliance mandates are designed to produce.

Tier 1 is a genuine improvement over Tier 0, but it introduces three structural vulnerabilities. The first is platform opacity: the data exists in one organizational context, subject to one provider's business decisions, funding continuity, terms of service, and infrastructure choices that the customer has no visibility into. The second is funding-model dependency: most research-facing hosted storage survives on some combination of grants, institutional subsidies, membership fees, and philanthropic funding, rather than on revenue from the researchers who use it, and none of these arrangements is contractually protected against reprioritization, budget cuts, or acquisition. The third is jurisdictional exposure: data stored on a provider's servers is governed by the laws of the country where those servers sit and of the country where the provider is incorporated, neither of which the researcher selects.

Tier 1 solves the problem of local hardware failure. It does not solve the problem of platform failure, and the platform can fail through bankruptcy, acquisition, defunding, jurisdictional change, or internal reorganization as readily as through technical failure. For the majority of researchers who follow current best practice, Tier 1 is the point at which the journey terminates, and it is one organizational decision away from the same outcome as Tier 0.

Tier 2: Coordinated institutional preservation #

Tier 2 maintains multiple copies of data across multiple locations, coordinated by institutional agreements and funded through a combination of membership fees, research libraries, and philanthropic support. The most sophisticated coordinated preservation systems on Earth — INSDC, the Worldwide Protein Data Bank, CLOCKSS, the LHC Computing Grid — operate at this tier. Section 6.1 develops their successes; Section 6.2 develops the structural limits that bound them.

Tier 2 works, and it works well within its scope. The architectural limitation is economic rather than technical. A Tier 2 network depends on the continued coordination of a small number of organizations, and that coordination requires continuous operational funding. When funding or coordination fails, the redundancy members paid for is no longer the redundancy they have.

Tier 3: Protocol-level distribution #

Tier 3 distributes data as a structural byproduct of use, across a protocol that requires no organization to operate and produces redundancy at every point of participation. The architectural pattern is visible in the longest-running information systems on the Internet. The Domain Name System has resolved domain names for 43 years, handling trillions of queries per day across 350 million registered domains, and has never gone down globally [16]. Email has operated for 44 years, serves 4.73 billion users, and carries 392.5 billion messages per day across an infrastructure no single entity owns [17][18]. BitTorrent has operated for 25 years with over 2 billion cumulative installations, and the Internet Archive offers it as a supplementary distribution channel for over a million items, describing it as the fastest method to retrieve Archive content [19][20]. Git has operated for 21 years, is used by 93.87% of professional developers, and maintains complete repository history in every clone, such that when kernel.org was compromised in 2011 the source code of the Linux kernel was never at risk because thousands of developers held independently verifiable copies [21][22][23].

These systems share a set of structural properties that differ from Tier 2 in kind rather than in degree. Redundancy is not maintained by organizational agreement; it is produced by the act of participation. Integrity is not asserted by the host; it is verifiable by cryptographic inspection. Governance failures do not terminate the copies; the copies persist on every node that participated. Scale is not limited by the coordination overhead of the participating organizations; it grows with each additional user. In fact, it is anti-fragile.

2.3 The tiers in comparison #

2.4 Why protocol properties transfer to scientific data #

The two most common objections to applying Tier 3 architecture to scientific data are that the protocols transport opaque byte streams while scientific data carries provenance, metadata, and governance requirements those protocols were not designed for, and that much scientific data is sensitive and cannot be distributed on the kind of network that carries movies and music. Both objections name real concerns and both rest on architectural assumptions that do not survive inspection. The byte-stream objection conflates the storage substrate with the governance layers every preservation system has always built on top of it. The privacy objection conflates Tier 3 architecture with public distribution.

Content addressing operates on any byte sequence regardless of its semantics. The hash of a FITS cube from a radio telescope, an fMRI volume, a CSV of field measurements, or a Git commit is computed the same way; identity, integrity, and deduplication behave identically across content types. Signed repositories — the mechanism behind Git commit signing and Git tag signing — establish provenance without requiring trust in the host, and the signature travels with the object. Every property that makes Tier 3 resilient for general-purpose data applies to scientific data at the byte-stream layer.

The features scientific data requires above the byte-stream layer — curated metadata, versioning policy, dispute resolution, schema governance — are the same features every Tier 2 system already layers on top of Tier 1 storage. The Protein Data Bank's weekly synchronization, GenBank's Feature Table format, and the astronomical archives' International Virtual Observatory Alliance standards are each a governance and metadata layer built above a storage substrate. The architectural question is not whether protocols can preserve scientific context. The question is whether the storage substrate beneath those governance layers is architecturally fragile or structurally redundant. Tier 3 is a foundation layer, not a complete solution, and the layering principle that has made Tier 2 successful on top of Tier 1 operates equally on top of Tier 3.

The privacy objection rests on a similar conflation. Tier 3 is not synonymous with public distribution. Permissioned variants of every major protocol exist and are in production use: private BitTorrent trackers, federated Matrix homeservers, and permissioned IPFS clusters all restrict which nodes can hold copies while preserving the protocol's redundancy and integrity properties. Sensitive scientific data — clinical records under HIPAA, embargoed pre-publication results, indigenous data sovereignty obligations, classified observations — does not require Tier 3 to be abandoned; it requires the permissioned configuration of the same architecture. Access control and structural redundancy are independent properties.

2.5 Infrastructure enables practice; practice determines resilience #

The tier at which a dataset effectively exists is determined not only by the infrastructure available to the researcher but also by the way that infrastructure is used. A Tier 3 protocol deployed in a centralized configuration delivers Tier 1 resilience, because the distributed properties of the protocol remain latent when only one copy is maintained.

GitHub, which hosts 630 million repositories and serves 180 million developers, illustrates the pattern [24]. Git is a distributed protocol; GitHub is a single commercial platform operated by a single company subject to a single country's legal regime. In July 2019, GitHub blocked developers in Iran, Syria, Crimea, Cuba, and North Korea from accessing their own repositories, citing United States export controls, and when affected developers requested copies of their disabled repositories, the platform responded that it was "not legally able to send an export of the disabled repository content" [25][25a]. Developers who had maintained local clones retained every commit, branch, and line of code. Developers who had relied exclusively on GitHub lost access to their own work [25]. The underlying protocol supported Tier 3 resilience; the usage pattern delivered Tier 1 exposure.

A researcher who deposits a dataset in a single repository and moves on is operating under the same pattern. The infrastructure may support distribution, but a single copy in a single organizational context is Tier 1 regardless of what the underlying technology makes possible. The architectural tier is a property of the deployment, not of the software.


3. Modes of Data Loss #

Section 2 established the architectural framework. This section examines the specific mechanisms through which research data is lost and shows that each of them reduces to a single structural property: single-copy dependency. The mechanisms differ in their immediate triggers and in the institutions that experience them, but they share the property that concentration produces loss and distribution produces survival. The section organizes documented failure mechanisms into four causal categories, then closes with the synthesis that motivates Sections 4 through 9.

3.1 Personnel turnover and institutional memory #

Research is performed predominantly by temporary workers who leave by design. The median total time from starting graduate school to receiving a doctorate is 7.3 years, and approximately 43% of candidates have not completed within ten years; the average postdoctoral position lasts approximately 4.5 years, and roughly 15–23% of postdocs eventually secure tenure-track positions, depending on field [26]. The person who understands what a dataset is, how it was generated, and where it lives is always within a few years of leaving the institution that holds it.

The structural exposure this creates is measurable at the project level. In a study of 133 popular GitHub projects, 65% had a bus factor of two or less, meaning two departures would leave the project effectively unmaintained [27]. The same pattern produces orphaned research data at the institutional level. At the High-Performance Computing Center Stuttgart, 57 of 262 user accounts on the tape archive were de-registered and left approximately 619 terabytes of potential dark data behind without active stewardship [28].

Institutional infrastructure treats every departing researcher as a preservation event it is not equipped to handle, and then responds to the accumulated consequence as though it were unexpected.

3.2 Physical and technical loss #

Buildings burn. Servers fail. Laptops are stolen. When research data exists in only one location, each of these ordinary events becomes a preservation event.

The individual-researcher variant is routine enough to go unreported in most years. The Agh et al. 2009 Artemia morphometric dataset (developed as a case study in Section 5.3.1) — the only record of six Iranian salt-lake populations, including the bisexual Artemia urmiana — was lost when a laptop containing it was stolen; the populations cannot be re-sampled because Urmia Lake has since lost approximately 88 percent of its surface area. Every research institution has a stack of comparable incidents that did not generate case studies because the losses terminated at the individual researcher.

Institutional-scale technical failure produces the same outcome through the same architectural property. In December 2021, a routine software update at Kyoto University's supercomputing center interacted badly with the center's backup scripts and deleted 77 terabytes of research data belonging to fourteen research groups; for four of the fourteen groups, the deleted files were the only copies the center held, and the loss is permanent [29]. A backup that shares a failure domain with the data it is meant to protect is not a backup; it is a second copy in the same system, and a single event that reaches the system destroys the primary data and the safety net in the same operation. At Kyoto, the backup scripts and the data they were meant to protect executed in the same software context, and a single administrative action destroyed both. The same architectural principle holds for physical failure domains — a backup stored in the same building as production is destroyed by the same fire, the same flood, or the same power failure that destroys production — but Kyoto shows that the failure domain need not be physical to collapse the distinction between primary and backup.

Physical destruction is the extreme variant. The 2018 fire at Brazil's National Museum destroyed roughly 18.4 million of 20 million items, including 200 years of scientific archives, field records, expedition logs, and irreplaceable catalog data accumulated across decades of Brazilian and international research, on an annual maintenance budget that had collapsed to roughly $13,000 in 2018 — a fraction of the $128,000 the museum required and had not received in any year since 2014 [30]. The destroyed records included the research documentation underlying substantial portions of Brazilian natural history, linguistics, and anthropology; there were no off-site copies of most of the archival material, and the loss is permanent.

None of these events was unprecedented. An ordinary stolen laptop, an ordinary software update, and an ordinary building fire were each catastrophic because the affected research data existed in one place.

3.3 Funding termination #

Grants keep data alive, and when grants end, maintenance ends with them. Between February and August 2025, the National Institutes of Health terminated 2,291 active grants, withdrawing $2.45 billion in committed funding and disrupting 383 clinical trials with more than 74,000 enrolled participants [31]. The National Science Foundation terminated 1,752 grants totaling roughly $1.4 billion in the same window, with the Science, Technology, Engineering, and Mathematics Education directorate alone losing 839 grants worth $888 million [31]. Proposed fiscal year 2026 reductions would cut the National Science Foundation by approximately 56%, the National Oceanic and Atmospheric Administration by 24%, and the Advanced Research Projects Agency–Energy by 57% [31].

Specific data infrastructure disappeared alongside the grants. The National Oceanic and Atmospheric Administration's Billion-Dollar Disasters database, which had tracked $2.9 trillion in climate disaster costs since 1980, ceased updating [32]. Mauna Loa Observatory, which maintains the 68-year continuous carbon dioxide record that is the longest such record on Earth, was proposed for complete defunding [32]. Nearly 3,400 datasets were removed from Data.gov by late February 2025 alongside more than 8,000 web pages and 14 decommissioned National Oceanic and Atmospheric Administration datasets covering earthquakes, marine science, and coastal systems [33]. A study of 411 long-term mammal studies found that 191 had been terminated, including a 63-year yellow-bellied marmot time series rejected for future funding on the explicit grounds that it had "too much data" [34].

Every long-term dataset is a cumulative capital asset. When the funding stops, the asset is abandoned, and re-collection is rarely possible because the ecological, cohort, or instrumental conditions under which the original data was collected cannot be reassembled.

3.4 Platform discontinuation and access restriction #

Platforms that host research data can cease to exist, cease to provide access, or cease to be affordable, on timelines that the platform controls and the researcher does not.

Discontinuation is the most visible mechanism. 191 research data repositories have shut down since 2012, with a median operational age of twelve years at closure, and 47% of those closures gave no indication of data migration or continued limited access [5].

The aggregate statistic is distributed across disciplines and failure modes. The NASA Astronomical Data Center, a federally funded archive of stellar, galactic, and extragalactic catalogs operated by NASA Goddard Space Flight Center for 25 years, was terminated in October 2002 after NASA determined that its services "sufficiently overlap those provided by [other services] to allow termination"; users were redirected informally to the Centre de Données astronomiques de Strasbourg, with no formally designated successor and curation responsibility fragmented across multiple independent services [35]. The Arts and Humanities Data Service — a United Kingdom national service operating five domain centers in archaeology, history, literature/languages/linguistics, performing arts, and visual arts — closed on March 31, 2008, exactly at the twelve-year median documented above, after the Arts and Humanities Research Council voted to discontinue co-funding despite community opposition; the archaeology and history collections were absorbed by successor services at the University of York and the University of Essex, and the visual arts collections continued as VADS at the University for the Creative Arts, while the performing arts collections had no direct successor and their long-term accessibility has been uneven [36]. The Banco de Información para la Investigación Aplicada en Ciencias Sociales, a social sciences repository hosted by the Centro de Investigación y Docencia Económicas in Mexico City across political science, economics, jurisprudence, and geography, obtained Data Seal of Approval certification in 2013 with an explicit pledge of "perpetuity of the data," and nonetheless went dark on December 15, 2023; its persistent identifiers no longer resolve, and no successor repository has been named [37]. Federal funding, national-service designation, and formal certification with an express durability pledge each proved insufficient to prevent closure under the organizational conditions that produced the aggregate statistic.

Access restriction produces equivalent research impact without the platform closing. When Twitter eliminated its free academic research application programming interface in February 2023, a subsequent analysis identified 33,306 studies across 8,914 venues and 610,738 citations that had been built on Twitter data, with over 100 active research projects canceled, halted, or pivoted [38]. A single platform's access decision reshaped a subfield of computational social science within months.

The same mechanism operates inside research infrastructure itself, where the platforms that researchers built to serve research have restricted access on timelines the research community did not control. The Global Initiative on Sharing All Influenza Data, the primary platform for COVID-19 genomic surveillance with submissions from over 200 countries and territories, suspended individual researcher accounts in 2023 after publications critical of the platform's origin narrative — including the Scripps group that flagged a discrepancy in the original SARS-CoV-2 submission and the international team that published the Wuhan market origin analysis — and the research data repository registry re3data subsequently reclassified the platform from open-access to restricted-access [39]; the governance conditions that produced the outcome are developed in Section 6.2.

State action routes access restrictions through research infrastructure by a different path and reaches the same endpoint. In April 2023, the China National Knowledge Infrastructure — the dominant aggregator of Chinese-language scholarship, serving approximately 1,600 institutional subscribers outside mainland China — cut off foreign subscribers from its dissertations, master's theses, conference proceedings, statistical yearbooks, and population census databases under the Data Security Law's cross-border data transfer review, leaving researchers at named institutions including Georgetown and the University of Notre Dame without forward access to primary material in China studies, demography, economics, and law [40]. In November 2024, CERN terminated its cooperation with Russian and Belarusian institutions under European Union and Swiss sanctions following the 2022 invasion of Ukraine, expelling approximately 500 scientists affiliated with Russian institutions from LHC experiments, alongside a smaller cohort affiliated with Belarusian institutions whose contracts ended earlier in 2024, ending decades of direct Russian-state participation in the world's largest particle physics collaboration [41].

Not every restriction is politically motivated. UK Biobank, a United Kingdom research charity serving over 30,000 approved researchers worldwide, transitioned during 2023 and 2024 from bulk data download to a cloud-only research platform metered per-analysis on top of an existing £9,000 three-year access fee; members of the neuroscience community publicly reported that their research project costs would approximately double under the new model [42].

Commercial capture is the slower-moving variant of the same mechanism. When Elsevier acquired Bepress in 2017, more than 500 universities discovered that their institutional repository infrastructure — which many had built specifically to circumvent commercial publishers — was now owned by a commercial publisher [43]. The acquisition did not require a shutdown. Ownership alone is sufficient to control the terms of preservation for every dataset and paper the platform holds. Elsevier's earlier acquisition of Mendeley produced the end-of-life of Mendeley Desktop in September 2022, four years after a 2018 update caused users to lose PDFs and annotations curated inside the application [44]. Academia.edu, launched as a free platform in 2008, added a $99/year premium tier in 2016 and has raised prices annually since; reports in 2026 place the rate at $498/year, with 40% of its users in developing nations where the paywall is most exclusionary [45]. Commercial control extends beyond ownership to the terms of use applied to content institutions have already paid to access. Elsevier and Springer Nature route text-and-data-mining access to their combined 5,500-plus journals through publisher-gated application programming interfaces under click-through licenses that cap mining rate, restrict republishing, and assert publisher rights over derivative outputs, and Elsevier has reported institutional contract violations to universities when researchers attempted bulk download through standard subscribed access [46].

Each of these mechanisms is a different trigger for the same architectural failure. The researcher's data, code, or access depended on one organization's continued decision to provide it, and the organization changed its decision.

3.5 A shared structural property #

The four categories above describe triggers that differ substantially in their immediate cause. A graduate student's departure, a data center fire, a grant termination, and a platform acquisition occupy different positions in the operational life of a research institution, require different administrative responses, and are managed by different people. They share one property: each of them destroys a specific dataset only when that dataset exists in a single copy within a single failure domain. Each of them is absorbed without permanent loss when independent copies exist across independent failure domains.

This is the architectural claim of the paper reduced to its operational form. The mechanisms documented in this section are the normal operating conditions of centralized research data infrastructure. Every research institution experiences some of them every year. Whether a given dataset survives is determined by how many independent copies exist across independent failure domains when the triggering event occurs.

Section 4 documents the consequences for science when data is lost through these mechanisms. Section 5 quantifies the institutional liability that unverifiable data creates and the mechanisms now converting it into realized cost. Section 6 documents the outcome when the same classes of event meet distributed architecture.


4. The Cost to Scientific Progress #

Section 3 documented the mechanisms by which research data is lost. Section 5 quantifies what that loss costs each institution on its own balance sheet. This section documents the cost no institution pays directly and every institution inherits: the aggregate harm to science as a field of inquiry, measured in reproducibility decay and the structural decay of the scholarly record itself.

4.1 The reproducibility crisis #

The downstream signature of research data loss is measured in reproducibility failure. A Nature survey of 1,576 researchers found that more than 70% had failed to reproduce another scientist's experiments, and more than 50% had failed to reproduce their own [47]. Amgen's hematology and oncology team attempted to replicate 53 landmark preclinical cancer studies and succeeded on six, an 11% success rate on foundational work that was guiding drug development [48]. When the editor of Molecular Brain asked 41 authors to produce the raw data behind their submitted manuscripts, 97% could not, and 21 of the papers were withdrawn [49]; retracted NIH-funded papers carry a mean attributable cost of approximately $392,582 each, with a documented aggregate of roughly $46.9 million in unadjusted NIH funding across 1992–2012 [50]. A large-scale analysis of published research code found that 74% of R files fail to complete without error, and 56% still fail after automated cleaning [51]. The research record, when inspected for reproducibility, does not reproduce.

Reproducibility failure has many proximate causes — methodological variation, biological noise, undocumented procedures, analytical flexibility. What connects the specific figures above to the architectural argument of this paper is that every one of them depends on the continued existence of the underlying data. When the data survives in a form a third party can verify, reproducibility failures become diagnosable: the independent investigator can examine the source, trace the divergence, and identify whether the issue lies in the protocol, the analysis, or the measurement. When the data does not survive, reproducibility failure becomes the terminal state of the investigation because the evidence required to diagnose anything else is already gone. The reproducibility crisis is the accumulated consequence of single-copy architecture operating across the research enterprise for decades.

4.2 The structural decay of the scholarly record #

Beyond individual reproducibility failures, the aggregate loss is visible in the structural decay of the scholarly record itself. A Pew Research Center analysis found that 25% of all webpages from 2013 to 2023 are already gone, rising to 38% for pages a decade old [52]. An analysis of 3.5 million scientific articles and approximately one million Uniform Resource Identifiers found that one in five scientific articles suffers reference rot, and among articles that cite web content, seven in ten have compromised scholarly context [53]. In legal scholarship, more than 70% of Uniform Resource Locators cited across a sample drawn from the Harvard Law Review and two other Harvard journals between 1996 and 2012 no longer resolve to the originally cited content [54]. Every dead reference is a broken link between a published claim and the evidence that supported it; at the scale of the scholarly record, the aggregate is a measurable decay in the degree to which research can be built upon.


5. The Cost of Unverifiable Research Data #

Research data loss imposes cost on the institution through a single mechanism: the inability to produce, on request, the data underlying the institution's published output. The current baseline is that 73 to 93 percent of published research cannot meet such a request. This section establishes that baseline from two decades of direct-contact studies, formalizes the four-term liability it creates on the institutional balance sheet, applies the formula to a specific documented case and to a representative R1 institution, documents the mechanisms now converting latent liability into realized cost, catalogs the tail events those mechanisms produce across the research sector, and closes with the continuous erosion institutions pay every year the infrastructure remains unchanged.

5.1 The baseline: 73 to 93 percent of published research cannot produce its underlying data #

The empirical base rate is established by direct-contact studies spanning two decades. Vines and colleagues attempted to retrieve raw data underlying 516 ecology and evolutionary biology papers and received only 19 percent of the requested datasets, with 77 percent of authors failing to produce data on request [1]. Gabelica, Bojčić and Puljak repeated the exercise on 1,792 biomedical papers in which the authors had published explicit commitments to share data on request; 93 percent failed to deliver [2]. Wicherts and colleagues found 73 percent non-compliance in psychology in 2005 [3]; Acciai, Schneider and Nielsen returned to the pattern seventeen years later with an audit experiment on 1,634 PNAS and Nature-portfolio papers from 2017–2021 whose data-availability statements promised data on request, and found 86 percent non-sharing [4]. Four cohorts, four fields, nearly two decades apart: 73 to 93 percent of published research sits on underlying data that cannot be produced on request.

This is the steady state, not a probabilistic forecast. The events that produce it — drives lost, laptops stolen, repositories shut down, personnel departed, formats obsoleted, backups overwritten — have already happened across most of the institution's published output. The question the rest of this section answers is not how often loss occurs, but what the accumulated loss costs the institution when something forces it to surface.

5.2 The institutional liability #

The liability an institution carries, for each dataset in the unretrievable fraction, is the sum of four terms.

A: Sunk grant value. The dollars that originally produced the data, available directly from the award record. For a multi-year principal-investigator grant at a research university, A ranges from several hundred thousand dollars to several million.

B: Replacement cost. When reconstruction is feasible, B can range significantly higher than A, because field conditions, cohorts, and experimental alignments rarely reassemble and re-collection carries full new-project overhead rather than the incremental cost of the original effort. When reconstruction is not feasible — as with long-term ecological time series, decommissioned cohorts, or datasets tied to one-time events — B is maximal rather than zero, because the data cannot be purchased back at any price and the institution has destroyed an irreplaceable capital asset.

C: Downstream value lost. The citation stream, follow-on grants, and regulatory products the dataset would have generated across its useful life. For long-horizon research, C typically exceeds A on its own; Piwowar and Vision documented a 9% citation advantage for papers with open data controlled for journal, author, and institutional history, and estimated that every 100 deposited datasets generate over 150 reuse papers within five years [55].

D: False Claims Act exposure. Settlement risk under the False Claims Act, which attaches to false or fraudulent claims submitted in connection with federal grant funds. The Act's "reckless disregard" standard, combined with the "implied certification" theory established by the Supreme Court in Universal Health Services v. United States ex rel. Escobar (2016), extends liability beyond deliberate fraud to certifications of compliance the institution has no infrastructure to verify. Per-retraction direct-cost exposure, separately, runs approximately $392,582 in National Institutes of Health costs per retracted paper [50]. Section 5.4 develops the precedent and the trajectory.

A + C + D serves as a lower bound on the per-dataset liability; the true figure exceeds it by the unquantifiable but real value of the irreplaceable asset captured in B. The sum A + B + C represents maximum exposure under a scenario in which every unverifiable dataset is surfaced within the measurement year. Expected annual loss is the sum weighted by the probability of surfacing, which is rising across the three vectors documented in §5.4 but remains well below 1.0 in any current year.

These four terms do not describe a cost the institution incurs when something breaks. They describe a liability the institution already carries on 73 to 93 percent of its published output. What varies across years is not the liability itself but the probability that some fraction of it is surfaced, whether by an audit, a retraction request, a False Claims Act action, a funder verification check, or a reviewer inquiry. Section 5.4 shows that probability rising sharply.

5.3 Application #

5.3.1 A specific case: Agh et al. 2009 #

In 2026, Tim Vines identified a paper from the 2009 publication year of his cohort that authors had confirmed were lost (T. Vines, pers. comm., 2026):

  • Agh, N., Bossier, P., Abatzopoulos, T.J., Beardmore, J.A., Van Stappen, G., Mohammadyari, A., Rahimian, H. & Sorgeloos, P. "Morphometric and Preliminary Genetic Characteristics of Artemia Populations from Iran." International Review of Hydrobiology 94(2):194–207, 2009 [56]

The Agh et al. paper characterized six populations of the brine shrimp Artemia from Iranian salt lakes across 19 morphometric variables, demonstrating that 85.9 percent of individuals could be correctly classified to source population on morphology alone, and that the bisexual Artemia urmiana — endemic to Iran's Urmia Lake — showed 100 percent separation from the parthenogenetic populations [56]. The work was carried out as a collaboration between Ghent University's Laboratory of Aquaculture and Artemia Reference Center and Urmia University's Artemia and Aquatic Animals Research Institute, as part of a 14-partner European Commission INCO-DEV Concerted Action on Artemia biodiversity, ICA4-CT-2001-10020, running from January 2002 through December 2004 with total EU funding of €800,000 verified in the CORDIS award record [57].

The raw data — 19 morphometric variables across individuals from six Iranian populations — was lost when a laptop containing the dataset was stolen (T. Vines, pers. comm., 2026). No backup survived.

Applying the four-term formula to this single paper:

A: Sunk grant value. The Urmia University partner share of the EU ICA4-CT-2001-10020 consortium award is approximately €57,000 (€800,000 total EU funding ÷ 14 partners across the three-year award period), verified in the CORDIS award record [57]. This is a conservative floor on term A for two reasons. First, the EU instrument was a Coordination Action — an FP5 mechanism that funds networking, methodology harmonization, and joint protocols rather than primary research — so €53,000 represents only the EU coordination share that flowed to Urmia, not the cost of generating the Iranian Artemia dataset itself. Second, the dataset was produced by a multi-funder stack in which the EU share is the smallest named component. Triangulating the full investment from comparable early-2000s sources: the Iranian Ministry of Science, Research and Technology provided additional national funding for the Urmia component (amount not publicly disclosed, but MSRT environmental-biology awards of the period typically ran €15,000 to €40,000 per 2–3 year project); N. Agh's doctoral stipend at Ghent over the 2001–2008 PhD track, at standard Belgian FWO/VLIR bursary rates of the period, contributes approximately €25,000 to €35,000 attributable to this chapter; Ghent Artemia Reference Center lab operations for the RFLP genetic characterization of six populations contribute approximately €15,000 to €25,000; and the cross-Iran field campaign itself — six sites across five provinces spanning Urmia in West Azerbaijan to Varmal in Sistan-Baluchestan, roughly 1,900 kilometers of domestic logistics — contributes approximately €20,000 to €40,000, largely absorbed by Urmia University institutional support. Reasoned all-in total: approximately €140,000 to €225,000 in 2004 euros, or approximately $210,000 to $340,000 in 2026 dollars after EU HICP inflation and EUR/USD conversion. The €57,000 CORDIS figure remains the hard-verifiable floor; the wider range is what was actually spent to generate the data the laptop held.

B: Replacement cost. A feasible resample of six Iranian Artemia populations at 2026 rates — international field expedition to the salt-lake sites with Iranian collaborator, specimen collection, and logistics across six provinces ($40,000 to $70,000); specimen preparation and morphometric measurement of 19 variables on 30 to 50 individuals per population ($20,000 to $40,000); modern genetic characterization (ddRAD or microsatellites replacing the original RFLP protocol, $25,000 to $50,000); and analysis, writing, and publication ($10,000 to $20,000) — would cost approximately $95,000 to $180,000. That figure is moot. Urmia Lake has lost approximately 88 percent of its surface area by the mid-2010s, driven by upstream agricultural water diversion and drought [58]; the pre-collapse salinity, temperature, and hydrological conditions under which the six populations were sampled cannot be reconstructed. The bisexual Artemia urmiana population has collapsed with the lake and is no longer abundant in its type locality. Re-sampling today would produce measurements from a transformed ecosystem, not a replacement dataset. B is effectively infinite against any fixed budget — the dataset can be paid to reproduce only if Urmia Lake is restored first.

C: Downstream value lost. The paper has accumulated 13 citations as indexed by OpenAlex and Semantic Scholar as of April 2026 and is among the principal references in the Iranian Artemia biodiversity literature. Applying the citation-advantage range from the open-data literature — 9 percent at the conservative bound (Piwowar & Vision 2013, gene-expression microarray corpus) rising to 25 percent when data is deposited in a repository rather than offered on request (Colavizza et al. 2020, PLOS+BMC corpus) [55][59] — the paper would have received approximately 1 to 3 additional citations had the data been accessible. Applying Piwowar and Vision's estimate of 150 reuse papers generated per 100 deposited datasets within five years [55] as a directional baseline (the original measurement was specific to gene-expression microarray data), the Agh data would have supported approximately 1 to 2 reuse papers at average reuse intensity. Two features of this specific dataset push the realistic reuse estimate above that baseline: it is pre-collapse baseline data for an ecosystem that has since collapsed — exactly the class of data that becomes more valuable, not less, as time passes, with the post-collapse genetic-erosion and conservation-genetics literature on A. urmiana (Asem and colleagues' 2019 PeerJ analysis documenting more than 90 percent population loss between 1994 and 2004 and the associated genetic changes [60]) constituting exactly the class of downstream work that would have drawn directly on the Agh 2009 morphometric and RFLP measurements as a pre-collapse reference had the underlying data been available — and the six-population cross-Iran comparison is one of the few of its scale for the era, giving it above-average comparator value in biogeography, conservation-genetics, and meta-analysis literature. Adjusted reuse estimate: 2 to 4 reuse papers over useful life. At the per-paper grant attribution established in term A above ($210,000 to $340,000 in 2026 dollars), the foregone reuse value is approximately $420,000 to $1,360,000 in additional research productivity not realized, concentrated in the pre-collapse-baseline and biogeographic-comparator literature that this dataset would have uniquely anchored.

D: Regulatory and institutional exposure. The Agh work predates the NSF Data Management Plan requirement (2011) and the NIH Data Management and Sharing Policy (2023), and it was carried out outside United States federal jurisdiction. An analogous incident at a US institution holding NIH or NSF funding today would sit within doctrinal reach of the False Claims Act's implied-certification theory developed below, against the $10 million to $100 million settlement range adjacent fraud cases have established — though no architectural FCA case has yet been brought.

Quantifiable cost: approximately $725,000 to $1.88 million, composed of ~$210,000 to $340,000 sunk investment (A), ~$95,000 to $180,000 feasible resample floor (B) — moot against an effectively infinite cap because Urmia Lake has lost approximately 88 percent of its surface area by the mid-2010s and Artemia urmiana has collapsed in its type locality — and ~$420,000 to $1.36 million in foregone reuse value (C), concentrated in the pre-collapse-baseline and biogeographic-comparator literature that this dataset would have uniquely anchored.

This is one paper. The representative-R1 analysis below applies the same pattern across an institution's annual publication output.

5.3.2 The representative R1 #

Apply the four-term formula to a Carnegie R1 research university. Per NCSES HERD FY2023, total United States academic research and development expenditure reached $108.8 billion across 914 reporting institutions, with the top 30 institutions alone accounting for 42 percent of that total [61]; of these, the 2025 Carnegie classification designates 187 as Research 1 ("Very High Spending and Doctorate Production," minimum thresholds of $50 million annual R&D and 70 research doctorates, determined using FY2021–FY2023 HERD data). The institution modeled here is a mid-sized R1 running approximately $200 million in annual research expenditure with roughly 1,000 to 1,500 tenure-track faculty of whom 250 to 500 are actively extramurally funded [62], producing approximately 3,000 peer-reviewed publications per year.

Latent liability from one year of institutional output. Of approximately 3,000 papers published per year, approximately 2,400 (80 percent) carry underlying data that cannot be produced on request. Applying the four-term formula:

A: Sunk grant value. Stern, Casadevall, Steen and Fang 2014's peer-reviewed median NIH direct cost per retracted paper is $239,381 — the conservative measure of Stern's right-skewed distribution; the mean is $392,582 [50]. Applied to 2,400 unretrievable papers per year at the median, the institution carries approximately $574 million per year in federally-reported research output whose underlying data cannot be defended against an audit request (rising to approximately $942 million per year applied at the mean). This is a cumulative grant-dollar attribution — the papers were produced from grants that ran over three to five prior years — but it accrues every year the architecture remains unchanged.

B: Replacement cost. Approximately 30 to 40 percent of the institution's biomedical output is human-subjects data tied to specific IRB protocols, consent documents, and recruitment windows — a range consistent with NIDDK's extramural-portfolio human-subjects share of approximately 40 percent, sustained from FY2014 through FY2023 [63]; for this fraction, replacement is not feasible at any price. For the bench, computational, imaging, and omics remainder, the reconstruction floor is approximately $250,000 per paper, anchored on Stern and colleagues' peer-reviewed median of $239,381 in NIH direct cost per paper [50] — essentially the cost of redoing the research that produced the paper. Across the reconstructible fraction of the 2,400 unretrievable papers per year, feasible-subset replacement floor is approximately $360 million to $420 million per year. The irreplaceable remainder is unbounded above.

C: Downstream value lost. Data-available papers attract approximately 9 percent more citations than papers without publicly available data (Piwowar & Vision 2013, gene-expression microarray corpus, 95% CI 5–13%), with the differential rising to 25 percent when data is deposited in a repository rather than offered on request (Colavizza et al. 2020, PLOS+BMC corpus) [55][59]. Applying Piwowar and Vision's estimate of 150 reuse papers per 100 deposited datasets within five years [55] as a directional baseline (the original measurement was specific to gene-expression microarray data) to the 2,400 unretrievable papers per year, approximately 3,600 reuse papers are foregone over each five-year window — roughly 720 papers per year of downstream research productivity not generated. At the Stern per-paper grant attribution of $239,381 used in term A [50], the foregone reuse value is approximately $172 million per year. This is the same methodology applied at single-paper scale to the Agh 2009 case in Section 5.3.1. The citation-advantage differential above this baseline contributes an additional directional multiplier that this analysis does not attempt to monetize.

D: Regulatory and institutional exposure. False Claims Act settlements for research-misconduct fact patterns at NIH-funded institutions establish the adjacent settlement range: $10 million (Harvard-Partners/Anversa, April 2017 settlement, with 31 papers subsequently recommended for retraction in October 2018) [64] to $112.5 million (Duke/Potts-Kant, March 2019) [65] per institutional settlement, with the most recent precedent set by the December 2025 Dana-Farber settlement at $15 million on six NIH grants under the implied-certification theory [66]. These cases price scientific fraud rather than architectural data unavailability; no FCA case has been brought on a purely architectural fact pattern. The implied-certification theory sits doctrinally available for that extension (developed in §5.4), and the NIH Data Management and Sharing Policy (NOT-OD-21-013, effective 25 January 2023) weights compliance in future funding decisions [67]. Term D is therefore not proportional to paper volume: it is an institutional tail risk of $10 million to $112.5 million per major surfaced event under a scenario in which the doctrinal extension is brought, with expected frequency rising under the DMS Policy compliance regime.

Maximum institutional exposure under a full-enforcement scenario: approximately $1.1 billion per year (~$574M A + ~$360–420M B + ~$172M C). This figure represents the tail — the loss the institution would realize if every unverifiable paper were surfaced by audit, enforcement, or verification demand within a given year. Expected annual loss is substantially lower and is a function of the probability that any given dataset is surfaced; Section 5.4 documents the three vectors pushing that probability upward.

5.4 The trajectory toward realized cost #

The institutional liability documented in Section 5.2 is not, as of 2026, being surfaced through enforcement at the architectural scale Section 5.3 quantifies. The probability that it will be is rising across three independent vectors — funder verification policy, False Claims Act precedent, and regulatory convergence with every other industry that handles consequential data — each of which is loading the conditions for surfacing rather than producing enforcement at architectural scale today. The variable that determines when those loaded conditions begin to fire is the maturity of the audit infrastructure required to surface architectural failure as a documentable claim, and the same audit infrastructure is the architecture this paper develops in Sections 6 through 8.

The mandate regime is moving from self-reported plans to programmatic verification. Every major research funder now requires data management and sharing as a condition of funding. The National Institutes of Health Data Management and Sharing Policy took effect in January 2023 as a term and condition of award, enforceable under the same mechanism as any other grant condition [67]. As of October 2024, annual progress reports must document what data has actually been shared, what repository was used, and by what unique identifier [67]. A simpler Data Management and Sharing Plan format scheduled to take effect for applications due on or after May 25, 2026 is intended to aid compliance monitoring [67]. The Gates Foundation, as of January 2025, funds OA.Works to perform programmatic compliance review — operationalizing ongoing independent verification at funder scale [68]. Horizon Europe makes FAIR data mandatory with no opt-out provision, enforceable through Grant Agreement Article 17, which ties compliance to payments [69]. The Wellcome Trust may decline to accept new grant applications from non-compliant researchers and may suspend funding to the institution in extreme cases [70]. The National Science Foundation returns proposals submitted without a data management plan without review [71]. The Office of Science and Technology Policy Nelson Memo, issued in August 2022, directed all federal agencies to require immediate open access to federally funded research and data, with no embargo period, covering more than $90 billion in annual federal research funding [72]. The question on which the regime is turning is no longer "did you write a plan?" but "did you do it, and can you prove it?" Most institutions cannot currently answer the second question by inspection. The mandate regime has run for three years since the January 2023 effective date of the National Institutes of Health Data Management and Sharing Policy without producing a major enforcement action grounded in architectural data unavailability, and that gap is the empirical baseline against which the trajectory must be read. The May 2026 simpler Data Management and Sharing Plan format, the Gates Foundation's January 2025 transition to programmatic compliance verification through OA.Works, and the operational maturity of the underlying compliance-monitoring stack are converging on the technical conditions under which programmatic enforcement becomes mechanically possible for the first time. The shift is not from "no mandate" to "mandate"; the mandates have been in place for three years. The shift is from a mandate regime that cannot programmatically detect non-compliance to one that can.

False Claims Act precedent is already establishing the settlement range. False Claims Act settlements for systematic data failure at federally funded institutions cover fraud and fabrication to date: Duke University paid $112.5 million in 2019 on grant applications and progress reports submitted to the National Institutes of Health and the Environmental Protection Agency, the largest False Claims Act payment by a university [65]; Harvard and Brigham and Women's Hospital paid $10 million in 2017 on claims related to the Anversa cardiac stem cell research, which produced 31 retractions [64]; Dana-Farber settled for $15 million in December 2025 on six NIH grants under the implied-certification theory [66]. The settlements above price scientific fraud, not architectural data unavailability; no False Claims Act case has been brought, let alone settled, on a purely architectural fact pattern. The implied-certification theory established in Universal Health Services v. United States ex rel. Escobar (2016) extends liability beyond deliberate fraud to certifications of compliance the institution has no infrastructure to verify, and the theory is doctrinally available to a relator or government attorney pursuing such a case. An institution that certifies compliance it cannot independently surface is operating in the territory the theory reaches. What has prevented the architectural extension from being brought is not the legal theory but the absence of audit infrastructure on the institutional side capable of surfacing architectural failure as a documentable claim — exactly the absence the reviewer of an institution's data infrastructure encounters when she asks the question and finds it unanswerable. The settlement range above is the precedent stack that an architectural case, if and when it is brought, would land against; it is not a current liability schedule for architectural failure as such.

False Claims Act enforcement is not the only mechanism by which architectural data unavailability surfaces as an institutional cost. Freedom of Information Act requests, available against federal agencies and against federally funded research outputs in a range of jurisdictions and contexts, surface data unavailability without requiring an enforcement action: an institution whose grantee cannot produce data in response to a properly directed FOIA request inherits the reputational, political, and downstream-grant-competitiveness cost of the inability, and the surfacing event creates the documentary record that subsequent enforcement, if it comes, references. The operational center of current research-integrity enforcement under the National Security Presidential Memorandum 33 implementation guidance and successor frameworks is conflict-of-interest disclosure rather than data retention [117], and that operational center is the procedural template under which verifiable-evidence requirements have already moved from self-attestation to documented disclosure. The same institutional response that handles a conflict-of-interest disclosure today will be expected to handle an architectural-evidence request tomorrow; the institutions that have already built the infrastructure to handle the latter will absorb the transition without operational disruption, and the institutions that have not will absorb it through whatever combination of rushed remediation, foregone funding, and reputational cost the moment of demand imposes.

Every other industry that handles consequential data has already crossed the verification threshold. Securities and Exchange Commission Rule 17a-4 mandates six years of retention in tamper-proof format with audited disaster recovery, backed by over $3.5 billion in fines for records-related failures since 2021 across SEC, CFTC, and FINRA combined [73][74]. The Health Insurance Portability and Accountability Act Security Rule mandates encrypted, redundant backups with tested restoration, with maximum penalties of $2.19 million per violation in the willful-neglect tier [75]. Title 21 of the Code of Federal Regulations Part 11 requires complete audit trails for any electronic record submitted to the Food and Drug Administration; when Applied Therapeutics submitted a new drug application and the Food and Drug Administration discovered that a vendor had deleted audit trails two days after FDA preannounced its inspection, the application was rejected — unverifiable data was inadmissible regardless of what it showed [76]. Financial institutions typically allocate 10 to 20 percent of their information technology budgets to cybersecurity and recovery planning combined [77]. In each of these sectors, the audit infrastructure was built before the enforcement matured, and the enforcement matured rapidly once the infrastructure was in place to support it. The Securities and Exchange Commission's Rule 17a-4 Write-Once, Read-Many requirement preceded the off-channel-communications enforcement wave by more than two decades; the Health Insurance Portability and Accountability Act Security Rule preceded the systematic Office for Civil Rights audit program by approximately a decade; the 21 Code of Federal Regulations Part 11 audit-trail requirement was published in 1997 and reached the Applied Therapeutics-style enforcement posture only after Food and Drug Administration inspection capacity caught up with it. Research data is at the equivalent point in its own arc: the mandate regime is in place, the legal theory is settled, the precedent stack on adjacent fact patterns has accumulated, and the variable that determines when enforcement scales to the architectural fact pattern is the audit infrastructure on the institutional side. Research is the last major sector handling consequential data without mandatory verification of the infrastructure that holds it, and its outlier status is the variable the mandate regime is now closing.

The convergence of these vectors does not, in 2026, convert the liability in Section 5.2 from a latent figure to a realized loss. It loads the conditions under which conversion becomes mechanically possible across the next funding and audit cycles, on a trajectory in which the rate-limiting input — institutional audit infrastructure capable of surfacing architectural failure on inspection — is also the input the rest of this paper proposes that institutions build for reasons that hold whether or not the enforcement trajectory materializes on the timeline the loaded conditions suggest. The Dana-Farber December 2025 settlement is a preview of what fraud-pattern cases settled under the maturing mandate regime look like, not a forecast of architectural cases at scale; the institutional position the paper recommends is the position that holds whether or not the architectural extension of the precedent stack ever arrives. Section 8 develops the architectural properties that produce verification evidence as a byproduct of operation, which is the property the loaded conditions of this section depend on for any mechanical-enforcement scenario to fire.

5.5 Continuous erosion #

The liability quantified in Section 5.3 coexists with a continuous erosion that institutions pay every year the infrastructure remains unchanged, independent of whether any audit or enforcement action occurs. Three terms capture it.

E: Faculty flight. A 2025 Nature reader poll drew more than 1,600 respondents (the majority of whom were scientists); 75% said they were considering leaving the country, rising to 79% among postgraduate researchers, citing funding cuts, firings, and cancelled programs as drivers [78]. Each departure compounds across subsequent grant cycles.

F: Failed recruiting. Positions that remain open, or offers that are declined, because the infrastructure a candidate requires does not exist. The European Commission's "Choose Europe for Science" program, launched in May 2025 at €500 million and since expanded to approximately €900 million across more than 100 national and regional initiatives, is actively recruiting global research talent during the US disruptions through ERC super-grants, ERA Chairs, and MSCA fellowships [78]. Only 44% of United States faculty report that their institution provides adequate technology support for grant-funded projects [79].

None of these costs appear in the fiscal year when the underinvestment decision was made. All of them compound across the years during which the underinvestment persists, and all of them accrue regardless of whether the latent liability in Section 5.3 is surfaced by an external event.

5.6 The asymmetry #

The institution's position is straightforward to summarize. On the liability side sits approximately $1.1 billion per year in unverifiable research output at a $200M R1, rising with annual publication volume, against a probability of surfacing that three independent vectors — funder verification, False Claims Act precedent, and regulatory convergence with every other consequential-data industry — are pushing up simultaneously. The probability of surfacing remains low in absolute terms in any single year and the architectural-extension scenario remains untested in litigation as of 2026, but the conditions for surfacing are loading on a trajectory the institution does not control and the asymmetry between the carrying cost of the position and the cost of closing it is large enough that the trajectory does not need to materialize on any specific timeline for the recommendation to follow. On the prevention side sits protocol-level preservation that runs on existing institutional server infrastructure at effectively zero marginal cost [6][7], and at $42 to $360 per node per year for a standalone deployment from nothing [80].

The institution is carrying an unhedged billion-dollar tail exposure to avoid a rounding-error expenditure. Section 6 documents what prevention looks like in practice; Section 7 prices it; Section 8 addresses the verification architecture that makes the prevention demonstrable to funders and auditors.


6. Resilience in Practice #

Section 5 established that 73 to 93 percent of published research sits on data that cannot be produced on request, and documented the verification regime now converting that latent non-verifiability into realized enforcement cost. This section documents what coordinated preservation has achieved at the ceiling, and the structural reasons that ceiling cannot extend to cover the 73 to 93 percent.

6.1 The ceiling of coordinated preservation #

The scientific community has built the most sophisticated coordinated preservation systems on Earth. Each operates at Tier 2, and each has survived multi-decade operational horizons through three- or four-institution coordination.

Nucleotide sequence data. The International Nucleotide Sequence Database Collaboration maintains three mirrored databases on three continents — the National Center for Biotechnology Information in the United States, the European Molecular Biology Laboratory – European Bioinformatics Institute in the United Kingdom, and the DNA Data Bank of Japan — synchronized daily through a shared Feature Table format [12]. The system holds 53.9 trillion bases across 6.27 billion records and has been in continuous operation since the 1980s. Any single node can go down without data loss because the other two hold complete copies. Resilience depends on the three institutions continuing to coordinate and to fund their operations.

Macromolecular structure data. The Protein Data Bank has operated since 1971 [13]. Since 2003, four wwPDB partner sites — RCSB PDB in the United States, PDBe at the European Molecular Biology Laboratory – European Bioinformatics Institute, PDBj in Japan, and BMRB in Wisconsin — have maintained synchronized copies of the complete archive of over 250,000 experimentally determined 3D structures. The estimated replacement cost is $23 billion [13]. One hundred percent of 34 new United States Food and Drug Administration–approved low-molecular-weight, protein-targeted cancer drugs between 2019 and 2023 relied on Protein Data Bank data [81]. The Protein Data Bank has been continuously preserved for over 50 years; four-site weekly synchronization has secured the present $23 billion archive since 2003.

Particle physics data. CERN's Worldwide LHC Computing Grid operates 1.5 exabytes across more than 170 sites in 42 countries and processes over 2 million tasks per day [15]. The grid is a managed hierarchy with a central coordination layer at CERN, representing the most sophisticated Tier 2 system ever deployed.

Environmental data. The National Oceanic and Atmospheric Administration's National Centers for Environmental Information manage over 60 petabytes of environmental data across four United States locations [82]. When Hurricane Helene struck the centers' Asheville headquarters in September 2024, all archived data holdings were confirmed safe [82].

Astronomical archives. The Space Telescope Science Institute, the Canadian Astronomy Data Centre, and the European Space Astronomy Centre have maintained a 30-year international data sharing partnership for astronomical archives, interoperating through International Virtual Observatory Alliance standards [83].

These systems represent the demonstrated ceiling of coordinated preservation. They have survived floods, hurricanes, and decades of operation, and they generate extraordinary returns on the investments that sustain them. They are also all organizationally dependent, discipline-specific, and limited to the communities that could assemble and fund the required institutional coordination.

The ceiling is lower than it looks. INSDC holds three copies. wwPDB holds four. NOAA NCEI holds four, all within a single agency. The astronomical consortium holds three. Three or four independently-maintained copies is the state of the art in coordinated scientific preservation, and each system is sustained by one or two funding streams. The copies are physically independent; the organizational, political, and budgetary domains governing them are not. A single budget decision can affect all four NOAA storage sites simultaneously because all four report to the same agency — the fiscal year 2026 proposed ~27% cut to NOAA, and the proposed complete defunding of Mauna Loa's 68-year carbon dioxide record, illustrate the failure mode [31][32]. INSDC, wwPDB, and WLCG carry the same structural exposure at different scales.

6.2 The gap from Tier 2 to Tier 3 #

The systems documented in Section 6.1 represent the demonstrated ceiling of coordinated preservation, and that ceiling exists only for a small set of well-funded disciplines that could organize and fund the required coordination. GenBank operates for nucleotide sequences. The Protein Data Bank operates for macromolecular structures. The Worldwide LHC Computing Grid operates for particle physics. Research that does not match the schema of one of these systems is not covered by any of them; cross-disciplinary work, small-team studies, underfunded projects, and data types without a community standard have no Tier 2 infrastructure at all.

Institutional repositories exist, but most operate as Tier 1, holding a single copy at a single institution with no replication across independent failure domains. The scale of what sits outside Tier 2 coverage is the baseline established in Section 5.1: only 19 percent of 516 ecology papers' underlying data delivered on request [1], 93 percent of 1,792 biomedical papers failing to deliver on their own published sharing commitments [2], 73 percent of psychology papers non-compliant in 2005 [3], 86 percent of 1,634 PNAS and Nature-portfolio papers from 2017–2021 not shared when contacted in 2022 [4], and an 8-percent-declared-versus-2-percent-delivered compliance gap across 2.1 million articles [84]. Tier 2 covers a narrow slice. The 73 to 93 percent does not.

Domain-specific Tier 2 infrastructure also carries hidden fragility. The more bespoke a system, the smaller the community maintaining it, the harder it is to replace, and the more likely it becomes a single point of failure wrapped in the appearance of resilience. The Global Initiative on Sharing All Influenza Data became the world's primary platform for COVID-19 genomic surveillance, with over 16.5 million SARS-CoV-2 sequences from over 195 countries and territories flowing through it as of 2024 [39]. On January 11, 2020, the SARS-CoV-2 genome was first publicly shared, with GISAID hosting submissions from the Chinese CDC; within two weeks, BioNTech launched Project Lightspeed (January 27, 2020) using the publicly available genome [39]. The technology was extraordinary. The governance, centralized under a single founder, led to suspended access for scientists investigating COVID origins in 2023, terminated data feeds to critical surveillance tools including Nextstrain, Outbreak.info, and CoV-Spectrum in 2025, and a revoked "Open Access" designation from re3data [39]. The comparison with the International Nucleotide Sequence Database Collaboration, whose coordination across three independent institutions has held for nearly 40 years [12], illustrates the pattern: the technology worked in both cases, and the governance determined whether the technology continued to work.

The Digital Preservation Network spent $7 million over its run as a coordinator among five federated Replicating Nodes (APTrust, Chronopolis, HathiTrust, Stanford Digital Repository, and Texas Digital Library), latterly as a single-member LLC of Internet2 (2017–2018), and announced wind-down in December 2018 with 26 of 64 charter members ever having deposited content and membership at dissolution at 31 [95]. Because DPN was a coordinator rather than an operator, with actual storage living at the federated nodes (each a free-standing preservation service), the dissolution did not destroy any copies. The five nodes continued operating, and depositors transitioned individually by their ingest node. But the cross-node integrity layer DPN had been building — fixity audits across nodes, succession guarantees, and the consortium-level provenance layer that members had paid for — dissolved with the coordinator, and depositors had to renegotiate preservation node by node on whatever terms each node offered in DPN's absence.

MetaArchive Cooperative operated for twenty years across institutions in three countries and nearly a dozen U.S. states, successfully preserved HBCU collections, born-digital objects, and digitized audio/visual recordings, and won the American Library Association's George Cunha and Susan Swartzburg Preservation Award in 2017. The cooperative dissolved on March 31, 2025 after Educopia shifted its fiscal-sponsorship requirement to one full-time-equivalent staff member per community in January 2025, two members departed, and the operational reserve fell below policy threshold [96]. Unlike DPN, MetaArchive operated the replication protocol on members' behalf — running the LOCKSS automated polling that produced the distributed copies — and members had outsourced replication-state verification to the cooperative. When the eleven-month wind-down forced a comprehensive audit, the cooperative discovered that automated polling had not been replicating content the way it advertised: Educopia's announcement notes "issues with insufficient replications and problems with the automated LOCKSS polling process" [96]. Recovery required collapsing the distributed architecture entirely — manually consolidating every member's content onto a new audit node at Stanford (where the LOCKSS Program is hosted) so it could be audited and rebuilt to a known-good baseline before being redistributed to the Dandelion Archive bridge network or returned to members. By Educopia's own statement, "it was not possible to secure a permanent archival home for all of MetaArchive's materials within the sunset time frame" [96]. Dandelion Archive itself is structured as a temporary bridge rather than a permanent home, and members who could not find a matching LOCKSS successor were given the option to pause preservation activities until they selected a new solution [96] — a path that converts to effective loss for any member that does not resume. No specific dataset has been documented as permanently lost in the public record, but the cooperative's own wind-down language acknowledges that not all materials reached permanent preservation, and the silent under-replication discovered during the sunset audit means pre-2025 loss within the cooperative cannot be ruled out either.

These cases — GISAID, DPN, MetaArchive — bracket Tier 2's structural fragility. GISAID shows that when governance turns adversarial, the platform can be wielded against parts of its own user community while the technology keeps running. DPN shows that when the coordinator dissolves, the integrity contract evaporates even though the underlying copies survive. MetaArchive shows that when the operator has been running the replication on members' behalf and has silent operational failures, the distributed architecture itself has to be undone to recover. Tier 2 redundancy is contingent on the consortium operating correctly, with no independent verification path available to the institutions that depend on it.

Closing the gap requires an architecture that operates below and across domain-specific governance: a protocol-level, domain-agnostic substrate on which the same resilience properties apply to genomic sequences, sociology datasets, climate observations, and educational interventions, because the resilience lives in the protocol rather than in the community that happens to maintain a particular instance of it.

The coverage gap has an economic origin. Tier 2 participation is priced for well-funded disciplinary consortia — thousands to tens of thousands of dollars per year, with dedicated staffing and infrastructure expectations — which is why coverage has accumulated in nucleotide sequences, macromolecular structures, and particle physics, not in the long tail of research that produces the 73 to 93 percent. Tier 3 participation scales down to existing institutional infrastructure at effectively zero marginal cost. Section 7 develops the economics that make universal coverage architecturally possible.


7. The Economics of Preservation #

The cost of data infrastructure is evaluated in this section against a widely used heuristic in quality and reliability engineering: Labovitz and Chang's 1:10:100 rule, documented in 1992, which holds that one dollar spent on prevention at the source costs ten dollars to detect and correct after bad data propagates, and one hundred dollars to handle once bad data has driven decisions [53]. Section 5 quantified the ten-dollar and one-hundred-dollar columns at research institutions — the approximately $1.1 billion per year in latent institutional liability that accrues on unverifiable research output. This section prices the one-dollar column and analyzes the comparative economics of the four architectural tiers.

7.1 The cost of local and hosted storage #

The median U.S. higher-education institution reporting to the EDUCAUSE Core Data Service spends $10.6 million per year on central information technology [85]. Compensation accounts for approximately half of that, $5.2 million, and it is nearly entirely fixed; staff turnover runs at 8% annually, and during budget reductions 49% of institutions implement hiring freezes — alongside 31% implementing layoffs and 32% offering retirement or layoff incentives [85]. Physical plant costs amortize over 10- to 15-year cycles for mechanical and electrical infrastructure, longer for building shells [86]. Cabling accounts for 61% of campus networking infrastructure costs and is embedded in buildings regardless of what traffic the network carries [86]. Bandwidth is contractual, with Internet2 membership fees based on institutional scale rather than traffic volume [8]. The marginal cost of additional network traffic is effectively zero.

Most of that infrastructure sits idle. Only 40% of data centers measure server utilization at all, and approximately one-quarter of physical servers are entirely comatose — running and drawing power while performing no work [6]. Where utilization is measured, typical on-premises enterprise servers run at 12% to 18% of capacity [6]. Networks run at 26% average utilization globally [7]; Internet2 flags concern when its backbone reaches the 30% threshold and maintains over 50% headroom by policy [8]. Millions of dollars of infrastructure carries zero resilience for the research it holds, because none of that idle capacity is structurally connected to the preservation of the data sitting on it.

Hosted storage purchased externally is affordable on an annual budget line. LYRASIS DSpaceDirect, a turnkey hosted institutional repository, costs $4,000 to $9,000 per year; Digital Commons runs $10,000 to $12,000 [87]. Self-hosted repositories are more expensive and illustrate where the cost actually falls. The Massachusetts Institute of Technology's DSpace instance runs approximately $260,000 per year, with $76,000 in infrastructure and $184,000 in staffing for 2.75 full-time-equivalent staff [88]. The University of Southampton's ePrints costs approximately 116,000 pounds sterling annually, 96% of which goes to staff [88]. Across the repository budgets studied, staff costs account for approximately 58% to 96% of the total [88]. The technology is inexpensive; the expertise to curate, maintain, and govern a repository is the substantial expense. Raw cloud storage is cheaper still: Amazon Web Services S3 Glacier Instant Retrieval costs approximately $48 per terabyte per year (Glacier Deep Archive runs roughly $12), and Google Cloud Archive runs approximately $14 [89].

Hosted storage is affordable on the annual line item, but as Section 3 documents, providers fail routinely and for many different reasons, and as Section 5 documents, a single surfacing event — an audit, a False Claims Act action, a retraction cascade — can exceed a century of repository fees. The economics of Tier 1 are favorable only until the liability is realized.

7.2 The cost of coordinated preservation #

Coordinated preservation adds redundancy through institutional agreements. Multiple organizations maintain copies across multiple locations, and the cost reflects coordination overhead.

CLOCKSS charges $550 to $18,350 per year, scaled to library budget [90]. LOCKSS Global Network membership runs $2,642 to $13,222 annually, plus approximately $700 per year for node hardware [91]. HathiTrust costs $6,600 to $13,000 [92]. APTrust charges $20,000 per year plus $420 per terabyte [93]. Portico ranges from $1,500 to $25,462 [94].

These are modest sums for research universities — CLOCKSS at the low end costs less than a departmental software license, and LOCKSS at the high end is a rounding error on an R1 institution's operating budget. The economic risk at Tier 2 lies in the organizational model the fees fund rather than in the fees themselves. The Digital Preservation Network charged $20,000 per year flat, regardless of institutional size or usage, and shut down in 2018 with 47% of its $7 million operating budget consumed by overhead and another 11% by marketing, and only 27 of roughly 60 members ever depositing content [95]. MetaArchive Cooperative operated on a participatory model for two decades before increased costs from fiscal-host Educopia and insufficient operating reserves converged to end it in 2025 [96].

A pattern is legible across the cases. Proportional fees, easy ingestion, and automated contribution produce consortia that survive — CLOCKSS has operated for 20 years, LOCKSS for 27, HathiTrust for 18 [90][91][92]. Flat fees, difficult ingestion, and voluntary contribution produce consortia that do not. The consortia that last are the ones in which the economics make participation rational at the institutional level.

7.3 The cost of protocol-level participation #

For an institution starting from nothing, a Hetzner CX23 instance — two virtual central processing units, 4 gigabytes of random-access memory, 40 gigabytes of disk, and 20 terabytes of monthly transfer — costs approximately $46 per year [80]. It can run any protocol node in the list below, and several simultaneously. A BitTorrent seedbox runs $36 to $84 per year [80]. An AT Protocol personal data server costs $42 to $72 per year [80]. A Matrix homeserver runs $60 to $240 per year [80]. A self-hosted Git forge costs $60 to $120 per year [80]. IPFS pinning runs $60 to $360 per year depending on volume [80].

Most institutions are not starting from nothing. They already operate servers where more than half the capacity sits idle [6], connected to networks running at 26% average utilization [7] with bandwidth contracted at flat rates regardless of traffic volume [8]. The marginal cost of adding a protocol node to this existing infrastructure is not $46 per year; it is closer to zero.

A BitTorrent seeding daemon averages 9 to 14 megabytes of resident memory on Linux [97]. On a server drawing 50% to 60% of peak power while operating at 15% utilization, adding that daemon is invisible in the noise floor. A Tor relay requires 512 megabytes of memory and 10 to 16 megabits per second of bandwidth [98], 0.016% of a 100-gigabit-per-second campus Internet2 connection [8]. Forgejo, a self-hosted Git forge, runs as a single binary at roughly 100 to 150 megabytes of resident memory and serves small institutional deployments comfortably on one to two central processing unit cores [80]. BitTorrent's WebSeed specification allows any existing web server to function as a seed with no software modification [99]. If a university already hosts datasets on a web server, making them available via BitTorrent adds no hardware, no software, and no cost.

Universities are already running protocol nodes as routine operations. TU Dortmund's Fachschaft Informatik operates a Matrix homeserver (matrix.fachschaften.org) for university-wide messaging in the university's own data center, run by student volunteers [100]. TU Dresden runs Matrix for 18,000 users on existing information technology staff and student assistants [101]. The Massachusetts Institute of Technology runs both a Mastodon instance and a Forgejo Git forge through its Student Information Processing Board, on existing server hardware, at effectively zero cost to the university [102]. Over 45 universities run Tor relays as background processes requiring near-zero maintenance [98]. Academic Torrents distributes over 298 terabytes of research data across volunteer seeders at zero central infrastructure cost [97]. No published study directly measures the incremental cost of adding protocol nodes to existing institutional infrastructure, because the costs have apparently not been significant enough to track.

7.4 The comparative cost structure #

7.5 The return on investment in research data infrastructure #

The return on well-maintained research data infrastructure is positive in every documented study, across disciplines and geographies. Representative measurements run from 7× operating cost (Australia's National Collaborative Research Infrastructure Strategy) to 20× or higher (EMBL-EBI at 20–26×, XSEDE at 18–88×, Purdue research computing at 49×), with the Protein Data Bank as a documented outlier at 800×.

The European Molecular Biology Laboratory – European Bioinformatics Institute operates on approximately 50 million pounds sterling per year and generates an estimated 1 to 1.3 billion pounds sterling annually in user value, a return of roughly 20 to 1 [104]. The United Kingdom's Archaeology Data Service produces 13 million pounds sterling in annual efficiency gains against operating costs, a return of roughly five to one [105]. Australia's National Collaborative Research Infrastructure Strategy program returns $7 for every $1 invested [106]. The Extreme Science and Engineering Discovery Environment cyberinfrastructure program generated $4.7 billion to $22.7 billion in total value on a $257.5 million investment [107]. Apon and colleagues found that every $100,000 in research-computing salaries is associated with a $14.3 million increase in higher education research and development expenditure, and every 100 TeraFLOPs of added capacity with a $1.3 million increase [108]. The Protein Data Bank, operating on approximately $6.1 million per year in federal funding, generates an estimated $5.5 billion in annual economic impact, an 800 to 1 return that represents the documented outlier [109].

The implications extend beyond direct value multiples. Infrastructure investment is inseparable from R1 status, where the 2025 Carnegie threshold requires $50 million in annual research spending and 70 research doctorates awarded [110]. Papers with open datasets receive 9% more citations, controlled for journal impact factor, author history, and institutional citation history [55], and every 100 deposited datasets generate over 150 reuse papers within five years [55].

7.6 The open-data multiplier #

Distribution is an architecture for redundancy, not a policy on access. Three techniques allow protocol-level distribution to handle sensitive data without exposing it. Client-side encryption keeps the data unreadable on every replica; institutional keys never leave the institution, and a compromised remote node exposes only ciphertext. Permissioned networks — private BitTorrent trackers, federated Matrix servers, permissioned IPFS clusters — constrain which partners hold copies, and jurisdiction can be bounded to a set of cooperating countries or institutions. Content addressing separates integrity from access: any node can verify that bytes have not been tampered with by recomputing a hash, without being able to read the underlying data. These techniques are the same ones institutions already apply to other sensitive data through encrypted offsite backups, virtual private network–tunneled disaster recovery, and federated clinical research consortia including the Electronic Medical Records and Genomics Network and the All of Us Research Program. Health records covered by the Health Insurance Portability and Accountability Act, student data covered by the Family Educational Rights and Privacy Act, export-controlled research, and embargoed datasets all participate in the same architecture as open data, with encryption and permission layers added on top.

When the preserved data can also be shared openly, the return documented in Section 7.5 compounds further, because openness converts preserved data into reuse at planet scale. Federal investment in the Human Genome Project and subsequent genomics research totaled $14.5 billion from 1988 through 2012 and generated $965 billion in economic impact [111]. The Landsat satellite program distributed Earth observation imagery for decades with limited public value: before the 2008 open-access policy, a maximum of 53 scenes were downloaded per day; after the policy change, downloads reached 5,700 scenes per day and the program's estimated economic value reached $25.6 billion per year by 2023 [112]. The data and the satellites were identical; opening access unlocked the value. Eighty-eight percent of 210 new United States Food and Drug Administration–approved drugs between 2010 and 2016 were facilitated by open Protein Data Bank structures [13a]; 100% of 34 new cancer drugs approved between 2019 and 2023 relied on Protein Data Bank data [81]. On January 10–11, 2020, the SARS-CoV-2 genome was shared publicly via virological.org and GISAID; BioNTech's Project Lightspeed launched on January 27, 2020 — seventeen days later, with eight vaccine candidates designed within 48 hours of the founders' decision — and Pfizer joined the partnership on March 17, 2020; the resulting Pfizer-BioNTech vaccines generated an estimated $1.9 trillion in global economic value, part of $5.2 trillion across all COVID-19 vaccines [113]. The European Commission estimated the cost of not having FAIR research data at a minimum of 10.2 billion euros per year across the European Union [114].

Resilient infrastructure delivers the preservation that allows data to survive long enough for compounding use to materialize. Openness delivers the reuse that realizes the value. The investment in infrastructure pays off in both cases; the open-data multiplier compounds the return where the data can be opened.


8. Verification as an Architectural Property #

Section 5.4 documented the regulatory trajectory: funders are shifting from self-reported data management plans to programmatic verification, False Claims Act precedent is establishing settlement ranges for institutions that cannot defend their compliance claims, and research is converging with every other industry in which consequential data must be producible on demand. This section addresses the structural question that trajectory raises: what kind of infrastructure can generate the evidence funders and auditors are beginning to require, and what kind cannot.

8.1 The structural disconnect between plans and infrastructure #

The compliance gap between what institutions promise funders and what they deliver is structural rather than behavioral. A researcher writes a data management plan because the funder requires one. The plan describes depositing data in a repository, maintaining metadata, and providing access. The researcher receives the grant, spends three to five years generating data, and stores it on a laboratory server or a personal drive in whatever format is convenient at the time. When the grant ends, the plan sits in a filing cabinet and the data sits on a hard drive. Neither is connected to the other. The plan was a compliance artifact, and the institution provided no infrastructure to make it anything else. This is a Tier 0 problem dressed in Tier 1 language: the plan promises Tier 1 behavior, while the infrastructure supporting most researchers remains Tier 0.

8.2 Verification as a byproduct of architecture #

The architectural tier at which an institution operates its data infrastructure determines whether it can produce verification evidence on inspection. Tier 1 infrastructure is opaque by design: the provider manages the backups, and the customer trusts the service level agreement. An institution operating at Tier 1 can tell a funder "we deposited the data in a repository," but it cannot prove that the repository's backups are geographically separated or that restoration has been tested. Tier 2 adds real redundancy through institutional agreements, but the verification is what the consortium asserts about its own protocols rather than what an outside party can independently re-run; MetaArchive's 2025 sunset audit showed those protocols can silently fail with no external party positioned to catch it, and when the consortium itself dissolves the verification dissolves with it.

Tier 3 infrastructure generates verifiable evidence as a structural byproduct of operation. When data is content-addressed, its integrity is mathematically verifiable [10]: altering one byte alters the hash, and any node can detect the discrepancy without trusting the source. When data is distributed across independent nodes, the number of copies and their locations are observable by inspection. The question "how many independent copies of this dataset exist, where are they, and are they intact?" becomes answerable because the protocol provides that information as a consequence of how it operates.

The May 2026 National Institutes of Health standardized Data Management and Sharing Plan format replaces narrative descriptions with three structured yes/no questions — whether scientific data will be shared, whether sharing will occur by the time of publication or end of performance period, and whether shared data will remain available for as long as repository or journal policies require — plus a data-type-and-repository table and a privacy attestation for human-subjects data [67]. The format is forward-looking, designed to make compliance review machine-actionable; it does not yet require independent verification that the data described in the plan exists, that it resides at the location the plan specified, that it has not been altered since deposit, or that access controls match the plan's stated terms. But the architectural direction is clear: structured, queryable plans demand structured, queryable infrastructure to back them. At Tier 1, each of those four properties opens a different audit burden if and when funders require evidence: existence can be checked by a manual query to a repository portal; location is a claim from the provider that the institution cannot independently verify; integrity is an assertion about backups the institution cannot inspect; and access control is a screenshot. At Tier 3, all four answers derive from a single cryptographic query across the distribution network. The hash confirms integrity; the node list confirms location and copy count; the access layer confirms permission state. The institution can produce a signed attestation that the auditor can independently re-verify.

The architectural shift from trust-based to verification-based compliance also eliminates a category of unfalsifiable defense against misconduct allegations. In 2012, Erasmus University concluded it had no confidence in the scientific integrity of social psychologist Dirk Smeesters' published work, and its 2014 final report formally found misconduct across seven papers; when asked to produce raw data supporting his published results, Smeesters responded that his home computer had crashed and that selectively discarding data was nothing out of the ordinary in his field and his department [115]. The "my data is lost" defense is credible only in an architecture that cannot distinguish lost data from data that never existed in the form reported. Under content-addressed deposit at the point of collection, the hashes and signatures either resolve against the original attestation or they do not; data loss becomes a testable claim rather than an unfalsifiable one. The reckless-disregard theory developed in Section 5.4 applies with particular force to institutions in which this category of defense is still structurally available, because the absence of verification infrastructure is precisely what makes the defense possible.


9. What Tier 3 Makes Possible #

Sections 2 through 8 developed the architectural case in diagnostic terms: what is failing, what it costs, and what architecture would close the gap. This section describes what becomes possible when Tier 3 preservation is the operating standard rather than the architectural aspiration. Each implication derives from a specific structural property established in the preceding sections.

Universal coverage. Tier 2 extended coordinated preservation to a handful of well-funded disciplines — nucleotide sequences, macromolecular structures, particle physics. The 73 to 93 percent of research output that sits outside that coverage — cross-disciplinary work, small-team studies, underfunded projects, data types without a community standard — becomes preservable for the first time, at near-zero marginal cost on infrastructure institutions already operate. The universe of research data with durable, verifiable preservation expands from the covered disciplines to the entire research enterprise.

Preservation horizon decoupled from project budget. Under the current regime, research data has the useful life of its originating grant — three to five years, after which maintenance ends and the data enters the failure modes cataloged in Section 3. Under Tier 3, preservation is a byproduct of participation rather than a line item on a time-bounded award. Longitudinal cohorts, ecological time series, and cross-decade measurement studies — the class of research for which a 63-year yellow-bellied marmot record is the exception rather than the norm — become structurally supportable instead of heroically sustained. The time horizon of preservation aligns with the time horizon of scientific inquiry rather than with the fiscal year of a grant cycle.

Reanalysis becomes a first-class research activity. When data persists across decades with cryptographically verifiable integrity, applying new analytical methods to older data stops being archaeological and becomes routine. Methods developed in 2035 can be applied to data collected in 2015 without negotiating access to an emeritus PI's personal drive, reconstructing experimental context from fragmentary documentation, or accepting that the comparison cannot be run. The scientific record compounds into a queryable substrate rather than an accumulating list of unverifiable claims.

Verification as a byproduct of operation. Section 8 established that Tier 3 infrastructure generates verification evidence structurally. The institutional posture this produces is qualitatively different from the current state. Audit requests become inspection queries rather than forensic reconstructions. Data management plan compliance is answerable on demand rather than defended after the fact. The regulatory convergence documented in Section 5.4 stops being an escalating threat and becomes a compliance posture the institution has already met. Institutions competing for faculty, partnerships, and federal funding against this baseline will either match it or absorb the differential cost of operating below it.

Compounding downstream reuse. Every additional year of preservation extends the window during which downstream researchers can build on the data, producing the compounding pattern documented in Section 7.5 — 150 reuse papers per 100 deposited datasets within five years, with citation and reuse returns continuing across the useful life of the data. Under the current regime, that useful life is truncated at the point of loss; under Tier 3, it extends to the horizon of continued scientific interest. The returns documented in Section 7.5 are measured against preservation regimes that already truncate at the grant cycle; the returns measurable against a regime that does not truncate have not yet been priced because the regime does not yet exist.

The architecture that produces these outcomes is already deployed at scale in adjacent domains: DNS, Git, and BitTorrent have operated for decades on the principles developed in Section 2. The remaining requirement is deliberate research-sector adoption. Sections 10 and 11 set out what that adoption looks like.


10. Tier 3 as AI-Era Research Infrastructure #

Section 9 described what Tier 3 makes possible for research data preservation. This section addresses the parallel question that the 2026 institutional environment makes unavoidable: what Tier 3 makes possible for the artificial intelligence work that is now the dominant strategic priority across United States research universities. The architectural properties that produce preservation as a byproduct of operation are the same properties that produce the data substrate artificial intelligence development requires.

10.1 Artificial intelligence is the institutional priority of 2026 #

Artificial intelligence and data are the top-ranked institutional issue in EDUCAUSE's 2025 Top 10 IT Issues report, listed as "The Data-Empowered Institution" — using data, analytics, and AI to drive student success, enrollment, research funding, and operational efficiency [118]. They are the dominant theme in 2024–2026 federal research funding announcements and the principal axis of competitive differentiation in faculty recruiting at every R1 institution.

The National Science Foundation's National Artificial Intelligence Research Resource Pilot launched on January 24, 2024 with NSF and ten partner federal agencies plus twenty-five private sector, nonprofit, and philanthropic partners; by 2026 the pilot had expanded to fourteen federal agencies and twenty-eight nongovernmental partners [119][120]. The pilot operates under the National Artificial Intelligence Initiative Act of 2020, incorporated into the William M. (Mac) Thornberry National Defense Authorization Act for Fiscal Year 2021 [121]; production-scale codification has been pursued through the CREATE AI Act, currently pending in Congress [122]. Twenty-nine NSF National AI Research Institutes have been funded across United States universities at approximately twenty million dollars each over five years [123]. The Department of Energy's Office of Science Artificial Intelligence Initiative and the Genesis Mission, launched by executive order on November 24, 2025 and backed by over $320 million in DOE investments announced in December 2025 across the American Science Cloud, the Transformational AI Models Consortium, automated-laboratory projects, and foundational AI research awards [124], the Department of Defense's Chief Digital and Artificial Intelligence Office [125], and DARPA's AI Next campaign and successor AI Forward initiative — together representing more than two billion dollars of DARPA AI investment since 2018 [126] — collectively redirect approximately $3.3 billion per year in nondefense federal AI research and development plus several billion more in defense applications toward artificial-intelligence-enabling capacity [127]. Institutions without a credible artificial intelligence strategy in 2026 are competing for a shrinking residual share of the federal research pool.

10.2 The data requirements of artificial intelligence are the architectural properties of Tier 3 #

The data properties required for defensible artificial intelligence development map directly onto the architectural properties developed across Sections 2 through 8.

Provenance — the ability to establish what data trained a model and where that data came from — is the structural product of content addressing and signed deposit (Section 2.1, Section 8.2).

Reproducibility — the ability to re-run a training pipeline against the original corpus — requires that the corpus persist across the lifetime of any model trained on it, which is exactly the preservation property Tier 3 produces (Section 6.2, Section 9).

Federation — the ability to train across institutional boundaries without consolidating sensitive data into a single trust domain — is the operational pattern Section 7.6 documented for permissioned BitTorrent, federated Matrix, and permissioned IPFS clusters, and the same architecture HIPAA-covered, FERPA-covered, and export-controlled research already requires.

Verification — the ability to demonstrate to a regulator, a court, or a peer reviewer that the training data was what the model card claims it was — is the architectural property developed in Section 8: a single cryptographic query across the distribution network produces evidence that any third party can independently re-verify.

The institutions that operate Tier 3 preservation nodes for the reasons developed in Sections 2 through 9 end up holding the artificial-intelligence-ready data substrate as a structural byproduct. The provenance-verified, content-addressed scientific data the architecture produces for compliance reasons is the same data the artificial intelligence development pipeline cannot operate without. The infrastructure investment serves both purposes from the same deployment.

10.3 The competitive position #

The institutional consequence is direct. Federal artificial intelligence grant programs increasingly emphasize data-management capacity, reproducibility, and access governance in solicitation language. The NSF National Artificial Intelligence Research Institutes solicitation (NSF 23-610) requires institutes to develop shared community infrastructure for data and software supporting reproducibility [128], and the NAIRR Pilot's NAIRR-Open focus area provides access to AI resources for open research while NAIRR-Secure, co-led by NIH and DOE, provides privacy- and security-preserving resources that require documented data-governance practices [120]. The artificial intelligence faculty market is the most competitive faculty market in academia, driven by a two-to-three-times compensation differential between academic and industry AI roles that creates sustained retention pressure on institutions; offer-acceptance rates depend on institutional data and compute infrastructure that determines whether candidates can run the research their industry counterparts could fund directly [129]. International competitors are closing the gap on the same infrastructure axis: the European Open Science Cloud, with its EU Node operational since October 2024 and the EOSC Federation in build-up phase under a 2025–2026 governance handbook [130]; the European High Performance Computing Joint Undertaking, whose EuroHPC Federation Platform development began in January 2025 with first release in April 2026 alongside its AI Factories program [131]; and the Chinese Academy of Sciences' coordinated scientific data infrastructure, operated through the China Science and Technology Cloud and managing eleven of twenty national scientific data centers under the National Science and Technology Infrastructure framework [132] — together offer data-governance capabilities that put competing United States institutions on the wrong side of a measurable and widening gap.

The institutions that deploy Tier 3 infrastructure in 2026 hold three positions simultaneously: the preservation posture documented in Sections 2–9, the verification posture documented in Section 8, and the artificial-intelligence-readiness posture documented in this section. The institutions that do not deploy it cede each of those positions in the same operational year, on infrastructure they already operate at substantial idle capacity (Section 7).

10.4 The compounding asymmetry #

Section 5 quantified the institutional liability of single-copy architecture at approximately $1.1 billion per year at a representative R1, against a probability of surfacing that three independent vectors are pushing upward. Section 7 priced the prevention at near-zero marginal cost on existing institutional infrastructure. The artificial intelligence dimension converts that asymmetry from one-sided hedging into two-sided positioning: the same investment that hedges the Section 5 liability captures the upside this section describes. The institution that deploys Tier 3 simultaneously closes exposure to the failure modes documented in Section 3 and acquires the data infrastructure its artificial intelligence strategy requires. The institution that does not deploy it carries the full liability and forgoes the full upside, on the same infrastructure base.

The architecture that preserves the scientific record is the architecture that powers what the scientific record will become.


11. Recommendations #

The analysis developed in Sections 2 through 8 supports seven recommendations addressed to three audiences: research institutions, funders, and the working group coordinating the development of reference infrastructure. Each recommendation identifies the audience, specifies the action, and sets out the rationale connecting the action to the architectural case.

Recommendation 1: Conduct an architectural audit of existing data infrastructure #

Audience: Research institutions (offices of research, information technology leadership, research libraries).

Action: Classify every research dataset the institution holds by its current architectural tier, using the framework developed in Section 2. For each dataset, record the number of independent copies, the failure domains those copies occupy, the verification capability available, and the exposure to each failure mechanism documented in Section 3. Complete the audit within twelve months.

Rationale: An institution cannot remediate architectural exposure it has not measured. The audit creates the baseline against which subsequent recommendations are scoped and priced, and it surfaces the datasets most immediately exposed to the failure mechanisms the institution has already experienced at peer institutions. The audit template produced by the working group (Appendix D) is designed to complete in approximately two staff-months at a research university of median size.

Recommendation 2: Deploy at least one protocol-level preservation node on existing institutional infrastructure #

Audience: Research institutions (research information technology, institutional repositories, research libraries).

Action: Deploy at least one Tier 3 preservation node — a BitTorrent seeder, a Tor relay, a Forgejo instance, an IPFS pinning node, or an AT Protocol personal data server — on existing institutional infrastructure within twelve months. Document the deployment as a reference configuration for subsequent institutions.

Rationale: Tier 3 is the only architecture that generates preservation redundancy and compliance verification as structural byproducts of operation. The marginal cost on existing infrastructure is effectively zero, as documented in Section 7. The deployment establishes the institution's capacity to participate in protocol-level preservation before the mandate regime requires it, and produces the operational experience necessary to scale subsequent deployments. Reference configurations from existing deployments at TU Dortmund, TU Dresden, and the Massachusetts Institute of Technology [100][101][102] demonstrate that the operational overhead is within the capacity of existing information technology staff or student volunteers.

Recommendation 3: Integrate compliance evidence generation into the data deposit workflow #

Audience: Research institutions (offices of research compliance, research libraries, research information technology).

Action: Modify the data deposit workflow so that every dataset produces, at the point of deposit, the verification artifacts required by funder mandates: content-addressed identifiers, cryptographic hashes of all deposited objects, signed attestations of deposit location and access control state, and machine-readable metadata conforming to the forthcoming National Institutes of Health standardized Data Management and Sharing Plan format.

Rationale: Compliance evidence generated as a byproduct of the deposit workflow is the only configuration under which compliance checks become answerable by inspection rather than by retrospective investigation. The May 2026 National Institutes of Health format transition and the Gates Foundation's OA.Works monitoring program are both converging on machine-readable verification, and institutions whose deposit workflows do not produce the required artifacts will fail programmatic compliance checks they cannot currently see coming. This recommendation is the operational expression of the architectural property described in Section 8.2.

Recommendation 4: Require verifiable evidence of data preservation, not self-reported plans #

Audience: Funders (federal agencies, private foundations, international funders).

Action: Transition grant submission and progress reporting requirements from self-reported data management plans to verifiable evidence of deposit, distribution, and access. Specify the technical form of the evidence — content-addressed identifiers, cryptographic hashes, signed attestations from independent nodes — rather than the human-readable form of the plan.

Rationale: The compliance gap — 8% declared availability, 2% actual availability across 2.1 million articles [84] — is a direct consequence of a regime that measures plan existence rather than plan execution. Transitioning the reporting requirement to verifiable evidence aligns the compliance check with the architectural property that produces compliance in the first place. The Gates Foundation's 2025 transition to automated compliance monitoring through OA.Works [68] is the reference implementation of this recommendation at the funder level.

Recommendation 5: Fund preservation through facilities and administrative cost recovery #

Audience: Funders and research institutions (jointly, through facilities and administrative rate negotiation).

Action: Include preservation infrastructure as a recognized category in facilities and administrative cost rate calculations, and negotiate rates that reflect the long-term institutional cost of data stewardship. Treat preservation as a continuing facilities cost rather than a project-scoped expense.

Rationale: As the National Academies of Sciences, Engineering, and Medicine documented in 2020, "the current system for funding research is not conducive to data life-cycle cost forecasting" [116]. Grants run three to five years; preservation needs run decades. A funding mechanism that scopes preservation to grants forces researchers to deprecate preservation obligations at grant close, which is the operational origin of the failure mode described in Section 3.1. Facilities and administrative cost mechanisms already exist for exactly this purpose; the recommendation is to apply them.

Recommendation 6: Maintain local clones and content-addressed copies of all research data #

Audience: Principal investigators, laboratory directors, and individual researchers.

Action: As a standard laboratory practice, maintain at least one local clone and one content-addressed copy of every research dataset. Use Git for code; use BitTorrent, IPFS, or Git large-file storage for data; use Signal-style end-to-end encryption where the data is sensitive. Treat the clone and the content-addressed copy as non-optional components of the research workflow.

Rationale: The GitHub-Iran episode documented in Section 2.5 demonstrates the difference between using Tier 3 infrastructure and capturing its resilience properties [25]. A single local clone of a Git repository contains the complete repository history with cryptographic integrity, and its existence determines whether a Tier 1 access restriction produces permanent loss or temporary inconvenience. The recommendation is operationally trivial at the laboratory level and architecturally decisive at the institutional level.

Recommendation 7: Publish reference deployments, audit templates, and cost models through coordinated working-group activity #

Audience: The Resilient Data Futures working group, coordinated with partner organizations.

Action: Publish, under open licensing, reference deployment configurations for each protocol node class identified in Section 7.3, audit templates for Recommendation 1, cost models calibrated to the formula in Section 5.2, and case studies documenting institutional deployments. Update the material as the working group collects additional evidence.

Rationale: The adoption cost of each of the preceding recommendations declines as reference implementations accumulate. The Software Heritage project, the CLOCKSS documentation, the Forgejo and IPFS reference deployments, and the existing university Matrix and Tor deployments demonstrate that well-documented reference material substantially accelerates subsequent adoption. The SciOS working group is the natural convener of this material; publication under open licensing ensures that the material is not itself exposed to the failure modes documented in Section 3.


12. Conclusion #

The empirical record on research data preservation documents an outcome: 73 to 93 percent of published research carries underlying data that cannot be produced on request; the odds of a dataset still existing fall by roughly 17% per year after publication, conditional on author response; 8% of published research declares its data available and 2% actually is; 191 repositories have closed since 2012, nearly half giving no indication of data migration or continued limited access; and the resulting institutional liability runs approximately $1.1 billion per year at a representative $200 million R1 research university, with tens of billions of dollars in sector-wide exposure across the research enterprise.

This paper has argued that the outcome is architectural rather than operational. The mechanisms that produce loss — platform discontinuation, access restriction, reference decay, funding termination, physical disaster, personnel turnover — are the normal operating conditions of research, and each of them produces permanent loss only on single-copy architecture. The same mechanisms are absorbed without permanent loss when independent copies exist across independent failure domains. The architectural tier at which an institution operates its data infrastructure — Tier 0, Tier 1, Tier 2, or Tier 3 — determines the outcome, and the determination is structural rather than procedural.

The cost of single-copy architecture is a compound cost. It includes the direct institutional liability carried on every dataset the institution cannot produce on request, the compliance and verification exposure that accumulates as the mandate regime matures, and the compounding loss of scientific output that the destroyed data would have produced across its useful life. Any single dimension understates the total; the three together describe what the research enterprise has already been paying, quietly and continuously, for the architecture it has.

The scientific community has built the most sophisticated coordinated preservation systems on Earth at Tier 2, and those systems demonstrate that multi-site, multi-copy preservation works at scale for well-funded disciplines. They do not extend to the long tail of research that generates most of the literature and most of the data, and their resilience remains contingent on the continued coordination of a small number of organizations. Tier 3 — protocol-level distribution on the same architectural pattern that has sustained the Domain Name System, email, BitTorrent, and Git for decades — extends the guarantees of Tier 2 to every dataset, at near-zero marginal cost on infrastructure institutions already operate. The economics favor the investment by more than an order of magnitude. The mandate regime is converging on verification that Tier 3 generates as a byproduct and that Tier 1 cannot produce by inspection. The architectural case, the economic case, and the fiduciary case all point toward the same conclusion.

The infrastructure required is infrastructure that already exists on every research campus in a state of substantial idle capacity. The tools required have operated at planetary scale for two decades or more. The remaining requirement is the institutional decision to deploy them. The recommendations in Section 11 specify the first steps; the working group convened by SciOS coordinates the reference material that lowers the cost of subsequent steps; the argument developed across Sections 2 through 8 establishes why the decision, once made, produces outcomes measurable in preserved data, satisfied compliance obligations, and avoided False Claims Act exposure.

The most expensive data infrastructure is the infrastructure built after the disaster. The analysis supports the opposite investment.


References #

[1] Vines, T.H. et al. "The Availability of Research Data Declines Rapidly with Article Age." Current Biology 24(1), 2014.

[2] Gabelica, M., Bojčić, R. & Puljak, L. "Many researchers were not compliant with their published data sharing statement: A mixed-methods study." Journal of Clinical Epidemiology 150:33–41, 2022. doi:10.1016/j.jclinepi.2022.05.019. PubMed: 35654271.

[3] Wicherts, J.M., Borsboom, D., Kats, J. & Molenaar, D. "The poor availability of psychological research data for reanalysis." American Psychologist 61(7):726–728, 2006. doi:10.1037/0003-066X.61.7.726.

[4] Acciai, C., Schneider, J.W. & Nielsen, M.W. "Estimating social bias in data sharing behaviours: an open science experiment." Scientific Data 10:233, 2023. doi:10.1038/s41597-023-02129-8.

[5] Strecker, D., Pampel, H., Schabinger, R. & Weisweiler, N.L. "Disappearing repositories: Taking an infrastructure perspective on the long-term availability of research data." Quantitative Science Studies 4(4):839–856, 2023. doi:10.1162/qss_a_00276.

[6] Uptime Institute. Global Data Center Survey 2023, 2024, and 2025 (in 2023, 40% of data center operators measured server utilization; 41% of organizations tracked server utilization as a sustainability metric in 2024); Taylor, J. & Koomey, J. "Zombie/Comatose Servers Redux," Anthesis Group / Stanford University, 2015 and 2017 update (30% of enterprise servers comatose, 2015; 25% of physical servers and ~30% of virtual machines comatose, 2017); Whitney, J. & Delforge, P. "Data Center Efficiency Assessment: Scaling Up Energy Efficiency Across the Data Center Industry," Natural Resources Defense Council Issue Paper IP:14-08-A, August 2014, nrdc.org/sites/default/files/data-center-efficiency-assessment-IP.pdf (typical on-premises enterprise server runs at 12%–18% of capacity); Lawrence Berkeley National Laboratory, 2024 United States Data Center Energy Usage Report, December 2024, eta-publications.lbl.gov/sites/default/files/2024-12/lbnl-2024-united-states-data-center-energy-usage-report_1.pdf (industry-wide modeling assumes 70% average server utilization with substantial uncertainty, dominated by hyperscale).

[7] TeleGeography, State of the Network, 2023–2025 editions (global average IP bandwidth utilization ~26%, peak ~44%); TeleGeography WAN Manager Survey (SD-WAN / DIA / MPLS configuration trends); Cisco Annual Internet Report (2018–2023) networking best-practice guidance.

[8] Internet2, 2025 Higher Education Fee Model Details, internet2.edu/wp-content/uploads/2024/11/2025-higher-education-fee-model-details.pdf (scale-based Sustaining Contributions determined by institutional research and total expenditures, not traffic volume); Internet2, "Internet2 IP Backbone Capacity Augment Practice," internet2.edu/community/about-us/policies/internet2-ip-backbone-capacity-augment-practice/ (backbone circuits flagged for discussion when 95th-percentile utilization reaches 30% over a week; 50% headroom maintained on connections between regionals, Internet2, and campuses by policy; 40% sustained utilization triggers a backbone augment).

[9] Freedman, L.P., Cockburn, I.M. & Simcoe, T.S. "The Economics of Reproducibility in Preclinical Research." PLOS Biology 13(6):e1002165, 2015. doi:10.1371/journal.pbio.1002165.

[10] Farrell, S., Kutscher, D., Dannewitz, C., Ohlman, B., Keranen, A. & Hallam-Baker, P. Naming Things with Hashes. RFC 6920, Internet Engineering Task Force, April 2013. doi:10.17487/RFC6920. datatracker.ietf.org/doc/html/rfc6920.

[11] Software Heritage. Activity Report 2025, published January 2026, softwareheritage.org/2026/01/16/software-heritage-activity-report-2025/ (Merkle DAG content-addressed archive; 27 billion unique source files from 421 million projects; SWHID published as ISO/IEC 18670).

[12] Arita, M. et al. "The international nucleotide sequence database collaboration (INSDC): enhancing global participation." Nucleic Acids Research 53(D1):D62–D69, 2025. doi:10.1093/nar/gkae1088. GenBank Release 271.0, April 27, 2026 (53.90 trillion bases, 6.27 billion records), ncbiinsights.ncbi.nlm.nih.gov/2026/04/27/genbank-release-271-0-is-available/; DDBJ/ENA/GenBank Feature Table Definition, insdc.org/submitting-standards/feature-table/ (shared format for daily exchange across three continents).

[13] Berman, H.M., Henrick, K. & Nakamura, H. "Announcing the worldwide Protein Data Bank." Nature Structural Biology 10:980, 2003; wwPDB Consortium, wwpdb.org; RCSB PDB, "Economic Impact" summary, rcsb.org/pages/about-us/economic-impact (four-site weekly RSYNC synchronization across RCSB PDB, PDBe, PDBj, BMRB; ~227,000 experimentally determined structures; ~US$23 billion replacement cost).

[13a] Westbrook, J.D. & Burley, S.K. "How Structural Biologists and the Protein Data Bank Contributed to Recent FDA New Drug Approvals." Drug Discovery Today 24(2):412–429, February 2019. doi:10.1016/j.drudis.2018.11.020 (5,914 PDB structures providing structural coverage for 88% of the 210 new molecular entities approved by the U.S. FDA between 2010 and 2016 across all therapeutic areas; more than half of those structures were published and distributed openly more than ten years before drug approval).

[14] CLOCKSS Archive. Preservation statistics (63.6 million articles, 568,210 books, 12 mirror nodes, 300 libraries, 691 publishers, 81 triggered titles) and Triggered Content register, clockss.org and clockss.org/triggered-content/, accessed April 2026.

[15] CERN. "The Worldwide LHC Computing Grid." home.cern/science/computing/grid; WLCG collaboration, wlcg.web.cern.ch (1.5 exabytes, 1.4 million cores, >170 sites in 42 countries, >2 million tasks/day).

[16] ICANN Root Server System Advisory Committee. RSSAC002: Advisory on Measurements of the Root Server System, 2014, and subsequent quarterly reports; Verisign, Domain Name Industry Brief, 2025 (registered-domain counts); Cloudflare, "How Cloudflare analyzes 1M DNS queries per second" (aggregate global DNS query volume); Vercara/DigiCert, 2023 DNS Traffic and Trends Analysis (41.97 trillion DNS queries processed in 2023 on a single platform).

[17] Postel, J. Simple Mail Transfer Protocol. RFC 821, IETF, August 1982; Klensin, J. RFC 5321, IETF, October 2008.

[18] Radicati Group / Statista / EmailToolTester, Email Statistics reporting for 2026 (4.73 billion active users; 392.5 billion messages per day), emailtooltester.com/en/blog/how-many-emails-are-sent-per-day/.

[19] BitTorrent Inc., "BitTorrent Crosses Historic 2 Billion Installations Milestone," company blog and BusinessWire press release, August 11, 2020, bittorrent.com/blog/2020/08/11/bittorrent-crosses-historic-2-billion-installations/ (cumulative installations of BitTorrent and µTorrent clients across Windows, Mac, and Android).

[20] Internet Archive Blog, "Over 1,000,000 Torrents of Downloadable Books, Music, and Movies," August 7, 2012, blog.archive.org/2012/08/07/over-1000000-torrents-of-downloadable-books-music-and-movies/ ("the now fastest way to download items from the Archive"); Internet Archive Help Center, "Archive BitTorrents," help.archive.org/help/archive-bittorrents/ (BitTorrent described as "a supplement to traditional HTTP download," currently in beta).

[21] GitHub, Inc. Octoverse Report 2025; Git 1.0 release announcement, December 2005 / Linus Torvalds first commit April 7, 2005.

[22] Stack Overflow. 2024 Developer Survey, survey.stackoverflow.co/2024/ (93.87% Git adoption among professional developers).

[23] Corbet, J. "Kernel.org compromised." LWN.net, August 31, 2011, lwn.net/Articles/457142/; The Linux Foundation, "The Cracking of Kernel.org"; Wikipedia, kernel.org (integrity of the Linux kernel source guaranteed by Git's cryptographic hash chain).

[24] GitHub, Inc. Octoverse Report 2025, octoverse.github.com (630 million repositories; over 180 million developers).

[25] Liao, R. & Singh, M. "GitHub confirms it has blocked developers in Iran, Syria and Crimea." TechCrunch, July 29, 2019, techcrunch.com/2019/07/29/github-ban-sanctioned-countries/ (GitHub access restrictions imposed on developers in Iran, Syria, Crimea, Cuba, and North Korea under U.S. export controls); GitHub, "GitHub and Trade Controls," docs.github.com/en/site-policy/other-site-policies/github-and-trade-controls (official GitHub policy documenting the affected jurisdictions).

[25a] Saeedi Fard, H. "A Sad Day for Iranian Developers." Medium, July 2019 (verbatim email from GitHub support to a sanctioned-region developer: "Unfortunately we are not legally able to send an export of the disabled repository content. I'm sorry for the frustration here, but GitHub must comply with U.S. export control laws and sanction requirements."); archived via the Wayback Machine where the original is unavailable.

[26] National Center for Science and Engineering Statistics (NCSES), Survey of Earned Doctorates: 2023 and Doctorate Recipients from U.S. Universities: 2023, NSF, 2024–2025 (median time to degree); Council of Graduate Schools, Ph.D. Completion and Attrition: Analysis of Baseline Program Data from the Ph.D. Completion Project, 2008 (≈43% attrition); Kahn, S. & Ginther, D.K. "The impact of postdoctoral training on early careers in biomedicine." Nature Biotechnology 35(1), 2017, and Woolston, C., "Postdocs in crisis." Nature Careers survey, 2020 (median postdoc 4.5–4.6 years; ~17% tenure-track placement).

[27] Avelino, G., Passos, L., Hora, A. & Valente, M.T. "A Novel Approach for Estimating Truck Factors." IEEE 24th International Conference on Program Comprehension (ICPC), 2016. doi:10.1109/ICPC.2016.7503718. arxiv.org/abs/1604.06766 (65% of 133 popular GitHub projects have truck factor ≤ 2).

[28] Schembera, B. & Durán, J.M. "Dark Data as the New Challenge for Big Data Science and the Introduction of the Scientific Data Officer." Philosophy & Technology 33:93–115, 2020 (online March 13, 2019). doi:10.1007/s13347-019-00346-x (HLRS tape archive: 57 of 262 de-registered accounts holding ~619 TB of orphaned data as of December 1, 2017); Heidorn, P.B. "Shedding Light on the Dark Data in the Long Tail of Science." Library Trends 57(2), 2008 (dark-data typology).

[29] Kyoto University Institute for Information Management and Communication (IIMC), official incident notification on the loss of storage files on the supercomputer system, December 28, 2021, iimc.kyoto-u.ac.jp/en (buggy HPE backup script update executed December 14–16, 2021, deleting ~77 TB across 34 million files from 14 research groups; ~28 TB across ~25 million files across four groups unrecoverable); Claburn, T. "Kyoto University loses 77TB of supercomputer data after buggy update to HPE backup program," Data Center Dynamics, January 2022.

[30] National Museum of Brazil fire, September 2, 2018; Escobar, H. "In a 'foretold tragedy,' fire consumes Brazil museum." Science 361(6406):960, 2018, doi:10.1126/science.361.6406.960; American Historical Association, "The Degradation of History: The Brazil Museum Fire and the Problem of Underfunding in the GLAM Sector," Perspectives on History, December 2018 (≈18.4 million of 20 million items destroyed; chronic underfunding with operating/preservation budget collapsed to a small fraction of the ~R$520,000/yr needed); Daley, J. "Why Brazil's National Museum Fire Was a Devastating Blow to South America's Cultural Heritage." Smithsonian Magazine, September 5, 2018, smithsonianmag.com/smart-news/artifacts-destroyed-brazil-devastating-national-museum-fire-180970194/ (museum had not received its full annual $128,000 maintenance budget since 2014; received only $13,000 in 2018; maintenance budget cut by 90 percent by 2018); UNESCO and IBRAM documentation.

[31] Kaiser, J. "NIH terminations reach $2.45 billion." Science News, August 2025; Mervis, J. "NSF kills 1,700 grants." Science News, August 2025; Council on Governmental Relations and PNAS, "How the 2025 NIH grant terminations varied by researchers' demographic groups," PNAS, 2025, doi:10.1073/pnas.2527755123 (confirms 2,291 NIH grants / $2.45B terminated February–August 2025); Committee of Concerned Scientists / COSSA, "NSF Releases List of Terminated Grants," May 21, 2025 (1,752 NSF grants / $1.4B; STEM Education Directorate 839 / $888M); FY2026 President's Budget Request, 2025; American Astronomical Society, "The FY26 President's Budget Request: NASA and NSF Details," June 2025, aas.org/posts/news/2025/06/fy26-presidents-budget-request (FY2026 NSF request of $3.9 billion represents an approximately 56% cut from FY2025 enacted level of $9.06 billion); Grant Watch (grant-watch.us).

[32] NOAA National Centers for Environmental Information, "Billion-Dollar Weather and Climate Disasters" archive, ncei.noaa.gov (403 disasters / ~$2.9T in damages since 1980); NOAA Notice of Changes, May 2025, nesdis.noaa.gov/about/documents-reports/notice-of-changes/2025-notice-of-changes/billion-dollar-weather-and-climate-disasters (database retired, no updates beyond 2024); Washington Post, "NOAA will stop tracking billion-dollar weather disasters," May 8, 2025; Eos, "Proposed NOAA Budget Calls for $0 for Climate Research," 2025 (FY2026 elimination of NOAA Office of Oceanic and Atmospheric Research, including Mauna Loa's 1958-onward atmospheric CO₂ record).

[33] Environmental Data and Governance Initiative (EDGI), Federal Environmental Web Tracker, envirodatagov.org; Data Rescue Project, datarescueproject.org; Harvard Law School Library Innovation Lab, Data.gov archive project; Wikipedia, "2025 United States government online resource removals" (308,000 → 304,621 Data.gov datasets by February 21, 2025, ≈3,379 removed; >8,000 web pages modified; NOAA Eos reporting of 14 decommissioned earthquake, marine, and coastal datasets, April 2025).

[34] Blumstein, D.T. "The end of long-term ecological data?" PLOS Biology 23(4):e3003102, April 2025. doi:10.1371/journal.pbio.3003102 (cites analysis of 411 long-term mammal studies with 191 terminated; 63-year yellow-bellied marmot study at Rocky Mountain Biological Laboratory).

[35] NASA Astronomical Data Center closure notice, October 1, 2002, archived at classe.cornell.edu/~seb/celestia/adc-closure.html; Eichhorn, G. "The end of the NASA Astronomical Data Center," community discussion, 2003; Strecker et al. 2023 [5] for typology context.

[36] Arts and Humanities Research Council decision to discontinue Arts and Humanities Data Service co-funding, announced March 2007, effective March 31, 2008; Digital Curation Centre, "Arts and Humanities Data Service decision," dcc.ac.uk/news/arts-and-humanities-data-service-decision; Arts and Humanities Data Service, Wikipedia (five centres: Archaeology, History, Literature/Languages/Linguistics, Performing Arts, Visual Arts; Archaeology Data Service continued at University of York; History Data Service collections transferred to UK Data Archive at Essex).

[37] Banco de Información para la Investigación Aplicada en Ciencias Sociales (BIIACS), Centro de Investigación y Docencia Económicas, Mexico City; re3data record r3d100010400, DOI 10.17616/R3ZG6K; Data Seal of Approval certification, September 4, 2013 (first DSA-certified archive in Latin America); shutdown date December 15, 2023 per re3data record; discussed as a case study in Strecker et al. 2023 [5], Section 4.4.

[38] Murtfeldt, R., Alterman, N., Kahveci, I. & West, J.D. "RIP Twitter API: A eulogy to its vast research contributions." arXiv:2404.07340, April 2024 (33,306 studies across 8,914 venues and 610,738 citations from 2006–2024; post-2023 restriction yielded a 13% decline); X Developer Platform, API pricing tiers announced February 2023.

[39] Van Noorden, R. "GISAID in crisis: can the controversial COVID genome database survive?" Nature, d41586-023-01517-9, 2023; Enserink, M. "The 'invented persona' behind a key pandemic database." Science, 2023; re3data record r3d100010126, currently classifying GISAID as "restricted" access (registration required), re3data.org/repository/r3d100010126; GISAID, "Data availability, 21 March 2023," gisaid.org/statements-clarifications/data-availability/ (chronology of suspended access for scientists who posted Wuhan market origins analysis, March 20, 2023); Nextstrain, "Interruption to GISAID-based SARS-CoV-2 sequence analyses," November 6, 2025, nextstrain.org/blog/2025-11-06-gisaid-based-ncov-analyses (GISAID terminated regular Nextstrain flat-file feed effective October 1, 2025; Outbreak.info support ended January 2025; CoV-Spectrum updates disrupted); Think Global Health, "To Finish the Pandemic Agreement, WHO Needs a Trustworthy Viral Database," 2025, thinkglobalhealth.org/article/to-finish-the-pandemic-agreement-who-needs-a-trustworthy-viral-database; sequence-count milestone of 16.5 million SARS-CoV-2 genomes reached March 2024 (multiple academic citations including Macario, M. et al., "SARS-CoV-2 CoCoPUTs," Virus Evolution 11(1), 2025); BioNTech Project Lightspeed timeline (Şahin & Türeci decision January 24–25, 2020 following Lancet familial-cluster paper; eight vaccine candidates designed by January 26; project officially launched January 27, 2020; Pfizer joined March 17, 2020), per Stiftung Deutsches Technikmuseum Berlin "Project Lightspeed" exhibit and Leisinger, K. & Schroeder, D. "Project Lightspeed: A case study in research ethics and accelerated vaccine development," Research Ethics, 2024.

[40] CNKI (China National Knowledge Infrastructure) access restriction notification, March 2023, effective April 1, 2023 (under Measures of Data Cross-Border Transfer Assessment, in force September 1, 2022); Times Higher Education, "China block on foreign access to journal portal 'damages knowledge'," 2023; Ithaka S+R, "Reflecting on Restricted Access to a Chinese Research Lifeline," sr.ithaka.org; Center for Security and Emerging Technology (Georgetown), "U.S. Think Tank Reports Prompted Beijing to Put a Lid on Chinese Data," cset.georgetown.edu.

[41] CERN Council, "CERN Council decides to conclude cooperation with Russia and Belarus in 2024," home.cern/news/news/cern/cern-council-decides-conclude-cooperation-russia-and-belarus-2024, effective November 30, 2024 (Russia) and June 27, 2024 (Belarus); Nature, "CERN expels hundreds of Russian scientists," d41586-024-02982-6, 2024; Scientific American, "CERN Suspends Collaborations with Russia," 2024; Joint Institute for Nuclear Research (JINR) cooperation suspension per CERN announcement (March 2022).

[42] UK Biobank Research Analysis Platform (UKB-RAP) transition, 2023–2024; UK Biobank access fees and Research Analysis Platform documentation, ukbiobank.ac.uk/use-our-data/fees/; "Data access changes to UK Biobank stir unease in neuroscientists," The Transmitter, 2024, thetransmitter.org (researchers report annual costs approximately doubling under cloud-metered model); Bio-IT World, "UK Biobank Pivoting to Platform-Only Model For Big Data Sharing," June 2024; AWS and DNAnexus documented as cloud platform partners.

[43] Straumsheim, C. "Elsevier Makes Move Into Institutional Repositories With Acquisition of Bepress." Inside Higher Ed, August 3, 2017; SPARC, Elsevier acquisition of Bepress/Digital Commons, August 2017 (>500 institutions using Digital Commons at acquisition).

[44] Mendeley Desktop end-of-life, effective September 1, 2022, blog.mendeley.com; Mendeley, Wikipedia (2018 Mendeley update caused some users to lose PDFs and annotations, fixed for most users after several weeks).

[45] Matthews, D. "Do Academic Social Networks Share Academics' Interests?" Times Higher Education, April 2016 (launch of $99/year premium tier, December 2016); Fortney, K. & Gonder, J. "A Social Networking Site is Not an Open Access Repository." Office of Scholarly Communication, University of California, December 2015; Academia.edu subscription pricing surveyed January 2026.

[46] SPARC, "Developments in Publishers' Text and Data Mining (TDM) Policy," sparcopen.org/our-work/developments-in-tdm-policy/; Murray-Rust, P. "Content-Mining Elsevier's TDM," blogs.ch.cam.ac.uk/pmr, 2014; Elsevier Text and Data Mining policy, elsevier.com/about/policies-and-standards/text-and-data-mining; Springer Nature TDM API, datasolutions.springernature.com/products/tdm/; bulk-download incident documented in SPARC TDM policy review and contemporaneous reporting.

[47] Baker, M. "1,500 scientists lift the lid on reproducibility." Nature 533(7604):452–454, May 2016. doi:10.1038/533452a.

[48] Begley, C.G. & Ellis, L.M. "Drug development: Raise standards for preclinical cancer research." Nature 483(7391):531–533, March 28, 2012. doi:10.1038/483531a (Amgen replication of 53 landmark oncology papers; 6 of 53 = 11% replicated).

[49] Miyakawa, T. "No raw data, no science: another possible source of the reproducibility crisis." Molecular Brain 13(24), 2020. doi:10.1186/s13041-020-0552-2 (41 manuscripts asked to provide raw data during peer review; 21 withdrawn, 19 of remaining 20 rejected for insufficient raw data; >97% did not supply raw data).

[50] Stern, A.M., Casadevall, A., Steen, R.G. & Fang, F.C. "Financial costs and personal consequences of research misconduct resulting in retracted publications." eLife 3:e02956, 2014. doi:10.7554/eLife.02956 (median attributable cost $239,381 per retracted article; mean $392,582; NIH-only subset mean $425,072).

[51] Trisovic, A., Lau, M.K., Pasquier, T. & Crosas, M. "A large-scale study on research code quality and execution." Scientific Data 9:60, 2022. doi:10.1038/s41597-022-01143-6 (over 9,000 unique R files from ~2,000 Harvard Dataverse replication datasets, 2010–2020; 74% fail on initial execution, 56% still fail after automated cleaning).

[52] Pew Research Center. "When Online Content Disappears: Link Rot and Digital Decay on Government, News and Other Webpages." Pew Research Center Data Labs, May 17, 2024, pewresearch.org/data-labs/2024/05/17/when-online-content-disappears/ (25% of webpages from 2013–2023 no longer accessible as of October 2023; 38% of pages from the 2013 snapshot gone by 2023).

[53] Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Balakireva, L., Zhou, K. & Tobin, R. "Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot." PLOS ONE 9(12):e0115253, 2014. doi:10.1371/journal.pone.0115253 (over 3.5 million articles across arXiv, Elsevier, and PubMed Central, 1997–2012; over 1 million web-resource references; one in five STM articles exhibit reference rot; seven in ten articles containing web references are affected); Jones, S.M., Van de Sompel, H., Shankar, H., Klein, M., Tobin, R. & Grover, C. "Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content." PLOS ONE 11(12):e0167475, 2016. doi:10.1371/journal.pone.0167475 (content-drift complement: >75% of resolving URI references have drifted from originally cited content).

[54] Zittrain, J., Albert, K. & Lessig, L. "Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations." Harvard Law Review Forum 127:176–199, March 2014, harvardlawreview.org/forum/vol-127/perma-scoping-and-addressing-the-problem-of-link-and-reference-rot-in-legal-citations/ (sample drawn from Harvard Law Review [1999–2012], Harvard Journal of Law & Technology [1996–2012], and Harvard Human Rights Journal [1997–2012]; more than 70% of URLs across the aggregate sample no longer resolve to originally cited content; 50% of URLs in U.S. Supreme Court opinions similarly affected).

[55] Piwowar, H.A. & Vision, T.J. "Data Reuse and the Open Data Citation Advantage." PeerJ 1:e175, 2013. doi:10.7717/peerj.175 (multivariate regression on 10,555 gene-expression microarray studies; papers with publicly available data received 9% more citations, 95% CI 5–13%, controlled for journal, author, and institutional history; estimated reuse trajectory of 40 reuse papers by year 2, 100 by year 4, and more than 150 by year 5 per 100 datasets deposited in year 0); Drachen, T.M., Ellegaard, O., Larsen, A.V. & Dorch, S.B.F. "Sharing Data Increases Citations." LIBER Quarterly 26(2):67–82, 2016. doi:10.18352/lq.10149 (astrophysics case study, 2000–2014: ~25% overall citation advantage for papers linked to data, rising to ~40% in 2009–2014).

[56] Agh, N., Bossier, P., Abatzopoulos, T.J., Beardmore, J.A., Van Stappen, G., Mohammadyari, A., Rahimian, H. & Sorgeloos, P. "Morphometric and Preliminary Genetic Characteristics of Artemia Populations from Iran." International Review of Hydrobiology 94(2):194–207, 2009. doi:10.1002/iroh.200811077 (six Iranian Artemia populations across 19 morphometric variables; 85.9% correct classification to source population; 100% separation of bisexual A. urmiana from parthenogenetic populations).

[57] CORDIS, European Commission. "Artemia biodiversity: current global resources and their sustainable exploitation" (ICA4-CT-2001-10020), FP5 INCO-DEV Concerted Action, 1 January 2002 – 31 December 2004, cordis.europa.eu/project/id/ICA4-CT-2001-10020 (coordinated by Ghent University; 14 partner organizations; total EU funding €800,000).

[58] AghaKouchak, A., Norouzi, H., Madani, K., Mirchi, A., Azarderakhsh, M., Nazemi, A., Nasrollahi, N., Farahmand, A., Mehran, A. & Hasanzadeh, E. "Aral Sea syndrome desiccates Lake Urmia: Call for action." Journal of Great Lakes Research 41(1):307–311, 2015. doi:10.1016/j.jglr.2014.12.007 (satellite analysis: Lake Urmia surface area decreased by ~88% over the preceding decades through the mid-2010s, driven primarily by over-exploitation of upstream water inputs rather than drought alone).

[59] Colavizza, G., Hrynaszkiewicz, I., Staden, I., Whitaker, K. & McGillivray, B. "The citation advantage of linking publications to research data." PLOS ONE 15(4):e0230416, 2020. doi:10.1371/journal.pone.0230416 (531,889 PLOS and BMC articles; papers with data-availability statements linking to data in a repository received up to 25.36% ±1.07% more citations than those without).

[60] Asem, A., Eimanifar, A., Van Stappen, G. & Sun, S.-C. "The impact of one-decade ecological disturbance on genetic changes: a study on the brine shrimp Artemia urmiana from Urmia Lake, Iran." PeerJ 7:e7190, 2019. doi:10.7717/peerj.7190 (documents >90% A. urmiana population loss between 1994 rainy period and 2004 drought period, with accompanying decrease in ITS1 genetic diversity; Urmia Lake surface area had declined ~80% over the two preceding decades).

[61] National Center for Science and Engineering Statistics (NCSES), Higher Education Research and Development Survey: Fiscal Year 2023, NSF 25-313, National Science Foundation, 2025, ncses.nsf.gov/pubs/nsf25313 (census of 914 universities and colleges that granted a bachelor's degree or higher and expended at least $150,000 in R&D in FY2022; total higher-education R&D expenditure $108.8 billion in FY2023, an 11.2% increase over FY2022).

[62] National Center for Science and Engineering Statistics (NCSES), Academic Institution Profiles, ncsesdata.nsf.gov/profiles/site (per-institution R&D, personnel, and degree data across U.S. universities; used here as the underlying data source from which the mid-sized R1 typology of $200M annual R&D, 1,000–1,500 tenure-track faculty, 250–500 extramurally funded, and ~3,000 peer-reviewed publications per year is constructed).

[63] National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), "Funding Trends & Support of Guiding Principles: Human Subjects Research," niddk.nih.gov/research-funding/funded-grants-grant-history/funding-trends-support-guiding-principles ("The percentage of NIDDK funding supporting human subjects research for all NIDDK extramural research awards rose from about 33% from FY 2011 to about 40% in FY 2014 and has remained at about 40% through FY 2020"); NIDDK, "Funding Trends & Support of Core Values," niddk.nih.gov/research-funding/funded-grants-grant-history/funding-trends-support-core-values (more than 17,600 awards and over $8.4 billion invested in human-subjects research FY2014–FY2023, consistent with sustained ~40% portfolio share through FY2023).

[64] United States Attorney's Office, District of Massachusetts, "Partners Healthcare and Brigham and Women's Hospital to Pay $10 Million to Resolve Allegations They Defrauded NIH in Connection with Cardiac Stem Cell Research," press release, April 27, 2017. Harvard Medical School / Brigham and Women's Hospital investigative committee, October 2018, recommending 31 Anversa-laboratory papers for retraction (Joseph, A. & Begley, S. "Harvard and the Brigham call for more than 30 retractions of cardiac stem cell research." STAT News, October 14, 2018).

[65] United States Department of Justice, "Duke University Agrees to Pay U.S. $112.5 Million to Settle False Claims Act Allegations Related to Scientific Research Misconduct," press release, March 25, 2019, justice.gov/archives/opa/pr/duke-university-agrees-pay-us-1125-million-settle-false-claims-act-allegations-related (settlement of qui tam suit brought by whistleblower Joseph Thomas; research-fraud allegations centered on Erin Potts-Kant's falsified pulmonary-research data; 17 related retractions).

[66] United States Department of Justice, "Dana-Farber Cancer Institute Agrees to Pay $15M to Settle Fraud Allegations Related to Scientific Research Grants," press release, December 16, 2025, justice.gov/opa/pr/dana-farber-cancer-institute-agrees-pay-15m-settle-fraud-allegations-related-scientific (resolves False Claims Act allegations 2014–2024; Dana-Farber used funds from six NIH grants to conduct research producing 14 publications with misrepresented or duplicated images and data; additional implied-certification exposure from a second researcher who obtained four NIH grants after submitting applications referencing a publication with undisclosed image manipulation).

[67] National Institutes of Health, "Final NIH Policy for Data Management and Sharing," Notice NOT-OD-21-013, October 29, 2020, effective January 25, 2023, grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html (applies to competing grant applications submitted on or after January 25, 2023; non-compliance may be taken into account by NIH for future funding decisions and may trigger special terms, award termination, or downstream funding consequences); NIH, "Reminder: Reporting Data Management and Sharing (DMS) Plan Activities in Research Performance Progress Reports (RPPRs) Submitted on or After October 1, 2024," Notice NOT-OD-24-175, grants.nih.gov/grants/guide/notice-files/NOT-OD-24-175.html (requires RPPRs to document status of data sharing, repository names, and unique identifiers); NIH, Notice NOT-OD-26-046, February 25, 2026, announcing the 2026 Pilot DMS Plan format — a structured Yes/No and tabular layout required for applications with due dates on or after May 25, 2026, enabling machine-actionable compliance review.

[68] Gates Foundation, "2025 Open Access Policy," effective January 1, 2025, openaccess.gatesfoundation.org/open-access-policy/2025-open-access-policy/ (policy expanded to cover all funded research and any underlying data; continuous compliance review with grantee/author notification on non-compliance; grant-number-based verification of publication compliance); Gates Foundation grant to OA.Works (since 2021) supporting compliance tracking and reporting tooling for open-access enforcement.

[69] European Commission, Horizon Europe Programme Guide, 2021 onward, ec.europa.eu/info/funding-tenders (FAIR data mandatory for all Horizon Europe projects with no opt-out provision; Grant Agreement Article 17 ties compliance to payments; non-compliance can trigger grant reduction or financial penalties).

[70] Wellcome Trust, "Complying with our open access policy," wellcome.org/research-funding/guidance/open-access-guidance/complying-with-our-open-access-policy (articles not made open access in compliance with the policy result in Wellcome declining to accept new grant applications from the researcher as lead applicant; in extreme cases, Wellcome may suspend funding to the organization).

[71] National Science Foundation, Proposal & Award Policies & Procedures Guide (PAPPG), Chapter II.D.2.i(ii), 2011 onward, nsf.gov/publications/pub_summ.jsp?ods_key=pappg (all proposals must include a Data Management Plan; proposals submitted without a DMP are returned without review).

[72] Office of Science and Technology Policy, "Ensuring Free, Immediate, and Equitable Access to Federally Funded Research" (Nelson Memo), August 25, 2022, bidenwhitehouse.archives.gov/wp-content/uploads/2022/08/08-2022-OSTP-Public-Access-Memo.pdf (directs all federal agencies to update public-access policies eliminating the 12-month embargo for publications and data; agency plans to be finalized by December 31, 2024 and in effect by end of 2025; covers more than $90 billion in annual federal research funding).

[73] U.S. Securities and Exchange Commission, Rule 17a-4, "Records to Be Preserved by Certain Exchange Members, Brokers, and Dealers," 17 CFR § 240.17a-4, law.cornell.edu/cfr/text/17/240.17a-4 (six-year retention minimum, with first two years readily accessible; electronic records preserved in Write-Once, Read-Many (WORM) format or, per the October 2022 amendments effective May 2023, via an audit-trail alternative maintaining a complete record of modifications).

[74] SEC, CFTC, and FINRA enforcement of recordkeeping requirements, 2021–2025, aggregate (more than $3.5 billion in combined civil penalties since 2021, including more than $2 billion in SEC off-channel-communications cases and over $600 million in SEC FY2024 recordkeeping cases alone; SEC Press Release 2024-98, "Twenty-Six Firms to Pay More Than $390 Million Combined to Settle SEC's Charges for Widespread Recordkeeping Failures," sec.gov/newsroom/press-releases/2024-98).

[75] HIPAA Security Rule, 45 CFR §§ 164.308 (administrative safeguards), 164.310 (physical safeguards), and 164.312 (technical safeguards, including encryption and contingency-plan data backup and disaster-recovery requirements); HHS Office for Civil Rights, 2025 HIPAA civil monetary penalty tiers (maximum $2,190,294 per violation per calendar year following annual cost-of-living adjustment); FFIEC IT Examination Handbook, Business Continuity Management booklet, 2019 revision.

[76] U.S. Food and Drug Administration, Warning Letter to Applied Therapeutics, Inc., CMS Case 696833, December 3, 2024, fda.gov/inspections-compliance-enforcement-and-criminal-investigations/warning-letters/applied-therapeutics-inc-696833-12032024 (third-party vendor deleted electronic data in Q-global®, including audit trails, for all 47 enrolled subjects on March 27, 2024 — two days after FDA preannounced its inspection; data for 11 subjects permanently lost); Applied Therapeutics Complete Response Letter for govorestat NDA in classic galactosemia, November 27, 2024; 21 CFR Part 11 (audit-trail requirements for electronic records submitted to FDA).

[77] Industry benchmarking of financial-services IT budgets: Deloitte, Global Financial Services IT Budget Benchmarking Survey, 2023–2024; Gartner, IT Key Metrics Data — Banking and Financial Services, 2022, 2025 editions; Ramp, "IT Budgeting Basics: Best Practices for 2026" (financial-services IT spend runs 4.4–11.4% of revenue, with roughly 10–20% of that IT budget allocated to cybersecurity and recovery planning combined; financial services, government, and healthcare consistently spend the most of any sectors on disaster recovery per Gartner).

[78] Witze, A. "75% of US scientists who answered Nature poll consider leaving." Nature, March 27, 2025, d41586-025-00938-y, nature.com/articles/d41586-025-00938-y (1,608 respondents on the question of leaving the United States; 75% said yes; 79.4% among postgraduate researchers; funding cuts and infrastructure deterioration cited as top drivers); European Commission, "Choose Europe for Science," launched by President von der Leyen at La Sorbonne, May 5, 2025, research-and-innovation.ec.europa.eu/news/all-research-and-innovation-news/choose-europe-science-eu-comes-together-attract-top-research-talent-2025-05-23_en (initial €500 million package for 2025–2027; EC news January 30, 2026 confirms the package has since been expanded through over 100 national and regional initiatives to approximately €900 million; includes Marie Skłodowska-Curie Actions "Choose Europe" pilot and new seven-year ERC super-grants; recruiting against United States institutions on research-freedom and infrastructure commitments).

[79] EDUCAUSE Center for Analysis and Research (ECAR), Supporting Faculty Research with Information Technology, June 2018 (11,141 faculty respondents from 131 U.S. institutions in the 2017 ECAR faculty survey; "A plurality of faculty (44%) agreed or strongly agreed that their institution made adequate technology support provisions for grant-funded projects"; 44% reported access to IT staff with specialized research-computing knowledge; central IT typically spends ~2% of annual resources on research-computing services), educause.edu/ecar/research-publications/supporting-faculty-research-with-information-technology/institutional-support-for-research; Ithaka S+R, US Faculty Survey 2021, July 14, 2022 (supporting context on research-support preferences and external-funding trends); EDUCAUSE, 2024 Core Data Service Almanac and 2025 Top 10 IT Issues (supporting context on institutional IT spending).

[80] Aggregate pricing benchmark for decentralized / self-hosted infrastructure suitable for protocol-level preservation at per-node deployments: Hetzner Online GmbH, Cloud-server pricing, CX23 plan (€3.49/month ≈ $46/year for 2 shared vCPU, 4 GB RAM, 40 GB storage, 20 TB egress), hetzner.com/cloud/pricing/; BitTorrent seedbox pricing surveyed across Seedbox.io, UltraSeedbox, Feral Hosting, Iseedfast, Evoseedbox, Ultra.cc, and Whatbox (low-end shared seedboxes from $3/month, mid-tier $5–10/month); Bluesky PDS self-host documentation, github.com/bluesky-social/pds (estimated total monthly self-host cost under $7 / $84 per year on a small VPS); Kleppmann, M. et al. "Local-First Software: You Own Your Data, in spite of the Cloud," arXiv:2402.03239, 2024; Masto.host managed Mastodon hosting ($6–30/month; masto.host/pricing); Element / Matrix.org and etke.cc community hosting ($5–20/month); Forgejo runtime-memory benchmarks — Forgejo, "Recommended Settings and Tips," forgejo.org/docs/next/admin/setup/recommendations/; YunoHost forum, "RAM usage for Gitea and Forgejo," forum.yunohost.org/t/ram-usage-for-gitea-and-forgejo/34643; OSSAlt, "Self-Host Forgejo: Open Source GitHub Alternative 2026," ossalt.com/guides/how-to-self-host-forgejo-github-alternative-2026 (small-instance idle resident memory observed at 95–150 MB; 1 vCPU / 1 GB RAM recommended floor for small teams); Pinata IPFS pinning (free tier + tiered plans from ~$0.15/GB/month; pinata.cloud/pricing). Aggregate range: approximately $46 to $360 per node per year for a from-scratch standalone deployment.

[81] Subramaniam, S., Berman, H.M., Bhikadiya, C., Bi, C., Chen, L., Di Costanzo, L. et al. "Impact of structural biology and the Protein Data Bank on US FDA new drug approvals of low molecular weight antineoplastic agents 2019–2023." Oncogene 43:2229–2243, June 2024. doi:10.1038/s41388-024-03077-2 (100% of the 34 new low-molecular-weight, protein-targeted antineoplastic agents approved by the U.S. FDA between 2019 and 2023 were enabled by open-access PDB biostructure data; >80% structure-guided design).

[82] NOAA National Centers for Environmental Information, "NCEI Archive: Growth and Change," ncei.noaa.gov/news/ncei-archive-growth-and-change (over 60 petabytes of holdings as of late 2023, with projected growth to approximately 400 petabytes by 2030; four primary locations — Asheville NC headquarters, Boulder CO, Silver Spring MD, Stennis Space Center MS); NOAA, "Latest on Hurricane Helene's impacts on NOAA's National Centers for Environmental Information," October 4, 2024, noaa.gov/news-release/latest-on-hurricane-helenes-impacts-on-noaas-national-centers-for-environmental-information (NCEI Asheville headquarters experienced significant disruption when Hurricane Helene struck September 26–27, 2024; NCEI confirmed all data holdings — including paper and film records — were safe; all NCEI staff were confirmed safe).

[83] Rodriguez, D.R., Arevalo, M., Dowler, P., Espinosa, J., McLean, B. & Willott, C. "Insights from a 30-Year International Partnership on Astronomical Archives." arXiv:2506.11888v1, 2025, arxiv.org/abs/2506.11888 (collaboration since the 1990s between Space Telescope Science Institute, European Space Astronomy Centre, and Canadian Astronomy Data Centre on Hubble and James Webb Space Telescope data sharing and accessibility; interoperates via International Virtual Observatory Alliance standards).

[84] Hamilton, D.G., Hong, K., Fraser, H., Rowhani-Farid, A., Fidler, F. & Page, M.J. "Prevalence and predictors of data and code sharing in the medical and health sciences: systematic review with meta-analysis of individual participant data." BMJ 382:e075767, 11 July 2023. doi:10.1136/bmj-2023-075767 (systematic review aggregating 105 meta-research studies covering 2,121,580 articles across 31 medical specialties; across 2016–2021, 8% of medical papers declared their data were publicly available (95% CI 5–11%) and 2% actually shared their data publicly (95% CI 1–3%); declared-sharing prevalence has increased over time but does not consistently correspond to actual sharing; compliance with mandatory data-sharing policies ranges 0–100% across journals).

[85] EDUCAUSE Core Data Service, FY 2022–23 collection; EDUCAUSE, 2024 CDS Interactive Almanac: IT Spending and Staffing, educause.edu/research-and-publications/research/analytics-services/it-spending-and-staffing-interactive-almanac (median central-IT expenditure $10.6 million across 320–400 reporting U.S. institutions, interquartile range $4.8M–$25.3M); EDUCAUSE, 2023 IT Organization, Staffing, and Financing and 2025 IT Organization, Staffing, and Financing reports; EDUCAUSE QuickPoll 2025 (IT-staff turnover and hiring-freeze vs. layoff response patterns during budget reductions).

[86] University of North Carolina, Information Technology Services, "ITS studies campus networking infrastructure costs," November 30, 2020, its.unc.edu/2020/11/30/networking-infrastructure-costs/ (cabling systems represent 61% of total campus network cost; network electronics are 16%; remaining 23% is other infrastructure; finding reflects labor and materials embedded in buildings); Hamilton, J. "Overall Data Center Costs," Mvdirona Perspectives, September 18, 2010, perspectives.mvdirona.com/2010/09/overall-data-center-costs/ (data-center mechanical and electrical infrastructure amortized over 10–15 year cycles; building shell amortization is longer, typically 27–40 years per tax-life rules); Schneider Electric / APC, "Determining Total Cost of Ownership for Data Centers and Network Room Infrastructure," white paper (component lifespans: UPS batteries 4 years, CRAC/CRAH units 12 years, switchgear 15 years, electrical distribution boards 25 years, generators 30 years).

[87] LYRASIS, DSpaceDirect hosting, lyrasis.org/dspace-direct/ (subscription tiers: Small $3,940/yr, Medium $5,780/yr, Large $8,670/yr at surveyed pricing); Elsevier / bepress Digital Commons hosted institutional-repository pricing ($10,000–$12,000/year range per surveyed library reports).

[88] Royal Society, "Costs of digital repositories," royalsociety.org/news-resources/projects/science-public-enterprise/digital-repositories/ (case-study totals: DSpace@MIT $260,000/year — $76,500 infrastructure + $183,500 staff for 2.75 FTE, 71% staff share; ePrints@Southampton £116,318/year — £111,318 staff + £5,000 infrastructure, 96% staff share; arXiv $810,000/year, ~$670,000 staff, 83% staff share; Dryad $350,000/year, ~$300,000 staff, 86% staff share; UK Data Archive £3.43M/year, £2.43M staff, 71% staff share; wwPDB ~$11–12M/year, ~$6–7M staff, ~58% staff share); Burns, C.S., Lana, A. & Budd, J.M. "Institutional Repositories: Exploration of Costs and Value." D-Lib Magazine 19(1/2), January/February 2013, dlib.org/dlib/january13/burns/01burns.html (literature review of repository establishment costs); Carr, L. & Harnad, S. "Keystroke economy: A study of the time and effort involved in self-archiving." University of Southampton ECS preprint, 2005 (early staffing and workflow analysis for ePrints).

[89] Amazon Web Services, S3 Glacier storage-class pricing, aws.amazon.com/s3/pricing/ (S3 Glacier Instant Retrieval ≈ $4 per TB-month ≈ $48 per TB-year; S3 Glacier Deep Archive ≈ $1 per TB-month ≈ $12 per TB-year); Google Cloud Storage, Archive storage-class pricing, cloud.google.com/storage/pricing (Archive storage ≈ $0.0012/GB/month ≈ $14.40 per TB-year, multi-region US/EU after the 2026 price adjustment from $0.0040/GB/month).

[90] CLOCKSS Archive, "Join CLOCKSS" and membership information, clockss.org/join-clockss/ (membership fees scaled to library materials budget for libraries and to journal/ebook revenue for publishers; surveyed supporter-fee tiers range approximately $550 to $18,350 per year; continuous operation since 2006; three hundred supporting libraries in 14 countries).

[91] LOCKSS Program, "Global LOCKSS Network" membership, lockss.org/gln (GLN fees scaled across 13 tiers determined by library materials budget and 6 geographical zones; consortial and multi-year discounts available; surveyed range approximately $2,642–$13,222/year plus ~$700/year for per-node hardware; 30–40 TB recommended storage; continuous operation since 1999).

[92] HathiTrust Digital Library, "Cost Model & Annual Fees," hathitrust.org/join/cost-fees/ (three-tier cost model based on member's Total Library Expenditures reported via IPEDS/CARL/CAUL; annual fees range $6,600–$13,000 for the majority of members; two components: public-domain items and copyrighted items; continuous operation since 2008).

[93] Academic Preservation Trust (APTrust), "Services and Fees," aptrust.org/about/services-and-fees/ (annual membership dues $20,000; the first 10 TB of preserved content per member is included; additional capacity at $420 per TB per year, equivalent to $0.41016/GB/year).

[94] Portico, institutional membership and publisher fees, portico.org, accessed 2026 (scaled institutional membership based on library budget; surveyed range approximately $1,500 to $25,462 per year; 1,076 participating publishers and 1,288 supporting libraries per 2023 Year in Review).

[95] Schonfeld, R.C. "Why Is the Digital Preservation Network Disbanding?" Scholarly Kitchen, December 13, 2018, scholarlykitchen.sspnet.org/2018/12/13/digital-preservation-network-disband/ (peak membership 62, declining to 31 at shutdown; only 27 members ever deposited content; flat $20,000 annual fee; board passed wind-down resolution in early December 2018); Rosenthal, D. "The Demise Of The Digital Preservation Network," DSHR's Blog, April 2019, blog.dshr.org/2019/04/the-demise-of-digital-preservation.html (Table 1 categorical allocation of $7,001,321 total spend: Development $2,782,693 (39.7%), Node Operations $134,454 (1.9%), Marketing $795,136 (11.4%), Overhead $3,289,038 (47.0%); "Overhead and Marketing consumed almost 60% of the total spend"; "Node Operations" in DPN's ledger refers specifically to node-storage payments that began only with 2016 production launch); Molinaro, M., Pcolar, D. & Gore, E. Digital Preservation Network Final Report. February 2019, osf.io/3p9jq/ (DPN was a single-member LLC of Internet2 from January 2017; storage federated across five Replicating Nodes — APTrust, Chronopolis, HathiTrust, Stanford Digital Repository, Texas Digital Library — each operating as a free-standing preservation service that continued after DPN dissolution); APTrust, Chronopolis, HathiTrust, SDR, and TDL, "Former DPN Nodes Respond to DPN Shutdown," joint statement December 5, 2018, aptrust.org/2018/12/05/former-dpn-nodes-respond-to-dpn-shutdown/.

[96] LOCKSS Program, "MetaArchive Transformation," April 2024, lockss.org/news/metaarchive-transformation (Educopia ending its long-standing fiscal-sponsorship arrangement with MetaArchive; nodes continue to operate during transition); Wikipedia, "MetaArchive Cooperative," en.wikipedia.org/wiki/MetaArchive_Cooperative (formally dissolved March 31, 2025 due to increased costs from its fiscal host organization and insufficient operating reserves); MetaArchive Cooperative / Educopia Institute, "MetaArchive Cooperative Transformation — Frequently Asked Questions," March 2025, educopia.org/wp-content/uploads/2025/03/MetaArchive-Cooperative-Transformation-Frequently-Asked-Questions-MetaArchive.pdf; MetaArchive Cooperative, "Sunset Announcement," March 31, 2025, educopia.org/wp-content/uploads/2025/03/MetaArchive-Cooperative-Sunset-Announcement-MetaArchive.pdf (longest-running private LOCKSS network, founded 2004 under Educopia Institute fiscal sponsorship; sunset March 31, 2025 following Educopia's January 2025 shift to a one-FTE-per-community staffing requirement, member attrition, and operational reserve drop below policy threshold; member transitions executed via mailed hard-drive returns to three members, cache returns to two members, and continued storage in the temporary Stanford-operated Dandelion Archive bridge network for the remaining members; Educopia's announcement notes that issues with insufficient replications and the automated LOCKSS polling process were resolved through manual consolidation to a Stanford anchor node prior to sunset, and that "it was not possible to secure a permanent archival home for all of MetaArchive's materials within the sunset time frame.").

[97] Cohen, J.P. & Lo, H.Z. "Academic Torrents: A Community-Maintained Distributed Repository," arXiv:1603.04395, 2016, arxiv.org/abs/1603.04395; Academic Torrents, academictorrents.com (founded 2013 at UMass by Cohen and Lo; distributes more than 298 terabytes of research datasets across volunteer seeders at zero central infrastructure cost); lightweight BitTorrent daemons (rTorrent, Transmission) typically run in the 9–14 MB resident-memory range on Linux at idle/small-torrent-set loads.

[98] Dopmann, C., Marx, M., Federrath, H. & Tschorsch, F. "Operating Tor Relays at Universities: Experiences and Considerations," arXiv:2106.04277, 2021, arxiv.org/abs/2106.04277 (resource profile of a Tor relay — 512 MB RAM, 10–16 Mbps typical bandwidth envelope — and institutional-operation considerations; finds universities have historically contributed negligible relay capacity); Electronic Frontier Foundation, Tor University Challenge, launched August 2023, toruniversity.eff.org (campaign encouraging universities to operate high-bandwidth Tor relays; university relay count reached the 45-institution range during campaign reporting).

[99] Harrison, J. & Hoffman, D. WebSeed HTTP/FTP Seeding — GetRight Style, BitTorrent Enhancement Proposal 19, bittorrent.org/beps/bep_0019.html (allows any standard HTTP or FTP URL to act as a BitTorrent seed with no software modification on the server side).

[100] Fachschaft Informatik, TU Dortmund, "Matrix," fsinfo.cs.tu-dortmund.de/fsr/edv/matrix; Fachschaften.org, "Matrix Service," fachschaften.org/services/matrix/ (matrix.fachschaften.org and element.fachschaften.org operated by Fachschaft Informatik for university-wide messaging; servers hosted in TU Dortmund's data center; service free to members of German-speaking universities, registration via TU Dortmund email address).

[101] TU Dresden, Center for Information Services and High Performance Computing (ZIH), Matrix documentation, doc.matrix.tu-dresden.de ("The Dresden Matrix service has been stable since April 2020 for more than 18,000 users through targeted load balancing on synapse workers"); Element / Matrix.org Education Sector case study (TU Dresden listed as reference installation); operated on existing ZIH information-technology staff plus successive rounds of student assistants funded initially by ZIH and subsequently by the CIO.

[102] MIT Student Information Processing Board (SIPB), sipb.mit.edu (SIPB maintains MIT's instances of Mastodon at mastodon.mit.edu and of the Forgejo Git forge at forgejo.mit.edu, on existing server hardware, staffed by student volunteers at effectively zero cost to the Institute).

[103] Cohen, B. "Incentives Build Robustness in BitTorrent," 1st Workshop on Economics of Peer-to-Peer Systems, 2003 (tit-for-tat incentive structure and distributed-hash-table coordination model); Wang, L. & Kangasharju, J. "Measuring Large-Scale Distributed Systems: Case of BitTorrent Mainline DHT," IEEE P2P 2013 (measurement study, 2011–2012: Mainline DHT sustains 16–28 million concurrent nodes, intra-day churn ≥10 million); Zhang, X., Liu, J. & Xu, B. "Metcalfe's Law Validation," Journal of Computer Science and Technology 30(2), 2015.

[104] Beagrie, N. & Houghton, J.W. The Value and Impact of the European Bioinformatics Institute, Charles Beagrie Ltd / EMBL-EBI, February 2016 (direct efficiency impact estimated £1B–£5B/year; £920M/year in wider future research impact; £1B/year benefits to users and funders "equivalent to more than 20 times the direct operational cost of the institute"), embl.org/documents/wp-content/uploads/2021/09/EMBL-EBI_Impact_report-2016-summary.pdf; updated value-and-impact series 2021.

[105] Beagrie, N. & Houghton, J.W. The Value and Impact of the Archaeology Data Service: A Study and Methods for Enhancing Sustainability, Charles Beagrie Ltd / Jisc / ADS, 2013, archaeologydataservice.ac.uk/blog/the-value-and-impact-of-the-archaeology-data-service-final-report/ (research-efficiency gains estimated at least £13 million per annum, approximately five times the costs of ADS operation, data deposit, and use; £2.4M–£9.7M thirty-year net-present-value return from one year of ADS investment, equivalent to 2–8× ROI).

[106] Lateral Economics, National Collaborative Research Infrastructure (NCRIS) Spending and Economic Growth Report, commissioned by a group of NCRIS infrastructure organisations, June 2021 (direct benefit of investment in NCRIS calculated at above $7 return for every $1 invested, ROI of 7.5:1; by 2022–23 expected to support employment of an additional 1,750 scientific, technical, support, supply-chain, and industry staff).

[107] Hart, D.L. & Sinkovits, R.S. and XSEDE Value Analytics team, XSEDE Value Analytics: Final Report, National Science Foundation / XSEDE, 2022; follow-on analysis using accounting-ROI concepts in Scientometrics 128(6), 2023, doi:10.1007/s11192-022-04539-8 (NSF invested approximately $257.5 million ($257,465,523) in XSEDE across two funding rounds; estimated downstream value of research enabled by XSEDE $4.7 billion to $22.7 billion; accounting-concept ROI 1.87 conservative and 3.24 best-available).

[108] Apon, A., Ahalt, S., Dantas, V., Goasguen, S., Lumsdaine, A., Stahlberg, E. et al. "Assessing the Return on Investment in Research Computing," Proceedings of PEARC '22, 2022; Apon et al., "Research Computing on Campus: Application of a Production Function to the Value of Academic High-Performance Computing," Proceedings of PEARC '21, 2021, doi:10.1145/3437359.3465564; Apon et al., SN Computer Science 5:883, 2024, doi:10.1007/s42979-024-02888-0 (production-function modeling of campus HPC against HERD R&D expenditures: each $100,000 spent on research-computing salaries is associated with a $14.3 million increase in HERD R&D expenditures; each 100 TeraFLOPs of added capacity is associated with a $1.3 million increase).

[108a] Smith, P.M. The Value Proposition of Campus High-Performance Computing Facilities to Institutional Productivity: A Production Function Model. Ph.D. dissertation, Purdue University, 2022, docs.lib.purdue.edu/dissertations/AAI30506034/; Smith, P.M. "The Value Proposition of Campus High Performance Computing Facilities to Institutional Productivity: A Production Function Model." SN Computer Science 5:883, 2024, doi:10.1007/s42979-024-02888-0 (Purdue FY2020: 55% of $443.5 million in total research expenditures — $242 million — attributable to faculty using campus HPC resources operated by the Rosen Center for Advanced Computing, a 49-fold return on RCAC investment).

[109] RCSB Protein Data Bank, "Economic Impact," rcsb.org/pages/about-us/economic-impact ("The economic impact of RCSB Protein Data Bank services is estimated at more than $5.5 billion a year — 800 times the operating cost"; NSF DBI-2321666 plus DOE and NIH grants underwrite RCSB PDB core operations at approximately $6.1 million per year, confirmed by an 8-year federal commitment of $49.4 million through 2028).

[110] American Council on Education and Carnegie Foundation for the Advancement of Teaching, 2025 Carnegie Classifications of Institutions of Higher Education — Research Activity Designations, February 2025, carnegieclassifications.acenet.edu (2025 R1 threshold: $50 million in annual total research expenditures and 70 research doctorates awarded annually, determined over three-year rolling HERD averages; 187 U.S. institutions designated R1 in 2025, 28% more than under the 2022 methodology).

[111] Battelle Technology Partnership Practice, The Impact of Genomics on the U.S. Economy, United for Medical Research, June 2013 (federal investment in the Human Genome Project and subsequent genomics research totaled approximately $14.5 billion from 1988 through 2012, yielding $965 billion in cumulative economic output, $293 billion in personal income, and 4.3 million job-years of employment; Battelle's earlier 2011 report reported a 141:1 return when HGP cost was computed in inflation-adjusted dollars of the period).

[112] U.S. Geological Survey, "Landsat's Economic Value Increases to $25.6 Billion in 2023," usgs.gov/news/featured-story/landsats-economic-value-increases-256-billion-2023; Straub, C.L., Koontz, S.R. & Loomis, J.B. Economic Valuation of Landsat Imagery, USGS Open-File Report 2019-1112, 2019; Wulder, M.A. et al. "Opening the archive: How free data has enabled the science and monitoring promise of Landsat," Remote Sensing of Environment 122:2–10, 2012. doi:10.1016/j.rse.2012.01.010 (2008 USGS open-access policy change: Landsat scene downloads increased from ~53/day to 5,775/day; 2023 estimated direct economic value $25.6 billion/year).

[113] Ly, N., Viana-Ferreira, C., Gouglas, D., Miller, M. et al. "The global return-on-investment of COVID-19 vaccines in the first year of the vaccination programme," medRxiv, September 2025, doi:10.1101/2025.09.02.25334932 (global economic value of COVID-19 vaccines US $5.2 trillion, 95% CI $4.1–$6.2 trillion; Pfizer-BioNTech-specific value >US $1.9 trillion; benefit–cost ratios 13.9× overall and 30.8× for Pfizer-BioNTech); GISAID Initiative, "SARS-CoV-2 genome-sharing timeline," gisaid.org (Zhang et al. deposit of SARS-CoV-2 reference genome 10–11 January 2020; Pfizer-BioNTech initiated vaccine design the same day).

[114] PwC EU Services, Cost-Benefit Analysis for FAIR Research Data, European Commission DG Research and Innovation, 2018, op.europa.eu/en/publication-detail/-/publication/d375368c-1a0a-11e9-8d04-01aa75ed71a1 ("€10.2 billion is just the minimum cost per year of not having FAIR research data for the EU and it will rise further over the years if we do not take action").

[115] Enserink, M. "Rotterdam Marketing Psychologist Resigns After University Investigates His Data." Science, June 25, 2012, science.org/content/article/rotterdam-marketing-psychologist-resigns-after-university-investigates-his-data (2012 Erasmus University investigation panel concluded "no confidence" in the scientific integrity of Smeesters' work; Smeesters told the committee the raw data for some experiments were lost when his home computer crashed, and described selective data exclusion as "nothing out of the ordinary" in his field and department); Oransky, I. "Final report in Smeesters case serves up seven retractions." Retraction Watch, March 19, 2014 (Erasmus University's 2014 final report formally concluded research misconduct across seven of Smeesters' papers).

[116] National Academies of Sciences, Engineering, and Medicine. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: National Academies Press, June 8, 2020. doi:10.17226/25639, nap.nationalacademies.org/catalog/25639 (examines economic factors for data acquisition, curation, preservation, accessioning, and deaccessioning; provides a framework for cost-effective life-cycle decision making; finds that the current research-funding regime is structurally misaligned with multi-decade data preservation obligations).

[117] National Security Presidential Memorandum 33 (NSPM-33), "United States Government-Supported Research and Development National Security Policy," January 14, 2021; National Science and Technology Council Subcommittee on Research Security, Guidance for Implementing National Security Presidential Memorandum 33 (NSPM-33) on National Security Strategy for United States Government-Supported Research and Development, Office of Science and Technology Policy, January 4, 2022 (operationalizes disclosure requirements for federally funded researchers across conflicts of interest, conflicts of commitment, foreign affiliations, and other-support reporting; current enforcement posture grounded in agency-level disclosure-requirement implementations under the Implementation Guidance, with conflict-of-interest disclosure as the operational center of current research-integrity enforcement).

[118] EDUCAUSE, 2025 EDUCAUSE Top 10 IT Issues, educause.edu/research-and-publications/research/top-10-it-issues-technologies-and-trends/2025 (Issue #1: "The Data-Empowered Institution — using data, analytics, and AI to drive student success, enrollment, research funding, and operational efficiency"); 2024 EDUCAUSE Top 10 IT Issues available at the same URL with /2024 substituted (continuity of data/analytics/AI as the lead institutional concern across both years).

[119] NSF News, "Democratizing the future of AI R&D: NSF to launch National AI Research Resource pilot," January 24, 2024, nsf.gov/news/democratizing-future-ai-rd-nsf-launch-national-ai (NAIRR Pilot launched January 24, 2024 with NSF and 10 partner federal agencies plus 25 private sector, nonprofit, and philanthropic organizations, including industry partners Anthropic, Amazon Web Services, IBM, Meta, Intel, NVIDIA, OpenAI, and Microsoft); Computing Research Association GovAffairs blog, "NSF Launches Pilot of NAIRR Program to 'Democratize the Future of AI'," January 25, 2024, cra.org/govaffairs/blog/2024/01/nsf-launches-nairr-pilot/.

[120] NSF, "National Artificial Intelligence Research Resource," nsf.gov/focus-areas/ai/nairr (NAIRR Pilot expanded by 2026 to NSF + 13 partner federal agencies and 28 nongovernmental partners; four focus areas — NAIRR Open, NAIRR Secure, NAIRR Software, NAIRR Classroom — define current operational scope).

[121] National Artificial Intelligence Initiative Act of 2020, Division E of the William M. (Mac) Thornberry National Defense Authorization Act for Fiscal Year 2021, P.L. 116-283, January 1, 2021 (incorporated NAIRR Task Force authorization into broader National AI Initiative; signed into law by congressional override of presidential veto, providing the legal foundation under which NAIRR Pilot currently operates).

[122] Creating Resources for Every American To Experiment with Artificial Intelligence Act of 2025 (CREATE AI Act), H.R. 2385, 119th Congress, congress.gov/bill/119th-congress/house-bill/2385 (would codify a full-scale National AI Research Resource at NSF; previously introduced as S.2714 in the 118th Congress; pending as of April 2026).

[123] NSF, "National AI Research Institutes," nsf.gov/focus-areas/ai/institutes (29 NSF National AI Research Institutes funded across more than 500 collaborating institutions in the United States and internationally; institutes funded at approximately $20 million each over five years; lead agencies NSF and USDA-NIFA with co-funding from DOD, NIST, ED-IES, and others); FY2025 NITRD Supplement Appendix B documents $72.3 million FY2025 budget across the AI Institutes program.

[124] DOE Office of Science, "Artificial Intelligence Initiative," science.osti.gov/Initiatives/AI; The White House, "Launching the Genesis Mission," Executive Order, November 24, 2025, whitehouse.gov/presidential-actions/2025/11/launching-the-genesis-mission/ (the EO itself does not allocate funds and is implemented "subject to the availability of appropriations"); DOE press release, "Energy Department Launches 'Genesis Mission' to Transform American Science and Innovation Through the AI Computing Revolution," November 24, 2025, energy.gov/articles/energy-department-launches-genesis-mission-transform-american-science-and-innovation; "Energy Department Advances Investments in AI for Science," DOE December 2025 rollout, and HPCwire, "Here's What's Inside DOE's $320 Million Genesis Mission Investment," December 11, 2025 (over $320 million in DOE investments announced in December 2025 across 37 projects, including the American Science Cloud (AmSC, $40 million), the Transformational AI Models Consortium (ModCon, $30 million), 14 robotics and automated-laboratory projects, and foundational AI research awards).

[125] Department of Defense, Chief Digital and Artificial Intelligence Office (CDAO), ai.mil (established 2022 as DOD's principal hub for AI capability adoption and integration across the department; absorbs and operationalizes DARPA-developed AI capabilities for departmental deployment).

[126] DARPA, "AI Next Campaign," darpa.mil/research/programs/ai-next-campaign (since 2018, more than $2 billion invested through the AI Next campaign to advance AI for national security purposes); DARPA, "AI Forward," darpa.mil/research/programs/ai-forward (successor initiative focusing on trustworthiness for national-security AI systems).

[127] Networking and Information Technology R&D Subcommittee and National Artificial Intelligence Initiative Office, Supplement to the President's FY2025 Budget, Executive Office of the President, November 2024, nitrd.gov/pubs/FY2025-NITRD-NAIIO-Supplement.pdf (NITRD agencies' requested investment in nondefense AI R&D for FY2025: $3,316.1 million, a 6.5 percent increase over FY2024; AI accounts for 18 percent of total FY2025 NITRD funding).

[128] National Science Foundation, National Artificial Intelligence Research Institutes Solicitation, NSF 23-610, nsf.gov/funding/opportunities/national-artificial-intelligence-research-institutes/nsf23-610/solicitation (Institutes required to develop shared community infrastructure including data and software supporting reproducibility; intellectual-property and data-sharing terms apply across NSF and partner-agency funding contributions).

[129] American Association of University Professors, Annual Report on the Economic Status of the Profession 2024–25; CUPA-HR Faculty in Higher Education Survey 2024–25; Chronicle of Higher Education faculty salary database; NEA Faculty Salary Report 2025 (computer science / AI faculty salary data showing modest premium over peer disciplines within academia, e.g., approximately $105,000–$120,000 assistant professor range vs. $80,000–$140,000 across peer fields); industry compensation data via levels.fyi, Glassdoor, and trade press reporting (industry AI roles for fresh PhD hires running approximately $200,000–$400,000 base plus bonuses, producing two-to-three-times compensation differential between academic and industry AI roles).

[130] EOSC Association, "The European Open Science Cloud," eosc.eu/eosc-about; European Commission, "European Open Science Cloud," digital-strategy.ec.europa.eu/en/policies/open-science-cloud (EOSC EU Node in operation since October 2024; EOSC Federation in build-up phase 2025–2026 with Federation Handbook defining interim governance, legal, and operational structures; fourteen new candidate national nodes joined the federation in 2025 expanding thematic and geographic coverage).

[131] European High Performance Computing Joint Undertaking (EuroHPC JU), "Five Years of the EuroHPC Joint Undertaking: Powering Science with World-class Advanced Supercomputing," December 17, 2025, eurohpc-ju.europa.eu; EuroHPC JU, "First release of the EuroHPC Federation Platform to streamline access to Europe's supercomputing resources," April 15, 2026, eurohpc-ju.europa.eu/first-release-eurohpc-federation-platform-streamline-access-europes-supercomputing-resources-2026-04-15_en (EuroHPC Federation Platform development began January 2025 under CSC–IT Center for Science leadership following December 2024 contract signing, with first release April 15, 2026 providing unified single access point to EuroHPC JU systems with federated authentication and authorisation; AI Factories program operates around EuroHPC supercomputing facilities to support European AI ecosystem development, with 19 AI Factories and 13 Antennas operational across the EU as of late 2025).

[132] Chinese Academy of Sciences Computer Network Information Center, "China Science and Technology Cloud" (CSTCloud), cstcloud.net; "Research e-infrastructures for open science: The national example of CSTCloud in China," Data Intelligence 5(2):355, 2023, direct.mit.edu/dint/article/5/2/355/114794 (CSTCloud operated by CAS CNIC; eleven of twenty national scientific data centers managed by CAS, including the National Basic Science Data Center for cross-domain basic science data, the National Space Science Data Center, and the National Ecosystem Science Data Center; National Science and Technology Infrastructure framework certificated by Ministry of Finance and Ministry of Science and Technology).