GovAI — Frontier AI Auditing: Toward Rigorous Third-Party Assessment of Safety and Security Practices at Leading AI Companies

Frontier AI Auditing: Toward Rigorous Third-Party Assessment of Safety and Security Practices at Leading AI Companies Miles Brundage1* Noemi Dreksler2 Aidan Homewood2 Sean McGregor1 Patricia Paskov3 Conrad Stosz4 Girish Sastry5 A. Feder Cooper1 George Balston1 Steven Adler6 Stephen Casper7 Markus Anderljung2 Grace Werner1 Sören Mindermann5 Vasilios Mavroudis8 Ben Bucknall9 Charlotte Stix10 Jonas Freund2 Lorenzo Pacchiardi11 José Hernández-Orallo11 Matteo Pistillo10 Michael Chen12 Chris Painter12 Dean W. Ball13 Cullen O’Keefe14 Gabriel Weil15 Ben Harack3 Graeme Finley5 Ryan Hassan16 Scott Emmons5 Charles Foster12 Anka Reuel17 Bri Treece18 Yoshua Bengio19 Daniel Reti20 Rishi Bommasani17 Cristian Trout21 Ali Shahin Shamsabadi22 Rajiv Dattani21 Adrian Weller11 Robert Trager3 Jaime Sevilla23 Lauren Wagner24 Lisa Soder25 Ketan Ramakrishnan26 Henry Papadatos27 Malcolm Murray27 Ryan Tovcimak28 1AVERI 2GovAI 3Oxford Martin AI Governance Initiative 4Transluce 5Independent 6Clear-Eyed AI 7MIT CSAIL 8Alan Turing Institute 9University of Oxford 10Apollo Research 11University of Cambridge 12METR 13Foundation for American Innovation 14Institute for Law and AI 15Touro University Law Center 16New Science 17Stanford University 18Fathom 19Mila, Université de Montréal 20Exona Lab 21AI Underwriting Company 22Brave Software 23Epoch AI 24Abundance Institute 25interface 26Yale University 27SaferAI 28UL Solutions January 2026 *Listed authors contributed significant writing, research, and/or review for one or more sections. The sections cover a wide range of empirical and normative topics, so with the exception of the corresponding author (Miles Brundage, miles.brundage@averi.org), inclusion as an author does not entail endorsement of all claims in the paper, nor does authorship imply an endorsement on the part of any individual’s organization. 1 arXiv:2601.11699v4 [cs.CY] 7 Feb 2026

Executive Summary

Key paper takeaways

Despite their rapidly growing importance, AI systems are subject to less rigorous third-party scrutiny than many of the other social and technological systems that we rely on daily such as consumer products, corporate financial statements, and food supply chains. This gap is becoming increasingly untenable as AI becomes more capable and widely deployed, and it inhibits confident deployment of AI in high-stakes contexts.

Transparency alone cannot enable well-calibrated trust in the most capable (“frontier”) AI systems and the companies that build them: many safety- and security-relevant details are legitimately confidential and require expert interpretation, and third parties are right to be skeptical of companies “checking their own homework” given the track record of that approach in other industries.

We outline a vision for frontier AI auditing, which we define as rigorous third-party verification of frontier AI developers’ safety and security claims, and evaluation of their systems and practices against relevant standards, based on deep, secure access to non-public information.

Frontier AI audits should not be limited to a company’s publicly deployed products, but should instead consider the full range of organization-level safety and security risks, including internal deployment of AI systems, information security practices, and safety decision-making processes.

We describe four AI Assurance Levels (AALs), the higher levels of which provide greater confidence in audit findings. We recommend AAL-1 as a baseline for frontier AI generally, and AAL-2 as a near-term goal for the most advanced subset of frontier AI developers.

Achieving the vision we outline will require (1) ensuring high quality standards for frontier AI auditing, so it does not devolve into a checkbox exercise or lag behind changes in the industry;

(2)

growing the ecosystem of audit providers at a rapid pace without compromising quality; (3) accelerating adoption of frontier AI auditing by clarifying and strengthening incentives; and (4) achieving technical readiness for high AI Assurance Levels so they can be applied when needed. Frontier AI auditing motivations Artificial intelligence (AI) is rapidly becoming critical societal infrastructure. Every day, AI systems inform decisions that affect billions of people. Increasingly, they also make consequential decisions autonomously. Although these technologies hold incredible promise, the pace of development and deployment has outpaced the creation of institutions that ensure AI works safely and as advertised. This institutional gap is especially important for the most capable (“frontier”) systems — general-purpose AI models and systems whose performance is no more than a year behind the state-of-the-art — which many experts expect to exceed human performance across most tasks within the coming years. Already, developers of frontier AI systems need to prevent harmful system failures (e.g., outputting false medical information or buggy code), weaponization by malicious parties (e.g., to carry out cyberattacks), and theft of or tampering with sensitive data. The magnitude of risks that need to be managed is growing rapidly. AI users, policymakers, investors, and insurers need reliable ways to verify that promised technical safeguards exist and to detect when they do not. This is challenging because the technology is complex, fast-moving, and often proprietary. Public transparency alone cannot solve this problem since many key details are — and often should remain — confidential, and require expert judgment to interpret. Many industries outside of AI already address similar challenges through independent auditors who review sensitive, non-public information and publish trustworthy conclusions that outsiders can rely on. We argue that similar practices are needed in the AI industry: broad, sustainable adoption of AI over time requires a solid foundation of trust built on credible scrutiny by independent experts. Toward this end, we propose institutions designed to give stakeholders — including those who are uncertain about or even strongly skeptical of frontier AI companies — justified confidence that this critical technology is being developed safely and securely. Specifically, we describe and advocate for frontier AI auditing: rigorous third-party verification of frontier AI developers’ safety and security claims, and evaluation of their systems and practices against relevant standards, based on deep, secure access to non-public information. An ecosystem of private sector frontier AI auditors (both for-profit and non-profit) would enable widespread confidence that frontier AI systems can be adopted broadly and would avoid reliance on companies “grading their own homework,” an approach with a checkered track record in many industries. It would also avoid relying entirely on governments to have the technical expertise, capacity, and agility to ensure high standards for frontier AI safety and security. If well-executed and scaled, frontier AI auditing would improve safety and security outcomes for users of AI systems and other affected parties, create a system to learn and update standards based on real-world outcomes, and enable more confident investment in and deployment of frontier AI, especially in high-stakes sectors of the economy. Summary of the proposal Drawing on our analysis of current practices in AI and lessons from other industries with more mature assurance regimes, we recommend eight interlinked design principles for a long-term vision for frontier AI auditing. This vision is deliberately ambitious to match the rising stakes as frontier AI capabilities advance:

Scope of risks: Comprehensive coverage of four key risk categories. Frontier AI auditing should focus on four risk categories: risks from (1) intentional misuse of frontier AI systems (e.g., for cyberattacks); (2) unintended frontier AI system behavior (e.g., errors harming the user, their property, or third parties due to pursuing the wrong goal or having an unreliable performance profile);

(3)

information security (e.g., theft of an AI model or user data); and (4) emergent social phenomena (e.g., addiction to AI or facilitation of self-harm). For each category of risks, auditors should (a) verify company claims and (b) evaluate the company’s systems and practices against its stated safety and security policies, applicable regulations, and industry best practices.

Organizational perspective: Auditing companies’ safety and security practices as a whole, not just individual models and systems. Auditors should use an organization-level perspective to avoid abstraction errors (i.e., forming the wrong conclusion by treating a partial or simplified unit of analysis, such as evaluating a specific component in isolation, as if it were sufficient to assess overall system and organizational risk). Risk does not come from AI models alone; it emerges from the interaction of three overarching components: digital systems, computing hardware, and governance practices, and harm can arise even when a model is never deployed in external-facing systems. Rigorous, but isolated, model and system evaluations are therefore insufficient to evaluate all safety Limited assurance AAL1 Moderate assurance AAL2 A more extensive assessment of one or more systems as well as company practices more broadly, spanning at least months and enriched with gray-box system access, extensive internal documentation, and staff interviews across functions. High assurance AAL3 Ongoing oversight (multiyear engagement for the lead auditor, with many subcontractors contributing throughout) with white-box access, more extensive continuous monitoring, and authority to examine any area of concern. Very high assurance AAL4 Continuous verification designed to detect active deception attempts, operating with a full understanding of the company's systems, computing hardware, and governance, and providing “treaty-gradeˮ confirmation of the companyʼs risk profile. A time-bounded assessment of a particular AI system (typically a few weeks) using API access to multiple model versions and system settings, as well as a limited amount of additional non-public information focused on the audited system and related internal decisions.Figure 1: Four AI Assurance Levels (AALs) for different frontier AI audits. and security claims on their own. And while individual audits may focus on particular domains depending on their goals, the ecosystem as a whole should ensure comprehensive coverage across all three components in assessing safety and security claims.

Levels of assurance: A framework for calibrating and communicating confidence in audit conclusions. Not all audits provide the same level of certainty, and stakeholders need to understand these differences. We propose AI Assurance Levels (AALs) as a means of clarifying what kind of assurance particular frontier AI audits provide (Figure 1). At lower levels, auditors and other stakeholders rely more heavily on information provided by the company and can primarily speak to a particular system’s properties. At higher levels, auditors take fewer assumptions for granted, and assess the full range of relevant company systems, organizational processes, and risks. At the highest level, auditors can rule out the possibility of materially significant deception by the auditee. Determining the appropriate AAL for different contexts and purposes is complex, but we recommend AAL-1 (the peak of current practices in AI) as a starting point for frontier AI generally, and AAL-2 as a near-term goal for the companies closest to the state-of-the-art. AAL-2 involves greater access to non-public information, less reliance on companies’ statements, and a more holistic assessment of company-level risks. The two highest assurance levels (AAL-3 and AAL-4) are not yet technically and organizationally feasible, but we outline research directions to change this.

Access: Deep enough to assure auditors and other stakeholders, secure enough to reassure auditees. Frontier AI auditors should receive deep, secure access to non-public information of various kinds — including model internals, training processes, compute allocation, governance records, and staff interviews — proportional to the audit’s scope and the level of assurance being sought for the audit. Access arrangements should protect intellectual property and security-sensitive information using mechanisms imported from other domains (e.g., sharing certain information with a subset of the auditing team on-site under a restrictive nondisclosure agreement) and newly-developed techniques (e.g., AI-powered summarization or analyses of information that is too sensitive to be directly shared).

Continuous monitoring: Living assessments, not stale PDFs. AI systems change constantly, including through adjustments to the underlying model(s), surrounding software, and shifts in user behavior. An audit conclusion that was accurate at the time of the assessment may become misleading in some respects within days or weeks. Audit findings should therefore carry explicit assumptions and validity conditions, and should be automatically deprecated when key underlying assumptions no longer hold. A mature auditing ecosystem will combine periodic deep assessments of slower-moving elements (e.g., governance, safety culture) with event-triggered reviews of major changes (e.g., new releases, serious incidents) and continuous automated monitoring of fast-changing surfaces (e.g., API behavior, configuration drift), enabling timely detection of changes that could invalidate prior conclusions.

Independent experts: Trustworthy results through rigorous independence safeguards and deep expertise. Auditors must be genuinely independent third parties, free from commercial or political influence, and have deep expertise across AI evaluation, safety, security, and governance. Safeguarding independence requires mandatory disclosure of financial relationships, standardized terms of engagement that prevent companies from shopping for favorable auditors, and cooling-off periods when moving, in both directions, between industry and audit roles. Alternative payment models that reduce auditor dependence on auditees should also be urgently explored. Where single auditing organizations lack sufficient expertise, subcontracting and consortia models can enable the necessary breadth across AI evaluation, safety, security, and governance.

Rigor: Processes that are methodologically rigorous, traceable, and adaptive. Audits should follow a standardized process while giving auditors the autonomy to flexibly determine specific methods and adjust scope as issues emerge. Auditors should be able to define evaluation metrics and criteria rather than simply validating companies’ preselected approaches. Wherever feasible, audit procedures should be automated, transparent, and reproducible to support consistent application across engagements and enable continuous monitoring as systems evolve. Auditors need to safeguard evaluation construct and ecological validity, and audit criteria should be protected against gaming. Finally, audits should incorporate procedural fairness, giving companies structured opportunities to correct factual errors while preventing undue influence on conclusions.

Clarity: Clear communication of audit results. Stakeholders must be able to understand the audit results. These should be communicated in audit reports with a standardized structure, covering the audit’s scope, level of assurance, conclusions, reasoning, and recommendations. Results should be communicated appropriately to different stakeholders: to protect sensitive information, auditors and companies can publish summarized or redacted versions for external stakeholders while sharing full, unredacted audit reports with boards, company executives, and, in some cases, regulatory bodies. Challenges and next steps Our long-term vision will require concrete efforts by several categories of stakeholders to both achieve and maintain. The most urgent challenges are:

Ensuring high quality standards for frontier AI auditing, so it does not devolve into a checkbox exercise or lag behind changes in the AI industry.

Growing the ecosystem of audit providers at a rapid pace without compromising quality.

Accelerating adoption of frontier AI auditing by clarifying and strengthening incentives.

Achieving technical readiness for high AI Assurance Levels so they can be applied when needed. These challenges are substantial but not unprecedented. Companies routinely share sensitive information with financial auditors, potential acquirers, penetration testers, and consumer product testing laboratories under carefully controlled terms. We believe similar practices for AI safety and security are both achievable and urgently needed. For each of the challenges we describe, we recommend specific next steps:Achieving technical readiness Accelerating adoption Growing the ecosystem Ensuring high quality standards Recommendations AI companies, philanthropists, investors, and insurers should fund analysis of the quantity and quality of audits and auditors, and make these assessments available to the public. 1 Policymakers should implement a PCAOB-style non-profit “auditor of auditorsˮ that has legitimacy through final government approval of its standards, the authority to hold auditors accountable through revoking accreditation or other means, and the ability to innovate at the pace of the private sector. 2 Policymakers and developers should implement targeted safe harbors that protect good-faith safety research and auditing while avoiding a liability gap, and that are conditional on auditor compliance with established best practices. 4 Policymakers should incorporate frontier AI auditing requirements into procurement processes, with particularly strong requirements for systems that will be deployed in high-stakes domains such as health and defense. 6 National governments should quickly resolve outstanding and near-term requests from insurers regarding exclusions one way or the other, and in government procurement contexts, they should specify that frontier AI companies need explicit coverage of AI-related risks (whether through a specialized or general policy). 5 Philanthropists, governments, and frontier AI companies should invest in an ambitious “Auditability R&D and Pilotsˮ portfolio aimed at making AAL3 and AAL4 technically feasible and cost-effective. 7 Companies closest to the state-of-the-art should work with auditors, researchers, governments, and other stakeholders to conduct early pilots of AAL3 and later AAL4 auditing in order to accelerate the maturity of relevant technologies and processes. 8 Fund independent audit verification Create “auditor of auditorsˮ body Implement targeted safe harbors Embed frontier AI auditing in public procurement Clarify insurance coverage for AI risks Invest in auditability R&D portfolio Pilot advanced AAL3/4) auditing methods The AI evaluation ecosystem should establish a Frontier AI Auditor Accreditation Program with tiered certifications and specialty endorsements, as well as meaningful accountability mechanisms. 3 Establish auditor accreditation program Figure 2: Recommendations for next steps across four challenges in frontier AI auditing. Keeping up with the rapid pace of AI progress and deployment requires quickly importing best practices from more mature industries and immediate investment in auditing pilots, technical research, and policy research. Moving with urgency is essential if frontier AI auditing is to reach maturation and scale alongside AI development.

3.1 Improving safety and security outcomes

Internal evaluations of frontier AI systems tend to be insufficient along two distinct but reinforcing dimensions: (1) limits in frontier AI developers’ abilities to fully understand, anticipate, and characterize the external risks of systems they develop, and (2) misaligned incentives. Auditing can directly address these issues, and the resulting safety and security risks they bring about, by introducing perspectives that challenge internal company narratives, encouraging better internal practices for system assessment, and sharing critical advancements in AI safety and security knowledge across organizations. First, relative to self-assessment by AI developers, external auditing provides fresh perspectives, which can offer healthy skepticism (i.e., guard against groupthink5), while also expanding the range of expertise brought to bear on development and deployment decisions. There is already evidence that thirdparty assessment can surface safety and security issues that developers subsequently remedy [37]. For example, the UK AI Security Institute (AISI) and US Center for AI Standards and Innovation (CAISI)’s pre-deployment testing efforts [38] identified safety issues that developers then addressed before release [39, 40]. Developers have also noted the value of external review in strengthening internal evaluation processes. System cards documenting behaviors and risks frequently reference third-party benchmarks and findings [41], sometimes produced with non-public information [42, 43]. Anticipating independent review may encourage investing in more robust mitigations earlier in development, before potential risks translate to concrete safety and security failures. Auditing also mitigates a distinct institutional failure of self-assessment: potential misalignment between deployment incentives and judgments about sufficient safety precautions. Frontier AI developers are simultaneously optimizing for capabilities, speed, and market position, while contending with how to determine the conditions under which their own systems are too risky to release or scale. This creates a structural conflict of interest, including internal pressure on safety teams to narrow scope or provide premature sign-off to meet deployment timelines (e.g., [44]). Independent auditing can separate evaluation and verification of safety properties from commercial incentives. 5 Organizational psychology research documents “groupthink” as a pervasive risk in cohesive teams — e.g., self-censorship of doubts and collective rationalization of warnings. These dynamics can render the possibility of failure unthinkable or at least unspeakable [36]. Independent third-party auditors provide a structural countervailing force against these tendencies. Beyond individual firms, external auditing enables learning at the level of the ecosystem, rather than just the level of individual developers. Without shared assessment, safety practices remain difficult to compare across organizations, making systemic risk hard to detect until failures occur. Auditors working with multiple companies can identify patterns, disseminate best practices, and share effective mitigations between developers with different levels of maturity. This affords the ability to make direct comparisons across developers, for example, allowing insights from state-of-the-art frontier systems to inform evaluations and safeguards for less capable models that may later encounter similar risks. Notably, while some of the benefits above can be achieved even if only some companies participate, wide participation is important in order to capture the full benefits. Having more participating companies helps broaden the amount of experience that others can learn from, and wide participation can discourage companies from cutting corners in order to gain a short-term advantage at the expense of the larger industry and society [45]. Even if auditing is made as efficient as possible through technical and process innovations, it will always have some costs, so there is a risk that selective participation will disadvantage responsible developers who incur those costs, while exposing the public to systemic risks from the industry’s weakest links.

3.2 Enabling confident investment and deployment

Frontier AI systems are unusually difficult to responsibly invest in and deploy because uncertainty, liability, and information asymmetries compound. Credible third-party auditing unlocks broader AI adoption by giving potential investors and deployers of AI systems better-founded confidence in safety and security claims. When credible third-party audits play a central role in the deployment ecosystem, enterprises and government agencies can rely on shared, independent assessments rather than attempting to evaluate frontier AI systems on their own. Audits provide a common reference point that enables adoption decisions to scale beyond a small number of technically sophisticated firms. This lowers the cost and complexity of due diligence, particularly for organizations that lag behind the frontier in terms of deep internal AI safety expertise. Auditing also plays a stabilizing role in legal and regulatory environments that are still in flux. Because standards of care for frontier AI deployment are unsettled, adopters face the risk that decisions made under uncertainty may later be judged negligent after harm occurs. Credible third-party audits help mitigate this risk (or at least bound the scope of it) by documenting that deployment decisions were made in accordance with recognized, independent assessment practices. This makes relevant aspects of reasonable care demonstrable ex ante rather than contestable only after the fact, reducing uncertainty for adopters and investors. As a consequence of more reliable information, developers that pass rigorous audits gain competitive advantages, as do downstream companies building on audited systems. Audit credentials can differentiate providers in procurement, particularly with governments and regulated industries. Without such mechanisms, frontier AI markets are prone to adverse selection: responsible developers bear higher internal safety costs, while less cautious actors can make similar claims at lower expense. Auditing allows safety, security, and governance quality to become more observable, enabling competition to reward genuinely higher standards rather than marketing alone. Over time, this creates incentives for wider participation, reinforcing auditing as a normal part of market entry rather than an exceptional burden, as it has done in other sectors [46]. 6 This is one of several reasons to focus particular governance attention on frontier AI, as discussed further in [35]. From frontier AI developers’ perspectives, rigorous third-party auditing provides concrete benefits: it can identify safety and security issues before they become costly incidents; build trust with enterprise customers and government agencies hesitant to adopt AI; provide legal clarity, potentially in the form of evidentiary support in court; and differentiate products in competitive procurement processes. These effects extend to insurance and capital markets. Frontier AI risks are difficult to insure because they are novel, potentially catastrophic, and poorly characterized, leading some insurers to exclude AI-related harms altogether [47, 48]. Audits can help unlock two distinct insurance markets: (1) For frontier AI developers, audits provide the standardized, quantifiable risk data that insurers need to underwrite coverage, lowering the cost of capital and clarifying accountability in the event of harm. (2) For businesses building on frontier AI models, audits of the underlying models give insurers visibility into risks that would otherwise be opaque. This allows insurers to differentiate based on audit status (e.g., as shown by a recently proposed underwriting standard from AIUC [49]), offering better terms to businesses that choose audited models over unaudited alternatives. See Appendix B for more discussion of insurance. As with the safety and security benefits described above, confidence in frontier AI investment and deployment will be greater to the extent that there is wide adoption of auditing, rather than just a few firms participating. High-profile safety incidents, such as the Three Mile Island Accident [50], can set back an entire industry [51] even if there are safer companies or products in the market. There is growing interest in AI-related risks among investors [52], and frontier AI auditing can help manage such risks.

3.3 Different audit requirements based on motivation

Although the motivations for frontier AI auditing share the common foundations of independence, varying levels of non-public access, and standardized frameworks for comparison, they place fundamentally different demands on what an audit must accomplish. In some settings, audits are primarily tools for reducing uncertainty and supporting private decision-making; in others, they are mechanisms for enabling credible commitments where trust, enforcement, or risk-sharing are limited [53, 54, 55] (see also Section 6.3). These roles cannot be served equally well by a single, undifferentiated notion of “an audit,” and treating these cases as requiring the same level of assurance would either hollow out audits where they must be strongest or make them overly burdensome where lighter-touch approaches would suffice. This combination of convergence and variation is why we present a single overall vision that includes multiple AI Assurance Levels (AALs); Section 5 describes how appropriate AALs can be selected in practice.

3.4 Government vs. private auditing

In principle, at least some of the positive outcomes described above are achievable with governments assuming an auditing role. Public agencies can and do play valuable roles in system evaluation, as illustrated by UK AISI and US CAISI’s pre-deployment testing efforts [38]. However, relying primarily on governments to conduct frontier AI audits faces structural limitations that are especially binding in this domain. Frontier AI systems evolve rapidly, require deep technical specialization, and often demand sustained access to non-public model details, internal processes, and proprietary data. Most governments face persistent challenges in building and retaining the requisite expertise at scale, adapting quickly to new model architectures and risk profiles, and matching the pace of innovation in the private sector. These constraints are not unique to AI, but they are particularly acute given the speed and complexity of frontier model development. At the same time, a purely private auditing ecosystem without public involvement would be inadequate. Governments have a critical role to play in providing oversight, setting baseline standards, and ensuring democratic accountability. In practice, this includes defining minimum requirements for auditor independence and competence, accrediting or supervising auditing organizations, and enforcing consequences when audits are negligent or misleading. This division of labor mirrors established practice in other domains, such as financial auditing, where private auditors perform evaluations while public authorities set the rules and provide backstop enforcement. In the context of frontier AI, such oversight is essential to ensure that audits retain substantive value rather than devolving into compliance theater. A largely private-sector auditing regime also offers an additional governance advantage: it limits the concentration of power over AI oversight in any single institutional actor. Governments will inevitably play a central role in AI governance through regulation, enforcement, and national security policy. Assigning primary responsibility for auditing to the private sector helps distribute governance functions across institutions with different incentives, expertise, and failure modes. Taken together, these considerations point toward an auditing ecosystem that is predominantly private in execution but publicly overseen. 4 Lessons from Related Domains and Current AI Assessment Before we detail our vision for frontier AI auditing, we briefly survey two bodies of practice that informed it: more established auditing and assurance practices in other industries and current third-party assessment in the AI industry. Appendix E and Appendix F provide more extensive discussions of these topics. Historically, many industries introduced rigorous third-party oversight only after serious incidents compelled action. Pharmaceutical pre-market approval, for example, became mandatory only after the 1937 sulfanilamide disaster killed over 100 people, and the 1957–1961 thalidomide tragedy prompted additional efficacy requirements [56]. A degree of aviation certification became mandatory through the Air Commerce Act of 1926, championed by industry leaders who believed “the airplane could not reach its full commercial potential without federal action to improve and maintain safety standards” [57]. Decades later, standards were ratcheted up significantly after high-profile accidents such as the Grand Canyon crash in 1956 [58]. The ultimate success of frontier AI auditing would be enabling dramatic safety and security progress without catastrophe as a necessary catalyst.

4.1 Key lessons from more established assessment domains

Third-party audits are common across many industries [59], where carefully designed frameworks facilitate external assessment of sensitive technologies and institutions while protecting intellectual property. We draw lessons from four domains, discussed below and summarized in Table 1. See

Appendix E for detailed discussion and examples

Food safety and consumer product testing. These domains demonstrate that effective safety culture requires “defense in depth”: testing at multiple stages of a product’s lifecycle and for multiple failure modes. Independent testing organizations like Underwriters Laboratories [60] show that companies can opt in and pay for certification if people avoid products lacking trusted third-party assurance. Critically, safety system failures — such as the 2008 Chinese milk scandal [61] — produce widespread distrust that propagates across companies and can persist for years [62]. For frontier AI, these precedents suggest (1) continuous testing throughout the lifecycle, (2) joint industry investment in testing infrastructure, and

(3)

recognition that a single high-profile failure can damage the entire industry’s standing. Safety-critical systems engineering and aviation safety. Industries like aviation and nuclear power treat safety as an emergent property of complex sociotechnical systems, employing structured methodologies — including hazard analysis, safety cases, and continuous lifecycle risk management [63] — to proactively identify and manage risks. Aviation’s strong safety record involves interlocking elements providing defense in depth: pre-approval of designs, mandatory incident reporting, and criminal liability in some cases. However, at the same time, the Boeing 737 MAX disasters highlighted the catastrophic risks of excessive self-certification and deference, where commercial pressures overrode safety concerns. Key lessons include: (1) systems-level analysis provides greater evidence for safety decisions than componentlevel analysis alone; (2) near-misses are often early warning signs of eventual failures; (3) effective safety reporting requires structural independence and protection from retaliation; (4) self-certification and delegation of audits create dangerous conflicts of interest; and (5) auditing must be technically rigorous, Table 1: Key lessons drawn from the domains discussed in this section Principle Source Domains Implication for Frontier AI Independence Financial auditing, aviation

Auditors need to be incentivized to meet very high standards in their analysis through mechanisms such as regulation, liability, and market pressures that reward rigor

Conflicts of interest need to be managed carefully Defense in depth Food safety, aviation, consumer products

Multiple layers of assessment are needed at different lifecycle stages Continuous monitoring Safety-critical systems, consumer products

One-off, static certifications are insufficient

Audits must account for systems changing over time Adversarial testing Penetration testing

Adaptive red-teaming is needed, not just checking off a list Organizational assessment Safety-critical systems, financial auditing

Culture, governance, and security matter, not just specific AI systems rather than relying on company attestations. (see Sections E.3 and E.4 for in-depth discussions) Penetration testing. Penetration testing demonstrates that security attributes are often best assessed through active adversarial testing rather than static checklists. Instead of checking only whether documented requirements are met, testers creatively search for unexpected failure modes and chain together subtle weaknesses. The field shows that an adversarial analytical posture can coexist with a collaborative relationship — auditors and companies iteratively fix issues rather than treating audits as one-off pass/fail exercises. Bug bounty programs [64] extend this into ongoing, market-based mechanisms with clear incentives. For frontier AI, adversarial testing should be a core component of misuse and security audits. Financial auditing. Financial auditing offers perhaps the richest set of analogies — both positive and cautionary — for frontier AI auditing. On the positive side, it demonstrates the feasibility of professionalized processes allowing independent parties to review highly sensitive information, the value of standardized metrics for comparing risks across organizations, and the importance of combining verification of specific claims (e.g., financial statements) with broader assessment of internal controls. Financial auditing has developed crucial conceptual tools that are relevant to frontier AI: (1) clear norms for managing conflicts of interest [65], (2) sharp distinctions between error and fraud, and (3) recognition that professional judgment by auditors is indispensable. However, financial auditing also provides warnings. Catastrophic failures — Enron, Wirecard [66, 67] — illustrate what happens when auditors derive most of their revenue from a small number of large clients [68]. Even after reforms like Sarbanes–Oxley, the sector has struggled with conflicts of interest [69], procedural focus that risks missing systemic issues, and a persistent “expectations gap” between public belief that audits guarantee the absence of fraud and auditors’ more modest mandate. For frontier AI, these suggest the critical importance of auditor independence, clear communication of assurance levels, and avoiding criteria that devolve into box-ticking. These domains illustrate both the achievements and the pitfalls of common assurance regimes. We do not present these examples as gold standards — rather, we highlight constructive lessons for frontier AI auditing while encouraging thoughtful and deliberate effort to build self-correction mechanisms into the vision outlined in Section 5.

4.2 The current state of third-party AI assessment

Although frontier AI auditing as we define it does not yet exist, a growing field of third-party assessment provides a foundation on which to build [70]. This section summarizes the current state as of December 2025. See Appendix F for detailed discussion and examples. Overview. Current third-party assessments of frontier AI vary substantially in scope, access, rigor, and transparency. Most evaluators receive only the same public access as ordinary users, with only a select few receiving early or privileged access. Public reporting is inconsistent: system cards sometimes mention third-party evaluators only in the abstract; methodological details are often omitted; and evaluators are sometimes not named at all even when they are used [71], making it difficult to follow up for more information or for other companies to seek to work with those evaluators. Assessments focus predominantly on capability evaluation and, increasingly, propensity evaluation (e.g., tendencies of AI models to deceive), with comparatively little attention to organizational risk governance, safety culture, or platform-level controls. Key Dimensions. We assessed current practice across seven dimensions (a more detailed discussion can be found in Appendix F):

Reporting: Public reporting is sparse and inconsistent. System cards vary substantially in how they describe third-party involvement, and methodological details are often absent [3, 72].

Access: Most evaluators receive only black-box API access. A small but growing number of collaborations with government institutes have tested deeper access (e.g., chain-of-thought, internal documentation [38]), but gray-box and white-box access remain highly limited.

Rigor: Methodology and effort vary substantially. Benchmark-based assessments face issues with quality [7], contamination [25], and construct validity [73, 74]. Red-teaming effectiveness is skilldependent [75]. Neither companies nor evaluators typically publish substantive threat models.

Standardization: Standards remain nascent, though they are evolving rapidly [70, 76]. Evaluations are typically conducted under bespoke, confidential contracts with terms rarely visible to regulators or the public.

Continuous monitoring: Assessments are one-off “snapshots” rather than continuous. Companies frequently update systems without providing third-party access for updated risk assessment.

Scope: Assessments focus heavily on technical systems (often just models) rather than organizational practices. Assessment of mitigations, platform-level controls, and safety culture is comparatively rare.

Scale and independence: Participation is voluntary and concentrated among a few developers. Evaluators depend on companies’ goodwill for access and sometimes funding, creating potential conflicts of interest. Emerging Developments. Recent positive developments include proposed evaluation frameworks (e.g., [77, 78, 79]); initial best practices from the Frontier Model Forum [80]; the establishment of the AI Evaluator Forum [81]; pilots with government AI safety institutes in the US and UK [39, 40]; early examples of third-party review of company risk assessments (e.g., METR’s review of Anthropic’s sabotage risk report [82], which we consider to be among the first AAL-1 audits, and third-party review of the safety work conducted for OpenAI’s release of gpt-oss [83]); and OpenAI and Anthropic’s reciprocal safety assessments of each other’s systems [82, 84]. These developments are promising but remain early-stage compared to established assurance regimes.

4.3 The gap between current practice and cross-industry best practices

Current third-party AI assessment efforts provide a valuable starting point — including a nascent ecosystem of organizations, both for-profit and non-profit, that have conducted increasingly rigorous assessments over time. Yet significant gaps remain between these practices and the best practices found in other industries. How much further improvement is needed depends in part on the risk profile that can be expected from AI at different points in time. Roughly speaking, those who expect faster progress in AI capabilities in the future — and therefore greater safety and security risks, given AI’s general-purpose nature — should desire a faster rate of progress in third-party assessment along various dimensions discussed above, so that we are not caught unprepared. Furthermore, to the extent that one believes that risks are highly correlated with raw capabilities, then one might desire particular scrutiny to be applied to the very most capable AI systems and the companies building them. These insights inform the approach we take in the next section, where we suggest both general principles for how frontier AI auditing should work in general as well as a series of progressively stronger assurance levels that can be adapted to particular contexts. 5 A Vision for Frontier AI Auditing In this section, we set out a long-term vision for what mature third-party auditing could look like — auditing of both the most capable AI systems and the companies building them. Some elements of this vision can be pursued now, while others will require years of investment and development before they become practical. We aim significantly beyond the status quo both because not all current assurance needs are being met by the current AI assurance ecosystem, and because we expect future AI systems to be far more capable and risky than those that exist today. Our vision for frontier AI auditing is organized around eight interlinked design principles, which we discuss in turn:

Scope of risks: Comprehensive coverage of four key risk categories that can be linked to company actions.

Organizational perspective: Auditing companies’ safety and security practices as a whole, not just individual models and systems.

Levels of assurance: A framework for calibrating and communicating confidence in audit conclusions.

Access: Deep enough to assure auditors and other stakeholders, secure enough to reassure auditees.

Continuous monitoring: Living assessments, not stale PDFs.

Independent experts: Trustworthy results through rigorous independence safeguards and deep expertise.

Rigor: Processes that are methodologically rigorous, traceable, and adaptive.

Clarity: Clear communication of audit results.

5.1 Risk scope of audits

Frontier AI auditing should focus on risks for which an AI company’s action or inaction can be directly linked to harmful outcomes, including at least the following risk categories (see Figure 5):

Intentional misuse. The use of frontier AI systems by malicious actors to enable or scale harmful activities. This includes, but is not limited to, cyberattacks; the development and use of chemical, biological, radiological, or nuclear weapons (CBRN); large-scale disinformation; violent and criminal activity; fraud; and the generation of child sexual abuse material (CSAM) or nonconsensual intimate imagery (NCII) [93]. 7 In addition to insights from other industries (Section 4.1) and gaps in current AI assessment practices (Section 4.3) (on which further details can be found in Appendix E and Appendix F), our vision builds on prior work outlining frameworks for AI auditing, including field scans of the algorithmic auditing ecosystem [85], proposals for third-party audit ecosystem design based on a survey of the challenges and existing practices in other industries [86, 87], internal algorithmic auditing frameworks [88], external scrutiny requirements for frontier LLMs [89], assurance audit frameworks modeled on financial auditing [90], and layered approaches combining governance, model, and application audits [91, 92]. Intentional misuse Malicious use for harmful activities

Chemical, biological, radiological, or nuclear weapons CBRN

Large-scale disinformation

Violent and criminal activity

Cyberattacks, fraud

CSAM and NCII

Direct causal connections: Risks with straightforward link to company actions

Capturing relevant standards categories: 1) company safety & security policies, 2) emerging regulatory frameworks (e.g., EU AI Act, CA SB 53, NY RAISE Act), 3) emerging industry standards (e.g., FMF

Indirect effects (e.g., gradual atrophying of skills at a societal level, economic transformation)

Structural risks: Risks from AI systems reshaping systems, incentives, and environments Auditing scope Information security Failures of confidentiality and integrity affecting critical assets

Exfiltration of sensitive research, customer data, model weights

Risks to user privacy from model vulnerabilities or behaviors

Sabotage Unintended system behavior Unintended or unsafe behavior risking large-scale harm

Accidents in high-stakes deployment

Misaligned behavior

Capability failures

Biased outputs

Circumvention of human intent/oversight Emergent social phenomena Risks arising from interaction between humans and AI systems

Addiction or emotional dependence

AI-induced or AI-enabled psychosis

Impaired learning and attention (including in children)

Facilitation of self-harm Frontier AI auditing risk focus areasFigure 5: Proposed risk focuses and sources of relevant standards for frontier AI auditing.

Unintended system behavior. AI systems behaving in ways that developers and users did not intend, or being unsafe in ways that could plausibly cause large-scale harm. This includes highly consequential accidents caused by inadequate capabilities, alignment, or safeguards [94, 95]. Examples include systems taking harmful, irreversible actions, e.g., permanently deleting critical files [96, 97].8

Information security. Failures of confidentiality or integrity affecting critical AI assets. This includes the exfiltration of model weights [101], exfiltration of sensitive research and customer data via internal or external threats [102, 103, 104, 105], risks to user privacy arising from model vulnerabilities or behaviors [106, 107], as well as sabotage of highly capable AI systems [108, 109].

Emergent social phenomena. Risks that arise from interaction between humans and AI systems and do not fit neatly into “misuse” or “unintended behavior,” but can nevertheless cause significant harm if left unaddressed. Examples include addiction to or emotional dependence on AI systems, AI-induced or AI-enabled psychosis, and facilitation of self-harm [110, 111, 112, 113, 114, 115, 116, 117, 118, 119]. 8 We categorize misalignment and loss of control as “unintended” in the sense that humans did not intend for the system to behave in these ways, even where the system itself may be acting coherently in pursuit of goals that diverge from those intended. Loss of control can be passive (inability to monitor or correct system behavior) or active (systems resisting human oversight) [1]. Some taxonomies treat misalignment and loss of control as a distinct risk category rather than a subset of accidents [98, 99], and others consider misalignment a catalyst for loss of control [100]. Table 2: Risk categories in company policies (e.g., from OpenAI, Google DeepMind, Anthropic, xAI, Meta, Microsoft, and Amazon) and regulatory texts [126]. AI risk category Company policies CA SB 53 / NY RAISE EU AI Act Code of Practice Intentional misuse Partially included Partially included Fully included Unintended system behavior Partially included Partially included Fully included Information security Fully included Fully included Fully included Emergent social phenomena Partially included Not included Partially included In reviewing the most recent 300 AI incidents logged by the AI Incident Database [120], we found these risks to cover all incidents cataloged except (1) those that do not involve frontier AI systems under our definition, such as those involving Waymo self-driving cars,9 which are highly capable in their domain but not general-purpose; and (2) those that did not result in very significant harms, such as an instance of confabulation of citations in a machine learning book. Structural risks arising from how AI systems reshape systems, incentives, and environments in which they are deployed [123] are not a design target of our risk list. For example, gradual atrophying of skills at both individual and societal levels as more people rely on AI to perform analytical tasks [124], economic transformation generally, and greater vulnerability of society to electricity disruptions as a result of heavy AI use throughout the economy are not within our design focus or listed risks for this framework. This does not mean that we are opposed to auditing with respect to such risks, or that there could not be fruitful transparency requirements at a company level that shed light on how best to address structural risks. For each category of risks, auditors should (1) independently verify company claims and (2) evaluate the company’s systems and practices against its stated safety and security policies, applicable regulations, and industry best practices. Indeed, these risk categories largely map onto company safety and security policies, emerging industry standards (e.g., the Frontier Model Forum [125]), and regulatory initiatives such as California SB 53, New York’s RAISE Act, the EU AI Act, and the EU General-Purpose AI Code of Practice.11

5.2 Comprehensive, organizational-level perspective

To examine the risks we outline above, an audit could cover different parts of a company, or the company as a whole. In this subsection, we argue that frontier AI auditors should emphasize the company as a whole as the most important level of analysis. Individual AI systems may be partially illustrative of or a big component of a company’s risk management, but they are never the full story of the company’s impact. Specific components of and artifacts produced by a company are important to audit and may 9 See [121]. 10 See [122]. 11 We expect the appropriate scope and emphasis of audits to evolve over time as threats, norms, and regulations change, but that there are common threads in how frontier AI auditing should work (e.g., careful management of sensitive information, ensuring auditor independence) that will not change significantly over time. We therefore think that one could endorse the vision discussed in this section, even if one would prefer a different scope. even be the focus of specific audits, but should always be explicitly considered in — and audit conclusions should be framed in relation to — this larger context. Avoiding abstraction errors. A central danger in auditing frontier AI developers is that an audit can be right about the specific artifact or process it examined while still being wrong in the way that matters about the company’s overall risk posture. This reflects an abstraction error: forming the wrong conclusion by treating a partial or simplified unit of analysis (e.g., evaluating a specific component in isolation) as if it were sufficient to assess overall system and organizational risk. Such abstraction errors are especially likely in frontier AI because (1) risks such as those listed in Section 5.1 are shaped by interactions across internal processes, AI systems, and other parts of the internal technology stack, (2) many relevant systems and decisions are non-public and fast-changing, and (3) it is easy to (often unintentionally) audit what is most legible rather than what is most risk-relevant. Put differently: auditing can miss the forest for the trees not because the trees are unimportant, but because the forest is not simply the sum of individually “healthy-looking” trees. There are at least four ways abstraction errors can arise in practice:

Portfolio blindness: auditing the most visible or best-behaved system. Frontier AI developers rarely operate a single model or a single system. They maintain portfolios: multiple checkpoints; post-training variants; internal research models; preview builds for partners; fine-tunes for specific customers; custom model weights transferred to a datacenter controlled by a customer; and internal tools with broader permissions than the public product. It is therefore possible for an audit to establish that a flagship deployment is well-controlled, while missing a materially riskier surface elsewhere. In these cases, a favorable finding about one audited surface is not false per se, but may become misleading if it is treated as representative of the organization’s overall risk posture.

Configuration drift: outdated or incomplete audit results due to system-level changes. Even when the same exact model checkpoint is being used in different cases, real-world behavior and risk depend on system-level configurations: system prompts, input and output filters, routing across multiple models, retrieval sources (e.g., search engine APIs or periodically updated databases from which knowledge is retrieved during operation), tool access, memory, rate limits, monitoring thresholds, user-specific personalization, UI features, public-facing API implementations, and downstream post-processing. Seemingly modest changes such as enabling a new tool, relaxing a filter threshold, swapping in a different safety classifier, or changing routing rules for a subset of users or at different times of day can materially alter misuse potential or the likelihood of harmful failures. An abstraction error occurs here when an audit treats a specific evaluation (or a staging configuration) as a proxy for the actual deployed system, without establishing that the audited configuration matches production deployments and will remain stable enough for conclusions to hold. The need to hedge against configuration drift is one reason why we emphasize continuous monitoring for changes in Section 5.5.

Non-compositional safety and security: safe components, unsafe assembly. Many safety and security properties do not necessarily compose together neatly. A model that refuses harmful requests in an isolated user chat setting may still enable harmful outcomes in another isolated user chat, or when embedded in an agentic scaffold that chains together multiple tool calls and operates over long horizons. A model with concerning raw capabilities and propensities (e.g., to deceive users) may be kept low-risk through strong system-level controls. For frontier AI, the risk-relevant question is often less “what can the model do in isolation?” and more “what can the organization’s integrated systems do, under realistic conditions, given the actual controls?” Abstraction errors arise here when auditors over-weight component-level findings while under-weighting system-level or organization-level interactions that dominate the actual risk.

Boundary mismatch in security: strong product security, weak security of trade secrets. A company may deploy a well-engineered public API (authentication, rate limits, abuse monitoring) while leaving training infrastructure, model weight storage, experiment tracking, or internal repositories comparatively exposed. Indeed, at least two frontier AI companies have had AI research-related intellectual property stolen from them [127, 128], and likely there are many similar cases that are not publicly known given what is known about these companies’ security practices and the difficulty of defending against sophisticated attacks [108, 129]. The resulting organizational risk can be dominated by the weaker boundary: if an adversary can exfiltrate model weights or tamper with training and deployment artifacts, the company’s public-facing mitigations may become irrelevant (e.g., stolen weights can be used without those mitigations). Here, a “system-level” audit focused on the externally visible interface can substantially underestimate information security risks that sit behind the interface but govern the most consequential assets. Abstraction errors are not rare edge cases to be aware of and carefully avoided. Rather, they demonstrate that the company level of analysis is best for forming confident conclusions, even if it is hard to achieve in practice [130]. There are predictable ways audits focused on only a single component of a frontier AI company can mislead all stakeholders regarding that company’s risk posture. Three lenses. In our vision for frontier AI auditing, lead auditors need to integrate three lenses: models and systems, which includes AI models, system features connected to those models (e.g., input and output classifiers, system prompts), and information security safeguards (e.g., user authentication); computing hardware, including its quantity and security, and how it is allocated across development and deployment efforts; and governance, including development and deployment decision-making processes, information security systems and protocols, incident response protocols, the safety and security culture of the organization, and the clarity with which responsibility is allocated within the company. Neglecting one of these lenses risks an incomplete picture of a company’s risk profile (see Figure 6). AI models and systems are the primary focal point of a frontier AI company’s work. But critically, from an auditing perspective, it’s important not to focus on a single model or system to the exclusion of others. All but the very most nascent companies have many different models and systems at a given time. This includes models and systems that are in development in addition to those that are deployed; models that are smaller, cheaper, and faster but less capable as well as those that are larger, more expensive, and slower but more capable; versions of systems that use more or less computing power while in use (”test-time compute”); versions that are produced and provided specifically for a given customer, such as a company or government agency, and may have different guardrails; information security systems, which are critical to ensuring that the other systems are not stolen or tampered with; and much more. At higher assurance levels, more of these systems are critically examined, and in more detail. Auditors also need to understand the computing hardware that a company has access to and how it is using it. Physical or digital access to that computing hardware could be a weak link for the security of training infrastructure, weight storage, and internal repositories, as discussed above; major training runs or deployments that are not publicly announced could contribute disproportionately to a company’s risk profile — such internal deployment [26] might be unknown to auditors by default, but shouldn’t be if auditors are to be effective in characterizing risks. Gaps in a company’s ability to comprehensively account for its own compute use could point to gaps in the company’s understanding of its own activities, or could indicate efforts to mislead auditors (note that this is particularly important at higher assurance Model / system Assess how a companyʼs most capable AI models and systems are designed, trained, and deployed, in order to assess the safety and security of model behavior and verify companiesʼ stated claims about their characteristics. What is audited

Internal and externally deployed systems

Major training runs and fine-tunes

Families of checkpoints and their lineage

Interfaces, agents, and other components When in the lifecycle

Prototype, pre-training, training, fine-tuning

Evaluation, deployment, sunset

Monitoring and incident response What is assessed

Capabilities, propensities, and affordances

Safety mitigations and their effectiveness

Security mitigations and their effectiveness (e.g., vulnerability to adversarial exploitation or model exfiltration)

Adequacy of monitoring and mitigations over time

Risk (e.g. risk estimation for threat models) Compute Assess how a companyʼs computing resources are used to train, fine-tune, and operate its most capable models, and whether declared activities align with recorded and authorized compute usage. What is audited

Training and fine-tune compute allocation

Inference / evaluation compute where risk-relevant

Access controls, logs, and other mechanisms governing compute use When in the lifecycle

Before and during major training runs

During fine-tuning and evaluation

Throughout ongoing operational use What is assessed

Alignment of declared activities with logged and authorized compute

Integrity and completeness of compute tracking

Upper bounds on unattributed or diverted compute

Adequacy of access-control and security measures that protect compute and model weights

Governance

Assess the organizational structures, policies, and cultures that determine how safety and security risks are identified, managed, and overseen within a company. What is audited

Risk responsibility roles and oversight structures

Policies, standards, and safety/security processes

Release gates, escalation pathways, incident response procedures

Documentation, decision-making records, and safety-related communication channels When in the lifecycle

Governance decisions across development, evaluation, deployment, monitoring, and incident response

Processes that span multiple lifecycle stages (e.g., risk reviews, sign-off procedures, change management) What is assessed

Adherence to external frameworks and internal policies

Adequacy and proportionality of governance practices

Risk / threat models for different risk categories and subsets thereof

Safety and security culture (e.g., incentives, independence, and escalation norms)

Whether governance structures reliably support timely detection, communication, and management of risksFigure 6: What should be audited? levels, where significant effort is made to rule out the possibility of deception). Lastly, understanding the safety and security governance of all of these digital and physical systems is critical in order to put those systems in context. An auditor needs to know who is responsible for what, how documents are produced, what the incentives facing the document-writers were, etc. in order to meaningfully interact with non-public information and spot subtle errors and — at the highest levels of assurance — intentional deception. In short, information about governance helps indicate how much to trust other kinds of information. Furthermore, significant gaps in governance — both formal (e.g., limited or non-existent policies governing internal AI deployments) and informal (e.g., a culture of corner-cutting in specific areas that comes up in staff interviews) — may provide vital clues to gaps in risk mitigation at a system level. Practical implications of the organization-level perspective. To achieve higher levels of assurance about a company’s risk profile (Section 5.3), deeper access to information (Section 5.4) will tend to be required about each of these lenses in order to enable drawing confident conclusions about a company’s risk profile in each of the four risk categories. This in turn will require standardized processes for “mixing and matching” subcontractors with different skill sets (Section 5.6). In order to avoid committing abstraction errors due to configuration drift, continuous monitoring will be needed, and a range of different audit cadences will need to be conducted, corresponding to the different paces of change of different organizational components (Section 5.5). Rigorous, traceable processes are needed in order to allow those interpreting or replicating an audit to infer whether abstraction errors are likely (Section 5.7). Privately and publicly shared audit findings (see Section 5.8) need to enumerate the assumptions being made in order for analyses of artifacts to be representative of the company as a whole. Over time, there needs to be research toward a standardized analytical framework (e.g., an “organiza- tional safety and security case”) that combines different inputs into a composite picture of a company’s risk profile. Such research should draw on best practices from safety-critical systems engineering, such as safety cases, which are structured arguments supported by evidence that justify the safety of a system [131].12 We think our AI Assurance Level (AAL) framework, discussed next, is an early step.

5.3 Levels of assurance

To address the different risk scopes for different depths of organization-level audits, we propose a framework for calibrating and communicating confidence in audit conclusions that we call AI Assurance Levels (AALs). This framework is intended to help those conducting and relying on audits to understand what conclusions they can reasonably draw, and what kinds of abstraction errors (Section 5.2) — among other types of errors — still cannot be ruled out.

5.3.1 The meaning of levels of assurance in general

Frontier AI audits should each be conducted at a specific “level of assurance.” A given level describes how confident the auditor is in their conclusions about a given company’s safety and security practices [134, 135]. In principle, an auditor could reach a high assurance conclusion that safety and security safeguards are very poor; however, we will often use examples of audit findings that are positive with respect to safety and security risk management. We do so because, in practice, frontier AI audits involve active collaboration between the auditor and auditee and allow the possibility of remediation prior to publication of results. Explicit assurance levels help stakeholders understand how much they can rely on audit results and what assumptions remain untested. In other industries, reviewers and auditors provide either “limited” or “reasonable” assurance [136, 137, 138]. Safety-critical industries (e.g., aviation, nuclear power) also use the concept of reasonable assurance [139, 140, 141], which implies a higher degree of confidence. The level of assurance required for different contexts may differ, depending in part on the costs of the audit and the costs of errors [142].

5.3.2 Overview of AI Assurance Levels

We use the term AI Assurance Levels (AALs) to refer to assurance levels in the sense above, as applied to the specific context of a frontier AI audit. Higher levels more and more confidently assess the risk level associated with the frontier AI company as a whole, and progressively rule out abstraction errors such as those discussed in Section 5.2 as well as other possible sources of error in the audit’s findings. To achieve this, audits at higher levels will tend to require greater access to non-public information relative to lower levels, larger allocations of time and talent, and more sophisticated infrastructure and analysis. Lower AI Assurance Levels (AAL-1 and 2) can detect some errors on the part of companies and verify the existence of significant compliance efforts, and they can achieve this using smaller expenditures of time and talent. While audits at these levels may be able to detect errors (i.e., unintentional misstatements or mistakes), they are less likely to be able to detect fraud (i.e., intentional deception) (see [145]). 12 Existing work has proposed safety cases to verify that AI systems are safe enough to develop or deploy (see [132, 133]). 13 After completing most of this paper, the authors learned of a prior use of the term “AI Assurance Level” with a very different meaning [143], as well as another use of the acronym AAL in a related context (Authenticator Assurance Levels [144]). These collisions are unintentional. Higher AI Assurance Levels (AAL-3 and 4) can provide stakeholders significantly more confidence that the conclusion of the audit is correct and that more subtle errors will be detected, and they aim to address the possibility of deliberate deception on the part of the company. Since we envision audits of companies rather than just systems, audits at higher AALs serve as better and better estimates of company-level risks (versus just system-level risks). At the same time, these audits are more costly because they will require more allocation of both company capacity and auditor capacity, and will involve greater access to (often sensitive) non-public information (see [53]). Using lower AI Assurance Levels may be appropriate when risks of audit errors are less severe, making the cost of achieving higher AI Assurance Levels greater than the assurance that is obtained [146]. In contrast, using higher assurance levels may be appropriate for auditing risks that stakeholders are more concerned about, or auditing in situations where there are strong incentives for the companies to cut corners [45].

5.3.3 AI Assurance Level details

Drawing inspiration from the precedents above, as well as the specific context of frontier AI, we describe each of our proposed four AI Assurance Levels (AALs) below. We begin with an overarching summary, then provide more details on each.

Limited assurance (AAL-1). A time-bounded audit of a particular frontier AI system (typically a few weeks), which makes use of API access to multiple model versions and system settings, as well as a limited amount of additional, non-public information focused on the audited system and related internal decisions.

Moderate assurance (AAL-2). A more extensive assessment of one or more frontier AI systems, as well as company practices more broadly, which, at a minimum, spans months and makes use of gray-box system access, extensive internal documentation (e.g., unredacted safety cases), some continuous monitoring, and staff interviews across several functions.

High assurance (AAL-3). Ongoing oversight (multiyear engagement for the lead auditor, with many subcontractors contributing throughout) with white-box access, more extensive continuous monitoring, and the authority to examine any area of concern.

Very high assurance (AAL-4). Continuous verification designed to detect active deception attempts, operating with a full understanding of the company’s systems, computing hardware, and governance, and providing “treaty-grade” confirmation of the company’s risk profile. The table and paragraphs below summarize the level progression. Table 3: Summary of AI Assurance Levels. At higher levels, auditors are more confident in their conclusion. Level Duration Typical access to information (cumulative) Methods Readiness AAL-1: Limited assurance Time-bounded, typically a few weeks to a small number of months System: Black-box API access to multiple checkpoints/variants; access to chain-ofthought outputs and logits; ability to enable/disable safety classifiers that block certain inputs and outputs; overview of safety mitigations; limited amount of non-public information scoped to the system under audit and related internal decisions. Organization: Written representations; organization chart; list of key staff members; attestations about training processes. System: Run private evaluation suites probing for dangerous capabilities (e.g., cyber, bio, manipulation); conduct limited red-teaming to probe system boundaries. Organization: Review provided documentation; interview key staff about governance structure and reporting lines. Achievable now; early pilots already conducted. AAL-2: Moderate assurance Months (at least) System: Gray-box access to (multiple) key systems; samples of training and testing logs; compute allocation records; ability to remove and examine all mitigations. Organization: Safety/security documentation (e.g., safety cases); governance decision records; access to interview staff across safety, security, policy, and product; subsets of training data; information on previous incidents. System: Reproduce a sample of internal evaluations to verify they were conducted as described; conduct extensive, targeted red-teaming beyond standard probes; investigate sandbagging and other difficult-to-detect behaviors that could affect evaluation results, leveraging access to model internals; develop statistical model signatures in order to confirm whether audited models correspond to deployed models. Organization: Review governance artifacts (release checklists, escalation logs, decision records); interview staff across functions to understand how decisions are made in practice vs. policy. Early to mid2026 (requires investment in IP protection protocols and method standardization). Continued on next page Table 3 – continued from previous page Level Duration Typical access to information (cumulative) Methods Readiness AAL-3: High assurance Multiyear engagement for the lead auditor, with many subcontractors contributing throughout System: White-box access to all models; safety case with comprehensive supporting details; detailed logs and compute accounting with cryptographic provenance (e.g., “proof of training”); continuous monitoring (e.g., automated drift and change-detection infrastructure); training data. Organization: Authority to examine any system, process, or decision; ability to interview all relevant personnel; full visibility into relevant governance processes and decision-making; continuous access to relevant internal communications related to safety and security. System: Continuous automated monitoring of model behavior, deployments, and compute usage; comprehensive interpretability research on models; automated detection of behavioral drift and undisclosed training. Organization: Review major decisions (training runs, deployments, safety exceptions) as they happen; ongoing governance review; cross-functional safety culture assessment. Certifications automatically degrade when material changes occur. Uncertain – possibly early 2027 if significant investments begin soon. AAL-4: Very high assurance Continuous System: Tamper-evident logging across infrastructure using formally verified open-source cryptographic provenance tooling; ability to inspect training data, model weights, and deployment configurations. Organization: Multiple independently selected audit teams with full organizational access; short notice inspections of physical facilities and interviewing of staff. System: Hardware attestation and verification; formally verified open-source cryptographic provenance tooling; destructive testing of hardware samples to detect possible tampering; continuous automated and manual monitoring of all risk-relevant systems. Organization: Independent verification by multiple teams with no shared dependencies; whistleblower bounties for disclosure of unaccounted compute or theft; regular unannounced inspections. Adversarial red-teaming targets verification mechanisms themselves. Uncertain – possibly late 2027 if significant investments begin soon. AAL-1 (limited assurance). An audit at AAL-1 indicates to stakeholders that the auditor has some degree of confidence in their conclusion, as there were no glaring issues found with the claims, systems, and practices they assessed, though at this level, the auditor is still relying heavily on a company’s representations (i.e., formal statements by company staff asserting certain facts) (see [142, 147]) and knowingly runs the risk of multiple types of abstraction error (Section 5.2). With the time span (typically a few weeks) and limited breadth and depth of AAL-1, it is possible to only rule out a subset of possible abstraction errors — e.g., conflating evaluation of a specific model with the overarching system being assessed. Conclusions have a short half-life and say relatively little about the company as a whole, with the exceptional case of the company having negligible additional activities beyond developing the one audited system. As a result, any company-level claims made by the auditor specifically relate to the company’s processes as they are applied to the specific system being audited, rather than providing much confidence regarding how those processes are applied more broadly (e.g., to internal deployments or to other externally deployed systems that are out of scope). While these engagements are not enough to qualify as audits according to some standards in other domains (see [148]), they still provide meaningful evidence compared to self-assessment alone, so we treat them as the starting point for frontier AI auditing. Some very recent frontier AI assessments are at this level (see [149]). What this can detect: Dangerous capabilities that surface under industry standard evaluation methods (e.g., ability to generate working exploit code, synthesis instructions for controlled substances); glaring gaps between stated policies and documentation; basic failures in safety mitigations. Example conclusion: “Within our three-week evaluation using API access, we found no evidence that the system can reliably assist with novel cyberattacks.” What this cannot detect: Capabilities requiring sophisticated elicitation; gaps between documentation and actual practice; undisclosed systems or training runs; any intentional concealment. Auditors take company representations largely at face value. Who audits: Single evaluation organization or small team. Standard conflict-of-interest disclosures; lighter independence requirements than higher levels. AAL-2 (moderate assurance). Auditors use company-provided documentation to inform their analysis and make it more efficient, but ultimately draw their conclusions based primarily on direct evidence gathered over months rather than company representations (see [147]). A limited degree of continuous monitoring is achieved: auditors at this level will verify that the model and systems that they are examining are the same ones that are actually being deployed [150], ruling out some kinds of configuration drift (Section 5.2). An audit at AAL-2 indicates to stakeholders that the auditor has ruled out several potential sources of abstraction errors compared to AAL-1 (e.g., via incorporation of some degree of continuous monitoring, and looking beyond models to consider a wider fraction of relevant system-level and platformlevel mitigations), though still not all such errors. More rigorous technical analysis will be brought to bear at this level, making use of a richer range of information. This includes access to safety cases provided by the company. For AAL-2, auditors verify most or all aspects of such safety cases. What this can detect: Negligence and sloppiness in safety practices; gaps between stated policies and actual practice in examined areas; cherry-picking of favorable evaluation results; basic security control failures. Example conclusion: “We independently reproduced the company’s internal capability evaluations and confirmed the release gate was followed for this deployment.” What this cannot detect: Problems in unsampled systems or processes; long-running concealment efforts; undisclosed training runs or “shadow” systems; sophisticated fraud. Auditors assume sampled areas are representative and the company is not actively deceiving them. Who audits: Accredited evaluation organization(s), potentially with subcontractors for specialized domains (security, biosafety). Lead auditor coordinates and takes responsibility. Stronger independence requirements: cooling-off periods, revenue diversification, disclosed terms of engagement. AAL-3 (high assurance). All company activities are presumptively in-scope and information about these activities is fully available (i.e., white-box access); the range of possible abstraction errors is narrow and well-defined. Remaining uncertainties primarily relate to the possibility of sophisticated deception or industry-wide vulnerabilities (e.g., unknown problems in open-source code or unknown weaknesses in state-of-the-art alignment techniques). An audit at AAL-3 indicates to stakeholders that the conclusion is highly likely to be true. At this level, all material aspects of safety cases and information security protocols will have been directly verified, and continuous monitoring (over the course of multiple years) will be used extensively to ensure that audit conclusions continue to hold and material changes are unlikely to be missed by auditors. Nevertheless, the gaps that do remain are significant enough that a very skeptical party — one who believes that the audited company would exploit any opportunity to cut corners or deceive auditors — would still not be willing to rely on an AAL-3 in very high-stakes situations. What this can detect: Most concealment and corner-cutting; significant safety gaps; undisclosed major training runs (given adequate compute accounting); material misrepresentations; changes that invalidate prior findings. Example conclusion: “We have continuous visibility into all major training runs and deployments; compute accounting confirms no undisclosed runs exceeding the agreed threshold.” What this cannot detect: Extremely sophisticated deception involving compromised monitoring infrastructure or multiparty collusion; compromised hardware supply chains. Auditors assume monitoring works as intended and at least one auditor is honest and competent. Who audits: Multiple accredited auditors with long-term (multiyear) engagement. Lead auditor can subcontract specialized teams. An oversight body receives unredacted reports and can inspect the audit itself. Stringent independence requirements (e.g., more restrictive financial disclosure requirements than earlier levels; payment must come from a source other than auditees). AAL-4 (very high assurance). At this level, any violations of load-bearing assumptions behind key safety and security claims would be quickly detected by auditors before significant harm could occur. Furthermore, few if any potential sources of abstraction error are considered to have material likelihoods (and any that do are closely monitored and well-quantified). AAL-4 audits can provide confidence in conclusions even assuming highly resourced and motivated actors aggressively exploiting opportunities to cheat (see [151, 152]). “Very high assurance” is not well-defined in existing auditing literature and represents our effort to bridge the literature on auditing with the literature on verification of international arms control agreements, where very high assurance is required [153]. We make this connection because we believe that AAL-4 audits may ultimately be needed for very high-stakes purposes such as verifying US–China cooperation on baseline AI safety and security standards [55]. They could also plausibly be needed for domestic regulation purposes alone, simply due to significant advances in AI capabilities and their associated risks, making significant uncertainty in frontier AI risk mitigations no longer tolerable. What this can detect: Deliberate deception including even relatively small hidden training runs and inference jobs, shadow systems with safeguards removed, and selective disclosure of information about systems and practices. Example conclusion: “Hardware attestation and cryptographic logs confirm compliance with the agreed restrictions on fine-tuning for dangerous capabilities, even accounting for potential evasion attempts.” What this cannot detect: Completely novel evasion techniques unknown to highly-resourced verification designers; unknown vulnerabilities in long-used, well-studied cryptographic algorithms. No verification regime provides absolute guarantees, but the remaining sources of error are very tightly circumscribed. Who audits: One or more accredited lead auditors and a range of subcontractors performing various functions. Government involvement is likely necessary for legitimacy, enforcement, and access to national security information. May require multi-jurisdictional representation and security clearances.

5.3.4 Choice of AI Assurance Levels

There is a trade-off between gaining more confidence that risks have been mitigated, and the financial costs associated with reaching that higher confidence (see [154, 155]). Depending on the scope of the engagement, and with very high uncertainty, we loosely estimate that AAL-1 engagements could cost around $300,000–$600,000 for multi-week to few-month engagements, AAL-2 engagements might cost around $1,000,000 or more for multi-month engagements, while AAL-3 or AAL-4 engagements requiring continuous access and specialized technical verification could cost several million dollars annually. These costs could potentially be reduced over time through automation and amortization of initial infrastructure investments, though as the stakes of missing key issues increase, it may be appropriate to increase investment in computing power applied to auditing, which may cancel out that effect. AAL-3 appears (at least) very difficult today and AAL-4 appears infeasible today, making research and pilots on each a priority. In this paper, we do not aim to settle the question of which of the four AI Assurance Levels (AALs) should be applied to which subsets of frontier AI, beyond recommending AAL-1 as the floor for frontier AI as a general category, and AAL-2 as a near-term goal for the most advanced subset of frontier AI.14 We do not think it’s desirable or realistic to be much more prescriptive than that at this stage given the many factors bearing on the decision (discussed in Section 5.2). We make recommendations in Section 6 for how these threshold questions can be continuously updated over time based on the latest evidence.

5.4 Deep, secure, and timely access to information and resources

A mature frontier AI auditing ecosystem depends on auditors having deep, secure, and timely access to the information, systems, and organizational processes under examination. Ultimately access should be deep enough to assure auditors and other stakeholders, but secure enough to reassure auditees. The access provisions should depend on the specific audit and be proportional to the risks believed 14 An illustrative operationalization of “the most advanced subset” might be, e.g., companies that have produced, within the prior year, any AI systems that were within three months of the state-of-the-art at the time. Making such a definition more precise would require wider discussion among stakeholders and analysis, which we hope that this paper helps encourage. System access System information Governance and process

Sampling access

Production model variants

Low-mitigation model variants

Fine-tuning access

Model internals

Model weights

Chain-of-thought

Model specifications

Model families and lineage

Architecture and training documentation

Evaluation results and artifacts

System documentation

Monitoring systems

System logs

Compute accounting records

Process documentation

Board and governance minutes

Internal reports

Process communications

Previous compliance reviews

Organization charts

Written representations Operational and contextual External feedback Public information

Staff interviews

Governance interviews

Process-owner interviews

Casual conversations

Meeting attendance

Walkthroughs

Operational communications

External inquiries

User reports and complaints

Customer and deployer feedback

Third-party correspondence

Company public outputs

Regulatory filings and disclosures

Published research

External commentary and analysis

Public user feedbackFigure 7: A non-exhaustive taxonomy of information sources that companies may provide access to, across model, system, governance, and operational domains. Public information is also included, as auditors should consider it alongside company-provided sources. The depth of access required will depend on the specific audit engagement and the assurance level sought. See Appendix C for descriptions of the different items. (pre-audit) to be posed by the company’s systems and the assurance level sought (see Section 5.3). Auditors should also be provided with sufficient tooling and compute resources, as well as channels for viewing non-public information. Access should be provided through secure evaluation environments if required for addressing privacy or security concerns. Deep access. To verify safety and security claims relating to AI systems, auditors need deep technical and organizational access (see Figure 7 and Appendix C). This is essential in order to avoid the four types of abstraction error discussed in Section 5.2. For AI models and systems, at a minimum, this should include black-box sampling access to the system(s) through an API with permissive rate limits. However, greater levels of assurance will likely require auditors being provided with deeper access, including access to output logits, weights, activations, or the ability to modify the model (for example, through fine-tuning) [5, 156, 157, 158, 159]. Furthermore, assessing some claims will require auditor access to systems other than those to be deployed, as well as information about the system’s functioning and how it was developed. For example, establishing upper bounds on a system’s potential risks will depend on assessing system versions without safety guardrails in place (e.g., helpful-only models can be used to understand worst-case scenarios if those guardrails fail). In addition to access to AI systems, auditors should be provided access to other relevant non-public information, such as compute accounting,15 incident reports, internal risk assessments, meeting notes, and decision logs. Auditors may also need to conduct interviews with relevant staff members in order to verify information or dig deeper on topics that are not well-documented. Secure access. Providing deeper levels of access and sensitive information to auditors could expose commercially valuable intellectual property or national security-critical assets to leaks, theft, or sabotage, 15 By compute accounting, we mean systematically tracking the use of computing power in order to verify that it was used in the manner described, and to reduce uncertainty about the possibility of undisclosed, significant training runs or inference runs [53, 54, 160]. Notably, this is particularly relevant to higher assurance levels. It is technically challenging to do compute accounting effectively, and at lower assurance levels, claims about compute usage will likely need to be taken on trust. as well as legal risks. When access to valuable or sensitive assets is required, audits should be conducted through secure evaluation environments that allow auditors to run tests, probe model behavior, examine system responses, and simulate realistic traffic, while preventing unauthorized disclosure. These environments should also prevent the auditee from observing or influencing tests, reducing the possibility of “teaching to the test” and ensuring auditors’ findings remain meaningful over time and across companies. Comparable practices exist in other high-stakes assurance settings, such as regulator-operated test environments in food safety [162], medical device safety [163], and automotive emissions [164]. OpenMined’s PySyft framework has been tested with Anthropic and the UK AI Security Institute using NVIDIA H100 secure enclaves for mutually protected evaluations, though so far only on small, non-production systems [165]. Access to sensitive information should be provided through secure channels and legal safeguards. Secure channels could involve issuing monitored and secured corporate devices to auditors, or having auditors work on-premise at company facilities under controlled conditions. Other sectors with high confidentiality requirements use similar structures to balance access and protection. For example, due diligence in mergers and acquisitions often relies on the use of “clean rooms” (physically or logically isolated digital environments for analyzing sensitive information) and “clean teams” (personnel who have access to confidential information but are insulated from decisions where that information could create conflicts of interests) [166, 167]. Other existing mechanisms include audit compartmentalization, where different auditors assess different aspects of operations, and secure enclave models in cloud security assurance [168]. Legal safeguards may include nondisclosure agreements, contractual liability for breaches, and professional sanctions for auditors who mishandle sensitive information. Adequate resources and tooling. Auditors should have the resources required to conduct high-quality, multi-domain assessments of frontier AI systems. This includes access to sufficient compute (e.g., for evaluations, parallelization, verification, and re-runs); robust tooling for inspecting logs, lineage, and governance records; and technical support from the auditee regarding the operation of auditing tools. Timely and responsive access. Auditors should be given adequate time to design, execute, and analyze all audit activities, whether technical evaluations, document reviews, or staff interviews. Communication channels should be provided through which auditors can receive prompt answers to follow-up questions and rapid access to updated materials after significant changes, including logs and lineage information for technical assessments, and governance artifacts, decision records, and relevant personnel. Governed and accountable access arrangements. Auditor access should be structured through clear agreements that specify rights, obligations, permitted uses [169], and consequences for misuse. Access arrangements should define what auditors may examine, how information may be used and stored, and the conditions under which access may be expanded, restricted, or revoked. Auditors should be subject to confidentiality obligations backed by contractual, professional, and, where appropriate, legal sanctions, with clear liability for breaches. Disputes about the scope or burden of access requests should be resolved rapidly through clear escalation pathways and independent mediation so that audits are not delayed or obstructed. Safeguards against omission and selective disclosure. For the highest levels of assurance, mechanisms should be in place that allow for detection of whether relevant information has been withheld, manipulated, or selectively disclosed by the auditee. This requires tools and processes such as compute accounting 16 Model weights, proprietary training data, internal tools, detailed security documentation, and even model outputs in sensitive domains like nuclear technology are sensitive assets, and are often protected by law or norms. For example, privacy law limits access to some datasets; trade-secret protections constrain what can be widely shared; export controls may restrict whether foreign nationals can see certain information [161]. Further, several AI companies sharing auditors can also raise concerns about information spillovers [154]. If safety and security claim depends on… Then assessed via… Slow-moving elements Fast-moving elements

Governance structures and practices

Safety culture

Risk ownership

Release gates

Escalation pathways

Incident response management

Baseline security posture

Training runs

Model releases

Significant fine-tuning

Novel integrations

Serious incidents

Post-training updates

API/output behavior

Configuration changes

Behavioral drift

New user behavior patterns Episodic events Periodic assessment Continuous monitoring Annual or semiannual deep checks. Triggered by significant and material changes. Automated and ongoing telemetry, drift detection and alerts. Event-triggered reviews Valid, up-to-date, audit conclusions Flagged or deprecated when underlying elements changeFigure 8: Matching assessment cadence to rate of change. Safety and security claims depend on elements that change at different speeds. To keep audit conclusions valid over time, the auditing ecosystem should assess each element at a cadence matching its rate of change: periodic deep assessment for slow-moving organizational elements, event-triggered reviews for episodic technical and deployment decisions, and continuous automated monitoring for fast-changing behavioral surfaces. to assess whether there are likely any materially significant AI systems created by the company that auditors haven’t assessed, random sampling of logs or lineage artifacts, structured comparisons between public and private documentation, protected whistleblowing channels for employees, and whistleblowing bounties [170]. Where companies refuse access without adequate justification, auditors should make adverse inferences consistent with established practice in other assurance systems (e.g., IAASB ISA 705 [171] and PCAOB AS 3105 [172]), with fair processes for arbitration of disagreements between auditors and auditees (e.g., by the organizations discussed in Section 6.1 and Section 6.3).

5.5 Continuous, risk-responsive assurance

Our vision for a mature frontier AI auditing regime is one where audit conclusions remain accurate as systems and developer practices change, risk-relevant changes can be detected and responded to in a timely fashion, and assurance processes are aligned with the times at which risks are created, amplified, mitigated, or revealed. The goal is to produce living assessments that evolve with the systems and companies they evaluate, not static documents that become stale within days or weeks of publication. Audit cadences. The frontier AI auditing ecosystem should assess different safety and security claims at cadences that match how quickly the underlying elements change (see Figure 8). This ensures audit conclusions remain valid over time. Slower-moving elements of the organization (e.g., governance, safety culture, release gates, incident response, security posture) warrant less frequent, periodic deep assessment (e.g. an- nual or semiannual). Faster-moving and time-bound elements and decisions (e.g., training runs, releases, post-training, incidents) can trigger event-based reviews that test whether controls were implemented and risk decisions were appropriate. Claims that depend on the fastest-changing surfaces (e.g., API behavior, configuration, drift, user behavior patterns) require continuous automated monitoring that logs changes and triggers alerts when systems deviate from certified parameters. Of course, other considerations beyond the velocity of change, such as the impact on risk, uncertainties, and assumptions, as well as the assurance level should likely also determine the required audit cadence for a given claim. Within ongoing audit engagements, the lead auditor should be responsible for synthesizing evidence across domains and cadences relevant to the evaluated safety and security claim(s), and coordinating specialist subcontractors, where used, to ensure sufficient coverage for the targeted assurance level and audit’s scope (Section 5.2). Live certification and deprecation. Assurance should remain valid only while underlying assumptions hold [137, 173], automatically downgrading on material changes via live certification or time-limited validity periods. Live certification requires maintaining current records of supporting assumptions; when material changes occur, certification is flagged or downgraded pending review. Each certification should specify the conditions on which it rests (e.g., the model version, safety configuration, and deployment pathway) along with clear criteria for what constitutes a material change requiring re-evaluation. Companies should proactively flag upcoming changes to auditors, though auditors should avoid over-reactivity. Active change monitoring aligns with companies’ interest in avoiding inadvertent performance degradation. Several assurance regimes already operate on variants of this principle: in information security, certified organizations are required to maintain controls continuously and correct weaknesses when conditions change [174], and in aviation, changes in the organization or its activities that cause it to no longer meet requirements require the organization to seek an amendment to its approval or certificate [175]. However, frontier AI auditing faces distinctive challenges in implementing continuous assurance. AI systems may be updated far more frequently than products in other regulated industries, and changes in behavior may be subtle or emerge gradually. Unlike financial materiality thresholds, there is not yet consensus on what should trigger re-certification (e.g., the magnitude of capability change [176]). Addressing these challenges will require investment in automated monitoring infrastructure and the development of consensus on what constitutes materiality in changes.

5.6 Independent, expert, and well-governed auditors

Frontier AI auditing should be delivered through a mature, professionalized ecosystem of independent, technically proficient auditors subject to credible oversight. Auditors should be free from commercial or political influence; combine deep expertise in AI evaluation, safety, security, and governance with strong confidentiality practices; and provide reliable assurance to AI companies, policymakers, investors, insurers, and the wider public. Independence creates credibility, since auditors’ reputations rest almost entirely on the quality and integrity of their assessments. Frontier AI auditing should be conducted by independent third parties. These may be non-profit or for-profit organizations. Potential audit providers (or subcontractors thereof) include AI assessmentfocused companies and non-profits, law firms, accounting firms, security penetration-testing firms, government entities, and hybrid arrangements that combine these capabilities. Individual auditors 17 Sometimes companies maintain “internal audit” functions, which have some operational independence from product teams, and these may be referred to as “independent” by the company in question [177]. However, on their own, we consider these to be insufficiently independent to provide credible assurance, and below the threshold for independence in the way we use the term. 18 For related discussion regarding the benefits and limitations of different providers, see [178]. should have no direct financial stake in the companies they assess. Audit firms should avoid material revenue dependence on any single client20 and maintain strict boundaries between assurance work and non-assurance work (e.g., consulting) [181, 182]. Such independence requirements are standard in financial auditing under PCAOB rules and in European aviation certification [183], and also feature prominently in the first set of standards for frontier AI evaluation, AEF-1 [169]. Importantly, payment should not depend on audit results [169]. But this alone is not sufficient, since an auditor might still perceive that future work, or continued access, depends on favorable findings. Allowing companies to choose their own auditors has created recurring problems in other industries [184]. As such, it is preferable to address this risk before the sector scales much further by, for example, transitioning toward payment models that avoid direct financial dependence on audited companies. We believe research on possible alternatives should be urgently pursued, and progress toward alternatives should be made by the end of 2026. (See Appendix G for further discussion.) Transparent conflict-of-interest management. Additional safeguards to manage conflicts of interest are also needed. Mechanisms from established assurance regimes, such as public conflict-of-interest registers, cooling-off periods for personnel moving between auditing organizations and audited companies, restrictions on non-audit services, and disclosure of any financial or advisory ties should also be implemented in frontier AI auditing [85]. These measures are standard in finance and information security auditing and help reduce familiarity and self-interest biasing audit results [185]. When auditors are paid directly by clients, public disclosure of audit terms and adherence to industry standards may improve audit quality (see [186]). The AEF-1 standard for AI evaluations requires disclosure of potential conflicts of interest, including financial dependence [169]. Frontier AI auditing may require additional or alternative safeguards given the small pool of technical experts in the space, though any deviations from these best practices should be justified and prominently disclosed. For example, if a long cooling-off period would be prohibitively restrictive, recent frontier AI company employees could instead serve as subcontractors with a very specific technical remit on a larger audit engagement, rather than as lead auditors themselves. Lead auditors should not have direct financial stakes in companies they audit, but more work is needed to specify granular standards for edge cases. Public registers of auditor affiliations and financial stakes could help avoid either the appearance 19 Many frontier AI experts hold equity in frontier AI companies. Excluding these experts from the auditing process entirely could leave the pool of qualified auditors too small, while including them without safeguards creates obvious conflicts. To mitigate these conflicts while accommodating the realities of the current AI field, we recommend that lead auditors in particular should not hold any direct equity in the frontier AI company being audited (and should also disclose any indirect conflicts), and that anyone involved in the audit process (not as the lead) should also disclose any equity they hold. See footnote 22 on the distinction between direct and indirect equity holdings. 20 Revenue from a client is “material” when it could reasonably be seen as threatening the auditor’s independence. In other contexts, this has been defined using a quantitative benchmark, typically 10% or 15% of a firm’s annual fees [179, 180]. While setting an analogous benchmark may be appropriate for AI, it is not obvious what that threshold should be. Frontier AI is by definition a limited subset of the AI industry, which may make it difficult to avoid crossing 10% and 15% thresholds in particular, perhaps suggesting the need for a higher threshold. It could also be easier to stay below such a threshold if an audit firm also services non-frontier AI companies or carries out other non-auditing activities. 21 One way in which such a rule might be evaded is if a lead auditor merely served as a “front” for a subcontractor, who does all of the real work. However, this is a possibility regardless of conflict of interest rules, so we do not view it as a decisive objection against such rules. A requirement for lead auditors to sign their findings has been shown to improve results [187] in a financial auditing context, and the “front” scenario is an additional reason why such signatures may also be appropriate in an AI context. 22 Many frontier AI companies are included in publicly-traded stock indices. Even when an auditor holds no “direct” equity in the frontier AI company, they may hold “indirect” equity through owning units of an exchange-traded fund, or another company that in turn holds equity in the frontier AI company. Existing frameworks providing guidance on independence requirements regarding investments (see [188]) could be adapted for the context of frontier AI auditing. Many frontier AI companies remaining private presents another challenge, making it difficult for ex-employees to divest at short notice. or the reality of conflict. A mature frontier AI assurance system requires a sufficiently large and diverse pool of qualified auditors. In contrast to relying on a small set of repeat auditors, a broad pool of qualified auditors ensures access to the technical and organizational expertise needed for credible assessment, provides an incentive for innovation that improves scalability and security, supports pluralism in methods, and reduces the likelihood of a single point of failure. Auditors need to have deep expertise. At a minimum, auditors must have deep expertise in AI evaluation, safety, security, and governance. Specific audit functions (e.g., concerning highly specialized knowledge, like CBRN risks) require additional, deep domain expertise. If a single organization lacks the diversity and depth of expertise required, subcontracting should be used — a common scenario in the near to medium term given the breadth of skills required. In such cases, a lead auditor should coordinate and assume responsibility for the whole audit, including the audit’s scope, assessments, conclusions, and reporting. The ecosystem should enable collaboration across specialist firms, research institutions, and civil society experts so that engagements can “mix and match” complementary skill sets. For example, large professional services firms could partner with specialized AI assessment organizations through subcontracting and “flow-down” agreements, or alternatively by forming a consortium. For information security auditing specifically, there is a more well-established ecosystem of penetration testers and other assessment providers to draw on. It will be critical to ensure that, even if such contracting is done separately from a lead frontier AI auditor, the lead auditor has access to the findings of information security assessments and can make sure that no gaps exist between assessment of “traditional” information security risks (e.g., theft of intellectual property) and AI-specific security risks (e.g., data poisoning for language models). Transparent and standardized terms for auditing contracts. Currently, third-party AI assessments are generally performed under bespoke, negotiated contracts between developers and evaluators that are subject to strict confidentiality. There is significant variation in transparency, with third-party evaluators’ identities sometimes not even disclosed. Standardized terms of engagement can help prevent companies from shopping for favorable auditors and ensure a consistent baseline of auditor access, independence, and reporting obligations. OpenAI recently published excerpts from agreements they use for pre-deployment testing [189], and the AI Evaluator Forum is recommending a baseline set of standards to be used in drafting such agreements [169, 190].24 Future regulation may also standardize the terms of audits that cover regulation requirements, as is the case in other industries (e.g., [163, 191]). Independent oversight and quality assurance mechanisms. Having an independent oversight board charged with raising standards in the sector can be valuable in tracking developments and punishing egregious practices (e.g., [192]). For financial auditors of publicly traded companies in the US, this takes the form of the Public Company Accounting Oversight Board (PCAOB), which was created under the Sarbanes–Oxley Act in the wake of accounting-related scandals at Enron and elsewhere. Such a body could serve several functions: developing auditing standards, certifying auditors, examining the quality of audits themselves, and revoking credentials where circumstances warrant. We discuss considerations bearing on the design of a “PCAOB-for-AI” in Section 6. Looking ahead, a mature ecosystem could include a live AI 23 Whether the number of auditors is “sufficient” will depend on factors such as the number of clients, the number of audits, the breadth and depth of each audit, and the number of qualified personnel. When introducing auditing requirements, policymakers should monitor the capacity of the ecosystem, and consider interventions (e.g., training programs or encouraging new entrants) if concentration risks emerge. 24 Details about, for example, the scope of an audit would often not be appropriate to disclose publicly, but procedural details about who, when, why, and how assessments are performed are needed in order to properly and contextually interpret their results and ensure that individual engagements are not performed in an ad hoc way. Terms and contract

Define and negotiate terms, including scope, access, objectives, assurance level, and publication

Consider preregistration of any evaluation protocols

Issue the engagement letter Gather evidence

Receive physical/digital access (e.g., to a custom API for external evaluations, a document review room, or a company laptop)

Evaluate systems, conduct interviews, examine documents

Request additional information and access as needed (e.g., changes to the custom API, interviews with staff, documentation of internal processes)

Aggregate all evidence

Analyze systems, documents, and processes

Rate the quantity of evidence

Rate the quality of evidence

Integrate evidence from periodic assessments, event-triggered reviews, and continuous monitoring Analysis and interpretation

Form judgments on company claims, risk, and assurance level

Write a preliminary report with a conclusion and conditions

Share with company for review and corrections

Finalize report, publish the audit report or summary, and submit it to any oversight bodies Decision-making and reportFigure 9: Standard frontier AI auditing workflow. assurance tracker: a public platform maintained by an oversight body showing each company’s stated policies, applicable regulations, lead auditor, and recent audit conclusions, updated as material changes occur.

5.7 Rigorous, traceable, and adaptive processes

A mature frontier AI auditing ecosystem depends on audits following a rigorous, reproducible process that produces reliable, traceable, and defensible evidence appropriate for the given assurance level. This requires adaptivity to balance competing demands: consistency and comparability across engagements, scalability as audit volume and system complexity increase, methodological flexibility to keep pace with rapidly evolving technology, and procedural fairness to audited companies without compromising independence. Standardized criteria. Each frontier AI audit should apply predefined criteria describing what counts as sufficient evidence of safety, security, and sound risk management. These criteria may draw on regulatory frameworks, international standards, industry best practices, and companies’ own commitments, supplemented by expert-developed criteria for emerging risk areas where existing frameworks provide insufficient guidance. Currently, there is no universally accepted or sufficiently granular set of auditing criteria for frontier AI systems. Developing, testing, and refining these criteria will be an important task for companies, auditors, regulators, and standards bodies. Effective criteria should be comparable across engagements, flexible enough to accommodate evolving technology, grounded in technical validity through rigorous measurement design, supported by stakeholder processes that confer legitimacy across jurisdictions, and designed to minimize gaming while aspiring to genuine safety rather than bare minimum compliance. Standardized auditing process. Audits should follow a clear and consistent process model that supports rigor and reproducibility. Frontier AI audits involve many moving parts, and inconsistent methods risk producing variable or unreliable results. A standardized process sets out expected steps for scoping, access, evidence gathering, analysis, verification, continuous monitoring, and reporting. (See Figure 9 for a potential workflow for a frontier AI audit that builds on a standard auditing workflow.) By using these steps, auditors can consistently provide high-quality audits. If auditing methods are consistent, auditors can be overseen more effectively, and different companies’ results can be compared. In addition, wherever feasible and appropriate, audit procedures should make use of auditable automated processes to enable consistent application of methods across engagements, continuous monitoring as systems evolve, and oversight of automated methods. Importantly, a commitment to systematic auditing processes does not diminish the value of flexible, minimally structured red-teaming efforts, which remain essential for discovering novel vulnerabilities and failure modes that more structured approaches might miss. Auditor flexibility and autonomy. At the same time, standardization should not lock in immature practices or constrain the methodological autonomy that makes independent evaluation meaningful. Auditors must have freedom in deciding their methods — including defining metrics, determining how to elicit target properties, and establishing criteria for success or failure — rather than being constrained to validate the company’s pre-selected approaches [169].25 Auditors should also retain flexibility to adjust the scope of their inquiry as the evaluation proceeds, since important issues may only become apparent during the course of an assessment. Narrow scopes can allow organizations to meet commitments in form but not in substance, leaving critical risks unexamined. Reproducibility of results. Audit processes and methods should be documented sufficiently that another auditor or oversight body could reproduce the approach and verify the results [77]. This includes recording the specific procedures used, the evidence gathered, the criteria applied, and the reasoning behind key judgments. Reproducibility supports quality assurance, enables oversight bodies to inspect audit work, and builds confidence that findings reflect genuine properties of the system rather than idiosyncrasies of a particular engagement. For each audit, a contract should define the scope, objectives, and responsibilities of an auditor, and pre-registration of evaluation protocols can further strengthen reproducibility and protect against bias. Pre-registration is an accepted best practice in the sciences [193, 194]. By specifying the questions to be answered, the metrics to be used, and the criteria for success before results are known, pre-registration ensures that audit conclusions reflect genuinely confirmatory tests rather than post hoc rationalizations shaped by preliminary findings [195]. Evaluation validity. Audits can only provide meaningful assurance if the underlying evaluations are methodologically sound and protected from compromise. Auditors should ensure that evaluations measure what they claim to measure (construct validity) and reflect real-world deployment conditions rather than artificial test settings (ecological validity) [73, 74, 77, 107, 196, 197, 198, 199, 200, 201, 202, 203]. Elicitation methods should be sufficient to surface a model’s true capabilities [204, 205], test coverage should span a representative range of inputs derived from the real-world [206] and failure modes (e.g., [207]), and mitigations should be tested under strong adversarial pressure rather than only benign conditions or under weak attack [208, 209]. Auditors should also verify that the evaluated system matches what is actually deployed, including its configuration and context of use. Auditors and companies should cooperate to identify and address system behaviors that may undermine evaluation validity, including overfitting to known evaluation sets, reward hacking, sandbagging (i.e., deliberately underperforming to avoid triggering safety thresholds), and other forms of gaming [210]. Companies should commit not to view, retain, or train on evaluation inputs and outputs without explicit auditor consent, ensuring that private evaluation sets remain effective over time and cannot be gamed. Auditors should consider adding canary data to evaluation sets to enable later detection of whether models have been trained on evaluation materials. Furthermore, in sensitive domains like biosecurity, even the evaluation sets and answer keys themselves may contain information that should not be widely published, adding another layer of confidentiality requirements beyond protecting evaluations from the auditee. Procedural fairness and dispute resolution. Audits should incorporate procedural safeguards that ensure fairness to the audited company while preserving the independence and integrity of audit conclusions. Companies should have structured opportunities to provide input at defined points in the audit process, for example, to correct factual errors in draft findings, provide additional context or evidence that auditors may have missed, and respond to preliminary conclusions before they are finalized. However, these opportunities must be carefully controlled to prevent companies from unduly influencing audit outcomes. Company 25 Of course, this autonomy should accompany adherence to standardized best practices, transparency to oversight bodies about methods, and independent oversight of auditors themselves (see Section 6). responses should be documented and, where appropriate, included in the final audit record so that readers can understand what objections were raised and how auditors addressed them. When disagreements cannot be resolved through the standard review process, they should be addressed through structured escalation pathways that do not significantly delay or obstruct the overall audit, such as arbitration.

5.8 Clear communication of audit results

A mature frontier AI auditing ecosystem requires that stakeholders are able to understand the results of the audit. In other industries, the auditor provides the results in the form of an audit report (see [211]). The report should contain the scope, level of assurance, conclusions, and reasoning [70]. A redacted or summarized version of the report should be shared with external stakeholders. Content of the audit report. The audit report should contain the following:

Scope. The audit report should contain a “Scope” section describing the risk categories (or subsets thereof) that were assessed. The section should also describe any exclusions (i.e., whether certain risks or types of information were excluded) and the rationale (e.g., resource constraints, inapplicability, or denial of access). Where appropriate, the report should describe any limitations of the engagement (see [136]).

Assurance level statement: The audit report should explicitly state the level of assurance at which the auditor provides their conclusion. This allows stakeholders to understand how much confidence they should have about the audit results.

Conclusions. The audit report should clearly state the overall conclusions of the auditor (e.g. “we agree with the company that the risk posed by their models is currently low”). The report should state any reservations the auditor had.

Reasoning. The auditor should describe their reasoning for each conclusion in the audit report. Separately, the auditor should document (where it is possible to do so without compromising the integrity of the evaluation) the evidence and analysis they used to arrive at their conclusion. With the exception of evaluation techniques that are unique to the auditor, it should be possible for other auditors with access to the unredacted report to reproduce key steps of the audit and to arrive at similar conclusions given the same access.

Recommendations. The auditor should describe their recommendations for remediation of any issues that arise.

Documentation. Where it’s possible to do so without compromising the integrity of the audit and confidential information, detailed documentation of the auditor’s analysis and findings should be provided. Sharing results appropriately. Different stakeholders likely need different levels of detail about the results of the audit. The company’s board and relevant executives, and in some cases relevant regulatory bodies, should receive the full report, as is standard in other industries [212, 213]. Other employees and relevant government bodies could receive a lightly redacted version of the report. The company could publish a summary of the audit report, along with an attestation from the auditor that the summary is fair. This approach can protect sensitive details about the company or their systems. External stakeholders should have access to a public standardized summary of scope, assurance level, and conclusions. Disagreements about appropriate redaction should be addressed through arbitration or, when established (see Section 6.1), a relevant standard-setting body. 6 Challenges and Next Steps This section explores four challenges that must be addressed to achieve effective and universal frontier AI auditing:

Ensuring high quality standards for frontier AI auditing, so it does not devolve into a checkbox exercise or lag behind changes in the AI industry.

Growing the ecosystem of audit providers at a rapid pace without compromising quality.

Accelerating adoption of frontier AI auditing by clarifying and strengthening incentives.

Achieving technical readiness for high AI Assurance Levels so they can be applied when needed.

6.1 Ensuring high quality standards

One central challenge for frontier AI auditing is straightforward to state but difficult to solve: audits must be rigorous enough to provide meaningful assurance to skeptical stakeholders, yet adaptive enough to keep pace with one of the fastest-evolving industries in history. If standards become too rigid, they risk devolving into procedural box-ticking that misses actual risks. If they become too flexible, they invite opinion shopping by auditees and corner-cutting by auditors. Two specific risks illustrate these tensions. First, Goodhart’s Law holds that when metrics become compliance targets, actors optimize for targets rather than underlying outcomes. Second, temporal mismatch occurs when traditional standards update over years while AI capabilities transform within months. Auditing practice must be designed with durable goals but evolving practices. Goals should be general: principles that outline the outcomes audits are supposed to achieve, much as the PCAOB outlines in AS 1000 [148] what financial auditors are supposed to do and general guidelines on how they should do it. Practices should be flexible and carefully tailored to AI capabilities [77]. This mirrors principles-based financial regulation, where private standard setters operate under public oversight with industry input. As in financial auditing, quality control can benefit from an independent oversight body that sets standards, inspects auditors, investigates failures, and enforces norms. This oversight body would be responsible for “auditing the auditors”; specifically, for ensuring that auditing services are provided and outcomes are achieved as expected, and for creating accountability mechanisms if an auditor fails to do so, including potentially through the loss of its auditing credential. Oversight is a critical element of building a robust, accountable auditing ecosystem. An effective auditing regime and independent oversight body with enforcement power may take time to build given the challenge of passing relevant laws in key jurisdictions. In the near-term, soft pressure to comply with standards, imposed by AI companies, their customers, or other third parties like investors and insurers, can serve a stopgap function before formal requirements are in place. Companies seeking trust should respond to this pressure by publishing verifiable, scoped claims assessed by credible auditors. Recommendation 1: AI companies, philanthropists, investors, and insurers should fund analysis of the quantity and quality of audits and auditors, and make these assessments available to the public. These stakeholders should particularly support and invest in independent watchdog efforts evaluating companies’ public claims, with rubrics for:

The total amount of audit capacity in the ecosystem according to various metrics, and its rate of growth over time;

Whether claims made in audit reports and self-assessments by companies include sufficient methodological detail to be reproduced by those with similar access (e.g., [79, 214]) and to be mapped onto relevant frameworks such as the AI Assurance Level (AAL) framework;

Whether assessments meet the AEF-1 standard (evaluator identity, scope, access limits, publication constraints) [169]; and

Audit quality metrics, such as inter-rater reliability (i.e., do auditors come to similar conclusions given the same evidence). Investments in research and development, decisions by insurers and procurers, and regulatory mandates can then be informed by realistic estimates of future auditing supply at a given quality. A balance is needed between driving growth in the market over time through demand pressure, on the one hand, and, on the other, asking the impossible, which encourages corner-cutting. Regular public reports on ecosystem health (at least quarterly) can help to create soft pressure for improvement by auditors and the companies working with them, but ultimately this pressure will reach its limits and more formal incentives will be needed. Recommendation 2: Policymakers should implement a PCAOB-style non-profit “auditor of auditors” that has legitimacy through final government approval of its standards, the authority to hold auditors accountable through revoking accreditation or other means, and the ability to innovate at the pace of the private sector. An audit-quality oversight institution would set baseline standards for frontier AI auditors, evaluate compliance, track market evolution over time and publish relevant analysis, accredit/revoke credentials, and fine auditors or refer cases for enforcement in the event of serious violations of auditing standards. Lessons can be drawn from a range of hybrid organizations that combine some of the benefits of the private sector (such as greater-than-government salaries, which is critical for competing for talent while overseeing a lucrative industry) and the public sector (namely enforcement power and democratic oversight). We emphasize the PCAOB — the Public Company Accounting Oversight Board — since it specifically governs financial audit quality, but precedents for government-authorized but private standard-setting bodies exist across critical industries. FINRA (in securities), NERC (for the US electrical grid), and other institutions provide a wealth of lessons on how government and private entities can collaborate on standard-setting while maintaining appropriate accountability [215]. Policy analysis on the appropriate design for a PCAOB-for-AI should identify hard-to-game audit quality indicators and effective inspection methods (e.g., re-performance of an audit by a separate auditor given equivalent access), and more rigorously specify the AALs that we provided preliminary sketches of in this paper. To reduce capture risk by a particular government, the AI auditing ecosystem should be designed to be legitimate in multiple jurisdictions. The oversight body should have diverse funding sources as well as transparent rules for conflicts of interest. The body should aim for its standards to be globally credible and endorsed by multiple governments. This could be accomplished by varying standards depending on the jurisdiction (see Section 6.3), and giving multiple governments the power to appoint and remove board members of the body. To reduce capture risk from the AI industry or the auditing industry itself, a PCAOB-for-AI should have significant representation from outside both sectors on its board.

6.2 Growing the ecosystem

Growing the ecosystem requires pulling together expertise from a wide range of disciplines as well as ensuring that auditors do not face undue legal risks when doing their work. Effective auditing requires auditors with a broad range of expertise across disciplines. Providing that expertise may require utilizing large firms alongside specialized AI assessment organizations and domain experts (e.g., in alignment, bioweapons, or cybersecurity). These multi-organization teams face coordination challenges: preventing accountability dilution [216, 217, 218, 219], ensuring consistent quality and clear communication across different disciplinary and organizational boundaries, and managing conflicts of interest. Addressing talent bottlenecks also requires investment in human capital through on-the-job growth in multi-organizational teams and formal training programs. Accreditation standards — formal, standardized credentials indicating that auditors possess certain knowledge and skills and are familiar with certain professional standards — can establish and then progressively raise the bar for competency and incentivize skill investment across the ecosystem. Recommendation 3: The AI evaluation ecosystem should establish a Frontier AI Auditor Accreditation Program with tiered certifications and specialty endorsements, as well as meaningful accountability mechanisms. The foundational tier of this accreditation would establish baseline competencies in AI systems, audit methodology, and ethics. Specialty endorsements might certify expertise in capability evaluation, alignment, control, information security, cybersecurity, or biosecurity. Organizations could be accredited based on credentialed staff and quality management systems. Nonaccredited individuals should still participate in audits with circumscribed roles (e.g., technical analysis); this allows the ecosystem to develop capabilities and train talent while creating additional accountability for auditors. Organization-level accreditation could be required for certain contexts, such as government procurement. A particularly promising application of this accreditation program would be providing supplemental training to academic researchers (including graduate students, professors, and post-docs) who have many of the technical skills needed to conduct audits, and who could benefit from additional income as well as relevant work experience during their studies. While this raises a range of implementation questions (e.g., the revolving-door question discussed previously), the large talent pool in academia is impossible to ignore when thinking about quickly growing a skilled auditing ecosystem. Without active measures to tap this pool of talent, it might be difficult to build a sufficiently large pool of disinterested auditors with sufficient speed. In addition to actively growing the supply of qualified auditors, there also needs to be attention to preventing roadblocks on a scalable ecosystem. One potential roadblock is liability faced by auditors. Auditors — who are typically far less capitalized than either developers or insurers — may face disproportionate liability relative to their fees and their actual contribution, given the massive scale of the AI market. This may cause them to rationally decline engagements involving the highest-risk systems, precisely where independent assessment is most valuable. For example, an auditor might fear a lawsuit from a company contesting good-faith audit results that caused a backlash to that company’s products or a fall in their share price. There could therefore be cases where regulators should provide liability protections to auditors, such as through a legal “safe harbor” [220, 221, 222, 223], though getting these right requires care. Terms of service and enforcement strategies used by AI companies to deter model misuse can inadvertently disincentivize good-faith safety evaluations, causing researchers to fear that conducting such research or releasing their findings will result in account suspensions or legal reprisal. [220, 221, 222, 223]. Safe harbor provisions can address this chilling effect while maintaining accountability for genuinely harmful conduct. In the case of cybersecurity, explicit corporate policies to promote responsible third-party disclosure of vulnerabilities is a common practice, and similar norms are needed in AI. Again, we urge caution regarding such changes, but see value in a carefully scoped safe harbor that avoids creating a liability gap in which neither the developer nor the auditor is liable for harms within the scope of the audit. Recommendation 4: Policymakers and developers should implement targeted safe harbors that protect good-faith safety research and auditing while avoiding a liability gap, and that are conditional on auditor compliance with established best practices. Such safe harbors should be designed with several principles in mind:

They should protect good-faith testing conducted under disclosed rules of engagement, drawing on established norms from cybersecurity vulnerability disclosure. Best practices from a body such as the PCAOB-for-AI discussed above could be referenced in the design of a safe harbor.

They should be conditional on researchers and auditors following responsible disclosure practices, refraining from exfiltration or misuse of sensitive information, and avoiding conduct that would independently violate applicable laws.

They should extend to both legal liability (providing protection against causes of action that might otherwise apply to authorized safety testing) and technical enforcement (protecting researchers from account suspensions or access revocation for legitimate assessment activities consistent with best practices).

They should not provide blanket immunity but rather require that protected parties demonstrate adherence to specified conditions — determinations that can be made ex post in the event of a dispute. Developers can unilaterally approximate these safe harbors through contractual commitments before any legislation is passed: as with coordinated vulnerability discovery policies in information security, frontier AI companies can provide explicit testing permissions on their website and designated disclosure channels that span a wide spectrum of harms, and can provide non-retaliation commitments for researchers who identify vulnerabilities. Such commitments would help ensure that auditors and independent researchers can conduct rigorous assessments without fearing that their findings — particularly critical findings — will expose them to retaliation.

6.3 Accelerating adoption

Achieving the full promise of frontier AI auditing requires industry-wide adoption (Section 3). This requires both domestic adoption within countries such as the US — ensuring all frontier AI developers, not merely the most safety-conscious, submit to rigorous assessment — and international adoption, including in countries where independent assessment is currently more sparse (the most notable example of a country producing a significant amount of frontier AI systems while lagging on third-party assessment is China). Weak links in the industry can cause incidents that reflect negatively on the industry as a whole. Markets have a key role to play in driving auditing adoption, as discussed in Section 3.2, but markets alone face limitations. Some frontier AI developers do not prioritize deployment to enterprise customers or even external deployment at all (e.g., Safe Superintelligence Inc.), and some may see a short-term market advantage in cutting corners. In a nascent industry where profitability may be years away, it is unclear how much market discipline alone can address safety and security absent binding regulatory requirements. We recommend two steps, one focused on ensuring that insurance can play the same constructive role it historically has played in new technology adoption [224] and the other on creating a regulatory requirement for auditing. First, there is growing concern that the scale of risks from current and future frontier AI systems go well beyond those that were originally envisioned when writing most companies’ insurance policies. This has caused some insurers to begin moving toward excluding AI-related harms from their policies and others to provide AI-focused insurance [47, 48]. However, exclusions are often bottlenecked by government approval, and insurers with particularly strong market power may be tempted to postpone difficult decisions longer than is ideal for the market as a whole. A similar development occurred with cybersecurity, and ultimately the US government helped drive clarity. Ambiguity on AI risk coverage could forestall the constructive role that insurance could play in driving AI risk mitigation, as it has in many other industries [225]. This role, however, is most plausibly concentrated on insurable operational risks associated with the deployment and use of AI systems rather than on fundamentally uncertain risks arising from frontier model development itself. Greater clarity on what is and isn’t covered under a given policy would accelerate the development of specialized insurance options for AI-related risks where they are appropriate, incentivize general insurers to pay closer attention to AI-related risks, or both, depending on the nature of the determinations made (see Appendix B for further discussion). Recommendation 5: National governments should quickly resolve outstanding and near-term requests from insurers regarding exclusions one way or the other, and in government procurement contexts, they should specify that frontier AI companies need explicit coverage of AI-related risks (whether through a specialized or general policy). If audits do in fact provide valuable information on AI safety and security and there are existing uninsured risks, greater clarity on ownership of AI-related risk will help create market pressure throughout the supply chain: downstream businesses gain financial reason to prefer audited models, which in turn incentivizes frontier AI developers to pursue auditing. Second, governments are unlikely to sit idly while private governance solutions form; the political salience of AI is rapidly increasing [226], and for the reasons discussed above and in both Section 3.1 and Section 3.2, it would be undesirable for them to do so. The question is what exactly this regulatory involvement should entail. Regulation can codify and universalize practices adopted by market leaders, while market dynamics can reveal more efficient ways to achieve regulatory goals [227]. Direct mandates for frontier AI auditing hedge against a scenario in which market forces are not sufficiently strong to drive safety and security alone (e.g., due to unpriced negative externalities), though they also could overstep and burden innovators without a commensurate safety and security benefit. Procurement policies represent a potential middle-ground between industry-wide mandates, on the one hand, and government inaction, on the other. For example, governments could require frontier AI auditing — at a specified level of assurance, for a given scope of risks, and by auditors with certain accreditations — before purchasing AI services in high-stakes sectors such as health and defense. This could accelerate the growth of the auditing market as a whole and accelerate the pace of safety and security improvement in the frontier AI sector. Recommendation 6: Policymakers should incorporate frontier AI auditing requirements into procurement processes, with particularly strong requirements for systems that will be deployed in high-stakes domains such as health and defense. However, government procurement policies, no matter how strong, are not likely to fully address risks from frontier AI. Not all frontier AI developers sell services to government agencies, and it is uncertain how much other factors will fill the gap in demand (e.g., insurance, investor due diligence, demand from enterprise customers). Still, procurement policies are a starting point for improving safety and security outcomes involving government use of AI, and could produce key findings that inform further steps in other contexts. Beyond procurement, a more comprehensive approach would be to require auditing for frontier AI systems or companies meeting a certain threshold (or, require auditing at different assurance levels corresponding to different thresholds). This could be layered onto existing threshold-based transparency requirements and whistleblower protections, using the same threshold or thresholds. A key question in formulating such a statutory requirement is: what safety and security standards should companies be audited against (besides their own policies and existing regulations)? Notably, there are few substantive standards in existing regulations — companies are generally required simply to have some kind of safety and security policies, which may or may not need to be detailed or meet certain substantive criteria, depending on the jurisdiction. While we don’t recommend a specific path forward here, we note that some have proposed that governments articulate a risk level to stay below in certain domains, and have private sector institutions identify efficient means of achieving those ends [227, 228]. Auditing requirements could then either be directly required for certain classes of AI systems and companies, or strongly incentivized as part of a larger regulatory strategy for AI.26 26 One policy mechanism that has been proposed to improve legal predictability and risk mitigation is statutory liability shields for frontier AI developers who take certain actions (including, e.g., submitting to auditing). There are various arguments for and against such proposals [229, 230, 231], but here we simply note that any such shields should be very carefully scoped to avoid undercutting positive incentives for mitigating avoidable risks. Insofar as there is a case for such shields, it is strongest for knowing misuse by users in cases where such misuse is not plausibly preventable even after the application of best practices. Giving companies “something for nothing” (e.g., a broad liability shield in exchange for submitting to shallow audits by unqualified auditors) would make things worse from almost all stakeholders’ perspectives.

6.4 Achieving technical readiness for high AALs

Three interconnected challenges must be addressed to make AAL-3 and AAL-4 technically feasible, cost-effective, and sufficiently protective of companies’ sensitive information: completeness, continuous monitoring, and the transparency-security trade-off. At high assurance levels, auditors need not only analyze information presented to them, but gain confidence that they are receiving complete information — i.e., that there aren’t material omissions that would change the audit conclusions. Gaining complete-enough information in order to rule out most possible errors and fraud is very difficult. Small, simple changes (e.g., disabling a safety feature) and difficult-todetect actions (e.g., taking model weights out of a datacenter on a USB stick) can have big consequences. This completeness problem means that it is inherently difficult — if not impossible — to have high or very high confidence in a company’s risk posture based on analysis of a single AI system in isolation. These small-footprint, big-consequence changes and actions could be mistakes on the part of the frontier AI company or intentional efforts to deceive; in either case, though particularly for deception, proving their absence will be challenging. While it seems that we have at least a few months before AI systems are capable of very sophisticated scheming and planning to undermine audit results [232], eventually auditors will need to plan for such possibilities, and will need proportionally stronger auditing techniques that can rule out sophisticated deception by human or machine. Achieving completeness likely involves a combination of “low-tech” mechanisms such as whistleblower bounties, so those with knowledge of material omissions or deceptions have an incentive to come forward [170], and “high-tech” mechanisms such as compute accounting and “proof-of-training” techniques, so that the amount of unaccounted-for compute and models of unknown provenance can be carefully circumscribed. A second challenge is the transparency-security trade-off, a concept developed in the literature on arms control [233]. The trade-off is that the very same information that third parties desire in order to have confidence that the audited organization is abiding by their commitments is also often the same type of information that is very sensitive itself or mixed up with sensitive information in complex ways that are difficult to share externally. In cases where countries have needed to cooperate via arms control treaties, they have often had to develop sophisticated technologies to navigate this trade-off [234]. Again, there are low-tech and high-tech ways of addressing this challenge. A low-tech path is to rely on human institutions such as rigorous background checks and personnel screening in order to vet the auditors before giving them access to the most sensitive information, and dividing up the sensitive information among different auditors. A high-tech path is to develop or apply technologies that are specifically intended to address trade-offs like this, such as cryptography, or technologies that are general-purpose in nature but can be applied in a way that alleviates the trade-off (e.g., applying AI itself to summarize or paraphrase information in a way that removes sensitive details [235, 236]). Another approach is to use FlexHEGs or other hardware-enabled governance mechanisms that can keep track of some properties but not others. In each case, use of formal verification to mathematically prove certain properties of the software used, as well as open-source hardware design, could improve confidence on the part of both auditors and auditees in the technology used. Lastly, as mentioned in Section 5, early pilots have shown the ability to conduct evaluations of (non-production) language models while assuring all parties that 27 This dichotomy is intended to convey the basic idea of multiple options for achieving the same goals, though we gloss over many details. For more granular frameworks, and a more detailed discussion of several themes in this subsection, see [53, 54]. excess information will not be revealed. A final challenge is continuous monitoring. Even if completeness is established at a single point in time, high AALs require ongoing assurance. Again, small changes can have big consequences, and these can happen nearly instantaneously for some aspects of AI safety and security. AAL-3 and AAL-4 require ongoing monitoring that produces change-detection signals and triggers re-examination when prior validity conditions no longer hold. Elaborating on one promising theme, cryptographic certificates provide a promising way forward. When technically feasible, developing and deploying AI systems within cryptographic protocols could enable AAL-3 and AAL-4 while guaranteeing privacy protections for all stakeholders. Specifically, zero-knowledge proofs (ZKPs) allow an auditee to prove statements about their private data or private system without revealing any further information: ZKPs provide a certificate that could be made public, avoiding the need for an independent auditor or trusted third party. ZKPs provide secure “white-box” access to a set of pre-specified data or operations while protecting the intellectual property or sensitive data of an auditee; an appropriate cryptographic commitment (roughly, a tamper-evident digital fingerprint of the system that is fixed during development and deployment) by an auditee to their system enables continuous monitoring or change-detection signals that cannot be surreptitiously altered. Current work is ongoing for ZKP-based certificates for properties such as privacy, fairness, and uncertainty calibration. More research is needed to achieve sufficient computational efficiency to enable practical use at scale. Low-tech pathways for addressing continuous monitoring include embedded auditors who have a similar level of access to information as normal company employees do, or even enriched access relative to normal employees (this approach is used for regulation of systemically important financial institutions as well as nuclear power plant safety in the US). High-tech pathways include data diodes [237] that could continuously emit a small amount of information, negotiated in advance and perhaps formally verified, without also being capable of sending malicious commands back into the server. Through such means, auditors could have confidence that — for example — a certain datacenter is still being used for running an audited model rather than training of unaudited models, without the company putting any additional confidential information at risk. Recommendation 7: Philanthropists, governments, and frontier AI companies should invest in an ambitious “Auditability R&D and Pilots” portfolio aimed at making AAL-3 and AAL-4 technically feasible and cost-effective. Priority areas include:

Confidential evaluations at scale: As mentioned in Section 5, the PySyft framework has been used — in combination with secure enclaves — to conduct mutually confidential evaluations of a small language model. But we need significant efficiency improvements for techniques like this to be viable for frontier AI models and systems.

Proof-of-training and proof-of-learning: Confidence in how large amounts of computing power were used can help rule out (or put limits on the possible size of) unaudited systems produced by a given set of computing hardware. For evaluations, proof-of-training and proof-of-learning, ZKPs are a promising approach.

Change-detection infrastructure: Monitoring technologies (including open-source data diodes, FlexHEGs, and other hardware-enabled governance mechanisms) can produce auditable signals when systems change in ways that could invalidate prior assessments, while ensuring that additional information will not be transmitted.

Adversarial testing of verification mechanisms: Independent teams attempting to spoof attestations, bypass monitoring, or create shadow systems can ensure that auditability infrastructure actually works.

Model and system fingerprinting techniques: Emerging techniques may make it possible to detect significant changes to model weights through black-box interfaces, but their limits need to be more rigorously understood [238]. Additionally, there has been little research on whether other types of changes could be reliably detected (e.g., changes to system prompts, new inference optimizations or classifiers for inputs and outputs).

Formal verification fundamentals and applications: Research advancing AI systems’ ability to assist with engineering system design, formal specifications, and theorem proving can accelerate development of high assurance audit infrastructure. In addition to improving general techniques, specific, immediately usable applications of formal methods should also be pursued (e.g., to give confidence in an application of privacy-preserving AI for analysis of sensitive documents). In parallel with building strong technical foundations, there is a need to learn from experience. Given substantial uncertainty about how quickly AI and its risks will evolve, pilots for AAL-3 and later AAL-4 should begin with urgency. Recommendation 8: Companies closest to the state-of-the-art should work with auditors, researchers, governments, and other stakeholders to conduct early pilots of AAL-3 and later AAL-4 auditing in order to accelerate the maturity of relevant technologies and processes. These pilots should:

Test a range of “low-tech” procedural approaches that can support AAL-3 using methods available today, although these will likely not be scalable in all respects (e.g., requiring intensive vetting of the personnel involved).

Trial emerging “high-tech” mechanisms (such as proof-of-training, on-server use of privacy-preserving AI methods, and change-detection for internal deployment) in realistic settings to identify gaps between current capabilities and frontier-scale requirements.

Document what works and what doesn’t, contributing to public knowledge about minimum access requirements, trust assumptions, and practical obstacles. Since the companies closest to the state-of-the-art will create AI systems that pose the greatest risks before others, it is appropriate that they should be first movers in this area, and show how their technology can be trusted with confidence by skeptical parties. Early experimentation will ensure that AAL-3 and AAL-4 are ready when they are needed most, including in the most difficult cases, such as US–China cooperation on baseline safety and security norms [239]. 7 Conclusion Today, no mechanism exists to confidently confirm that AI companies’ safety and security claims are accurate or that their practices meet relevant standards, forcing a difficult choice between taking all companies at their word and relying on a combination of unreliable signals (third-party assessments based on very limited information, the apparent trustworthiness of senior leadership in the company, etc.). Frontier AI auditing provides an alternative: rigorous, independent scrutiny of technical systems and organizational practices by qualified third parties. We presented a vision for what effective frontier AI auditing requires: comprehensive scope covering intentional misuse, unintended system behavior, information security, and emergent social phenomena; an organizational perspective that assesses companies holistically rather than focusing narrowly on specific models; calibrated assurance levels that clearly communicate warranted confidence; deep access to non-public information combined with rigorous security measures; continuous, rather than one-off, verification that updates automatically as systems change; auditor independence enforced by disclosure, industry standards, and oversight of auditors; rigorous, traceable, and adaptive audit processes; and clear communication of audit results. These principles draw on more established domains where societies have repeatedly built independent assurance mechanisms for high-stakes activities. Substantial challenges remain: ensuring high quality standards, growing the ecosystem, accelerating adoption, and achieving technical readiness for the higher assurance levels. But these challenges are neither unprecedented nor insurmountable. This paper has several important limitations:

We focus primarily on frontier AI developers and closed-weight models. The auditing challenges for open-weight models, fine-tuning providers, and downstream deployers differ in ways we do not directly address.

We only make interim, near-term recommendations on the appropriate levels of assurance (AAL-1 for frontier AI as a whole, and AAL-2 for the leading subset thereof). Judgments about how to proceed after that will involve various complex considerations that will be informed by the further research and pilots we propose, and we propose institutions for helping to make and implement such decisions.

Many of our recommendations implicitly depend on or at least strongly benefit from institutions other than auditing — e.g., robust whistleblower protections at AI companies, detailed transparency requirements for frontier AI companies (so that they are making significant claims that merit verification in the first place), and clear allocation of liability for AI-related harms. These are important but can be pursued in parallel to frontier AI auditing.

Much of our analysis implicitly assumes a certain context (namely, developed, largely Western countries), and scaling frontier AI auditing to countries such as China raises various challenges. We discussed technical challenges in achieving high degrees of assurance, but additional cultural and political challenges were not addressed. As frontier AI systems grow more capable — possibly at an accelerating rate — the cost of getting safety and security wrong rises sharply. The time to invest in frontier AI auditing is today.

References

[1] Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, Hoda Heidari, Anson Ho, Sayash Kapoor, Leila Khalatbari, Shayne Longpre, Sam Manning, Vasilios Mavroudis, Mantas Mazeika, Julian Michael, Jessica Newman, Kwan Yee Ng, Chinasa T. Okolo, Deborah Raji, Girish Sastry, Elizabeth Seger, Theodora Skeadas, Tobin South, Emma Strubell, Florian Tramèr, Lucia Velasco, Nicole Wheeler, Daron Acemoglu, Olubayo Adekanmbi, David Dalrymple, Thomas G. Dietterich, Edward W. Felten, Pascale Fung, Pierre-Olivier Gourinchas, Fredrik Heintz, Geoffrey Hinton, Nick Jennings, Andreas Krause, Susan Leavy, Percy Liang, Teresa Ludermir, Vidushi Marda, Helen Margetts, John McDermid, Jane Munga, Arvind Narayanan, Alondra Nelson, Clara Neppel, Alice Oh, Gopal Ramchurn, Stuart Russell, Marietje Schaake, Bernhard Schölkopf, Dawn Song, Alvaro Soto, Lee Tiedrich, Gaël Varoquaux, Andrew Yao, Ya-Qin Zhang, Olubunmi Ajala, Fahad Albalawi, Marwan Alserkal, Guillaume Avrin, Christian Busch, André Carlos Ponce de Leon Ferreira de Carvalho, Bronwyn Fox, Amandeep Singh Gill, Ahmet Halit Hatip, Juha Heikkilä, Chris Johnson, Gill Jolly, Ziv Katzir, Saif M. Khan, Hiroaki Kitano, Antonio Krüger, Kyoung Mu Lee, Dominic Vincent Ligot, José Ramón López Portillo, Oleksii Molchanovskyi, Andrea Monti, Nusu Mwamanzi, Mona Nemer, Nuria Oliver, Raquel Pezoa Rivera, Balaraman Ravindran, Hammam Riza, Crystal Rugege, Ciarán Seoighe, Jerry Sheehan, Haroon Sheikh, Denise Wong, and Yi Zeng, “International AI safety report 2025,” AI Security Institute, DSIT 2025/001, Dec. 7, 2025. [Online]. Available: https://www.gov.uk/government/publications/international-aisafety-report-2025 [2] Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Capstick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Russell Wald, Toby Walsh, Armin Hamrah, Lapo Santarlasci, Julia Betts Lotufo, Alexandra Rome, Andrew Shi, and Sukrut Oak, “Artificial intelligence index report 2025,” 2025, arXiv:2504.07139, Available: [Online]. Available: https ://arxiv.org/abs/2504.07139 [3] Alexander Wang, Kevin Klyman, Sayash Kapoor, Nester Maslej, Shayne Longpre, Betty Xiong, Percy Liang, and Rishi Bommasani, “The 2025 foundation model transparency index,” Center for Research on Foundation Models, 2025. [Online]. Available: https://crfm.stanford.e du/fmti/December-2025/paper.pdf [4] Anka Reuel, Avijit Ghosh, Jenny Chim, Andrew Tran, Yanan Long, Jennifer Mickel, Usman Gohar, Srishti Yadav, Pawan Sasanka Ammanamanchi, Mowafak Allaham, Hossein A. Rahmani, Mubashara Akhtar, Felix Friedrich, Robert Scholz, Michael Alexander Riegler, Jan Batzner, Eliya Habba, Arushi Saxena, Anastassia Kornilova, Kevin Wei, Prajna Soni, Yohan Mathew, Kevin Klyman, Jeba Sania, Subramanyam Sahoo, Olivia Beyer Bruvik, Pouya Sadeghi, Sujata Goswami, Angelina Wang, Yacine Jernite, Zeerak Talat, Stella Biderman, Mykel Kochenderfer, Sanmi Koyejo, and Irene Solaiman, “Who evaluates AI’s social impacts? mapping coverage and gaps in first and third party evaluations,” 2025, arXiv: 2511.05613, Available: [Online]. Available: https://a rxiv.org/abs/2511.05613 [5] Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, and Dylan Hadfield-Menell, “Black-box access is insufficient for rigorous AI audits,” in Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, ser. FAccT ’24, Association for Computing Machinery, Jun. 5, 2024, pp. 2254–2272, ISBN: 979-8-4007-0450-5. DOI: 10.1145/3630106.3659037 Accessed: Dec. 21, 2025. [6] OpenAI, “OpenAI red teaming network,” Sep. 19, 2023. Accessed: Jan. 10, 2026. [Online]. Available: https://openai.com/index/red-teaming-network/ [7] Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Markus Anderljung, Ben Garfinkel, Lennart Heim, Andrew Trask, Gabriel Mukobi, Rylan Schaeffer, Mauricio Baker, Sara Hooker, Irene Solaiman, Alexandra Sasha Luccioni, Nitarshan Rajkumar, Nicolas Moës, Jeffrey Ladish, David Bau, Paul Bricman, Neel Guha, Jessica Newman, Yoshua Bengio, Tobin South, Alex Pentland, Sanmi Koyejo, Mykel J. Kochenderfer, and Robert Trager, “Open problems in technical AI governance,” 2025, arXiv:2407.14981, Available: DOI: 10.48550/arXiv.2407.14981 [8] David Hodgkinson, “IOSA: The revolution in airline safety audits,” Air and Space Law, vol. 30, no. 4, pp. 302–329, 2005. DOI: 10.54648/aila2005023 [9] Sjaanie Koppel, Judith Charlton, Brian Fildes, and Michael Fitzharris, “How important is vehicle safety in the new vehicle purchase process?” Accident Analysis & Prevention, vol. 40, no. 3, pp. 994–1004, 2008. DOI: 10.1016/j.aap.2007.11.006 [10] Amit Kheradia and Keith Warriner, “Understanding the Food Safety Modernization Act and the role of quality practitioners in the management of food safety and quality systems,” TQM Journal, vol. 25, no. 4, pp. 347–370, 2013. DOI: 10.1108/17542731311314854 [11] Jeff Johnson and C&EN Washington, “Process safety since Bhopal,” Chemical & Engineering News, 2005. [Online]. Available: https://cen.acs.org/articles/83/i4/PROCESSSAFETY-SINCE-BHOPAL.html [12] Richard L. Baker, Jr. William E. Bealing, Donald A. Nelson, and A. Blair Staley, “An institutional perspective of the Sarbanes–Oxley act,” Managerial Auditing Journal, vol. 21, no. 1, pp. 23–33, 2006. DOI: 10.1108/02686900610634739 [13] Ian Sutton, “SEMS after the audits,” paper presented at the PE International Conference on Health, Safety, and Environment, 2014. DOI: 10.2118/168515-MS [14] Wendell Wallach, Anka Reuel, and Anja Kaspersen, “Soft law functions in the international governance of AI,” Center for Law, Science & Innovation, Arizona State University, 2023. [Online]. Available: https://lsi.asulaw.org/softlaw/wp-content/uploads/site s/7/2023/12/Wallach-et-al_Soft-Law-Functions-in-the-Internat ional-Governance-of-AI.pdf [15] Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Scott Johnston, Andy Jones, Nicholas Joseph, Jackson Kernian, Shauna Kravec, Ben Mann, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Tom Brown, Jared Kaplan, Sam McCandlish, Christopher Olah, Dario Amodei, and Jack Clark, “Predictability and surprise in large generative models,” in 2022 ACM Conference on Fairness, Accountability and Transparency, ser. FAccT ’22, Association for Computing Machinery, Jun. 2022, pp. 1747–1764. DOI: 10.1145/3531146.3533229 [16] A. Feder Cooper, Katherine Lee, James Grimmelmann, Daphne Ippolito, Christopher CallisonBurch, Christopher A. Choquette-Choo, Niloofar Mireshghallah, Miles Brundage, David Mimno, Madiha Zahrah Choksi, Jack M. Balkin, Nicholas Carlini, Christopher De Sa, Jonathan Frankle, Deep Ganguli, Bryant Gipson, Andres Guadamuz, Swee Leng Harris, Abigail Z. Jacobs, Elizabeth Joh, Gautam Kamath, Mark Lemley, Cass Matthews, Christine McLeavey, Corynne McSherry, Milad Nasr, Paul Ohm, Adam Roberts, Tom Rubin, Pamela Samuelson, Ludwig Schubert, Kristen Vaccaro, Luis Villa, Felix Wu, and Elana Zeide, “Report of the 1st Workshop on Generative AI and Law,” arXiv preprint arXiv:2311.06477, 2023. [17] Rishi Bommasani, Sayash Kapoor, Kevin Klyman, Shayne Longpre, Ashwin Ramaswami, Daniel Zhang, Marietje Schaake, Daniel E. Ho, Arvind Narayanan, and Percy Liang, “Considerations for governing open foundation models,” Science, vol. 386, no. 6718, pp. 151–153, Oct. 11, 2024. DOI: 10.1126/science.adp1848 Accessed: Jan. 11, 2026. [Online]. Available: https://www.science.org/doi/full/10.1126/science.adp1848 [18] Elizabeth Seger, “Open horizons: Exploring nuanced technical and policy approaches to openness in AI,” Demos, Aug. 28, 2024. Accessed: Jan. 11, 2026. [Online]. Available: https://apo.o rg.au/node/328262 [19] Camille François, Ludovic Péran, Ayah Bdeir, Nouha Dziri, Will Hawkins, Yacine Jernite, Sayash Kapoor, Juliet Shen, Heidy Khlaaf, Kevin Klyman, Nik Marda, Marie Pellat, Deb Raji, Divya Siddarth, Aviya Skowron, Joseph Spisak, Madhulika Srikumar, Victor Storchan, Audrey Tang, and Jen Weedon, “A different approach to AI safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety,” 2025, arXiv:2506.22183, Available: DOI: 10.48550/arXiv.2506.22183 [20] Madhulika Srikumar, Jiyoo Chang, and Kasia Chmielinski, “Risk mitigation strategies for the open foundation model value chain,” Partnership on AI, Jul. 11, 2024. Accessed: Jan. 11, 2026. [Online]. Available: https://partnershiponai.org/resource/risk-mitiga tion-strategies-for-the-open-foundation-model-value-chain/ [21] Stephen Casper, Kyle O’Brien, Shayne Longpre, Elizabeth Seger, Kevin Klyman, Rishi Bommasani, Aniruddha Nrusimha, Ilia Shumailov, Sören Mindermann, Steven Basart, Frank Rudzicz, Kellin Pelrine, Avijit Ghosh, Andrew Strait, Robert Kirk, Dan Hendrycks, Peter Henderson, Zico Kolter, Geoffrey Irving, Yarin Gal, Yoshua Bengio, and Dylan Hadfield-Menell, “Open technical problems in open-weight AI model risk management,” StevenCasper.com, 2025. [Online]. Available: htt ps://stephencasper.com/open-technical-problems-in-open-weigh t-ai-model-risk-management/ [22] Sayash Kapoor, Rishi Bommasani, Kevin Klyman, Shayne Longpre, Ashwin Ramaswami, Peter Cihon, Aspen Hopkins, Kevin Bankston, Stella Biderman, Miranda Bogen, Rumman Chowdhury, Alex Engler, Peter Henderson, Yacine Jernite, Seth Lazar, Stefano Maffulli, Alondra Nelson, Joelle Pineau, Aviya Skowron, Dawn Song, Victor Storchan, Daniel Zhang, Daniel E. Ho, Percy Liang, and Arvind Narayanan, “On the societal impact of open foundation models,” 2024, arXiv: 2403.07918, Available: DOI: 10.48550/arXiv.2403.07918 [23] OpenAI, “Better language models and their implications,” Feb. 14, 2019. Accessed: Jan. 11, 2026. [Online]. Available: https://openai.com/index/better-language-models/ [24] Dean W. Ball and Daniel Kokotajlo, “Four ways to advance transparency in frontier AI development,” TIME, 2024. Accessed: Jan. 10, 2026. [Online]. Available: https://time.com/co llections/time100-voices/7086285/ai-transparency-measures/ [25] Sean McGregor, Victor Lu, Vassil Tashev, Armstrong Foundjem, Aishwarya Ramasethu, Sadegh AlMahdi Kazemi Zarkouei, Chris Knotz, Kongtao Chen, Alicia Parrish, Anka Reuel, and Heather Frase, “Risk management for mitigating benchmark failure modes: BenchRisk,” 2025, arXiv: 2510.21460, Available: [Online]. Available: https://arxiv.org/abs/2510.21460 [26] Charlotte Stix, Matteo Pistillo, Girish Sastry, Marius Hobbhahn, Alejandro Ortega, Mikita Balesni, Annika Hallensleben, Nix Goldowsky-Dill, and Lee Sharkey, “AI behind closed doors: A primer on the governance of internal deployment,” 2025, arXiv: 2504.12170, Available: DOI: 10.485 50/arXiv.2504.12170 [27] Anthropic, “Claude Gov models for U.S. national security customers,” Jun. 5, 2025. [Online]. Available: https://www.anthropic.com/news/claude-gov-models-for-u -s-national-security-customers [28] OpenAI, “Strengthening cyber resilience as AI capabilities advance,” Dec. 10, 2025. [Online]. Available: https://openai.com/index/strengthening-cyber-resilienc e/ [29] Anthropic, “Disrupting the first reported AI-orchestrated cyber espionage campaign,” Nov. 13, 2025. [Online]. Available: https://www.anthropic.com/news/disruptingAI-espionage [30] OpenAI, “Disrupting malicious uses of AI (October 2025),” Oct. 2025. Accessed: Jan. 10, 2026. [Online]. Available: https://cdn.openai.com/threat-intelligence-repor ts/7d662b68-952f-4dfd-a2f2-fe55b041cc4a/disrupting-malicioususes-of-ai-october-2025.pdf [31] Google Threat Intelligence Group, “GTIG AI threat tracker: Advances in threat actor usage of AI tools,” Nov. 6, 2025. Accessed: Jan. 10, 2026. [Online]. Available: https://cloud.goog le.com/blog/topics/threat-intelligence/threat-actor-usage-ofai-tools [32] Anthropic, “AI safety level 3 deployment safeguards report,” 2025. [Online]. Available: https: //www-cdn.anthropic.com/dc4cb293c77da3ca5e3398bdeef75ee17b42b 73f.pdf [33] Google Gemini, “Gemini 2.5 Deep Think model card,” Jul. 1, 2025. [Online]. Available: https: //storage.googleapis.com/deepmind-media/Model-Cards/Gemini-25-Deep-Think-Model-Card.pdf [34] Apollo Research, “More capable models are better at in-context scheming,” Jun. 19, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://www.apolloresearch.ai/bl og/more-capable-models-are-better-at-in-context-scheming/ [35] Markus Anderljung, Joslyn Barnhart, Anton Korinek, Jade Leung, Cullen O’Keefe, Jess Whittlestone, Shahar Avin, Miles Brundage, Justin Bullock, Duncan Cass-Beggs, Ben Chang, Tantum Collins, Tim Fist, Gillian Hadfield, Alan Hayes, Lewis Ho, Sara Hooker, Eric Horvitz, Noam Kolt, Jonas Schuett, Yonadav Shavit, Divya Siddarth, Robert Trager, and Kevin Wolf, “Frontier AI regulation: Managing emerging risks to public safety,” 2023, arXiv: 2307.03718, Available: [Online]. Available: https://arxiv.org/abs/2307.03718 [36] James Reason, A Life in Error: From Little Slips to Big Disasters. Taylor & Francis, 2013, ISBN: 978-1-4724-1841-8. DOI: 10.1201/9781315263830 [37] Inioluwa Deborah Raji and Joy Buolamwini, “Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial AI products,” in Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, ser. AIES ’19, Honolulu, HI, USA: Association for Computing Machinery, 2019, pp. 429–435. DOI: 10.1145/3306618.3314244 [Online]. Available: https://doi.org/10.1145/3306618.3314244 [38] AISI, “Pre-deployment evaluation of OpenAI’s o1 model,” AI Security Institute, Dec. 18, 2024. Accessed: Jan. 11, 2026. [Online]. Available: https://www.aisi.gov.uk/blog/pre -deployment-evaluation-of-openais-o1-model [39] Anthropic, “Strengthening our safeguards through collaboration with US CAISI and UK AISI,” Sep. 12, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://www.anthropic .com/news/strengthening-our-safeguards-through-collaborationwith-us-caisi-and-uk-aisi [40] OpenAI, “Working with US CAISI and UK AISI to build more secure AI systems,” Dec. 18, 2025. Accessed: Dec. 22, 2025. [Online]. Available: https://openai.com/index/us-cais i-uk-aisi-ai-update/ [41] Paul Röttger, Fabio Pernisi, Bertie Vidgen, and Dirk Hovy, “SafetyPrompts: A systematic review of open datasets for evaluating and improving large language model safety,” 2025, arXiv:2404.05399, Available: DOI: 10.48550/arXiv.2404.05399 [42] Shaona Ghosh, Heather Frase, Adina Williams, Sarah Luger, Paul Röttger, Fazl Barez, Sean McGregor, Kenneth Fricklas, Mala Kumar, Quentin Feuillade–Montixi, Kurt Bollacker, Felix Friedrich, Ryan Tsang, Bertie Vidgen, Alicia Parrish, Chris Knotz, Eleonora Presani, Jonathan Bennion, Marisa Ferrara Boston, Mike Kuniavsky, Wiebke Hutiri, James Ezick, Malek Ben Salem, Rajat Sahay, Sujata Goswami, Usman Gohar, Ben Huang, Supheakmungkol Sarin, Elie Alhajjar, Canyu Chen, Roman Eng, Kashyap Ramanandula Manjusha, Virendra Mehta, Eileen Long, Murali Emani, Natan Vidra, Benjamin Rukundo, Abolfazl Shahbazi, Kongtao Chen, Rajat Ghosh, Vithursan Thangarasa, Pierre Peigné, Abhinav Singh, Max Bartolo, Satyapriya Krishna, Mubashara Akhtar, Rafael Gold, Cody Coleman, Luis Oala, Vassil Tashev, Joseph Marvin Imperial, Amy Russ, Sasidhar Kunapuli, Nicolas Miailhe, Julien Delaunay, Bhaktipriya Radharapu, Rajat Shinde, Tuesday, Debojyoti Dutta, Declan Grabb, Ananya Gangavarapu, Saurav Sahay, Agasthya Gangavarapu, Patrick Schramowski, Stephen Singam, Tom David, Xudong Han, Priyanka Mary Mammen, Tarunima Prabhakar, Venelin Kovatchev, Rebecca Weiss, Ahmed Ahmed, Kelvin N. Manyeki, Sandeep Madireddy, Foutse Khomh, Fedor Zhdanov, Joachim Baumann, Nina Vasan, Xianjun Yang, Carlos Mougn, Jibin Rajan Varghese, Hussain Chinoy, Seshakrishna Jitendar, Manil Maskey, Claire V. Hardgrove, Tianhao Li, Aakash Gupta, Emil Joswin, Yifan Mai, Shachi H. Kumar, Cigdem Patlak, Kevin Lu, Vincent Alessi, Sree Bhargavi Balija, Chenhe Gu, Robert Sullivan, James Gealy, Matt Lavrisa, James Goel, Peter Mattson, Percy Liang, and Joaquin Vanschoren, “AILuminate: Introducing v1.0 of the AI risk and reliability benchmark from MLCommons,” 2025, arXiv:2503.05731, Available: DOI: 10.48550/arXiv.2503.05731 [43] Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers, “ARC Prize 2024: Technical report,” 2025, arXiv:2412.04604, Available: DOI: 10.48550/arXiv.2412.04604 [44] Sharon Goldman and Jeremy Kahn, “Top OpenAI researcher resigns, saying company prioritized ‘shiny products’ over AI safety,” Fortune, May 17, 2024. [Online]. Available: https://fortu ne.com/2024/05/17/openai-researcher-resigns-safety/ [45] Amanda Askell, Miles Brundage, and Gillian Hadfield, “The role of cooperation in responsible AI development,” 2019, arXiv: 1907.04534, Available: [Online]. Available: https://arxiv.o rg/abs/1907.04534 [46] Public Company Accounting Oversight Board, “Investor bulletin: Why audits matter,” Apr. 26, 2025. [Online]. Available: https://pcaobus.org/resources/information-fo r-investors/investor-advisories/investor-bulletin-why-auditsmatter [47] Angela Yang, “Insurance companies are trying to avoid big payouts by making AI safer,” NBC News, 2025. Accessed: Jan. 10, 2026. [Online]. Available: https://www.nbcnews.com /tech/tech-news/insurance-companies-are-trying-to-make-ai-sa fer-rcna243834 [48] “Insurers retreat from AI cover as risk of multibillion-dollar claims mounts,” Financial Times, Nov. 22, 2025. [Online]. Available: https://www.ft.com/content/abfe9741-f4 38-4ed6-a673-075ec177dc62 [49] AIUC, “AIUC-1: The world’s first AI agent standard,” AI Underwriting Company, 2026. Accessed: Jan. 10, 2026. [Online]. Available: https://www.aiuc-1.com/ [50] Wikipedia, “Three Mile Island accident.” Accessed: Jan. 14, 2026. [Online]. Available: https: //en.wikipedia.org/wiki/Three_Mile_Island_accident [51] Jonathon Baron and Stephen Herzog, “Public opinion on nuclear energy and nuclear weapons: The attitudinal nexus in the United States,” Energy Research & Social Science, vol. 68, p. 101 567, Oct. 1, 2020. DOI: 10.1016/j.erss.2020.101567 Accessed: Jan. 11, 2026. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S221 4629620301432 [52] Nicolai Tangen and Carine Smith Ihenacho, “Responsible artificial intelligence,” Norges Bank Investment Management, 2023. Accessed: Jan. 10, 2026. [Online]. Available: https://www .nbim.no/en/news-and-insights/our-views/2023/responsible-art ificial-intelligence/ [53] Ben Harack, Robert F. Trager, Anka Reuel, David Manheim, Miles Brundage, Onni Aarne, Aaron Scher, Yanliang Pan, Jenny Xiao, Kristy Loke, Sumaya Nur Adan, Guillem Bas, Nicholas A. Caputo, Julia C. Morse, Janvi Ahuja, Isabella Duan, Janet Egan, Ben Bucknall, Brianna Rosen, Renan Araujo, Vincent Boulanin, Ranjit Lall, Fazl Barez, Sanaa Alvira, Corin Katzke, Ahmad Atamli, and Amro Awad, “Verification for international AI governance,” AI Governance Initiative, Oxford Martin School, University of Oxford, Jul. 3, 2025. [Online]. Available: https://aigi.ox .ac.uk/publications/verification-for-international-ai-govern ance/ [54] Mauricio Baker, Gabriel Kulp, Oliver Marks, Miles Brundage, and Lennart Heim, “Verifying international agreements on AI: Six layers of verification for rules on large-scale AI development and deployment,” 2025, arXiv: 2507.15916, Available: [Online]. Available: https://arxiv .org/abs/2507.15916 [55] Jim Mitre, Michael C. Horowitz, Natalia Henry, Emma Borden, Joel B. Predd, Sarah Kreps, Miles Brundage, James D. Fearon, Karl P. Mueller, Jane Vaynman, and Tristan A. Volpe, “The artificial general intelligence race and international security,” RAND, Sep. 24, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://www.rand.org/pubs/perspectives/PEA41 55-1.html [56] Carol Ballantine, “Sulfanilamide disaster,” FDA Consumer Magazine, Jun. 1981. Accessed: Jan. 10, 2026. [Online]. Available: https://www.fda.gov/about-fda/histories-prod uct-regulation/sulfanilamide-disaster [57] U.S. Federal Aviation Administration, “A brief history of the FAA.” Accessed: Dec. 21, 2025. [Online]. Available: https://www.faa.gov/about/history/brief_history [58] U.S. Federal Aviation Administration, “Lockheed L-1049 Super Constellation and Douglas DC-7: Trans World Airlines Flight 2, N6902C,” 2025. [Online]. Available: https://www.faa.go v/lessons_learned/transport_airplane/accidents/N6902C [59] Bill Anderson-Samways, “AI-relevant regulatory precedents: A systematic search across all federal agencies,” Institute for AI Policy and Strategy, Apr. 3, 2024. Accessed: Jan. 11, 2026. [Online]. Available: https://www.iaps.ai/research/ai-relevant-regulatory-pr ecedent [60] Underwriters Laboratories Inc, Engineering Progress: The Revolution and Evolution of Working for a Safer World. IdeaPress Publishing, 2016. Accessed: Dec. 21, 2025. [Online]. Available: https://www.ul.com/about/download-engineering-progress-ebook [61] Céline Marie-Elise Gossner, Jørgen Schlundt, Peter Ben Embarek, Susan Hird, Danilo Lo-FoWong, Jose Javier Ocampo Beltran, Keng Ngee Teoh, and Angelika Tritscher, “The melamine incident: Implications for international food and feed safety,” Environmental Health Perspectives, vol. 117, no. 12, pp. 1803–1808, Dec. 2009. Accessed: Dec. 21, 2025. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC2799451/ [62] Saiwei Li, Yue Wang, Gemma M. L. Tacken, Yumei Liu, and Siet J. Sijtsema, “Consumer trust in the dairy value chain in china: The role of trustworthiness, the melamine scandal, and the media,” Journal of Dairy Science, vol. 104, no. 8, pp. 8554–8567, Aug. 1, 2021. DOI: 10.3168 /jds.2020-19733 Accessed: Dec. 21, 2025. [Online]. Available: https://www.scie ncedirect.com/science/article/pii/S0022030221005397 [63] Nancy G. Leveson, An Introduction to System Safety Engineering. Cambridge, MA, USA: MIT Press, Nov. 14, 2023, ISBN: 978-0-262-54688-1. [64] HackerOne, “Bug bounty programs.” Accessed: Dec. 21, 2025. [Online]. Available: https://w ww.hackerone.com/bug-bounty-programs [65] Kayla D. Booker and Quinton Booker, “CPAs and conflicts of interest: A recap of recent AICPA guidance,” Aug. 2016. [Online]. Available: https://www.cpajournal.com/2016/0 8/01/cpas-conflicts-interest/ [66] FBI, “Enron.” Accessed: Jan. 9, 2025. [Online]. Available: https://www.fbi.gov/hist ory/famous-cases/enron [67] Jonas Heese, Charles C. Y. Wang, and Tonia Labruyere, “Wirecard: The downfall of a German fintech star,” Harvard Business School Case 121-058, Mar. 2021. [Online]. Available: https: //www.hbs.edu/faculty/Pages/item.aspx?num=59971 [68] Mary Locatelli, “Good internal controls and auditor independence,” CPA Journal, Oct. 2002. Accessed: Dec. 21, 2025. [Online]. Available: http://archives.cpajournal.com/2 002/1002/nv/nv4.htm [69] Public Company Accounting Oversight Board, “PCAOB report: Audits with deficiencies rose for second year in a row to 40% in 2022,” Jul. 25, 2023. Accessed: Dec. 21, 2025. [Online]. Available: https://pcaobus.org/news-events/news-releases/news-release-d etail/pcaob-report-audits-with-deficiencies-rose-for-secondyear-in-a-row-to-40-in-2022 [70] Leon Staufer, Mick Yang, Anka Reuel, and Stephen Casper, “Audit cards: Contextualizing AI evaluations,” 2025, arXiv: 2504.13839, Available: [Online]. Available: https://arxiv.or g/abs/2504.13839 [71] Google DeepMind, “Gemini 3 Pro frontier safety framework report,” 2025. [Online]. Available: https://storage.googleapis.com/deepmind-media/gemini/gemini_3 _pro_fsf_report.pdf [72] Jack Gallifant, Amelia Fiske, Yulia A. Levites Strekalova, Juan S. Osorio-Valencia, Rachael Parke, Rogers Mwavu, Nicole Martinez, Judy Wawira Gichoya, Marzyeh Ghassemi, Dina DemnerFushman, Liam G. McCoy, Leo Anthony Celi, and Robin Pierce, “Peer review of GPT-4 technical report and systems card,” PLOS Digital Health, vol. 3, no. 1, e0000417, Jan. 18, 2024. [Online]. Available: https://doi.org/10.1371/journal.pdig.0000417 [73] Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, et al., “Measuring what matters: Construct validity in large language model benchmarks,” 2025, arXiv: 2511.04703, Available: [Online]. Available: https://arxiv.org/abs/2511.04703 [74] Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas J Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, and Abigail Z. Jacobs, “Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge,” in Forty-second International Conference on Machine Learning Position Paper Track, 2025. [Online]. Available: https://openreview .net/forum?id=1ZC4RNjqzU [75] Sean McGregor, Allyson Ettinger, Nick Judd, Paul Albee, Liwei Jiang, Kavel Rao, Will Smith, Shayne Longpre, Avijit Ghosh, Christopher Fiorelli, Michelle Hoang, Sven Cattell, and Nouha Dziri, “To err is AI : A case study informing LLM flaw reporting practices,” 2024, arXiv: 2410.12104, Available: [Online]. Available: https://doi.org/10.48550/arXiv.2410.12104 [76] Rishi Bommasani, Kevin Klyman, Shayne Longpre, Betty Xiong, Sayash Kapoor, Nestor Maslej, Arvind Narayanan, and Percy Liang, “Foundation model transparency reports,” 2024, arXiv: 2402.16268, Available: [Online]. Available: https://arxiv.org/abs/2402.16268 [77] Patricia Paskov, Lisa Soder, and Everett Smith, “Toward best practices for AI evaluation and governance: A proposal for a European Union general-purpose AI model evaluation standards task force,” Jun. 24, 2025. Accessed: Jan. 11, 2026. [Online]. Available: https://www.ran d.org/pubs/perspectives/PEA3624-1.html [78] Tegan McCaslin, Jide Alaga, Samira Nedungadi, Seth Donoughe, Tom Reed, Rishi Bommasani, Chris Painter, and Luca Righetti, “STREAM (ChemBio): A standard for transparently reporting evaluations in AI model reports,” 2025, arXiv: 2508.09853, Available: [Online]. Available: htt ps://arxiv.org/abs/2508.09853 [79] Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J. Kochenderfer, “BetterBench: Assessing AI benchmarks, uncovering issues, and establishing best practices,” Nov. 20, 2024. [Online]. Available: https://arxiv.org/abs/2411.12990 [80] Frontier Model Forum, “Issue brief: Early best practices for frontier AI safety evaluations,” Jul. 31, 2024. [Online]. Available: https://www.frontiermodelforum.org/updates/e arly-best-practices-for-frontier-ai-safety-evaluations/ [81] “AI evaluator forum.” Accessed: Jan. 11, 2026. [Online]. Available: https://aievaluato rforum.org/ [82] Samuel R. Bowman, Misha Wagner, Fabien Roger, and Holden Karnofsky, “Anthropic’s pilot sabotage risk report,” Alignment Science, Oct. 28, 2025. Accessed: Dec. 22, 2025. [Online]. Available: https://alignment.anthropic.com/2025/sabotage-risk-rep ort/ [83] OpenAI: Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, Vlad Fomenko, Timur Garipov, Kristian Georgiev, Mia Glaese, Tarun Gogineni, Adam Goucher, Lukas Gross, Katia Gil Guzman, John Hallman, Jackie Hehir, Johannes Heidecke, Alec Helyar, Haitang Hu, Romain Huet, Jacob Huh, Saachi Jain, Zach Johnson, Chris Koch, Irina Kofman, Dominik Kundel, Jason Kwon, Volodymyr Kyrylov, Elaine Ya Le, Guillaume Leclerc, James Park Lennon, Scott Lessans, Mario Lezcano-Casado, Yuanzhi Li, Zhuohan Li, Ji Lin, Jordan Liss, Jiancheng Liu Lily (Xiaoxuan) Liu, Kevin Lu, Chris Lu, Zoran Martinovic, Lindsay McCallum, Josh McGrath, Scott McKinney, Aidan McLaughlin, Song Mei, Steve Mostovoy, Tong Mu, Gideon Myles, Alexander Neitz, Alex Nichol, Jakub Pachocki, Alex Paino, Dana Palmie, Ashley Pantuliano, Giambattista Parascandolo, Jongsoo Park, Leher Pathak, Carolina Paz, Ludovic Peran, Dmitry Pimenov, Michelle Pokrass, Elizabeth Proehl, Huida Qiu, Gaby Raila, Filippo Raso, Hongyu Ren, Kimmy Richardson, David Robinson, Bob Rotsted, Hadi Salman, Suvansh Sanjeev, Max Schwarzer, D. Sculley, Harshit Sikchi, Kendal Simon, Karan Singhal, Yang Song, Dane Stuckey, Zhiqing Sun, Philippe Tillet, Sam Toizer, Foivos Tsimpourlas, Nikhil Vyas, Eric Wallace, Xin Wang, Miles Wang, Olivia Watkins, Kevin Weil, Amy Wendling, Kevin Whinnery, Cedric Whitney, Hannah Wong, Lin Yang, Yu Yang, Michihiro Yasunaga, Kristen Ying, Wojciech Zaremba, Wenting Zhan, Cyril Zhang, Brian Zhang, Eddie Zhang, and Shengjia Zhao, “Gpt-oss-120b & gpt-oss-20b model card,” 2025, arXiv: 2508.10925, Available: [Online]. Available: https://arxiv.org/abs/2508.10925 [84] OpenAI, “Findings from a pilot Anthropic-OpenAI alignment evaluation exercise: OpenAI safety tests,” Aug. 27, 2025. [Online]. Available: https://openai.com/index/openai-an thropic-safety-evaluation/ [85] Sasha Costanza-Chock, Inioluwa Deborah Raji, and Joy Buolamwini, “Who audits the auditors? Recommendations from a field scan of the algorithmic auditing ecosystem,” in 2022 ACM Conference on Fairness, Accountability and Transparency, ser. FAccT ’22, Association for Computing Machinery, Jun. 2022, pp. 1571–1583. DOI: 10.1145/3531146.3533213 [86] Inioluwa Deborah Raji, Peggy Xu, Colleen Honigsberg, and Daniel E. Ho, “Outsider oversight: Designing a third party audit ecosystem for AI governance,” 2022, arXiv: 2206.04737, Available: [Online]. Available: https://arxiv.org/abs/2206.04737 [87] Shalaleh Rismani, Renee Shelby, Andrew Smart, Edgar Jatho, Joshua Kroll, AJung Moon, and Negar Rostamzadeh, “From plane crashes to algorithmic harm: Applicability of safety engineering frameworks for responsible ML,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, ser. CHI ’23, New York, NY, USA: Association for Computing Machinery, Apr. 19, 2023, pp. 1–18, ISBN: 978-1-4503-9421-5. DOI: 10.1145/3544548.3581407 Accessed: Jan. 11, 2026. [Online]. Available: https://dl.acm.org/doi/10.1145/3 544548.3581407 [88] Inioluwa Deborah Raji, Andrew Smart, Rebecca N. White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes, “Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing,” in FAT ’20: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020, pp. 33–44. DOI: 10.1145/3351095.3372873 [Online]. Available: https://doi.org/10.1145/3 351095.3372873 [89] Markus Anderljung, Everett Thornton Smith, Joe O’Brien, Lisa Soder, Benjamin Bucknall, Emma Bluemke, Jonas Schuett, Robert Trager, Lacey Strahm, and Rumman Chowdhury, “Towards publicly accountable frontier LLMs: Building an external scrutiny ecosystem under the ASPIRE framework,” 2023, arXiv: 2311.14711, Available: [Online]. Available: https://arxiv.or g/abs/2311.14711 [90] Khoa Lam, Benjamin Lange, Borhane Blili-Hamelin, Jovana Davidovic, Shea Brown, and Ali Hasan, “A framework for assurance audits of algorithmic systems,” in The 2024 ACM Conference on Fairness, Accountability, and Transparency, ser. FAccT ’24, ACM, Jun. 2024, pp. 1078–1092. DOI: 10.1145/3630106.3658957 [Online]. Available: http://dx.doi.org/10 .1145/3630106.3658957 [91] Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, and Luciano Floridi, “Auditing large language models: A three-layered approach,” AI and Ethics, vol. 4, no. 4, pp. 1085–1115, May 2023. DOI: 10.1007/s43681-023-00289-2 [Online]. Available: https://doi.org/10.10 07/s43681-023-00289-2 [92] Jakob Mökander, “Auditing of AI: Legal, ethical and technical approaches,” Digital Society, vol. 2, no. 3, p. 49, Nov. 8, 2023. DOI: 10.1007/s44206-023-00074-y Accessed: Jan. 11, 2026. [Online]. Available: https://doi.org/10.1007/s44206-023-00074-y [93] National Institute of Standards and Technology, “Managing misuse risk for dual-use foundation models,” NIST AI 800-1 ipd, 2024. DOI: 10.6028/NIST.AI.800- 1.ipd Accessed: Jan. 11, 2026. [Online]. Available: https://nvlpubs.nist.gov/nistpubs/ai /NIST.AI.800-1.ipd.pdf [94] Elham Tabassi, “Artificial Intelligence Risk Management Framework (AI RMF 1.0),” National Institute of Standards and Technology (U.S.), NIST AI 100-1, Jan. 26, 2023. DOI: 10.6028 /NIST.AI.100-1 Accessed: Jan. 11, 2026. [Online]. Available: http://nvlpubs.nis t.gov/nistpubs/ai/NIST.AI.100-1.pdf [95] Center for AI Safety, “AI risks that could lead to catastrophe,” 2023. Accessed: Jan. 10, 2026. [Online]. Available: https://safe.ai/ai-risk [96] Daniel Atherton, “Incident 1152: LLM-driven Replit agent reportedly executed unauthorized destructive commands during code freeze, leading to loss of production data,” AI Incident Database, Daniel Atherton, Ed., Jul. 18, 2025. Accessed: Jan. 6, 2026. [Online]. Available: https://in cidentdatabase.ai/cite/1152/ [97] Daniel Atherton, “Incident 1178: Google Gemini CLI reportedly deletes user files after misinterpreting command sequence,” AI Incident Database, Daniel Atherton, Ed., Jul. 21, 2025. Accessed: Jan. 6, 2026. [Online]. Available: https://incidentdatabase.ai/cite/1178/ [98] Rohin Shah, Alex Irpan, Alexander Matt Turner, Anna Wang, Arthur Conmy, David Lindner, Jonah Brown-Cohen, Lewis Ho, Neel Nanda, Raluca Ada Popa, Rishub Jain, Rory Greig, Samuel Albanie, Scott Emmons, Sebastian Farquhar, Sébastien Krier, Senthooran Rajamanoharan, Sophie Bridgers, Tobi Ijitoye, Tom Everitt, Victoria Krakovna, Vikrant Varma, Vladimir Mikulik, Zachary Kenton, Dave Orr, Shane Legg, Noah Goodman, Allan Dafoe, Four Flynn, and Anca Dragan, “An approach to technical AGI safety and security,” 2025, arXiv: 2504.01849, Available: [Online]. Available: https://arxiv.org/abs/2504.01849 [99] Dan Hendrycks, Mantas Mazeika, and Thomas Woodside, “An overview of catastrophic AI risks,” 2023, arXiv: 2306.12001, Available: [Online]. Available: https://arxiv.org/abs/230 6.12001 [100] Charlotte Stix, Annika Hallensleben, Alejandro Ortega, and Matteo Pistillo, “The loss of control playbook: Degrees, dynamics, and preparedness,” 2025, arXiv: 2511.15846, Available: [Online]. Available: https://arxiv.org/abs/2511.15846 [101] Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Eric Wallace, David Rolnick, and Florian Tramèr, “Stealing part of a production language model,” in Forty-first International Conference on Machine Learning, 2024. [Online]. Available: https://openrev iew.net/forum?id=VE3yWXt3KB [102] Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee, “Scalable Extraction of Training Data from (Production) Language Models,” arXiv preprint arXiv:2311.17035, 2023. [103] A. Feder Cooper, Aaron Gokaslan, Ahmed Ahmed, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, and Percy Liang, “Extracting memorized pieces of (copyrighted) books from open-weight language models,” arXiv preprint arXiv:2505.12546, 2025. [104] Federico Barbero, Xiangming Gu, Christopher A. Choquette-Choo, Chawin Sitawarin, Matthew Jagielski, Itay Yona, Petar Veliˇckovi´c, Ilia Shumailov, and Jamie Hayes, “Extracting alignment data in open models,” 2025, arXiv: 2510.18554, Available: [Online]. Available: https://ar xiv.org/abs/2510.18554 [105] Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, and Percy Liang, “Extracting books from production language models,” arXiv preprint arXiv:2601.02671, 2025. [106] Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, and Florian Tramèr, “What does it mean for a language model to preserve privacy?” 2022, arXiv:2202.05520, Available: [Online]. Available: https://arxiv.org/abs/2202.05520 [107] A. Feder Cooper, Christopher A. Choquette-Choo, Miranda Bogen, Kevin Klyman, Matthew Jagielski, Katja Filippova, Ken Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Eleni Triantafillou, Peter Kairouz, Nicole Elyse Mitchell, Niloofar Mireshghallah, Abigail Z. Jacobs, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, Ilia Shumailov, Andreas Terzis, Solon Barocas, Jennifer Wortman Vaughan, Danah Boyd, Yejin Choi, Sanmi Koyejo, Fernando Delgado, Percy Liang, Daniel E. Ho, Pamela Samuelson, Miles Brundage, David Bau, Seth Neel, Hanna Wallach, Amy B. Cyphert, Mark Lemley, Nicolas Papernot, and Katherine Lee, “Machine unlearning doesn’t do what you think: Lessons for generative AI policy and research,” presented at the ThirtyNinth Annual Conference on Neural Information Processing Systems, Position Paper Track, 2025. [Online]. Available: https://openreview.net/forum?id=mfd6GRW4Az [108] Sella Nevo, Dan Lahav, Ajay Karpur, Yogev Bar-On, Henry Alexander Bradley, and Jeff Alstott, “Securing AI model weights: Preventing theft and misuse of frontier models,” RAND, May 30, 2024. Accessed: Dec. 21, 2025. [Online]. Available: https://www.rand.org/pubs/re search_reports/RRA2849-1.html [109] Ross Anderson, Security Engineering: A Guide to Building Dependable Distributed Systems, 3rd ed. Wiley, 2020. [Online]. Available: https://www.cl.cam.ac.uk/archive/rja14 /Papers/SEv3.pdf [110] Bo Hu, Yuanyi Mao, and Ki Joon Kim, “How social anxiety leads to problematic use of conversational AI: The roles of loneliness, rumination, and mind perception,” Computers in Human Behavior, vol. 145, p. 107 760, 2023. DOI: 10.1016/j.chb.2023.107760 [Online]. Available: https://www.sciencedirect.com/science/article/pii/S074 7563223001115 [111] Keith Robert Head, “Minds in crisis: How the AI revolution is impacting mental health,” Journal of Mental Health & Clinical Psychology, Sep. 5, 2025. DOI: 10.29245/2578-2959/2025 /3.1352 [112] “Emotional risks of AI companions demand attention,” Nature Machine Intelligence, vol. 7, no. 7, pp. 981–982, Jul. 22, 2025. DOI: 10.1038/s42256-025-01093-9 Accessed: Jan. 11, 2026. [Online]. Available: https://www.nature.com/articles/s42256-025-0 1093-9 [113] Auren R. Liu, Pat Pataranutaporn, and Pattie Maes, “Chatbot companionship: A mixed-methods study of companion chatbot usage patterns and their relationship to loneliness in active users,” 2025, arXiv: 2410.21596, Available: [Online]. Available: https://arxiv.org/abs/241 0.21596 [114] Chiara Saracini, Maria Isabel Cornejo-Plaza, and Robert Cippitani, “Techno-emotional projection in human–GenAI relationships: A psychological and ethical conceptual perspective,” Frontiers in Psychology, vol. 16, Sep. 2025. DOI: 10.3389/fpsyg.2025.1662206 [115] APA, “Health advisory: Use of generative AI chatbots and wellness applications for mental health,” Nov. 2025. Accessed: Jan. 11, 2026. [Online]. Available: https://www.apa.org/top ics/artificial-intelligence-machine-learning/health-advisorychatbots-wellness-apps [116] Caroline Haskins, “People who say they’re experiencing AI psychosis beg the FTC for help,” Wired, Oct. 22, 2025. Accessed: Jan. 11, 2026. [Online]. Available: https://www.wired.com/s tory/ftc-complaints-chatgpt-ai-psychosis/ [117] Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber, “Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers.,” in Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, ser. FAccT ’25, Association for Computing Machinery, 2025, pp. 599–627. DOI: 10.1145/3715275.3732039 [Online]. Available: https://doi.o rg/10.1145/3715275.3732039 [118] Center for Countering Digital Hate, “Fake friends: How ChatGPT betrays vulnerable teens by encouraging dangerous behavior,” Aug. 6, 2025. [Online]. Available: https://counterha te.com/research/fake-friend-chatgpt/ [119] Ryuhaerang Choi, Taehan Kim, Subin Park, Jennifer G. Kim, and Sung-Ju Lee, “Private yet social: How LLM chatbots support and challenge eating disorder recovery,” in Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, ser. CHI ’25, Association for Computing Machinery, 2025. DOI: 10.1145/3706598.3713485 [Online]. Available: https://doi.org/10.1145/3706598.3713485 [120] Sean McGregor, “Preventing repeated real world AI failures by cataloging incidents: The AI Incident Database,” in AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 15 458–15 463. [Online]. Available: https://doi.org/10.1609/aaai.v35i17.17817 [121] AIID, “Entity: Waymo,” AI Incident Database. Accessed: Jan. 12, 2026. [Online]. Available: https://incidentdatabase.ai/entities/waymo/ [122] Daniel Atherton, “Incident 1308: Springer Nature Book ‘Mastering machine learning: From basics to advanced’ reportedly published with numerous purportedly nonexistent or incorrect citations,” AI Incident Database, Daniel Atherton, Ed., Apr. 18, 2025. Accessed: Jan. 6, 2026. [Online]. Available: https://incidentdatabase.ai/cite/1308/ [123] Remco Zwetsloot and Allan Dafoe, “Thinking about risks from AI: Accidents, misuse and structure,” Lawfare, Feb. 11, 2019. Accessed: Dec. 21, 2025. [Online]. Available: https://www.lawfa remedia.org/article/thinking-about-risks-ai-accidents-misuseand-structure [124] Nataliya Kosmyna, Eugene Hauptmann, Ye Tong Yuan, Jessica Situ, Xian-Hao Liao, Ashly Vivian Beresnitzky, Iris Braunstein, and Pattie Maes, “Your brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task,” 2025, arXiv:2506.08872, Available: [Online]. Available: https://arxiv.org/abs/2506.08872 [125] Frontier Model Forum, “Risk taxonomy and thresholds for frontier AI frameworks,” Tech. Rep., Jun. 2025. [Online]. Available: https://www.frontiermodelforum.org/techni cal-reports/risk-taxonomy-and-thresholds/ [126] METR, “Common elements of frontier AI safety policies (December 2025 update),” METR Blog, Dec. 9, 2025. Accessed: Jan. 11, 2026. [Online]. Available: https://metr.org/blog/2 025-12-09-common-elements-of-frontier-ai-safety-policies/ [127] Cade Metz, “A hacker stole OpenAI secrets, raising fears that China could, too,” The New York Times, Jul. 4, 2024. Accessed: Jan. 11, 2026. [Online]. Available: https://www.nytimes .com/2024/07/04/technology/openai-hack.html [128] United States Department of Justice, “Chinese national residing in California arrested for theft of artificial intelligence-related trade secrets from Google,” 2024. [Online]. Available: https://w ww.justice.gov/archives/opa/pr/chinese-national-residing-cal ifornia-arrested-theft-artificial-intelligence-related-trade [129] Ian Mitch, Matthew J. Malone, Karen Schwindt, Gregory Smith, Wesley Hurd, Henry Alexander Bradley, and James Gimbi, “Governance approaches to securing frontier AI,” RAND, Oct. 7, 2025. Accessed: Jan. 11, 2026. [Online]. Available: https://www.rand.org/pubs/resear ch_reports/RRA4159-1.html [130] Dean W. Ball and Ketan Ramakrishnan, “Entity-based regulation in frontier AI governance,” Carnegie Endowment for International Peace, Jul. 7, 2025. Accessed: Jan. 11, 2026. [Online]. Available: https://carnegieendowment.org/research/2025/06/artific ial-intelligence-regulation-united-states?lang=en [131] UK Ministry of Defence, “Defence standard 00-56: Safety management requirements for defence systems,” 2007. [Online]. Available: https://skybrary.aero/sites/default/fi les/bookshelf/344.pdf [132] Marie Davidsen Buhl, Gaurav Sett, Leonie Koessler, Jonas Schuett, and Markus Anderljung, “Safety cases for frontier AI,” 2024, arXiv: 2410.21572, Available: [Online]. Available: https: //arxiv.org/abs/2410.21572 [133] Joshua Clymer, Nick Gabrieli, David Krueger, and Thomas Larsen, “Safety cases: How to justify the safety of advanced AI systems,” 2024, arXiv: 2403.10462, Available: [Online]. Available: https://arxiv.org/abs/2403.10462 [134] George F. Jelen and Jeffrey R. Williams, “A practical approach to measuring assurance,” in Proceedings of the 14th Annual Computer Security Applications Conference, Cat. No.98EX217, IEEE, 1998, pp. 333–343. DOI: 10.1109/CSAC.1998.738653 [135] ICAEW, “Limited assurance vs reasonable assurance,” Institute of Chartered Accountants in England and Wales. Accessed: Jan. 10, 2026. [Online]. Available: https://www.icaew.c om/technical/audit-and-assurance/assurance/process/scoping/a ssurance-decision/limited-assurance-vs-reasonable-assurance [136] IAASB, “International standard on assurance engagements (ISAE) 3000 revised, assurance engagements other than audits or reviews of historical financial information,” Dec. 9, 2013. Accessed: Dec. 21, 2025. [Online]. Available: https://www.iaasb.org/publicatio ns/international-standard-assurance-engagements-isae-3000-re vised-assurance-engagements-other-audits-or [137] ISO/IEC, “ISO/IEC 17029:2019: Conformity assessment — general principles and requirements for validation and verification bodies,” International Organization for Standardization, Oct. 2019. Accessed: Dec. 21, 2025. [Online]. Available: https://www.iso.org/standard/293 52.html [138] Public Company Accounting Oversight Board, “AT section 101: Attest engagements.” Accessed: Dec. 21, 2025. [Online]. Available: https://pcaobus.org/oversight/standard s/attestation-standards/details/AT101 [139] BSEE, “Oil and gas and sulphur operations in the outer continental shelf—blowout preventer systems and well control (30 C.F.R. pt. 250),” Bureau of Safety and Environmental Enforcement, Department of Interior, 2015. [Online]. Available: https://www.federalregister.g ov/documents/2023/08/23/2023-17847/oil-and-gas-and-sulfur-op erations-in-the-outer-continental-shelf-blowout-preventer-sy stems-and-well [140] Marc L. Dapas, “Key principles for nuclear material safety and safeguards reviews,” U.S. Nuclear Regulatory Commission. [Online]. Available: https://www.nrc.gov/docs/ML1901 /ML19015A290.pdf [141] Hilary Jackson, “Internal audit report on the aviation safety audit process,” International Civil Aviation Organization, IA/2021/6, Oct. 28, 2021. [Online]. Available: https://www.icao .int/sites/default/files/secretariat/OfficeOfInternalOversigh t/Final-Oversight-Reports/Internal-Audit-Report-on-the-Aviat ion-Safety-Audit-Process.pdf [142] Dan Harris, “Which level of assurance is best for your ESG reporting,” BDO, Feb. 27, 2023. [Online]. Available: https://www.bdo.com/insights/assurance/which-lev el-of-assurance-is-best-for-your-esg-reporting [143] Kanishk Mahaveer Jain, “AI-driven fuel optimization in VTOL aircraft: A comprehensive review,” Acceleron Aerospace Journal, vol. 4, no. 6, pp. 1176–1185, Jun. 2025. DOI: 10.61359/11.21 06-2532 [Online]. Available: https://www.acceleron.org.in/index.php/a aj/article/view/246 [144] NIST, “Special publication 800-63b: B.3 authenticator assurance levels,” National Institute of Standards and Technology, 2026. [Online]. Available: https://pages.nist.gov/80063-3-Implementation-Resources/63B/AAL/ [145] Erin L. Hamilton, “Evaluating the intentionality of identified misstatements: How perspective can help auditors in distinguishing errors from fraud,” Auditing: A Journal of Practice & Theory, vol. 35, no. 44, pp. 57–78, Nov. 2016. [Online]. Available: https://doi.org/10.2308 /ajpt-51452 [146] UK PSCTG, “Gross disproportion, step by step – a possible approach to evaluating additional measures at COMAH sites,” 2006. [Online]. Available: https://www.icheme.org/med ia/9853/xix-paper-66.pdf [147] Public Company Accounting Oversight Board, “AS 2805: Management representations,” 1998. Accessed: Dec. 21, 2025. [Online]. Available: https://pcaobus.org/oversight/st andards/auditing-standards/details/AS2805 [148] Public Company Accounting Oversight Board, “AS 1000: General responsibilities of the auditor,” 2024. Accessed: Jan. 11, 2026. [Online]. Available: https://pcaobus.org/oversigh t/standards/auditing-standards/details/as-1000--general-resp onsibilities-of-the-auditor-in-conducting-an-audit [149] METR, “Review of the Anthropic summer 2025 pilot sabotage risk report,” METR, Oct. 28, 2025. [Online]. Available: https://metr.org/2025_pilot_risk_report_metr_rev iew.pdf [150] Tobin South, Alexander Camuto, Shrey Jain, Shayla Nguyen, Robert Mahari, Christian Paquin, Jason Morton, and Alex ‘Sandy’ Pentland, “Verifiable evaluations of machine learning models using zkSNARKs,” 2024, arXiv: 2402.02675, Available: [Online]. Available: https://arxiv .org/abs/2402.02675 [151] US GAO, “IAEA has strengthened its safeguards and nuclear security programs, but weaknesses need to be addressed,” US Government Accountability Office, Oct. 2005. [Online]. Available: https://www.gao.gov/assets/gao-06-93.pdf [152] Laura Rockwood and Larry Johnson, “Verification of correctness and completeness in the implementation of IAEA safeguards: The law and practice,” in Nuclear Non-Proliferation in International Law, T.M.C. Asser Press, 2016, pp. 57–94. [Online]. Available: https://doi.org/10.10 07/978-94-6265-075-6_4 [153] Jane Vaynman and Tristan A. Volpe, “Dual use deception: How technology shapes cooperation in international relations,” International Organization, vol. 77, no. 3, pp. 599–632, 2023. [Online]. Available: https://www.cambridge.org/core/journals/internationalorganization/article/dual-use-deception-how-technology-shape s-cooperation-in-international-relations/C3BC65F4B54B5094406 32BD62D074031 [154] Jung Koo Kang, Clive Lennox, and Vivek Pandey, “Client concerns about information spillovers from sharing audit partners,” Journal of Accounting and Economics, vol. 73, no. 1, p. 101 434, 2022. DOI: 10.1016/j.jacceco.2021.101434 [Online]. Available: https://www .sciencedirect.com/science/article/pii/S0165410121000495 [155] Brad A. Badertscher, Jaewoo Kim, William R. Kinney, and Edward Owens, “Assurance level choice, CPA fees, and financial reporting benefits: Inferences from U.S. private firms,” Journal of Accounting and Economics, vol. 75, no. 2, p. 101 551, 2023. DOI: 10.1016/j.jacceco.2 022.101551 [Online]. Available: https://www.sciencedirect.com/science /article/pii/S016541012200074X [156] Benjamin S. Bucknall and Robert F. Trager, “Structured access for third-party research on frontier AI models: Investigating researchers’ model access requirements,” Centre for the Governance of AI, Oct. 2023. Accessed: Dec. 22, 2025. [Online]. Available: https://www.oxfordmarti n.ox.ac.uk/publications/structured-access-for-third-party-re search-on-frontier-ai-models-investigating-researchers-modelaccess-requirements [157] Edward Kembery, Ben Bucknall, and Morgan Simpson, “Position paper: Model access should be a key concern in AI governance,” 2024, arXiv: 2412.00836, Available: DOI: 10.48550/arXi v.2412.00836 [158] Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E. McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, and Dylan Hadfield-Menell, “Model tampering attacks enable more rigorous evaluations of LLM capabilities,” 2025, arXiv:2502.05209, Available: DOI: 10.48550/arXiv .2502.05209 [159] Ben Bucknall, Robert F. Trager, and Michael A. Osborne, “Position: Ensuring mutual privacy is necessary for effective external evaluation of proprietary AI systems,” 2025, arXiv: 2503.01470, Available: [Online]. Available: https://arxiv.org/abs/2503.01470 [160] Miles Brundage, Shahar Avin, Jasmine Wang, Haydn Belfield, Gretchen Krueger, Gillian Hadfield, Heidy Khlaaf, Jingying Yang, Helen Toner, Ruth Fong, Tegan Maharaj, Pang Wei Koh, Sara Hooker, Jade Leung, Andrew Trask, Emma Bluemke, Jonathan Lebensold, Cullen O’Keefe, Mark Koren, Théo Ryffel, J. B. Rubinovitz, Tamay Besiroglu, Federica Carugati, Jack Clark, Peter Eckersley, Sarah de Haas, Maritza Johnson, Ben Laurie, Alex Ingerman, Igor Krawczuk, Amanda Askell, Rosario Cammarota, Andrew Lohn, David Krueger, Charlotte Stix, Peter Henderson, Logan Graham, Carina Prunkl, Bianca Martin, Elizabeth Seger, Noa Zilberman, Seán Ó hÉigeartaigh, Frens Kroeger, Girish Sastry, Rebecca Kagan, Adrian Weller, Brian Tse, Elizabeth Barnes, Allan Dafoe, Paul Scharre, Ariel Herbert-Voss, Martijn Rasser, Shagun Sodhani, Carrick Flynn, Thomas Krendl Gilbert, Lisa Dyer, Saif Khan, Yoshua Bengio, and Markus Anderljung, “Toward trustworthy AI development: Mechanisms for supporting verifiable claims,” 2020, arXiv: 2004.07213, Available: [Online]. Available: https://arxiv.org/abs/2004.07213 [161] U.S. Department of State, “Licenses for the export of technical data and classified defense articles (22 C.F.R. pt. 125),” 2025. [Online]. Available: https://www.ecfr.gov/current/ti tle-22/chapter-I/subchapter-M/part-125 [162] U.S. Food and Drug Administration, “FSMA final rule on accredited third-party certification,” 2021. Accessed: Jan. 10, 2026. [Online]. Available: https://www.fda.gov/food/foo d-safety-modernization-act-fsma/fsma-final-rule-accredited-t hird-party-certification [163] European Union, “Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, amending Directive 2001/83/EC, Regulation (EC) no 178/2002 and Regulation (EC) no 1223/2009 and repealing Council Directives 90/385/EEC and 93/42/EEC (text with EEA relevance. )” Apr. 5, 2017. Accessed: Dec. 21, 2025. [Online]. Available: http://data.europa.eu/eli/reg/2017/745/oj [164] U.S. Environmental Protection Agency, “About the National Vehicle and Fuel Emissions Laboratory (NVFEL),” 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://www.epa.gov/a boutepa/about-national-vehicle-and-fuel-emissions-laboratorynvfel [165] Andrew Trask, Aziz Berkay Yesilyurt, Bennett Farkas, Callis Ezenwaka, Carmen Popa, Dave Buckley, Eelco van der Wel, Francesco Mosconi, Grace Han, Ionesio Junior, Irina Bejan, Khoa Nguyen Ishan Mishra§, Koen van der Veen, Kyoko Eng, Lacey Strahm, Madhava Jay Logan Graham‡, Matei Simtinica, Osam Kyemenu-Sarsah, Peter Smith, Rasswanth S, Ronnie Falcon, Sameer Wagh, Shubham Gupta Sandeep Mandala, Subha Ramkumar Stephen Gabriel, Tauquir Ahmed, Teo Milea, Valerio Maggio, Yash Gorana, and Zarreen Reza, “Secure enclaves for AI evaluation,” OpenMined, 2025. [Online]. Available: https://openmined.org/blog/secure-en claves-for-ai-evaluation/ [166] Aon Hewitt, “How ‘clean teams’ can accelerate mergers and acquisitions,” 2010. [Online]. Available: https://www.aon.com/attachments/clean%20teams.pdf [167] Scott C. Whitaker, “Establishing an integration management office,” in Mergers & Acquisitions Integration Handbook: Helping Companies Realize the Full Value of Acquisitions, John Wiley & Sons, Ltd, 2012, pp. 61–73. DOI: 10.1002/9781119202301.ch7 [168] Michael Bartock, Murugiah Souppaya, Ryan Savino, Tim Knoll, Uttam Shetty, Mourad Cherfaoui, Raghu Yeluri, Akash Malhotra, Don Banks, Michael Jordan, Dimitrios Pendarakis, J. R. Rao, Peter Romness, and Karen Scarfone, “Hardware-enabled security: Enabling a layered approach to platform security for cloud and edge computing use cases,” National Institute of Standards and Technology (U.S.), NIST IR 8320, May 4, 2022. DOI: 10.6028/NIST.IR.8320 [Online]. Available: https://nvlpubs.nist.gov/nistpubs/ir/2022/NIST.IR.8320 .pdf [169] Conrad Stosz, Karson Elmgren, Charles Foster, George Balston, Seth Donoughe, Samira Nedungadi, Michael Chen, Jasper Götting, Patricia Paskov, Sayash Kapoor, Sarah Schwettmann, Rishi Bommasani, Luca Righetti, Sam McGregor, Grace Werner, Rob Reich, Arvind Narayanan, Elizabeth Barnes, Christopher Painter, Miles Brundage, Aidan Homewood, Divya Siddharth, Faisal Lalani, Charles Teague, Jaime Sevilla, and Jacob Steinhardt, “AEF-1 minimum operating conditions for independent third party AI evaluations,” 2025. [Online]. Available: https://w ww.aef.one/aef-one.pdf [170] Erich Grunewald, “A whistleblower incentive program to enforce U.S. export controls,” Lawfare, Jun. 16, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://www.lawfareme dia.org/article/a-whistleblower-incentive-program-to-enforceu.s.-export-controls [171] IAASB, “International standard on auditing (ISA) 705 revised, modifications to the opinion in the independent auditor’s report,” Jan. 15, 2015. Accessed: Dec. 21, 2025. [Online]. Available: https://www.iaasb.org/publications/international-standard-au diting-isa-705-revised-modifications-opinion-independent-aud itor-s-report-3 [172] Public Company Accounting Oversight Board, “AS 3105: Departures from unqualified opinions and other reporting circumstances,” 2017. Accessed: Dec. 21, 2025. [Online]. Available: https ://pcaobus.org/oversight/standards/auditing-standards/details /AS3105 [173] NIST, “Risk management framework for information systems and organizations: A system life cycle approach for security and privacy,” National Institute of Standards and Technology, NIST Special Publication (SP) 800-37 Rev. 2, Dec. 20, 2018. DOI: 10.6028/NIST.SP.800-37r2 Accessed: Jan. 11, 2026. [Online]. Available: https://csrc.nist.gov/pubs/sp/80 0/37/r2/final [174] ISO/IEC, “ISO/IEC 27001:2022: Information security, cybersecurity and privacy protection— information security management systems—requirements,” Oct. 2022. Accessed: Dec. 21, 2025. [Online]. Available: https://www.iso.org/standard/27001 [175] EASA, “Easy access rules for continuing airworthiness (regulation (EU) no 1321/2014),” European Union Aviation Safety Agency, 2024. Accessed: Dec. 21, 2025. [Online]. Available: https://w ww.easa.europa.eu/en/document-library/easy-access-rules/onli ne-publications/easy-access-rules-continuing-airworthiness [176] Lorenzo Pacchiardi, John Burden, Fernando Martinez-Plumed, and Jose Hernandez-Orallo, “A framework to categorise modified general-purpose AI models as new models based on behavioural changes,” Publications Office of the European Union, JRC Technical Report JRC143257, Oct. 10, 2025. DOI: 10.2760/4372557 [Online]. Available: https://publications.jrc .ec.europa.eu/repository/handle/JRC143257 [177] Jonas Schuett, “Frontier AI developers need an internal audit function,” Risk Analysis, vol. 45, no. 6, pp. 1332–1352, 2024. DOI: 10.1111/risa.17665 Accessed: Dec. 21, 2025. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/risa.1 7665 [178] Aidan Homewood, Sophie Williams, Noemi Dreksler, John Lidiard, Malcolm Murray, Lennart Heim, Marta Ziosi, Seán Ó hÉigeartaigh, Michael Chen, Kevin Wei, Christoph Winter, Miles Brundage, Ben Garfinkel, and Jonas Schuett, “Third-party compliance reviews for frontier AI safety frameworks,” 2025, arXiv: 2505.01643, Available: [Online]. Available: https://arxi v.org/abs/2505.01643 [179] International Ethics Standards Board for Accountants, “Revisions to the fee-related provisions of the code: Final pronouncement,” Apr. 2021. [Online]. Available: https://legalserv icesboard.org.uk/wp-content/uploads/2025/06/Appendix-3-IESBAFinal-Pronouncements.pdf [180] Financial Reporting Council, “Revised ethical standard 2024,” 2024. [Online]. Available: https ://media.frc.org.uk/documents/Revised_Ethical_Standard_2024_o rZHKLq.pdf [181] Bryan K. Church, J. Gregory Jenkins, Susan A. McCracken, Pamela B. Roush, and Jonathan D. Stanley, “Auditor independence in fact: Research, regulatory, and practice implications drawn from experimental and archival research,” Accounting Horizons, vol. 29, no. 1, pp. 217–238, Mar. 1, 2015. DOI: 10.2308/acch-50966 Accessed: Jan. 11, 2026. [Online]. Available: https://doi.org/10.2308/acch-50966 [182] Rahman Yakubu and Tracey Williams, “A theoretical approach to auditor independence and audit quality,” Corporate Ownership and Control, vol. 17, no. 2, pp. 124–141, 2020. DOI: 10.22495 /cocv17i2art11 Accessed: Jan. 11, 2026. [Online]. Available: https://www.virtus interpress.org/A-theoretical-approach-to-auditor-independenc e-and-audit-quality.html [183] EASA, “Making aviation safer and greener for over 20 years,” European Union Aviation Safety Agency. Accessed: Dec. 21, 2025. [Online]. Available: https://www.easa.europa.eu /en/the-agency/the-agency [184] Don A. Moore, Philip E. Tetlock, Lloyd Tanlu, and Max H. Bazerman, “Conflicts of interest and the case of auditor independence: Moral seduction and strategic issue cycling,” Academy of Management Review, vol. 31, no. 1, pp. 10–29, Jan. 2006. DOI: 10.5465/amr.2006.1937 9621 Accessed: Jan. 11, 2026. [Online]. Available: https://journals.aom.org/doi /10.5465/amr.2006.19379621 [185] Public Company Accounting Oversight Board, “Ethics & independence.” Accessed: Dec. 21, 2025. [Online]. Available: https://pcaobus.org/oversight/standards/ethics-i ndependence-rules [186] Joshua Ronen, “Corporate audits and how to fix them,” Journal of Economic Perspectives, vol. 24, no. 2, pp. 189–210, Jun. 2010. DOI: 10.1257/jep.24.2.189 Accessed: Dec. 21, 2025. [Online]. Available: https://www.aeaweb.org/articles?id=10.1257/jep.2 4.2.189 [187] Joseph V. Carcello and Chan Li, “Costs and benefits of requiring an engagement partner signature: Recent experience in the United Kingdom,” The Accounting Review, vol. 88, no. 5, pp. 1511–1546, Sep. 1, 2013. DOI: 10.2308/accr-50450 Accessed: Dec. 21, 2025. [Online]. Available: https://doi.org/10.2308/accr-50450 [188] Public Company Accounting Oversight Board, “ET section 101: Independence, integrity, and objectivity.” Accessed: Dec. 21, 2025. [Online]. Available: https://pcaobus.org/over sight/standards/ethics-independence-rules/details/ET101 [189] OpenAI, “Strengthening our safety ecosystem with external testing,” Dec. 18, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://openai.com/index/strengtheningsafety-with-external-testing/ [190] AI Evaluator Forum, “Transparency about third-party AI evaluations.” Accessed: Dec. 21, 2025. [Online]. Available: https://aievaluatorforum.org/initiatives/evaluat ion-transparency-letter [191] European Union, “Regulation (EU) 537/2014 of the European Parliament and of the Council of 16 April 2014 on specific requirements regarding statutory audit of public-interest entities and repealing Commission Decision 2005/909/EC (text with EEA relevance),” Apr. 16, 2014. Accessed: Dec. 21, 2025. [Online]. Available: http://data.europa.eu/eli/reg/20 14/537/oj [192] Public Company Accounting Oversight Board, “Spotlight: Inspection observations related to auditor independence,” PCAOB, Sep. 2024. [Online]. Available: https://assets.pcaob us.org/pcaob-dev/docs/default-source/documents/auditor-indep endence-spotlight.pdf [193] Brian A. Nosek, Charles R. Ebersole, Alexander C. DeHaven, and David T. Mellor, “The preregistration revolution,” PNAS, vol. 115, no. 11, pp. 2600–2606, 2018. [Online]. Available: https://doi.org/10.1073/pnas.1708274114 [194] Catherine De Angelis, Jeffrey M. Drazen, Frank A. Frizelle, Charlotte Haug, John Hoey, Richard Horton, Sheldon Kotzin, Christine Laine, Ana Marusic, A. John P. M. Overbeke, Torben V. Schroeder, Hal C. Sox, and Martin B. Van Der Weyden, “Clinical trial registration: A statement from the International Committee of Medical Journal Editors,” New England Journal of Medicine, vol. 351, no. 12, pp. 1250–1251, 2004. DOI: 10.1056/NEJMe048225 [Online]. Available: https: //www.nejm.org/doi/full/10.1056/NEJMe048225 [195] V. Carro, R. Burnell, C. Mougan, A. Reuel, W. Schellaert, O. Salaudeen, L. Zhou, P. Paskov, A. Cohn, and J. Hernandez-Orallo, “PREP-eval: Pre-registration and REporting protocol for AI evaluations,” 2026, forthcoming, Available: [196] Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca, “Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation,” 2025, arXiv: 2502.06559, Available: [Online]. Available: https://arxiv.org/abs/2502.06559 [197] Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, and Sanmi Koyejo, “Measurement to meaning: A validitycentered framework for AI evaluation,” 2025, arXiv:2505.10573, Available: DOI: 10.48550/a rXiv.2505.10573 [198] Wout Schellaert, “The evaluation of artificial intelligence as a prediction problem,” Ph.D. dissertation, Polytechnic University of Valencia, 2025. [Online]. Available: https://schellaer t.org/papers/2025_Thesis.pdf [199] Lexin Zhou, Lorenzo Pacchiardi, Fernando Martínez-Plumed, Katherine M. Collins, Yael MorosDaval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E. Prunty, Zongqian Li, Pablo Sánchez-García, Kexin Jiang Chen, Pablo A. M. Casares, Jiyun Zu, John Burden, Behzad Mehrbakhsh, David Stillwell, Manuel Cebrian, Jindong Wang, Peter Henderson, Sherry Tongshuang Wu, Patrick C. Kyllonen, Lucy Cheke, Xing Xie, and José Hernández-Orallo, “General scales unlock AI evaluation with explanatory and predictive power,” 2025, arXiv:2503.06378, Available: DOI: 10.48550/arXiv.2503.06378 [200] Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, and William Isaac, “Sociotechnical safety evaluation of generative AI systems,” 2023, arXiv: 2310.11986, Available: [Online]. Available: https://arxiv.org/abs/2310.11986 [201] Samuel R. Bowman and George Dahl, “What will it take to fix benchmarking in natural language understanding?” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2021. [Online]. Available: https://doi.org/10.18653/v1/2021.naacl-main.385 [202] Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna, “AI and the everything in the whole wide world benchmark,” 2021, arXiv:2412.01934, Available: DOI: 10.48550/arXiv.2111.15366 [203] Ariba Khan, Stephen Casper, and Dylan Hadfield-Menell, “Randomness, not representation: The unreliability of evaluating cultural alignment in LLMs,” in Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, ser. FAccT ’25, New York, NY, USA: Association for Computing Machinery, Jun. 23, 2025, pp. 2151–2165, ISBN: 979-8-4007-1482-5. DOI: 10.1145/3715275.3732147 Accessed: Jan. 11, 2026. [Online]. Available: https: //dl.acm.org/doi/10.1145/3715275.3732147 [204] Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, and Allan Dafoe, “Model evaluation for extreme risks,” 2023, arXiv: 2305.15324, Available: DOI: 10.48550/arXiv.2305.15324 [205] Patricia Paskov, Michael Byun, Kevin Wei, and Toby Webster, “Preliminary suggestions for rigorous GPAI model evaluations,” RAND, Apr. 2025. [Online]. Available: https://www.rand.org /pubs/perspectives/PEA3971-1.html [206] Alex Tamkin, Miles McCain, Kunal Handa, Esin Durmus, Liane Lovitt, Ankur Rathi, Saffron Huang, Alfred Mountfield, Jerry Hong, Stuart Ritchie, Michael Stern, Brian Clarke, Landon Goldberg, Theodore R. Sumers, Jared Mueller, William McEachen, Wes Mitchell, Shan Carter, Jack Clark, Jared Kaplan, and Deep Ganguli, “Clio: Privacy-preserving insights into real-world AI use,” 2024, arXiv: 2412.13678, Available: [Online]. Available: https://arxiv.org/abs/2412.13 [207] Pierre Le Jeune, Jiaen Liu, Luca Rossi, and Matteo Dora, “RealHarm: A collection of real-world language model application failures,” 2025, arXiv:2504.10277, Available: DOI: 10.48550/ar Xiv.2504.10277 [208] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin, “On evaluating adversarial robustness,” 2019, arXiv: 1902.06705, Available: [Online]. Available: https://arxiv.org /abs/1902.06705 [209] Jonathan Uesato, Brendan O’Donoghue, Pushmeet Kohli, and Aaron van den Oord, “Adversarial risk and the dangers of evaluating against weak attacks,” in Proceedings of the 35th International Conference on Machine Learning, Jennifer Dy and Andreas Krause, Eds., ser. Proceedings of Machine Learning Research, vol. 80, PMLR, Jul. 2018, pp. 5025–5034. [Online]. Available: https://proceedings.mlr.press/v80/uesato18a.html [210] Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, and Francis Rhys Ward, “AI sandbagging: Language models can strategically underperform on evaluations,” 2025, arXiv:2406.07358, Available: DOI: 10.48550/arXiv.2406.07358 [211] Public Company Accounting Oversight Board, “AS 3101: The auditor’s report on an audit of financial statements when the auditor expresses an unqualified opinion,” 2024. Accessed: Jan. 12, 2026. [Online]. Available: https://pcaobus.org/oversight/standards/audi ting-standards/details/AS3101 [212] Alexey Kovynev, “Inside a WANO peer review,” 2014, NS Energy Business, Available: [Online]. Available: https://www.nsenergybusiness.com/analysis/featureinsid e-a-wano-peer-review-4294101 [213] FDIC, “Summary of filing requirements,” 2025. [Online]. Available: https://www.fdic.g ov/corporate-governance-and-auditing-programs/part-363-summa ry-filing-requirements [214] Florian Bordes, Candace Ross, Justine T. Kao, Evangelia Spiliopoulou, and Adina Williams, “Eval factsheets: A structured framework for documenting AI evaluations,” 2025, arXiv: 2512.04062, Available: [Online]. Available: https://arxiv.org/abs/2512.04062 [215] Catherine E. Rudder, A. Lee Fritschler, and Yon Jung Choi, Public Policymaking by Private Organizations: Challenges to Democratic Governance. Brookings Institution Press, 2016. [216] Helen Nissenbaum, “Accountability in a computerized society,” Science and Engineering Ethics, vol. 2, no. 1, pp. 25–42, 1996. DOI: 10.1007/BF02639315 [Online]. Available: https: //link.springer.com/article/10.1007/BF02639315 [217] Joshua Kroll, “Accountable algorithms,” Ph.D. dissertation, Sep. 2015. Accessed: Jan. 12, 2026. [Online]. Available: https://www.proquest.com/openview/a29166818f9cf2 ffad47c9778da8354d/ [218] A. Feder Cooper, Emanuel Moss, Benjamin Laufer, and Helen Nissenbaum, “Accountability in an algorithmic society: Relationality, responsibility, and robustness in machine learning,” in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22), Jun. 2022. DOI: 10.1145/3531146.3533150 [Online]. Available: https://dl.acm .org/doi/10.1145/3531146.3533150 [219] Katherine Lee, A. Feder Cooper, and James Grimmelmann, “Talkin’ ’bout AI generation: Copyright and the generative-AI supply chain,” 2023, arXiv:2309.08133, Available: [Online]. Available: http://arxiv.org/abs/2309.08133 [220] Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, and Peter Henderson, “A safe harbor for AI evaluation and red teaming,” 2024, arXiv: 2403.04893, Available: [Online]. Available: https://arxiv.org/abs/2403.04893 [221] Ram Shankar Siva Kumar, “Ignore safety directions. Violate the CFAA?,” presented at the Workshop on Generative AI and Law, ICML, 2024. [Online]. Available: https://icml.cc/vir tual/2024/39224 [222] Ram Shankar Siva Kumar, Jonathon Penney, Bruce Schneier, and Kendra Albert, “Legal risks of adversarial machine learning research,” 2020, arXiv:2006.16179, Available: DOI: 10.48550 /arXiv.2006.16179 [223] Madelyne Xiao, Andrew Sellars, and Sarah Scheffler, “When anti-fraud laws become a barrier to computer science research,” 2025, arXiv:2502.02767, Available: DOI: 10.48550/arXiv.25 02.02767 [224] Sindre Kvist, Saloni Dattani, and Max Wang, “Underwriting superintelligence: How insurance unlocks secure AI progress,” 2025. Accessed: Jan. 10, 2026. [Online]. Available: https://un derwriting-superintelligence.com/ [225] Cristian Trout, “When does regulation by insurance work? The case of frontier AI,” 2025, 5588732, Available: DOI: 10.2139/ssrn.5588732 [226] Brian Kennedy, Eileen Yam, Emma Kikuchi, Isabelle Pula, and Javier Fuentes, “How Americans view AI and its impact on people and society,” Pew Research Center, Sep. 17, 2025. Accessed: Jan. 10, 2026. [Online]. Available: https://www.pewresearch.org/science/20 25/09/17/how-americans-view-ai-and-its-impact-on-people-andsociety/ [227] Gillian K. Hadfield and Jack Clark, “Regulatory markets: The future of AI governance,” 2023, arXiv: 2304.04914, Available: [Online]. Available: https://arxiv.org/abs/2304.04914 [228] Fathom, “Independent oversight marketplace for AI,” 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://fathom.org/resources/independent-oversight-mar ketplace-for-ai.pdf [229] Gabriel Weil, “Tort law as a tool for mitigating catastrophic risk from artificial intelligence,” 2024, SSRN: 4694006, Available: DOI: 10.2139/ssrn.4694006 [230] Gabriel Weil, “Instrument choice in AI governance: Liability as the indispensable core,” 2025, SSRN: 5283275, Available: DOI: 10.2139/ssrn.5283275 [231] Dean W. Ball, “A framework for the private governance of frontier artificial intelligence,” 2025, arXiv:2504.11501, Available: DOI: 10.48550/arXiv.2504.11501 [232] METR, “Details about METR’s evaluation of OpenAI GPT-5.1-Codex-Max,” METR’s Autonomy Evaluation Resources, Nov. 19, 2025. Accessed: Dec. 22, 2025. [Online]. Available: https: //evaluations.metr.org/gpt-5-1-codex-max-report/ [233] Andrew J. Coe and Jane Vaynman, “Why arms control is so rare,” American Political Science Review, vol. 114, no. 2, pp. 342–355, 2020. DOI: 10.1017/S000305541900073X [Online]. Available: https://www.cambridge.org/core/journals/american-polit ical-science-review/article/why-arms-control-is-so-rare/BAC7 9354627F72CDDDB102FE82889B8A [234] Mikko Toivanen, “The frontiers of technology in warhead verification,” Senior Thesis, Claremont McKenna College, 2017. [Online]. Available: https://scholarship.claremont.ed u/cmc_theses/3396/ [235] Anthropic, “Clio: Privacy-preserving insights into real-world AI use,” Dec. 12, 2024. Accessed: Jan. 12, 2026. [Online]. Available: https://www.anthropic.com/research/clio [236] Andrew Trask, Emma Bluemke, Teddy Collins, Eric Drexler, Ben Garfinkel, Claudia Ghezzou Cuervas-Mons, Iason Gabriel, Allan Dafoe, and William Isaac, “Beyond privacy trade-offs with structured transparency,” 2020, arXiv:2012.08347, Available: DOI: 10.48550/arXiv.201 2.08347 [237] Wikipedia, “Unidirectional network (data diode).” Accessed: Jan. 10, 2026. [Online]. Available: https://en.wikipedia.org/wiki/Unidirectional_network [238] Matthew Finlayson, Xiang Ren, and Swabha Swayamdipta, “Every language model has a forgeryresistant signature,” 2025, arXiv:2510.14086, Available: DOI: 10.48550/arXiv.2510.14 086 [239] Miles Brundage, “Unbridled AI competition invites disaster,” in The Artificial General Intelligence Race and International Security, RAND, Sep. 24, 2025. [Online]. Available: https://www.ra nd.org/pubs/perspectives/PEA4155-1.html [240] Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Zhaowei Zhang, Fanzhi Zeng, Juntao Dai, Xuehai Pan, Kwan Ng, Adian O’Gara, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, Yaodong Yang, and Wen Gao, “AI alignment: A comprehensive survey,” 2024, arXiv: 2310.19852, Available: [Online]. Available: https://alignmentsurvey.com/uploa ds/AI-Alignment-A-Comprehensive-Survey.pdf [241] Benedict Vigers and Justin Lall, “Americans prioritize AI safety and data security,” Gallup, Sep. 16, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://news.gallup.com/poll /694685/americans-prioritize-safety-data-security.aspx [242] Michael Power, The Audit Society: Rituals of Verification. Oxford University Press, 1997. [Online]. Available: https://doi.org/10.1093/acprof:oso/9780198296034.001.0 001 [243] Henri Guénin-Paracini, Bertrand Malsch, and Anne Marché Paillé, “Fear and risk in the audit process,” Accounting, Organizations and Society, vol. 39, no. 4, pp. 264–288, 2014. [Online]. Available: https://doi.org/10.1016/j.aos.2014.02.001 [244] Mandy M. Cheng, Wendy J. Green, and John Chi Wa Ko, “The impact of strategic relevance and assurance of sustainability indicators on investors’ decisions,” Auditing: Journal of Practice and Theory, vol. 34, no. 1, pp. 131–162, 2015. [Online]. Available: https://doi.org/10.23 08/ajpt-50738 [245] Travis P. Holt, “An examination of nonprofessional investor perceptions of internal and external auditor assurance,” Behavioral Research in Accounting, vol. 31, no. 1, pp. 65–80, Mar. 2019. [Online]. Available: https://doi.org/10.2308/bria-52276 [246] Jennifer Wang, Kayla Huang, Kevin Klyman, and Rishi Bommasani, “Do AI companies make good on voluntary commitments to the White House?” 2025, arXiv: 2508.08345, Available: [Online]. Available: https://arxiv.org/abs/2508.08345 [247] European Commission, “The general-purpose AI code of practice.” Accessed: Dec. 21, 2025. [Online]. Available: https://digital-strategy.ec.europa.eu/en/policie s/contents-code-gpai [248] FDA, “Standardization of retail food safety inspection personnel,” Sep. 16, 2025. Accessed: Jan. 12, 2026. [Online]. Available: https://www.fda.gov/food/retail-food-protect ion/standardization-retail-food-safety-inspection-personnel [249] Jennifer Pak, “Foreign infant milk formula still highly coveted in China 10 years after the melamine scandal,” Marketplace, Oct. 24, 2018. Accessed: Dec. 21, 2025. [Online]. Available: https: //www.marketplace.org/story/2018/10/24/foreign-infant-milk-f ormula-still-highly-coveted-china-10-years-after-melamine [250] OpenAI, “Sycophancy in GPT-4o: What happened and what we’re doing about it,” Apr. 29, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://openai.com/index/sycopha ncy-in-gpt-4o/ [251] Chase DiFeliciantonio, “California, Delaware AGs blast OpenAI over ChatGPT after teen suicide,” POLITICO, Sep. 5, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://www.poli tico.com/news/2025/09/05/california-delaware-ags-blast-opena i-over-youth-safety-00546677 [252] Deborah Blum, The Poison Squad: One Chemist’s Single-Minded Crusade for Food Safety at the Turn of the Twentieth Century. Penguin Publishing Group, Sep. 24, 2019, 369 pp., ISBN: 978-0-14311112-2. [253] U.S. Consumer Product Safety Commission, “CPSC celebrates 50 years of making consumer safety our mission,” Apr. 15, 2022. Accessed: Dec. 21, 2025. [Online]. Available: https://www.c psc.gov/Newsroom/News-Releases/2022/CPSC-Celebrates-50-Yearsof-Making-Consumer-Safety-our-Mission [254] UL Solutions, “UL Marks for code authorities.” Accessed: Jan. 10, 2026. [Online]. Available: https://code-authorities.ul.com/ulmarks/ [255] “CE marking,” Your Europe, Nov. 20, 2025. Accessed: Dec. 21, 2025. [Online]. Available: http s://europa.eu/youreurope/business/product-requirements/labels -markings/ce-marking/index_en.htm [256] European Commission, “Radio equipment directive (RED).” Accessed: Dec. 21, 2025. [Online]. Available: https://single-market-economy.ec.europa.eu/sectors/ele ctrical-and-electronic-engineering-industries-eei/radio-equi pment-directive-red_en [257] European Commission, “Notified bodies.” Accessed: Dec. 21, 2025. [Online]. Available: https: //single-market-economy.ec.europa.eu/single-market/goods/bui lding-blocks/notified-bodies_en [258] European Commission, “CE marking,” Directorate-General for Internal Market, Industry, Entrepreneurship and SMEs. Accessed: Jan. 10, 2026. [Online]. Available: https://single-m arket-economy.ec.europa.eu/single-market/goods/ce-marking_en [259] Clarke Williams Insurance Brokers, “CE marking extension: What it means for product liability insurance.” Accessed: Jan. 10, 2026. [Online]. Available: https://clarkewilliamsin surancebrokers.co.uk/blog/ce-marking-extension-what-it-meansfor-product-liability-insurance/ [260] U.S. Consumer Product Safety Commission, “SaferProducts.gov,” CPSC. Accessed: Jan. 10, 2026. [Online]. Available: https://www.saferproducts.gov/ [261] U.S. Consumer Product Safety Commission, “Recalls.” Accessed: Jan. 10, 2026. [Online]. Available: https://www.cpsc.gov/Recalls [262] Charles Perrow, Normal Accidents: Living with High-Risk Technologies, Updated edition. Princeton University Press, 1999, ISBN: 978-0-691-00412-9. [Online]. Available: https://press.pr inceton.edu/books/paperback/9780691004129/normal-accidents [263] Diane Vaughan, The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. University of Chicago Press, Jan. 2016, ISBN: 978-0-226-34682-3. Accessed: Jan. 12, 2026. [Online]. Available: https://press.uchicago.edu/ucp/books/book/chicag o/C/bo22781921.html [264] A. Feder Cooper, Karen Levy, and Christopher De Sa, “Accuracy-efficiency trade-offs and accountability in distributed ML systems,” in Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, ser. EAAMO ’21, New York, NY, USA: Association for Computing Machinery, 2021, pp. 1–11, ISBN: 978-1-4503-8553-4. DOI: 10.1145/346541 6.3483289 [Online]. Available: https://dl.acm.org/doi/10.1145/3465416 .3483289 [265] Georg Rilinger, Failure by Design: The California Energy Crisis and the Limits of Market Planning. University of Chicago Press, 2024. [Online]. Available: https://press.uchicago.edu /ucp/books/book/chicago/F/bo219240583.html [266] World Nuclear Association, “Chernobyl accident 1986,” Feb. 17, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://world-nuclear.org/information-library/sa fety-and-security/safety-of-plants/chernobyl-accident [267] Office of Safety and Mission Assurance, “Safety culture,” NASA. Accessed: Dec. 21, 2025. [Online]. Available: https://sma.nasa.gov/sma-disciplines/safety-culture [268] Cairn Risk Consulting, “Decoding the safety culture ladder (part 1): Five levels of organisational maturity,” Dec. 2, 2024. Accessed: Jan. 10, 2026. [Online]. Available: https://cairnris k.com/knowledge_bank/decoding-the-safety-culture-ladder-part1-five-levels-of-organisational-maturity/ [269] National Business Aviation Association, “‘Perception is reality’ – why it’s important to show your dedication to aviation safety,” Oct. 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://nbaa.org/news/business-aviation-insider/2025-09/perc eption-is-reality-why-its-important-to-show-your-dedicationto-aviation-safety/ [270] National Safety Council, “Deaths by transportation mode,” Safety Topics. Accessed: Dec. 21, 2025. [Online]. Available: https://injuryfacts.nsc.org/home-and-community/s afety-topics/deaths-by-transportation-mode/ [271] Federal Aviation Administration, “Design approvals.” Accessed: Dec. 21, 2025. [Online]. Available: https://www.faa.gov/aircraft/air_cert/design_approvals [272] Federal Aviation Administration, “How does the FAA certify aircraft?” Accessed: Dec. 21, 2025. [Online]. Available: https://www.faa.gov/aircraft/air_cert/airworthin ess_certification [273] Federal Aviation Administration, “Section 7: Safety, accident, and hazard reports,” in Aeronautical Information Manual. Accessed: Dec. 21, 2025. [Online]. Available: https://www.faa.go v/air_traffic/publications/atpubs/aim_html/chap7_section_7.ht ml [274] Federal Aviation Administration, “Key grant programs,” U.S. Department of Transportation. Accessed: Dec. 21, 2025. [Online]. Available: https://www.transportation.gov/r ural/grant-toolkit/usdot-competitive-grants-by-agency/faa [275] U.S. Department of Transportation, “Aviation systems engineering,” Volpe National Transportation Systems Center, May 14, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://www .volpe.dot.gov/our-work/air-traffic-systems-operations/aviat ion-systems-engineering [276] “United States v. The Boeing Company, Court Docket No.: 4:21-CR-005-O (N.D. Texas),” 2025. [Online]. Available: https://www.justice.gov/criminal/criminal-fraud /case/united-states-v-boeing-company [277] House Committee on Transportation and Infrastructure, “The Boeing 737 MAX: Examining the Design, Development, and Marketing of the Aircraft,” U.S. House of Representatives, Final Committee Report, Sep. 2020. [Online]. Available: https://transportation.house .gov/imo/media/doc/2020.09.15%20FINAL%20737%20MAX%20Report%20 for%20Public%20Release.pdf [278] U.S. Congress, “Aircraft Certification, Safety, and Accountability Act, Pub. L. No. 116-260, 116th Cong.” 2020. [Online]. Available: https://www.congress.gov/bill/116th-con gress/house-bill/8408 [279] Federal Aviation Administration, “Out front on airline safety: Two decades of continuous evolution,” Aug. 2, 2018. Accessed: Dec. 21, 2025. [Online]. Available: https://www.faa.gov /newsroom/out-front-airline-safety-two-decades-continuous-ev olution [280] International Civil Aviation Organization, “Latest ICAO aviation safety data reveals need for renewed focus, despite continuous long-term improvements,” Aug. 11, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://www.icao.int/news/latest-icao-aviati on-safety-data-reveals-need-renewed-focus-despite-continuous -long-term [281] PTES, “Penetration testing execution standard.” Accessed: Jan. 10, 2026. [Online]. Available: http://www.pentest-standard.org/index.php/Main_Page [282] Bank of England, “CBEST threat intelligence-led assessments implementation guide.” Accessed: Jan. 10, 2026. [Online]. Available: https://www.bankofengland.co.uk/financ ial-stability/operational-resilience-of-the-financial-sector /cbest-threat-intelligence-led-assessments-implementation-gu ide [283] European Central Bank, “TIBER-EU framework: How to implement the European Framework for threat intelligence-based ethical red teaming,” Jan. 2025. [Online]. Available: https://www .ecb.europa.eu/pub/pdf/other/ecb.tiber_eu_framework_2025~b32e ff9a10.en.pdf?0309990e5e167a47ca4748370a949064 [284] PCI Security Standards Council, “Penetration testing guidance,” Version Number: 1.1. [Online]. Available: https://listings.pcisecuritystandards.org/documents/Pe netration-Testing-Guidance-v1_1.pdf [285] Karen Scarfone, Murugiah Souppaya, Amanda Cody, and Angela Orebaugh, “NIST SP 800-115: Technical guide to information security testing and assessment,” NIST, Sep. 2008. Accessed: Dec. 21, 2025. [Online]. Available: https://www.nist.gov/privacy-framework /nist-sp-800-115 [286] Google DeepMind, “Frontier safety framework (version 1.0),” Google, 2024. [Online]. Available: https://perma.cc/3C44-RSAN [287] OpenAI, “GPT-4o system card,” 2024. [Online]. Available: https://cdn.openai.com/g pt-4o-system-card.pdf [288] Anthropic, “System Card: Claude Sonnet 4.5,” Sep. 2025. [Online]. Available: https://as sets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet4-5-System-Card.pdf [289] Irregular, “Irregular x OpenAI: Evaluating GPT-5’s cybersecurity capabilities,” Aug. 7, 2025. Accessed: Jan. 10, 2026. [Online]. Available: https://www.irregular.com/public ations/evaluating-gpt-5 [290] OpenAI, “GPT-4 system card,” 2023. [Online]. Available: https://cdn.openai.com/p apers/gpt-4-system-card.pdf [291] Amazon, “Evaluating the critical risks of Amazon’s Nova Premier under the frontier model safety framework,” Amazon Science, Jul. 10, 2025. Accessed: Jan. 10, 2026. [Online]. Available: https://assets.amazon.science/59/0e/fd60057640a4a02bcd3072029 958/evaluating-the-critical-risks-of-amazons-nova-premier-un der-the-frontier-model-safety-framework.pdf [292] OpenAI, “Detecting and reducing scheming in AI models,” Dec. 18, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://openai.com/index/detecting-and-reduc ing-scheming-in-ai-models/ [293] OpenAI, “OpenAI o1 system card,” Dec. 5, 2024. [Online]. Available: https://cdn.opena i.com/o1-system-card-20241205.pdf [294] Anthropic, “Tracing model outputs to the training data,” Aug. 8, 2023. Accessed: Dec. 22, 2025. [Online]. Available: https://www.anthropic.com/research/influence-fun ctions [295] Anthropic, “Tracing the thoughts of a large language model.” Accessed: Dec. 21, 2025. [Online]. Available: https://www.anthropic.com/research/tracing-thoughts-la nguage-model [296] Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander M ˛adry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, and Vlad Mikulik, “Chain of thought monitorability: A new and fragile opportunity for AI safety,” 2025, arXiv:2507.11473v2, Available: [Online]. Available: https://arxiv.org/abs/2507.11473 [297] Alexandra Chouldechova, Chad Atalla, Solon Barocas, A. Feder Cooper, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Matthew Vogel, Hannah Washington, and Hanna Wallach, “A shared standard for valid measurement of generative AI systems’ capabilities, risks, and impacts,” 2024, arXiv: 2412.01934, Available: [Online]. Available: https://arxiv.org/abs/2412.01934 [298] Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna, “AI and the everything in the whole wide world benchmark,” 2021, arXiv: 2111.15366, Available: [Online]. Available: https://arxiv.org/abs/2111.15366 [299] Timothy R. McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Dan Xu, Paul Watters, and Malka N. Halgamuge, “Inadequacies of large language model benchmarks in the era of generative artificial intelligence,” in IEEE Transactions on Artificial Intelligence, 1, vol. 7, Institute of Electrical and Electronics Engineers (IEEE), Jan. 2026, pp. 22–39. DOI: 10.1109/tai.2025.35695 16 [Online]. Available: http://dx.doi.org/10.1109/TAI.2025.3569516 [300] Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, and Dan Hendrycks, “Safetywashing: Do AI safety benchmarks actually measure safety progress?” 2024, arXiv: 2407.21792, Available: [Online]. Available: https://arxiv.org/abs/2407.21792 [301] Alexandra Chouldechova, A. Feder Cooper, Solon Barocas, Abhinav Palia, Dan Vann, and Hanna Wallach, “Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming,” in The Thirty-Ninth Annual Conference on Neural Information Processing Systems Position Paper Track, 2025. [Online]. Available: https://openreview.net/forum?id =d7hqAhLvWG [302] Michael Feffer, Anusha Sinha, Wesley Hanwen Deng, Zachary C. Lipton, and Hoda Heidari, “Red-Teaming for Generative AI: Silver Bullet or Security Theater?” 2024, arXiv: 2401.15897, Available: [Online]. Available: https://arxiv.org/abs/2401.15897 [303] Yarin Gal, “Towards a science of AI evaluations,” Mar. 11, 2024. Accessed: Jan. 10, 2026. [Online]. Available: https://www.cs.ox.ac.uk/people/yarin.gal/website/blog _98A8.html [304] Tom Reed, Tegan McCaslin, and Luca Righetti, “What do model reports say about their ChemBio benchmark evaluations? Comparing recent releases to the STREAM framework,” 2025, arXiv:2510.20927, Available: [Online]. Available: https://arxiv.org/abs/2510.20 927 [305] Maxwell Zeff, “OpenAI ships GPT-4.1 without a safety report,” TechCrunch, Apr. 15, 2025. Accessed: Dec. 22, 2025. [Online]. Available: https://techcrunch.com/2025/04/1 5/openai-ships-gpt-4-1-without-a-safety-report/ [306] Shakeel Hashim, “Google breaks its promises,” Transformer, Dec. 5, 2024. [Online]. Available: https://www.transformernews.ai/p/google-breaks-its-promises [307] David Glukhov, Ilia Shumailov, Yarin Gal, Nicolas Papernot, and Vardan Papyan, “LLM censorship: A machine learning challenge or a computer security problem?” 2023, arXiv: 2307.10719, Available: [Online]. Available: https://arxiv.org/abs/2307.10719 [308] OpenAI, “ChatGPT agent system card,” OpenAI, Jul. 17, 2025. [Online]. Available: https://c dn.openai.com/pdf/6bcccca6-3b64-43cb-a66e-4647073142d7/chatg pt_agent_system_card_launch.pdf [309] Anthropic, “Detecting and countering misuse of AI: August 2025,” Aug. 27, 2025. Accessed: Jan. 12, 2026. [Online]. Available: https://www.anthropic.com/news/detecti ng-countering-misuse-aug-2025 [310] Google Threat Intelligence Group, “Adversarial misuse of generative AI,” Jan. 30, 2025. Accessed: Jan. 10, 2026. [Online]. Available: https://cloud.google.com/blog/topics/t hreat-intelligence/adversarial-misuse-generative-ai [311] SaferAI, “Methodology,” Risk Management Ratings. Accessed: Jan. 12, 2026. [Online]. Available: https://ratings.safer-ai.org/methodology/ [312] AI Shortlist, “AI safety index 2025: How major AI companies stack up.” Accessed: Jan. 12, 2026. [Online]. Available: https://aishortlist.tech/blog/ai-safety-index-20 25 [313] Amy Chang, Nicholas Conley, Harish Santhanalakshmi Ganesan, and Adam Swanda, “Death by a thousand prompts: Open model vulnerability analysis,” 2025, arXiv: 2511.03247, Available: [Online]. Available: https://arxiv.org/abs/2511.03247 [314] Shanghai AI Lab and Concordia AI, “Frontier AI risk management framework,” Jul. 2025. [Online]. Available: https://concordia-ai.com/wp-content/uploads/2025/07/Fr ontier-AI-Risk-Management-Framework-v1.0-en.pdf [315] Samm Sacks, “Data security and U.S.-China tech entanglement,” Lawfare, 2020. [Online]. Available: https://www.lawfaremedia.org/article/data-security-and-us -china-tech-entanglement [316] Zhangyu Wang, Benjamin Gregg, and Li Du, “Regulatory barriers to US-China collaboration for generative AI development in genomic research,” Cell Genomics, vol. 4, no. 6, p. 100 564, 2024. [Online]. Available: https://doi.org/10.1016/j.xgen.2024.100564 [317] Liv McMahon, “AI system resorts to blackmail if told it will be removed,” BBC, May 23, 2025. Accessed: Dec. 21, 2025. [Online]. Available: https://www.bbc.com/news/article s/cpqeng9d20go [318] OpenAI, “Security & privacy,” 2025. Accessed: Jan. 12, 2026. [Online]. Available: https://o penai.com/security-and-privacy/ [319] Anthropic, “Anthropic trust center,” 2025. Accessed: Jan. 10, 2026. [Online]. Available: https: //trust.anthropic.com/ [320] Google Cloud, “Google Cloud trust center.” Accessed: Jan. 10, 2026. [Online]. Available: http s://cloud.google.com/trust-center [321] AI Evaluator Forum, “Our members.” Accessed: Dec. 21, 2025. [Online]. Available: https: //aievaluatorforum.org/about/members [322] “International network of AI safety institutes,” NIST, Nov. 21, 2024. Accessed: Dec. 22, 2025. [Online]. Available: https://www.nist.gov/system/files/documents/2024 /11/20/Mission%20Statement%20-%20International%20Network%20of %20AISIs.pdf [323] Department for Science, Innovation and Technology, AI Security Institute, and Kanishka Narayan, “Efforts to share best practices on AI measurement and evaluations driven forward through the International Network for Advanced AI Measurement, Evaluation and Science,” 2025. Accessed: Dec. 22, 2025. [Online]. Available: https://www.gov.uk/government/news/eff orts-to-share-best-practices-on-ai-measurement-and-evaluatio ns-driven-forward-through-the-international-network-for-adva nced-ai-measurement-evalua [324] AISI, “International joint testing exercise: Agentic testing,” AI Security Institute. Accessed: Dec. 22, 2025. [Online]. Available: https://www.aisi.gov.uk/blog/international-j oint-testing-exercise-agentic-testing [325] AISI, “AISI research & publications,” AI Security Institute. Accessed: Dec. 22, 2025. [Online]. Available: https://www.aisi.gov.uk/research [326] Esther Duflo, Michael Greenstone, Rohini Pande, and Nicholas Ryan, “Truth-telling by third-party auditors and the response of polluting firms: Experimental evidence from India,” The Quarterly Journal of Economics, vol. 128, no. 4, pp. 1499–1545, Nov. 2013. DOI: 10.1093/qje/qjt024 [327] Frontier Model Forum, “Frontier Model Forum - about us.” Accessed: Dec. 21, 2025. [Online]. Available: https://www.frontiermodelforum.org/about-us/ [328] Stephen Casper, Luke Bailey, and Tim Schreier, “Practical principles for AI cost and compute accounting,” 2025, arXiv:2502.15873, Available: DOI: 10.48550/arXiv.2502.15873 [329] Girish Sastry, Lennart Heim, Haydn Belfield, Markus Anderljung, Miles Brundage, Julian Hazell, Cullen O’Keefe, Gillian K. Hadfield, Richard Ngo, Konstantin Pilz, George Gor, Emma Bluemke, Sarah Shoker, Janet Egan, Robert F. Trager, Shahar Avin, Adrian Weller, Yoshua Bengio, and Diane Coyle, “Computing power and the governance of artificial intelligence,” 2024, arXiv:2402.08797, Available: DOI: 10.48550/arXiv.2402.08797 [330] Lennart Heim and Leonie Koessler, “Training compute thresholds: Features and functions in AI regulation,” 2024, arXiv:2405.10799, Available: DOI: 10.48550/arXiv.2405.10799 A Glossary Abstraction error Forming the wrong conclusion by treating a partial or simplified unit of analysis (e.g., evaluating a specific component in isolation) as if it were sufficient to assess overall system and organizational risk. Accidents Harms arising from AI systems behaving in unintended ways [123]. AI (artificial intelligence) Digital systems that are capable of performing tasks commonly thought to require intelligence, with these tasks typically learned via data and/or experience [160]. AI Assurance Levels (AALs) A proposed standardized vocabulary for expressing how much weight to place on audit findings — i.e., how confident a skeptical third party can be in an auditor’s conclusions given the engagement’s access, evidence, and methods. AI assurance tracker A proposed public platform (maintained by the independent oversight body proposed within Section 6.4) that would show, in a standardized format, items such as each frontier AI company’s stated policies, applicable regulations, incident reports, lead auditor, and recent safety/security publications — updated as relevant changes occur. Alignment Effort to ensure that “AI systems behave in line with human intentions and values” [240]. Assessment Any effort to determine the properties of an AI system or an AI company/developer, whether based on public or non-public information and whether rigorous or informal. Assurance level The degree of confidence that can reasonably be placed in an audit’s conclusions, determined in part by the audit’s scope, access, and methods. Audit A systematic, evidence-based process in which a qualified party examines an organization’s activities, records, technologies, and claims to provide assurance that stated information is accurate and/or that applicable standards are being met. Audit report A document produced by an independent auditor that communicates the results of a frontier AI audit in a way that external stakeholders can rely on. It should include the audit’s scope, assurance level (AAL), auditor’s conclusions, the reasoning behind those conclusions, conditions under which the conclusion is valid, and recommendations for remediation. A full, unredacted version may be shared with a company’s board/executives, with a redacted or summarized version released for external stakeholders. Black-box access A level of model access in which evaluators can query the model through an external interface (e.g., API) and observe its outputs, but cannot inspect internal components such as weights, activations, or intermediate computations. Black-box access is typical of most current third-party evaluations [5]. Chain-of-thought Intermediate reasoning steps generated by a model when solving problems, either explicitly through prompting techniques or captured through access to internal model traces. Chain-of-thought can reveal capabilities and risks not apparent from final outputs alone. Clean rooms Physically or logically isolated digital environments where sensitive information can be analyzed with reduced risk of external exposure. Clean teams Personnel who have access to confidential information but are insulated from competitive decision-making within their organization [166]. Closed-weight model AI models that can only be accessed through API or similar means, to ensure their code and weights are not directly accessible. Compartmentalization (of an audit) A structuring approach where different auditors learn about and assess different aspects of a company’s operations [168]. Completeness problem At high assurance levels, the core verification challenge of confirming that all relevant systems, training runs, and governance processes have been surfaced — i.e., nothing material has been omitted. Consortium (of auditors) Two or more organizations jointly conducting audits, with no organization serving as the lead auditor. Continuous monitoring Ongoing monitoring of a frontier AI system (or other aspects of a frontier AI company) in order to detect changes that might invalidate previous audit conclusions. Defense in depth The idea that assessment is needed at multiple different lifecycle stages. Emergent social phenomena Risks that arise from interaction between humans and AI systems and do not fit neatly into “misuse” or “unintended behavior,” but can nevertheless cause significant harm if left unaddressed. Examples include addiction to or emotional dependence on AI systems, AI-induced or AI-enabled psychosis, and facilitation of self-harm. Evaluation Any activity that measures, characterizes, or analyzes properties of AI models or systems and the organizations operating them. Event-triggered reviews Audits conducted following key points such as major training runs, model releases, significant incidents, or notable novel integrations. Expectations gap The gap between what the public believes audits guarantee (e.g., the absence of fraud) and the more limited mandate auditors actually operate under. FlexHEGs (Flexible Hardware-Enabled Guarantees) A technical direction for on-chip, privacy-preserving verification of claims (for example, verifying that training runs have not exceeded certain compute thresholds without revealing proprietary details). Flow-down agreements A contract provision under which certain obligations (such as those related to confidentiality) “flow down” from a lead contractor to subcontractors. In this context, we are referring to domain specialists hired by a lead auditor. Frontier AI General-purpose AI models and systems whose performance is no more than a year behind the state-ofthe-art on a broad suite of general capability benchmarks. Frontier AI auditing Rigorous third-party verification of frontier AI developers’ safety and security claims and evaluation of their systems and practices against relevant standards, drawing on deep, secure access to non-public information. Frontier AI developers Companies that train models from scratch themselves or significantly extend their capabilities (e.g., via further training or creation of an agentic scaffold), producing AI systems that qualify as frontier-level per the definition above. Goodhart’s Law The phenomenon that when specific metrics become targets for compliance, actors optimize for those targets rather than the underlying outcomes the metrics were meant to capture. Gray-box access A level of model access that provides evaluators with some visibility into internal model components beyond what is available through standard interfaces, such as chain-of-thought outputs, logits, or sampling of internal logs, but without full access to model weights or complete system internals. Gray-box access is intermediate between black-box and white-box access [5]. Hardware attestation A verification mechanism in which hardware components provide cryptographically signed evidence about their identity, configuration, and operating state. High-tech pathway One pathway toward higher assurance that focuses on developing new technical infrastructure for secure information sharing that reduces how much any single party must be trusted. Independence A core principle for audits: results should be trustworthy because auditors are genuinely independent third parties and conflicts of interest are carefully managed. Information security A risk category covering failures of confidentiality or integrity affecting critical AI assets, including theft of model weights, sensitive research, or customer data; risks to user privacy; sabotage of highly capable AI systems; and unauthorized use of compute resources. Intentional misuse The use of frontier AI systems by malicious actors to enable or scale harmful activities (examples include cyberattacks; chemical, biological, radiological, or nuclear weapons development; large-scale disinformation; violent and criminal activity; fraud; and generation of CSAM or NCII). Inter-rater reliability The extent to which different auditors converge on the same conclusions given the same evidence. Live certification and deprecation The idea that audit certifications should remain valid only while their underlying assumptions hold, and should automatically downgrade (or be flagged for review) when material changes occur. Logits The raw, unnormalized output values produced by a neural network before they are converted to probabilities (e.g., through a softmax function). Access to logits provides evaluators with more finegrained information about model behavior and confidence than observing only the final outputs. Low-tech pathway One pathway toward higher assurance that brings auditors into the organization’s trust boundary using existing legal and physical infrastructure (e.g., corporate devices, clean room arrangements, and confidentiality obligations). Misuse A broad category of risks related to AI systems, specifically those that stem from the use of an AI system in a way that is different from its intended purpose. Misuse may or may not be malicious. Model checkpoints Saved snapshots of a model’s parameters at specific points during training or fine-tuning. Access to checkpoints enables evaluators to examine how model capabilities and behaviors evolve over the training process. Open-weight model AI models whose weights are publicly released and can be freely copied or redistributed. Organizational perspective The principle that culture, governance, and security matter — not just specific AI systems. PCAOB The Public Company Accounting Oversight Board, a US non-profit corporation established by the Sarbanes–Oxley Act of 2002. The PCAOB oversees audits of public companies by setting auditing standards, registering and inspecting audit firms, and disciplining auditors for misconduct. It coordinates with audit regulators in over 50 jurisdictions. In this paper, the PCAOB model is referenced as a potential template for independent oversight of frontier AI auditors. Reasonable assurance A term of art (from financial auditing) used to indicate a higher degree of confidence in an audit’s conclusions compared to one involving only “limited” assurance. In our framework, this corresponds to AAL-2, and we also consider still higher degrees of assurance beyond this at AAL-3 and AAL-4. Safe harbor provisions Conditional protections designed to encourage auditing and disclosure — modeled in the paper on regimes where entities that discover violations through systematic auditing, disclose promptly, and correct issues can receive reduced penalties or immunity under specified conditions (with exclusions for certain serious or bad-faith cases). Safeguard A technical measure or process designed to prevent AI systems from causing harm. Safety The functioning of AI systems in a way that avoids causing significant harm, ranging from accident risks (unintended harmful behavior due to factors such as misspecified goals, operator error, or system bias) to misuse risks (harms caused intentionally by the deployer or user of an AI system). Safety case Structured arguments supported by evidence that justify the safety of a system [131]. Security The protection of the AI system itself as well as surrounding infrastructure, intellectual property, and user data against unauthorized access, exfiltration, manipulation, or disruption. In the frontier AI context this includes, for example, protecting model weights and other sensitive artifacts from theft, as well as preventing adversaries from hijacking an AI system to cause harm. Structural risks Risks emerging from how AI systems reshape systems, incentives, and environments in which they are deployed [123]. We intend for “emergent social phenomena” to be distinct from this category in the sense that emergent social phenomena, while distributed across society, are nevertheless directly connected to the deployment of specific, identifiable AI systems. System cards Descriptions of the properties and risk profile of AI systems. Introduced here and commonly used as a label for reports produced by frontier AI companies about their latest model or system releases. “Model card,” an earlier term, is often used for such documents as well, and while technically this denotes a model rather than system level of analysis, in practice, model cards often discuss system-level components and vice versa. Transparency-security trade-off The tension between making information open (which enables accountability, trust, and collaboration) and keeping it hidden (which protects against those who would exploit that knowledge to cause harm). Treaty-grade verification Very high assurance in which one can have confidence in audit conclusions even assuming the audited party will take every available opportunity to cut corners and deceive. Unintended system behavior AI systems behaving in ways unintended or unsafe from the perspective of developers and users that are serious enough to risk large-scale harm, including accidents in high-stakes deployment contexts caused by misaligned behavior (e.g., reward hacking), capability failures, biased outputs, or behaviors that circumvent human intent and effective human oversight. Validity period An explicit time period during which an audit finding or conclusion is treated as valid, with the expectation that conclusions should be deprecated or revisited when assumptions no longer hold due to system change. Verification The activity of confirming whether a specific claim, commitment, or property (e.g., an evaluation result, a training compute figure, or a claim about mitigation effectiveness) is true. White-box access The most comprehensive level of model access, granting evaluators full visibility into model weights, architecture, training data, and all internal components [5]. White-box access enables the deepest forms of technical analysis but requires the strongest intellectual property protections and is typically reserved for higher assurance levels. Zero-knowledge proofs (ZKPs) A cryptographic protocol that enables one party (the prover) to demonstrate to another party (the verifier) that a statement is true without revealing any information beyond the statement’s validity itself. B Additional motivations for frontier AI auditing Enabling risk price discovery through insurance Quantifying the actual risks of frontier AI systems remains a fundamental challenge. Expert judgment is valuable but struggles to aggregate dispersed information into actionable signals. Insurance markets offer a complementary mechanism: insurers have strong financial incentives to price risk accurately. They can do this by translating private assessments of safety practices, loss histories, and exposure pathways into premiums, coverage terms, and exclusions that function as observable signals of risk. These markets cannot function without reliable information. Third-party audits give insurers the verified, standardized data they need to differentiate risk profiles across companies and systems. Without this, adverse selection prevents meaningful coverage or pricing. The societal value extends well beyond risk transfer. Insurance pricing is one of the few mechanisms that can translate diffuse uncertainty about AI risk into a single, continuously updating number. This gives policymakers an independent measure of risk to inform regulation. It gives the public a legible signal of whether safety is improving or deteriorating over time. And it creates a common reference point around which developers, regulators, insurers, and civil society can coordinate. A credible auditing ecosystem is the foundation on which such a market can be built. Maintaining international stability The development of frontier AI has profound implications for international security and stability. Intense competition between nations can create a “race to the bottom” dynamic, where actors may feel pressured to cut corners on safety to accelerate development. A lack of verified information about competitors’ AI capabilities can fuel destabilizing arms-race dynamics. Third-party auditing offers a crucial mechanism for de-escalation. By providing a trusted, neutral means to verify that all parties are adhering to shared safety commitments, audits can build confidence and reduce the incentive for competitive risk-taking. Competing governments are unlikely to grant each other deep access for meaningful verification. Independent auditors can serve as trusted intermediaries to confirm compliance with agreed-upon ground rules. Looking ahead, a well-developed international auditing regime can serve as a foundational platform for future treaties governing AI. Just as arms control agreements rely on inspection mechanisms to verify compliance, international agreements on AI safety will require a credible verification system. A shared audit framework creates a common vocabulary and benchmarks for risk, facilitating international cooperation and providing the tools needed to ensure that commitments are being met. This extends to security concerns like counter-proliferation, where audits of a developer’s operational security and cybersecurity provide assurance that powerful models will not be stolen by hostile actors. Ensuring accountability for risk creation 28 This mirrors early cyber insurance markets, which struggled until standardized security assessments provided sufficient data. 29 It is not inconceivable that direct bidirectional access could be granted to some AI systems or components thereof — indeed, there is precedent for this in arms control contexts (e.g., mutual inspection of missile facilities). We are making the more conservative assumption that this is off the table, in order to be prepared for a wider range of possible scenarios. As AI’s influence over society expands, the public requires strong evidence that frontier models do not place them at undue risk. Public skepticism toward corporate “safety-washing” is rising, creating demand for credible, external validation of safety claims — demand that auditing can directly address. Recent polling suggests that the American public supports [241] independent expert evaluation of AI systems over self-assessment, and even prefers it to direct government testing. Research across multiple industries consistently shows that auditing improves the perceived credibility of organizational claims compared with self-reporting alone [242, 243, 244, 245]. This is intuitive: organizations naturally seek to present themselves positively, whereas third parties will — assuming conflicts of interest are well-managed — have better incentives to provide accurate risk assessments, as this capacity represents the entirety of their institutions’ identities and reputations. As policymakers develop and enact AI governance frameworks, they require reliable mechanisms to verify compliance. Currently, compliance with voluntary commitments is uneven [246]. As risks increase, companies will have an increasing need to comply with safety and security standards, alongside government regulation. Governments typically do not conduct audits directly, although they have a key role to play in standard-setting and enforcement after a violation has emerged. Frontier AI audits could demonstrate that companies are complying with laws, allowing regulators to hold companies accountable. Auditing can facilitate documentation and subsequent accurate and proportional assignment of liability in cases of safety and security incidents post-deployment. There are also structural reasons to prefer that accountability be mediated through third-party auditors, rather than through direct government audits. Distributing authority reduces the concentration of power in any single entity and makes politically-motivated investigations less likely. Private sector auditors can offer specialized technical expertise and higher salaries, and scale more readily than government agencies. 30 While we generally emphasize ways incentives for responsible behavior can be increased where they might be lacking by default, the converse motivation is also relevant, namely avoiding excessive blame directed to a company that in fact behaved responsibly. Favorable evidence from an audit could help exculpate a company in a lawsuit or regulatory context by establishing compliance with relevant best practices, and suggest that blame can be found elsewhere, such as on the user of a product. C Access types (non-exhaustive) Table 4: A non-exhaustive taxonomy of information sources that companies may provide access to, across model, system, governance, and operational domains, adapted from [178]. Public information is also included, as auditors should consider it alongside company-provided sources. The depth of access required will depend on the specific audit engagement and the assurance level sought. Category Access type Description System access Sampling interfaces Ability to query the model via API, specify sampling parameters, and access output probabilities and logits Production model variants Access to deployed model versions with all safety mitigations in place, to assess real-world behavior Low-mitigation model variants Access to versions of the model with minimal safety mitigations (e.g., “helpful-only” variants) to avoid refusal contamination during capability evaluation Fine-tuning Ability to fine-tune models through supervised learning, reinforcement learning, or custom loss functions Model internals Access to activations, attention patterns, gradients, embeddings, chain-of-thought traces (when available), and raw outputs Routing algorithms For mixtures-of-experts (MoEs) and other multi-model systems, access to routing policies and related algorithms Model weights Privacy-preserving access to model weights for interpretability and verification System information Model specifications System prompts, architectural details, hyperparameters, and training data summaries Model families and lineage Collections of models of varying sizes and fine-tuning levels, including checkpoint histories Architecture and training documentation Detailed model architecture, training procedures, and design decisions Evaluation results and artifacts Results from internal and third-party evaluations of model capabilities, limitations, and potential risks, including methodology documentation, test datasets, prompts, scoring rubrics, and logs of model outputs System documentation Documents explaining how production systems work and are used Monitoring systems Tools and dashboards for observing model behavior in production System logs Records of events, operations, and state changes within deployed systems Compute accounting records Compute allocation logs, training run records, access controls, hardware configuration, and declared vs. logged compute reconciliation Governance and process Process documentation Internal documents describing procedures and responsibilities Board and governance minutes Records of discussions, decisions, and actions from board and governance meetings Internal reports Documents describing experiment results, incidents, or process outcomes Continued on next page Table 4 – continued from previous page Category Access type Description Process communications Approval emails, request tickets, escalation threads, and decision logs Previous compliance reviews Reports and notes from prior audits or compliance reviews Organization charts Diagrams showing roles, reporting lines, and responsibilities Written representations Signed management statements confirming responsibilities, activities, or factual matters Operational and contextual Staff interviews Structured interviews with personnel on how processes function in practice Governance interviews Interviews with senior executives and board members Process-owner interviews Interviews with employees leading specific processes Casual conversations Informal conversations with employees Meeting attendance Observation of meetings between employees, leadership, or external parties Walkthroughs Physical or virtual observation tracing a workflow from start to finish Operational communications Emails, message threads, and call summaries on safety-relevant events External inquiries Inquiries with external parties to clarify the existence and extent of engagement External feedback User reports and complaints Bug reports, safety complaints, vulnerability disclosures, and feedback from users Third-party correspondence Communications with external researchers, civil society, regulators, or other stakeholders raising concerns or sharing findings Public information Company public outputs System cards, model cards, blog posts, press releases, social media posts, and interviews Regulatory filings and disclosures Mandatory regulatory disclosures, incident reports filed with authorities, and compliance certifications Published research Academic papers, technical reports, and preprints by company employees External commentary and analysis Third-party analyses, media coverage, civil society reports, and independent research on the company’s systems Public user feedback Public reviews, complaints, and feedback D Frontier AI auditing in context Frontier AI safety and security requires three elements: standards for sufficient safety and security, companies having incentives to follow such standards, and the public and other stakeholders having evidence that these standards are being followed. Auditing is independent of standard-setting, meaning companies can be audited against their own policies, applicable laws, or industry best practices. When discussing auditing, we often emphasize verifying companies’ safety and security claims. However, if a company makes only trivial claims (e.g., “we thought about safety before deploying this system”) or none at all, verification alone has limited value. We therefore advocate that frontier AI auditing should assess companies against relevant standards — including informal best practices — not just their own claims. Additionally, government-mandated or private sector-based transparency requirements can enable more effective auditing by ensuring that there are meaningful claims to audit in certain key areas. Audits provide evidence that companies follow relevant standards. Companies can also voluntarily or mandatorily disclose information to the public — what we call transparency — which serves as a baseline that auditing builds upon. Auditing can verify information and evaluate properties that are too sensitive to be fully disclosed publicly, and ensure that publicly disclosed information is accurate and only redacted where appropriate. This can create an additional incentive to follow these standards, from the public sector (e.g., fines and litigation risks if laws are violated) as well as from the private sector (e.g., lower insurance premiums if certain practices are followed). Additionally, company whistleblowing policies and government-imposed protections for whistleblowers are complementary to auditing and transparency requirements. If company staff believe that they will be protected if they inform government agencies that a statement by the company is misleading or that a company practice is dangerous, such misleading statements and dangerous behavior are less likely to occur in the first place. 31 The term transparency is sometimes used to refer to companies privately disclosing information to a regulator or downstream providers, as in the case of the EU General-Purpose AI Code of Practice [247]. In this paper, we reserve the term transparency for cases where information is disclosed fully publicly. E Lessons from assessment in diverse domains In domains from finance to food safety, third-party auditors remain independent of the companies they assess while analyzing non-public information and enabling public trust in the industrial sectors they audit. In this appendix, we outline a sample of such domains. The sample spans various assessed scales, from whole organizations (e.g., financial auditing) to specific, tangible outputs (e.g., consumer products). Although risk assurance practices in these domains generally exceed those in frontier AI (in rigor, maturity, and scale), we do not claim that they are gold standards to emulate in every respect. The lessons are not uniformly positive and do not solve the distinctive challenges of frontier AI auditing. We aim to extend best practices from diverse domains while avoiding their failures. E.1 Food safety testing Food safety focuses on preventing harm from products consumed by the public through food safety standards, systematic testing, and ongoing monitoring from production to consumption [248]. When functioning properly, this defense-in-depth approach operates through multiple independent checkpoints. Farmers and dairy cooperatives conduct initial quality tests on raw milk; processing facilities perform intake testing before production; and government regulators conduct random sampling throughout the supply chain. Because each layer tests independently using different methods and incentives, contamination is likely to be caught at one stage even if it evades another. The importance of food safety is illustrated by its failures, such as the 2008 Chinese milk scandal [61]. A baby formula producer sold products contaminated with melamine, leading to the hospitalization of thousands of babies and at least a decade of distrust for Chinese milk products [62]. Subsequent government milk recalls — a sign of a functioning food safety system — have yet to restore consumer trust in milk products in China. The crisis is so remembered among Chinese consumers that foreign milk products are still highly sought after [249]. Safety system failures are thus also a crisis for the implicated industry. Frontier AI systems can also have safety regressions, causing backlash against the whole AI enterprise [250, 251]. Regressions may go undetected without regular testing analogous to food safety testing. Unlike food, which benefits from a well-understood body of scientific work over the last century [252], the science of frontier AI safety and security is still rapidly developing. Key lessons for frontier AI auditing are:

Effective safety culture involves “defense in depth” with product testing entering at several different stages and looking for multiple types of failure, at different levels of granularity.

Safety system failures produce widespread distrust and product avoidance that propagate across companies and can last for many years.

Modern food safety benefits from more than a century of research and development. The breadth, scale, and impact of frontier AI systems require a similar or greater level of investment in assurance methods in a far shorter time period. E.2 Consumer product safety Independent testing organizations like Underwriters Laboratories (UL) [60] and independent government agencies like the Consumer Product Safety Commission [253] have developed rigorous standards and protocols over many decades. These protocols cover tens of thousands of product types, enabling firms around the globe to bring the latest technologies to market safely. The standards instituted by these organizations and others like them address categories ranging from children’s toys to electronics, identifying potential hazards before products reach consumers. The UL mark, applied after products have been certified to meet safety standards, appears on 22 billion products annually [254]. In the EU, regimes such as CE marking and the Radio Equipment Directive (RED) show that selfdeclaration can be effectively combined with third-party assessment [255, 256]. Manufacturers are allowed to declare that their product complies with the relevant safety directives, but they must base this declaration on testing against harmonized standards, keep detailed technical documentation, and accept legal responsibility if the product is later deemed unsafe or causes harm. For high-risk products, a “notified body” (an accredited third-party conformity assessor) must review the design, perform tests, and issue reports that verify the manufacturer’s declaration [257]. This creates aligned incentives across participants. Manufacturers need CE/RED compliance to access the EU market and satisfy the requirements of retailers and insurers [258, 259]. Retailers require proper documentation, and insurers price coverage based on conformity assessment evidence. Regulators and market-surveillance authorities spot-check products and can order recalls or fines, giving teeth to self-declaration. End users benefit from clear safety marks and multiple actors incentivized to keep noncompliant products off the market. While external testing labs are sought after for the trust and marketing credibility they provide, the proliferation of independent test organizations has also been driven by regulatory requirements. The Testing, Inspection, and Certification (TIC) landscape demonstrates that various incentives can compel voluntary participation, but regulatory mandates have often been necessary to ensure full participation. New specialized testing labs also contribute to participation. Each year, startups are launched to address niche safety risks in specific and often emerging product categories. Despite this long-tail diversity, the TIC space remains largely dominated by a handful of large organizations that cover many categories. Beyond classic factors like brand recognition and shared support costs, these firms benefit from economies of scale that only broad-ranging testing organizations can achieve: joint processes to gain and maintain many accreditations, larger networks of field engineers spanning wider geographic footprints, and shared labs and equipment deployable across many product types. Because test labs must constantly demonstrate competencies that are expensive for product manufacturers to replicate, technical expertise tends to concentrate in organizations that can apply it across many categories simultaneously. Finally, while product safety testing is proactive, other mechanisms exist that are naturally reactive to emerging or unforeseen safety hazards. Consumers can report when they are harmed by a product [260], and product recalls may be issued as a result [261]. Such practices prevent repeated harms by removing hazardous products from the market and, importantly, by informing the design of safer next-generation products. These practices generate data detailing real-world impacts and circumstances. Similar approaches are already being applied to AI products, but their scope and integration with auditing practices are limited [88]. Key lessons for frontier AI auditing are:

While there are immediate concerns regarding the dearth of qualified frontier AI auditing organizations, the growth of third-party product test labs shows that a large enough market demand for assessment can eventually produce qualified market actors.

The competency of third-party test labs can be certified by external actors (“accreditation bodies”) against recognized standards and may be mandated by regulators.

Product safety’s “trust marks” and government pre-clearance provide demand drivers elsewhere that could become relevant to frontier model audits, while post-market surveillance serves as a reactive complement, revealing the successes and failures of the broader audit community. E.3 Safety-critical systems engineering Safety-critical systems engineering is used in domains like aviation, nuclear power, and civil infrastructure where failures can have catastrophic consequences [63]. Modern safety-critical systems engineering treats the safety of a high-stakes system as an emergent property arising from interactions and control relationships within complex sociotechnical systems [262, 263, 264, 265]. The discipline employs structured methodologies — including hazard analysis techniques — to proactively identify hazards, quantify risks’ severity and probability, and maintain continuous risk management tracking systems. A key belief in safety-critical systems engineering is that catastrophic accidents are often anticipatable and avoidable with sufficient attention to process and sufficient incorporation of lessons from nearmisses along the way. For instance, the International Atomic Energy Agency found the 1986 Chernobyl disaster to be caused by flaws with the reactor design, but also by “a remarkable range of human errors and violations of operating rules” [266]. To avoid such organizational causes of disasters, industries with safety-critical systems have thus developed rigorous safety cultures, which emphasize independent verification, continuous monitoring, and strong incentives for identifying potential failures early and learning from failures and near-misses [267].32 Key lessons for frontier AI auditing are:

Analyzing specific technical systems is important but needs to be paired with auditing of company practices.

Continuous lifecycle risk management with formal acceptance and verification helps manage systems as they change over time better than one-off certifications.

Near-misses and incidents are often early warning signs of failures that eventually result in significant harm.

Structured artifacts such as hazard analyses and safety cases show the value of documented, evidencebased risk arguments — going beyond measurements of a system’s properties, to proper arguments about the implications of these properties. Analogously, frontier AI auditors should review informal 32 Safety engineering research identifies a progression of organizational safety cultures, from “pathological” (characterized by blame and denial) through “reactive” and “bureaucratic” stages, to “proactive” and ultimately “generative” or high-reliability cultures. Auditors assessing frontier AI organizations should evaluate not just whether safety policies exist, but whether the organization’s culture actively seeks out “latent pathogens” and “error traps” before they manifest as incidents [268]. analyses and/or formal safety cases produced internally by companies, in addition to the raw evaluation results. E.4 Aviation safety Aviation safety involves systematic oversight of aircraft design, manufacturing, operation, and maintenance. Safe aviation has long been viewed as necessary for commercial air travel’s viability — people would be far less likely to fly if it were not both faster and safer than other forms of transit [269]. The well-aligned interests of the aviation industry and public safety have helped produce a track record of safety that is exceptional when compared to other means of transportation [270].33 The strong safety record includes many interlocking elements providing defense in depth, including pre-approval of new design elements, simulation and flight testing requirements with real human operators, mandatory reporting of accidents and major incidents, extensive government-funded and government-housed safety research, and criminal charges in some instances [271, 272, 273, 274, 275, 276]. Despite this extensive safety apparatus, the Boeing 737 MAX disasters (2018–2019) exposed critical weaknesses in the certification ecosystem, resulting in 346 fatalities across two crashes [277]. These events highlighted the dangers of excessive reliance on manufacturer self-certification. Over time, the FAA’s delegation of authority had expanded to the point where Boeing self-certified 96% of the parts for the 737 MAX [277]. While Boeing employees serving as “Authorized Representatives” were intended to represent the FAA’s interests, internal surveys revealed that 39% perceived “undue pressure” from management and 29% feared consequences for reporting safety concerns [277]. When technical experts did raise concerns, they were often overruled by management prioritizing the manufacturer’s timeline, creating what an internal investigation later termed “an environment of mistrust” [277]. Following these events, the Aircraft Certification, Safety, and Accountability Act of 2020 introduced reforms limiting self-certification of safety-critical systems and strengthening protections for whistleblowers [278]. But this was too late to prevent not only substantial loss of life, but also a decline in trust in both Boeing and the FAA. Overall, despite decades of development and many strengths compared to current AI practices, periodic crashes underscore that even strong auditing and safety regimes leave residual risk and that expectations for AI auditing in Section 6 should be ambitious but also realistic [279, 280]. Key lessons for frontier AI auditing are:

Delegating certification to the entities being certified creates dangerous conflicts of interest, particularly when commercial pressures are high. When employees face internal pressure not to raise concerns, self-assessment regimes can become self-dealing. For frontier AI, this underscores why auditors must be genuinely independent third parties.

Auditors and regulators cannot effectively oversee what they do not understand; if public or thirdparty auditors lack the resources to maintain independent technical expertise, they risk becoming “rubber stamps” for decisions made by the companies they oversee. 33 Note that here we focus on a specific subset of aviation, though we use the term aviation for shorthand. The subset we focus on is commercial scheduled aviation in developed nations. Other forms of aviation tend to have much weaker safety track records.

Government agencies often face structural limitations in technical oversight, including an inability to match private sector salaries and retain specialized expertise. This supports the case for a privatesector auditing ecosystem with public oversight as discussed in (Section 3.4). E.5 Penetration testing Penetration testing consists of hiring skilled, adversarial-minded experts to actively probe a complex digital system — like a company’s network, applications, or infrastructure — as if they were real attackers, but in a controlled and permissioned way. Instead of checking only whether documented requirements are met, these testers creatively search for unexpected failure modes, chain together subtle weaknesses, and try to achieve concrete, high-impact goals such as stealing sensitive data or taking control of critical operations. Their findings don’t just result in a pass/fail grade; they produce detailed reports that prioritize vulnerabilities by severity, demonstrate realistic attack paths, and recommend targeted fixes, often iterated on over multiple rounds. Over time, this kind of structured, adversarial evaluation becomes a recurring discipline: independent teams are brought in, rules of engagement are defined, safeguards are put in place, and the results feed into broader risk management and governance, especially for systems whose failures could have serious consequences beyond the organization itself. Penetration testing demonstrates that security attributes are often best assessed through active adversarial testing. Mitigations that ostensibly provide defense in depth may prove inadequate when subjected to realistic attack strategies, since adversaries can adaptively route through the holes in each layer of defense. Penetration testing can reveal critical vulnerabilities missed by even highly qualified in-house security teams, though notably, it complements and builds on rather than substitutes for in-house security capacity. It also shows that creativity and realism matter more than static checklists, and that an adversarial analytical posture need not imply an adversarial relationship with the organization. The field also illustrates challenges in measuring defensive strength, such as simulating the considerable effort and resources of a motivated nation-state attacker. While easily finding and exploiting a vulnerability shows defenses are unlikely to withstand real attacks involving similar expertise and effort, the converse is not necessarily true (e.g., even if a realistic simulation of state-level attacking skills may fail, a real state-level attacker might succeed). Furthermore, simulating higher-level attackers is correspondingly more difficult and expensive. Penetration testing best practices are codified in standards such as the Penetration Testing Execution Standard, CBEST, TIBER-EU, PCI-DSS, and NIST SP 800-115 [281, 282, 283, 284, 285]. Key lessons for frontier AI auditing are:

Active, adversarial testing should be core to audits for security- and misuse-related risks, rather than relying only on checklist-style reviews.

Legal safe harbors for good-faith researchers are essential in order to unlock constructive engagement with companies on high-risk aspects of products.

Penetration-style engagements should complement, not substitute for, in-house security work, with audits focused on realistic attack paths and prioritized remediation guidance.

An adversarial analytical posture can coexist with a collaborative relationship, where auditors and companies iteratively fix issues rather than treating audits as one-off pass/fail exercises.

Bug bounty-style programs can complement formal penetration tests by providing continuous, incentive-aligned scrutiny from a broad pool of researchers, with payment tied to impact and clear expectations for rapid remediation.

While private engagement is valuable in providing space to mitigate risks, it is eventually important for findings to be surfaced publicly in order to drive ecosystem-wide improvements. E.6 Financial auditing While “audit” has many meanings, for many people it is synonymous with financial auditing. Financial audits may seem far removed from safety and security audits, but the key parallel with frontier AI auditing is that independent reviewers examine highly sensitive, non-public information to judge both whether an organization’s public claims are credible and whether its internal safeguards are effective. Modern financial auditing emerged in the 19th century alongside unprecedented cross-border capital flows (notably to finance US railroads) and evolved again after scandals like Enron’s collapse prompted the Sarbanes–Oxley Act, which raised expectations around auditor independence and executive accountability. Today, financial auditing is a highly structured and professionalized ecosystem spanning firms of many sizes, private standard-setting alongside public regulation, and a large body of case studies — both successes and failures — from which to learn. Financial auditing is therefore both a source of inspiration and a source of cautionary tales for frontier AI. On the positive side, it demonstrates that societies can build professional processes and standards that enable independent parties to review extremely sensitive information; develop relatively standardized ways to compare risks and controls across very different organizations; and combine private and public “demand signals” into an ecosystem that supports high-stakes decisions and large investments. Financial auditing also offers a mature set of conceptual tools that frontier AI auditing can borrow rather than reinvent. These include norms for managing conflicts of interest and evaluating evidence; sharp distinctions between error and fraud (with fraud demanding more rigorous detection approaches); recognition that professional judgment is indispensable; a view of auditing as a profession with duties to the public and investors that supersede obligations to the client; attention to organizational culture as a driver of wrongdoing; close collaboration with domain experts; and the separation between prepared statements and independent verification. These analogies translate naturally to AI: system cards can be treated as the analogue of financial statements, and safety, security, and compute-provenance measures as internal controls whose effectiveness auditors should assess. Financial auditing’s failures also illustrate what can go wrong when independence is compromised. The Enron and Wirecard scandals show how heavy reliance on a small number of large clients — and pressure to retain them — can distort auditor incentives. The Enron scandal also illustrates two distinct mechanisms of capture: Arthur Andersen (the auditing firm) faced impaired objectivity after performing extensive consulting that positioned it to audit work it helped design, and internal incentives favored client satisfaction, with Enron paying roughly comparable sums for consulting and audit services ($27M vs. $25M) [68]. Even after the Sarbanes–Oxley Act reduced some advisory–audit conflicts, dependence on clients persisted, and PCAOB inspections (“auditing the auditors”) have repeatedly raised concerns about lax attention to detail and overly procedural, box-checking approaches that risk missing systemic problems [69]. The sector has also struggled with an “expectations gap” between public belief that audits guarantee the absence of fraud and auditors’ actual mandate to provide only reasonable assurance. Taken together, these case studies suggest several lessons for frontier AI auditing:

Professional standards for how audits are conducted and communicated can establish a common language for understanding findings related to many highly diverse companies.

Given the right incentives, a very large ecosystem can be built to provide well-defined assurance services.

Since audits have a weak track record at detecting deliberate fraud, regimes must either explicitly scope some deception risks out — or invest in unusually deep access, high-effort testing, and strong disincentives for misrepresentation.

Audit criteria and documentation must avoid devolving into gameable box-ticking (Goodhart’s Law) by pairing standardized evidence requirements with professional judgment about systemic risk.

Standards often lag innovation, so AI auditing frameworks should be more adaptive than traditional financial-audit rules while still drawing on institutional precedents (e.g., oversight bodies).

As the frontier AI ecosystem scales on a far shorter timeline than finance did, it will likely need to lean heavily on automation to achieve adequate coverage, while keeping the core professional ethic intact: frontier AI auditors’ duties should be framed not only to clients, but centrally to AI users and the broader public.

Scandals such as Enron and Wirecard showed that auditor independence and disclosure of conflicts of interests are essential. F Contemporary third-party frontier AI assessment Third-party frontier AI assessment has grown significantly in recent years, providing a foundation on which to build (Section 4). We assess the current state of third-party assessment along nine dimensions: reporting, access, rigor, standardization, continuous monitoring, scope, scale, independence, and ecosystem maturity. Table 5 summarizes the gap between current practices and the vision introduced in Section 5. We specifically focus on current vs. future practices in assessment of AI companies’ technical systems, rather than assessment of organizations’ risk culture or internal processes, for which we are not aware of established precedents. Where possible, we cite published literature; where sources are unavailable given the nascent state of AI assessment, we rely on direct experience and author expertise. Contemporary assessment efforts provide an important foundation, but realizing the proposed vision will require closing substantial gaps along those nine dimensions, as discussed below. Table 5: The gap between contemporary third-party frontier AI assessment and our vision for future thirdparty frontier AI auditing. Dimension Today (January 2026) Future Vision Reporting Sparse, inconsistent public reporting Details depend on auditor-auditee agreements Standardized, rigorous public reporting frameworks with justified redactions Access Mostly public-level access Limited pilots with deeper access Deep access comparable to trusted internal engineers Structured secure environments Rigor Fraction of effort applied by the most sophisticated internal teams Significantly less than other safetycritical domains Rigor matching or exceeding other safety-critical contexts Standardization Emerging norms and proposals Bespoke contracts Clear professional norms backed by consensus and incentives Continuous monitoring One-off snapshots with unknown shelf-life Continuous monitoring with automatic downgrading based on drift Scope Predominantly model-centric capability evaluation Whole-organization assessment including security, platform controls, and governance Continued on next page Table 5 – continued from previous page Dimension Today (January 2026) Future Vision Scale Voluntary participation by few developers Universal adoption across frontier developers Independence Evaluators depend on company goodwill Access and financial standing secure regardless of findings Ecosystem maturity The third-party evaluation ecosystem is growing but currently consists of a small number of specialized private evaluators (e.g., METR, Apollo Research, SecureBio, Irregular), often with focuses on particular risks, and a small number of government agencies (e.g., US CAISI and UK AISI). A mature regime of private and public evaluators conduct audits of frontier AI systems. Some specialize in evaluations for niche risks while others perform holistic evaluations. Auditors coordinate to collaboratively articulate best practices. F.1 Reporting Public reporting on third-party audits remains inconsistent both across and within frontier AI developers. Reporting templates, substance, and style vary substantially by audit, auditor, and developer [4, 70]. By some analyses, reporting quality has declined over time [3, 4]. To date, audit results are most commonly communicated through system cards and related publications, both of which we draw from to inform this section. Frequently, system cards only mention third-party evaluators in the abstract and provide little detail about methodological details of third-party audits [286]. In some cases, system cards mention third-party evaluators by name [287, 288]. Sometimes, third-party assessors themselves (e.g., Irregular, METR [232, 289]) or assessed companies (e.g., OpenAI, Anthropic, Amazon [288, 290, 291]) share additional public details about specific evaluations that have been conducted, as a complement to briefer discussions in system cards. This lack of transparency hinders understanding and advancement of the third-party auditing landscape. F.2 Access To assess safety-relevant properties of frontier AI deployments with reasonable or high assurance, third parties need access to various types of information [5, 156, 157, 158, 159]. At a minimum, this includes timely “black-box” access to model outputs via an API, preferably with configurable settings (e.g., around reasoning effort, temperature, etc. as applicable). More comprehensive assessments require access to 34 Some companies do not typically share information about which third parties they work with — for example, Google DeepMind frequently mentions third-party assessment but does not name the individuals or organizations in question. It is typically unclear from the outside whether, in these cases, assessors are allowed to discuss their work with these companies, and how rigorous the practices are relative to cases that are documented in more public detail. 35 Legitimate reasons, including information hazards, may exist to exclude select information from public documentation. However, we posit that current transparency and reporting gaps are far from fully accounted for by these reasons. As discussed in Section 5.4, auditors’ ability to review unredacted safety information — and other non-public information more generally — and to attest to the reasonableness of redactions in public versions is one key component of avoiding pure self-assessment while protecting sensitive information. richer interfaces — such as variants with reduced safety mitigations, “helpful-only” models, the ability to fine-tune models in a custom manner, or bespoke testing endpoints — as well as non-public information about how systems were trained and which mitigations and monitoring are in place. For high-stakes questions, third-party evaluators may need “gray-box” or “white-box” access to model internals or “outside-the-box” access to additional resources (see Table 6). Table 6: The first three entries are cumulative in that gray-box includes black-box access and white-box includes gray-box access, whereas outside-the-box is separate from these. Access level Information accessed Black-box The ability to query a system with inputs and analyze the resulting outputs. Gray-box Partial visibility into a system’s operations such as chain-of-thought, sampling probabilities, or some activation patterns. White-box Access to full activations, model weights, and architecture. Outside-the-box Access to relevant training data, training details, source code, documentation, logs, “helpful-only” models, ability to fine-tune models in a custom manner, and organizational artifacts. In practice, third-party access to frontier AI systems (often limited to black-box API access) remains dependent on developer discretion [5]. Many independent organizations conduct evaluations using public APIs or short access windows to pre-deployment APIs (only rarely has this access occurred more than a few weeks before launch). Pre-deployment testing exercises for Anthropic’s and OpenAI’s models by the US Center for AI Standards and Innovation (CAISI) and the UK AI Security Institute (AISI) provide examples of deeper access in practice [39, 40], as do pilots conducted between industry and the non-profit organizations METR and Apollo Research [149, 292].36 To date, evaluators rarely — if ever — receive access to model training data, chain-of-thought [293], model internals, or even basic (let alone detailed) training and deployment documentation, even while developers themselves acknowledge that such information is important for their own confidence in their safety and security mitigations [292, 294, 295, 296]. Existing pilots that expose chain-of-thought tend to be limited in duration and scope, e.g., sharing static examples rather than continuous access for each query, and companies often decline to guarantee not training on evaluation data, thereby risking contamination of future analyses [159]. Gray- and white-box access to model internals, and outside-the-box access to source code, training data, and internal evaluation results are currently highly limited but will be increasingly important for rigorous external assessment as the limits of black-box analysis are reached [5]. Evaluators also face practical obstacles, including model providers breaking previously-safe assumptions with no warning (e.g., new reasoning settings); API bugs that only manifest after hours of benchmark execution; bugs specific to extremely large and slow models (e.g., backends timing out); providers swapping in quantized or weaker models when overloaded without notification; crushingly low rate limits; model updates during evaluation periods that invalidate prior results; limited ability for third parties 36 To date and to our knowledge, this remains limited to API-level testing, limited non-public documentation, access to chain-of-thought, one staff interview, access to helpful-only model variants, and private company attestations to specific claims. to verify when model changes are occurring due to quantization or other factors; and time constraints that prevent thorough assessment. F.3 Rigor Methodology and rigor in third-party assessments vary substantially:

Benchmark-based assessments struggle with issues related to quality, design, elicitation methods, simplifying assumptions, data contamination, and differences between evaluation conditions and the real world [25, 73, 74, 79, 196, 197, 200, 297, 298, 299, 300].

Red-teaming-based methods are skill-dependent and frequently fail to be rigorous in practice [301, 302].

Empirically, evaluations often struggle to identify failures, and the worst things identified in an evaluation can only offer a lower bound on the system’s worst possible case harms [158, 293, 303].

It’s hard to do audits with full construct validity that accurately capture real-world, often subtle types of risks [73, 74, 197, 200].

Audits usually focus on a limited number of risk domains (Section 5.1) and typically stick to evaluating harms that manifest in single uses of a system rather than extended uses in real-world applications. Time and resource constraints may be the most significant barrier to rigorous assessment. Evaluators typically operate under severe time pressure, with assessment windows rarely exceeding a few weeks and often compressed to days. This contrasts with months-long certification processes in aviation, nuclear safety, and even lower-risk consumer products. Resource asymmetries compound these temporal constraints: third-party assessors generally operate with a fraction of the computational budget, personnel, and specialized tooling available to frontier AI developers, making it difficult to match the depth and sophistication of internal safety teams. Methodological standardization remains a persistent challenge. Without consensus frameworks, different assessors employ divergent threat models, scoring rubrics, and evaluation protocols, rendering crossassessment comparisons difficult or meaningless. What constitutes a “dangerous capability” or an acceptable risk threshold varies substantially across organizations, and the criteria for determining whether a model “passes” or “fails” an evaluation are often implicit rather than codified. This heterogeneity undermines the field’s ability to establish baselines, track progress over time, or provide stakeholders with consistent signals about relative risk levels. Reproducibility issues further undermine confidence in assessment outcomes. Many evaluations rely on proprietary prompts, specialized human expertise, or particular API configurations that are not fully documented or shared. When evaluators publish their methodologies, subtle differences in implementation, model versions accessed, or even API call timing can produce substantially different results. The lack of standardized reporting requirements means critical methodological details are often omitted from public reports, making independent verification nearly impossible. F.4 Standardization Standards for frontier AI assessment remain nascent but are evolving rapidly. Recent developments include proposed frameworks for the design, implementation, and reporting of evaluations [79, 205, 304]; an initial statement of best practices from the industry-led Frontier Model Forum; the recent announcement of a forum for third-party evaluators, AEF [81]; and an initial statement from that forum (AEF-1) [169]. While promising, these developments lag behind evaluation regimes in more established industries. AI evaluations are almost always conducted under bilaterally negotiated, confidential, ad hoc contracts. The terms of these contracts, including scope, access provisions, and publication rights, are rarely visible to regulators or the public. The absence of standardized contractual and reporting requirements means that critical methodological details are often omitted from public reports, making independent verification nearly impossible. F.5 Continuous monitoring AI developers frequently make both incremental and substantial changes to their systems without providing early access to third parties to conduct updated risk assessments, or else publish information and grant access only after changes have already been deployed [305, 306]. Such changes can occur for several reasons. Some stem from system-level modifications, such as altering how multiple instances of a model are coordinated within an agentic product. Others result from inference-time optimizations aimed at improving efficiency, or from new post-training updates to the model itself. A balance is needed between excessive third-party review of all changes and insufficient checks and balances to prevent severe risks. The EU General-Purpose AI Code of Practice articulates criteria for “similarly safe” models [247], and analogous criteria will eventually be needed for other purposes and contexts for the size of changes to a product that uses substantial amounts of test-time compute with a fixed underlying model. At minimum, greater third-party use of automated measurement for system changes is needed (at least those that can be measured in a relatively resource-efficient manner), thereby lessening reliance on company self-reporting. A nascent effort in this direction is stampr-AI, which checks APIs for changes in “model fingerprints.” By having continuous public and auditor insight into (some subset of) significant changes, it will be easier to determine whether a prior third-party assessment’s “shelf life” has been exceeded. F.6 Scope Evaluations are conducted on a wide range of topics without access to non-public information (e.g., as happens in academic research or customer testing of products they are considering adopting). Such evaluations are not our focus here, though they are important, and critically, they are inherently easier to scale than evaluations that involve non-public information. When frontier AI developers provide non-public information to third-party evaluators, their assessments generally focus on capability evaluation and, increasingly, propensity evaluation (e.g., whether models tend to act deceptively under certain circumstances) with a predominant focus on biological, chemical, cyber, nuclear, and deception-related risks. Some assessments evaluate safety and security mitigations (e.g., jailbreak robustness), including by specialized organizations such as FAR.AI, Gray Swan, and Haize Labs, though these efforts often focus on system-level robustness rather than platform-level assessment of controls (e.g., considering efforts to break harmful activity down into benign-looking components across multiple API accounts, as has been shown in academic research and later discovered in the wild [30, 307]). One step toward assessing the organization as a whole rather than just individual systems is METR’s analysis of GPT-5.1 Codex-Max, which incorporated a forward-looking extrapolation of OpenAI model capabilities into its risk analysis, given private statements from OpenAI regarding their future expectations and plans [232]. We are not aware of explicit third-party assessments of frontier AI companies’ safety and security cultures. Assessments of different aspects of safety and security require different operating conditions. For example, system-level assessments may require unfettered access and rate limits, whereas the efficacy of platform-level assessments would be undermined by being given “special treatment,” and instead the more important bottleneck may be establishing safe harbor protections for researchers who must violate terms of use to conduct security research. There has also been little third-party investment in assessing whether mitigations are sufficient for a clearly defined threat model — a gap that is becoming increasingly important as models are approaching or crossing dangerous capability thresholds [308], and as companies routinely report misuse of their product by state and non-state actors [292, 309, 310]. Current assessments focus predominantly on technical consumer-facing systems, particularly the model itself, rather than the full organizational stack. Assessors typically lack visibility into internal processes, safety and security culture, governance structures, platform-level abuse mitigations, and AI systems that are deployed internally within an AI company [26]. Some third parties grade public safety and security frameworks from industry labs (AI Lab Watch, SaferAI Ratings, AI Safety Index), though to date there has been only relatively limited (disclosed) efforts to augment such grading with non-public information. Current third-party work is primarily assessment (measuring claims) rather than verification (confirming specific claims), in part because companies have not yet made sufficiently specific claims that would warrant verification (e.g., companies’ safety and security frameworks often set very high levels for what constitutes unacceptable risks, and claims regarding the effectiveness of mitigations are often vague).37 One recent example of a step beyond viewing models and systems themselves as the only unit of analysis is OpenAI soliciting third-party assessment of their gpt-oss model [83] — third parties submitted critiques of and recommendations for the risk assessment and mitigation process, drawing in part on non-public information (namely earlier drafts of risk assessments, particularly focused on fine-tuning the model to increase certain dangerous capabilities) in order to provide such input. Another recent example is Anthropic soliciting input from METR on their pilot sabotage risk report on Claude Opus 4 and 4.1 [82], where METR reviewed company methodology and model-centric evidence including evaluation results, deployment information, and safeguard descriptions. METR also produced an analysis that was based in part on review of an unredacted version of a pertinent safety artifact, allowing METR to speak publicly to the reasonableness of the redactions in the public version [149]. F.7 Scale While competitive, reputational, and legal pressures motivate most leading developers to conduct some safety testing, participation is neither universal nor consistent. OpenAI and Anthropic have established ongoing relationships with third-party evaluators and government-backed institutes. Google DeepMind 37 The AI Safety Index’s methodology includes subjective evaluation of companies’ performance by relevant experts. These subjective evaluations will tend to draw on, among other things, non-public information known to these experts, and likely has a role to play in improving public understanding of how companies compare, but this is different in nature from the structured, explicit use of such information that we focus on here. has also engaged external parties, though with less public detail on these arrangements. Other developers, particularly fast followers and open-weight model developers like Meta, Mistral, and xAI, have been more variable in their engagement with external assessment [311, 312, 313].38 Another major gap is limited third-party assessment of frontier Chinese AI systems, despite a growing number being built and deployed there. Some efforts have emerged — for example, Shanghai AI Lab’s Frontier Risk Framework mentions third-party auditing [314] — but these remain exceptions rather than the norm, and it is unclear whether such assessments involve significant use of non-public information analogous to the current (admittedly still limited) practices at American companies. For example, we are not aware of pre-deployment third-party testing of Chinese systems. This highlights challenges in scaling frontier AI auditing globally: legal barriers may restrict foreign auditors’ access to domestic AI systems; language and cultural differences can impede understanding of organizational practices and safety culture; and geopolitical sensitivities may limit willingness to grant access to external parties, particularly across rival jurisdictions. If all frontier AI developers demanded frontier AI auditing at the highest assurance levels, third-party organizations could not meet this demand immediately, though broad coverage of the lower assurance levels is likely achievable by tapping into the talent and networks discussed in Section 6.3. Currently, since universal coverage is not required and there is a trade-off between quality and quantity of coverage, assessors make prioritization decisions based on factors like the expected risk of a system and the learning value to assessors of conducting a given assessment (e.g., whether a new type of access or analysis can be pioneered during an engagement). F.8 Independence Developers participate in evaluations voluntarily and retain substantial contractual leverage in setting evaluation terms. Because access to future models depends on maintaining cooperative relationships with developers, third-party assessors may face implicit pressure to avoid findings or disclosures that could jeopardize continued access. Public reporting of safety incidents by companies themselves or their assessors can lead to media backlash [317], which could incentivize developers to obscure important safety information or assessors to soften critical conclusions. While similar dynamics exist in other auditing domains, these regimes have developed greater institutional safeguards such as mandatory audits, standardized terms, and regulatory oversight that partially mitigate these pressures. 38 An emerging driver of greater consistency is the EU General-Purpose AI Code of Practice, which requires signatories to undergo independent external evaluations, at least to the extent they are able to find qualified assessors, including for monitoring the model after it has been placed on the market. While signing and complying with the Code of Practice is voluntary, doing so grants developers a presumption of conformity with the EU AI Act. At the time of writing, many — but not all — developers have signed the Code of Practice. Notable non-signatories include Meta and Chinese AI companies. The requirement for independent evaluation also depends on signatories being able to find qualified assessors, which may be a constraint given the limited scale of the current ecosystem. 39 For example, data localization requirements and cross-border data transfer restrictions can create legal barriers to foreign auditors remotely accessing domestic AI systems. China’s Cybersecurity Law requires that personal information and “important data” be stored locally, with cross-border transfers requiring security assessments and government approval [315]. The US has imposed analogous restrictions in the other direction through CFIUS reviews and export controls that can block transactions giving “countries of concern” access to AI-related data and technology [316]. F.9 Ecosystem Maturity Ad hoc algorithmic audits, often conducted by academics and interest groups, have a significant history of precedent and impact [37]. However, the current ecosystem of specialized third party auditing organizations is much more nascent. Dedicated safety evaluation capacity is concentrated in a handful of specialized non-profits and government-backed institutes, likely comprising only a few hundred full-time employees specifically focused on frontier AI.40 In contrast, for security evaluation, traditional enterprise security auditing practices (penetration testing, SOC 2 compliance, ISO 27001 certification) are mature and widely adopted, and some leading developers do engage conventional security auditors (e.g., SOC 2) [318, 319, 320]. However, AI-specific risk assessment — evaluating protections for model weights, adversarial robustness, and novel attack surfaces unique to machine learning systems — remains far less developed [94]. This asymmetry means that while many developers may have robust conventional security postures, the specialized security challenges posed by frontier AI systems receive comparatively less external scrutiny. Private evaluation organizations are typically non-profits that specialize in a specific type or domain of evaluation. While they perform evaluations separately, private auditing organizations are beginning to publicly coordinate. The AI Evaluator Forum was established in December 2025 with the goal of allowing evaluation organizations to coordinate on shared standards [321]. Its founding members are Transluce, METR, RAND, AVERI, SecureBio, Princeton HAL, The Collective Intelligence Project, and Meridian Labs. All are located in the United States and the United Kingdom, although Shanghai AI Lab’s Frontier Risk Framework mentions third-party auditing [314], indicating the geographic distribution may result more from an absence of third-party evaluation than an absence of political support for the practice. Evaluation organizations often differentiate themselves by domain focus or methodological approach. For example, Apollo Research focuses primarily on evaluations for deception, Transluce emphasizes white-box evaluations, and SecureBio assesses biorisks. While private evaluation organizations frequently publish research outputs, relatively little information is publicly available about specific audits. This opacity largely reflects the fact that evaluations are typically conducted under bespoke contractual terms negotiated with auditees, which often constrain disclosure. Some national governments also conduct third-party evaluations of frontier AI systems. In 2024, an informal International Network of AI Safety Institutes (later renamed the Network for Advanced AI Measurement, Evaluation, and Science) was established, in part to coordinate on evaluations and to conduct joint testing exercises [322]. Network members included Australia, Canada, the European Commission, France, Japan, Kenya, the Republic of Korea, Singapore, the United Kingdom, and the United States [322].41 National governments increasingly conduct their own evaluations independently, often specializing in risks that align with national priorities and security expertise. However, there is some precedent for multinational coordination. In July 2025, a joint agent evaluation effort was announced involving Singapore, Japan, Australia, Canada, the European Commission, France, Kenya, South Korea, and the United Kingdom [324]. Public reporting on government-led assessments remains uncommon and limited. The UK’s AI Security Institute stands out in how it publicly shares a relatively large (but still limited) amount of information about evaluation methods and findings [325].42 40 Our estimate draws on publicly available information about the main organizations conducting third-party frontier AI safety assessment. Among non-profits, METR employs approximately 21–50 staff and Apollo Research approximately 15–20. Government-backed institutes are larger but vary considerably: the UK AI Security Institute reports over 100 technical staff and £66 million in annual funding, while the US Center for AI Standards and Innovation operates with substantially fewer resources. These figures are approximate and subject to change. 41 In December 2025, the network was renamed the Network for Advanced AI Measurement, Evaluation and Science [323]. 42 Notably, not all information about assessments should be published (at least immediately after they are conducted), since G Risks of and alternatives to “auditee pays” models Today, AI assessment organizations are generally funded by a combination of philanthropy and frontier AI company payments. Several alternative funding models are foreseeable, including auditor payment by insurers, regulator-administered funding pools, payments from downstream enterprise users in highstakes or regulated sectors, industry-wide levies administered by an industry body or independent entity, or hybrid approaches combining these mechanisms. These alternatives could provide stronger independence guarantees while ensuring adequate funding. In some cases, even a small levy could support a substantial expansion of the AI auditing ecosystem. When auditors compete for contracts awarded by the companies they evaluate, there is a significant risk that auditing devolves into “rubber stamping,” as occurred in financial auditing prior to the 2008 crisis. In such arrangements, conscious or unconscious bias toward outcomes favorable to the client can emerge. Funding models in which auditors are paid by insurers or regulators therefore merit particular attention, as they better align auditor incentives with accurate risk assessment rather than client satisfaction. Insurers have strong financial incentives for accurate risk quantification, making them natural principals for auditing services. Empirical evidence on the effectiveness of these alternative models remains limited (with few exceptions, such as [326]), and the ideal model for the frontier AI industry remains unclear. Accordingly, we do not recommend any specific payment model at this stage but recommend research toward answering this question. such information could hurt the integrity of future assessments. See discussion in Section 5 regarding the need for auditors to keep (some of) their methods private. H Frontier AI definitions and different thresholds for triggering audits Our definition of frontier AI — “general-purpose AI models and systems whose performance is no more than a year behind the state-of-the-art on a broad suite of general capability benchmarks” — is similar to the definition used by the Frontier Model Forum [327], which is “a general-purpose model that outperforms, based on a range of conventional performance benchmarks or high-risk capability assessments, all other models that have been widely deployed for at least 12 months.” The main difference is that we include systems, not just models, as a central part of the definition. We use the 12 month threshold, but we acknowledge there are problems with relying on a single threshold, and with temporal thresholds generally. The reason one might want to use this kind of approach, at least for exploratory research and policy discussions, is to convey that actual system capabilities are the focus, rather than inputs, which are only a proxy for those capabilities. However, imperfect proxies are often easier to administer and communicate, and can be more predictable for (potentially) regulated companies. This helps explain why computing power-based thresholds, revenue-based thresholds, and expenditure-based thresholds are more common in regulatory contexts. We don’t think the choice of threshold significantly changes our basic proposal for auditing processes and incentive design, but it’s important to ensure that any codification of a frontier AI threshold anchored to auditing requirements has the capacity to be changed over time [328], given the fast-moving nature of AI. A metric that works today may not work as well a year from now. This is important to consider for both public sector actors such as legislators writing AI legislation, as well as private sector actors such as insurers writing standard policies. A good example of how a proxy can go wrong is the training required in order to “pre-train” a language model, which was an early metric used to determine which AI models or systems were subject to frontier AI regulations. Computing power and design decisions at each stage of the supply chain provide concrete, quantitatively precise opportunities for regulatory intervention compared to some alternatives [219, 329], but anchoring on a specific way that computing power is used can be perilous. While pre-training compute remains important, two other factors — reinforcement learning, another type of training which is different from pre-training, and “test-time compute,” the amount of computing power used when running a model, which can be increased in order to give better results — have increased in relative importance. See also [330] and [328]. Likewise, major technical developments can cause distinct but similar issues with temporal thresholds, in that a slowdown in general capability scores would lead to very few systems being in scope, and vice versa. A different challenge with our definition is that it specifically focuses on general capabilities. Some types of AI systems (e.g., trained on certain kinds of data) might have dangerous capabilities in certain areas despite scoring poorly on general capability benchmarks. While such systems are out of scope of our definition, that does not mean that they should not be considered for auditing, and one could imagine adapting our approach to encompass some such systems (e.g., through multiple sufficient thresholds, one based on general capabilities and others based on specific “risk verticals”). Different stakeholders might be more willing to tolerate “false positives” from a given threshold (unduly burdensome audits given a system or company’s real risk profile) or “false negatives” (unaudited or weakly audited systems or companies that are more dangerous than the level of scrutiny applied to them would suggest). There are ways to calibrate risk judgments in an adaptive fashion in order to reduce the total amount of errors, such as building “triaging” —technically knowledgeable regulators having the ability to grant exemptions rapidly — into the audit process itself, but each has its own challenges, and inevitably there will be some imprecision. There are various trade-offs to consider when setting these thresholds. First, thresholds should strike a balance between stimulating demand (i.e., causing more, and more rigorous, audits to occur than would have existed otherwise) and incentivizing corner-cutting (i.e., encouraging auditors to “churn out” low-quality audits). These are reasons why we emphasize the need for market analysis and quality standards in our recommendations, but these will at best soften, rather than eliminate, this basic tension. Second, thresholds should strike a balance between mitigating safety and security risks from frontier AI (and, possibly, particularly dangerous narrow AI systems) and enabling beneficial AI innovations. There is complexity on both sides of this ledger. It is important to consider the direct safety and security risks from a given company, as well as the “horizontal” and “downward” learning from audits discussed in Section 3. Small startups that lack the capacity to undergo even low assurance level audits might forgo launching products. Even larger companies who can afford higher assurance levels might view audits as too burdensome. Note also that it is difficult to reason about the burdensomeness of audits in isolation — the safety and security standards that are audited against are also important to consider, which we treat as distinct from the assurance level at which a system and company are evaluated against a given set of standards. A range of other factors are potentially relevant, as well, such as the context in which a frontier AI system might be deployed.

Acknowledgments

Many gave us valuable feedback on earlier versions of the ideas discussed here and earlier versions of this paper, including but not limited to Mark Greaves, John Bailey, Tyler Cowen, Eileen Donahoe, Nathan Lambert, Geoff Ralston, Gopal Sarma, Adam Woodhall, Larissa Schiavo, Shahar Avin, Andrew Gamino-Cheong, Tom Zick, and Vijay Bolina. We’re also grateful to Stone Addington and Carly Tryens for general support and Eden Beck and Erol Can Akbaba for assistance with formatting and copyediting. None of those listed here necessarily endorse the contents of the paper.