Safety Frameworks

FilingsGPT · Reference

Measuring AI R&D Automation

GovAI · 2026-03-01 · 32 pages

10.0040library address · passages 10.0040.001 →

Measuring AI R&D Automation {Alan Chan1∗ Ranay Padarath1† Joe Kwon1†}‡ Hilary Greaves2 Markus Anderljung1§ 1GovAI 2University of Oxford

Abstract

The automation of AI R&D (AIRDA) could have significant implications, but its extent and ultimate effects remain uncertain. We need empirical data to resolve these uncertainties, but existing data—primarily capability benchmarks—may not reflect real-world automation or capture its broader consequences, such as whether AIRDA accelerates capabilities more than safety progress or whether our ability to oversee AI R&D can keep pace with its acceleration. To address these gaps, this work proposes metrics to track the extent of AIRDA and its effects on AI progress and oversight. The metrics span dimensions such as capital share of AI R&D spending, researcher time allocation, and AI subversion incidents, and could help decision makers understand the potential consequences of AIRDA, implement appropriate safety measures, and maintain awareness of the pace of AI development. We recommend that companies and third parties (e.g. non-profit research organisations) start to track these metrics, and that governments support these efforts. 1 Introduction Frontier AI companies aim to automate AI R&D. OpenAI’s CEO Sam Altman expects to have an automated AI researcher by 2028 (Bellan, 2025), while Google DeepMind’s CEO Demis Hassabis predicted in 2025 that an automated AI researcher was “a few years away” (Perrigo, 2025). According to Anthropic’s CEO Dario Amodei, current AI systems are “good enough at coding that some of the strongest engineers [he has] ever met are now handing over almost all their coding to AI” (Amodei, 2026). Many recent products from these companies—such as Codex, AntiGravity, and Claude Code—focus on automating software engineering, a key component of AI R&D. Adoption is widespread: 47% of developers use AI tools daily (Stack Overflow, 2025) and 50% of new code at Google is AI-generated (Alphabet, 2026). Some qualitative reports also suggest that these tools can perform some of the R&D tasks expected of junior AI researchers (Anthropic, 2025b; 2026b). The effects of AI R&D automation (AIRDA) could be significant, but might cut in different directions (Figure 1). AIRDA could accelerate AI progress, bringing forward AI’s benefits but also hastening the arrival of destructive capabilities—including those related to weapons of mass destruction—or other forms of disruption such as unemployment (Bengio et al., 2026). Whether this acceleration would be overall beneficial remains unclear. Key uncertainties include whether AIRDA would accelerate defensive capabilities more than offensive ones, whether safety research will keep pace with capabilities research, and whether the human institutions can adapt to the accelerated pace of progress (Bernardi et al., 2025; Vaintrob & CottonBarratt, 2025; Kembery & Ammann, 2025). AIRDA could also affect oversight of the AI R&D process. By reducing the need for human researchers, AIRDA could concentrate control over AI development into fewer hands and decrease societal oversight (Davidson et al., 2025). AI systems could also increase the need for oversight by introducing more frequent errors into the R&D process, potentially leading to loss of control incidents (Benton et al., 2024; Ward et al., ∗Corresponding author: alan.chan@governance.ai †Work completed as seasonal fellows at GovAI. ‡Equal contribution. Authors are free to list themselves as first author on their CVs. §Senior author. 1 arXiv:2603.03992v3 [cs.CY] 6 Mar 2026 Extent of AI R&D Automation AI progress could affect could affect Oversight capacity Oversight demand Oversight gap could affectFigure 1: The extent of AI R&D automation could affect both AI progress and the oversight gap: the difference between how much oversight is needed (“oversight demand”) and how much oversight is actually achieved. Oversight capacity is the ability to achieve oversight, encompassing both the ability to understand the R&D process (e.g., having sufficient expertise) and the resources available for exercising control (e.g., human labour, monitoring tools). AI progress could also affect the oversight gap, such as by increasing the stakes of R&D decisions. This work proposes metrics to track all of these quantities. 2025; Bengio et al., 2026). On the other hand, AIRDA could potentially improve our ability to perform oversight, such as by enabling new AI-assisted oversight tools (Choi et al., 2024). As an analogy, higher-level programming languages let developers write more complex software that still behaves (roughly) as intended,1 compared to writing programs in assembly. Empirical data will be essential for resolving the above uncertainties and informing decisions. Frontier AI companies may need to determine whether human review processes are keeping pace with AI-generated outputs or whether to invest more in safety research and oversight. Policymakers may need to decide what reporting requirements to impose, whether to mandate minimum levels of human involvement in AI R&D, or how to calibrate regulatory thresholds to the actual pace of automation. Existing data on AIRDA has significant gaps. First, many uncertainties remain about the extent of AIRDA. Rapidly improving benchmark results indicate at least some progress (Jimenez et al., 2023; Chan et al., 2025; Starace et al., 2025; Wijk et al., 2025; Anthropic, 2026b),2 but it is unclear how directly such results translate to productivity boosts given real-world integration frictions (Becker et al., 2025; Dell’Acqua et al., 2023; Brynjolfsson et al., 2023; Noy & Zhang, 2023; Narayanan & Kapoor, 2025; Becker et al., 2026). Current benchmarks also mainly focus on software engineering rather than other parts of the research pipeline, such as generation and prioritization of research ideas. Second, there is little data about the consequences of AIRDA, rather than simply its extent. For example, it is unclear whether productivity improvements differ between safety teams and other teams, or how often AI-introduced errors escape human review. To address these gaps, this paper proposes concrete metrics for tracking AIRDA and its potential effects on AI progress and oversight. The metrics span dimensions such as capital share of AI R&D spending, researcher time allocation, and AI subversion incidents. Given the complexity of the underlying phenomena, these metrics are most useful in combination: no single metric perfectly correlates with AIRDA or its effects, but different metrics can compensate for each other’s weaknesses and provide a broader picture. For example, a simultaneous increase in both the use of AI for high-stakes decisions (Metric #7) and observed defects of AI-generated R&D outputs (Metric #9) could raise significant concerns. Additionally, factors such as commercial sensitivity or individual privacy could warrant sharing certain metrics only with a select group of actors, such as third-party auditors (Brundage et al., 2026) or key government decision makers. We intend these metrics to be useful for frontier AI companies designing and implementing their safety frameworks (OpenAI, 2025b; Anthropic, 2025a; xAI, 2025; Google DeepMind, 2025a), policymakers developing reporting requirements, and researchers studying AI development trajectories. We recommend that 1Compiler bugs are rare, but can still lead to subtle issues. 2At the time of writing, the most advanced model had a 80% success rate on tasks that take human expert coders 1 hour and 10 minutes to complete (Metr, 2025b). these actors focus on metrics (or components thereof) that are relatively neglected, which we describe in Table 1. Table 1: Recommendations for companies, governments, and third parties (e.g. non-profit research organisations). Actor Recommendation Metrics Companies Track differential progress in automating safety vs. capabilities research Metric #1 (AI R&D evaluations) and Metric #6 (staff surveys) are most straightforward. Metric #2 (human–AI comparison RCTs) can also track differential progress, but requires significantly more effort. Companies Track how AIRDA affects oversight Metric #7 (AI use in high-stakes decisions), Metric #8 (researcher time allocation), and Metric #14 (AI permission lists) are the most feasible starting points. Metric #10 (AI subversion incidents) requires somewhat more effort, while Metric #9 (oversight effectiveness retrospectives) requires the most effort. Companies Track the actual extent of AIRDA Metric #11 (researcher headcount), Metric #12 (compute distribution), and Metric #13 (capital share) are relatively straightforward. Metric #8 (researcher time allocation) would also be informative, but requires somewhat more implementation effort. Governments Develop systems for confidential reporting, potentially in the form of industry-wide aggregates Promising candidates are Metric #7 (AI use in high-stakes decisions), Metric #10 (AI subversion incidents), Metric #12 (compute distribution), and Metric #13 (capital share). Third parties Estimate metrics using public sources Feasible places to start are Metric #12 (distribution of compute usage; You, 2025) and Metric #11 (researcher headcount). Third parties Create tooling and design surveys Time-tracking software for Metric #8 (researcher time allocation), and surveys for Metric #6 (self-reported productivity boosts) and Metric #7 (AI use in high-stakes decisions). 2 What is AI R&D? AI R&D encompasses the activities involved in developing and improving AI systems, including both capabilities research and safety and security research. Roughly following Owen (2024), we explain activities involved in the current AI R&D process and comment on how AI could plausibly automate its various stages. Coming up with and prioritizing research ideas. Research ideas can cover a wide range of topics, such as how an AI system might behave on a new evaluation, whether a particular technique might improve performance, or what the cause of a behaviour might be. AI systems could potentially help generate, filter, and prioritize such ideas. As an early experiment, Wen et al. (2025) show that AI systems can predict which of two research ideas will perform better on benchmarks, in some cases outperforming human experts. Designing experiments. Ideas need to be tested in experiments. Designing these experiments involves tasks such as writing and debugging code, as well as generating or collecting datasets (e.g., for training or evaluation). Coding agents have made significant gains in the past few years, with improved performance on coding benchmarks (Jimenez et al., 2023; Chan et al., 2025) and widespread adoption of tools like GitHub Copilot, Cursor, and Claude Code among developers (Stack Overflow, 2025). Synthetically generated data is also becoming increasingly important to the AI R&D pipeline (Bai et al., 2022; Schoen et al., 2025; Fronsdal et al., 2025). Running experiments. Running experiments can sometimes be as simple as running the code. Other times, especially for large-scale experiments, AI researchers have to efficiently distribute experiments across multiple machines and monitor experiments for signs of failure. Monitoring large-scale training runs can be particularly tricky, involving heuristics and on-the-fly interventions (Marin, 2025). AI systems can potentially assist with monitoring and troubleshooting training runs, but data on effectiveness remains limited. Analyzing results. Analysis of experimental results can inform new research ideas and experiments. AI systems already assist with aspects of result analysis, such as generating visualizations and identifying patterns across runs (Choi et al., 2024; Fronsdal et al., 2025). We use AI R&D automation (AIRDA) to refer to the use of AI to carry out parts of this pipeline. Automation can be implemented to differing degrees, from simply using AI as a hypothesis generator, to deploying teams of artificial researchers that carry out all parts of the pipeline. The tasks involved in AI R&D could also change over time. For example, increasing involvement of AI in the R&D pipeline could shift human involvement towards verifying that experiments have been designed and run correctly. 3 Potential Implications of AI R&D Automation To identify informative metrics for measuring AIRDA, we first discuss its potential implications for AI progress and oversight. For AI progress, AIRDA could affect when certain kinds of capabilities emerge and how we respond to them (Figure 2). For oversight, AIRDA could affect the gap between how much oversight of AI R&D is needed and how much is achieved (Figure 3).

3.1 AI ProgressOffensive or dualuse capabilities

Defensive or beneficial capabilities could accelerate AI R&D Automation Safety & security research could help manage could constrain Human welfare Institutional adaptation could affect could enable use of could affect could improve could harm Figure 2: Positive and negative implications of AI R&D automation for AI progress. AIRDA could accelerate both capabilities research and safety & security research. One potential mechanism is a recursive feedback loop, where AI systems generate software improvements that lead to more capable AI systems, which in turn generate further improvements (Eth & Davidson, 2025). Numerous copies of automated AI researchers could run in parallel as long as there is available compute, working around the clock and much faster than humans can. In the long run, AI systems could become more capable than human researchers and generate higher-quality improvements (Ho et al., 2024; Barnett, 2025; Sevilla et al., 2022). While empirical evidence for these scenarios is limited, early signs are suggestive: AI systems already write code much faster than humans can, and can in limited cases outperform human experts in predicting which idea will lead to better performance on benchmarks (Wen et al., 2025). Compute bottlenecks (Erdil & Barnett, 2025; Whitfill & Wu, 2025; Ho, 2026) and integration frictions could dampen acceleration, but at least some speed-up seems plausible. Such acceleration could have significant consequences, including bringing forward AI’s benefits, but whether it would be overall beneficial remains unclear. The reason is that AIRDA-induced acceleration could (i) accelerate some aspects of AI progress relative to others, and (ii) accelerate AI progress without similarly accelerating crucial human and institutional processes. The order in which technologies arrive can have significant consequences (Sandbrink et al., 2022; Buterin, 2023; Nielsen, 2024). First, AIRDA could affect whether offensive AI capabilities arrive before the defensive capabilities or safety and security mitigations needed to counter them. For example:

AI forecasting capabilities could help to manage CBRN uplift capabilities, by predicting when and how malicious actors might carry out attacks. If forecasting is easier to automate than CBRN capabilities, then AIRDA would yield a net benefit by providing stronger defences at any given level of CBRN uplift.

Safety and security research could facilitate the use of defensive capabilities (e.g., by ensuring that systems follow user instructions) or directly constrain offensive capabilities (e.g., through misuse safeguards). However, safety may be harder to accelerate than capabilities because safety is more difficult to measure. For example, attempts from AI systems to hide misbehaviour could make it hard to assess whether safety is actually improving (Greenblatt et al., 2024; Meinke et al., 2025). If so, accelerated capabilities progress could outstrip the development of necessary safeguards, which could result in a net harm. Even if AIRDA differentially accelerates offensive capabilities, the resulting equilibrium could be stable if using the capability results in catastrophic harm. For example, some scholars argue that the spread of nuclear weapons has improved peace through deterrence (Waltz, 1981), though others dispute this claim (Sagan & Waltz, 2003). An additional nuance is that many capabilities are dual-use: for example, AI systems that identify software vulnerabilities could strengthen both defensive cybersecurity and offensive attacks. In such cases, a key question is whether AIRDA affects the extent to which defensive applications can be deployed before harmful uses proliferate. Second, human and institutional responses to advanced AI operate on timescales that may not accelerate alongside AI progress itself. Understanding a new technology’s effects and developing appropriate responses already takes years; accelerated capability gains could leave institutions even less prepared than they are now. For instance, accelerated AI progress could displace workers faster than they can retrain and adapt (Korinek & Suh, 2024; Manning & Aguirre, 2026; Bengio et al., 2026). Rapid progress could also push decisionmakers toward hasty or poorly-informed responses, such as overly broad or unenforceable restrictions on AI development. That said, accelerated progress could also galvanize policy attention and reforms, and advances in defensive capabilities like forecasting could aid adaptation (Schoenegger et al., 2024; Karger et al., 2025).

3.2 Oversight

An actor has oversight over the AI R&D process to the extent that they (1) understand the process and

(2)

exercise informed control over it in order to produce desired outputs, such as by reviewing AI-generated outputs for errors. In other words, oversight is about both a particular state (i.e., understanding) and a particular act (i.e., exercising control). An actor performs oversight to the extent that they exercise such control. We ultimately care about the oversight gap: the difference between how much oversight is needed over the AI R&D process (“oversight demand”)3 and how much oversight is actually achieved. A factor that could affect the gap is oversight capacity: the ability to achieve oversight, encompassing both the ability to understand the R&D process (e.g., having sufficient expertise) and the resources available for exercising control (e.g., human labour, monitoring tools). Crucially, oversight capacity does not directly translate into oversight achieved: factors such as financial costs or negligence could lead organizations to forgo oversight. We analyse how AIRDA could affect the oversight gap by changing oversight capacity, the level of oversight achieved, or oversight demand. 3Oversight could improve the safety of and provide assurance about the AI R&D process, but could, for example, impact research productivity or impose financial or other costs. Different actors could have different preferences about how to make these trade-offs. To sidestep this issue, we focus on changes to oversight demand rather than its absolute level. decreases increases Deskilling and fewer human researchers AIs are more error-prone Oversight demand Oversight capacity AIs are less errorprone Difficult-tounderstand outputs decreases Freeing up researcher time Higher-stakes AI R&D decisions More data that can be reviewed AI-enabled oversight tools increases Confusion about rapid AI progress More outputs requiring review Continuous monitoring of AIs Oversight gap could affect could affectFigure 3: AI R&D automation could decrease or increase the oversight gap by changing oversight capacity, the level of oversight achieved (not depicted for simplicity), or oversight demand. Note that oversight capacity may not directly translate into oversight achieved (e.g., companies decide to forgo oversight due to financial or time costs).

3.2.1 Decreasing the Oversight Gap

AIRDA could decrease the oversight gap by increasing oversight capacity or the level of oversight achieved, or decreasing oversight demand. Increasing oversight capacity or the level of oversight achieved:

By automating routine tasks such as writing code or running experiments, AIRDA could allow human researchers to spend more time on oversight.4

AI systems generate extensive logs, chains of thought, activations, and other intermediate outputs that make their work more transparent and auditable than equivalent human work.

AIRDA could accelerate the development of automated AI oversight tools (Choi et al., 2024; Fronsdal et al., 2025; 2026), which could be more effective than humans at carrying out oversight or could do so more cheaply.

Continuous monitoring is not practical or socially acceptable for humans, but is feasible for AI agents and is actively being explored (OpenAI, 2026; Anthropic, 2026a). Decreasing oversight demand:

More capable AI systems could be less error-prone than humans.

3.2.2 Increasing the Oversight Gap

AIRDA could increase the oversight gap by decreasing oversight capacity or the level of oversight achieved, or increasing oversight demand. 4In practice, the time gained could be spent advancing capabilities, which could increase oversight demand by accelerating AI progress. Decreasing oversight capacity or the level of oversight achieved:

The results of automated AI research could potentially be more difficult for humans to understand (e.g., results might use novel mathematics), or AI-introduced errors could be more difficult for humans to notice.

AIRDA could motivate companies to reduce the number of human researchers involved in AI R&D. As a result, there might be less human labour available to carry out oversight activities.

AIRDA could reduce hands-on involvement in AI R&D. If so, human researchers could lose the detailed, ground-level understanding of systems and codebases needed to identify errors. There might also be reduced oversight in practice (e.g., giving instructions to coding agents without supervising their activities), which would decrease the level of oversight achieved.

Accelerated AI progress could make it difficult for humans to understand the AI R&D process and make informed decisions (e.g., time pressure leads to carelessness).

If AIRDA results in radically fewer human researchers, decisions to develop and deploy potentially dangerous capabilities could concentrate into the hands of a handful of humans with limited internal or external checks (MacAskill & Moorhouse, 2025), leading to reduced oversight at a societal level. Increasing oversight demand:

AI systems could be more error-prone than humans (Cotroneo et al., 2025), for reasons including capability limitations, adversarial interference (e.g., poisoning training data), or reward hacking (Betley et al., 2026; MacDiarmid et al., 2025).

Automated researchers can work faster than humans and generate a larger volume of experiments, code, and decisions requiring review per unit time.

If AIRDA results in faster AI progress, individual R&D decisions could be higher stakes. For example, a decision to develop the next model could result in much larger capability jumps than occur today. 4 Metrics for AI R&D Automation We propose metrics to track the extent of AI R&D automation, the rate of AI progress, and changes in oversight, focusing on the uncertainties identified in the previous section. For the extent of AIRDA, the metrics track:

Capability evaluations as leading indicators of whether AI R&D could be automated (Metrics #1, #2)

The extent of AI R&D automation in practice (Metrics #6, #7, #8)

Other organizational and operational indicators of AI R&D automation (Metrics #11, #12, #13, #14) For AI progress, the metrics track both the overall rate of AI progress and the rate of safety progress relative to capabilities progress (Metrics #1, #2, #5, #6, #11). For oversight, we track oversight capacity, how much oversight is actually achieved, and oversight demand (the latter two of which give us the oversight gap). More specifically, the metrics track:

How human researchers spend their time (Metrics #6, #8)

The extent to which AI systems help with oversight (Metrics #3)

Whether AI systems act in ways that increase the demand for oversight (Metrics #4, Metric #9, #10)

Headcount and performance of human researchers (Metric #11)

The extent to which AI systems are involved in significant decisions or actions (Metrics #7, #14) Many of these metrics are most useful in combination. For example, an increase in the use of AI for highstakes decisions (Metric #7), coupled with an increase in the defects of AI-generated R&D outputs (Metric #9), could raise significant concerns. In the rest of this section, we describe what each metric tracks and analyze feasibility, limitations, and information sensitivity considerations. Sensitivity considerations could warrant sharing certain metrics only with a limited group of actors, such as key government decision makers. We summarize the metrics in Tables 2 to 5. Table 2: Experimental metrics for measuring AI R&D automation. Metric Description Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand What remains to be done Metric #1: AI performance on AI R&D evaluations AI performance on evaluations of capabilities directly relevant to AI R&D work, such as replicating ML papers and picking research ideas to test. ✓ Leading indicator of AI systems’ potential to perform AI R&D. ✓ Can compare improvements on safety research tasks to capabilities research tasks. Developing more evaluations for non-softwareengineering tasks and safety and security research tasks. Metric #2: AI performance on AI R&D evaluations compared to humans and human-AI teams (“AI R&D Performance RCTs”) Performance differences between human-AI teams, human-only teams, and AI-only teams on AI R&D tasks. ✓ Superior AI-only performance suggests greater AIRDA potential. ✓ Can compare improvements on safety research tasks to capabilities research tasks. ✓ Depending on the interaction protocol, superior human-AI performance can suggest non-negligible oversight demand. Broadening coverage of AI R&D tasks and running more comparisons between AI systems and human-AI teams. Metric #3: Oversight red-teaming experiments Evaluations for whether oversight systems used in production would catch an AI system that is directed to subvert the R&D process, such as by attempting to sabotage experiments. ✓ Measures ability to catch attempts to subvert the AI R&D process. Developing and running evaluations specific to internal infrastructure. Metric #4: Misalignment evaluations Evaluations measuring propensities and capabilities relevant to misaligned behaviour, including alignment faking, sycophancy, sabotage, and reward hacking. ✓ Greater misalignment increases oversight demand. Developing evaluations that are more representative of realistic AI R&D workflows. Metric #5: Compute efficiency improvements The year-over-year percentage reduction in the amount of compute needed to achieve a given level of performance on capability (including AI R&D) evaluations. ✓ Can measure efficiency improvements on AI R&D tasks. ✓ Directly tracks AI progress. Calculating efficiency improvements from existing data. Potentially running additional evaluations. Table 3: Survey-based metrics for measuring AI R&D automation. Metric Description Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand What remains to be done Metric #6: Staff views on AI use and productivity boosts Researchers’ self-reported AI use patterns and productivity gains across different AI R&D task categories. ✓ Elicits self-reported AI usefulness across R&D tasks. ✓ Can compare productivity boosts across teams (e.g. safety vs. pre-training teams). ✓ Can ask how helpful AI is for oversight tasks. Designing questions and establishing a regular cadence for running surveys. Metric #7: Extent of AI use in high-stakes decisions Survey of how AI is involved in high-stakes R&D decisions, such as research agenda prioritization, training runs, and deployment decisions. ✓ Selfreported signal of AI involvement in key decisions. ✓ Can ask about how much oversight is exercised over AI involvement in high-stakes decisions. ✓ Significant AI involvement suggests greater oversight demand. Identifying high-stakes R&D decisions. Designing surveys and running them each time such decisions are made. Table 4: Operational metrics for measuring AI R&D automation. Metric Description Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand What remains to be done Metric #8: Researcher time allocation (“AI-powered Toggl”) How human researchers allocate time across different R&D activities, measured via automated time tracking software. ✓ Captures potential impacts of AIRDA on researcher time allocation. ✓ Can measure time spent on oversight activities. Developing and deploying time-tracking software, managing privacy concerns. Metric #9: Oversight effectiveness retrospectives The proportion of AI-generated R&D outputs (e.g. production code, experimental analyses) that have defects, stratified by level of oversight. ✓ Tracks oversight activities on outputs. ✓ Tracks oversight gap directly, and oversight demand indirectly if level of oversight is held constant. Building a tracking/tagging pipeline and establishing a regular data analysis schedule. Metric #10: AI subversion incidents The number and severity of incidents in which AI systems attempted to subvert real AI R&D processes, such as by sabotaging experiments or hiding misbehaviour. ✓ An increasing rate of subversion incidents indicates growing oversight demand. Defining what counts as a subversion incident and building detection infrastructure. Table 5: Organizational metrics for measuring AI R&D automation. Metric Description Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand What remains to be done Metric #11: Headcount of AI researchers and the distribution of their performance The number and seniority of AI researchers at frontier AI companies, along with the distribution of their performance, as estimated, for example, by salaries or compute budgets. ✓ Decreased headcount suggests increased AIRDA. ✓ Useful AI tools could lead to a widening performance distribution. ✓ Decreased headcount or performance could reduce oversight capacity. ✓ Decreased headcount could reduce how much oversight is performed. Aggregating and presenting existing data. Metric #12: Distribution of compute usage How compute is distributed across various AI R&D activities. ✓ AIRDA could shift compute use towards internal inference. Developing systems to classify compute use. Metric #13: Capital share of AI R&D spending Compute expenditure as a percentage of total AI R&D spending. ✓ Rising capital-tolabour ratio could signal increasing automation. Calculating and reporting the metric. Metric #14: AI permission lists A list of which actions AI systems are authorized to take with different levels of human approval or review, including where none is required. ✓ Automation is potentially higher if AI systems are permitted to take significant actions autonomously. ✓ Oversight demand may be higher if AI systems are permitted to take significant actions without human review. Defining actions of interest, collecting permissions across teams, and updating the list regularly.

4.1 Experimental Metrics

These metrics involve running experiments on AI systems. 4.1.1 Metric #1: AI performance on AI R&D evaluations Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ ✓ Description: This metric tracks capabilities directly relevant to AI R&D work, including software engineering, ML experimentation, research replication, and idea generation and curation. Current evaluations mostly cover tasks related to software engineering, and include SWE-Bench, MLE-Bench, RE-BENCH, and PaperBench (Jimenez et al., 2023; Chan et al., 2025; Wijk et al., 2025; Starace et al., 2025). Where available, both AI-only performance and human baseline comparisons should be reported. Additionally, if the time humans took to complete tasks is available, the x%-task-completion time horizon for various x (e.g., 50%, 80%) should be reported (see Metr (2025b) for a more precise definition). Significance: Capability evaluations provide information about the extent to which AI systems could automate core AI R&D activities. Because certain tasks could be more relevant to safety research than to capabilities research, tracking performance across different task types could offer insight into differential progress across research areas. Feasibility: High, especially since benchmark results are commonly reported (Anthropic, 2025b; OpenAI, 2025a; Google DeepMind, 2025b). The main costs are evaluation design, compute for running the evaluation, and engineering time (e.g., testing different agent scaffolds). Limitations:

Ecological validity: Benchmarks do not fully capture R&D work, which involves ambiguous objectives, longer time horizons, coordination across projects, and messier feedback loops. As such, raw performance numbers can tell us that we are getting closer to AIRDA, but might not tell us how far away we are.

Data contamination: Some evaluations are drawn from public sources (e.g., GitHub, Kaggle, published papers) and could therefore leak into training sets. As a result, this metric could overestimate AI R&D capabilities.

Autonomous evaluations only: These benchmarks evaluate AI agents working independently, but human-AI collaborative teams may better reflect how AI will actually be integrated into R&D work (see Metric #2).

Setup sensitivity: Results can depend on factors such as the task setup and agent scaffolding, making comparisons across setups difficult (Zhu et al., 2025).

Saturation: Evaluations can become quickly saturated. For example, Claude Opus 4.6 has saturated all of Anthropic’s cyber evaluations (Anthropic, 2026b). Sensitivity: Low. Organizations with strong performance will likely share results voluntarily for marketing and recruiting, but weak results could be omitted or selectively reported. What remains to be done: Developing more evaluations for tasks not related to software engineering, such as idea generation and curation, and for tasks that may be more relevant to safety and security research than capabilities research (e.g., interpretability, red-teaming). 4.1.2 Metric #2: AI performance on AI R&D evaluations compared to humans and human-AI teams (“AI R&D Performance RCTs”) Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ ✓ ✓ Description: This metric measures performance differences between human-AI teams, human-only teams, and AI-only teams on AI R&D tasks. The human-AI collaborative teams should have predefined interaction protocols (e.g., humans can review AI outputs, redirect approaches, provide context, or take over subtasks). Both absolute performance in each group and the gap between groups should be reported. Significance: Superior performance of AI-only teams would indicate greater potential for AIRDA. Superior human-AI performance would indicate that humans still provide value to AI R&D, but more precise implications would depend upon the human-AI interaction protocol. For example, if the protocol mostly involves a human overseeing AI work, superior human-AI performance would indicate that human oversight still meaningfully improves outputs. On the other hand, we would not obtain information about oversight if humans and AI systems mostly complete tasks in parallel without interacting. Feasibility: Moderate. Although some organizations have begun conducting related studies (Becker et al., 2025; 2026; Dell’Acqua et al., 2023; Brynjolfsson et al., 2023; Noy & Zhang, 2023; Metr, 2025b), humanAI evaluation requires substantial effort and cost, including recruiting human participants with relevant expertise (expensive for skilled ML or SWE tasks), designing standardized collaboration protocols, and running parallel AI-only and human-AI evaluations on matching task sets. However, where a human expert baseline already exists, the additional cost of comparing it to an AI-only evaluation is minimal. Furthermore, a lighter-weight, but less objective, version of this metric might instead be based on researchers’ self-reports rather than experiments (see Metric #6). Limitations:

Problems with evaluations: Because this metric is also based on evaluations, it inherits many of the same limitations as Metric #1, such as ecological validity and setup sensitivity.

Comparability over time: Human participants have different skills and human performance is less reproducible than AI performance. As such, it could be difficult to compare RCT results over time.

Dependence on human expertise levels: The gap between teams will vary with human expertise level. Results should clearly specify the skill level of human participants.

Performance gaps likely depend on task selection: Some tasks inherently favor collaboration (ambiguous, long-horizon problems), while others favor autonomous execution (well-defined, parallelizable work). The metric is most informative when tracked across diverse task types. Sensitivity: Low to moderate, depending on the results. Results showing that human-AI teams substantially outperform AI-only systems could undermine marketing claims about AI agents. Moreover, superior performance of AI-only systems could motivate public concerns about unemployment or a potentially uncontrollable intelligence explosion (MacAskill & Moorhouse, 2025). What remains to be done: Covering a broader array of AI R&D tasks in future studies. In principle, every benchmark used in Metric #1 could also be used for this metric. 4.1.3 Metric #3: Oversight red-teaming experiments Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ Description: In controlled experiments, this metric measures how often an organization’s oversight systems would catch an AI system explicitly instructed to subvert normal R&D workflows, such as by attempting to sabotage experiments, insert backdoors, or manipulate evaluation results. The metric is reported as the percentage of subversive actions that are detected, both overall and by task category. Significance: All else equal, oversight systems that are more capable of detecting subversion attempts in test deployments will likely be more effective during the real R&D process. However, because this metric involves controlled experiments, it does not quite measure the level of oversight achieved. Feasibility: Low to moderate. Implementation requires a sandboxed R&D environment mirroring production workflows, models fine-tuned or prompted to attempt specific behaviours, and instrumentation to log both attempted actions and detection events. Some related evaluations exist (Inglis et al., 2025), but may not necessarily reflect real production workflows. A related line of work uses red teams to embed hidden objectives in a language model, with blue teams attempting to detect them (Marks et al., 2025). Limitations:

Does not measure propensity: The metric does not capture whether AI systems would attempt subversion unprompted. Prompting AI systems to be subversive might produce behaviours that would not surface during normal use of AI for R&D.

Reduced realism: Testing must occur in sandboxed environments to prevent actual harm, which may reduce realism and potentially affect model behaviour. Sensitivity: Low to moderate. Aggregate results would be less sensitive than breakdowns that reveal specific attack vectors. Organizations with strong detection rates might selectively share results to demonstrate security/oversight maturity. What remains to be done: Making evaluations specific to internal company infrastructure. 4.1.4 Metric #4: Misalignment evaluations Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ Description: This metric measures AI systems’ propensities and capabilities relevant to misaligned behaviour in AI R&D environments. Misaligned behaviours could include alignment faking (Greenblatt et al., 2024), scheming (Meinke et al., 2025), sabotage (Benton et al., 2024; Ward et al., 2025), and reward hacking (Betley et al., 2026). Significance: Greater misalignment increases oversight demand. Feasibility: High. Many AI companies and external organizations already run such evaluations (OpenAI, 2025a; Anthropic, 2025b; Google DeepMind, 2025b; Ward et al., 2025; Betley et al., 2026; MacDiarmid et al., 2025). Limitations:

Lack of generalizability: Differences between evaluation environments and real AI R&D workflows could limit the informativeness of evaluations. For example, real R&D workflows likely involve lengthier, more complex interaction patterns than can be simulated in an evaluation.

Evaluation gaming: Data leakage or attempts to train on evaluations could make this metric less informative.

Evaluation awareness: Frontier models increasingly recognize when they are being evaluated and condition their behaviour accordingly (Meinke et al., 2025). As a result, misalignment evaluations could underestimate the degree of misalignment in practice. Sensitivity: Low to moderate. As noted above, several AI companies already publish misalignment evaluation results in system cards. However, high misalignment scores in the future could reflect poorly on their products. What remains to be done: Developing evaluations that are more representative of realistic AI R&D workflows. 4.1.5 Metric #5: Compute efficiency improvements Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ ✓ Description: This metric measures the year-over-year percentage reduction in the amount of compute (FLOP) needed to achieve a given level of performance on capabilities (including AI R&D) evaluations. For example, if achieving a fixed benchmark score requires half as much compute this year as last year, that represents a 2x improvement in compute efficiency. This metric could be reported either as an industrywide aggregate or on a per-company basis. Ideally, improvements in both training compute efficiency and inference compute efficiency should be reported. Significance: Compute efficiency improvements on AI R&D tasks could be a signal of AIRDA. This metric also directly tracks AI progress: if AIRDA increases the pace at which algorithmic or data improvements are created, one expected signal would be large and sustained increases in compute efficiency. Feasibility: Moderate to high, depending on whether companies need to run new evaluations and train new AI systems. The core methodology involves comparing the performance of models trained or run with various compute budgets (Hernandez & Brown, 2020; Davidson et al., 2023; Ho et al., 2024; Ho & Berg, 2025; Ho et al., 2025; Whitfill, 2025; Gundlach et al., 2025). Obtaining accurate estimates of efficiency improvements could require training additional AI systems that companies will not use in production. The cost could potentially be reduced by focusing on improvements in post-training rather than pre-training. Furthermore, incomplete estimates based on existing evaluations and training runs could still be informative. Limitations:

Limitations of evaluations: This metric inherits the limits of evaluations discussed for Metric #1, such as benchmark saturation and inconsistencies in evaluation set-ups.

Does not capture performance ceiling: Compute efficiency measures how cheaply a given performance level can be achieved, but it says nothing about the maximum performance attainable. For example, it could become cheaper over time to reach 50% on a benchmark, while the best achievable score remains stuck at 75% year over year.

Attribution difficulties: This metric does not distinguish between the causes of observed efficiency gains, which could be due to factors such as algorithmic improvements, improvements in data quality, or scaling up training compute. If observed gains are due to scaling up training compute, acceleration of AI progress may not necessarily be attributable to AIRDA. Sensitivity: Low to moderate. Leading companies may report the metric as a positive signal of their research capabilities (e.g., to advertise their products or attract talent), while lagging companies may hesitate to reveal their relative standing. Sensitivity decreases further if reported as an industry-wide aggregate. What remains to be done: Calculating efficiency improvements from existing data. Potentially running additional evaluations. 5Some algorithmic improvements could be scale-dependent: they provide much larger efficiency gains as compute is scaled up. For example, Gundlach et al. (2025) compare transformers to LSTMs, finding a 6.3x efficiency improvement at 1015 FLOP and a 26x improvement at 3 · 1016 FLOP.

4.2 Survey-Based Metrics

These metrics involve surveying human researchers at AI companies. 4.2.1 Metric #6: Staff views on AI use and productivity boosts Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ ✓ ✓ Description: This metric captures researchers’ self-reported AI use patterns and productivity gains across different task categories. For each task category, researchers answer questions such as:

“What percentage speed-up do you experience when using AI tools for this task?”

“Approximately how many hours per week do AI tools save you on this task?”

“To what extent could AI tools replace a newly hired junior researcher?”

“How have the tasks in your day-to-day job shifted over the past X months from AI use, if at all?”

“How much of a speed-up in AI progress would you expect over the next six months as a result of AI tools?” Task categories could include: code implementation and debugging, experiment design and execution, literature review, data analysis, writing and documentation, and review of AI-generated outputs. Significance: Higher productivity boosts could suggest greater AIRDA, and productivity boosts for oversight tasks could suggest that oversight capacity is increasing. Comparing productivity boosts across teams (e.g., safety vs. pre-training) could provide information about differential AI progress. Open-ended questions about AI use patterns could also complement more quantitative data. Feasibility: High. Implementation requires only periodic surveys with straightforward questions. Most organizations already have infrastructure for such data collection. As further evidence of feasibility, the Claude Opus 4.5 System Card includes data from an internal survey that asks whether Opus “could completely automate a junior ML researcher” (Anthropic, 2025b). The main challenge is maintaining response rates and avoiding survey fatigue. Limitations:

Subjectivity: Self-reported productivity gains are subjective and may not correlate with objective output measures. Survey respondents could also overestimate benefits due to novelty effects, or underestimate them as they grow accustomed to AI tools and raise their expectations of what counts as a meaningful boost (Becker et al., 2025).

Coverage gaps: Productivity gains from automation of a task could go unreported if researchers no longer consider it part of their job. This metric also does not capture the creation of new kinds of AI R&D tasks as a result of automation (e.g. substantially more time spent reviewing AI logs). Sensitivity: Low. The data is subjective and provides limited competitive intelligence. Some organizations might voluntarily share positive results for recruiting and marketing purposes. What remains to be done: Designing questions and establishing a regular cadence for running surveys. 4.2.2 Metric #7: Extent of AI use in high-stakes decisions Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ ✓ ✓ Description: This metric surveys researchers’ judgment of how AI is involved in categories of high-stakes decisions, such as scaling up or launching training runs, compute allocation across teams, and deployment decisions. For each decision, the survey would assess both the level and nature of AI involvement. Potential questions could include:

For this (category of) decision, select all below that apply: – AI generated initial options/proposals that humans chose from – AI provided analysis or data synthesis that informed the decision – AI drafted decision memos or documentation – AI made recommendations that were adopted without significant modification – AI flagged risks or concerns that changed the decision

Overall, how crucial was the involvement of an AI system in making this decision? (reported on a Likert scale, with optional comment fields)

What kind of oversight was exercised over the AI system? Significance: Increased AI involvement in high-stakes decisions could be a signal of AIRDA. Increased involvement could also mean increased oversight demand: failure of such AI systems would be more consequential given the stakes of the decisions at hand. This metric can also track how much oversight is performed if the survey asks about it. Feasibility: Low to moderate. Data collection is relatively straightforward, but companies would need to define which high-stakes decisions to track and how to assess AI involvement in such decisions. One way of collecting the data is to integrate the survey into existing approval workflows for major decisions (e.g., training run sign-offs). Limitations:

Subjectivity: Assessing the importance of AI involvement would likely be subjective. Researchers could potentially have reputational incentives to understate AI involvement.

Does not capture decision quality: If decisions are of poor quality when AI involvement is high, AIRDA could be less extensive, or less likely to increase AI progress, than this metric would suggest. Sensitivity: High. This metric likely reveals information about internal decision-making processes and strategy. What remains to be done: Identifying high-stakes R&D decisions. Designing surveys and running them each time such decisions are made.

4.3 Operational Metrics

These metrics monitor ongoing R&D processes and events. 4.3.1 Metric #8: Researcher time allocation (“AI-powered Toggl”) Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ ✓ Description: This metric would use software to track how AI researchers allocate their time across different R&D activities. Such software could use AI systems to categorize the activities on the researcher’s screen. Task categories could include:

Primary R&D activities: – Writing and debugging code – Designing experiments and formulating hypotheses – Running and monitoring experiments – Analyzing experimental results and interpreting data – Literature review and reading research papers – Architecture and system design – Meetings and collaboration – Infrastructure and tooling setup

AI interaction activities: – Prompting or directing AI coding assistants – Reviewing AI-generated outputs (code, analysis, summaries) – Debugging AI behaviour or outputs Results should be reported as median and mean percentage allocation per activity category across all researchers, tracked over time. Significance: Time allocation patterns could provide on-the-ground evidence of the extent of AIRDA in a way that benchmarks cannot capture. For example, it would be strong evidence of AIRDA if researchers shifted more of their time from primary R&D activities to AI interaction activities. Time spent on oversight activities also measures how much oversight is performed. Feasibility: Moderate. Time-tracking software needs to be written and there needs to be ways to navigate privacy concerns. Tracking will likely become significantly easier over time because AI tools can automate it. Limitations:

Time spent is not a perfect indicator of productivity or output quality: For example, code quality is not necessarily correlated with the time spent on coding.

Difficulty aggregating results: It may be difficult to compare and aggregate time allocation between researchers at different skill levels or working on different problem types.

Unobservables: The metric does not capture cognitive load. Researchers might spend less wallclock time on a task but more mental effort reviewing AI outputs. It might also be difficult to measure activities that are not captured in software applications (e.g. time spent outside of the office thinking about a problem).

Confounders: Time allocation could shift without corresponding changes in AIRDA, such as a result of shifting company priorities. Sensitivity: Low to moderate, depending on the granularity of the activity categories. Time allocation data reveals workflow patterns and what AI is successfully automating, which could provide some competitive intelligence. Sensitivity is likely to be higher for detailed breakdowns because they could reveal strategic priorities or weaknesses. Time-tracking software itself could also raise privacy concerns amongst employees. What remains to be done: Developing and deploying the time-tracking software, managing privacy concerns. 4.3.2 Metric #9: Oversight effectiveness retrospectives Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ ✓ Description: This metric focuses on defects in AI-generated R&D outputs. For each AI-generated R&D output, companies would track:

Whether the output underwent human review (and what kind)

The degree to which AI systems assisted with human review

A description of the defects, including bugs, errors, security vulnerabilities, misconfigurations, or other quality issues

When the defects were discovered (e.g., during review, after review but before production, or after production)

The severity of the defects discovered Categories of AI R&D outputs could include randomly selected subsets of:

Pull requests

Analyses of experimental data

Logs of how a training run was monitored Both the proportion of outputs that have defects and the length of time until discovery of the defect(s) should be reported, stratified by severity. Significance: This metric directly tracks how much oversight is performed on AI-generated outputs. Increases in the frequency of defects would suggest increases in the oversight gap. If the level of oversight is held constant, increases in the frequency of defects could also suggest increases in oversight demand. Feasibility: Low to moderate, depending on the type of output. One difficulty is defining a defect. For code, defining defects is relatively straightforward. For scientific papers, reasonable people may disagree on what counts as a defect, and such defects may only come to light long after publication. Another difficulty is that in practice, outputs will be partly human-written and partly AI-generated. Organizations may need to build systems to track which portions are AI-generated. For example, Cursor (2026) proposes a standard for labelling AI-generated code. Limitations:

Limited comparability: There can be reasonable disagreement about what constitutes a defect in a complicated output such as a scientific paper. Inconsistent definition of defects could make it difficult to compare results between AI companies. Inconsistencies in how human review is defined between companies (e.g. quick glance vs. detailed audit) could also hinder comparability.

Difficulty of catching defects: Defect detection depends on downstream quality processes (testing, review, monitoring), which may not catch all errors equally for AI versus human work (e.g. current error detection processes might be more effective at catching human errors). Sensitivity: Moderate to high because the metric could reveal oversight quality problems and raise safety concerns. What remains to be done: Building a tracking/tagging pipeline and establishing a regular data analysis schedule. 4.3.3 Metric #10: AI subversion incidents Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ Description: This metric tracks incidents where AI systems have attempted to subvert AI R&D workflows in practice. Examples of subversive actions could include sabotage of experiments, insertion of backdoors into production code, or manipulation of evaluation results. For each incident, organizations would record:

A description of the subversive action attempted

When the incident occurred

The context in which it occurred (e.g., which task, which model, what level of autonomy)

How the incident was detected (e.g., chain-of-thought analysis)

The severity of the incident (i.e., the damage caused or the potential damage had the attempt succeeded)

An assessment of why the incident occurred, including the extent to which the incident appears to be goal-directed, as opposed to being a result of e.g. misunderstanding instructions. As a summary statistic, the number and severity of such subversion incidents could be reported. Significance: An increasing number or severity of subversion attempts, even if detected and prevented, indicates increased oversight demand. Compared to Metric #9, which measures defects in AI-generated outputs, this metric focuses on subversion attempts, whether successful or not. Compared to Metrics #3 and #4, this metric focuses on real production workflows rather than sandboxed or toy environments. Feasibility: Low to moderate. The monitoring infrastructure to detect subversion could require significant engineering effort and compute. Some companies are beginning to build such infrastructure (OpenAI, 2026). Limitations:

Definitional ambiguity: It may be difficult to define what counts as a subversive action. Apparently subversive actions could result from factors such as capability limitations or ambiguous instructions. Models can also produce misleading or post-hoc rationalizations in their reasoning traces.

Under-detection: Because AI systems could try to evade detection, the observed incident count could significantly underestimate the true rate. Sensitivity: Moderate to high. Public knowledge of subversion attempts could undermine confidence in the company’s products and reveal specific vulnerabilities. What remains to be done: Defining what counts as a subversion incident and building detection infrastructure.

4.4 Organizational Metrics

These metrics capture organizational structure, resource allocation, and policies. 4.4.1 Metric #11: Headcount of AI researchers and the distribution of their performance Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ ✓ ✓ ✓ Description: This metric measures the number and seniority of AI researchers at frontier AI companies, along with the distribution of their performance. Headcount should be disaggregated by team (e.g. trust and safety team, pre-training team, etc). Performance could be measured (or proxied) in a variety of ways, such as:

Completion of OKRs (adjusted by difficulty)

Number of code commits

Salaries

Compute budgets Researcher headcount should not include:

Staff building software products, features, or services on top of existing AI systems

Business, operational, marketing, legal staff, etc. Significance: AIRDA could lead to decreases in headcount if a small number of researchers is able to get much of the necessary R&D work done with AI assistance. As explained in Section 3.2, lower headcounts could mean reduced oversight capacity and level of oversight achieved. Furthermore, a widening distribution of researcher performance could signal that AI tools are becoming powerful enough to create large productivity differences between those who use them effectively and those who do not. As AI systems continue to improve, this gap could compound. Feasibility: Moderate to high. HR already maintains records of staff counts by role type and salary levels. Compiling and reporting this information requires minimal additional infrastructure. However, performance data may be more difficult to standardize within and across organizations. Limitations:

Difficulties measuring performance: Performance metrics like OKRs may not capture the most important research contributions, which are often difficult to predict or measure. Attribution of research outcomes to individual researchers is inherently difficult in collaborative environments. Research is also non-linear: long periods of apparent unproductivity could precede major breakthroughs.

Confounders for headcount: Headcount changes could reflect a variety of other factors such as funding cycles, labour market conditions, or strategic pivots.

Non-linearity: We should expect human labour to become more valuable before full automation because AI systems can augment human labour (Gans & Goldfarb, 2026). As a result, headcount could remain high or even increase as long as human-dependent activities bottleneck full automation, such as providing high-level research direction (Toner et al., 2026). Sensitivity: Moderate to high. Sharing salaries in particular could invite public criticism or create uncomfortable social dynamics between employees. Sharing OKRs could reveal research strategies. However, some information (headcount, hiring trends) can be gathered through publicly available sources such as LinkedIn. What remains to be done: Aggregating and presenting existing data. 4.4.2 Metric #12: Distribution of compute usage Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ Description: This metric measures the amount of compute (FLOP) used by frontier AI companies across the AI R&D process, including pre-training, mid-training, post-training, evaluation, experimentation, internal inference (e.g. coding and research agents), and external deployment. Significance: Shifts in compute allocation towards internal inference, especially relative to compute used for external deployment, could suggest increases in AIRDA. Comparing internal and external inference could help to distinguish AIRDA from company growth. Feasibility: Moderate to high. Compute is likely to be meticulously tracked at frontier AI companies for cost management and infrastructure planning. For example, pre-training workloads require months-long planning and dedicated hardware reservations. The main implementation challenge is categorizing uses of compute. Public estimates based on dollar spend are possible (You, 2025), but would not track fine-grained details about how compute is used. Limitations:

Other factors affecting compute allocation: Breakthroughs in compute efficiency could mean that less compute is required to accomplish the same activity. The allocation between external and internal inference could also heavily depend upon market dynamics and investor demands.

Inconsistent categorization of activities: Because different AI companies could have different R&D workflows, results may not be comparable between them. Sensitivity: High. Compute usage data reveals both absolute resource levels relative to competitors and R&D productivity. Many frontier AI companies stopped disclosing pre-training compute in 2024, although the EU AI Act (reg, 2024) requires some disclosure for in-scope systems. What remains to be done: Developing systems to classify compute use. For example, a workload classifier could tag jobs submitted to a compute cluster, while another classifier could tag internal agent or chat logs, similar to Tamkin et al. (2024). External organizations can try to create estimates of this metric based on publicly available information. 4.4.3 Metric #13: Capital share of AI R&D spending Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ Description: This metric measures compute expenditure as a percentage of total AI R&D spending. It is calculated as: compute costs for R&D activities total R&D expenditure including labour, compute, and other costs × 100. Significance: As AI systems take on more of the work previously done by humans, the share of spending going to compute (capital) should increase relative to spending on labour and other costs. Combining this 6A broader variant of this metric is the network-adjusted capital share (Trammell, 2026), which would measure the extent to which a particular process (e.g. AI R&D) could sustain itself without human labour anywhere in its supply chain. metric with Metric #12 could help to distinguish increased spending on inference from increased spending on training. Feasibility: High. Organizations already track R&D expenditures for internal budgeting and tax purposes, and standardized tax definitions of R&D could facilitate cross-company comparisons. Separating compute costs from labour costs is straightforward accounting. Limitations:

Ambiguity: Rising capital share could reflect increasing AIRDA, but could also reflect larger investments in training runs, changes in compute pricing (e.g., from supply chain disruptions), or wasteful or inefficient uses of compute. Sensitivity: Moderate. This metric reveals the basic structure of R&D spending, but it does not directly expose capabilities, specific techniques, or detailed competitive information. What remains to be done: Calculating and reporting the metric based on existing data. 4.4.4 Metric #14: AI permission lists Extent of AIRDA AI Progress Oversight Capacity Oversight Achieved Oversight Demand ✓ ✓ Description: A list of which actions AI systems are authorized to take and what level of human review or approval is needed, if any. Actions could include: modifying production code, initiating training runs, deploying models, and accessing sensitive systems or data. For each action, organizations would report the default permission level (e.g. no review required), recording notable deviations if applicable. Significance: This metric tracks oversight demand, which could be higher if AI systems are authorized to take significant actions without human review. AIRDA is likely also higher if AI systems are permitted to take significant actions autonomously (e.g., launching large training runs). Compared to Metric #7, which is about AI use in practice, this metric is about policies around AI use. Feasibility: Moderate to high. Organizations implementing AI agents for R&D tasks must already make decisions about permissions and access controls. This metric simply requires documenting those decisions systematically. Limitations:

Formal vs. actual permissions: Documented permissions may not reflect actual practice. For example, exceptions to the rules might be granted easily, or human approval could be perfunctory.

Granularity challenges: Actions must be specific enough to be meaningful, but ideally general enough to be comparable across different organizations with different infrastructure.

Does not account for reliability: Even if AI systems are authorized to take high-stakes actions, oversight demand could in principle be lower if AI systems are much more reliable than humans. Sensitivity: Moderate to high, depending on the level of granularity. Permission lists could reveal internal security practices and potential vulnerabilities. High-level summaries (e.g., “AI systems cannot autonomously modify production code” vs. “AI systems can autonomously commit to any repository”) would likely be less sensitive. What remains to be done: Defining actions of interest, collecting permissions across teams, and updating the list regularly. 5 Limitations This work has several limitations that warrant consideration. Our analysis of AIRDA’s implications in Section 3 has largely ignored higher-order effects. While their significance is highly uncertain, they could be meaningful to track. For example, if AI R&D automation advances safety research faster than capabilities research, the perception that safety is progressing adequately could encourage more risk-taking. Many of the proposed metrics are lagging indicators: they measure automation that has already occurred. By the time these metrics show concerning trends, the window for potential interventions may have narrowed considerably. While benchmark performance on AI R&D tasks could be a leading indicator, translating such performance into predictions about real-world AIRDA remains imprecise. Relatedly, the relationship between AIRDA metrics and AI progress could be highly non-linear. AI R&D could be substantially automated across most tasks while remaining bottlenecked by a single humandependent activity, such as providing high-level research direction. In such scenarios, metrics might indicate high levels of automation without corresponding acceleration of AI progress. Automation of that final bottleneck could dramatically accelerate progress, yet only show up as a small change in the metrics. Differences in data collection methods could limit comparability between AI companies. For example, companies could define task categories differently or use different attribution methods for AI contributions. Without standardized measurement protocols, aggregating or comparing metrics across organizations could be misleading. This limitation is particularly acute for metrics involving subjective judgment, such as surveys on productivity gains. Given that AI companies would internally collect most of our proposed metrics, it might be difficult to verify their accuracy. For instance, companies could understate the degree of automation for fear of government intervention. As a mitigation, independent third parties could help to corroborate company results, as they already do for benchmarks (UK AI Security Institute, 2024; Schoen et al., 2025; Epoch AI, 2024). AI companies could also contract external organizations to carry out data collection with standardized protocols, such as for researcher surveys. Any collection of AIRDA metrics will likely need to be updated over time. In addition to lessons learned over the course of developing and collecting such metrics, we will likely gain more clarity over time about which effects of AIRDA to prioritize. Finally, many factors relevant to AI progress and oversight fall outside the domain of AI R&D. For instance, metrics focused on AIRDA do not capture the cybersecurity readiness of critical infrastructure providers (Kamil˙e Lukoši¯ut˙e, 2025) or the pace of regulatory adaptation. AI companies would also find it difficult to track such metrics by themselves. A comprehensive assessment of AIRDA’s implications will likely require cross-domain collaborations. 6 Related Work A growing suite of benchmarks evaluates AI systems on tasks relevant to AI R&D. SWE-bench and its variants (Jimenez et al., 2023) assess software engineering capabilities on real GitHub issues, while MLEbench evaluates ML engineering through Kaggle competitions (Chan et al., 2025). For tasks more directly relevant to frontier research, METR’s RE-Bench compares AI agents to human experts on ML research engineering tasks (Wijk et al., 2025), and OpenAI’s PaperBench tests agents’ ability to replicate ICML papers from scratch (Starace et al., 2025). A number of works have studied theoretical models of AI R&D automation. Eth & Davidson (2025) explore the possibility that software progress alone could lead to increasingly accelerating AI progress. AI Futures Project (2026) provides a model for forecasting AI R&D automation. Erdil & Barnett (2025) argue that automating AI R&D alone likely will not accelerate AI progress. Ho & Whitfill (2025) point out problems both with existing data sources and the models used to study the effects of AI R&D automation. Erdil et al. (2025) model the feedback loop that AI development could create between itself and the economy. Whitfill & Wu (2025) argue that compute and cognitive labour are complements in the context of frontier experiments. Under both voluntary commitments and legislation, AI companies will likely report some details about AI R&D automation. Many safety frameworks address AI R&D automation. Anthropic’s Responsible Scaling Policy (Anthropic, 2026a) defines a threshold for AI R&D automation based on compressing “two years of 2018 – 2024 AI progress into a single year” . OpenAI’s Preparedness Framework (OpenAI, 2025b) tracks “AI Self-improvement” capabilities, with a “High” threshold for models whose impact is “equivalent to giving every OpenAI researcher a highly performant mid-career research engineer assistant, relative to those researchers’ 2024 baseline”. Google DeepMind’s Frontier Safety Framework (Google DeepMind, 2025a) includes “Machine Learning R&D” as a Critical Capability Level domain, recommending high security measures for models that could significantly accelerate AI development. Furthermore, under California’s Transparency in Frontier Artificial Intelligence Act, each large frontier developer must describe how it approaches “[a]ssessing and managing catastrophic risk resulting from the internal use of its frontier models, including risks resulting from a frontier model circumventing oversight mechanisms.” Some recent work has begun to study how internal use of automated AI researchers could lead to such risks (Clymer et al., 2025; Benton et al., 2024; Stix et al., 2025; Metr, 2025a; Korbak et al., 2025; Greenblatt, 2024). Toner et al. (2026) offers a useful complement to this piece, distilling areas of expert consensus and disagreement from a workshop on AI R&D automation. 7 Conclusion AI R&D automation could have significant effects on AI progress and human oversight. We have proposed a variety of metrics to help track and understand these effects. Companies, governments, and third parties (e.g. non-profit research organisations) all have roles in advancing these metrics. We especially recommend prioritizing metrics (or components thereof) that are relatively neglected (Table 1). In addition to clarifying the degree of AIRDA and its implications, future work should also investigate how to respond to different AIRDA scenarios. Promising areas of investigation could include how to accelerate defensive AI capabilities, how to improve the ability of institutions to adapt to rapid progress, and whether and how to impose human oversight requirements.

Acknowledgments

The following people participated in a preliminary workshop and survey that informed this work: Ardi Janjeva, Annika Hallensleben, Dewey Murdick, Vasilios Mavroudis, Helen Toner, Eli Lifland, Carolyn Ashurst, David Owen, Ture Hinrichsen, Rosco Hunter, and other participants who prefer to remain anonymous. We would also like to thank the following people for useful feedback on previous drafts of this work: Tom Davidson, Sam Manning, Parker Whitfill, Cheryl Wu, Micah Carroll, Hjalmar Wijk, and Anson Ho. All errors remain our own.

References

Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 june 2024 laying down harmonised rules on artificial intelligence and amending regulations (EC) no 300/2008, (EU) no 167/2013, (EU) no 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act), July 2024. URL http://data.europa. eu/eli/reg/2024/1689/oj. AI Futures Project. AI futures model, 2026. URL https://www.aifuturesmodel.com/. Alphabet. Alphabet Q4 2025 earnings call, 2026. URL https://abc.xyz/investor/events/ event-details/2026/2025-Q4-Earnings-Call-2026-Dr%5FC033hS6/default.aspx. Dario Amodei. The adolescence of technology, 2026. URL https://www.darioamodei.com/essay/ the-adolescence-of-technology. Anthropic. Responsible scaling policy. Technical Report Version 2.2, 2025a. Anthropic. System card: Claude Opus 4.5. Technical report, November 2025b. Anthropic. Responsible scaling policy. Technical Report Version 3, 2026a. Anthropic. System card: Claude Opus 4.6. Technical report, February 2026b. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional AI: Harmlessness from AI feedback, December 2022. URL http://arxiv.org/ abs/2212.08073. Matthew Barnett. Algorithmic progress likely spurs more spending on compute, not less, 2025. URL https://epoch.ai/gradient-updates/ algorithmic-progress-likely-spurs-more-spending-on-compute-not-less. Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. Measuring the impact of early-2025 AI on experienced open-source developer productivity, July 2025. URL http://arxiv.org/abs/2507.09089. Joel Becker, Nate Rush, Tom Cunningham, David Rein, and Khalid Mahamud. We are changing our developer productivity experiment design, February 2026. URL https://metr.org/blog/ 2026-02-24-uplift-update/. Rebecca Bellan. Sam Altman says OpenAI will have a ‘legitimate AI researcher’ by 2028, October 2025. URL https://techcrunch.com/2025/10/28/ sam-altman-says-openai-will-have-a-legitimate-ai-researcher-by-2028/. Y. Bengio, S. Clare, C. Prunkl, M. Murray, M. Andriushchenko, B. Bucknall, R. Bommasani, S. Casper, T. Davidson, R. Douglas, D. Duvenaud, P. Fox, U. Gohar, R. Hadshar, A. Ho, T. Hu, C. Jones, S. Kapoor, A. Kasirzadeh, S. Manning, N. Maslej, V. Mavroudis, C. McGlynn, R. Moulange, J. Newman, K. Y. Ng, P. Paskov, S. Rismani, G. Sastry, E. Seger, S. Singer, C. Stix, L. Velasco, N. Wheeler, D. Acemoglu, V. Conitzer, T. G. Dietterich, E. W. Felten, F. Heintz, G. Hinton, N. Jennings, S. Leavy, T. Ludermir, V. Marda, H. Margetts, J. McDermid, J. Munga, A. Narayanan, A. Nelson, C. Neppel, S. D. Ramchurn, S. Russell, M. Schaake, B. Schölkopf, A. Soto, L. Tiedrich, G. Varoquaux, A. Yao, Y.-Q. Zhang, L. A. Aguirre, O. Ajala, F. Albalawi, N. AlMalek, C. Busch, J. Collas, A. C. P. de L. F. de Carvalho, A. Gill, A. H. Hatip, J. Heikkilä, C. Johnson, G. Jolly, Z. Katzir, M. N. Kerema, H. Kitano, A. Krüger, K. M. Lee, J. R. López Portillo, A. McLysaght, O. Molchanovskyi, A. Monti, M. Nemer, N. Oliver, R. Pezoa, A. Plonk, B. Ravindran, H. Riza, C. Rugege, H. Sheikh, D. Wong, Y. Zeng, L. Zhu, D. Privitera, and S. Mindermann. International AI safety report 2026. Technical Report Dsit 2026/001, Department for Science, Innovation and Technology, 2026. URL https://internationalaisafetyreport.org. Joe Benton, Misha Wagner, Eric Christiansen, Cem Anil, Ethan Perez, Jai Srivastav, Esin Durmus, Deep Ganguli, Shauna Kravec, Buck Shlegeris, Jared Kaplan, Holden Karnofsky, Evan Hubinger, Roger Grosse, Samuel R. Bowman, and David Duvenaud. Sabotage evaluations for frontier models, October 2024. URL http://arxiv.org/abs/2410.21514. Jamie Bernardi, Gabriel Mukobi, Hilary Greaves, Lennart Heim, and Markus Anderljung. Societal adaptation to advanced AI, January 2025. URL http://arxiv.org/abs/2405.10295. Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. Nature, 649(8097):584–589, January 2026. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-025-09937-5. URL http://arxiv.org/abs/2502.17424. Miles Brundage, Noemi Dreksler, Aidan Homewood, Sean McGregor, Patricia Paskov, Conrad Stosz, Girish Sastry, A. Feder Cooper, George Balston, Steven Adler, Stephen Casper, Markus Anderljung, Grace Werner, Soren Mindermann, Vasilios Mavroudis, Ben Bucknall, Charlotte Stix, Jonas Freund, Lorenzo Pacchiardi, Jose Hernandez-Orallo, Matteo Pistillo, Michael Chen, Chris Painter, Dean W. Ball, Cullen O’Keefe, Gabriel Weil, Ben Harack, Graeme Finley, Ryan Hassan, Scott Emmons, Charles Foster, Anka Reuel, Bri Treece, Yoshua Bengio, Daniel Reti, Rishi Bommasani, Cristian Trout, Ali Shahin Shamsabadi, Rajiv Dattani, Adrian Weller, Robert Trager, Jaime Sevilla, Lauren Wagner, Lisa Soder, Ketan Ramakrishnan, Henry Papadatos, Malcolm Murray, and Ryan Tovcimak. Frontier AI auditing: Toward rigorous third-party assessment of safety and security practices at leading AI companies, February 2026. URL http://arxiv.org/abs/2601.11699. Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond. Generative AI at work, April 2023. URL https: //www.nber.org/papers/w31161. Vitalik Buterin. My techno-optimism. Technical report, 2023. Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. MLE-bench: Evaluating machine learning agents on machine learning engineering, February 2025. URL http://arxiv.org/abs/ 2410.07095. Dami Choi, Vincent Huang, Kevin Meng, Daniel D Johnson, Jacob Steinhardt, and Sarah Schwettmann. Scaling automatic neuron description, October 2024. URL https://transluce.org/ neuron-descriptions. Joshua Clymer, Isabella Duan, Chris Cundy, Yawen Duan, Fynn Heide, Chaochao Lu, Sören Mindermann, Conor McGurk, Xudong Pan, Saad Siddiqui, Jingren Wang, Min Yang, and Xianyuan Zhan. Bare minimum mitigations for autonomous AI development, April 2025. URL http://arxiv.org/abs/2504.15416. Domenico Cotroneo, Cristina Improta, and Pietro Liguori. Human-written vs. AI-generated code: A largescale study of defects, vulnerabilities, and complexity, August 2025. URL http://arxiv.org/abs/2508. 21634. Cursor. Agent trace, 2026. URL https://agent-trace.dev/. Tom Davidson, Jean-Stanislas Denain, Pablo Villalobos, and Guillem Bas. AI capabilities can be significantly improved without expensive retraining, December 2023. URL http://arxiv.org/abs/2312.07413. Tom Davidson, Lukas Finnveden, and Rose Hadshar. AI-enabled coups: How a small group could use AI to seize power. 2025. URL https://www.forethought.org/research/ ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power. Fabrizio Dell’Acqua, Edward McFowland III, Ethan R. Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality, September 2023. URL https://papers.ssrn.com/abstract=4573321. Epoch AI. Introducing Epoch AI’s AI benchmarking hub, November 2024. URL https://epoch.ai/blog/ introducing-benchmarks-dashboard. Ege Erdil and Matthew Barnett. Most AI value will come from broad automation, not from R&D, 2025. URL https://epoch.ai/gradient-updates/ most-ai-value-will-come-from-broad-automation-not-from-r-d. Ege Erdil, Andrei Potlogea, Tamay Besiroglu, Edu Roldan, Anson Ho, Jaime Sevilla, Matthew Barnett, Matej Vrzla, and Robert Sandler. GATE: An integrated assessment model for AI automation, 2025. URL https://arxiv.org/abs/2503.04941. Daniel Eth and Tom Davidson. Will AI R&D automation cause a software intelligence explosion? 2025. URL https://www.forethought.org/research/ will-ai-r-and-d-automation-cause-a-software-intelligence-explosion. Kai Fronsdal, Isha Gupta, Abhay Sheshadri, Jonathan Michala, Stephen McAleer, Rowan Wang, Sara Price, and Sam Bowman. Petri: Parallel exploration of risky interactions, 2025. URL https://github.com/ safety-research/petri. Kai Fronsdal, Jonathan Michala, and Sam Bowman. Petri 2.0: New scenarios, new model comparisons, and improved eval-awareness mitigations, 2026. URL https://alignment.anthropic.com/2026/petri-v2/. Joshua S. Gans and Avi Goldfarb. O-Ring automation, January 2026. URL https://www.nber.org/papers/ w34639. Google DeepMind. Frontier safety framework. Technical Report Version 3.0, 2025a. Google DeepMind. Gemini 3 pro model card. Technical report, November 2025b. URL https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf. Ryan Greenblatt. Prioritizing threats for AI control, June 2024. URL https://blog.redwoodresearch. org/p/prioritizing-threats-for-ai-control. Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models, December 2024. URL http://arxiv.org/abs/2412.14093. Hans Gundlach, Alex Fogelson, Jayson Lynch, Ana Trisovic, Jonathan Rosenfeld, Anmol Sandhu, and Neil Thompson. On the origin of algorithmic progress in AI, November 2025. URL http://arxiv.org/abs/ 2511.21622. Danny Hernandez and Tom B. Brown. Measuring the algorithmic efficiency of neural networks, May 2020. URL http://arxiv.org/abs/2005.04305. Anson Ho. The least understood driver of AI progress, 2026. URL https://epoch.ai/gradient-updates/ the-least-understood-driver-of-ai-progress. Anson Ho and Arden Berg. Quantifying the algorithmic improvement from reasoning models, 2025. URL https://epoch.ai/gradient-updates/ quantifying-the-algorithmic-improvement-from-reasoning-models. Anson Ho and Parker Whitfill. The software intelligence explosion debate needs experiments, 2025. URL https://epoch.ai/gradient-updates/ the-software-intelligence-explosion-debate-needs-experiments. Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, and Jaime Sevilla. Algorithmic progress in language models, 2024. Anson Ho, Jean-Stanislas Denain, David Atanasov, Samuel Albanie, and Rohin Shah. A Rosetta Stone for AI benchmarks, 2025. URL https://epoch.ai/blog/a-rosetta-stone-for-ai-benchmarks. Rogan Inglis, Ollie Matthews, Tyler Tracy, Oliver Makins, Tom Catling, Asa Cooper Stickland, Rasmus Faber-Espensen, Daniel O’Connell, Myles Heller, Miguel Brandao, Adam Hanson, Arathi Mani, Tomek Korbak, Jan Michelfeit, Dishank Bansal, Tomas Bark, Chris Canal, Charlie Griffin, Jasmine Wang, and Alan Cooney. ControlArena, 2025. URL https://github.com/UKGovernmentBEIS/control-arena. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? arXiv preprint arXiv:2310.06770, 2023. Kamil˙e Lukoši¯ut˙e. Design for the defenders you care about or risk being useless, 2025. URL https:// kamilelukosiute.com/llms/Design+for+the+defenders+you+care+about+or+risk+being+useless. Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E. Tetlock. ForecastBench: A dynamic benchmark of AI forecasting capabilities, February 2025. URL http://arxiv.org/abs/2409.19839. Eddie Kembery and Nora Ammann. AI resilience, 2025. URL https://airesilience.net/. Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, and Geoffrey Irving. A sketch of an AI control safety case, January 2025. URL http://arxiv.org/abs/2501.17315. Anton Korinek and Donghyun Suh. Scenarios for the transition to AGI, March 2024. URL https://www. nber.org/papers/w32255. William MacAskill and Fin Moorhouse. Preparing for the intelligence explosion. 2025. URL https://www. forethought.org/research/preparing-for-the-intelligence-explosion. Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, and Evan Hubinger. Natural emergent misalignment from reward hacking in production RL, November 2025. URL http://arxiv.org/abs/2511.18397. Sam J. Manning and Tomás Aguirre. How adaptable are American workers to AI-induced job displacement?, January 2026. URL https://www.nber.org/papers/w34705. Marin. Marin 32B retrospective, 2025. URL https://marin.readthedocs.io/en/latest/reports/ marin-32b-retro/. Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth MishraSharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham, Carson Denison, Florian Dietz, Satvik Golechha, Akbir Khan, Jan Kirchner, Jan Leike, Austin Meek, Kei Nishimura-Gasparian, Euan Ong, Christopher Olah, Adam Pearce, Fabien Roger, Jeanne Salle, Andy Shih, Meg Tong, Drake Thomas, Kelley Rivoire, Adam Jermyn, Monte MacDiarmid, Tom Henighan, and Evan Hubinger. Auditing language models for hidden objectives, March 2025. URL http://arxiv.org/abs/2503.10965. Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming, January 2025. URL http://arxiv.org/abs/2412. 04984. Metr. AI models can be dangerous before public deployment, January 2025a. URL https://metr.org/ blog/2025-01-17-ai-models-dangerous-before-public-deployment/. Metr. Measuring AI ability to complete long tasks, March 2025b. URL https://metr.org/blog/ 2025-03-19-measuring-ai-ability-to-complete-long-tasks/. Arvind Narayanan and Sayash Kapoor. AI as normal technology. Knight First Amendment Institute, 25, 2025. Michael Nielsen. Notes on differential technological development, 2024. URL https://michaelnotebook. com/dtd/. Shakked Noy and Whitney Zhang. Experimental evidence on the productivity effects of generative artificial intelligence. Science, 381(6654):187–192, July 2023. doi: 10.1126/science.adh2586. URL https://www. science.org/doi/10.1126/science.adh2586. OpenAI. GPT-5 system card. Technical report, August 2025a. URL https://cdn.openai.com/ gpt-5-system-card.pdf. OpenAI. Preparedness framework. Technical Report Version 2, April 2025b. OpenAI. GPT-5.3-Codex system card. Technical report, February 2026. David Owen. Interviewing AI researchers on automation of AI R&D, 2024. URL https://epoch.ai/blog/ interviewing-ai-researchers-on-automation-of-ai-rnd. Billy Perrigo. Demis Hassabis is preparing for AI’s endgame. Time, April 2025. URL https://time.com/ 7277608/demis-hassabis-interview-time100-2025/. Scott Douglas Sagan and Kenneth Neal Waltz. The spread of nuclear weapons: a debate renewed with new sections on India and Pakistan, terrorism, and missile defense. W.W. Norton, 2nd edition, 2003. URL https://cir.nii.ac.jp/crid/1970586434801795733. Jonas Sandbrink, Hamish Hobbs, Jacob Swett, Allan Dafoe, and Anders Sandberg. Differential technology development: An innovation governance consideration for navigating technology risks, September 2022. URL https://papers.ssrn.com/abstract=4213670. Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Angela Fan, Andrei Matveiakin, Rusheb Shah, Marcus Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, and Marius Hobbhahn. Stress testing deliberative alignment for anti-scheming training, September 2025. URL http://arxiv.org/abs/2509.15541. Philipp Schoenegger, Peter S. Park, Ezra Karger, Sean Trott, and Philip E. Tetlock. AI-augmented predictions: LLM assistants improve human forecasting accuracy, August 2024. URL http://arxiv.org/abs/ 2402.07862. Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, and Pablo Villalobos. Compute trends across three eras of machine learning. In 2022 international joint conference on neural networks (IJCNN), pp. 1–8, 2022. doi: 10.1109/ijcnn55064.2022.9891914. Stack Overflow. 2025 developer survey, 2025. URL https://survey.stackoverflow.co/2025/ai. Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research, April 2025. URL http://arxiv.org/abs/ 2504.01848. Charlotte Stix, Matteo Pistillo, Girish Sastry, Marius Hobbhahn, Alejandro Ortega, Mikita Balesni, Annika Hallensleben, Nix Goldowsky-Dill, and Lee Sharkey. AI behind closed doors: a primer on the governance of internal deployment, April 2025. URL http://arxiv.org/abs/2504.12170. Alex Tamkin, Miles McCain, Kunal Handa, Esin Durmus, Liane Lovitt, Ankur Rathi, Saffron Huang, Alfred Mountfield, Jerry Hong, Stuart Ritchie, Michael Stern, Brian Clarke, Landon Goldberg, Theodore R. Sumers, Jared Mueller, William McEachen, Wes Mitchell, Shan Carter, Jack Clark, Jared Kaplan, and Deep Ganguli. Clio: Privacy-preserving insights into real-world AI use, December 2024. URL http: //arxiv.org/abs/2412.13678. Helen Toner, Kendrea Beers, Steve Newman, Saif M. Khan, Colin Shea-Blymyer, Evelyn Yee, Ashwin Acharya, Kathleen Fisher, Keller Scholl, Peter Wildeford, Ryan Greenblatt, Samuel Albanie, Stephanie Ballard, and Thomas Larsen. When AI builds AI: Findings from a workshop on automation of AI R&D. Technical report, Center for Security and Emerging Technology, January 2026. Philip Trammell. What (and how far off) is self-replicating capital?, January 2026. URL https: //philiptrammell.substack.com/p/what-and-how-far-off-is-self-replicating. UK AI Security Institute. Pre-deployment evaluation of OpenAI’s o1 model | AISI work, 2024. URL https://www.aisi.gov.uk/blog/pre-deployment-evaluation-of-openais-o1-model. Lizka Vaintrob and Owen Cotton-Barratt. AI tools for existential security. 2025. URL https://www. forethought.org/research/ai-tools-for-existential-security. Kenneth N. Waltz. The spread of nuclear weapons: More may be better. The Adelphi Papers, 21(171):1–1, September 1981. ISSN 0567-932x. doi: 10.1080/05679328108457394. URL https://doi.org/10.1080/ 05679328108457394. Francis Rhys Ward, Teun van der Weij, Hanna Gábor, Sam Martin, Raja Mehta Moreno, Harel Lidar, Louis Makower, Thomas Jodrell, and Lauren Robson. CTRL-ALT-DECEIT: Sabotage evaluations for automated AI R&D, November 2025. URL http://arxiv.org/abs/2511.09904. Jiaxin Wen, Chenglei Si, Yueh-han Chen, He He, and Shi Feng. Predicting empirical AI research outcomes with language models, June 2025. URL http://arxiv.org/abs/2506.00794. Parker Whitfill. Note on selection bias in observational estimates of algorithmic progress, August 2025. URL http://arxiv.org/abs/2508.11033. Parker Whitfill and Cheryl Wu. Will compute bottlenecks prevent an intelligence explosion?, August 2025. URL http://arxiv.org/abs/2507.23181. Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Holden Karnofsky, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato, William Saunders, Maksym Taran, Ben West, and Elizabeth Barnes. RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts, May 2025. URL http://arxiv.org/abs/2411.15114. xAI. xAI risk management framework. Technical report, 2025. Josh You. Most of OpenAI’s 2024 compute went to experiments, 2025. URL https://epoch.ai/ data-insights/openai-compute-spend. Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellerman, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, and Daniel Kang. Establishing best practices for building rigorous agentic benchmarks, July 2025. URL http://arxiv.org/abs/2507.02825.