FMF — Components of Safety Frameworks

Components of Safety Frameworks

Issue Brief: Components of Frontier AI Safety Frameworks

Body Text

Safety frameworks have recently emerged as an important tool for frontier AI safety. By specifying capability and/or risk thresholds, safety evaluations and mitigation strategies for frontier AI models in advance of their development, safety frameworks position frontier AI developers to be able to address potential safety challenges in a principled and coherent way. Both government and industry recognized the importance of safety frameworks through the Frontier AI Safety Commitments announced at the AI Seoul Summit in May 2024.

Yet safety frameworks remain a relatively nascent concept. Only a handful of firms have published safety frameworks to date, and until recently few research organizations had published on the topic. While there is an emerging consensus about the function and core components of safety frameworks, there is still a clear need for further research, related norms, and established guidance to enable implementation.

This brief proposes a set of core components for inclusion in safety frameworks. Drawn from the Frontier AI Safety Commitments as well as published member firm frameworks and expert input, this piece reflects a preliminary consensus among member firms about how to structure safety frameworks. However, we note other approaches may emerge in the future that also meet the Frontier AI Safety Commitments. We hope this brief will serve as a useful resource for broader discussion about how to develop frontier AI safety frameworks. Future briefs will explore key elements of safety frameworks in greater depth.

Components of Safety Frameworks

Frontier AI safety frameworks are designed to enable developers to take a robust, principled, and coherent approach to anticipating and addressing the potential safety challenges posed by frontier AI.

At a high level, safety frameworks include the following components:

Risk Identification

Frontier AI safety frameworks are intended to manage potential severe threats to public safety and security. To effectively manage these threats, safety frameworks should identify and analyze model risks stemming from advanced capabilities in chemical, biological, radiological, and nuclear (CBRN) weapons development and cyber attacks. As the technology evolves and there is more research conducted on frontier AI risks, additional risk domains may emerge, such as systems with increasingly autonomous capabilities. However, there is not broad consensus yet about what these risk domains might or should be.

Firms should also explain how they plan to ensure an accurate picture of model risks and concerning capabilities, such as by carrying out threat modeling exercises.

Capability and Risk Thresholds

Capability thresholds have largely been used in safety frameworks as a proxy for risk to date. These are specific capabilities at which, absent mitigation measures, models may pose unacceptable or intolerable levels of risk to society. While using capability thresholds is current best practice, this is an area of active research and risk thresholds may also be defined using likelihood or severity estimates of specific risks or other relevant risk factors, as appropriate. Future issue briefs will explore this and related issues, such as differences between capability, compute, and quantitative risk thresholds, as well as their associated benefits and shortcomings.

Safety frameworks should define clear capability or risk thresholds at which the risks posed by their model would merit heightened safeguards and when such risks (if unmitigated) would be deemed unacceptable. The framework should provide a rationale for why these thresholds were chosen and, as appropriate, disclose how firms received input from external experts or other stakeholders, such as home governments, when defining their thresholds.

Capability and Risk Assessments

Safety frameworks should outline the process the firm will take to assess risks related to capability and risk thresholds. More specifically, frameworks should describe how the firm plans to assess when a given threshold has been met or surpassed. This should include a description of the planned approach to frontier AI safety evaluations.

Safety frameworks should also outline the procedural components for risk assessments, including when to conduct risk assessments (e.g. before deployment, and before and during training) and the frequency of such assessments (e.g. assessing risks every 2x, 4x, or 6x increase in capabilities on relevant dimensions). They may also include a rationale for this approach.

Firms should also consider committing to sharing information about their risk assessments and evaluations publicly as appropriate, while being mindful of safety and security risks (as well as other potential issues including creating information hazards or evaluation contamination). For example, firms might consider sharing a report confirming that a safeguard has caused a model to meet the predefined standard for safety.

Risk Mitigation

Safety frameworks should, to the extent possible while maintaining the effectiveness and integrity of such measures, describe the mitigation measures they plan to apply at each threshold to reduce the risk posed to an acceptable level. These may include, for example, security or containment measures to ensure a model with dangerous capabilities is not stolen, as well as deployment mitigations or safety measures to manage the risk of misuse, once deployed.

If a model or system's capabilities are above a pre-stated capability threshold and the firm has not developed effective safeguards for capabilities above that threshold, or if the risks of a model or system are above a pre-stated risk threshold and the firm has not developed effective safeguards for keeping risks below that threshold, they commit to taking appropriate action such as restricting further development and public deployment until effective safeguards have been developed. This is present in all existing safety frameworks and is a key requirement of the Frontier AI Safety Commitments. Safety frameworks may also outline the criteria for determining when models no longer create unacceptable risks and are safe to continue training and/or deploying.

Finally, safety frameworks should outline the process a firm intends to follow to continually assess and monitor the adequacy of mitigation measures.

Risk Governance

To the extent feasible and not covered by other governance materials, safety frameworks may list firm commitments in place to ensure internal compliance with the framework. Similarly, safety frameworks should also specify the internal and external oversight mechanisms in place to ensure proper implementation of the framework, as appropriate.

Firms should also specify a process and approach for updating the framework over time as appropriate, including any predefined trigger points for updates. This may include detailing, if appropriate, any internal processes and special approvals required to make changes. Firms should also consider noting when material updates to or revisions of a safety framework will be made public.

Finally, the framework should indicate how external actors were involved in the process of developing the framework, if applicable. Future briefs will elaborate on specific governance and transparency mechanisms in more depth. The briefs may describe internal processes for ensuring adherence to safety frameworks, as well as external processes by which third-parties may assess adherence. They may also elaborate on the development and publication, as appropriate, of model cards, safety evaluations, and other safety-specific documentation.

Table 1: Components of Safety Frameworks

| Component | Description | Sub-Components | |-----------|-------------|-----------------| | Risk Identification | Identify potential safety and security risks stemming from future frontier models, especially those based on advanced capabilities. | • Consult with domain experts • Analyze risks to determine specific threat models • Define critical capability and risk thresholds | | Capability and Risk Thresholds | Define and explain the rationale behind critical capability and risk thresholds. | • Define critical capability and risk thresholds | | Capability and Risk Assessment | Outline the process for conducting pre-mitigation capability and risk assessments, including the planned approach to frontier AI safety evaluations. | • Conduct pre-mitigation capability assessments and safety evaluations • Explain the process of capability and risk assessments • Share information about the safety evaluations as appropriate, while being mindful of safety and security risks | | Risk Mitigation | Explain the risk mitigation measures in place to ensure critical risk thresholds are not passed, and the measures for keeping risk within tolerable thresholds | • Explain risk mitigation measures • State a commitment to pause development or deployment to the public if post-mitigation risks remain too high • Routinely assess and monitor the adequacy of mitigation measures | | Risk Governance | Indicate the internal accountability and governance frameworks for implementing the safety framework, including the process for updating the safety framework, and outline the process for transparency in implementation and accountability for the safety framework. | • Outline measures for ensuring internal compliance • Explain measures for maintaining oversight • Delineate the framework's update process, where appropriate • State commitments to transparency |

Conclusion

Based on existing safety frameworks and the recent Frontier AI Safety Commitments, the preliminary list of core components above offers a strong foundation for the development of frontier AI safety frameworks. Though non-exhaustive and preliminary, this guidance serves as a useful starting point for firms seeking to create their own safety frameworks and a resource for broader discussion about frontier AI safety frameworks. Future briefs in this workstream will explore elements of safety frameworks in greater depth, from the tradeoffs associated with different thresholds to the timing and frequency of capability assessments. The ultimate aim with this and future briefs is to inform public debate of how best to design and implement safety frameworks.