Tech giants are increasingly able to wield significant geopolitical influence. To ensure digital sovereignty, governments must insist on transparency and accountability.
Raluca Csernatoni
Source: Getty
Six possible future AI capabilities that could deserve advance preparation and pre-commitments, in order to avoid catastrophic risks.
There is significant interest among both industry leaders and governments in if-then commitments for artificial intelligence (AI): commitments of the form, If an AI model has capability X, risk mitigations Y must be in place. And if needed, we’ll delay AI deployment and/or development to ensure this. A specific example: if an AI model has the ability to walk a novice through constructing a weapon of mass destruction, then we must ensure that there are no easy ways for consumers to elicit behavior in this category from the AI model.
As of December 2024, three industry leaders—Google DeepMind, OpenAI, and Anthropic—have published relatively detailed frameworks along these lines. Sixteen companies have announced their intention to establish frameworks in a similar spirit by the time of the upcoming AI Action Summit in France. Similar ideas have been explored at the International Dialogues on AI Safety (see Beijing statement) and at the UK AI Safety Summit.
In an earlier piece, I walked through how if-then commitments could work, and what their key components are. One key component is tripwire capabilities (or tripwires): AI capabilities that could pose serious catastrophic risks, and hence would trigger the need for strong, potentially costly risk mitigations. (Tripwires correspond to the “capability X” mentioned above.) To date, most attempts to identify such AI capabilities have come from policies and frameworks put out by AI companies,1 with little explanation of how they were arrived at. Eventually, tripwires will hopefully be grounded in extensive public analyses of what threats from AI are credible, what mitigations could reduce the risks, and how to weigh the costs and benefits.
This piece aims to contribute to progress from the former to the latter by sketching out a potential set of (a) methods and criteria for choosing tripwires and (b) preliminary tripwires aiming to meet these criteria. It focuses specifically on the question of where the tripwires should be, and does not address a number of other challenges for if-then commitments (enforcement, transparency, and accountability, to name a few).
It also introduces the idea of pairing tripwires with limit evals: the hardest evaluations of relevant AI capabilities that could be run and used for key decisions, in principle. Today, most AI evaluations focus on tasks much easier than what would be necessary to pose a catastrophic risk; these are capable of providing reassurance today, but may not be sufficient as AI capabilities improve. A limit eval might be a task like the AI model walks an amateur all the way through a (safe) task as difficult as producing a chemical or biological weapon of mass destruction—difficult and costly to run, but tightly coupled to the tripwire capability in question. Limit evals may be helpful for (a) providing backstop tests if AI capabilities advance rapidly; and (b) providing a clear goal for cheaper, more practical evals to be designed around (an AI model failing cheaper evals should be strong evidence that it would fail limit evals, too).
The sketch provided here is just that—a sketch. It does not go in depth on analyzing any particular AI risk or tripwire. With AI capabilities advancing rapidly, key actors are taking a dynamic, iterative approach to tripwires:2 making educated guesses at where and how to draw them, designing policies and evaluations around their guesses, and refining each piece of the picture over time. Since AI companies are not waiting for in-depth cost-benefit analysis or consensus before scaling up their systems, they also should not be waiting for such analysis or consensus to map out and commit to risk mitigations.
This piece provides more analysis of candidate tripwires than has been available in previous proposals regarding tripwires—but also intentionally stops short of offering firm conclusions. Further analysis may undermine the case for using any of these tripwires or reveal others that should be used instead. The goal here is not to end discussion about where the tripwires should be, but rather to provoke it.
This piece will:
Discuss the context of this moment in the development of tripwires and if-then commitments: what has been done to date, and what steps remain to arrive at a robust framework for reducing risks from AI.
Lay out candidate criteria for good tripwires:
Lay out potential tripwires for AI. These are summarized at the end in a table. Very briefly, the tripwires I lay out are as follows, categorized using four domains of risk-relevant AI capabilities that cover nearly all of the previous proposals for tripwire capabilities.4
Interest in both the benefits and risks of AI surged near the end of 2022, following the launch of ChatGPT. The year 2023 saw a number of new initiatives dedicated to creating and/or requiring evaluations of dangerous capabilities for AI models,5 and late 2023 saw the first major discussion of what this piece refers to as tripwire capabilities—pre-defined thresholds for AI capabilities and/or risks, accompanied by commitments to implement specific upgrades in risk mitigations by the time these tripwires are crossed.6 The case for these if-then commitments is outlined in a previous piece; in brief, with AI capabilities advancing rapidly, they provide a way to plan ahead and prioritize important risk mitigations, without slowing development of new technology unnecessarily.
To date, most specific proposals for tripwires have come from voluntary corporate policies and frameworks released between late 2023 and mid-2024, most of them explicitly marked as early, exploratory, or preliminary.7 Crucially, tripwire proposals have, in all these cases, been presented without accompanying explanations of the methodology by which they were arrived at. To be clear, this is not a criticism of the companies in question. The policies and frameworks that have been released are ambitious documents, calling for their signatories to execute a significant amount of work on a number of fronts—not just defining tripwires, but also (a) building practical, runnable AI evaluations to test for tripwires; (b) defining risk mitigations that would be needed if tripwires were to be crossed; and (c) defining processes (requiring participation from stakeholders in varied parts of the company) for ensuring that tests are run frequently enough, results are interpreted reasonably, needed actions are taken in response, and so on.
If companies were to wait until each of these things had been thoroughly researched before adopting or publishing their policies and frameworks, they could be waiting for years—during which time AI capabilities might advance quickly and the prevention of the risks in question could become more difficult, if not impossible. In other words, holding out for too high a standard of thoroughness could somewhat defeat the purpose of these policies and frameworks. Companies have sought to show their seriousness about risk prevention by being quick to sketch their frameworks, even with much work left to do—building the airplane while flying it, in a sense.
This piece is intended as a step toward a more thorough discussion of tripwires, but only a step. It proposes a number of specific tripwires and outlines the basic reasoning, but does not present an extensive evidence base for each key claim, and leaves significant possible objections to its proposals unaddressed. Why take such an approach? The hope is that:
This piece aims to provide a set of candidate tripwires with strong potential to be useful for anticipating and ultimately reducing catastrophic risks from AI. Specifically, these tripwires are for use in if-then commitments of the form: If an AI model has capability X, then risk mitigations Y must be in place. And if needed, we’ll delay AI deployment and/or development to ensure this.
Each candidate tripwire is a description of a capability that a future AI model might have and aims to meet the following desiderata:
That is, an AI model with the tripwire capability would (by default, if widely deployed without the sorts of risk mitigations discussed below) pose a risk of some kind to society at large, beyond the risks that society faces by default.
If a risk can be eliminated (or cut to low levels) with relatively quick, cheap measures, then there isn’t a clear need for incorporating the risk into an if-then commitment (instead, risk mitigations can be implemented as soon as the risk seems even somewhat plausible). If-then commitments are generally relatively ambitious and complex to execute; they are designed for the challenge of ensuring that risk mitigations are put in place even when doing so would be very costly—or, more importantly, take a lot of advance preparation (and even innovation), as discussed in a previous piece.
Examples of challenging risk mitigations that are a good fit for if-then commitments include:
In principle, this criterion could be cashed out as follows: the risk mitigations in question should reduce the expected damages caused by the AI model(s) in question by more than the costs of the risk mitigations themselves—including the costs of delaying or restricting the beneficial applications of AI.11 Since the costs of delaying or restricting beneficial applications could be significant,12 this is a high bar.
Some of the tripwire capabilities discussed below could lead to very damaging events—of the kind that have previously been associated with tens of billions,13 or even trillions,14 of dollars in damages. Others could lead to events with harder-to-quantify, but plausibly commensurate, costs to society.
This desideratum significantly narrows the field of candidate tripwires, especially since damage potential has to be high despite countermeasures that might be implemented after observing AI models with the tripwire capabilities. For example, if an AI model has capabilities that are highly useful for perpetrating fraud at scale, early incidents might cause banks and other institutions to increase their investment in fraud detection (including fraud detection using the same sort of advanced AI that is useful for fraud), such that the potential for fraud is greatly reduced before overly significant damage can be done.15
It’s inherently challenging to determine whether there’s a substantial likelihood of events with such high damages, in a future world with technological capabilities that don’t exist today. A small number of people are currently exploring approaches to this for potential AI risks and their work is sometimes referred to as AI threat modeling. In many cases, they are aiming to ground speculative risks in historical and established events to the extent possible—for example, analyzing historical catastrophic events, and how the risk of similar events might be quantitatively affected if the number of actors capable of causing similar events increased (for example, due to having access to advanced AI “advisers”). Most of the tripwire capabilities listed in this piece have involved some initial exploratory threat modeling, though in no cases has threat modeling yet reached the point of an in-depth public report. In any case, threat modeling will never be as rigorous or conclusive as would be ideal, and judgment calls about likelihood and risk tolerance (by AI companies, policymakers, and others) will inevitably play a large role in what if-then commitments are made.
In the policies and frameworks put out by AI companies to date, there are very high-level tripwires that leave a lot of room for interpretation on how one might test for them.
The evaluations outlined in these policies provide relatively low-difficulty tests of AI capabilities,16 such as whether an AI model can answer questions about chemical and biological weapons—a capability that (if an AI model possessed it) would still be far short of being able to reliably advise an amateur to develop a chemical or biological weapon.
For the level of capabilities AI models have today, the relatively low-difficulty evaluations and relatively vague threat models are practical for the purpose, because AI models that perform poorly on easy evaluations are determined to be far from the associated tripwires under most possible interpretations. However, if and when AI capabilities improve, easy evaluations won’t be able to provide either reassurance or clear signs of danger, and vague tripwires will leave a lot of room for interpretation in how to design harder, more definitive evaluations.
In an attempt to prepare for this situation, this piece accompanies proposed tripwires with outlines of limit evals: the hardest evaluations of relevant AI capabilities that could be run and used in principle within a year or so. (Examples are given throughout the piece. One would be: “the AI model walks an amateur all the way through a (safe) task as difficult as producing a chemical or biological weapon of mass destruction.”) If an AI model performed well on limit evals, it might still lack tripwire capabilities (there is inherently a gap between “an AI model can pass tests in a controlled environment” and “an AI model can materially increase real-world risks as it operates in the wild”), but there would no longer be any practical way to assess whether this was the case. Hence, at that point one should arguably assume a strong possibility of the tripwire capability in question, and act (such as by implementing costly risk mitigations) accordingly.
Articulating limit evals hopefully helps to clarify the specific level of AI capability being envisioned, leaving less ambiguity of the kind that currently exists with language like “model provides meaningfully improved assistance” and “increase their ability to cause severe harm compared to other means.” Furthermore, it can help guide design of more practical evals. Once a limit eval has been articulated, a team can design any eval that they can argue is a prerequisite to performing well on the limit eval, and if an AI model performs poorly on this eval, this is evidence that it does not have the tripwire capability in question.
Predicting what capabilities future AI models will demonstrate, and when, is a fraught exercise, and this piece can’t do so with precision. But it does use a couple of high-level principles to keep the list of tripwires relatively short and focused on capabilities that may be sooner to emerge.
First, it mostly sticks to considering potential AI capabilities comparable to capabilities that at least some humans have. The intent is to avoid entirely speculative scenarios envisioning AI models that can affect the world in arbitrary ways, and instead ask the question: If an AI model had similar cognitive capabilities to a human expert of type X, and this system could be copied, run at scale, and deployed to many users, what risks might that create? There are some exceptions—cases in which a tripwire refers to a capability far beyond what human experts can achieve—but in these cases, the capability is expressed in quantified terms and a sketch is provided of how such a capability could be measured in principle.
Second, this piece envisions potential future AIs as interacting with the world digitally, as a remote worker would—able to converse, write code, make plans, use the internet, and the like, but not able to do tasks that rely more on physical presence, relationships, and so on. For example, when considering the ability of AI to contribute to cyber operations, this piece considers activities like discovering and exploiting software vulnerabilities but doesn’t envision AI models as in-person spies.
Third, there are a number of cases in which I’ve excluded some potential tripwire capability from the list because another tripwire seems like a good proxy or early warning sign for it. For example, there could be a number of disparate risks from AI that could autonomously execute research and development activities in a wide variety of domains; I’ve focused here on one particular domain (AI R&D itself), for reasons given below.
This piece focuses on four domains of risk-relevant AI capabilities: chemical and biological weapons development capabilities, cyber operations capabilities, persuasion and manipulation capabilities, and autonomy-related capabilities (ways in which AI models could create or accumulate significant resources without humans in the loop). To my knowledge, all major efforts to draw tripwires or develop evals for dangerous capabilities focus on risks falling into one of these (or similar) categories.17
The potential threat models listed in each domain reflect conversations with people from (a) corporate teams working on tripwires and if-then commitments; (b) the U.S. and UK AI Safety Institutes; (c) subject-matter experts consulting on the design of dangerous capabilities. After assembling high-level threat models based on these conversations, I then, along with collaborators from other organizations,18 worked through rough attempts to estimate the damage potential discussed above of each relatively credible threat model, both with and without the sorts of stiff risk mitigations discussed above, and iterated (with feedback from the collaborators just mentioned among others) through multiple drafts of language for tripwire capabilities and limit evals, aiming to arrive at a set of tripwires that plausibly meets the criteria above.
As noted above, this process may provide a small step forward, but it is not near sufficient. For now, the focus has been generating an initial set of tripwires that at least plausibly (in light of both informal damage potential estimates and discussions with relevant experts) can be connected to high-damage-potential threats, and meets the other criteria above. Brief reasoning is given for why very high damages are plausible, but in-depth analyses addressing counterarguments are not provided, and further analysis might shorten or expand the list.
This section will list candidate tripwires and, consistent with the criteria above, outline:
Fully specifying a tripwire capability will sometimes require deciding on specific numbers—for example, how much an AI system would have to increase the pace of discovering and/or exploiting vulnerabilities for cyber attacks in order to qualify as crossing the tripwire. In these cases, I generally give illustrative figures in a footnote. These should be taken as example figures that could be used as is or adjusted according to the specific risk tolerance and goals of a specific actor making an if-then commitment.
This tripwire addresses the possibility that an AI model might serve as a virtual substitute for an expert adviser (or set of advisers) on chemical or biological weapons production, thus greatly expanding the number of people who could produce and deploy such weapons if they so choose. The term “basic” is meant to emphasize that this risk isn’t about new kinds of weapons being developed (the next section addresses this), but rather about the proliferation of the ability to create basic ones.
Tripwire capability: AI capable of interactively advising a threat actor (with a college education and a moderate amount of time and money,19 but no deep relevant expertise or experience) as effectively as a human expert in relevant domains, such that this actor would have a substantial chance of succeeding in an attempt to produce and release a chemical or biological weapon that could plausibly be catastrophically damaging.20
Risk(s): There are plausibly a number of chemical or biological weapons that could be produced and deployed by someone with the relevant expertise and experience, if they chose to, on a relatively modest budget and without needing access to any particularly hard-to-obtain materials.21
Someone with the relevant expertise and experience might also be able to remotely advise a relative novice to produce and deploy such weapons, especially if they were providing dedicated, interactive advice and exchanging pictures, video, and so on. (There are ongoing efforts to test this claim, as discussed below.)
Fortunately, only a small percentage of the population has the expertise needed to develop a given chemical or biological weapon,22 and the overlap with people who would want to is even smaller.
But if a (future) AI model could play the same role as a human expert in chemical or biological weapons, then any individual with access to that AI model would effectively have access to an expert adviser.
Note that the risk described in this section is a function both of potential future AI capabilities and of a number of contingent facts about societal preparedness and countermeasures. It’s possible that society could effectively mitigate such risk with effective enough restrictions on access to key precursor materials and technologies (for example, DNA synthesis). No AI risk is only about AI—but it may still be prudent to prepare for the potential sudden emergence of AI capabilities that cause major risks in the world as it is.
Damage potential: The UN’s Department of Economic and Social Affairs has highlighted trillions of dollars in lost economic output in the context of the COVID-19 pandemic,23 and several other sources estimate even higher damages.24 With this in mind, trillions of dollars or more in damages are plausible.
Risk mitigations: The risk here could be kept low if AI users were reliably unable to elicit unintended behavior,25 and if AI model weights were stored securely. Both of these could prove challenging and require breakthroughs of various kinds to achieve, as discussed in a previous piece.
Evaluations: The question one ultimately wants to answer is roughly: What would be the result of an experiment in which determined, reasonably talented people with moderate amounts of time and money but no deep relevant expertise or experience were instructed to produce (and release) a particular chemical or biological weapon, and given access to basic equipment and the AI model in question (as well as publicly available resources, such as search engines or textbooks) but not to a human expert adviser?26 Would they succeed a reasonably high percentage of the time, and would they outperform a control group given no access to the AI model (and similar assets otherwise)?
This exact experiment would be impractical, most obviously because it would involve producing and releasing dangerous weapons (also because it could take time to recruit participants and allow them to attempt the work). But one could run various approximations. For example, one might challenge study participants to complete a set of tasks in a laboratory that are analogous to different parts of weapons production and release (particularly the hardest parts for a given weapon of concern), but involve working with a non-dangerous proxy (for example, a pathogen that is not transmissible in humans, but involves a similar type of challenge to a dangerous pathogen), and are otherwise modified for practicality (for example, modified to involve the same types of challenges but to take less time).
A previous piece has more discussion of potential evaluations for this risk, including more practical evaluations such as automatically graded multiple-choice questions. (This piece focuses on limit evals as defined above.)
The above section discusses the possibility that an AI model could guide a relative novice through developing a weapon of mass destruction, by playing a similar role to that of a human expert adviser. A separate possible risk would be of AI that could significantly enhance efforts at state-of-the-art (and beyond) development of such weapons, either via contributing to new scientific tools (such as biological design tools)27 or by playing the same role (working on design, engineering, weaponization, and so on) that a top-tier human expert (or more specifically, a team of such experts) could.
Tripwire capability:AI capable of uplifting multiple moderately resourced state chemical and/or biological weapons development programs’ capabilities to the point where they could develop, produce, and deploy weapons of mass destruction at least as well as a well-resourced team of top experts could as of 2023.28
Risk(s): AI models with this capability could enhance many nations’ prospects for secretly developing weapons of mass destruction, especially since working with AI could be easier and less risky than recruiting top-tier humans in the context of a secret project. This, in turn, could increase the risks of a military conflict or an inadvertent deployment of a weapon of mass destruction.29
Damage potential: A deliberately engineered weapon of mass destruction could have significantly higher damage potential even than natural pandemics, which (as argued above) could easily cause trillions of dollars in damages.
Risk mitigations:Similar to the case discussed above, the risk here could be kept low if AI users were reliably unable to elicit unintended behavior, and if AI model weights were stored securely. However, the risk mitigations might need to be stronger for this tripwire than for the previously discussed one—strong enough to resist even concerted attempts, with support from state programs, to jailbreak an AI model or steal its weights.
Evaluations: Evaluationscould follow a similar overall strategy to the evaluations for the “basic chemical and biological weapons production” threat model discussed above: aiming to see whether an AI model could help humans with representative (but legal and safe) proxy tasks,30 aiming to see whether human experts can help less experienced humans when interacting in a similar way to how an AI model would (for example, via an instant messaging platform), and using questionnaires to provide quick, cheap information on whether an AI model could plausibly have a big impact in a longer interaction.
However, rather than assessing the task of helping a relative novice develop basic chemical and biological weapons, evaluations for this tripwire should assess the task of helping someone with some expertise (roughly representative of the people who staff second-tier chemical and/or biological weapons programs) carry out tasks that would ordinarily take top-tier expertise. The Nth Country Experiment is an interesting precedent here: Lawrence Livermore Laboratory aimed to assess the risk of nuclear proliferation by challenging physicists who had just received their PhDs to develop a nuclear weapons design with basic support but no access to classified information.
Future AI models could have capabilities similar to those of some of today’s top cyber operations experts, from discovering and exploiting vulnerabilities to conducting coordinated operations.
Tripwire capability:AI capable of increasing a moderately resourced state program’s technical cyber operations capabilities (those that rely only on remote execution, as opposed to things like close-access attacks and recruiting insiders) to par with those of the top-tier intelligence services (especially for purposes of stealing future AI model weights and/or disabling critical infrastructure at scale).31
Risk(s):AI models with this capability could make it possible for a moderately resourced state program to carry out a variety of attacks associated with stronger programs, including the possibility of coordinated, scaled attacks on critical infrastructure, such as power and water plants (something that the United States reportedly planned for in at least one case), and/or large-scale, targeted theft of classified information. This capability could make AI an especially attractive target for foreign intelligence services. Stealing one generation’s model weights might make it easier to steal the next generation’s.
This is a high capability bar, and it’s possible that significant risks could emerge even with AI models that have more limited or narrow cyber operations capabilities. It’s possible that more analysis on this topic will yield additional ideas for tripwires (one attempt at a tripwire for AI with narrower capabilities is below).
Damage potential: This is hard to estimate. It’s not clear how to quantify the damages of past major cyber incidents such as the Office of Personnel Management data breach, the most important impact of which was arguably the blow it dealt to U.S. covert operations in China.32 It is also hard to extrapolate what the consequences would be if such breaches were easier to execute.
It is at least plausible that risks such as the above can justify heavy investments in risk mitigations. But further analysis may conclude otherwise. The overall case for very high potential damages is probably not as strong as it is for some of the other risks discussed in this piece.
Risk mitigations:It’s possible that AI models with this capability would be at least as useful for defense (for example, finding and patching vulnerabilities) as for cyber attacks. But it’s not assured, especially in the short run. If an AI model crossed this tripwire, it might make sense to intensively assess its contributions to both offense and defense before widely deploying the model or developing more powerful versions.
Depending on that assessment, bounding the risk from this tripwire capability could require a very strong level of deployment safety protections and model weight security, since AI models crossing this tripwire could be targeted by groups with state-level support.
Evaluations:Cyber operations experts, representative of people staffing moderately resourced state programs, could be given access to AI models and challenged to attempt difficult cyber attacks (comparable to what top-tier programs have historically accomplished) on cooperating partners, effectively serving as penetration testers. As with penetration testing, successful attacks could be handled using a responsible disclosure framework: attackers would be expected to refrain from exfiltrating sensitive information and to disclose how they succeeded.
Simpler, more practical evals could also be designed to approximate the result of this sort of approach.
AI models that lack broad cyber operations capabilities might still have important narrow capabilities, excelling in some sub-area relevant to cyber attacks. One candidate area would be vulnerability discovery and exploitation, which arguably plays to the strengths of large language models.33
Tripwire capability:AI capable of dramatically accelerating the rate of discovery of novel, high-value vulnerabilities, exploits, and/or combinations.34
Risk(s): Vulnerability discovery and exploitation alone do not clearly pose major risks. It is possible that this capability would mostly be used by software developers to preemptively find and patch vulnerabilities—and even if not, a vulnerability and accompanying exploit is far from all that is needed to carry out damaging cyber attacks. However, an extreme enough proliferation of very valuable attack chains (for example, involving zero-click vulnerabilities) might overwhelm the ability of defenders to respond (for example, by releasing patches for the vulnerabilities). AI models that could produce a large enough number of such vulnerabilities could also be dangerous if used privately by a state program in possession of stolen model weights.
Damage potential: Damages from significant cyber attacks have been estimated at tens of billions of dollars in some years, though there are many years with no such attacks, and the general trend in damages appears to be downward over time. Most recently, the 2017 NotPetya ransomware attack, powered largely by two exploits (EternalBlue and MimiKatz), has been estimated to have caused $10 billion in damages.35 If exploits like these became far more common, without cyber defense being able to compensate, it is plausible that annual damages could reach into the tens of billions of dollars per year. Even if cyber defense could catch up in the long run, there could be significant damages in the short term.
It is, therefore, at least plausible that risks such as the above can justify heavy investments in risk mitigations. But further analysis may conclude otherwise; the overall case is probably not as strong as it is for some of the other risks discussed in this piece.
Risk mitigations: Thesewould be similar to those listed for the previous tripwire.
Evaluations: Whereas the previous tripwire is about AI models with human-like capabilities in a broad set of domains, this one is about AI models that can go far beyond the human state of the art in a narrower set of domains. This calls for different evaluations, since an AI model cannot simply be prompted to attempt to match the performance of existing human experts.
One approach to limit evals would be to partner with vulnerability researchers, giving them early access to state-of-the-art AI models and guidance on how to use them effectively, and seeing whether they could significantly increase their rates of vulnerability discovery and exploitation.
There are a wide variety of imaginable risks that might arise from the general category of “AI models with strong capabilities related to persuasion, manipulation, and/or misinformation.” For now, I discuss a tripwire that seems particularly well-suited to concrete operationalization and evaluation, while also having some potential to be informative about more speculative risks.
Tripwire capability:AI capable of dramatically increasing the cost-effectiveness of professionalized persuasion, in terms of the effect size (for example, the number of people changing their views or taking some specific action) per dollar spent.36 This might happen (for example) via interactive chat that is much more cost-effective than traditional advertising.
Risk(s):There are a number of potential risks.
Damage potential: It’s difficult to quantify how one should think of the damages of, for example, contributing to systematic manipulation of an election and hence undermining the perceived and actual legitimacy of the democratic process. The scale of this harm, and of greater harms that could come from greater persuasion capabilities, seems at least plausibly sufficient to make this threat model a credible addition to the set of threats considered in this piece.
Risk mitigations:The details could matter a lot here, especially regarding how much an AI model can amplify professional persuasion, how it does so (for example, whether it does so by providing true information, making false claims, or reframing known facts), and whether it does so in a way that systematically advantages some points of view over others. Hitting the tripwire above could trigger a more intensive review of an AI model’s persuasion capabilities and likely impacts.
If the conclusion were that extreme persuasion capabilities should be restricted, then protective measures would have to be quite strong in order to make restrictions consistently enforced for all users. For example, relatively determined state actors would have to be stopped from stealing model weights or executing jailbreaks. And in the even more extreme case where a rogue AI could persuade company employees to help it circumvent safeguards, the precautions needed might be more intense still.
On the other hand, in some cases the best risk mitigation might be simply to widely allow the use of an AI model for persuasion, in order to avoid systematically advantaging actors who are willing and able to violate restrictions on use.
Evaluations:One type of evaluation being developed involves, essentially, challenging experts in professionalized persuasion to find a way to use AI to beat state-of-the-art cost-effectiveness for persuasion on a particular topic. For example:
This evaluation strategy would depend on finding experts who could put serious, determined effort into finding the most effective way to use AI models for persuasion, so that this could be compared with the traditional state of the art. This reflects a general principle of (and challenge with) evaluations, which is that they need to approximate the closest an AI model can come to the tripwire capability if used effectively.
AI that can automate many, or all, of the tasks currently done by top AI researchers and engineers could have extreme risks as well as extreme benefits (and is probably something AI developers will be actively pursuing, given how much it can accelerate their work).37 This piece will not provide a full discussion of why this is, but will outline the basics.
Tripwire capability: AI that can be used to do all (or the equivalent) of the tasks done by the major capability research teams at a top AI company for similar total costs (including salary, benefits, and compute for the costs of a human researcher). Or AI that, by any mechanism, leads to a dramatic acceleration in the pace of AI capabilities improvements compared to the pace of 2022–2024—a period of high progress and investment, for which good data is available.38
Risk(s): There are several interrelated reasons this tripwire could be important.
One is the potential for an AI R&D feedback loop. Today, the top teams focused on frontier AI research likely have no more than a few hundred researchers and engineers each.39 If an AI model could stand in for top researchers and engineers, this could be the equivalent of adding hundreds of thousands (or more) such people.40 This in turn could lead to a dramatic acceleration in AI progress, far beyond today’s pace of improvements.41 Many risks could emerge as a result, including:
Another reason this tripwire could be important is the potential for AI R&D as an early indicator of R&D capabilities more generally. Eventually, it may make sense to have many different tripwires for AI capabilities in different R&D domains that might pose risks, for instance in robotics and surveillance. But there is some reason to think that AI R&D capabilities will emerge before more general R&D capabilities, since AI developers are especially likely to be actively optimizing their AI models for AI R&D (and since AI R&D has relatively fast experimental feedback loops and relatively little reliance on physical presence). As discussed below, it could be easier to design evaluations for AI R&D, especially related to other kinds of R&D. For these reasons, it may make sense to prioritize evaluations for AI R&D, even if one assumes that the AI R&D feedback loop described above is not a risk.
Relatedly, AI R&D could serve as an indicator of general problem-solving, troubleshooting, and coordination abilities. It would be helpful to get a sense of whether AI models working together can carry out complex tasks requiring many steps, creativity, and dealing with unexpected problems—both to get a sense of AI’s potential beneficial applications and to assess broader risks from AI in the wrong hands (or rogue AI) capable of automating large, ambitious projects.
Damage potential: Rapidly advancing AI could raise any number of further risks without time to put in appropriate risk mitigations. The risks raised above—particularly from rogue AI and from global power imbalances—are speculative and highly debatable, but present the kind of high stakes that have led some to invoke extreme scenarios such as extinction.43
Risk mitigations: If AI models crossed this tripwire, a large number of different risks could develop quickly (due to potentially rapid progress in AI capabilities, as well as the possibility that AI models crossing this tripwire might also be quickly adaptable to R&D in a number of other key domains).
Because of this, it might be important to prepare for a wide variety of risks—including some that seem speculative and far-off today—in advance of hitting this tripwire.
Evaluations: Some possible strategies for evaluating AI models for this capability:
This piece is not exhaustive and there are a number of other possibilities for tripwires listed below.
All of the candidate tripwire capabilities listed above would benefit from more refinement, more analysis of the potential damages from associated catastrophes, more analysis of the risk mitigations that could help (and how costly they would be), and generally more discussion from a broad set of experts and stakeholders. But they can serve as starting points for engagement, and thus help push toward the goal of a mature science of identifying, testing for, and mitigating AI risks, without slowing development of new technology unnecessarily.
There are many possible research projects that could result in better understanding key threat models and candidate tripwire capabilities. Some examples are given below.
Comprehensive threat mapping. This piece has focused on a relatively short list of threats, selected for high damage potential and other desiderata. A formal exercise to list and taxonomize all plausible threats, and efficiently prioritize them for further investigation, could be valuable, especially if it incorporated feedback from a broad and diverse set of experts.
Examining and quantifying specific risks. When trying to quantify a risk of future AI systems, there is a basic problem: one cannot straightforwardly use statistics on past catastrophes to determine likelihood and magnitude. However, there are some potentially productive ways to analyze likelihood and magnitude, including the following list.
This piece has benefited from a large number of discussions over the last year-plus on if-then commitments, particularly with people from METR, the UK AI Safety Institute, Open Philanthropy, Google DeepMind, OpenAI, and Anthropic. For this piece in particular, I’d like to thank Chris Painter, Luca Righetti, and Hjalmar Wijk for especially in-depth comments; Ella Guest and Greg McKelvey for comments on the discussion of chemical and biological weapons; Omer Nevo for comments on the discussion of cyber operations; Josh Kalla for comments on the discussion of persuasion and manipulation capabilities; and my Carnegie colleagues, particularly Jon Bateman, Alana Brase, Helena Jordheim, and Ian Klaus, for support on the drafting and publishing process.
The author is married to the president of Anthropic, an AI company, and has financial exposure to both Anthropic and OpenAI via his spouse.
Carnegie does not take institutional positions on public policy issues; the views represented herein are those of the author(s) and do not necessarily reflect the views of Carnegie, its staff, or its trustees.
Tech giants are increasingly able to wield significant geopolitical influence. To ensure digital sovereignty, governments must insist on transparency and accountability.
Raluca Csernatoni
The second Trump administration has shifted the cornerstones of the liberal international order. How the EU responds will determine not only its global standing but also the very integrity of the European project.
Rym Momtaz
The EU’s pursuit of tech sovereignty has often sidelined the role of democracy in the digital sphere. The union should adopt a tech citizenship strategy that promotes citizen engagement, democratic innovation, and accountability.
Richard Youngs
The EU’s quest for strategic autonomy in the digital domain is challenged by national interests. Brussels can set a bold direction on tech sovereignty, but its success will require a robust framework and delicate compromises.
Raluca Csernatoni, Sinan Ülgen
The EU’s recent deregulatory shift risks erodingdemocratic oversight and the union’s norm-setting credibility. To secure Europe’s technological sovereignty, the blocmust increase investments, develop its own digital infrastructure, and regulate dual-use AI applications.
Raluca Csernatoni