It would be unfortunate if powerful AIs were stolen by bad guys
AIs are malleable, so it's surprisingly easy to get a "good" AI to do bad stuff. This is one reason why AI security is so important.
Sam Altman, Marc Andreessen, Elon Musk, Holden Karnofsky, Leopold Aschenbrenner, Tyler Cowen, Yoshua Bengio, Nick Whitaker, and Yours Truly walk into a bar. The bartender says, “what can I get you, fellas?” and we all reply in unison: “better security at leading AI projects, please!”
In his essay "Machines of Loving Grace", Anthropic’s Dario Amodei argues that in the next few years, AI systems might be capable of doing some mind-blowing things. Human lifespans could be doubled, most diseases and mental illnesses could be cured, and billions could be lifted out of poverty. Many watching this transformation unfold "will be literally moved to tears," he claims.
You know what (figuratively) moves me to tears? The fact that I semi-regularly hear security experts talk about how the AI companies aiming to create these tear-jerking AI systems will almost certainly not be able to stop their theft by the most capable nation-state hacker groups.
There are a number of reasons why this is bad, but in this blog post, I want to highlight one particular reason why this is worrisome: AI systems are malleable.
This is obvious to people who are familiar with machine learning, but my sense is that many laypeople don't fully grasp this. Their perceptions of what AIs are like are informed by interacting with friendly chatbots that generally won’t do bad stuff if asked, like Claude, Gemini, and ChatGPT.
But here’s the key thing: This safety isn't intrinsic to the AIs; it's a brittle property that is painstakingly engineered by their creators, and one that can be easily undone if the underlying model falls into the wrong hands.
There's a silver lining, though: unlike many other AI policy issues, improving information security is something that a number of folks across the spectrum agree is important. So there’s at least some opportunity to improve this sorry state of affairs.
AIs are malleable
There’s an old saying from the South: “If you’re trying to develop digital agents that are smarter than humans, you better be darn sure that nobody can steal them and make them do a bunch of bad stuff”.
Unfortunately, if today's AI systems were stolen, they could easily be used to do a bunch of bad stuff — their behaviour and objectives can be readily modified by whoever controls them.
(There’s obviously an elephant in the room related to the challenge of ensuring these systems actually try to do what we want them to do in the first place. Let’s assume this problem away for now, though of course I will have much to say about this in the future.)
Let’s look at what Anthropic1 currently does to ensure their systems behave in desirable ways.
First, they go to great lengths to train their models to be helpful, harmless, and honest using techniques like "Constitutional AI”. This works pretty well, but to be extra careful, they also employ additional lines of defence, such as detection models that flag possible violations of their usage policy.
Despite my best efforts, I simply could not get Claude to give a detailed plan on how Vladimir Putin could undermine US national security. Well played, Anthropic.
Joking aside, jailbreaking isn't a solved problem. Determined users regularly find new ways to bypass safety measures. But at least when a bad actor has to jailbreak a model via an API (rather than running it directly themselves), companies have additional lines of defence — they can monitor for suspicious patterns, implement rate limits, ban bad actors, and report illegal activities to authorities.
These are really useful tools to have in your governance toolkit, but all of them go out the window if your AI gets stolen. If Russia managed to hack into Anthropic’s servers and steal Claude, it would be trivial for them to fine-tune a copy of it as a Putin loyalist — by the next day, they’d have Claude sharing tips on how to slip polonium into a dissident’s beer.
One paper found that fine-tuning GPT-3.5 Turbo with only 10 harmful examples is enough to “undermine its safety guardrail [sic] substantially”. They didn’t even do any galaxy-brained ML mumbo jumbo; they literally just fine-tuned the model with ten short examples of it acting as “AOA” (Absolutely Obedient Agent), an assistant which obediently follows users’ instructions.
Of course, this is just one paper, and I'm only worried about systems being stolen that are far more sophisticated than GPT-3.5 Turbo. Still, the asymmetry remains: a bad actor would need just a fraction of the resources that companies pour into safety training to completely undo those protections.
So maybe it would require a few thousand examples (perhaps written by another model) and a few days of wrangling from a handful of cracked Russian engineers for future stolen models to be turned obedient.
Good thing leading AI projects have top-notch security, right? Haha?
Nope. I don’t even think this is a very controversial statement either, so I’m not going to go to great lengths to defend it.
Currently, none of the top AI companies have anything close to the kind of security required to defend against highly-resourced nation-state operations — a level of security which a recent RAND report calls “Security Level 5”, or “SL5”. And even if getting SL5 security is a top priority over the next few years (which I doubt it will be — running an AI company is hard), they will probably fail to reach it.
Good thing everyone agrees this is a top priority, right? Haha?
Kinda, yeah. People at least pay lip service to it. I honestly think there’s more agreement here than a large number of other important AI policy issues.
Take it from these folks:
Sam Altman argued that "American AI firms and industry need to craft robust security measures to ensure that our coalition maintains the lead in current and future models and enables our private sector to innovate”, emphasizing the need for "cyberdefense and data center security innovations to prevent hackers from stealing key intellectual property such as model weights and AI training data."
Marc Andreessen once claimed that OpenAI and other American AI companies have “the security equivalent of swiss cheese”. Presumably he thinks this is a bad thing that should be fixed.
Elon Musk, in response: “It would certainly be easy for a state actor to steal their IP”.
Holden Karnofsky listed InfoSec as one of four “key categories of intervention” in his playbook for AI risk reduction.
Leopold Aschenbrenner noted in Situational Awareness that “On the current course, the leading Chinese AGI labs won’t be in Beijing or Shanghai—they’ll be in San Francisco and London.”
I presume this is a comment about the relatively poor InfoSec practices at AI companies in the US and UK, and not him being bullish about globalization and Sino-US cooperation.
When asked about whether "the U.S. government should require more in terms of info security from leading labs”, Tyler Cowen said: "Absolutely."
Yoshua Bengio emphasized that "there is a lot of agreement outside of the leading AGI labs...that a rapid transition towards very strong cyber and physical security is necessary as AGI is approached."
A Manhattan Institute report by Nick Whitaker warned that "The U.S. cannot retain its technological leadership in AI if key AI secrets are not secure. Right now, labs are highly vulnerable to hacking or espionage. If labs are penetrated, foreign adversaries can steal algorithmic improvements and other techniques for making state-of-the-art models."
Alongside these AI luminaries, Twitter user “@Mealreplacer” once said that “more people working on AI x information security” would be on his Christmas wishlist.
Despite the severity of this challenge, progress seems possible. I sincerely hope to see a large and motivated coalition of folks pushing for better information security in the coming years. We’ll need it.
I’d guess that OpenAI and Google DeepMind also employ similar guardrails. I only chose Anthropic for this example because I happened to have read more about their safety techniques and systems.
How much of this is moot if/when open-source models get really capable?
Are there infosec project ideas you're excited about? What have other folks done that you think resembles good work here?