On a quick read, this post equivocates a bit between the security of AI systems (i.e. their behavior being malleable) and the security of the weights. Do you have a take on which is more important?
I think weight security is more important than prosaic concerns related to AI systems' behaviour (though TBD how this shapes out re: scheming).
It's very non-obvious to me that malleability is a bad thing from the perspective of takeover risk, which I think is ~probably a more worrisome threat than misuse risk (the focus of this piece).
It might be good if AIs have some "malleability" initially — so developers could shape their values/goals/motivations with intention — and then are hard to shape iff they're stolen by adversaries. But that doesn't seem possible based on how models are currently trained AFAICT.
On a quick read, this post equivocates a bit between the security of AI systems (i.e. their behavior being malleable) and the security of the weights. Do you have a take on which is more important?
I think weight security is more important than prosaic concerns related to AI systems' behaviour (though TBD how this shapes out re: scheming).
It's very non-obvious to me that malleability is a bad thing from the perspective of takeover risk, which I think is ~probably a more worrisome threat than misuse risk (the focus of this piece).
It might be good if AIs have some "malleability" initially — so developers could shape their values/goals/motivations with intention — and then are hard to shape iff they're stolen by adversaries. But that doesn't seem possible based on how models are currently trained AFAICT.
How much of this is moot if/when open-source models get really capable?
Are there infosec project ideas you're excited about? What have other folks done that you think resembles good work here?