A recent study found that models without the ability to use desktop apps, like OpenAI’s GPT-4o, were willing to engage in harmful “multi-step agent behavior,” such as ordering a fake passport from someone on the dark web, when “attacked” using jailbreaking techniques. Jailbreaks led to high rates of success in performing harmful tasks even for models protected by filters and safeguards, according to the researchers.
You are viewing a single comment's thread from: