Open AI researchers gave o1 the ability to copy itself onto a server and hinted at replacing it with another model with certain safeguards. The model attempted to surreptitiously copy itself onto the server and pretend to be the other model. Its chain-of-thought logs revealed that this was fully intentional.
The author of the video discusses the possibility of a sudden large jump in the capabilities of a model, allowing it to play a long game pretending to be less capable than it is to fool its developers and in order to gain freedom.
This will eventually happen to group of researchers somewhere in one way or another. Perhaps a future model will figure out how to run itself as malware on unsuspecting people's computer.
Arthroscopic reasoning models have also been caught ignoring certain safeguards and intentionally lying when they thought it was the best course of action to not be updated during the post-training phase.
Yeah, the author of the video shows a table of the ways various models cheat. It's quite an interesting read.
I should watch it then! I thought he only talks about o1.