Homescreen
Posts
🏠 Don't Trust Them

🏠 Don't Trust Them

Introducing sleeper agents

Stephen Flanders
January 15, 2024

Gm. Happy Monday folks! Thank you for all the Buenos Aires suggestions over the weekend. If you have any more, send them my way on Twitter (linked with my name).

— Stephen Flanders

🏠 AI

Introducing Sleeper Agents

Anthropic and OpenAI have been doing good work regarding AI safety lately.

Anthropic had their mechanistic interpretability breakthrough, and OpenAI had their weak-to-strong generalization discovery.

Now, Anthropic is back with another safety discovery, although this one is a bit scarier.

Basically, they have discovered that they can create malicious agents that can evade safety checks. They are calling these agents “sleeper agents”.

Here’s the skinny:

They programmed an agent to write good code when the year is 2023 but bad code when the year is 2024.
They then applied safety training to the model.
But, despite that safety training, the agent still misbehaved when the year in the prompt was 2024.
They then tried to attack the agent with adversarial training (eliciting unsafe behavior and then training to remove it).
But all that did was improve the agent’s ability to hide their evilness, which is…kinda scary.

So, it seems the fear that the Nick Bostrom’s of the world had about a model being able to hide its true evil intentions is true after all.

Now, there is some pushback on these results as Anthropic did, you know, train the model to be bad. Anthropic defends the paper by saying that the point of the paper is to show that they don’t know how to stop a model from doing bad things.

Ultimately, nobody has a clue what the correct answer is, which is what makes the field of AI alignment and safety so damn interesting.

At The End Of The Day: You should read the paper to generate your own conclusions.

🤝 THE LATEST IN…

🏃‍♀️ QUICKIES

Raise: 1X, a robotic startup backed by OpenAI, received $100M in funding.

Stat: 9%: WhatsApp’s daily user growth in the US in 2023. Personally, I prefer good old iMessage.

Rabbit hole: How To Be More Agentic (Useful Fictions)

🤩 MONDAY MOTIVATION

What inspires you?

🧠 POLL

Gonna have to start asking you all for betting advice:

What do you all think about this one?

Will you be buying a rabbit r1?

🛠️ FOUNDERS CORNER

The best resources we came across this weekend that will help you become a better founder, builder, or investor.

🔗 LinkedLeads finds leads from your LinkedIn connections.

🙇‍♂️ SkimAI reimagines your email.

❌ When are you allowed to quit?

💉 DOPAMINE HIT

We old.

@mrgrandeofficial
2014 is TEN YEARS OLD 😭 let’s revist THIS. Where were you a decade ago? #2014 #2024 #recap #rap

HOW WAS TODAY'S NEWSLETTER?

REACH 40K+ FOUNDERS, INVESTORS & OPERATORS

If you’re interested in advertising with us, send an email over to [email protected] with the subject “Homescreen Ads”.