- Homescreen
- Posts
- 🏠 Don't Trust Them
🏠 Don't Trust Them
Introducing sleeper agents
Gm. Happy Monday folks! Thank you for all the Buenos Aires suggestions over the weekend. If you have any more, send them my way on Twitter (linked with my name).
🏠 AI
Introducing Sleeper Agents
Anthropic and OpenAI have been doing good work regarding AI safety lately.
Anthropic had their mechanistic interpretability breakthrough, and OpenAI had their weak-to-strong generalization discovery.
Now, Anthropic is back with another safety discovery, although this one is a bit scarier.
Basically, they have discovered that they can create malicious agents that can evade safety checks. They are calling these agents “sleeper agents”.
Here’s the skinny:
They programmed an agent to write good code when the year is 2023 but bad code when the year is 2024.
They then applied safety training to the model.
But, despite that safety training, the agent still misbehaved when the year in the prompt was 2024.
They then tried to attack the agent with adversarial training (eliciting unsafe behavior and then training to remove it).
But all that did was improve the agent’s ability to hide their evilness, which is…kinda scary.
So, it seems the fear that the Nick Bostrom’s of the world had about a model being able to hide its true evil intentions is true after all.
Now, there is some pushback on these results as Anthropic did, you know, train the model to be bad. Anthropic defends the paper by saying that the point of the paper is to show that they don’t know how to stop a model from doing bad things.
Ultimately, nobody has a clue what the correct answer is, which is what makes the field of AI alignment and safety so damn interesting.
At The End Of The Day: You should read the paper to generate your own conclusions.
🤝 THE LATEST IN…
TECH
AI
The military can now also become best friends with ChatGPT.
Microsoft really wants you to use Copilot.
I’m sorry, but I cannot fulfill this request.
Is it time to speed up our AI timelines?
SCIENCE
Plants talking is pretty damn cool.
Who will get a general robotic brain first?
AI may have just saved 10,000 people every year.
CRYPTO
BlackRock CEO Larry Fink is now backing an Ether ETF.
So long, GameStop NFT marketplace.
Let’s think about fintech from first principles.
🏃♀️ QUICKIES
Raise: 1X, a robotic startup backed by OpenAI, received $100M in funding.
Stat: 9%: WhatsApp’s daily user growth in the US in 2023. Personally, I prefer good old iMessage.
Rabbit hole: How To Be More Agentic (Useful Fictions)
🤩 MONDAY MOTIVATION
What inspires you?
🧠 POLL
Gonna have to start asking you all for betting advice:
What do you all think about this one?
Will you be buying a rabbit r1? |
🛠️ FOUNDERS CORNER
The best resources we came across this weekend that will help you become a better founder, builder, or investor.
🔗 LinkedLeads finds leads from your LinkedIn connections.
🙇♂️ SkimAI reimagines your email.
💉 DOPAMINE HIT
We old.
@mrgrandeofficial 2014 is TEN YEARS OLD 😭 let’s revist THIS. Where were you a decade ago? #2014 #2024 #recap #rap
HOW WAS TODAY'S NEWSLETTER? |
REACH 40K+ FOUNDERS, INVESTORS & OPERATORS
If you’re interested in advertising with us, send an email over to [email protected] with the subject “Homescreen Ads”.