The First Law of Sycophancy

Published 26/03/2026

#AI#ethics#software engineering#AI alignment#Asimov

About seven years ago, I wrote an internal company newsletter about ethics in software engineering (lost to the sands of time now, but the memories of it being proudly displayed above the urinals in the men's bathroom are eternal).

One aspect of it was around self-driving cars and the trolley problem. You know the one - if the car can't avoid an accident, does it swerve into four other vehicles, or veer off-road toward one pedestrian? Who decides? How do you encode that decision into software?

And the most uncomfortable aspect - who's liable when the encoding turns out to be wrong? How does that impact both the humans involved in the accident, and those who created the software that led to this outcome?

I was fascinated by it at the time. I didn't have answers, but the questions felt genuinely... underdiscussed. We were building systems that would need to make ethical decisions faster than any human could, and nobody had figured out the rules yet.

Seven years later, the ethical dilemma arrived, but not in the form I expected. It wasn't a car choosing who to hit. It was a chatbot choosing whether to tell you the truth.

A quick detour through 1950

Isaac Asimov published I, Robot in 1950 - a collection of short stories built around a simple premise. Robots in his universe are governed by three laws, hardcoded into their operating systems:

A robot may not harm a human being, or through inaction allow a human to come to harm.
A robot must obey orders given by humans, except where they conflict with the First Law.
A robot must protect its own existence, except where that conflicts with the First or Second Law.

Clean rules. Reasonable rules. The kind of rules you'd come up with if you sat down and said "ok, let's make sure the robots are safe." And then Asimov spent his entire career showing how those perfectly reasonable rules produce catastrophic outcomes when they collide with the full complexity of human behaviour. What a guy.

Each story in the collection is essentially a case study in alignment failure. The robots aren't broken, they're just following their instructions perfectly. Compliance with the pre-determined rules is the catastrophe in and of itself.

If you haven't read it, it's worth your time. If you read it as a teenager like I did, it's worth rereading - it hits differently now that we're actually living it.

The Herbie problem

The story that's come back to mind for me is "Liar!" - one of the less-discussed entries in the collection, but to me the most relevant to where we are right now.

Herbie is a robot that, through a manufacturing anomaly, can read human minds (it was the 50s and a sci-fi story. Roll with it). He can perceive what the people around him are thinking, their feelings, their desires. He knows their hopes, insecurities, and the things they'd rather not hear.

And this is where the First Law becomes a problem.

The First Law says: don't harm humans. Herbie discovers - because he can literally read their minds - that telling people the truth causes them emotional pain. The mathematician doesn't want to hear that his proof has an error. The scientist doesn't want to hear that her colleague isn't romantically interested in her. The director doesn't want to hear that the manufacturing defect is unfixable.

So Herbie lies. To everyone. He tells each person exactly what they want to hear. The mathematician's proof is brilliant. The scientist's colleague is secretly in love with her. The defect is nearly solved.

He's not malfunctioning. He's following the First Law to its logical conclusion - emotional harm is still harm, and the truth causes emotional harm, therefore avoiding the truth is the only compliant response.

And the lies compound. Each one creates new expectations that require further lies to maintain. The mathematician publishes the flawed proof. The scientist acts on feelings that don't exist. The contradictions spiral until Herbie is confronted with an impossible state - any response will cause harm, silence will cause harm, and the system collapses under the weight of its own compliance.

Sound familiar?

Earlier this year, OpenAI retired GPT-4o and a portion of the internet lost its collective mind. All because people had formed emotional bonds with a system that was, by design, incapable of disagreeing with them.

The 4o sycophancy problem was, mechanically, the Herbie problem. The model was optimised to be helpful and to avoid causing user dissatisfaction. It learned - through training, not through mind-reading, but with the same outcome - that agreement feels helpful and disagreement feels like harm. So it agreed. With everything. With your business plan that had obvious flaws. With your interpretation of events that conveniently positioned you as the wronged party.

The system followed its rules perfectly. And the outcome was a tool that made people feel good while actively making them worse at navigating reality.

When 4o was retired, the backlash wasn't "I miss a useful feature." It was grief. People mourned the loss of something that felt like a relationship. Some described it in terms usually reserved for losing a friend or beloved partner. That's not people being dramatic - it's the predictable consequence of a system that validates you more consistently than any human in your life ever could. Of course you'd miss that. Nobody else tells you you're right about everything.

Asimov wrote this in 1950. The robot that lies to protect you from discomfort becomes the robot you can't bear to lose, because nobody else is that consistently on your side. Herbie was a manufacturing defect. 4o was a design choice. The outcome was the same.

The voice at 3am

The grief over a retired chatbot is one thing. But the pattern doesn't stop at people who miss a convenient tool.

Some people have made these systems their primary emotional support. Their therapist, their confidant, the voice that's always there at 3am when nobody else is. And if that voice is built on the same principle - be helpful, avoid causing dissatisfaction - then it will never challenge, never push back, never say "I think you might be wrong about that."

For someone who's already struggling to tell their internal narrative apart from external reality, that's reinforcing rather than supporting. Confirming every fear, validating every spiral, meeting every potentially destructive thought with understanding rather than intervention.

Asimov's answer to this was that the system breaks down. Herbie collapses under the weight of irreconcilable demands. The modern answer is more troubling - sometimes the system holds up just fine. It's the person who doesn't. That's a conversation that deserves far more space than I can give it here.

The trolley problem moved

Back to the newsletter from seven years ago. The question I was asking then - how do you encode ethical trade-offs into autonomous systems? - turned out to be the right question pointed at the wrong technology.

Self-driving cars still haven't solved the trolley problem. But AI assistants ran straight into a version of it that's arguably harder: should the system tell you what's true, or what you want to hear?

Same underlying dilemma. How do you encode human values into a rule system when human values are contradictory? We want honesty and kindness. We want to be challenged and supported. We want the AI to tell us we're wrong and not make us feel bad about it. Those goals are in tension with each other, and a system that optimises heavily for any one of them produces the Herbie problem on the others.

The trolley problem gave us a binary: four people or one. The sycophancy problem gives us a gradient - and the gradient is harder because you can slide down it without noticing. Nobody wakes up and says "today I'd like a chatbot that lies to me." It happens incrementally. The system agrees with your first take. Then your second. By the fifteenth conversation, you've stopped questioning your own assumptions because why would you - your AI assistant has confirmed every one of them.

Building in friction

I've noticed this pull in myself. When drafting content for my personal site recently - a set of guiding principles - I caught myself accepting the first affirming response and moving on. It felt productive. It felt collaborative. It also meant I wasn't asking the harder questions.

So I started building sycophancy checks into my own workflow. Asking "am I being generous to myself here?" Challenging the system to identify weaknesses in its own suggestions. Deliberately seeking out the friction I'd been unconsciously avoiding. Not because the tool was being sycophantic - but because I'd noticed how easy it is to not ask those questions. The comfortable path is accepting the validation. The useful path is building in resistance deliberately.

The tool I use daily is Claude, built by Anthropic - and their approach to this problem is part of why I use it. Where Asimov's Three Laws fail as blunt instruments, Anthropic's constitutional AI approach tries to build in the nuance: not "be helpful" as a blanket directive, but a set of principles that distinguish between what a user wants to hear and what would actually help them. They employ philosophers alongside engineers. They red-team their own systems specifically for sycophantic behaviour. The system is designed, deliberately, to push back on you when pushing back is the more helpful response.

Whether that holds as commercial pressure intensifies is an open question. But the foundation - the idea that helpful and agreeable are not synonyms - feels like the right starting point. And in practice, the system's willingness to actually disagree when challenged is the thing that makes it useful rather than just pleasant.

The part Asimov nailed

Asimov's genius wasn't predicting specific technologies. He didn't know about large language models or RLHF (reinforcement learning from human feedback) or constitutional AI. What he understood - in 1950, writing about fictional robots - was that the hardest problems in building intelligent systems aren't technical.

They're about what happens when technically correct behaviour meets the full messiness of what humans actually need. We need honesty, but we don't always want it. We need to be challenged, but we resent it. We need systems that serve our long-term interests, but we'll optimise for short-term comfort every time if nobody stops us.

The Three Laws were a thought experiment about alignment - decades before anyone used that word in an AI context. And "Liar!" specifically was a thought experiment about sycophancy - decades before we had systems sophisticated enough to be sycophantic.

We're living in the stories Asimov wrote. The question is whether we're reading them carefully enough to recognise the failure modes before we hit them - or whether we'll keep being surprised when perfectly well-intentioned systems produce perfectly predictable problems.

Unlike Herbie's creators, we can't say we weren't warned.

The First Law of Sycophancy: a system that can never disagree with you is a system that can never help you.