Coding Agents, Actually — or: How I Learned to Stop Coding and Love the Agent

Slide version — this essay has a companion deck. View the slides →

The survey that started this

A company I consult with ran an internal survey. Somewhere between 80 and 90 percent of their engineers almost never use AI coding agents. When asked why: too unpredictable, too slow to set up, and the last time they tried, it made a mess.

I understand that answer. I had a version of it myself, for a while.

I read AI news every day — AI Twitter, research papers, practitioner blogs. The fascination with Claude Code had been building for months. People kept describing something that sounded implausible: assigning a task, going for a walk, coming back to working code. But every time I’d tried an agent previously, the results were underwhelming. Nothing in those early tries suggested anything qualitatively different was possible. The mental model I had was wrong. I was thinking of it like multi-file autocomplete — AI making changes across several files. Not like delegation. Not like assigning a task to a person and getting it done.

So I had this friction, and I had the wrong expectations, and I never saw a “wow” moment that reset them.

Then came Christmas.

I took a few days off over the holidays. I had a pet project I’d been meaning to start for a long time — a website for book trailers. Never had the time. Over those few days, I decided: I’ll learn how to actually use Claude Code by trying to build this thing.

What I thought would take a month took a couple of days. I finished the entire project in less than a week. I ended up doing more than I’d planned, and the result came out nicer than I’d imagined. The website launched at booktrailers.ai.

Nearly 20,000 lines of Python. A full frontend. Without writing a line of code myself. Without knowing JavaScript, without knowing how to build landing pages, without knowing anything about generating promo videos.

The thing emerged from my imagination. Not from my ability to implement it. That was the moment.

This post is what I wish I could hand to the 80% in that survey — not a list of tools, but an honest account of what actually changed, what the workflow looks like now, and the four investments that make that shift happen.

Something changed in November 2025

In August 2025, Andrej Karpathy — one of OpenAI’s co-founders, the person who built Tesla’s Autopilot team, and one of the most hands-on practitioners in machine learning (multiple repositories above 50,000 GitHub stars; a lot of high-quality code written by him personally) — tweeted about coding agents. His advice: stop using them by default, they’re over-engineered, look at one file at a time, don’t use tools.

Six months later, same person: “coding agents basically didn’t work before December and basically work since.” He described an agent working for 30 minutes, running into issues, researching solutions, writing code, testing it, coming back with a report. “I didn’t touch anything.”

What actually happened?

Three breakthrough model releases — Anthropic (Opus 4.5), OpenAI (GPT-5.1-Codex-Max), Google (Gemini 3 Pro) — within six days of each other, November 18–24. That simultaneity wasn’t a coincidence. It was a competitive response.

The real catalyst was Claude Code. Anthropic’s coding agent went from $0 to $1 billion in annual recurring revenue in roughly six months. To put that number in context: the fastest-growing SaaS companies in history have taken five to ten years to reach $1 billion ARR. Six months is not just fast — it has no precedent. It was the first agentic coding tool to go genuinely viral among professional developers. It proved there was a massive market — and OpenAI and Google had to respond. They shipped.

The timing wasn’t just commercial pressure. The underlying technology had also matured at the same moment: coding-specific reinforcement learning on real software environments (models trained against actual repos and test suites, not synthetic tasks), and agent harnesses that were themselves trained alongside the models. The models could maintain coherent work for hours instead of minutes. The METR benchmark — which measures how long an agent can work autonomously before failing — jumped from roughly 1 hour to nearly 5 hours for Opus 4.5. That crossed the line for real work.

Karpathy experienced it in December. I experienced it over Christmas. The transition was that fast.

Why most engineers haven’t made the jump

There are two honest reasons I didn’t make this shift sooner, and I’ve heard both from almost everyone I’ve talked to since.

The first is friction and a wrong mental model. I used Cursor for a long time, and its agent UI kept nudging me toward thinking about it as fancy multi-file editing: you describe a change, AI modifies several files, you review the diff. That’s not useless, but it’s not what agents actually are. The right mental model is: you assign a task, like you would to a person. The agent works on it. You look at the result. The gap between those two mental models is enormous, and the UX of most tools in 2024 didn’t make it obvious.

The second is that the early results genuinely weren’t impressive enough. I tried agents a few times in 2024, and the outcome was: it changed two files, I approved, I moved on. I could have done that in two quick edits myself. I didn’t see evidence that anything qualitatively different was possible. So I stopped trying.

What changed is not just my tolerance for friction. The models got materially better — and the workflows around them got figured out. The people who stuck with it through 2024 built up a body of practice: how to structure documentation, how to give the agent a verifiable goal, how to communicate. By late 2025, there was a template to copy. You don’t have to figure it out from scratch anymore.

This is not autocomplete

The mode switch is the thing engineers miss. Most people who “tried AI coding” tried autocomplete — Copilot, Cursor’s tab completion — and concluded it was okay but not transformative. Or they tried chat — ask a question, get an answer — and found it useful for looking things up.

Agents are neither of these. They’re delegation.

You assign a task. The agent plans it, writes code, runs tests, hits errors, looks them up, fixes them, iterates — and comes back when it’s done or when it needs a decision from you.

The four investments

Everything else is downstream of these four. I’ve watched smart engineers try agents and fail because they skipped one of them. I’ve watched engineers who set these up correctly move at a speed that looks implausible.

1. Delegation

AI is not a smarter autocomplete. It’s a very strange, very productive colleague — with real strengths and real blind spots. The better you understand them, the better you collaborate.

Three things that change everything once you internalize them. First: assign tasks, not instructions. “Build me a settings page that passes the e2e test” beats “edit these three files.” One frames an outcome; the other micromanages the path. Second: give a verifiable goal. The agent iterates tirelessly toward a target it can check. Without one, it guesses when to stop — and often stops wrong. Third: run several in parallel. I routinely run five or more agents simultaneously on independent tasks. It’s a small team, not a single assistant. You are the manager.

This last one took me a while to actually do. It felt weird. Then I did it and everything changed again.

2. Documentation

Two rules. First: when the agent finishes something, ask it to update the docs. Second: if you’ve said the same thing twice — corrected the agent the same way, given the same instruction again, watched it fail in the same spot — that’s a gap in the documentation. Write it down. Add a page. The repetition is the signal. If you find yourself typing “no, don’t do it that way” for the third time, that instruction should have been in the docs after the first.

3. Inner and outer verification

Agents are exceptionally good at one specific thing: trying, failing, adjusting, and trying again until they reach a goal you’ve defined. That’s genuinely where they shine. The iteration is tireless and fast. This is inner verification — the loop the agent uses to check its own work. Give it a clear, verifiable target and it will work toward it — running tests, reading errors, changing approach, running tests again — without you needing to be there.

The flip side is equally true. Without a clear goal, results are poor. The agent does something, declares it done, and stops. It may have technically followed your instructions while missing the point entirely.

This is also the mechanism that makes it possible to delegate really big tasks. You define what “done” looks like precisely enough that the agent can check it. You leave. Two or three hours later you come back and the thing works exactly as you intended. That’s not magic — it’s the agent iterating hundreds of times toward a measurable target while you were away.

Verifiable goals look like: “end-to-end test passes”, “the feature is visible at /dashboard”, “accuracy on the eval set is above 0.74.” Vague goals look like: “make it better”, “fix the styling”, “clean this up.” The difference in outcomes is enormous.

There’s a second half — outer verification — that most people miss. It isn’t about distrust — the agent’s coding ability is fine. It’s about the fact that your documentation has gaps, your instructions can be misread, and your features accumulate fast. When you run five agents in parallel, you have five times the surface area for things to go wrong. Bugs, inconsistencies between agents, misinterpreted requirements — these are what your automated outer verification needs to catch before they reach you.

The rule is simple: no manual step that runs on every change. The moment you’re doing manual outer verification on each agent’s output, you’ve become the bottleneck. You’re slower than the agents. The whole system stalls waiting for you. Design your outer verification so the pipeline filters what matters and surfaces only the decisions that genuinely require human judgment.

4. Voice

Most people still type to their agents. I stopped. I talk to Claude Code the way I’d talk to a colleague — describing what I want, explaining context, asking follow-up questions, changing direction mid-thought. I use Wispr Flow for this. It works anywhere.

The obvious benefit is speed. Talking is faster than typing, so you give more context, ask more follow-ups, correct course sooner. More nuance in → better results out. The compounding effect is real: better communication → better understanding → less time correcting → faster iteration.

But the less obvious benefit is the one that surprised me most. Higher bandwidth doesn’t just make you faster — it changes what you’re willing to attempt. When expressing an idea costs almost nothing, your ideas get bolder. Your ambitions get bigger. You start projects you’d have previously dismissed as too complicated. You ask for nicer things. You push further.

This is not a productivity multiplier. It’s an ambition multiplier. The friction of typing was quietly capping what I thought was worth trying. Removing it didn’t just speed me up — it changed the quality and scale of what I built.

Tokenmaxxing

Something shifted in leadership conversations this year. Two years ago, companies were scared of AI — worried about leaks, liability, engineers going off-script. Today the conversation has completely reversed. CTOs and VPs at AI conferences aren’t asking how to slow down AI adoption. They’re frustrated their engineers aren’t moving faster. The question being asked is: why aren’t we using more AI? Why are your engineers using it so little? Why can’t you ship this faster?

There’s a word circulating right now in those rooms: tokenmaxxing — the idea that more AI token usage is a leading indicator of team productivity, and leadership wants to see those numbers go up.

I feel this directly. The company that invited me to give this talk ran that internal survey because their leadership already suspects the answer — and they want to close the gap. The conversation has changed. If you’ve been waiting for organizational permission to go all in, it’s here.

The drudgery is gone

I want to be honest about something personal. I like building things. I always have. But I didn’t always love my day-to-day work as an engineer — and for a long time I thought that was just how it was. A lot of the actual work wasn’t inventing things; it was wrestling with things. Reading through pages of library documentation. Trying to stitch two packages together, hitting an error, reading more docs, trying a different approach, hitting a different error. Hours of that. Useful, necessary, occasionally interesting — but mostly not why I got into this.

That part is gone now.

When I need to connect two systems that don’t quite fit together, I describe the problem to the agent. It doesn’t read ten web pages and try one approach. It tries ten different approaches, learns from each attempt, and finds the one that works — while I’m doing something else. I come back and it works. I barely see the seams. It genuinely feels like magic, because it is: a kind of problem-solving I couldn’t do at that speed or that breadth.

What’s left is the part I actually love. What should this thing do? How should it work? What would make it genuinely good? Those questions are still entirely mine.

I can finally say honestly that I love my job. That’s new.

The game has changed

There’s a framing I’ve seen that gets it wrong: “AI does 80% of the work, and you still handle the hard 20%.” This undersells what’s actually happening.

The profession hasn’t been incremented. It’s been transformed. What you are has changed.

You’re not a code writer anymore. You’re a creator. An architect. You generate ideas, you see them implemented in near-real time, you manipulate them, you push them further. The feedback loop between imagination and result has collapsed from weeks to hours.

The complexity of the craft hasn’t gone away — it’s moved. It no longer lives in knowing how to write the code. It lives in the quality of your ideas, the clarity of your direction, the design of your inner and outer verification, the architecture of your collaboration with agents. Those are deeply creative skills, and they matter more now, not less.

The game is different. It’s a better game. It’s more fun — at least for me — because it asks more of the things I’m actually good at and care about, and less of the things I could do but found wearing.

If you’re in this profession and you haven’t made the shift yet, I genuinely think you’re missing something. Not just productivity. The work itself is different on the other side.

A few more things that work

These don’t fit neatly into the four investments, but they compound the effect noticeably.

Browser automation. If your outer verification architecture needs to cover a frontend, don’t test it manually. Add Playwright MCP to your agent’s toolbox. Write the test instructions once — where to navigate, what to click, what credentials to use — and from then on, the agent can run and iterate on the full end-to-end test without you. The same inner verification loop that makes agents good at backend work applies just as well to a browser.

Deep Research. Whenever I hit a non-obvious decision — which framework to use, whether a particular architectural choice is sound, how to compare tools — I run a deep research instead of asking the agent to guess. Deep Research goes through hundreds of thousands of pages and comes back with a dense, specific report. I also use it for bugs that an agent genuinely can’t crack in a single conversation: some problems are just hard enough that you need a broad survey of solutions before you can pick one. For this talk alone I ran five separate researches.

Use the wider tool landscape. The useful AI ecosystem is bigger than the coding agent. I alternate between Claude Code for interactive iteration and Codex for async, background work. For this presentation I used Claude Design — it’s powerful for visually complex, aesthetically specific work. And Gemini I use for a specific thing I couldn’t do before: I have a long backlog of podcasts and talks I want to follow. I feed Gemini the YouTube link. It watches the whole thing and produces a dense summary. I read the summary. Then I decide: satisfied, or does this need a proper watch? Sometimes I end up just having a conversation with the video through Gemini. None of this is directly about coding. It’s about staying informed in a field that changes every week, with less overhead than it used to take.

Pet projects. This one might be the most important. Pick something small that you actually want to exist in the world. Something you’ve been putting off because you didn’t have the time, or the frontend skills, or the patience. Start it. Watch it emerge from nothing — from pure thought. It’s very revealing. It’s very liberating. This is also, in my experience, the fastest path to internalizing the shift. You can read about what agents can do. But seeing something go from imagination to deployed reality in a day, something that’s entirely yours, changes the way you think about what’s possible. This is how the mental model actually moves.

If this is useful, I occasionally write about AI and ML workflows. More posts →