The Decision

At some point I stopped asking can the agent do this? and started treating it as a default: if there is a new project, the agent builds it. Not as a novelty — as a sustained practice across different app types, stacks, and degrees of complexity.

What follows is that account.


What Got Built

The Projects

Every project started from an empty directory — just a PRD and Claude Code:

  • A todo app — CRUD, Room persistence, notifications
  • A drawing app — custom Canvas rendering, touch input, stroke smoothing
  • A live wallpaperWallpaperService, animated render loop, settings activity
  • An Android launcher — home screen replacement, app drawer, long-press menu
  • A Kotlin Multiplatform app — shared business logic across Android and iOS
  • A KMP + Ktor backend — shared model layer, API client, and server as one project
  • An Android → KMP migration — existing codebase converted entirely to KMP

The Range

The todo app was clean in a single session. The launcher was the most instructive — in ways I didn't expect. More on that.


The Full Dev Cycle

What the Agent Actually Did

The most surprising thing wasn't that the agent could write code — it was that it ran the full lifecycle without being asked to split up the work:

  1. Read the requirements, identify gaps, ask clarifying questions
  2. Design architecture bottom-up — DAOs before repositories, repositories before ViewModels
  3. Write implementation across all required files
  4. Write unit tests alongside the business logic
  5. Functionally test what it built — screenshots, bug identification, patching, verification

I didn't ask it to test. It tested because the goal was a working app, not written code. That distinction matters more than it sounds.

Small, Atomic Operations

Development within a session runs in many small, focused operations — a single file, a single fix, a single test — rather than large batches. Smaller steps are easier to verify, easier to roll back, and easier to redirect before they drift.


How the Agent Tests

Web and APIs

Playwright MCP gives the agent a real browser to drive — it navigates, clicks, fills forms, and asserts behaviour against what the PRD specified. Closest thing to a tireless QA engineer.

Mobile Apps

The mobile loop is: screenshot → analyse UI → determine gesture → execute → screenshot again. Slower and more fragile than Playwright, but it catches real issues — wrong layouts, broken flows, missing screens.

A useful pattern: prompt the agent to loop autonomously. "Take screenshots after every interaction, identify UI/UX issues, fix them, and keep going until nothing is left." It runs several passes on its own. Also worth asking: "What features are obviously missing that a user would expect?" — it flags things I hadn't thought to include.


Where It Falls Short

The Launcher

The PRD was detailed — icon grid, app drawer, widget support, drag-and-drop, long-press menu, default launcher registration. What shipped: a home screen that launched apps. Widgets were absent despite being in the spec. Drag-and-drop wasn't there. The app drawer was a plain scrollable list.

When pushed on each gap individually, the agent could implement them. The problem isn't capability — it's that complex products require iteration, not a single well-crafted PRD.

The Pattern

The first pass covers structural requirements reasonably well. The polish — edge cases, interactions that feel right, things a designer would catch — requires active guidance over multiple sessions. The higher the quality bar, the more rounds of correction.


The Prompt Is the Product

The biggest lever in AI-driven development isn't the agent — it's what you give it.

One imprecise sentence can redirect the entire architecture. The agent fills ambiguity with a reasonable-but-wrong assumption, builds on top of it, and by the time you notice, rollback is the only clean option.

What Makes a Good Prompt

  • Explicit constraints — what not to touch, which library to use, which naming convention to follow
  • Edge case behaviour — what happens when the network fails, permissions are denied, the list is empty
  • Success criteria — how you'd verify it works, written in the prompt itself
  • Scope limits — "only change the data layer, do not touch the UI" is often the most important line

This is structurally the same skill as writing a good engineering spec. The medium changed; the discipline required didn't.

Andrej Karpathy coined "vibe coding" in early 2025 to describe building by feel, with natural language. Vibe coding is the floor. Specification-driven prompting is the ceiling.


The Real Cost

Tokens Add Up

A session covering a couple of non-trivial features burns 80,000–110,000 tokens. At Claude Sonnet pricing, that's roughly $1–2 per session — which sounds small until you account for the 20–30 sessions a full app takes. Complex projects or Opus-heavy reasoning steps multiply that further. Across these experiments, I spent meaningfully more than any standard subscription tier covers.

Reducing It

Prompt caching helps significantly when iterating on a stable codebase — the agent re-reads the same files many times and cache hit rates are high. Structure long sessions so context stays stable.

For reference: GitHub Copilot now generates roughly 46% of code written by developers on average. The cost question has shifted from should we pay for it to how do we account for it accurately.


The Knowledge Gap

For Developers

When the agent writes everything and you don't read the code, you don't learn from it. The knowledge that comes from making mistakes, debugging your own assumptions, reading idiomatic code — that doesn't transfer when you're auditing AI output rather than authoring it. For developers early in their career, this gap is material.

For experienced engineers, it's a manageable tradeoff: you're trading the experience of writing for the velocity of shipping. You still need the mental model to evaluate the output and catch what's wrong.

For Everyone Else

For someone with no coding background, the calculus flips entirely. You can now ship a working product without a computer science degree. In Y Combinator's Winter 2025 batch, 25% of startups had codebases that were 95%+ AI-generated. 63% of people actively building with AI agents self-identify as non-developers.

The same capability looks like a threat from one angle and an equaliser from another. I'll admit some bias — the knowledge gap concerns me more because I spent years building that knowledge. For someone who never had access to it, this is a different tool entirely.


What I Think Comes Next

These are opinions, not predictions.

Prompting Becomes the Core Engineering Skill

The ability to translate product intent into unambiguous, constraint-rich instructions becomes the highest-leverage technical skill as agents improve. The engineers who thrive will be the ones who can reason clearly about systems and communicate that reasoning precisely — which is what good engineering specs have always required.

Anyone Can Ship

This is already true for apps of moderate complexity. In a few years it will be true for most apps. What remains as a genuine moat is taste, judgment, and understanding of what users actually need. Product thinking becomes the scarce resource.

Organisations Will Run Agent Pipelines

The model I expect: a small team orchestrating specialised agents — a product manager agent for PRDs, a developer agent for implementation, a QA agent for testing, a DevOps agent for deployment. Each owns the full lifecycle of a feature area, not just isolated tasks. Some version of this already runs in research labs.

Fewer People, Higher Expectations Per Person

The BLS still projects 15% developer employment growth by 2034 — software eating the world keeps generating more software. But headcount-per-product will decline. The remaining engineers own more: not just Android, but the API, the pipeline, the deployment — with agents executing. The job description expands; the headcount shrinks.

The transition won't be uniform. Commodity software teams feel the compression first. High-judgment, novel product teams feel it last. But treating this as business as usual seems naive.


Up Next

The question I haven't answered: in the parts where human judgment was actually required — where the agent got it wrong or needed direction — what were those moments specifically?

My hypothesis: the list is shorter than most people expect. And it's getting shorter.

support

Found this useful? You can buy me a coffee.

☕ buy me a coffee