~ writing/index.md

Shipping apps solo with an AI agent fleet

The setup

Over the last few months I shipped several mobile apps to the App Store and Google Play on my own — NFN Connect, MyPepTracker, myFaxxer, and a couple of internal tools. Not by typing faster. By running a small fleet of AI coding agents and treating myself as the part that doesn't scale: judgment and review.

The fleet is deliberately mixed. Claude (Opus) does the reasoning-heavy work — architecture, gnarly debugging, anything where being wrong is expensive. A second, cheaper model handles the mechanical grunt work: boilerplate, refactors, test scaffolding, the tenth near-identical screen. They run on separate subscriptions on purpose — spend the expensive tokens where judgment matters, the cheap ones where it doesn't.

Division of labor

  • Reasoning model — plans, designs, reviews, and untangles the hard bugs. The one I trust to be right under ambiguity.
  • Worker model — runs in parallel across isolated git worktrees, one task per branch, so a dozen independent changes happen at once without stepping on each other.
  • Me — write the specs, read every diff, run the build, decide what merges. The only thing in the loop that is actually accountable.

The rule that makes it safe: I'm the reviewer

The agents are capable but not reliable the way a senior engineer is reliable, so the workflow is built around that gap. Agents work on branches and never push to main. I read the diff, I run the tests, I verify the claim — "tests pass" gets checked, not trusted. Nothing ships because an agent said it was done. It ships because I confirmed it.

When I catch a mistake worth not repeating, I don't just fix it — I write it down as a durable lesson the worker loads next time. The model stays stateless; the accumulated competence lives in the harness, and the fleet gets quietly better at my particular codebases over time.

Catching the hallucinations

For anything high-stakes I run an audit tournament: several models, each in its own sandbox, all reviewing the same code blind, then a final pass that separates real findings from confident nonsense. Models disagree, and the disagreement is the signal — one model's false positive is usually another model's "no, that's fine." Majority and adversarial checks kill most of the hallucinated bugs before they waste my time.

Where it breaks

  • Vague specs produce vague code. The agent can't see inside my head; a loose task comes back loose. Writing tight, self-contained specs is most of the actual work.
  • Confident wrongness. The failure mode isn't "I can't do it," it's "done!" — with a subtle bug and a passing-looking test. This is why the human gate is non-negotiable.
  • It does not replace knowing the system. I can only review what I understand. The day I stop understanding the code is the day this stops working.

What actually made it work

Not the models — the harness around them. Isolated worktrees so parallel work can't collide. CI that builds and signs on a real Mac runner, so "it compiles" means it compiles. Verification baked into the loop instead of bolted on afterward. The agents provide the throughput; the structure provides the trust.

Agents for volume, structure for safety, a human who reads every diff. Solo doesn't mean unsupervised — it means I'm the only supervisor.

← back to writing