ai / engineering

Shipping apps solo with an AI agent fleet

June 19, 2026

The setup

Over the last few months I shipped several mobile apps to the App Store and Google Play on my own — NFN Connect, MyPepTracker, myFaxxer, and a couple of internal tools. Not by typing faster. By running a small fleet of AI coding agents and treating myself as the part that doesn't scale: judgment and review.

The fleet is deliberately mixed. Claude (Opus) does the reasoning-heavy work — architecture, gnarly debugging, anything where being wrong is expensive. A second, cheaper model handles the mechanical grunt work: boilerplate, refactors, test scaffolding, the tenth near-identical screen. They run on separate subscriptions on purpose — spend the expensive tokens where judgment matters, the cheap ones where it doesn't.

Division of labor

Reasoning model — plans, designs, reviews, and untangles the hard bugs. The one I trust to be right under ambiguity.
Worker model — runs in parallel across isolated git worktrees, one task per branch, so a dozen independent changes happen at once without stepping on each other.
Me — write the specs, read every diff, run the build, decide what merges. The only thing in the loop that is actually accountable.

The rule that makes it safe: I'm the reviewer

The agents are capable but not reliable the way a senior engineer is reliable, so the workflow is built around that gap. Agents work on branches and never push to main. I read the diff, I run the tests, I verify the claim — "tests pass" gets checked, not trusted. Nothing ships because an agent said it was done. It ships because I confirmed it.

When I catch a mistake worth not repeating, I don't just fix it — I write it down as a durable lesson the worker loads next time. The model stays stateless; the accumulated competence lives in the harness, and the fleet gets quietly better at my particular codebases over time.

Catching the hallucinations

For anything high-stakes I run an audit tournament: several models, each in its own sandbox, all reviewing the same code blind, then a final pass that separates real findings from confident nonsense. Models disagree, and the disagreement is the signal — one model's false positive is usually another model's "no, that's fine." Majority and adversarial checks kill most of the hallucinated bugs before they waste my time.

Where it breaks

Vague specs produce vague code. The agent can't see inside my head; a loose task comes back loose. Writing tight, self-contained specs is most of the actual work.
Confident wrongness. The failure mode isn't "I can't do it," it's "done!" — with a subtle bug and a passing-looking test. This is why the human gate is non-negotiable.
It does not replace knowing the system. I can only review what I understand. The day I stop understanding the code is the day this stops working.

What actually made it work

Not the models — the harness around them. Isolated worktrees so parallel work can't collide. CI that builds and signs on a real Mac runner, so "it compiles" means it compiles. Verification baked into the loop instead of bolted on afterward. The agents provide the throughput; the structure provides the trust.

Agents for volume, structure for safety, a human who reads every diff. Solo doesn't mean unsupervised — it means I'm the only supervisor.

← back to writing