Teaching the cheap model: a learning flywheel for AI coding agents
The cheap model problem
I run a fleet of AI coding agents. The expensive one handles architecture, hard bugs, and anything where being wrong is costly. The cheap one — the flat-rate worker — churns through the long tail of mechanical tasks: refactors, boilerplate, test scaffolding, and all the near-identical screens that come with shipping apps. The economics are obvious: pay for judgment where it matters, pay for tokens where it doesn't.
But cheap models come with a predictable failure mode. They are fast, eager, and confidently wrong in ways that look almost right. The bug isn't "I can't do this"; it's "done!" with a subtle regression hidden behind a passing-looking test. Without a tight loop, the cheap worker doesn't get better — it just gets faster at making the same mistakes.
So I built the feedback loop around the thing that already has to happen anyway: review.
Review first, teach second
Every cheap-model change lands on a branch and sits there until I or the expensive model reads the diff. We don't trust the claim. We run the build, we read the code, we check that the test actually proves what it says it proves. When we catch something worth remembering, we don't just fix it and move on — we write it down as a lesson.
A lesson is a single, durable note: what the worker got wrong, why it matters, and what to do instead. "Don't mock the database in integration tests." "Always validate the URL before fetch in user-triggered flows." "Swift 6 strict concurrency means capture lists need explicit semantics here." Small, specific, tied to a real mistake.
The expensive model writes most of these. It has the context from the review, it knows what the cheap model missed, and it can phrase the lesson in a way the cheap model will understand next time. The review isn't punishment; it's curriculum.
Why injection beats storage
At first I just stored the lessons in a folder and hoped the cheap model would remember. It didn't. These models are stateless. Every task starts fresh. If the lesson isn't in the prompt, it might as well not exist.
That observation changed the shape of the system. The lessons had to be injected, not archived. Now, every time the cheap model picks up a task, we look at what it is being asked to do, find the lessons most relevant to that task, and prepend them to the prompt. The model still has no memory, but the harness gives it the right memory at the right time.
The matching has to be semantic, not keyword-based. Keyword recall fails when the same idea shows up with different words: a lesson about "Swift concurrency" needs to surface for a task about "actor isolation" or "Sendable closures." I index the lessons as vector embeddings and search by cosine similarity, then inject the top matches above the task description.
The result is a worker that improves without being retrained. The model weights stay frozen; the harness around it gets smarter.
Building the flywheel
The loop has four steps, and each one is cheap enough to run every time:
- Run the cheap model on a scoped task. One branch, one change, tight spec.
- Review the diff. Human or expensive model; someone accountable reads it.
- Distill any new mistake into a lesson. One note, one insight, written in the style the cheap model needs.
- Inject matched lessons into the next task prompt. Rebuild the index when lessons change; otherwise just query it.
The lessons live in a small shared git repo so every machine in the fleet gets them. The vector index itself is machine-local and rebuilt from the shared lessons — binary indexes don't belong in version control, but the text lessons absolutely do.
A threshold keeps noise out. Too low and the prompt fills with barely-relevant reminders; too high and the model misses the one lesson it needed. I tuned it by running real tasks and checking whether the injected lessons would have prevented the last few bugs. The right knob is usually a cosine floor plus a relative margin, not a fixed top-k.
What I'd do differently
I spent too long on the index before I had enough lessons to make it useful. The search works best once you have fifty or a hundred notes; before that, a flat list and grep would have been fine. Start simple, add vectors when retrieval quality actually starts to matter.
I also learned that lesson quality matters more than lesson quantity. A vague reminder like "be careful with concurrency" is worse than useless — it wastes tokens and trains the model to ignore the whole block. Each lesson needs a concrete failure and a concrete fix.
Takeaways
- Stateless models don't learn on their own. If you want the worker to get better, you have to put the lesson in the prompt.
- Review is the curriculum. The expensive model's real job isn't just catching bugs; it's turning each catch into a reusable note.
- Match by meaning, not by keyword. Semantic retrieval is what makes the injection scale across different phrasings of the same problem.
- Keep the index local, share the lessons. Text lessons belong in version control; binary embeddings are machine-local build artifacts.
- Tight specs make the whole loop work. A vague task produces vague mistakes, which produce vague lessons, which nobody retrieves.
A cheap model with good retrieval and a careful reviewer beats an expensive model left to guess. The flywheel isn't in the weights — it's in the loop.