Felix runs a small events business in London, ten to twelve nights a month, on his own. I built him an agent to wire his promotions together and tell him where his ticket sales actually came from. This is the story of what building it taught me: how the architecture forced me into evals, and how a stronger model, acting as judge, caught my agent having a panic attack on a friend's behalf.
Felix is a close friend who runs a small events business in London, ten to twelve nights a month, on his own. He promotes every one of them across email, WhatsApp, Instagram and his website, with no real idea which channel sells a single ticket. I set out to build him an agent that wired everything together, ran his promotions, and learned where his sales actually came from.
I built it twice. The first version was a conventional state machine with the language model bolted on as a garnish. It was brittle. The second was an agent, with the model doing the reasoning and tools doing the fetching. That was the better system, but it moved the hard part somewhere I hadn't expected, and then somewhere I couldn't see at all.
Testing that invisible layer is where I taught myself evals. I'd been running questions past the bot and eyeballing the answers without knowing the practice had a name. The eval caught something I never thought to look for: under partial data, the agent stopped sounding like a chief of staff and started sounding like a panic attack. Fixing that, in the prompt and in the model routing, is the real subject here. It's also the discipline I now use in professional work.
Felix's problem was never effort. It was visibility. When a night sold well he couldn't tell you why, and when it sold slowly he couldn't tell you where to push.
He runs ecstatic dance and movement nights with a real community around them, promoted across five WhatsApp groups, a thousand-odd person mailing list, Instagram, and his own site. Ticketing sits on one platform, email on another, the links scattered across channels that report nothing back. So every cycle he was promoting hard and guessing, with no thread connecting a WhatsApp share to a ticket sold.
The idea was a mobile-first agent, living in Telegram where Felix already was, that connected his ticketing, his email, the WhatsApp links and the site into one loop. It would run his promotions, track every link, and feed the attribution back to him, so each cycle taught the next: see what worked, lean into it, learn, lean in again. And because Felix wanted a colleague rather than a lecture, the agent had to advise like someone who actually knew event marketing. So part of the build was research, a real read of email and WhatsApp promotion best practice, baked into the agent's context so its advice came from current practice rather than vibes.
The product thesis was a flywheel, not a dashboard. Most analytics tools stop at showing you a number. The point here was to close the loop: the agent runs a promotion, tracks the clicks per channel, watches what converts, and folds that back into the next recommendation. Over enough cycles, Felix's sense of his own channels gets sharper and the agent's advice gets more specific to his business in particular.
That framing mattered for the build, because it meant the agent needed memory and a feedback loop from day one, not as a later feature. It logs what it recommended, an evening job resolves those recommendations against real outcomes, and a nightly pass extracts patterns into the agent's long-term memory of Felix's business.
The obvious way to make a bot reliable is to make it deterministic. So the first version was a workflow engine, and the model only smoothed the wording.
It had a scripted onboarding, a scenario engine with a state for every situation Felix might be in, and a content pipeline that scored and generated copy. The language model sat on top as a garnish. Every situation Felix could land in was a branch I had to anticipate and write by hand. Miss one and the bot fell over or said something wooden.
The trouble with that shape is that the world has more situations than you can enumerate. The failure lived in my branching logic, and there was always another edge case waiting. I was spending most of my time covering cases rather than making the thing genuinely useful, and the model, the most capable part of the whole system, was reduced to tidying sentences.
The first architecture carried a five-day scripted onboarding flow, a seventeen-state scenario engine that tried to classify which situation a client was in, and a content engine with a scoring model, seven copy angles and a multi-layer prompt. Roughly fifty Python files and somewhere near nineteen thousand lines, most of it written to anticipate situations rather than reason about them.
It worked, in the narrow sense that the happy paths ran. But it was expensive to extend, fragile at the edges, and the effort of maintaining the branches scaled with Felix's unpredictability, which is to say without limit.
So I turned it inside out. Instead of code that decided what to do with the model smoothing the edges, I made the model the thing that reasons, and gave it tools to fetch data and take actions.
The agent was better straight away: more capable, more natural to talk to, and a far smaller codebase. But brittleness turned out to be conserved. It didn't disappear when I changed architecture. It relocated.
With the reasoning now in the model, the failures moved to the tools. The agent was only ever as good as the data its tools handed it, and messy or partial tool output was where things broke. So I spent the next stretch making the tools solid, returning clean, well-shaped, trustworthy data. And when I'd done that, the brittleness moved one last time, to the model's own reasoning: how it behaved when data was missing, whether it actually followed the instructions in the prompt. That layer is invisible. You can't read it off the code, and you can't read it off the tool outputs. The only way to see it is to test it.
The agent rewrite roughly halved the codebase, from about fifty files to thirty, and deleted whole categories of code: the scripted onboarding, the seventeen-state scenario engine, the scoring-and-angles content pipeline. The data layer, the link tracking and the feedback loop stayed. Reasoning that used to live in branches now lived in the prompt and the context.
It isn't a free trade. A deterministic checker formats the same way every time; an agent varies. When it says something wrong, debugging is no longer reading a code path, it's reading the prompt and the context and working out why the model reasoned the way it did. The mitigation is to keep the tools deterministic and to log everything, so the one part that can hallucinate is fenced by data it can't make up. That is the whole reason the next step, testing the reasoning itself, became unavoidable.
Once the tools were solid and the agent was working, the only way to know if it was any good was to ask it things and look hard at what came back.
I sat down with Felix and we went through the questions he'd actually ask: how's this night selling, who should I email, what should I post. I wrote them down. That list was a golden set, though I didn't know the term yet. As I ran the questions past Claude, I asked whether there was a formal way to test an agent. That's where evals surfaced. I read into it properly, wrote up a methodology for the project, and did it for real: a golden set per data source, run through the bot, scored pass, partial or fail, with a stronger model acting as the judge.
The thing that struck me later is that the golden set was user discovery wearing different clothes. The most ordinary product habit I have, sitting with a user and learning what they'll ask, turned out to be the same muscle the eval needed. I'd built the test before I had the word for it. That's the part worth sitting with: I didn't adopt evals as a framework. I backed into them because the architecture left me no other way to know if the thing worked.
The methodology ran in three phases. First, capability mapping: for each data source, write out the full space of questions a client would ask and classify every one. Answerable today, partial, or not answerable, and for each "not answerable", say why: an API limitation (the data doesn't exist), a tooling gap (it exists but I don't fetch it), or a deliberate choice (I could build it and have decided not to). That last category is pure product management hiding inside an eval: deciding what not to build, and writing down the reason so it doesn't get relitigated.
Then gap prioritisation, then the golden set and the eval sessions themselves: state the question, run it, react before analysing, then a joint verdict with a failure category. Any failure category that showed up three or more times became a development priority.
The scores were honest, including the one I'd rather not show:
The 14% wasn't a bug to bury. It was the eval doing its job: telling me the truth about what the system genuinely couldn't do. Felix's ticketing platform exposes no attribution on orders, so the chain from "which channel got the click" to "which channel sold the ticket" is broken at the source. No prompt fixes that. I'd rather know, and so would Felix.
When the data came back clean, the agent was calm and useful. When it came back partial, ticket counts but no prices, this week's orders but no baseline, it catastrophised.
“That's not a chief of staff. That's a panic attack in text form. Felix would be alarmed reading this, and it's based on one week of data for events that are still weeks away.”
Partial-data replies are verbatim from the eval set. The calm version is representative.
The judge's verdict was the cleanest line I got all session. And I hadn't asked it to grade tone. I'd asked it to grade accuracy. One of the partial-data replies even invented a capability: the click-to-sale conversion isn't moving the needle, a metric the agent can't compute, because the data to compute it doesn't exist. Faced with a gap, the model filled it with a sentence that sounded like a number.
Here's why it happens. Deterministic software fails silently or loudly, and the error is just the error. A model hits uncertainty and fills the gap, and what it fills it with is learned from us. People who are uncertain and under pressure hedge, over-explain, apologise, catastrophise. The model gives that back. It wasn't inventing facts so much as importing the wrong emotional register, reading "incomplete data" as a cue to escalate, in exactly the way an anxious junior employee would. Felix would have read it at 7am on a Tuesday.
Some of what the eval caught was a prompt problem. Each failure became a named rule the agent now carries. Browse them: every one traces back to something the eval surfaced.
By the end, the system prompt had quietly become the specification, and the eval is what wrote it. None of those rules existed in advance. Each one is the scar of a specific failure the test surfaced.
But some failures aren't promptable. The cheap default model is reliably weak at date arithmetic and multi-step planning, and no amount of rewording fixes that. So the deeper fix is architectural: route by intent. The agent runs on the cheap model by default, and when it detects campaign planning, the workflow that leans hardest on reasoning, it upgrades to a stronger model for the rest of that conversation and gives it more room to work. Cheap when it can be, capable when it has to be. You can usually tell which workflow needs which by the tools the agent is reaching for.
The default is the cheap model. The upgrade triggers on intent expressed through tool use: when the agent calls the campaign-planning tools, it switches to the stronger model for the remainder of that conversation, lifts the tool-call budget from three to six, and widens the short-term memory from six turns to fourteen so a multi-step planning session doesn't lose its thread. The conversation stays upgraded until it goes quiet, then later sessions drop back to cheap.
The trigger earned its place. In one exchange Felix said "Thursday the 16th"; the cheap model confidently called that "after the event", then offered "the Thursday before" as the 9th, which is a Wednesday. The stronger model handles that reliably. Rather than pay for it on every message, you pay for it only on the workflow that needs it, and you detect that workflow by the tools in play, not by trying to classify the user's sentence.
No growth chart came out of this. What came out was a way of working that I didn't have before, learned on a friend's business with nothing riding on it but his 7am peace of mind.
The most ordinary product habit I have, sitting with the user and learning what they'll ask, was the eval all along. I'd built the test before I knew the name. Evals aren't a new muscle for a product person. They're an old one, pointed at a system whose behaviour you can't read off the code.
Including the truth you'd rather skip. The 14% on link attribution was the single most useful result of the session, because it named a real limit at the source rather than letting the agent paper over it with a confident sentence.
Going agent-first didn't remove the hard work, it relocated it: first to the tools, then to the model's reasoning. I'd planned for a finish line and got a relay. Worth knowing before you start, so the second and third legs don't feel like failure.
I lost time wording and rewording prompts at things only a bigger model could do. Some failures yield to a better instruction; some need a better model; some need a tool that returns real data. Sorting which is which is most of the craft, and getting it wrong is most of the wasted effort.
When your client is talking to a model they don't know they're talking to, how it behaves under uncertainty is the relationship. The eval was the only thing standing between a panic attack in text form and Felix's phone on a Tuesday morning. Tone became a pass criterion, not just accuracy, and that's a permanent change to how I test anything built on a model.
I went in to build Felix a tool. I came out with a way of working. With deterministic software you can write the spec up front, because you can read the behaviour off the code. With an agent you can't, so you find the spec by testing: you watch what the thing does, you catch what it gets wrong, and you encode the correction. The golden set, the judge, the named rules, the model routing, all of it is one loop for arriving at a specification I couldn't have written in advance.
That's the part I keep. Not the agent itself, useful as it is to Felix, but the loop. The same discipline now travels into professional work, where the stakes are higher and the principle is identical. The architecture forced the question, the evals answered it, and the answer was never going to come from a document written before the system existed.
I taught myself evals on a friend's events business, on my own time, with nothing on the line but whether Felix opened a calm message in the morning instead of an alarming one. That's the same discipline I now deploy professionally, on real enterprise systems.