If you're building AI-powered tools for high-stakes domains, rigorous evaluation is central to building something you can actually trust. This isn't new, evaluation and testing have always been important, but AI changes how you need to do it. Even for AI-powered tools, this isn't new: Anthropic made it clear in their recent piece on demystifying evals for AI agents, noting that teams without evals catch failures only in production and get stuck in reactive loops.
One new failure mode with AI-powered tools is when they fail silently, the outputs will look correct to a casual observer, but will be incorrect, e.g., missing a topic during study review or giving the wrong medical advice. What I'd add, from my experience building and deploying AI tools in global health contexts, is that the stakes of those invisible failures don't have uniform impact. Some users can't detect the failure (e.g. a student receiving targeted remediation, a community health worker following medical protocols). In these cases, rigorous evaluation becomes more critical than ever.
What I found when I actually measured my study assistant prototype: the most educationally valuable phase of a session worked correctly 72–77% of the time depending on the model. A single prompt change closed most of that gap for one model (+15 percentage points) and exposed the next failure entirely. None of this was visible in a demo.
Beyond catching failures, a strong evaluation framework tells you exactly where to focus, validates whether your fixes actually work, and (as I discovered) can transform your AI coding assistant from an autocomplete tool into a genuine research collaborator. And it doesn't stop at launch. Models change, concept drift occurs, and the gap between "it worked last month" and "it works now" is only visible through continuous monitoring.
Building the demo was the easy part
I built a voice-enabled study bot for my son. He's eleven, drowning in weekly vocabulary lists, and I figured: here's a useful, contained problem. I know AI. I've spent my career applying it in global health. And my son needed help with vocabulary lists. How hard can it be?
I called my vocab bot Orðaflóð, which is Icelandic for "word flood." It felt apt. I often feel like I'm being flooded when trying to help him learn his vocabulary words, and somehow, naming it in Icelandic made the whole thing feel like an epic quest rather than homework.
The bot worked. I demoed it a couple of times and it was genuinely fun to use. It had a casual tone, gave good hints, asked my son to define words, and circled back at the end to review the ones he'd struggled with. I thought: great, it's done.
Then I ran it five more times, systematically. And here's what I found: in three of those five runs, the native-voice model version quietly stopped following its own protocol. No error message. No crash. It just... drifted. It forgot to offer to review the missed words at the end. Or it skipped offering to review the list before going through it. The session felt coherent, but the bot wasn't doing what it was supposed to do.
The traditional pipeline (speech recognition model → LLM → text-to-speech model) fared better on staying on-task than the native-voice model, but in two of five runs it forgot to cover one or two words from the list.
That was my "uh-oh" moment. The bot wasn't broken. It just wasn't reliable. And in any context where the user can't detect the failure — a kid taking a vocabulary quiz, a patient receiving health guidance — "not reliable" is the same as "not usable."
Why the Stakes Are Higher Than They Look
When an adult uses a flaky AI tool, they might adapt. They notice when something seems off and compensate. When a kid uses a flaky educational tool, they don't adapt — they just absorb the inconsistency. If the bot skips reviewing a word the student missed, the student doesn't know they missed it. They walk into the test confident about a word they never actually learned.
The same dynamic plays out in digital health. A community health worker relying on an AI-assisted diagnostic tool may not know when the tool has drifted from its protocol. A patient receiving automated health guidance doesn't know what was skipped. In both contexts, the gap between "the tool seemed fine" and "the tool actually worked" isn't a minor usability issue — it's a safety and quality-of-care issue.
That asymmetry between appearance and actual performance is a central challenge of deploying AI in high-stakes domains. And it's why I believe the most important thing you can do when building these tools isn't the AI part. It's the evaluation part.
I built two versions: a traditional pipeline chaining separate models for speech recognition (open-source OpenAI Whisper model), language generation (Claude 4.5 and GPT-4o), and text-to-speech, and a voice-to-voice system using OpenAI's Realtime API. Before I could answer which was better, I needed to answer a more basic question: did either of them actually work?
Building the Evaluation Framework
Running the bot manually five times told me there was a problem. It didn't tell me how big the problem was, which architecture was worse, or whether my attempts to fix it were actually helping. For that, I needed systematic evaluation.
The challenge with evaluating a conversational AI is that conversations don't repeat. Every run is different. If I wanted to run 50 conversations per model, I couldn't do that manually — I needed to automate both sides of the conversation.
One evaluation run: the vocab bot and simulated student hold a complete session; the transcript is scored by a judge LLM. Click the vocab bot or simulated student nodes to view their system prompts.
The Simulated Student
A core component of my evaluation framework is a simulated student: an LLM (Google Gemini 2.5 Flash) that plays the role of a teenager studying vocabulary. I gave it a specific knowledge profile — it knows 70% of the words on the list and genuinely doesn't know the other 30% — and detailed behavioral instructions.
The simulated student talks like a real teenager. It uses filler words. When it knows a word, it gives a rambling, imprecise explanation — over-explains, goes slightly off-topic, eventually lands somewhere close to the meaning. When it doesn't know a word, it stalls, guesses wildly, or asks for a hint. It occasionally asks the bot a question back instead of just answering. It reacts naturally to feedback.
The verbosity and filler words are deliberate — typed interactions tend to be more concise, but this project is ultimately about a voice-enabled tool, and speech is messier. A simulated student that answers in clean, complete sentences would produce an unrealistically easy test.
With this simulated student, I generated 10 conversations for each of the 5 vocabulary lists (50 conversations total) per model, and compared the results across models.
I built this on top of Inspect AI, an open-source framework for LLM evaluations from the UK AI Safety Institute. It handles the scaffolding for running experiments, logging results, and computing metrics — so I could focus on what to measure, not how to run the experiments. The full evaluation framework and bot implementations are available in the Orðaflóð GitHub repository.
What I Measured
I identified three metrics based on where the demo seemed to struggle:
1. Completion Rate — Did the bot quiz every word?
Simple question: was each vocabulary word actually mentioned at some point in the conversation?
2. Review Accuracy — Did the bot review the right words at the end?
After the quiz, the bot is supposed to review the words the student struggled with. This is arguably the most educationally valuable part of the session — targeted remediation of specific gaps. To do this, the bot has to both correctly identify which words the student struggled with and "remember" them to review them at the end. I first use Claude Sonnet 4.6 as a judge, classifying each quiz exchange as "struggled" or "adequately answered" as inferred by the vocab bot's response to identify our "ground truth" of what words the student missed. Finally, I measure recall: of the words the student actually missed, how many does the bot bring back for review?
To validate the judge's labels, I ran a blind inter-rater reliability check: I manually annotated 100 quiz excerpts (balanced 50/50 by classification, equally distributed across the three models) without seeing the judge's labels. The judge and I agreed 96% of the time (Cohen's κ = 0.92 — considered excellent). If I maintain this bot, I'll need to periodically repeat this and possibly tweak things. Model-based graders do unfortunately require periodic calibration with human graders to maintain accuracy.
3. Hint Quality — Were hints helpful or giveaways?
The bot's protocol allows up to two hints per word during the quiz phase. A good hint might explain the meaning of a root word, or use the vocabulary word in an example sentence. It lets the student infer the meaning rather than defining it outright. An example from the transcripts looked like this:
"After the hike, I was absolutely ravenous and ate a huge meal — what do you think it means now?"
A bad hint is too direct and doesn't encourage a student to engage deeply. During testing, I frequently observed the voice-to-voice model announce it was giving a hint, then read the Merriam-Webster definition it was provided almost verbatim. For example, the Merriam Webster definition for "meres" was "an expanse of standing water : lake, pool" and the hint in the transcripts was as follows:
"meres are calm, standing water like a lake or pool"
That's the failure mode this metric is designed to catch. I use an LLM-as-a-judge to classify every hint as appropriately indirect or too revealing.
For the LLM-as-a-judge that scores hint quality, I deliberately use a different provider than the bots being evaluated. This prevents the judge from systematically favoring models from its own model family.
What I Found
I ran 50 conversations per model (10 conversations each for 5 different vocabulary lists), comparing three configurations:
- Claude 4.5 in the traditional pipeline,
- GPT-4o in the traditional pipeline, and
- OpenAI's Realtime API for voice-to-voice.
I ran all evaluations in text mode because I wanted to start with the core instruction-following behavior and then we could independently debug if the voice-to-voice part made it better or worse.
Metric #1: Coverage — Every Word Gets Mentioned
All three models mentioned every vocabulary word at some point in the conversation — 100% coverage. That's the good news, and honestly more than I expected going in. Given the inconsistency from the initial testing, future work will involve broader evaluation of end-to-end voice-enabled system performance to better understand what happened in those early demos.
Metric #2: Review Accuracy — The Most Valuable Phase Is the Weakest
| Model | Review Recall (mean) | Review Precision (mean) |
|---|---|---|
| Claude 4.5 | 72% | 97% |
| GPT-4o | 77% | 95% |
| Voice-to-Voice Realtime | 76% | 90% |
This is where all models showed their biggest gaps. The review phase — where the bot goes back over the words the student missed — correctly recalled the right words about 72–77% of the time across models. All models have very high precision (>0.89) — when they review a word, it's almost always one the bot perceived as struggled.
More concretely, if a student missed 4 words during the quiz, the bot would successfully target about 3 of them in the review.
One pattern that emerged from digging deeper into the data: review recall degrades as conversations get longer. I measured the cumulative tokens processed during the quiz phase and found a negative correlation with review accuracy. Longer, chattier quiz sessions produced worse recall at the review phase. The relationship was statistically significant for GPT-4o (r = −0.31, p = 0.03) and directionally consistent across all three models. The hypothesis: these models aren't storing a running log of which words were struggled with; they're reconstructing that information from the full conversation at review time. The more there is to reconstruct from, the worse they do. It was an interesting observation because the quiz only consists of 10 words and that intuitively didn't seem prey to context rot.
What Happens When You Give the Bot a Notepad?
That finding pointed toward a specific fix: instead of asking the model to remember implicitly, give it an explicit mechanism to record information as it goes. I added a tracker line to the system prompt — instructing the bot to append [TRACK: struggled=[word1, word2] ok=[word3, ...]] to each response, updated after every word. This is essentially giving the bot a notebook on the side.
I ran this variant across 50 conversations each for Claude 4.5 and GPT-4o, covering five vocabulary lists.
| Model | Baseline recall | With notepad tracker | Δ |
|---|---|---|---|
| Claude 4.5 | 72% | 87% | +15pp |
| GPT-4o | 77% | 80% | +3pp |
Claude 4.5 improved by 15 percentage points from a single prompt change. GPT-4o was essentially unchanged. Claude 4.5 benefited most from state tracking. Hypothesis testing reveals that the Claude 4.5 improvement was statistically significant while the GPT-4o change was not.
This is what eval-driven development looks like in practice. The first experiment revealed a gap. The second experiment closed most of it and, as we will explore in the next section, exposed the next opportunity. You don't find these things in demos.
What Happens When the Agent Takes Over the Experiment?
Giving the bot a notebook improved the recall, so I started another iteration of measure, identify failure, hypothesize, fix, re-measure to see what else we could do. Weirdly, Claude Code beat me to most of my process.
When I asked where the low-scoring runs were concentrated, it showed me the transcripts. Then, without prompting, it did something I didn't ask for: it reviewed the failure cases, identified a pattern, and stated a hypothesis. The pattern was that the bot was logging words as "adequately answered" after giving a hint, because the student eventually answered correctly. The correct assessment was "struggled" because they needed a hint first. The scratchpad was faithfully recording the vocab bot's judgment, but the bot's judgment was wrong and tended to be too lenient in its assessment of the student's answer. Claude Code proposed a prompt change to fix the calibration, then asked if I wanted to re-run the experiment and analysis. I panicked and said no — I hadn't saved my current results yet and wasn't ready to overwrite them!
I revisited the proposed change and ran the experiment myself later. I did wind up changing the bots prompt to apply a calibration fix for what "struggled" and "adequately answered" meant in a transcript. Here's the full arc across all three conditions:
| Model | Condition | Recall | Precision |
|---|---|---|---|
| Claude 4.5 | Baseline | 72% | 97% |
| Claude 4.5 | + Notepad (v1) | 87% | 51% |
| Claude 4.5 | + Notepad and Calibration fix (v2) | 100% | 44% |
| GPT-4o | Baseline | 77% | 95% |
| GPT-4o | + Notepad (v1) | 80% | 71% |
| GPT-4o | + Notepad and Calibration fix (v2) | 95% | 68% |
The recall numbers are remarkable. Claude 4.5 has essentially perfect recall and GPT-4o has 95% recall with both improvements being statistically significant. But the precision tells the other half of the story. The calibration fix overcorrected: the bot now treats almost any hesitation as struggling, which means it reviews words the student knew perfectly well. A student who breezes through seven words and genuinely struggles with three now gets to review six words at the end.
This isn't necessarily a failure — defining the optimum precision-recall operating point will depend on the application and it's exactly what a rigorous eval framework is supposed to surface. You can't see a precision-recall tradeoff like this in a demo. You can't find the optimum solution without being able to measure it. And you almost certainly can't find the next engineering target without something producing structured, inspectable evidence to reason about.
Which brings me to the broader point. What happened in this iteration — Claude Code identifying the calibration pattern unprompted, stating a hypothesis, designing the next experiment — wasn't autocomplete. It was an agent participating in the scientific method. But it only happened because the eval framework gave it something concrete to reason about: specific failures, scored and logged, with transcripts attached. A rigorous eval infrastructure transformed my AI coding assistant into an AI research collaborator.
Metric #3 — Hint Quality: High Variance, Low Insight (For Now)
The final metric, hint quality, varied enormously across all models — from 0.0 (every hint was a giveaway) to 1.0 (every hint was appropriately indirect) within the same model and round. The variance itself is the finding: hint quality isn't stable. Some runs produce excellent, pedagogically sound hints. Others essentially hand the student the answer.
| Model | Mean hint quality | Std dev | Sessions scoring 0.0 | Sessions scoring 1.0 |
|---|---|---|---|---|
| Claude 4.5 | 0.60 | 0.34 | 14% | 28% |
| GPT-4o | 0.46 | 0.41 | 38% | 28% |
| Voice-to-Voice Realtime | 0.35 | 0.34 | 34% | 14% |
The pattern is strikingly bimodal, especially for GPT-4o: nearly 40% of sessions produced uniformly bad hints, while another 28% produced uniformly good ones. A model that oscillates between perfect and useless within the same prompt seems harder to fix than one that's consistently mediocre — inconsistency of this kind suggests the model isn't reliably applying a stable internal strategy for hint generation.
This is the metric that needs the most engineering attention. The likely fix involves more explicit prompt guidance (i.e. distinguishing "use the word in a sentence" from "explain the word's meaning"), but validating that fix requires the same eval infrastructure that surfaced the problem in the first place.
What This Means for AI in High-Stakes Domains
Here's what I took away:
As AI shifts from assisting humans to driving them, the failure mode that matters most is the one that's invisible. Here, the bot didn't crash or return an error. It held a perfectly coherent conversation while quietly skipping the part that mattered. That's the specific danger of deploying AI in contexts where the user can't detect the failure (e.g. a community health worker faces when a diagnostic tool drifts from its protocol, or a patient faces when automated health guidance skips a step). Gen AI is stochastic and every run is different. The gap between "the session felt fine" and "the application works" is only visible if you're an expert (which most people aren't) or if you're measuring (which is what we all need to be doing). Without that, you're not deploying a tool. You're deploying a demo and hoping it works consistently.
Evaluation is a design tool, not just a quality-assurance step. The notepad experiment didn't come from intuition. It came from measuring a specific failure and following the data. And when the eval produced structured, inspectable evidence, something unexpected happened: Claude Code participated in the scientific method and acted as an effective research collaborator with me. It reviewed the failure cases, identified a calibration pattern and proposed the next experiment. Evals give an agent something to reason about. That's only possible when there's something concrete to reason about. I look forward to exploring how the agent handles optimization for several metrics including when I explore precision-recall tradeoffs in future work.
Evals don't stop at deployment. There's a temptation to treat evaluation as a pre-launch activity — something you do to gain confidence before shipping, then set aside. That's a mistake, and it's especially costly with LLM-powered tools. LLM applications aren't static. Models change, sometimes visibly and sometimes not. In a different project, I found that LLM performance in some low-resource languages degraded considerably in a model update from Claude 3.5 Sonnet to Claude 3.7 Sonnet. Sometimes, providers even update models in place with no announcement. And as you tune prompts to improve one metric, you can inadvertently degrade another. Concept drift means systems that used to work great can become less effective over time. The same eval infrastructure you built during development is exactly what catches these regressions before they reach the actual user — but only if you keep running it. For an organization deploying AI tools in health or education at scale, that means treating eval infrastructure as a first-class operational investment, not an afterthought.
If this is the kind of problem you're working on, I'd love to hear more.