AI Evaluation
AI evaluation is how you measure whether an AI system actually works — scoring its outputs against what good looks like, systematically and repeatably, instead of eyeballing a few demos. For non-deterministic systems like LLMs and agents, it's the discipline that separates a thing that demos well from one you can ship.
Also known as: evals, AI evals, model evaluation, LLM evaluation
Traditional software is deterministic — same input, same output — so you test it with assertions. An LLM or agent can give a different, plausible-sounding answer every time, which breaks that model. AI evaluation fills the gap: a repeatable way to score outputs against criteria — accuracy, relevance, safety, format, task completion — so you know whether a change made the system better or worse.
In practice it spans offline evals on a fixed test set (catch regressions before you ship) and online evals on live traffic (catch what production surfaces). The hard part is defining “good” for open-ended outputs, which is why methods like LLM-as-a-judge and human review both show up, and why observability and evaluation get used together rather than confused.
It’s a recurring theme on the show because it’s where most AI projects quietly fail: teams ship on vibes, can’t tell why quality drifts, and have no way to improve systematically. Evaluation is the feedback loop that makes iteration possible — without it you’re guessing.
Go deeper
- How do you evaluate an AI agent? AI, decoded · The 3 Levels of Evaluating an AI Agent
- What's the difference between AI observability, evaluation, and benchmarking? AI, decoded · Observability vs. Evaluation vs. Benchmarking
- How do you test an AI system when the output isn't deterministic? AI, decoded · How to Test an AI System
From the conversation
-
Explaining Eval Engineering | Galileo's Vikram Chatterji -
Mindset Over Metrics: How to Approach AI Engineering | Hamel Husain -
Practical Lessons for GenAI Evals | Chip Huyen & Vivienne Zhang -
The 2025 AI Shift: From Chat to Task Completion & Reliable Action | Galileo Founders -
Beyond Chatbots: How Twilio Uses AI to Strengthen Human Connection | Vinnie Giarrusso -
Architecting AI Agents: The Shift from Models to Systems | Aishwarya Srinivasan -
Why Enterprises Need a Different Approach to AI Agents | Lyzr’s Siva Surendira -
AI in 2025: Agents & The Rise of Evaluation-Driven Development