Best Chain of Thought Episodes on AI Evaluation
Measuring whether AI actually works — evals, reliability, trust.
- 1 Every AI Agent Has an Evaluation Gap | Alex Ratner, Snorkel AI Alex Ratner, Snorkel AI Alex Ratner on the evaluation gap every team hits the moment an agent leaves the demo — and how to close it.
- 2 Explaining Eval Engineering | Galileo's Vikram Chatterji Vikram Chatterji, Galileo Vikram Chatterji on treating evaluation as an engineering discipline, not a one-time check.
- 3 The AI Agent Trust Gap: Bridging Risk to Reliability | Elastic’s Philipp Krenn Philipp Krenn, Elastic Philipp Krenn on bridging the gap from risky to reliable — why trust is the real bar for shipping agents.
- 4 AI in 2025: Agents & The Rise of Evaluation-Driven Development Vikram Chatterji & Andrew Zigler The case for evaluation-driven development — building the eval loop into how you ship, not bolting it on after.
- 5 Practical Lessons for GenAI Evals | Chip Huyen & Vivienne Zhang Chip Huyen & Vivienne Zhang Chip Huyen and Vivienne Zhang on practical GenAI evaluation — the most-cited starting point for evals on the show.
- Best Chain of Thought Episodes for AI Founders Building an AI company — defensibility, GTM, and the market reality.
- Best Chain of Thought Episodes for Engineers Shipping with AI — the craft, the workflow, and the reality checks.
- Best Chain of Thought Episodes on AI Agents How agents actually work — architecture, memory, context, frameworks.