Multimodal AI

Models that see, hear, and read at once.

Multimodal AI describes models that take in and reason across more than one kind of data — text, images, audio, and video — in a single system, aligning those modalities so the model can answer about an image, transcribe and act on speech, or ground a response in a chart.

6 episodes

Every AI Agent Has an Evaluation Gap | Alex Ratner, Snorkel AI Alex Ratner, Snorkel AI · Apr 29, 2026 · Transcript
Why LLMs Are Plausibility Engines, Not Truth Engines | Dan Klein Dan Klein, Scaled Cognition · Apr 8, 2026 · Transcript
Gemini 3 & Robot Dogs: Inside Google DeepMind's AI Experiments | Paige Bailey Paige Bailey, Google DeepMind · Jan 14, 2026 · Transcript
Breaking the Language Barrier: Smartling's AI Translation Pipeline | Olga Beregovaya Olga Beregovaya, Smartling · Apr 23, 2025 · Transcript
The Making of Gemini 2.0: DeepMind's Approach to AI Development and Deployment | Logan Kilpatrick Logan Kilpatrick, Google DeepMind · Feb 12, 2025 · Transcript
Practical Lessons for GenAI Evals | Chip Huyen & Vivienne Zhang Chip Huyen & Vivienne Zhang · Dec 4, 2024 · Transcript

Explainer on this topic

What Is Multimodal AI Explainer · Jun 16, 2026

Term on this topic

Multimodal AI Glossary

Guests on this topic

Alex Ratner Dan Klein Paige Bailey Olga Beregovaya Logan Kilpatrick Chip Huyen Vivienne Zhang