Why can’t generic LLM-as-a-judge evals get past ~70% accuracy?

Because a generic judge only knows the public internet it was trained on — it’s missing your “last-mile context.” Vikram Chatterji’s example: the model doesn’t know “CC” means “credit card” in your parlance, or your internal banking policies. A GPT-5 judge lands around 70% accuracy, costs ~$1.25 per million tokens, and adds 2-3 seconds of latency. Getting past the plateau requires injecting that organization-specific context, which Galileo does via human feedback rather than full fine-tuning.

How does Galileo make evals cheaper and faster to run at scale?

Galileo distills an accurate LLM judge into a small language model purpose-built for evaluation — its Luna evaluation models — which only need to emit a single token or a few tokens rather than generate freely. Chatterji notes they reworked the decoder steps so the model is eval-focused, not generative, then auto-fine-tune it on ground-truth data created through the human-feedback loop. The result: a large telco’s annualized eval cost fell from roughly $26M on a GPT-4.1 judge to about $350K.

Should every eval become a runtime guardrail?

No. Vikram Chatterji’s rule is “today’s evals are tomorrow’s guardrails, but not every single eval needs to be a guardrail.” You stack-rank your evals by priority — something like banking-policy adherence is high enough to be enforced as a runtime control, while many others are just for monitoring. He estimates if you build X evals, maybe X-divided-by-four become guardrails. Some controls don’t even need an eval score; they can be purely conditional, but they still need a runtime engine (Galileo’s Protect) operating within ~200ms latency.

What does the emerging “eval engineer” role actually look like?

Vikram Chatterji compares it to the go-to-market engineer who sits in the middle of sales and marketing — the eval engineer needs the same straddle: real product understanding of the business use case plus real engineering skills. They understand systems, code hygiene, and actually submit PRs. He frames it as a software engineer who is more product-minded, and notes it’s already happening, since AI engineers spend most of their time on prompts and evals rather than boilerplate code.

Episodes · S3 E46 ← Prev Next →

Explaining Eval Engineering | Galileo's Vikram Chatterji

Dec 19, 2025 · Vikram Chatterji , Galileo · 37 min

AI Agents AI Evaluation & Reliability AI Observability AI Engineering

Listen on any app

Key takeaways

Generic LLM-as-a-judge metrics plateau around 70-75% accuracy — unshippable for customer-facing AI. A GPT-5 judge is roughly 70% accurate, ~$1.25 per million tokens, and 2-3 seconds of latency. At 70%, three in ten interactions fail and you often can’t tell, so it “doesn’t cut it” for anything important.
Galileo’s path off the plateau is three steps: auto-author the judge prompt from a subject-matter expert’s plain language (“Auto-Gen,” ~75%), close the “last-mile measurement” gap with continuous learning through human feedback (“Auto-Tune”/CLHF — a handful of feedback points gets it to ~95%), then distill it into a fine-tuned small language model (the Luna evaluation models) for low latency and cost at scale.
Evals are already most of the job. When Vikram’s team asks serious AI builders — AI SRE orgs, billing agents for sales and banks — what their day looks like, roughly 70% of the time is just evals, mostly tweaking judge prompts. Because of that, he predicts “eval engineer” becomes a standard role, the way “prompt engineer” once was.
The customer math is staggering. One large telco went from 1 agent to 47 in eight months at 8,000+ queries per second with evals on 100% of traffic — and by moving off a GPT-4.1 judge to Galileo’s Luna stack, their annualized eval cost dropped from ~$26M to ~$350K. People budget for inference but forget you’re also running inference on every eval.
Today’s evals become tomorrow’s guardrails — but not all of them. Once an eval is accurate enough, it can run as a runtime control (Galileo’s “Protect” enforces SLM-powered evals within ~200ms). Stack-rank your evals: a high-priority one like banking-policy adherence becomes an enforced guardrail, but Vikram estimates only about a quarter (“X divided by four”) should be promoted to controls.
“Which eval should I even create?” is the real step zero. Vikram calls out a game of whack-a-mole with non-deterministic systems and warns against “frivolous evals like correctness and toxicity” that aren’t even the tip of the iceberg. Galileo’s “Auto-Insights” feeds logs and eval prompts into an eval-specific reasoning engine to surface the 10-20 places things are actually going wrong — explicitly not a gimmicky “chat with your logs” feature.

Frequently asked questions

What is eval engineering?: Per Galileo CEO Vikram Chatterji, eval engineering is the discipline of treating evaluations as scalable infrastructure rather than one-time checkbox tests. You don’t just author an eval — you author it, test it, measure it, then convert it from a static score into something that’s simultaneously high-accuracy, low-latency, and low-cost so it can run across all your production traffic. Chatterji predicts “eval engineer” will become as common a role as “prompt engineer” once was.
Why can’t generic LLM-as-a-judge evals get past ~70% accuracy?: Because a generic judge only knows the public internet it was trained on — it’s missing your “last-mile context.” Vikram Chatterji’s example: the model doesn’t know “CC” means “credit card” in your parlance, or your internal banking policies. A GPT-5 judge lands around 70% accuracy, costs ~$1.25 per million tokens, and adds 2-3 seconds of latency. Getting past the plateau requires injecting that organization-specific context, which Galileo does via human feedback rather than full fine-tuning.
How does Galileo make evals cheaper and faster to run at scale?: Galileo distills an accurate LLM judge into a small language model purpose-built for evaluation — its Luna evaluation models — which only need to emit a single token or a few tokens rather than generate freely. Chatterji notes they reworked the decoder steps so the model is eval-focused, not generative, then auto-fine-tune it on ground-truth data created through the human-feedback loop. The result: a large telco’s annualized eval cost fell from roughly $26M on a GPT-4.1 judge to about $350K.
Should every eval become a runtime guardrail?: No. Vikram Chatterji’s rule is “today’s evals are tomorrow’s guardrails, but not every single eval needs to be a guardrail.” You stack-rank your evals by priority — something like banking-policy adherence is high enough to be enforced as a runtime control, while many others are just for monitoring. He estimates if you build X evals, maybe X-divided-by-four become guardrails. Some controls don’t even need an eval score; they can be purely conditional, but they still need a runtime engine (Galileo’s Protect) operating within ~200ms latency.
What does the emerging “eval engineer” role actually look like?: Vikram Chatterji compares it to the go-to-market engineer who sits in the middle of sales and marketing — the eval engineer needs the same straddle: real product understanding of the business use case plus real engineering skills. They understand systems, code hygiene, and actually submit PRs. He frames it as a software engineer who is more product-minded, and notes it’s already happening, since AI engineers spend most of their time on prompts and evals rather than boilerplate code.

Concepts in this episode

AI terms discussed here — each links to a plain-language definition.

AI Evaluation Accuracy Latency LLM as a Judge Context Engineering Inference Model Context Protocol (MCP)Prompt Engineering AI Agent Precision and Recall

Show notes

You've heard of evaluations—but eval engineering is the difference between AI that ships and AI that's stuck in prototype.

Most teams still treat evals like unit tests: write them once, check a box, move on. But when you're deploying agents that make real decisions, touch real customers, and cost real money, those one-time tests don't cut it. The companies actually shipping production AI at scale have figured out something different—they've turned evaluations into infrastructure, into IP, into the layer where domain expertise becomes executable governance.

Vikram Chatterji, CEO and Co-founder of Galileo, returns to Chain of Thought to break down eval engineering: what it is, why it's becoming a dedicated discipline, and what it takes to actually make it work. Vikram shares why generic evals are plateauing, how continuous learning loops drive accuracy, and why he predicts "eval engineer" will become as common a role as "prompt engineer" once was.

In this conversation, Conor and Vikram explore:

Why treating evals as infrastructure—not checkboxes—separates production AI from prototypes
The plateau problem: why generic LLM-as-a-judge metrics can't break 90% accuracy
How continuous human feedback loops improve eval precision over time
The emerging "eval engineer" role and what the job actually looks like
Why 60-70% of AI engineers' time is already spent on evals
What multi-agent systems mean for the future of evaluation
Vikram's framework for baking trust AND control into agentic applications

Plus: Conor shares news about his move to Modular and what it means for Chain of Thought going forward.

Chapters:00:00 – Introduction: Why Evals Are Becoming IP01:37 – What Is Eval Engineering?04:24 – The Eval Engineering Course for Developers05:24 – Generic Evals Are Plateauing08:21 – Continuous Learning and Human Feedback11:01 – Human Feedback Loops and Eval Calibration13:37 – The Emerging Eval Engineer Role16:15 – What Production AI Teams Actually Spend Time On18:52 – Customer Impact and Lessons Learned24:28 – Multi-Agent Systems and the Future of Evals30:27 – MCP, A2A Protocols, and Agent Authentication33:23 – The Eval Engineer Role: Product-Minded + Technical34:53 – Final Thoughts: Trust, Control, and What's Next

Connect with Chain of Thought host Conor Bronsdon:Substack – https://conorbronsdon.substack.com/LinkedIn – https://www.linkedin.com/in/conorbronsdon/X (Twitter) – https://x.com/ConorBronsdon

Learn more about Eval Engineering:⁠https://galileo.ai/evalengineering⁠

Connect with Vikram Chatterji:LinkedIn – ⁠https://www.linkedin.com/in/vikram-chatterji/⁠

Transcript

82 segments

Conor Bronsdon 0:00 You've heard of evaluations. But have you heard of eval engineering? Eval engineering is the difference between AI that ships and AI that's stuck in prototype. Most teams are still treating evals like they did unit tests. Write them once, check a box, move on, maybe even just ignore them if they don't pass. But when you're deploying agents that make real decisions, that touch real customers, that cost money,

Conor Bronsdon 0:25 these one time tests don't cut it. The company's actually shipping production AI at scale, they've figured out something different. They've turned evaluations into infrastructure, into IP, into the layer where domain expertise becomes executable governance. Welcome back to Chain of Thought. I'm your host Conor Bronsdon. Before we dive in, I do want to note some news that I have to share, which is that I have recently

Conor Bronsdon 0:48 joined Modular as head of technical ecosystem where I am working on next generation AI infra and the Mojo programming language. I've got a huge open source community. It's an incredible opportunity and I'm super excited about what we're building there. But that also means that I'm no longer at Galileo. And chain of thought has always been about these conversations that matter, whether it's the technical depth that brings

Conor Bronsdon 1:08 builders to the next level for them, that helps them ship production AI, diving into valuations like like we're doing today. And I'm really delighted that Galileo has graciously agreed to sponsor season three of this podcast, which means we can keep bringing you these conversations. And I'm even more excited to have Galileo's CEO, Vikram Chatterjee joining us once again today to talk about the crucial discipline of eval engineering,

Conor Bronsdon 1:30 how it's becoming a thing, why it's becoming so popular, and what it actually takes to make it work. Vikram, thank you so much for joining me. It's great to see you. Likewise, Conor. Thank you for having me again on this pod. Super excited to be here. Always love chatting with you. You have been a fantastic guest several times here. For folks who are listening, and are maybe new, definitely recommend checking out some of Vikram's past episodes. He's had a quite a few good ones. Vikram, you and I have been talking about evaluating agents this year. It's obviously been

Conor Bronsdon 1:57 a deep topic of discussion. I think most folks really are still thinking sometimes about evaluating LLMs at a base level, which is something that's been happening for a while now. But the field has rapidly evolved. And with it has come this idea of eval engineering, evaluation engineering.

Vikram Chatterji 2:11 Can you start by just giving us a definition of what eval engineering is? And then maybe we can get into a bit of why it matters. Essentially, like you said, if I take a step back around why evals are becoming IP, over the last year, year and a half, especially, we've been noticing a lot of teams move away from just thinking about building out these AI applications as just, hey, let's try to get to our response from this chatbot, or let's get this agent to actually perform a task towards

Vikram Chatterji 2:37 we're thinking more about how do I actually take this thing to production? How do I make this super useful for my business unit? And as a part of that, evals have been coming up over and over again. That's why in the last year, it's become extremely crucial for teams to build high quality, high accuracy Evalues. Galileo has been pioneering eval engineering, as we call it now, for the last several years at this point. But what we've been noticing in the industry is folks have been really confused about what it means to create high quality evals. What is how do you how do you actually create evals? What kind of evals to create? Those are questions which plague everyone. So it's a weird dichotomy that we've been seeing in the industry where on the one hand folks realize that if you're a mature organization and you mature AI team and you really wanna be able to create high quality AI applications and agents,

Vikram Chatterji 3:24 you know that you wanna create you need you need evals. But at the same time, they also have the same question around, like, what's they need to be educated around what those evals should be and how you can actually build them out. And the way we think about this at Galileo is, an eval is something that you don't just author. If it's an LLM powered metric, you can't just create an LLM as a charge and call it a day. You have to be able to, you know, author it, test it, measure it, but also make sure that it can go from not just a static

Vikram Chatterji 3:51 metric or a score that you get at the end of the day, but take that and make sure that it's high accuracy, but also low latency and low cost. And there are many, many different reasons for that, which we can get into. But this act of being able to take an eval and then convert that into something that is scalable is extremely crucial. And that's the infrastructure that Kadaleo has been building for the last two years, And we call that eval engineering. It's something that we built with that very developer first mindset.

Vikram Chatterji 4:18 It's something that any developer can take and and and create and, and, and scale out with, in a very self serve way. And speaking of that self serve developer approach,

Conor Bronsdon 4:28 I know Jim Bennett at Galileo is actually running a course right now or that's starting very shortly. That's gonna be running through December and January, of 2026. I'm talking about eval engineering for AI developers and enabling folks. Is this something that you're gonna start running regularly to help educate the community?

Vikram Chatterji 4:44 That's right. And and you know this better than anybody else, Conor, since you were here, that we're very big on talking about the stuff that we've been hearing from from developers everywhere and our customers and packaging that into content, which which which everybody can use. We think education should be democratized. And, when it comes to the forefront of evals, we've been trying to do that as much as we can. So this whole idea of evals engineering, how do you actually go about that in a very systematic step by step fashion is something that we're

Vikram Chatterji 5:11 packaging in the form of a of a course based on popular demand. I'm super excited about that so that developers can actually learn what best practices are here, but also give us feedback and teach us about how they're be they've been performing eval engineering.

Conor Bronsdon 5:24 And part of the rationale for this, know is that generic evals are really plateauing. They're not getting to that 90 plus percent accuracy you need to really ensure you're getting what you need in production. What, what have you been seeing on the research side around this? I know Shreya Shankar has been doing some, some work around this at Berkeley. And what do you think developers need to know about breaking that plateau?

Vikram Chatterji 5:46 Well, I think there's a, there's two parts to this. One is there's a research aspect. There's also an organization aspect. Increasingly in teams, when we talk to folks who are very, very serious about building out high quality agents, like AI SRE organizations and billing agents for sales and billing agents for banks, what we'd be noticing is they have a bunch of software engineers who are working on building out these agents. But when we really, really ask them what their day to day looks like, 70% of the time is just evals.

Vikram Chatterji 6:13 Right? And when we really break that down even further, what they're trying to do is take these evals and make sure that those are getting to higher accuracy by just simply trying to tweak the prompts in these LLMs as a judge and trying to make sure that they're high accuracy enough, and then they're trying to monitor their traffic. And they typically just do around 10% monitoring of the traffic because beyond that, it gets extremely expensive to use these LLM based judges. So two things. One is organizationally,

Vikram Chatterji 6:37 I am of the opinion that we're gonna start seeing more and more folks come out as not just AI engineers, but also eval engineers, just given how pivotal this is for every team and given how important it is from an IP perspective to have your own custom scalable evals. So eval engineering as an actual discipline, as a science, as a concept is gonna be bigger and bigger as we go. And therefore, the most mature organizations are gonna have not just, you don't hear prompt engineering prompt engineers anymore, but eval engineering is 100% gonna be a bigger thing. And that's why we're also publishing the course so that eval engineers can can learn that, get certified, and and more and more eval maturity starts to show up in the market. To to your other question about the the research component of this, there's very, very limited r and d that's happening right now, which is why, again, like, at Galileo, you know this, we've been we're not just a software layer company. We've been pioneering stuff on the data science side of things, infrastructure side of things, such that we can actually provide evals at scale.

Vikram Chatterji 7:33 And as this requires a lot of different kinds of competence to come together. So one example of that was the idea that you can provide natural language feedback when you actually look at a, an eval that's that's not performing really well. To give you a concrete example, if you have built a simple eval, like, correctness or to measure the accuracy or, and correctness of a of a response from a from a chatbot, for instance, when you look at that and you see that you see that it's it's a 100% wrong, most of the time you look at that score and you have you can't really do anything about it. And that's gonna where as a developer, as well as a subject matter expert, you feel very powerless.

Vikram Chatterji 8:06 And so we build out this technique by which you could just, add natural language feedback. And, we take abstract away the notion of how you can auto prompt engineer that response. We use different kinds of techniques there. That that's completely differentiated technology that's taken away from the developer's hands, such that they don't have to think about going and trying to tweak the prompt in different kinds of ways. We call that continuous learning through human feedback. And, the, the feature is called auto tune, but that itself has dramatically reduced the amount of time that it takes from seeing, an eval that's not performing really well to actually getting to around 95% plus accuracy with an LLM as a judge, the three or four points of feedback. Right? So that's one example of r and d that we've been seeing has been has been really, really useful, but there's many, many more such things that you can do to make sure that it's high accuracy, low latency, low cost.

Vikram Chatterji 8:54 We have started also working with certain academic institutions and folks in in across the industry to make sure that we can further the idea of eval engineering. And that's also frankly why we're making this big push towards people in the industry understanding that there's a bigger and bigger need for people to think more about about eval engineering versus just focusing on getting that next agent out the door, which is not gonna work. Yeah. Part of the rationale here is that

Conor Bronsdon 9:22 change in accuracy from generic evaluations, like you may be to 70 ish percent, the research shows, to actually providing that auto tuning human feedback to providing, creating custom evals with that, where you can get 90% plus a 100% even sometimes accuracy. Because with, without that, I mean, percent evaluation accuracy is fine, but that means that three in 10 interactions are failing and you can't tell necessarily.

Conor Bronsdon 9:46 So, that's just not deployable for customer facing AI. If you're trying to do anything that's like particularly important. So I think it's really clear there's an opportunity here, but it might be helpful to define the different key elements of eval engineering. So we've talked a bit about, this customization of evaluations. We've talked a bit about this latency problem, which I know is something that you're looking at addressing.

Conor Bronsdon 10:11 How do you break down this discipline and what are the key facets of it?

Vikram Chatterji 10:16 Yep. That's a great question. So essentially there are, three different facets to this and, to to to go sort down that route, maybe let's start with the core problem. Right? When you're using an LLM as a judge, even if you're using let's say, take an example. If you're using a GPT five, as as as the LLM for your judge, what you're doing effectively is you build out a judge which is around 70% accurate. It is, for a per million token, it's about $1.25

Vikram Chatterji 10:44 which is very expensive. It's got a latency of around two to three seconds. So it's super, super expensive, super high latency. So now as as a developer, you've built something, which is just like wrong three out of 10 times, but also you can't really scale this for, for when you're, when you're monitoring all of your traffic. So step number one is, just authoring a high accuracy, just authoring the judge. That's step number one. Step number two is making sure that it's high accuracy,

Vikram Chatterji 11:09 which is where auto prompt engineering comes in. That's that continuous learning piece, which I think all developers are familiar with in other areas. That's right. That's correct. That's correct. So how do you actually tweak the, the actual LMS judge so that you can make a create a high accuracy prompt at the minimum, right, for for that LMS judge. And I'll talk about that and what pieces are important there. Step number three is auto adapting that such that you now make this a small language model version of itself. And this is where you need an SLM,

Vikram Chatterji 11:37 which is gonna be very useful for that particular task, but you also need to be able to fine tune that SLM with data from your production traffic really quickly. So those are the three different ways by which you can basically get from an LLM as a judge towards a high accuracy, low cost, low latency, SLM version of itself. Now, if you take the first step, which is just authoring an LLM as a judge, question number one there becomes, how do you even know what the the task at hand should be and what the, prompt should be in the first place?

Vikram Chatterji 12:05 We've noticed that a lot of, these LLMs as a judge are actually created not by the developer. They're created by subject matter experts, the the AI product manager, the folks who actually understand the business use case. And those folks, are not prompt engineers. And so, part part of that is, how do you make sure that you could take natural language input from these from these folks and then author a very, very high performing prompt, which can get you to that 75%

Vikram Chatterji 12:30 accuracy. So that's step number one. That is something that we've already built. We call it auto gen, where we can build the eval prompt for you. Step number two becomes going from 75% to 95% accuracy with the LLM as a judge is what we call the last mile measurement problem. And what do you realize there, Conor, is that it's very much a factor of all the information that is not what the LLM was was was trained on in the first place, which is the the public Internet. But instead it's the last mile context, which you have as an organization. It's about your use case, your organization, your context, and no LLM on the planet can have all of that. So now if for some reason it says that the model doesn't realize that the the abbreviation CC stands for credit card, versus that's what you've been using in your parlance all the time. Now there's some very specific banking policies that the that the model has no idea about that's internal to you guys. Then you have to add that context and provide that context to the model without necessarily fine tuning it. So that's the part which I call CLHF.

Vikram Chatterji 13:30 That's the research that we've that we've been doing for the last couple of years. So continuous learning through human feedback basically means that now you add that natural language feedback, like I said, and we auto auto tune the the LLM as a judge. And now very quickly with four or five pieces of feedback, you can get it to a fairly high degree of accuracy.

Vikram Chatterji 13:47 What's happening during this time is that as a part of that feedback, Conor, you're also basically providing feedback when there is a false positive or a false negative. Right? So what that means is you're also basically telling the the the platform what the actual right metric should be or the right response should be. So you're creating ground truth effectively. And as part of creating ground truth, you're creating the dataset. Once you create authored, you're essentially authoring this dataset, which is the ground truth dataset for that metric. And that helps with step number three, which is we have these SLMs,

Vikram Chatterji 14:17 SLMs, which have been specifically built out for evals. We have done a lot of jujitsu around their their rates, around their, the decoder steps, etcetera, to make sure that it's, low latency, but also specifically for evals. We call them our LUNA evaluation models. We auto we also built an auto fine tuning engine on top of that, which automatically fine tunes with this data that you've now which you've now created with or without you knowing through CLHF.

Vikram Chatterji 14:41 And that's how you automatically create a very, high accuracy, fine tuned SLM version of that metric. Right? But it doesn't end there because now you have an eval, but you also need to be able to actually, do inference on that eval at scale. Some of our customers have thousands of queries per second, and they have tons of evals for every single one of those queries, which is a lot. And so how do you actually perform eval at, inference for evals at scale? And so that that becomes more of a GPU optimization question. So as you can see, there's these three different parts, but overall, it becomes very much of a a data science problem, becomes an infrastructure challenge. But net net, from an eval engineering perspective, we think that if you can at least follow these three steps overall, it, it leads to a much, much higher performing eval versus just a vanilla LLM as a judge, which is step zero. Nice starting point. It's not gonna get you where you need to be. Exactly. That's exactly right. And I think one of the really exciting things about eval engineering as a discipline is this idea that you just continue to improve over time. So you maybe start with that generic,

Conor Bronsdon 15:42 evaluation, whether that's one of the belted ones, Galileo's platform, an open source one, whatever it is you're using. You can then fine tune that obviously with a human feedback. You can apply some of these, you know, automations for auto tuning and then there's this major opportunity to actually, once you've got into the success level you need, whether that's, know, 95% success, 99, whatever it is,

Conor Bronsdon 16:06 you can apply this eval as a runtime control to actually manage agent behavior, in real time. Can you tell us a bit more about how

Vikram Chatterji 16:15 this evolving part of the discipline is starting to take shape? This is this is something we've been we've been noticing more and more where with agents, we feel like it's, it's less about just the input and the output. It's a it's a lot to do with all the inner working of the agents from when it comes to evals. And what I mean by that is these tools and planners and the handoffs between agents, there needs to be a lot of, rerouting and traffic copying that needs to happen in in runtime. And what that means is,

Vikram Chatterji 16:45 observability platforms need to evolve from being purely reactive to being proactive. And again, as you know, we've been talking about this and shouting from the rooftops for the last two years ever since we launched Protect about two years ago, which is our runtime engine for taking these SLM powered evals and using them for protection within two hundred millisecond latencies at scale. So we we're seeing that same level of protection and that infrastructure

Vikram Chatterji 17:09 can also be used for actually rerouting these micro controls within the agent. So I think the future is gonna be very much about who can control, the different aspects of the of the agent objects, like your tools and MCP server and your handoffs. And and that's what we that kind of agent control is what's gonna lead to, folks who can, build agents that are truly autonomous versus the ones who cannot. I assume you don't think every evaluation

Conor Bronsdon 17:36 should turn into a runtime guardrail. How should developers be making the decisions about, hey, this seems to be a live control versus here's something we're using, in other circumstances?

Vikram Chatterji 17:46 Yeah, it's a great question. I almost feel like, it's there are two dissociated concepts at that point. There is a and and and they all fall under the broader umbrella of observability. So think of observability as something which is has to work for for for agents, especially, has to work offline, online, as well as runtime. Now, evals is one component of all of this where you build out this measurement system. Right? And you can use that measurement system for for monitoring,

Vikram Chatterji 18:10 but then you could also use this measurement system for, let's say, at a tool level to say that, look, if there's an error in the tool, then do x. That being said, some of these some of these controls needn't have an eval at its at its at its score. They could be, purely conditional, but they'll need a runtime, engine to to actually operate. But the way I think about this is

Vikram Chatterji 18:28 not today's evals are tomorrow's guardrails, but not every single eval needs to be to be a guardrail. Not every eval needs to be a control. You know, when you're thinking about, controls within the, within an organization, certain evals are more system based. What I mean by that is it's, much more about the, you know, the adherence to the context. It's about more about the the errors in the tools. There are some evals that are much more about the use case itself, the whether there's a banking policy adherence happening or not. Right? So if you stack and list out all these different evals and you stack rank it, what we've noticed is the ones that are highest priority.

Vikram Chatterji 19:04 As an example, adherence to banking policy. Let's say that one is extremely high priority for the bank, for that particular business unit. It's gonna be very important for them to make sure that that's something which is converted into a guardrail and enforced. So you might have X number of evals that have been built out, but you'll only have maybe X divided by four that are converted into guardrails and then also used as controls. Tell me more about how

Conor Bronsdon 19:25 you then gain insights from what's happening in production. Because obviously if you are starting to run all these controls at runtime, and you are operating a lot of this infrastructure around these agents, there's a lot of data that comes from that. And there's a lot of lessons and learning that can be then reapplied to your entire AI infra system here. You mentioned

Conor Bronsdon 19:47 earlier this idea of stacking agents on top of that to take the learnings back into the system. How is that process actually working or what are you recommending to eval engineers as they try to bring those insights back from production and leverage them to improve, the agents?

Vikram Chatterji 20:02 Yeah. It's a great question. So essentially, all of this is a loop. So we do see how if you if you notice something in production where you have a failure mode, converting that into an eval in the first place becomes extremely important. So the way we think of this is you log your data, you enforce these evals on those logs, you start to, fix and update your agent based on that, And then you monitor it again. Within that process, you once you actually realize that, hey, there's something which is which is a new failure mode that's being discovered, you actually bring that back into the overall system and the overall flow. Now, what we've also noticed is that when for a particular eval has a life cycle of its own. So there's an overall agent life cycle, but if you look at an eval, because it's also powered by an LLM for the most part, it's almost like a mini app within an app. And now you have many such mini apps because you have many such evals. So you just you just like you have a prompt and a data fly wheel for your overall agent, you also have a prompt and a data fly wheel that you need for your eval as well. Now, this can sound like a lot of work, but that's why I feel like it's very, very important for observability platforms like ours at Galileo to be to take on the work, to, make it dead easy for developers

Vikram Chatterji 21:14 to be able to just, you know, start by creating an eval and then with minimum work and effort, kind of get to the other side where they can actually author high accuracy versions of itself, but also discover when they need to re fine tune these these SLM powered metrics and also abstract away the fine tuning piece from them because they're not data scientists. So there's a lot of work there that needs to be done to

Vikram Chatterji 21:37 abstract the complexity away from developers. And that's precisely where I feel like the R and D needs to come in. That's something which in 2026, just given how it's gonna be the year of agents, it's gonna be the year when agents are gonna be going into production more than ever before. We're also seeing it being, 2025 was the year where we had a proliferation of agent builder platforms. So we're gonna see millions and millions of agents built out, but like a handful of them, as you know right now, are actually in production, so we're gonna see more and more of them coming in production, but that's only because we're gonna see more people actually figure out how to do this eval engineering piece for their agents, and become really good at thinking about this as a life cycle as, versus just thinking about the agents as a life cycle. As these evaluation life cycles evolve from, let's call it 10% observability to closer to a 100% observability,

Conor Bronsdon 22:26 that creates a ton more opportunities to, to learn, a lot more data you can bring in. But I imagine there's also a risk of these positive feedback loops going wrong, where they can spiral in the wrong direction. Are there particular failure modes that you think eval engineers need to watch out for as they architect these systems?

Vikram Chatterji 22:43 Well, that's what's interesting because there are some failure modes, but it's also really, really hard to figure out what the failure modes are because they're they come from in all shapes and sizes and they could be at the tool levels, MCP server level, handoff, could be in many different in many different parts. It could also just be in there, in the use case specific in a very use case specific nature. So something which we've been working on is

Vikram Chatterji 23:05 what we call our auto insights. So what it means is essentially taking all of the different logs and all of these different evals that have been built out and the prompts for these evals that have been built out and throwing that into a an eval specific reasoning engine, which then can automatically tell the developer exactly where the failure modes are. So this is not a chat with the logs feature, which is extremely gimmicky in my opinion. But instead, it's more of a automatically telling the developer that here are the ten, fifteen, 20 different places where there's something which is potentially going wrong,

Vikram Chatterji 23:33 and you can actually take action on it. It's it's the reason I feel like that's extremely important is because it's very difficult otherwise for developers to constantly be on top of what the failure modes are. That is the big, game of whack a mole that starts to getting get played when it comes to figuring out where things are going wrong with these nondeterministic systems.

Vikram Chatterji 23:52 And frankly, Conor, I feel like that's actually step zero part of the step zero of eval engineering, which is which eval should I even create? What are my failure modes outside the usual suspects? Because otherwise, I just see people creating pretty frivolous evals like correctness and, you know, toxicity and, stuff like that, which is just not even the tip of the iceberg.

Conor Bronsdon 24:11 What kind of customer results have you been seeing, Vikram, from teams that are successfully implementing this eval engineering stack?

Vikram Chatterji 24:18 The customer results for the last year of us doing this with, with folks has been absolutely staggering. So for instance, one, large telco that we work with for the last year, because of eval engineering, what we've seen is that's unblocked them in two fronts. One of them is they actually have the trust and control to be able to go from one one agent when we started working with them to 47 agents within the span of the last eight months at scale.

Vikram Chatterji 24:41 And when I say at scale, this is not like a startup in San Francisco scale. This is like 8,000 queries per second more. It's insane. And with all their evals in place for a 100% of their traffic. What this means from a cost perspective for them outside of the observability impact is before we came into the picture and help them from an eval engineering perspective,

Vikram Chatterji 25:01 they were using GPT model. At the time, they were using GPT 4.1 as the judge. And it cost them across all their evals, across all their logs, cost them around $26,000,000 a year annualized for just for evals. Right? It's extremely expensive. People don't realize this, but it's there. People only think about inference costs, but actually you're also doing inference for all of your evals and it's extremely expensive.

Vikram Chatterji 25:28 With Gallaudet's eval engineering, because we help them convert that into a higher accuracy, low latency, low cost version of itself, that LUNA models and the entire stack that we have on top of that, it's gone down from 25,000,000 to $350,000 annualized. We've done this with this large telco. We're also seeing the same thing happen with across a few of our other customers in different kinds of fields. So which is which is why we're now up with all of those learnings that we've had over the last one or two years and of pioneering in the space, And we're now capturing that in this course that I was talking about and making that available for free for all developers. A large part of all of the work that we've done with Luna models as well as the auto tune feed feedback feature that I mentioned before. All of that is also available for free in our in our product, which for any developer to go that's building out an agent and use at galileo.ai.

Vikram Chatterji 26:18 And, we're very happy to to get some feedback on all of those pieces too.

Conor Bronsdon 26:23 One of the things I think is really cool is to see how over the last over the last what, three and a half years, as Galileo has really grown, you've continued to get deeper on this pipeline where it's like, okay, initially we're helping solve basic metrics. We're getting you to that 70% problem. But you know, once you saw that start to plateau, you said, okay, how can we enable more customization? How can we make it really easy to customize?

Conor Bronsdon 26:44 How can we bring in data from production to actually improve this? How can we add synthetic data that you can use to test this against to again improve? But I'm sure not everything has worked the first time. In fact, I can, I can speak to that a little bit myself having been there for a year and a half? But I'm curious from your perspective, what are some of the approaches that you've tried as you've worked through eval engineering

Conor Bronsdon 27:08 that have not paid off as well as you maybe expected them to? And, and what did you learn from that process that you're now applying to the discipline?

Vikram Chatterji 27:15 Yeah, it's a great question. I mean, that's why it's always the the boon and the bane of, trying to pioneer versus just, you know, follow the herd. And, you know, this one of our values is wander bravely, and we've always been kind of pushing the envelope. And our thesis on this market has always been that at some point, people are gonna get very serious about these agents and these apps that they're building out because they're gonna be critical for their business unit. And you also know that, you know, for maybe the first two years of us doing this, that was not the case in the market. People did not care about scalable evals. People did not care about high accuracy evals. They were the the the question was just give me my traces, man, and I'll be happy that it is which is honestly, you could wipe code that today. Right? So so so along the path, we've noticed is we had certain, certain ideas for how do you solve for these two bespoke problems. One is how do you actually solve for that last mile accuracy problem? And the second was how do you actually reduce the latency on this? One of the things which we tried out initially from, from a, from a latency perspective was we always knew that SLMs would potentially be the answer for this. So you don't have to use a large, super, super large language model for doing a simple task like an eval. The first thing that we, which we had done, you probably remember this, is we, we use the BERT model from from Google way back. And we realized that compared to these much larger language models, the BERT model just doesn't generalize. And then now where do you go from here? Like, then we started using a LAMA 80,000,000,000 parameter model, and we noticed that, wait a second, that can that's much, much better from an accuracy perspective, can also be fine tuned, but it's extremely expensive and it's extremely high latency. So can we launch this for every developer? Sure. But we won't be doing justice. So there's a lot of research that that we had to do to figure out which amongst this large volume of open source models out there, which ones can we actually use, which ones have the right licenses. So there was a lot of back and forth and huge amounts of experimentation that had to happen there. The other piece that what we realized was even after fine tuning these models,

Vikram Chatterji 29:05 there was this other problem of how do you actually provide this at scale? And, that's when we realized that, you know, these models can't just be a generation a gen right? You can't just take out, let's say, a LAMA eight b and just say, like, look, I'm gonna fine tune this and that's it. Kumbaya. It doesn't solve it doesn't solve the problem because they are generative models versus an eval model has to be something which just spits out a single token or a few tokens. And so now how do you actually change the weights? How do you fix how do you work on the decoders such that it becomes an eval focused model? So that's more work we have to do. So it's been a series of things that we've had to do over time such that now we can we've been able to commercialize this at scale. So lots of learnings along the way, but, we've tried to commoditize that into infrastructure and product.

Conor Bronsdon 29:49 And now that we're actually seeing people putting agents into production, really caring about moving to that 90% plus accuracy mark, I mean, can look at a million examples, but like Google just announced enabling agents, people to build agents within their Google Workspaces, for example, very simply. That's just one obvious example, but I think you kind of look anywhere and you'll see this behavior happening.

Conor Bronsdon 30:11 Now that that's happening, you know, obviously this means the importance of the all platforms, evolved engineering discipline has vastly increased. What do you see coming next? What are the things that you're keeping your eye on or or charging towards perhaps as far as continuing to innovate the discipline?

Vikram Chatterji 30:27 Something which is very top of mind for us right now is what does all of this mean for multi agentic systems at scale? Because we're gonna 2026 is probably gonna be also the year when we're gonna see right now, we're very much in the earliest innings. We we don't have those many agents. The industry doesn't have those many agents in production. It's rapidly changing. The the demand is there. The the stack is there. So we're obviously gonna see more and more multi agentic systems. We're gonna see a much maturation with the MCP

Vikram Chatterji 30:56 as well as a two a frameworks. What we're also gonna start seeing is more of a focus on auth for agents. You're gonna see is more of a standardization across multiple APIs. I think we feel like a lot of platforms like Salesforce and HubSpot and many others like Epic and others are start gonna start become more like databases with very easily accessible tools for different agents to use. So we're gonna see a lot more of that. So now in

Vikram Chatterji 31:22 that future, where first party and third party agents start to talk to each other to perform very, very complex tasks, what does it mean for evals? It's definitely not an element as a judge. It's much more of how do you actually help control the different smaller fine, grained parts of these agents. And so that's kind of the direction that we've been thinking a lot more about, off late and think it's gonna become much more real maybe to eight months from now. Do you expect that this

Vikram Chatterji 31:47 new eval engineer role that's starting to appear is gonna be sitting more on like the platform team side, is gonna be embedded within the main engineering team that's working on say an agent product? How do you see this breaking down from a organizational perspective? Because obviously there's a lot of different stakeholders that are gonna care about this. Yep. Yep. That's a great question. I think it's it's very similar to how we noticed how a go to market engineer came about who was kinda sitting in this middle of sales and marketing. I feel like it's gonna be similar where the eval engineer needs to have a very good idea of what the product is supposed to, as well as be able to engineer

Vikram Chatterji 32:24 the the eval and scale it. So it's the it's a person who has to have a really good understanding of of systems and engineering, but also of the product. I think of it as very similar to a product engineer workflow where they have to have that product intuition and an understanding of the of the of the business, but also understand how code works and how, code hygiene works and actually submit PRs. So it's it's it's it's gonna be very similar to a software engineer, but somebody who is more product minded. And frankly, we're already seeing that happen today with a lot of the software engineers who are actually working on AI applications. Most of the time, Conor, is is you probably know already is is is is spent less on the code side of things. It's more on the prompt side of things, like agents as well. I know that agents are 90%

Vikram Chatterji 33:08 code, but that code is extremely boilerplate. And most of the time that's spent is spent on the prompts for the, for the planners and handoffs, etcetera. And so I feel like it's exactly those individuals who are gonna also start thinking a lot more about evals and as a discipline, start to think of it more as engineering. And I've transitioned from prompt engineering to context engineering and now adding eval engineering as, to really nail this in. That's correct. Yeah. And people are gonna very quickly realize that the more eval engineers that you have in your team, the the the magically quote unquote, the, the AI application actually becomes much more accurate and much more scalable than your competition.

Conor Bronsdon 33:43 Vikram, thank you so much for coming on. It's been great to chat with you again here and dive into eval engineering as a discipline. I think it's clear that there's just so much opportunity to improve the agentic stack. Most of us in the industry are just seeing, this agentic feature really come into clear focus here. We've, we've gone from demos to prototypes,

Conor Bronsdon 34:05 to actually having agents in production and major companies launching agent builders and agents starting to just expand everywhere, including everything from being able to use a voice agent to call a company and get hours, which someone Google is now providing to builders like N and a Gumloop and so many others where you can build custom agents for your business use cases.

Conor Bronsdon 34:25 And as you mentioned, this includes the ability of a lot of these coding agents to kind of build the boilerplate for you, let you focus on the prompts, let you really nail in your evals to ensure the system's working the way it needs to, to customize. It's been a fantastic tour through eval engineering,

Vikram Chatterji 34:41 but I'd love to get your perspective as we close out here on what's next. What's, what else do folks need to know about eval engineering and what's coming so they can, I guess, anticipate the future or prepare to be excellent in these roles? Great question, Conor. So as you think about eval engineering, I think of it as the the biggest blocker right now for, agent application development is how do you actually bake in trust and control, so that you can actually sleep at night knowing that things are okay. And I think of eval engineering as a very, very important piece for baking in that trust in your system, so you can do proper regression testing and you can bake that into your CICD process.

Vikram Chatterji 35:16 However, a really big thing that's gonna be very important for agentic application development is how do you actually bake in, those fine grained micro controls for different parts of your agentic system at scale. And that's something which I think is gonna be extremely crucial going forward in 2026. And we're we're actually very excited about a launch that we're gonna be doing

Vikram Chatterji 35:40 in about two months from now on exactly these grounds, and we've been, beta testing it with, with our customers right now too, and getting some incredible responses. So I'm super excited to talk more about that very, very soon. So there's more coming. We're gonna keep pushing the boundaries of eval engineering and move it from just trust towards control as well. But that's something which we're really excited about going forward.

Conor Bronsdon 36:01 Fantastic. Well, Vikram, thanks again for coming on the show. It's been great catching up with you, and hope to chat again soon. Likewise. Thanks, Conor. And for everyone listening, just a quick reminder, if you have a guest that you would like to see us bring on the show, drop a comment on this episode on Spotify, on YouTube, or message me on LinkedIn. Would love to hear from you and know who else you think we should be bringing on for season three and season four in 2026.

Conor Bronsdon 36:26 Vikram, thanks again for coming on the show. It's been a pleasure seeing you. Thanks, Conor. Appreciate it. Thanks to Galileo for sponsoring this episode. Their new 165 page comprehensive guide to mastering multi agent systems is freely available on their website at calao.ai and provides you the lens you need to understand when multi agent systems add value

Conor Bronsdon 36:52 versus single agent approaches, how to design them efficiently, and how to build reliable systems that work in production. Download it for free at the link in the show description to discover how to continuously improve your AI agents, identify and avoid common coordination pitfalls, master context engineering for agent collaboration, measure performance with multi agent metrics, and much more.