Why does Hamel Husain say buying an evals tool won’t fix AI reliability?

Hamel Husain says the instinct to “abstract this entire thing away to some tool” and make accuracy “not my problem” doesn’t work — until something like AGI, no tool can just figure out whether your AI is doing the right thing for the user. The number-one question he gets is “What are the tools?”, and he calls that the wrong question. No matter which tool you pick, you still have to go through the same process to evaluate AI correctly, so the real question is “What’s the right process?”

What does Hamel mean by “look at your data,” and how many traces should you review?

For Hamel Husain, “look at your data” is shorthand for error analysis: open a trace viewer, read through traces, write notes on what’s going wrong, then categorize and count those error types to decide what to prioritize. He cites the social-science concept of “theoretical saturation” — keep looking until you’re not learning anything new — but gives a concrete starting heuristic of at least 100 traces, because the abstract version makes people anxious and they never begin. Once they start, he says, they stop caring about the 100 because of how much they’re learning.

Does Hamel think AI engineers need a data-science background?

Hamel Husain tried to see how far he could teach engineers evals without a data-science background and says you hit a limit fast. In the eval course he co-teaches — over 700 students of all backgrounds so far — topics like LLM-as-a-judge require validating the judge against human labels, and questions like why sampling works push you back to classic statistics, e.g. bootstrap sampling to measure judge noise. His take: you don’t need to train models, but you need real data literacy and exploratory data skills, which lands you near the skill set of an ML engineer or data scientist.

Why does Hamel recommend building custom data-annotation apps instead of using a generic dashboard?

Hamel Husain argues that real applications have domain-specific context — rendered widgets, emails being written, external data sources, traces that should be viewed exactly as the user sees them, or token-heavy content you want to hide. To do fast error analysis you want all the data you need in one place, rendered the right way. Because AI is now good at vibe-coding simple data-rendering apps, the value of building your own annotation tool outweighs the cost a lot of the time — though, he’s careful to add, not 100% of the time. He still wants a trace viewer like Galileo’s as a supplement.

How does Hamel say teams should involve domain experts in evals and prompting?

Hamel Husain calls outsourcing evaluations to developers one of the biggest failure modes he sees — fine for a developer tool, but otherwise developers lack the context, so you’re just guessing. If you’re building for lawyers, involve the legal expert. He warns that many people treat “prompt” as an abstract concept and expect a developer to write it — “the worst thing that can possibly happen.” His fix: give the domain expert an admin view to edit the prompt directly inside the real, user-facing application — rather than a playground that can’t call your tools, do RAG, or run your actual code.

Episodes · S2 E37 ← Prev Next →

Mindset Over Metrics: How to Approach AI Engineering | Hamel Husain

Aug 20, 2025 · Hamel Husain , Parlance Labs · 42 min

AI Evaluation & Reliability AI Observability AI Engineering

Listen on any app

Key takeaways

The first question Hamel Husain gets is “What are the tools?” — and he says that’s the wrong question. No tool abstracts away whether your AI is doing the right thing for the user; until something like AGI arrives, that’s not possible. The right question is “What’s the right process?” because every tool still forces you through one.
Generic dashboard metrics — hallucination score, toxicity score, conciseness score — are, in Hamel’s words, “most of the time not helpful at all.” They’re too generic, don’t necessarily correlate with actual failures in your product, and create an illusion you’ve checked the evals box while you’re really monitoring nothing.
Hamel grounds evals in failures through error analysis: open a trace viewer, read traces, take notes on what’s going wrong, then categorize and count those error types to decide what to prioritize. He notes the technique predates machine learning — it comes from the social sciences — and most people skip it only because no one taught them to do it.
Borrowing the social-science idea of “theoretical saturation” — keep looking until you stop learning anything new — Hamel still gives a concrete starter heuristic: aim for at least 100 traces. People freeze on the abstract version; 100 is a goal they can act on, and once they begin they stop caring about the number because they’re learning so much.
Hamel’s “spicy hot take”: an AI engineer doesn’t need to train models, but does need real data literacy. Teaching LLM-as-a-judge means validating the judge against human labels, and explaining why sampling works (e.g., bootstrap sampling for judge noise) drags you back to classic statistics — the skill set of a machine learning engineer or data scientist.
The biggest failure mode Hamel sees — and a major driver of his consulting business — is outsourcing evaluations to developers because teams treat AI like a software-engineering task. Unless you’re building a developer tool, the developer isn’t the domain expert, so you guess. If you’re building for lawyers, involve the lawyer.

Frequently asked questions

Why does Hamel Husain say buying an evals tool won’t fix AI reliability?: Hamel Husain says the instinct to “abstract this entire thing away to some tool” and make accuracy “not my problem” doesn’t work — until something like AGI, no tool can just figure out whether your AI is doing the right thing for the user. The number-one question he gets is “What are the tools?”, and he calls that the wrong question. No matter which tool you pick, you still have to go through the same process to evaluate AI correctly, so the real question is “What’s the right process?”
What does Hamel mean by “look at your data,” and how many traces should you review?: For Hamel Husain, “look at your data” is shorthand for error analysis: open a trace viewer, read through traces, write notes on what’s going wrong, then categorize and count those error types to decide what to prioritize. He cites the social-science concept of “theoretical saturation” — keep looking until you’re not learning anything new — but gives a concrete starting heuristic of at least 100 traces, because the abstract version makes people anxious and they never begin. Once they start, he says, they stop caring about the 100 because of how much they’re learning.
Does Hamel think AI engineers need a data-science background?: Hamel Husain tried to see how far he could teach engineers evals without a data-science background and says you hit a limit fast. In the eval course he co-teaches — over 700 students of all backgrounds so far — topics like LLM-as-a-judge require validating the judge against human labels, and questions like why sampling works push you back to classic statistics, e.g. bootstrap sampling to measure judge noise. His take: you don’t need to train models, but you need real data literacy and exploratory data skills, which lands you near the skill set of an ML engineer or data scientist.
Why does Hamel recommend building custom data-annotation apps instead of using a generic dashboard?: Hamel Husain argues that real applications have domain-specific context — rendered widgets, emails being written, external data sources, traces that should be viewed exactly as the user sees them, or token-heavy content you want to hide. To do fast error analysis you want all the data you need in one place, rendered the right way. Because AI is now good at vibe-coding simple data-rendering apps, the value of building your own annotation tool outweighs the cost a lot of the time — though, he’s careful to add, not 100% of the time. He still wants a trace viewer like Galileo’s as a supplement.
How does Hamel say teams should involve domain experts in evals and prompting?: Hamel Husain calls outsourcing evaluations to developers one of the biggest failure modes he sees — fine for a developer tool, but otherwise developers lack the context, so you’re just guessing. If you’re building for lawyers, involve the legal expert. He warns that many people treat “prompt” as an abstract concept and expect a developer to write it — “the worst thing that can possibly happen.” His fix: give the domain expert an admin view to edit the prompt directly inside the real, user-facing application — rather than a playground that can’t call your tools, do RAG, or run your actual code.

Concepts in this episode

AI terms discussed here — each links to a plain-language definition.

AI Evaluation LLM as a Judge Retrieval-Augmented Generation (RAG)Accuracy AI Hallucination Artificial General Intelligence (AGI)F1 Score Tokenization

Show notes

As we enter the era of the AI engineer, the biggest challenge isn't technical - it's a shift in mindset. Hamel Husain, a leading AI consultant and luminary in the eval space, joins the podcast to explore the skills and processes needed to build reliable AI.

Hamel explains why many teams relying on vanity dashboards and a "buffet of metrics" experience a false sense of security, which is no substitute for customized evals tailored to domain-specific risks. The solution? A disciplined process of error analysis, grounded in manually looking at data to identify real-world failures

This discussion is an essential guide to building the continuous learning loops and "experimentation mindset" required to take AI products from prototype to production with confidence. Listen to learn the playbook for building AI reliability, and derive qualitative insights from log data to build customized quantitative guardrails.

Connect with Chain of Thought host Conor Bronsdon:

Newsletter: https://newsletter.chainofthought.show/
Twitter/X: https://x.com/ConorBronsdon
LinkedIn: https://www.linkedin.com/in/conorbronsdon/
YouTube: https://www.youtube.com/@ConorBronsdon

Follow Today's Guest(s)

Connect with Hamel on LinkedIn

Follow Hamel on X/Twitter

Check out his blog: hamel.dev

Check out Galileo

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Agent Leaderboard

Transcript

104 segments

Hamel Husain 0:00 These generic metrics are most of the time not helpful at all. They're way too generic. They don't necessarily correlate with actual failures in your AI product, and they don't actually mean anything. And it's it can be extremely destructive because, like, you know, people get seduced into this. Like, okay. I could just plug in this dashboard to my system and get this dashboard of metrics and can tell me, like, how I'm doing. And then you kind of have this illusion of, oh, like, I checked the box. I'm doing evals, and I am monitoring my system. In reality, like, you're not monitoring anything. Just wasted a whole bunch of time.

Conor Bronsdon 0:35 Welcome back to Chain of Thought, everyone. I am your host, Conor Bronson, and joining me is my cohost, Atin Drea Sanyal, co founder and CTO of Galileo. Atin, as always, great to have you behind the mic with me. Always great to be here. Yeah. I'm excited for this conversation because it's gonna be on a few topics you and I are particularly passionate about, and which I hate to tell our audience, but they've probably heard us opine about a few times because we have a special guest joining us who's been at the forefront of applied AI, helping over 30 companies navigate the complexities of building and productionizing their products. Hamel Hussein is an independent AI consultant, a luminary in the eval space,

Conor Bronsdon 1:13 and has worked with innovative companies such as Airbnb and GitHub, which included early LLM research, used by OpenAI for code understanding. Hamel, great to see you. Welcome to the show. Thank you for having me. Yeah. It's absolutely a pleasure because we've chatted a couple times about product philosophy, how to approach AI products today, and you're obviously well known for

Conor Bronsdon 1:37 your blogs, your courses, and a particular favorite of mine is your field guide to rapidly improving AI products. You get right to the core of what actually makes AI products successful in the real world, And that feels like the perfect place for us to start our conversation. You open that guide with this concept of a tools trap that many companies are falling into. Can you start by giving our audience a bit of an explanation of this idea and why so many smart AI teams are falling into this trap. Yeah.

Hamel Husain 2:09 So a lot of times when people think about evals or measuring, you know, the reliability or performance of their LMs in terms of is it doing the right thing for the user, the first thing that a lot of people's minds go to is, like, okay. What tools can I use? Can I just, like, abstract this entire thing away to some tools? Can I just buy tool like, you know, can I make it not my problem?

Hamel Husain 2:35 You know? Is there some abstraction or something that I can use to, like, have it so that I don't have to worry about the accuracy of my AI product or the performance of it or if it's doing the right thing. Like, maybe something can just figure it out for me. And, you know, until we get to like AGI or something like that, I don't think that's possible. But, you know, it is where people's mind goes goes towards.

Hamel Husain 3:05 And so I think the number one question I get is, hey. What are the tools? And that's the wrong question. The question should be like, hey. What's the right process? Like, how do you how do you evaluate AI? You know, getting into tools later, but what is the right process to go through? Because no matter what tool you use, you have to go through a certain process

Hamel Husain 3:31 to evaluate AI correctly.

Conor Bronsdon 3:33 Yeah, and, Otten, I know you have a ton of thoughts about that process, and both of you, I've seen, discuss this idea of generic metrics not being enough for many companies. You know, fancy dashboards being a panacea and, you know, a Band Aid solution, not something that actually solves everyone's problems. And this idea in your guide, as you put it, Hamill, creating of a false sense of measurement and progress, or as you described it, that, Atin,

Conor Bronsdon 4:03 you know, AI has a measurement problem. Hamill, could you give us an example of how you think vanity metrics have led teams astray? Yeah. So it's really

Hamel Husain 4:12 tempting if you're building AI tools, you know, Autin probably can provide more color around this, but I've seen this with other vendors, not not trying to pick on Galileo or really anyone, to be honest, is when you go into a pitch meeting around, hey, like, we can help you with your evals, you know, people wanna see a solution. And they want, you know, it's easy to kind of present a dashboard. It's convincing to some extent, if you don't know any better, to present a dashboard with all a bunch of generic metrics, hallucination score, toxicity score,

Hamel Husain 4:51 conciseness score, you name it. It just so happens that these generic metrics are most of the time not helpful at all. You know, they're they're way too generic. They don't necessarily correlate with actual failures in your AI product, and they don't actually mean anything. And it's it can be extremely destructive because, like, you know, people get kinda seduced into this. Like, okay. I could just again, going back to the tools

Hamel Husain 5:22 discussion, I can plug in this dashboard to my system. I can get this dashboard of metrics. It can tell me, like, how I'm doing. And then you kind of have this illusion of, oh, like, I checked the box of doing evals, and I am monitoring my system. But in reality, like, you're not monitoring anything. You just wasted a whole bunch of time. You don't really know what your failures are and, like, what the most important things you should be focusing on.

Hamel Husain 5:46 And so it just creates a lot of churn, and I think people are getting a lot better with recognizing that now. I think, especially six months ago, it was the cause of almost all of my consulting business, with people that were confused and hit kind of a a roadblock or a wall in terms of, okay, we plugged in this tool. We got this generic dashboard thing, and we don't really know what to do now. No. I I absolutely second what Hamel is saying. In fact, I would go to the extent of saying that this

Speaker 6:22 generic metrics problem has existed in machine learning even before GenAI. Even erstwhile ML workflows, you usually have a held out test set, measure F1 scores, and just say the model is good or bad based on that score. Those approaches as well are akin to generic approaches where they treat all kinds of errors the same way. In some situations they're necessary but they're in no way sufficient.

Speaker 6:51 And these were also some of the realizations that personally had as well before even starting the company when we built Michelangelo at Uber, there was no one stop metric that would be the panacea for your problems, and the same patterns are emerging again. I'm curious to ask you, Hamel, what kind of patterns you're seeing, but basically, just to take an example,

Speaker 7:18 with agents, customized architectures are kind of the way to go. You can build agentic architectures in a million different ways, and customized architectures need customized personalized evals, which also need to evolve as your application grows and evolves and meets the new new kinds of data. So one good question to ask, think, for practitioner, for a developer is rather than, oh, what metrics do I need from a buffet of metrics,

Speaker 7:49 rather, what are the pains or potential risks in the workflows of my app? Let's list them down, and then author evals which are customized to those panes, and then constantly monitor those panes because those panes will also evolve, panes as in potential risks and pitfalls in your application, and then accordingly update the set of evals that you're using based on those

Speaker 8:15 evolving pitfalls. But I'm curious to know if you've seen similar patterns, Mel. Yeah, so one of the things that's really important

Hamel Husain 8:23 to do with evals is to ground it in your failures. So how do you know what your failures are? And like, the thing that we harp on a lot and what we teach in evals and what I write in my blogs constantly is look at your data. But what does look at your data mean? Look at your data so what's behind look at your data is this process called error analysis. And error analysis is

Hamel Husain 8:48 has been around for a really long time, even before machine learning. So it's like been around in social sciences. I recently learned it. I thought, you know, the first time I was exposed to it was machine learning, of course. But it is a kind of this process where you go through and you look at data and you take notes about what is going wrong, and you then

Hamel Husain 9:10 use those notes and you kind of categorize them. You say, okay, like, what kinds of errors am I seeing? And you can do it starts very simple, like counting those categories and seeing like, okay, what types of errors are happening the most? And then you make a decision like what to prioritize from there. And it's a very powerful technique that most people don't do

Hamel Husain 9:34 because no one has taught them to do it, I think. And it's very simple. It's like the most simple kind of thing. We like, you know, We're talking about opening a trace viewer and then writing notes and going through a bunch of traces. And there's some, okay, the same questions always come up, how many traces should I look at? So on and so forth. And there's some useful heuristics. There's this concept from social sciences

Hamel Husain 10:01 called theoretical saturation, which just means like, hey, keep looking at traces until you're not learning anything new. So what we teach is, like, try to look at at least a 100 traces just as a heuristic to get people started because they have a lot of anxiety. If you just say theoretical saturation, they get they don't even begin. They just get scared of the whole process. But 100 is, like, concrete number people can know, like, have a goal.

Hamel Husain 10:26 And then, like, you know, after you begin, you don't really care about the 100. You're like, oh, I'm learning so much. I think that's the counterintuitive part of, like, going through individual data points and reading what is happening in a focused session provides immense value, and people don't know that until they do it. They're very surprised you know, at the amount of value that it provides. And so that that can inform all of your evals activity.

Hamel Husain 10:58 You know, it'll, like, it'll motivate everything, like what you should focus on, what you should write an eval for, etcetera. And it's not really like this error analysis can, like, kind of bucket it into this activity of evals, but it's not even evals. It's, like, just development. So I'll just

Speaker 11:19 stop there. No. That's super fascinating. I I actually kind of as you were talking, I'm drawing parallels to certain sort of opinions that we make on the Galileo platform itself, because we are an evals and observability platform, is this new notion of quantitative insights or metrics and qualitative, and the qualitative bit to me sounded very similar to the theoretical saturation workflow that you're describing, which is the error analysis process where it's less about numbers between zero and one measuring low and high, and it's more about

Speaker 11:57 more abstract. It's at a much more abstract level where are you achieving what you were set out to do? And along the way, what pitfalls or errors are you seeing? Something we do in Galileo is kind of drive the developer or the user to using what we call LogStream Insights, And LogStream insights are more qualitative insights on hoards of your data, like segments of your

Speaker 12:25 long running sessions, whether it's like a chatbot session or any kind of long running agent, we would analyze data in bulk and give you qualitative insights and then try to correlate them to potentially having you build some quantitative measures based on those qualitative insights. And hopefully the more qualitative insights you find, you reach that theoretical saturation

Speaker 12:52 that you're talking about. So I can draw a lot of parallels and it's very fascinating to hear kind of the theoretical sort of side of error analysis and the practice of it being much beyond AI and machine learning. I'm curious if the two of you think that part of the reason

Conor Bronsdon 13:12 this approach to error analysis hasn't really truly been popularized in current AI development circles is because we've seen this change in persona where most of the people who were doing machine learning work, like, there were engineers involved, but it's a lot of data scientists who have kind of more classically been trained on some of these error analysis techniques. Whereas software debugging

Conor Bronsdon 13:37 is a different approach often. And we're now seeing kind of the the marrying of these two approaches with engineers who are now becoming AI engineers and and working very differently and having to transform both how they think about the software they create from deterministic to nondeterministic, and also having to think about their approaches in different ways. Is that what's driving this kind of gap, you think, or is it something different? I think so. Yes. I mean, I I would say the first epoch or phase of AI engineering

Hamel Husain 14:08 was very much focused on, okay, like, we need to build stuff. We need to get go to zero to one really fast, and let's see what's possible in a rough sense. And, you know, now that you know, and so it was very much the narrative and, you know, also the truth. Like, you know, one of the most important skill to get started was software engineering, you know, in that. Like, you need to, you know, glue together a lot of things, use APIs,

Hamel Husain 14:36 you know, kind of full stack engineering, really important. And when it comes to, okay, like, how do you know that this stochastic system is reliable? That's a whole different skill set that takes time to learn. And there's a very large intersection between machine learning, data science, and the skills you need to do evals often. And the reason you know, I tried to actually

Hamel Husain 15:11 see how how much I could get get away with in terms of, like, teaching engineers evals without data science background or, you know, the requisite, let's say, background. And you do hit a limitation really fast. And, like, for example, you know, Srain and I are teaching a class on evals. We've taught over 700 students so far of all different kinds of backgrounds.

Hamel Husain 15:38 And, you know, like for example, when we get into building LLM as a judge, what we teach people is like, okay, one of the things that's important with LLM as a judge is that you can trust the LLM as a judge. And to trust LLM as a judge, you have to compare it to some human labels. There's things like the And questions always come up such as hey like why is it okay to sample

Hamel Husain 16:08 data how can you know and like we we show people, like, okay. If you wanna know how much noise there is in your judge, you can do stuff like boot strap bootstrap sampling. People don't understand that. They're like, why is it okay to, like, just continue, like, sample a whole bunch of times from a dataset to get the distribution? And so we we found that, like, we almost have to go back to classic statistics and see people that you. Which

Hamel Husain 16:31 is not super tractable, to be honest. Like, you know, not in the format of, okay. Let me teach you evals real quick. You can teach fundamentals, and you don't necessarily need all that stuff to get started, but you can you need, like, a fair amount of data literacy. That's one side of the equation. It's, like, statistics, but, also, it's all the analytical tools.

Hamel Husain 16:55 Right? So like how do you how do you like dig into data? You know, like let's say we're talking about traces earlier and like clustering those traces or navigating them or analyzing them. Like you want to be able to like really pick at data really fast and just do open ended exploratory analysis on it. And a lot of those data skills come into play again when it comes to, like, digging into a problem.

Hamel Husain 17:21 And so, like, you very quickly arrive at the very similar skill set of a machine learning engineer or a data scientist. You don't necessarily you don't need to be training models, but I would argue that you shouldn't be spending most of your time training models anyways. Like, you were looking at, you were doing a lot of error analysis and debugging and whatnot, so

Hamel Husain 17:46 that's my spicy hot take perhaps in this podcast.

Speaker 17:50 Don't think it's a hot take at all. I think it's a very, very legit take on just the distinction between software engineers and data scientists, and answering that key question. Like in the new world of sort of meshed roles and the AI engineer and what is, you know, mostly, like technical people are kind of undergoing this minor identity crisis. And the answer kind of lies in what you said, which is if you were to cherry pick one skill that's needed

Speaker 18:23 for the software engineer to become the AI engineer or, you know, to be efficient in the modern era is really just the skill of understanding data and knowing the difference between good and bad data or how to take bad data and step by step move it to good data and just data literacy is how you put it. Think that is the main skill because there's the other skills which are, you know, knowing the semantics of a decision tree,

Speaker 18:50 which is totally commoditized and you don't need to know. You don't even need to know how to train models or fine tune them. But to be able to understand this basic process of comparing an output with a pre generated ground truth, which is either human labeled or synthetic, but just knowing the goods and bads of the practices, that is what data literacy is, and if this skill is adopted

Speaker 19:17 by a tier A software engineer, I think they've set themselves up for the future.

Hamel Husain 19:22 Definitely, and there's lot of other related skills as well, like designing metrics, and the list goes on, how to tell stories with data, how to have a sense of when your metrics are leading you astray, all the way down to like having good product sense and having that be aligned with with metrics, you know, potentially doing AB tests, the whole suite of things is is important.

Hamel Husain 19:49 My friends and I joke that we might have a new, job title coming called AI scientist, but I try not to be the one who is coining you Wait wait a second. Job title. We're talking about AIPMs,

Conor Bronsdon 20:02 AI engineers, AI scientists now. Oh, man.

Hamel Husain 20:06 You know, there's always every time there's a technological shift of some kind, there is kind of this sort of gravitation towards the idea of a unicorn. So we saw it we saw it actually, like, many times. Like, you know, the most recent time we've seen it is, like, actually in data science itself, where, you know, initially at the outset of the data scientist,

Hamel Husain 20:29 we had the person that did everything, software engineering, statistics, DevOps, so on and so forth. I think, like, people realize there's a little bit too much service area, honestly, and then kinda split it into different kinda sub disciplines. Then we may be seeing that with AI engineer, if I were to predict.

Conor Bronsdon 20:48 Speaking of AI engineers, I know one of the recommendations that you've made, Hamel, has been that when teams are making AI investments, particularly when AI engineers are helping make their decisions here, it's really important just to have a customized way of viewing their data, not necessarily a complex dashboard, so that they can approach this debugging as error analysis the right way, so they can make decisions in the right way.

Conor Bronsdon 21:14 Because as I think as Austin and I have certainly experienced working with folks, it's very easy to overwhelm teams with too much data instead of enabling focus. Why do you think giving everyone an easy way to see what their AI system is doing

Hamel Husain 21:30 is more impactful than some of the sophisticated analytics that I think often we're trying to reach for? Yeah. So the guidance there is like, okay. There's a lot of tools out there that provide a good way to get started, like Galileo. Like, you know, you have a a way that you can, like, plug in your AI application and see your traces in a stream, and kind of

Hamel Husain 21:53 go through that. A lot a lot of times in, you know, in your applications, there are a lot of domain specific things going on. Like, might have widgets that you're rendering. Your application might be writing emails. You might have external data sources that you need to reference to evaluate particular trace. You might want to have you might want to view the trace in the exact way the user is is seeing it,

Hamel Husain 22:27 For example, you might have things in your trace that by default are usually not helpful, but that take up a lot of space in terms of tokens. All kinds of, like, little nuances. So what you want to do is really dial in the data viewing experience so that you can do this error analysis and, like, review lots of data really fast in a way that is very customized to you, that is very contextualized

Hamel Husain 22:55 to how you want to see data, all the data you need to see in one place, rendered in exactly the right way. And so the reason that's my that's our advice is just because of AI. Because, like, AI, you can vibe code. So, you know, AI is really good at producing simple applications that can render data and, like, you know, have simple yeah. Simple web applications like render data where you have, like, input fields and stuff like that.

Hamel Husain 23:26 That's something that is probably, you know, below the bar where AI can clear clear those tasks very well. And so because of that reality, we recommend that people in a lot of cases like create their own data annotation apps because there's just way too much value to be had relative to the cost of doing so. Isn't that the case like 100% of the time, but it's the case like a lot of times.

Conor Bronsdon 23:55 Atsin, I know a big part of our recent product philosophy at Galleo has been to give people more simplified views, whether it's, you know, the graph view or timeline view, which we've kind of designed with the idea of like, okay, let's give them other options to debug agents in particular as we look at these kind of more complex systems, as well as other views that, you know, may or may not be live by the time that this podcast launches.

Conor Bronsdon 24:19 And I know this is something you're thinking about a lot this is something you're thinking a lot about too, because as I alluded to, we've kind of had conversations together with AI engineers, I think, just like Hamill has, who are going, hey, I need help focusing here. I'm not really sure what to look at necessarily. I'm not sure where to spend my time in that error analysis.

Conor Bronsdon 24:38 What's your philosophy on how to approach this, I guess, observability and focus layer that Hamill is talking about? Yeah. I think beyond the

Speaker 24:49 graph views, which is a feature that we offer, features like graph views kind of tend to point to the broader philosophy of giving the right abstractions to the user to be able to kind of do the segmented root causing of these ever growing sophistication in systems which have evolved from simple REG to agentic REG to multi agent. You want to give the users the right abstractions so that they can shine the torch in the right areas,

Speaker 25:24 and that's where views like the graph view, session views, interaction views, these come in to be able to give the tools to the user to just be able to root cause effectively. And what that means what that entails is you run your application end to end, and each request may sort of touch certain parts of your application and light up the nodes there. And each request will run through a different

Speaker 25:54 sort of path in in in your application, which you can visualize as a dagger or workflow. The first step is to be able to spot the anomaly where kind of the ground level customization on the metrics, as well as the qualitative insights come in, but then these right abstractions and the right views to be able to make sense of, yeah, what's going on. And then there's the data that's associated with, because all this is really is just data flowing through a bunch of nodes and edges.

Speaker 26:28 So once you spot the anomaly, you want to look at the data and what, you know, went wrong with that. So simplifying the views around the data is kind of the next step from there.

Hamel Husain 26:39 And just to be clear, like, what I described does not add odds with these things and tools. They're just, like, supplementary. Like, I I also always want a trace viewer like the ones in Galileo because this can be, like, a lot faster to search through that and just look at that without you know, sometimes I'm looking sometimes I'm looking for something that maybe by accident wasn't in the annotation tool or something else. So it it is really useful. And also, like, a lot of these platforms, like Galileo, have APIs where you can connect your annotation tool to,

Hamel Husain 27:13 and, you know, write data back and forth to it.

Conor Bronsdon 27:16 So, you know, that's just something to think about. Yeah. And I think we all agree that's a a great best practice, is to leverage the APIs of whatever evaluation tool you're leveraging. Obviously, we we hope that's GALLAYO. But whichever eval tool you're using, like, using that API to bring that data into other places where you can look at it look at it in different ways,

Conor Bronsdon 27:37 and kind of consume that information and highlight it to business users, think is a fantastic thing to do. And, Hamel, I know you've talked about this idea of empowering domain experts who may not be in an eval product every day to add their insights and help improve these nondeterministic systems. How do you think about, you know, writing and iterating on prompts

Conor Bronsdon 28:02 with domain experts versus with engineers?

Hamel Husain 28:06 Yeah. So one of the biggest failure modes I see, and is also one of the biggest drivers of my consulting business, is people outsourcing evaluations to developers, which is fine if you're building a developer tool where the developer is a domain expert, but usually they're not. And the symptom there are the root cause of people outsourcing eval developers because they're thinking of AI like software engineering. They're like, oh, AI

Hamel Husain 28:39 development is a software engineering task. The, you know, the moment you'd say anything about AI development process, they're like, this is outsourced to developers. That turns out that always goes really badly because, yeah, like, you're you're only guessing, you know, and the developers don't have enough context, so you want to involve the domain expert. It's like, you know, if you're

Hamel Husain 29:05 working building something for lawyers, you want to involve the lawyer. You wanna involve the the legal expert at some point. And so, you know, when it comes time to doing things like iterating on prompts, you shouldn't have the prompt so removed from the domain expert. The whole point of LLMs is, like, humans can talk to computers. And so if you obfuscate everything so much that the domain expert can't talk to the computer, then you're

Hamel Husain 29:31 kind of burning the whole, you know, the value proposition of AI to begin with. Like, because you wanna direct, you know, line of communication between your domain expert and your in, like, what's going into the AI in terms of prompting. And so what I described in that blog post is a lot of like, a good pattern that I've seen work really well is if you have a user facing application,

Hamel Husain 29:59 you know, have, like, an admin view where you expose the prompt and allow the person to change the prompt. Even if you don't want the user to change the prompt, you have the like, for your internal purposes, you have an admin view that allows the domain expert to change the prompt and and fiddle with it. It gives them, like, a more direct connection to what exactly is happening rather than, like, having conversate abstract conversations about AI, and it should do this and it should do that. It's really important that they get in there, and they are, like, experimenting.

Conor Bronsdon 30:30 Yeah. And I think it very much aligns to what Galileo has done with our continuous learning through human feedback feature, because we feel the same way. You need to leverage this domain expert feedback. You can't simply have it just be the engineers who may be, depending on your business, you know, divorced from the bare metal of what the product's doing. Like, hopefully they are are very aligned to that, but sometimes they have business users who are translating,

Conor Bronsdon 30:55 you know, key pieces of that for them, or domain experts who bring a lot of context. And I know it's part of why, especially when we're looking at custom metrics, but all of our metrics, we leverage, you know, feedback from SMEs, you know, whatever type they may may be, can go in and say, okay, like, let me get feedback on these 10 traces and say, hey, this this metric feels a little off, actually. This is pretty accurate, or, you know, here's a little contextual feedback,

Conor Bronsdon 31:18 and then use a judge to translate that and apply it and, you know, retune the metrics. It's something we're we're finding a lot of success with. But I I think there's a lot more opportunity to go deeper here, to your point. Like, it feels like too often, even in highly customized evaluation systems for enterprises, we are just scratching the surface of the human context that we can bring in.

Conor Bronsdon 31:41 I mean, it's it's a very common problem for many organizations that there is too much tribal knowledge that's not living in documentations, that's not necessarily making its way into systems. And to your point, it's so necessary that we bring that human knowledge into our AI systems because they perform best when they have the data they need. And it can be as simple as

Conor Bronsdon 32:04 friction between technical teams and understanding that domain experts have of some of the jargon of your AI systems. Like, you gave this great example in one of your pieces about translating REG to just making sure the model has the right context and really saying, hey, like, let's just put this in a term that anyone can understand, even if they're not deep in AI.

Conor Bronsdon 32:25 What's your advice for AI teams who are looking to bridge that gap and really bring their domain experts into the fold, so that they can be part of improving their AI systems and their AI data. Yeah. Let me, like, clarify

Hamel Husain 32:40 the last point with some, like, concrete failure modes, like, to look out for. Like, one is okay. There's a there's there's an aspect of, like, a prompt store or, like, a centralized place that you could put prompts, which is fine. But a lot of times what happens is folks don't build, like, properly enough around around that. They don't build, an experimentation

Hamel Husain 33:04 environment. And so, like, you have to change the prompt there and then, like, commit it and then wait and then, like, go somewhere else and, like, try something. And that's, like, way too much friction. So that is kind of you know, that prevents the domain expert from experimenting. A lot of tools have prompt playgrounds, which are great. It's a good place to get started. However, most pump playgrounds, they don't have access to your tools and your infrastructure and your application code. So they can't call

Hamel Husain 33:33 they can't perform rag, they can't call tools, they can't do all the things that your application is doing. And so, you know, you can't necessarily rely on that either. That's why you need this, like, integrated I forgot what I called it in the in the blog post thing. Called it, like, integrated prompting environment or something. Try to make up a name for Basically it's, you need to be able to play

Hamel Husain 33:55 with the prompt in your user facing application directly. Because that's the only pattern, at least that I've seen, that's worked reliably in terms of bringing the domain experts in. Yeah, I'll just add a couple of points here. First, of course the need for

Speaker 34:11 easy to, sort of easy to use human feedback is critical, And some of our, yeah, like to your point, Connor, some of our human feedback features, which go much beyond, you know, just offering like binary signals, thumbs up, thumbs down, and the ability to create your own sort of feedback, kinds of feedback becomes important. But to Hamil's other point about just

Speaker 34:40 managing the prompts and offering the subject matter experts the ability to tweak the prompts to interact with the app. I think engineering wise, the matter gets a little bit tricky, especially for more sophisticated applications like multi agents where things are not necessarily driven by one prompt, might have a series of prompts which are triggered one after the other, you don't have control over

Speaker 35:06 many of them, but more often than not, it is driven by a kind of a seed query, which is kind of the natural language interface to any GenAI app. So the engineering challenge kind of becomes how do you abstract the entire application and make it available in front of the user through a natural language interface, the user being the subject matter expert, not the developer,

Speaker 35:32 but being able to actually run the developer's app seamlessly, so that to the SME, it's all about, here's my input. I have pure knowledge about my input and the expected output, but all the machinery in the middle, you should be able to abstract out from me. So the trickiness kind of comes in the fact that, I guess, the challenges around how do you use our APIs and the SDKs and, of course, all the, you know, containerization

Speaker 36:01 technology to be able to kind of simulate version of the app, which may be a distributed app. It might be running on, you know, two different availability zones for that matter. It is just software. So I think that's where the the challenge comes, and we are kind of at a point where it's it's doable to simulate sort of singular monolithic applications and make this workflow available to the SME, but it gets challenging when the app

Speaker 36:28 itself becomes distributed, and that's where kind of the a lot of engineering

Hamel Husain 36:34 innovation is going. Yeah. It's really nontrivial. Like, have to be you have to think, like, you often can't, like, expose everything to the SME. You have to say, is there a high value thing I can expose? And you know, it just if anything else, it just helps give them intuition. So they don't think rag is a very abstract concept or prompt is even an abstract concept.

Hamel Husain 36:58 You know, you'll be surprised, like, how many people think prompt is an abstract concept because they say something in a meeting and the expectation is a developer's gonna write the prompt. That's the worst thing that can possibly happen. So whatever way possible you needed to get away from that. And so what I'd love to close the conversation with,

Conor Bronsdon 37:16 and Hamel, thank you again so much for joining us. It's been a distinct pleasure having you. It is just some advice. Like, what what would be your summation, your advice to a team that is looking to build their eval system, that is looking to

Hamel Husain 37:31 improve their AI product? What would you tell them? Yeah. So the biggest two kind of things that I can think of is like one, error analysis, also known as look at your data. It just it can yeah. It solves so many problems. I was like, maybe 90% of the whole evals process is is like looking at your data. Like, you find so much even before writing evals. Like, just find you'll just find so many bugs, so many things, opportunities for improvement, so on and so forth.

Hamel Husain 38:05 And then the next thing that I can think of that makes a huge difference is having the experimentation mindset. And this one you have to cultivate a little bit. There's some talks that I can point you to about how you might, you know, reframe your thinking. I mean, this is something that's innate to machine learning folks and data science folks is, like, you know, you don't have, like, this waterfall

Hamel Husain 38:31 chart of, like, how to build a machine learning system. Like, you have to you you have an idea of, different experiments you wanna try. You don't even know it's gonna work. But what you do have is a hypothesis of, like, hey. Like, this might work. This might not work. Let's try this. Let's look at this afterwards. And so you have to reorient a lot of things in order to do that. You have to kind of, you know, have a different language that you talk

Hamel Husain 39:01 about within your teams and sort of make sure that you're not don't have those rigid approaches when it comes to this. It's hard yeah. That's probably another podcast, but

Conor Bronsdon 39:16 those are my thoughts. I mean, we can definitely have you back for for another conversation, because I think there is so much more we can go into here. Ottin, how about you? Any closing thoughts from your side of the house? Yeah. I would say that, you know, erstwhile,

Speaker 39:28 before LLMs, AI was considered garbage in garbage out, and now with LLMs, AI has become software, so software three point zero is AI, and now software is garbage in garbage out. So to Hamil's point, do look at your data because of garbage in garbage out. And secondly, I would say that there's three specific things that I've learned as kind of the layers of AI reliability.

Speaker 39:56 The bottom most layer is the kind of the brass tacks, set up basic monitoring traceability. That's just stuff that we've solved before AI happened. Traditional observability is a partially solved problem, and there are certain things that are done well there, adopt those practices. The second layer of the three layers is set up your prompts and your metrics and consider them as your evaluation assets, they're your first class citizens,

Speaker 40:26 they will evolve over time, have disciplined versioning, lineage around them, set up a good system there. And the third is the insights layer, which is the whole qualitative insights turned into customized quantitative insights. So if you practice these three things and kind of consider them the three pillars of your AI reliability, you've built a good three sixty evaluations and observability layer

Hamel Husain 40:51 in your software. And I'll add one more thing, is take my course. So it's a shameless plug, take the Evals course. It'd be a good way to learn about how to get set up with the Evals.

Conor Bronsdon 41:03 And I'll second that and say also check out Hamill on X and on LinkedIn, he shares a lot of fantastic content. We will certainly link both those, in the show notes, yeah, Hamel's blog as well is is a great place to to go learn. Hamel, thank you so much for joining us on the show. It's been a pleasure. Yeah. Thank you. And to our listeners, if you want more fantastic content,

Conor Bronsdon 41:28 from Hamil and many other thought leaders, make sure you subscribe to the podcast because we share information from industry experts, perspectives from AI luminaries, and hot takes, plus much more both in the podcasting app of your choice and on YouTube. So whether you wanna watch the conversation, listen in, or check out any of our other content here from Galileo, you can find us all over the Internet. We appreciate your support, and Hamel Ottin, thank you again for joining me today. Thank you. Thank you.