How much cheaper did DeepSeek claim to train R1?

Atin put the saving at “like 95% less cost” versus earlier techniques, achieved by leaning harder on distillation and reinforcement learning. He was careful to note a very large base model still sat underneath — the V3 base, trained on 15 trillion tokens — so the savings built on the giants’ prior scaling work rather than replacing it.

Did DeepSeek invent new techniques, or reuse existing ones?

Existing ones, per Atin. He said “the methodology is not anything that’s new,” describing the same standard methodologies used to train the larger OpenAI and Anthropic models. The innovation, he argued, was in how DeepSeek put the building blocks together to hit very low cost — efficiency, not a brand-new recipe.

What was Atin’s view on export controls?

He took what he called a simplistic take: these kinds of controls “just stifle innovation.” He saw AI as a symbiotic, intricately connected system across countries and argued that innovation and technology should be kept separate from political discussions, warning the US would impose a “massive penalty” on itself with such restrictions.

Why is there no single metric for agent quality?

Because, Atin explained, an agent is a directed acyclic graph of components — LLM calls, tool and function calls, vector-store lookups — and “one mistake in any part of this workflow” compounds into large downstream errors. Quality depends on which components a given agent has, so the right metrics vary system to system.

What kinds of custom evals do teams actually build?

Mostly “very simplistic functions,” Atin said — small Python, Golang, or TypeScript checks like whether a string appears in a blob of text or whether a graph was sorted correctly. They need “little to no compute” and no GPUs; the harder, more valuable part is the management layer for registering, sharing, and tracking the lineage of those metrics.

Episodes · S2 E11 ← Prev Next →

How DeepSeek Changed the AI Race Overnight

Feb 5, 2025 · Atindriyo Sanyal , Galileo · 33 min

AI Agents Open Source AI RAG & Retrieval AI Evaluation & Reliability AI Hardware

Listen on any app

Key takeaways

Atin argued R1 was no secret sauce: DeepSeek leaned on the standard methodologies already used to train the larger OpenAI and Anthropic models, pushing distillation and reinforcement learning hard enough to train a very large model at “like 95% less cost.” The novelty, he said, was efficiency — not some technique no one had seen before. A very large base model still sat underneath it all (the V3 base, trained on 15 trillion tokens), so the cost collapse rode on the scaling work the giants had already done.
On the cost curve, Atin predicted that within “twelve to eighteen months, someone should be able to train a similar model just on their laptop.” He pointed to cheaper models already built on top of DeepSeek — including, by his own double hedge, a Berkeley student project he “already heard” about that “what I heard was $500” — as the start of a pattern of ever-cheaper foundation models.
Atin pushed back on the hype that R1 was “some kind of all knowing all encompassing thing.” He framed it as a reasoning experiment: reinforcement learning plus auto-generated chain-of-thought data to produce high-quality reasoning models — which, by design, left it “not that great at some non reasoning based tasks.”
On policy, Atin said export controls “just stifle innovation.” He described AI progress as “symbiotic” across China, India, Europe, and the US — citing how work at Stanford leans on a Tsinghua reference that leans on an Indian university — and argued innovation and technology should be kept separate from political discussions.
On agent evaluation, Atin stressed there is “no one metric that can tell you agent quality.” An agent is a directed acyclic graph of LLM calls, tools, and vector lookups, and “one mistake in any part of this workflow” compounds downstream. His answer paired out-of-the-box action-advancement and task-completion metrics with custom ones teams define themselves.
Atin said the customizations teams actually build are “very simplistic functions” — small Python, Golang, or TypeScript checks needing “little to no compute” and no GPUs. The real value, in his telling, is the management layer that lets a team register a metric, share it, and track its lineage as it evolves.

Frequently asked questions

How much cheaper did DeepSeek claim to train R1?: Atin put the saving at “like 95% less cost” versus earlier techniques, achieved by leaning harder on distillation and reinforcement learning. He was careful to note a very large base model still sat underneath — the V3 base, trained on 15 trillion tokens — so the savings built on the giants’ prior scaling work rather than replacing it.
Did DeepSeek invent new techniques, or reuse existing ones?: Existing ones, per Atin. He said “the methodology is not anything that’s new,” describing the same standard methodologies used to train the larger OpenAI and Anthropic models. The innovation, he argued, was in how DeepSeek put the building blocks together to hit very low cost — efficiency, not a brand-new recipe.
What was Atin’s view on export controls?: He took what he called a simplistic take: these kinds of controls “just stifle innovation.” He saw AI as a symbiotic, intricately connected system across countries and argued that innovation and technology should be kept separate from political discussions, warning the US would impose a “massive penalty” on itself with such restrictions.
Why is there no single metric for agent quality?: Because, Atin explained, an agent is a directed acyclic graph of components — LLM calls, tool and function calls, vector-store lookups — and “one mistake in any part of this workflow” compounds into large downstream errors. Quality depends on which components a given agent has, so the right metrics vary system to system.
What kinds of custom evals do teams actually build?: Mostly “very simplistic functions,” Atin said — small Python, Golang, or TypeScript checks like whether a string appears in a blob of text or whether a graph was sorted correctly. They need “little to no compute” and no GPUs; the harder, more valuable part is the management layer for registering, sharing, and tracking the lineage of those metrics.

Concepts in this episode

AI terms discussed here — each links to a plain-language definition.

Reasoning Models Chain-of-Thought Prompting Test-Time Compute AI Agent Foundation Model AI Evaluation Retrieval-Augmented Generation (RAG)Knowledge Distillation Vector Database Tokenization

Chapters

02:09DeepSeek's Impact and Innovations
03:43Open Source AI and Industry Implications
13:44Export Controls and Global AI Competition
18:55Software as a Service
19:29Agentic Evaluations
25:14Metrics for Success
31:34Conclusion and Farewell

Show notes

This week, hosts Conor Bronsdon and Atindriyo Sanyal discuss the fallout from DeepSeek's groundbreaking R1 model, its impact on the open-source AI landscape, and how its release will impact model development moving forward. They also discuss what effect (if any) export controls have had on AI innovation and whether we’re witnessing the rise of “Agents as a Service”.

To tackle the increasing complexity of agentic systems, Conor and Atin highlight the need for robust evaluation frameworks, discussing the challenges of measuring agent performance, and how the recent launch of Galileo's agentic evaluations are empowering developers to build safer and more effective AI agents.

Chapters:00:00 Introduction

Check out Galileo

Try Galileo

Connect with Chain of Thought host Conor Bronsdon:

Newsletter: https://newsletter.chainofthought.show/
Twitter/X: https://x.com/ConorBronsdon
LinkedIn: https://www.linkedin.com/in/conorbronsdon/
YouTube: https://www.youtube.com/@ConorBronsdon

Show Notes

On DeepSeek and Export Controls

Introducing Agentic Evaluations

Transcript

68 segments

Speaker 0:00 We've heard Satya Nadella and others talking about the death of SaaS and agents as a service. I guess you'd call that ASS, but maybe that's not my the right acronym for us to be using. We are back on Chain of Thought. I'm Conor Bronsden, head of developer awareness at Galileo. And joining me today is Galileo CTO, Atin Durio Sanyal. Atin, great to have you back on the podcast.

Speaker 0:30 Hey, Connor. Yeah. It's great to be back here. Absolutely. You always add invaluable insights in every episode you're able to appear on, so this should be a lot of fun, especially as we have a lot to dig into. There's so much happening in AI right now between DeepSeek and Alibaba's new models, our own agentic evaluations, so many capabilities around agents with, you know, Goose and Operator and all these different systems.

Speaker 0:52 So, yeah, you're the perfect person to have on here to dive into more about what developers and engineering leaders need to be paying attention to around AI today, especially when we have so much incredible AI news happening. This last week has been a firestorm. The story that really lit the AI world on fire was obviously DeepSeek, and the launch of their new r one and r one zero models

Speaker 1:14 led to massive conversations, some of which I think is spot on and we'll discuss and some of which maybe is a little far flung. And they obviously made it to the number one spot in the App Store, but that wasn't the only piece of the conversation. It wasn't only limited to model efficiency or scaling. Export controls started to become a topic of conversation once again and much more as the potential end of USAI dominance

Speaker 1:41 was a key topic of discussion. And this was followed rapidly about a week later by fellow Chinese tech company Alibaba dropping their QUEN 2.5 model, which they claim outpaces Deep six R1. There's so much we could talk about here. There's agents and the movement happening there with, you know, multi agentic systems. But, Adan, let's just start by kind of laying some groundwork and saying, what's your take on the massive reaction, especially in the markets,

Speaker 2:08 to these new models? I think my take is it's not surprising in the sense that we all expected open source and, you know, work that's going on outside the big tech of AI to eventually take center stage. This is kind of basically what's happened. And it is very interesting to see the kind of model that has come out of DeepSeek. There's three specific models that have come out, the R1 in particular,

Speaker 2:36 I think the techniques that have been used to eventually get to R1, they're not surprising in the sense that there's some secret sauce that no one knows about. They've used the fundamentals of AI and to some degree, they've pretty much used the standard methodologies which have been used to train some of the larger models that have come out from OpenAI and Anthropic and others. But they've made some very interesting innovations

Speaker 3:04 to be able to do it at very low cost, which is really good to see. In a way, it's kind of like the secret sauce. Certain secret sauces are out now and they're kind of in the hands of anyone who can sort of, you know, follow that recipe and build a very large model, but also do it rather largely intelligent model, but do it at very low cost. So it's awesome, and I think this will just lead to a further proliferation of great open source models which will come out, and it'll be interesting to see how the industry adopts it. Absolutely agree, particularly around the open source piece. We just had a conversation about open source AI last week focused on agents.

Speaker 3:49 And you have been, I think, very cognizant of this, talking about the opportunity for open source AI in 2025 back during our predictions episodes a month or two back for folks who listen to season one. So kudos to you for already having some of your predictions come true here. As we've seen these open source AI models built off of the work that's already been done with, you know, OpenAI and other models, We're seeing this creation happen in real time. We're seeing open source catch up.

Speaker 4:17 But to the point of many folks like Daria over at Anthropic, this is also an example of where the scaling law work has already kind of been done by these giants and a lot of the kind of push to get the model to a certain level of intelligence has happened. And now we're seeing the catch up of this cost factor where the cost is rapidly going down to get an amount of inference. I think that's really exciting for the opportunity to have AI involved in a variety of applications.

Speaker 4:50 Do you feel like this unlocks kind of a next layer of AI where it could be, you know, more often being leveraged at a local level or we could see it proliferate more rapidly here? I think in context of the whole deep seek moment, I think the one main conclusion or outcome of this is that we are able to leverage certain techniques that have been used in training some of these models,

Speaker 5:21 particularly distillation and reinforcement learning to a much higher degree in training some of these larger foundation models and that's really the one main outcome of this is that you can train a very big model at like 95% less cost than what it was using some of the erstwhile techniques but the methodology is not anything that's new per se. The the innovation is, I think, in how they've kind of put the, you know, building blocks together.

Speaker 5:54 So as far as industry usage is concerned, I'm excited to see more proliferation of smarter models, reasoning based models, reinforced learned models come out and people leveraging just the reinforcement learning paradigm to sort of build applications that kind of automatically learn over time as opposed to just relying on, you know, know, some pre trained data beforehand.

Speaker 6:23 So there'll be some tweaks and changes to, I think, for the for the better at the application layer, fundamentally, I think there'll be an availability of equally good reasoning based models for much cheaper. And at some point, I think in the next, you know, twelve to eighteen months, someone should be able to train a similar model just on their laptop. In fact,

Speaker 6:46 I already heard there's a couple of these cheaper models that have come out on top of DeepSeek who have further optimized it. There was one launch from a lab at Berkeley. A few students created something for what I heard was $500 but this is just the beginning of a pattern that I'm seeing where know foundational models will be trained at cheaper and cheaper cost

Speaker 7:10 and some of the toolscape will certainly change to be more reinforcement learning and distillation oriented which are some of the new discoveries if you will or pieces of knowledge that has come out that there's still a need for a large base model. Like if you do, go deeper into, you know, how DeepSeek really did it, there's still a very large base model which was the v three base and that was trained it's a very large foundation model that was trained on 15,000,000,000,000 tokens.

Speaker 7:42 So in some ways it's not a surprise like it's not a shape shifting moment in the sense that you know something that we did not know about AI has happened and it's just that there's some nuances in in the changes that have been brought into kind of the instruction tuning preference tuning paradigm which was largely believed to be the you know sort of the recipe for creating smart models like O1 and the GPD models.

Speaker 8:09 They have kind of used reinforcement learning to make equally smart models at very very low fine need for fine tuned data which is kind of the biggest impediment in training these models. One of the intermediate models that was created was actually used to generate chain of thought data. So they were using techniques that we also use like as an evaluations and observability company. We certainly rely

Speaker 8:34 techniques like chain of thought. Of course, the name of the podcast is the same. So that term is not going away. In fact, now it's more about how you can use reasoning to auto generate step by step thinking data, which is more for reasoning based tasks and which is what DeepSeek did and that was what was different from some of the GPT models the way they were trained. And it's exorbitantly costly to label this kind of data and,

Speaker 9:02 course, there's labeling platforms that help, but it's a very time consuming process. And you can use these kind of generative models to generate, you know, high quality reasoning data, which eventually leads to high quality reasoning models. Yeah. There's no question that DeepSeek has done some incredible stuff around efficiency and model training efficiency in particular.

Speaker 9:24 But in a lot of ways, this is an example of the shifting cost curve for models where as models are kind of catching up to breakout models, it costs a lot less to train them. We have these efficiency gains that regular are coming up. Whereas it feels like a lot of the hot takes around this were, oh, this is a complete paradigm shift. This is a total breakthrough.

Speaker 9:46 And I I'm not trying to discredit the work Deepsea's done. It's it's really fantastic work, but it it does feel like this is a situation where certain folks have seized on it in the media and in online circles and have maybe added their own narratives to what is actually happening. Well, I fully agree with you. In fact, I would credit companies like DeepSeek to sort of open source these models and these techniques so that it's essentially

Speaker 10:12 available to for anyone to use. It's a methodology so there is nothing proprietary that should be about it and anyone should be able to train these models and it's all about efficiency right like we're all aiming towards more and more efficient ways not only to train these models but also to use them and you know build cost effective ways to scale Gen AI applications so we're all kind of solving the same sort of problem they just did it at the foundational model layer and

Speaker 10:41 that's just how open source should work and that's how the AI community should be, right? The people are, you know, taking this to various political levels and all this is to say that the recipe is out there and there will be constant innovation around the world I would say not only in China or The United States but any developer sitting anywhere can sort of employ this and use this in their Gen AI systems, and that's what's exciting to me. Are there any particular learnings that you're taking from what DeepSeek and Alibaba have done where you're saying, oh, these are the things I'm particularly paying attention to? I think the main thing that came out of my understanding of all that has transpired in the last two weeks or so is really

Speaker 11:28 there's many sort of pieces of the puzzle that you can kind of stitch together to build something that's generalizable and that's smart. In this case they of course used reinforcement learning was one of their main ingredients along with the chain of thought and distillation. So for me it was just exciting to see that hey we know all of these techniques exist and they've been employed in some form or another in the last two to three years to produce very smart models

Speaker 12:02 and there's always a price you pay. Think the other thing that people are maybe blowing out of proportion is that deep six model is some kind of all knowing all encompassing thing that will solve you know all of gen AI and will render all the other companies useless that's obviously not the case at all in fact these models are it's almost like an experiment that they've done to show how you can get very good quality

Speaker 12:32 reasoning based models so the whole topic is around reasoning and you know that makes them not that great at some non reasoning based tasks. And it has been shown right? They have just built something great and published about it and kudos to them for doing that and it's important for us to not misinterpret it for what it is and only focus, I think, that should be on the on the good novel things that have come out of it so that we can sort of take it to the next level or use them even as we are building these, GenAI applications.

Speaker 13:05 Completely agree. There's this massive opportunity for open source to influence positively both the accessibility of AI and models, the utilization, the advancement. And I think this is a great example of we can have this great opportunity to learn from this company that is doing cutting edge research, that has made innovations based off of existing techniques. If we let this competition become only about the competitive nature,

Speaker 13:33 we will move more slowly, and we'll have negative externalities that come out of this. That said, there is, I think, an important competitive conversation that has started to bubble back up about export controls on NVIDIA chips to China. There were accusations as soon as DeepSeek released r one that, oh, this was trained with all these smuggled chips. When you really got down to it, it looks like, oh, this is trained off of chips that were previously exported before they were banned, or these are trained off of older chips. And this is much more about that change in the curve of

Speaker 14:11 the cost curve based off of all these other models kind of paving the way before them. And obviously, these innovations that we're talking about here. How do you think the release of these new models, particularly with another Chinese company and Alibaba falling rapidly behind, impact the case for these export control policies on chips that are obviously a very hot topic now? You know, look, I have a pretty simplistic take on

Speaker 14:39 at least, you know, the the non technology slash political side of things, these kind of things just stifle innovation. That's just the nature of it. And it's hard to sort of root cause exactly what causes the loss of innovation, but it's such an intricately connected world, right? With you know and one thing you put these kind of curbs and impediments on foundational technology

Speaker 15:08 just take the deepseq example, right? Like they have published about a better way to train a foundation model at cheaper cost and that will lead to so many more companies coming out both in China and United States and around the world and you know to to imagine that they wouldn't have resources to be able to publish about it and it you know we all wouldn't even know what we would lose

Speaker 15:33 because of these kind of you know policies if you put put them in place. So I think as far as innovation goes, you know, you should leave innovation and technology on one side and keep it separate from political discussions. You know, in some sense, that's what diplomacy should all be about. In the next few years we'll see tremendous benefits of innovation that comes out of universities and industries

Speaker 16:02 in all different countries whether it's China, India, countries in Europe, Australia. Everyone's innovating in AI and everyone has it's a symbiotic relationship that we all have. And for The United States, I would think that they're kind of at the forefront of a lot of innovation they've historically been. So it would be a massive penalty that they would sort of impose on themselves by putting these kind of restrictions

Speaker 16:30 because as I said, it's very symbiotic and the fact that the open source community is you know so well connected and there's a lot of knowledge sharing and a lot of github repositories with open models it's really great to see and you know that's what can accelerate innovation and, you know, it's such a complicated and intricate system that you you put an impediment on chips, which is at such a foundational level.

Speaker 16:56 It will stifle innovation at all layers above. Yeah. And I think we often don't reckon with those unintended consequences when we focus on the intended ones of, oh, we wanna be ahead here in AI because we view it as this transformational super important technology, which it is. But the challenges it'll cause around the open source community, around global knowledge sharing are are myriad. And I I'm with you. I I worry that

Speaker 17:23 the discussions around things like imposing tariffs on chips from Taiwan and, you know, increasing import export controls are really going to distract from the inherent technological benefits. And it's you know, I understand that there's this geopolitical element where, you know, obviously, it's an important technology for, you know, major countries, but I I do worry that we're going to set ourselves back as an industry if we lean too far in that direction.

Speaker 17:52 Absolutely. I mean, because it's it's never simple cause and effect. Right? Like, it's anything that has to do with something that's global especially like technology and open source technology it's never you know first degree cause and effect and you do this you get that it's never like that there's so much interdependence we have on each other and you know some innovation that happens at Stanford

Speaker 18:14 relies on a paper reference they do from Tsinghua University in China and which references some other university in India. It's all interrelated and there's no no one really knows how the graph really works and it applies to most things right like especially in order for us to move technology forward we need to just get rid of all impediments and, you know, get rid of all

Speaker 18:41 differences, whether at the political level or otherwise, and let let things just flow freely. That's the only way real innovation can happen. We've heard Satya Nadella and others talking about the death of SaaS and agents as a service. I guess you'd call it ASS, but maybe that's not my the right acronym for us to be using. And because of, you know, this conversation around agents, we've seen

Speaker 19:07 agents start to be deployed in a real world context. We've seen agents actually make an impact for customer service. We've seen the whole teams that are are changing how they function because of these new digital employees that are are essentially being added to the team. And Galileo recently announced new agentic evaluations for every company building AI agents or leveraging them. Can you talk a bit about how these new agentic evaluations increase AI agent safety and alignment? Absolutely.

Speaker 19:37 I think it's one of my favorite topics lately, primarily because we've had the privilege to work with a lot of cutting edge teams who are building pretty advanced agents and for us to be sort of the eval layer for it. A lot of lessons learned but high level I think the distinction between say a non agentic system and agentic system I've realized is that an agentic system there's a bunch of extra components that have come into the mix as opposed to

Speaker 20:06 say just a simple call to an LLM which is a question answer response and or even rag with a vector stored thrown in. Agents kind of get a little bit more complicated in terms of there's additional components which are really stitched together as like a directed acyclic graph or a DAG and when you execute it then the nodes of the you know the DAG kind of execute and each node

Speaker 20:29 could represent an LLM call, a call to a function or a tool, could be even be a like a database lookup like a vector store lookup but all in all it's kind of like this end to end system which is doing more than just making LLM calls and trying to actually achieve something or do an action. So I think we kind of started with this definition and aligned with many of our customers. The most intuitive

Speaker 20:56 idea from an evaluation standpoint was that is there a measure of completion that we can surface to the user to essentially guide the user to know if the agent is sort of, you know, on the right track, whether it's a financial chat bot, which is chatting with the user, is the user's goal being you know are we moving towards it so that's kind of one like a notion of task completion or objective achievement was one of the first things.

Speaker 21:28 Some of the other things that we realized as some of these agentic systems matured that number one there's no one size fits all there's no one metric that can tell you agent quality it really depends on the different components that you have in your agentic system just to be more concrete say you have an agent which is like a financial chatbot advisor, just to take that example. Say you have some, a rag system in there that does vector store lookups.

Speaker 21:55 Before you make an Alim call, you have a couple of tools that can make maybe a few API calls. What you really want to, you know, sort of be aware of is that one mistake in any part of this workflow can lead to massive, errors downstream. The errors kind of compound. And this is something that we've known just principally in ML that, or machine learning that, you know, one error kind of compounds downstream and to the point when it meets the user, it's such a, it's blown out of proportion and it's hard to root cause. I think those simple same parallels kind of apply to agentic systems where you know say you have an agentic system that does reruns

Speaker 22:37 of calls to LLMs just to fine tune its response and one of the reruns hallucinates. So that one error in you know the intermediate output can lead to completely wrong outputs of the agent and in order to solve that if we realize there's a need for a customization layer on top of a platform like Galileo which allows you to really like almost define from scratch what you really care about. Say as a developer you care about the number of steps that the

Speaker 23:13 agent has taken till now you want some kind of real time tracking of that or the number of reruns or recalls that have been made or you just want to do simple embedding similarity from the you know rag lookup to measure rag quality or you know similarity checks. It can be anything so one of the innovations we've done on Galileo to support more complex agentic system is we offer certain out of the box metrics that give you some high level of understanding around task completion and agent advancements, but we literally allow you to do two things. Number one, create

Speaker 23:52 your own agentic metric which you care about, and it can be sort of an LLMS judge based metric, or it could be a function that you've written in Python or Golang or TypeScript, and you register it to the metrics platform and we kind of apply it to the parts of the agents that you deem relevant. And I think it's the combination of this kind of gives you like a set of evaluation

Speaker 24:18 criteria that you want to apply to your agentic system that allows you to kind of high level monitor the health of your agent. And the combination of all this is what has been, I feel, very useful to our customers. And I think it's really exciting to see both these out of the box opportunities and the customization because we know that every AI application has a potential to be deeply different,

Speaker 24:48 and we know every agent may have different things that they're trying to achieve. But it's also important to have a framework in place where you can understand from, you know, zero to one, how is the success of this agent, or how is this agent functioning? What's the success metrics here? In in your mind, what are the specific metrics folks should be looking at out of the box before they get into fine tuned specific metrics that they're developing themselves?

Speaker 25:14 Yeah. I think the standard agentic systems that have, you know, been built at least across the the slew of customers that we have typically involve a few default components. A vector store is certainly one which is kind of become the mainstay of any kind of GenAI system. Then there's the LLMs themselves along with their memory. And then there's a suite of tools that usually folks register

Speaker 25:42 to the agentic system to be able to essentially make it behave in a slightly more deterministic manner. One example I would give you is lot of people have used agents and LLMs to be able to say traverse complex JSON graphs or JSON objects and it can get pretty hairy. A lot of data that can potentially be represented as key value pairs in JSON but then to be able to extract specific keys from the JSON,

Speaker 26:12 it's a you can build a fairly simplistic agent that has that employs LLM calls along with memory, but also tools that can help you parse JSON in a more deterministic manner. I thought of an example in my own head based on one of the companies I advise and they essentially represent the world that they operate in as graphs. So there's a lot of tooling that they apply to their agentic systems that allow them to parse graphs in a more deterministic way.

Speaker 26:44 So as far as evaluation goes, again, there's the high level notion of whether the agent is proceeding towards the goal because every agentic system has a goal. It's trying to the whole point of agents is that it does stuff rather than just, you know, spew out tokens. So there's always a goal. So there's a high level notion of whether as the either the conversation is happening or the graft reversal is happening is the agent moving towards it. So action advancement and that's certainly something Galileo offers out of the box. We've battle tested it with many customers and then some so we can assure that those kind of default metrics are very high quality. And then there's the customization piece that I was talking about and honestly a lot of customization

Speaker 27:28 we've seen are very simplistic functions. They're not boiling the ocean or some kind of advanced thing that can be built in a lab. They're simple Python functions, you know, that take text input and text output and you want to do basic things like whether a string is present in the blob of text or whether you want to take an action where you want to sort the nodes of a graph in some form, you know have a measure of whether the sorting was correct or not. So very basic things. So these are very little functions which require

Speaker 27:57 little to no compute. You can literally run them on, you know, any kind of server. You don't need GPUs or any sophisticated hardware, but it's really the management layer where a user is allowed to define a metric they care about, register it to the platform, then share it with the rest of the team who are also potentially, you know, contributing to building the agent and then sort of the

Speaker 28:24 track lineage because these functions will evolve over time and the management layer becomes critical is what we've seen. If you have a good cohesive management layer around the set of scores or metrics that contribute to agent quality, I think that has gone a much longer way than saying that, hey, I provide three very advanced agentic evaluation metrics that have been built in a lab somewhere. Based on these early successes,

Speaker 28:53 what are your thoughts on the future of agentic evaluations and how they might evolve alongside advancements in these very same agents' capabilities? It'll be interesting to see, where agentic evals go primarily from the point of view that number one, models are getting cheaper, they're getting better, they're getting smarter. So, you know, just out of the box, the baseline LLM that you would use would

Speaker 29:19 potentially behave in a much higher quality output. That might shift the focus on evaluating certain non LLM parts of the system. That's certainly one area where you'll have to essentially provide metrics to be able to measure the ancillary pieces of your agents beyond just the LLMs. And then there's newer tools and newer software components that will come in this

Speaker 29:48 ever evolving toolscape. Newer orchestration systems, newer libraries. That's why I think a platform play is a lot more sort of adaptive even from an agent eval's perspective because the principle would still hold that you wouldn't need very advanced out of the box metrics that just work that are generalizable. It'll be very interesting to see how much you can sort of take a default metric and sort of make it generalizable

Speaker 30:18 and have it evolve and adapt to your data. That's certainly something that Galileo has worked a ton on in its metrics platform providing features like continuous learning with human feedback, really incorporating the element of human feedback, which is the other aspect. Like, I don't think there'll there will always be a need for high efficacy human feedback in the loop because

Speaker 30:42 agentic systems will get more and more complicated as the months roll by. So it's very less likely that evaluation will go down the lane of here's a metric that works everywhere. Rather, what we'll see is that what wins is here's a platform that allows you to create and share and manage your metrics and sort of apply it for your the flavor of the agentic system that you've built and apply it to any part of the system. So those are the things. I think we're moving more towards

Speaker 31:15 customized sort of platform plays as far as agent evals is concerned and that combined with the fact that models are getting better and the cost of intelligence is reducing. It'll be interesting to see where, you know, where agentic evals sort of fall, you know, say twelve months from now. Absolutely. And for folks who wanna follow along and learn more about how we're thinking through agentic evals, check out galileo.ai.

Speaker 31:41 Atin, it was a pleasure to chat with you today. We'll be counting on your continued and valuable advice to help us navigate the hectic AI world in 2025. Thank you so much. Thank you so much. And, yeah, it was a pleasure chatting. And we'll make sure for those listening that everything we talked about today is in the show notes, including the articles we referenced.

Speaker 32:00 You can also find out where to follow Atin there and myself and Galileo. The ways you can stay up to date on all the news happening here, especially as we continue to roll out awesome new features like Identica evaluations for the Galileo platform. Thanks for listening, everyone. And if you're not already watching the show, you can check it out on Galileo's YouTube channel. We'll see you next week.

Speaker 32:30 We've heard Satya Nadella and others talking about the death of SaaS. You know what? Just gonna have to redo it. That's I I can't I I SaaS. I know I can say it. The the the the death of SaaS.