What did Yash and Atin predict for AI in 2025?

Yash Sheth’s one-word prediction was “automation” — real ROI coming from automating workflows across industries rather than just conversational software. Atin Sanyal predicted AI would find two kinds of fit: product market fit (real user and business benefit) and “product tool stack fit,” as the engineering and systems around the LLM mature. On code, Atin expected progress beyond boilerplate generation toward “better code understanding.”

What is Galileo Protect?

Yash describes Galileo Protect as a “control pane” for a generative AI application — he compares it to a firewall for security threats. As he frames it, the idea is to detect bad behavior, create a metric to catch it, and prevent the application from acting on it “within seconds.” These are the speakers’ descriptions of their own product, not independent claims.

How can AI evaluations be cheaper and more scalable?

Atin claims Galileo offers “the only hallucination detection model and algorithm that literally works at zero dollars,” with low latency across “both P50 and P95” so it works past a certain throughput. Yash frames the cost constraint directly: pushing past LLM-as-judge toward methods that scale to “thousands of QPS” at a price point that works, because “no one wants to double their OpenAI bills” or run an evaluation that “takes 10 seconds to evaluate one prompt.” These are the speakers’ claims about their own product.

What advice did they give businesses adopting AI in 2025?

Yash’s advice was to “start quantifying the behavior of your application into metrics early on,” establishing rigor before scaling rather than shipping a cool POC without it. Atin first joked “Get Galileo for your LLM evaluation needs,” then seconded Yash on rigor — comparing the moment to cloud adoption a decade ago, where security was the missing unlock, and arguing evaluation is the equivalent unlock for AI.

Episodes · S2 E9 ← Prev Next →

AI in 2025: Agents & The Rise of Evaluation Driven Development

Jan 15, 2025 · Yash Sheth , Galileo, Atindriyo Sanyal , Galileo · 33 min

AI Agents AI Evaluation & Reliability AI Engineering

Listen on any app

Key takeaways

Atin Sanyal’s hot take: “in the next three to five years, every piece of software that is built on this planet will have some sort of AI baked into it.” He argues AI software development is “back to square zero” — “there’s no eval tooling,” and people are using “caveman tools” to “just look at vibe checks and eyeballs.” Conor pushes back that it’ll take longer and bets Atin a meal that he is wrong.
Yash Sheth’s one-word theme for 2025: “automation.” He argues the real ROI comes not from making software more conversational but from automating workflows across industries — LLMs that make API calls, execute code, and process multi-modal data (images, audio).
Atin notes that “25 percent of Google’s code is now code gen automated,” but says under a magnifying glass most of it is boilerplate — “a couple of for loops” and basic things. His excitement for 2025 is moving past boilerplate toward “better code understanding” that grasps the context around the code.
Atin argues code writing “is just probably 10 percent of a software engineer’s job” — design and other work make up the rest — so a human (or developer) in the loop stays needed. Comparing the path to autonomous driving, he sees engineers freed up to focus on “connecting boxes with arrows and building awesome systems.”
Yash explains that because agents make API calls, tool calls, and code execution, they cause “irreversible changes,” which raises the penalty for a wrong action. He describes Galileo Protect as a “control pane” for a generative AI application — likening it to a firewall — that aims to detect bad behavior and prevent it “within seconds.”
On cost and scale, Atin claims Galileo has “the only hallucination detection model and algorithm that literally works at zero dollars,” at low latency across “both P50 and P95.” Yash frames the constraint bluntly: “no one wants to double their OpenAI bills,” and no one wants an evaluation system that “takes 10 seconds to evaluate one prompt” when traffic runs to “thousands of QPS.”

Frequently asked questions

What did Yash and Atin predict for AI in 2025?: Yash Sheth’s one-word prediction was “automation” — real ROI coming from automating workflows across industries rather than just conversational software. Atin Sanyal predicted AI would find two kinds of fit: product market fit (real user and business benefit) and “product tool stack fit,” as the engineering and systems around the LLM mature. On code, Atin expected progress beyond boilerplate generation toward “better code understanding.”
Why do AI agents need stronger evaluation than chatbots?: Atin and Yash argue agents don’t just return text — they take actions through API calls, tool calling, and code execution, which means “the penalty you pay for the right action versus the wrong action is potentially much higher.” Because agents can make “irreversible changes,” they say you need robust evaluation that runs at scale and enforces good-behavior metrics in real time within an agentic flow.
What is Galileo Protect?: Yash describes Galileo Protect as a “control pane” for a generative AI application — he compares it to a firewall for security threats. As he frames it, the idea is to detect bad behavior, create a metric to catch it, and prevent the application from acting on it “within seconds.” These are the speakers’ descriptions of their own product, not independent claims.
How can AI evaluations be cheaper and more scalable?: Atin claims Galileo offers “the only hallucination detection model and algorithm that literally works at zero dollars,” with low latency across “both P50 and P95” so it works past a certain throughput. Yash frames the cost constraint directly: pushing past LLM-as-judge toward methods that scale to “thousands of QPS” at a price point that works, because “no one wants to double their OpenAI bills” or run an evaluation that “takes 10 seconds to evaluate one prompt.” These are the speakers’ claims about their own product.
What advice did they give businesses adopting AI in 2025?: Yash’s advice was to “start quantifying the behavior of your application into metrics early on,” establishing rigor before scaling rather than shipping a cool POC without it. Atin first joked “Get Galileo for your LLM evaluation needs,” then seconded Yash on rigor — comparing the moment to cloud adoption a decade ago, where security was the missing unlock, and arguing evaluation is the equivalent unlock for AI.

Concepts in this episode

AI terms discussed here — each links to a plain-language definition.

Accuracy AI Agent AI Evaluation Latency Model Drift Vector Database AI Hallucination Human in the Loop Retrieval-Augmented Generation (RAG)Tool Use (Function Calling)

Chapters

02:55Advancements in LLMs and Code Generation
05:16Challenges and Opportunities in AI Development
10:40Evaluating AI Agents and Applications
16:07Building Evaluation Intelligence
23:41Research Opportunities
29:50Advice for Leveraging AI in 2025
32:00Closing Remarks

Show notes

"In the next three to five years, every piece of software that is built on this planet will have some sort of AI baked into it." - Atin Sanyal

Chain of Thought is back for its second season, and this episode dives headfirst into the possibilities AI holds for 2025 and beyond. Join Conor Bronson as he chats with Galileo co-founders Yash Sheth (COO) and Atindriyo Sanyal (CTO) about major trends to look for this year. These include AI finding its product "tool stack" fit, generation latency decreasing, AI agents, their potential to revolutionize code generation and other industries, and the crucial role of robust evaluation tools in ensuring the responsible and effective deployment of these agents.

Yash and Atin also highlight Galileo's focus on building trust and security in AI applications through scalable evaluation intelligence. They emphasize the importance of quantifying application behavior, enforcing metrics in production, and adapting to the evolving needs of AI development.

Finally, they discuss Galileo's vision for the future and their active pursuit of partnerships in 2025 to contribute to a more reliable and trustworthy AI ecosystem.

Chapters:00:00 AI Trends and Predictions for 2025

Connect with Chain of Thought host Conor Bronsdon:

Newsletter: https://newsletter.chainofthought.show/
Twitter/X: https://x.com/ConorBronsdon
LinkedIn: https://www.linkedin.com/in/conorbronsdon/
YouTube: https://www.youtube.com/@ConorBronsdon

Show Notes:

Transcript

49 segments

Atindriyo Sanyal ¶ in the next three to five years, every piece of software that is built on this planet will have some sort of AI baked into it. new era of software development, right? Like this is a new paradigms, new ways to build software. And if you turn the clock back to the 1980s, how we used to build traditional software. we're kind of there within the new era of AI software, there's no eval tooling, there's no machinery around, you know, building a piece of software in a robust way. People are using caveman tools, you know, just to look at, wipe checks and eyeballs. That's how potentially we used to do, you know, software development in the earliest days when when software became a thing, so we're kind of back to square zero in a way,

Conor Bronsdon 1:00 it's a new year and we are back with a new season of the chain of thought podcast. Welcome to season two of chain of thought. I'm Conor Bronson, head of developer awareness at Galileo. And I'm delighted to be joined once again by my fellow hosts and the co founders of Galileo. We've got two out of three here today. Yash Sheth COO and Atindriyo Drio Sanyal CTO. Atin Yash, thanks so much for joining me Today.

Yash Sheth ¶ Excited for the new year. Let's do it.

Conor Bronsdon ¶ Yeah, it's great to be here in 2025.

Conor Bronsdon ¶ I think there's so much excitement happening in AI coming out of this incredible growth year we saw in 2024. What do you expect the theme for 2025 to be with AI,

Yash Sheth 2:00 if one word comes to mind, it's automation. so far, we have been really leveraging this technology to answer questions to, make technology more conversational and, while that has been a great start to adopting language models as part of software, the real ROI is going to come from leveraging this technology to automate so many, so many workflows out there across industries.

Conor Bronsdon ¶ Atin, what's your take on that?

Atindriyo Sanyal ¶ Yeah, I think, AI will find two kinds of fit in 2025. One is it will start getting towards product market fit. where we'll start seeing some user benefits and, you know, these LLM systems will start reaping some business results, but it will also start seeing, product tool stack fit. But what that means is we spent a lot of 2025, you know, putting early prototypes into production, but the tool stack was still evolving and still mature, uh, immature. Um, a lot of libraries come, come in and go out. We'll start seeing a lot of the engineering and systems around the LLM starting to mature, which will lead into building more practical and better systems.

Conor Bronsdon ¶ Yeah.

Conor Bronsdon 3:00 Is there a particular advancement that you're anticipating, that you're thinking, Hey, we're seeing the early signs of this and this is going to happen later this year.

Atindriyo Sanyal ¶ I think from my perspective, lot of focus will be on getting LLMs to actually do work and take action, as opposed to just showing some generation that looks smart, which is where agents come in. a lot of the foundational models are also focusing towards multi modality, so that will, that's the other, other thing that'll sort of pick up. and it'll be interesting to see a hybrid system where, you know, the LLMs are taken to the next level. They start actually achieving goals for users and, not only taking language as input, but also images and audio.

Yash Sheth 4:00 Just one more thing I'd like to add there is the ability for LLMs to generate, high quality outputs in a lower token count, basically like, you know, able to generate things faster is going to improve drastically like, you know, the Gemini 2. 0 flash model is just just a quick example that came out, not too long ago, last year. And, you know, we're going to see a trend in that. I think The biggest, as I mentioned earlier, like, and to Athin's point as well, if automation, true automation is going to be unlocked and, LLMs are going to be able to make, you know, API calls, run, execute code and process large amounts of data, multi modal images, audio, etc. then generation latency has to come down and we're already seeing that trend happening.

Atindriyo Sanyal 5:00 Totally. one more just additional point to the automation bit, since Yash mentioned code generation and, code related use cases, one thing I'm personally very excited about is, taking code generation beyond just generating boilerplate template code, which is kind of the low hanging fruit. I know we got the statistic recently that 25 percent of Google's code is now code gen automated. lot of that if you would put a magnifying glass, you'll see that it's stuff that they would have to write, but, code that's essentially boilerplate. But, we'll take steps towards better code understanding. And, you know, giving more nuanced Generated code, which is not only, you know, a couple of for loops and, you know, basic things, but more really understanding the context around the code. so very excited to see, you know, progress on that front.

Conor Bronsdon ¶ Do you think you're going to be ready to implement that autonomous AI dev agent this year, Atin or not? Not quite yet.

Atindriyo Sanyal 6:00 I'm excited to. I do think a lot of this was kind of. You know, we had prototypes, released in 2024, which people, many sort of criticized and, thought that, oh, it's just, toys. but we'll truly see some very incredible advancements towards that. the most exciting thing I think is, There'll still be a human in the loop or developer in the loop needed because you're not just writing the code writing is just probably 10 percent of a software engineer's job. There's design and a lot of other things. So it'll free the developer up from writing code. And at some point we'll get fully automated code writing, kind of like autonomous driving. And then you can just focus on, you know, connecting boxes with arrows and building awesome systems. Thanks.

Conor Bronsdon ¶ same time though, as someone who enjoys writing code occasionally, even if I personally be the worst one on this call and actually doing it. I don't know that I want to fully free up my code gen. I want to spend my time, as you point out on the kind of like higher level, more strategic pieces. It's that boilerplate that I'm really excited to get rid of where it's like, great, like, let me get to stage one here more rapidly. Let me make my update from one framework to another more rapidly. And I'm curious if there are particular use cases, whether it's something with software developers or something else that you're hearing from customers about. What they want to see from AI, whether it's this year or in the coming future.

Yash Sheth 7:00 one thing I'll like, in terms of particular use cases, it, really depends on the, the industry and the vertical. Like, you know what I'm hearing right from the,the inception of the, generative ai, landscape here is like, can we use this technology in to convert? Like COBOL and Fortran code and financial services and like mainframes to more performant, maybe Rust or, you know, even Python for that matter. I know that a lot of there's a lot of effort being put behind that just that, you know, I'm not seeing a lot of financial institutions really talking about it because it's a sensitive topic. You don't want to change the world's, transaction capabilities overnight. But, you know, I'd love to see, some amazing progress on that front this year because that's truly going to be transformational for the world's economy.

Atindriyo Sanyal 8:00 And I fully agree. And, uh, it's very interesting. The point, Yash mentions about, translating code from, you know, old school legacy systems, but also some of the modern software that we've built, potentially on programming languages, which are, chosen for reasons other than performance. And you end up building so much tech debt tech debt that is like a cancer in every organization you go to. And it leads to slow systems, so it'll be very interesting if we can totally automate code translation into, languages like C++ which are just natively 1000x faster than some of the more application layer languages.

Conor Bronsdon 9:00 I absolutely think that's one of the. Almost current opportunities. it's one of the big things Amazon did in 2024 and trumpeted during their, I believe Q3 earnings was, Hey, you know, our Q software generation, tool allowed us to upgrade from, I think it was Java 8 to Java 17, this massive savings across the board. I had a conversation with. LinkedIn's VP of engineering, Aarathi Vidyasagar, who fantastic leader on this front, thinks very deeply about developer experience in particular. And she thought about it from the perspective of what they're trying to do at LinkedIn and say, okay, we, we don't want our devs to spend their time having to do this major translation. That's not exciting work. Let's free up their time for new features. Let's free up their time for more exciting parts of their role and kind of help with that translation layer of great, let's get. Into C whatever it is we want to upgrade to. I totally think you're both spot on that. This is a major opportunity, both right now with all kind of the co pilots of the world, but also with agents here pretty shortly, if not already.

Yash Sheth 10:00 Yeah. I mean, speaking of agents, right, like, I mean, I think until now there's been a lot of focus around code and code capabilities for these language models. and the big reason is that even if you look at the biggest, you know, highest valued companies, a lot of their, OPEX goes in developer costs, like, you know, developers are some of the most expensive resources, And that's where a lot of spend goes. A lot of innovation happens there. So if you can free up time, this massive ROI to be unlocked. if you look at industry wise and there are tens of thousands of people processing transactions, documents, in various verticals, whether it's, financial services again, or even, you know, telecom or regulatory defense, there's so much manual inspection happening. And the reason why it's these things haven't come up yet is because agents are just being productionized this year. that's where the true automation will help all of these. folks get up skilled to not be doing the manual grungy tasks of inspection, almost like develop on top of this technology that, okay, this automation is already helping me review all of this. What can we build more on top of that to help the end user?

Atindriyo Sanyal 11:00 Just to add to that point, I think this is where it also underscores the need for better evaluation tooling for these kind of LLM applications, Number one, not only are these systems more complicated with agents and various other components in the mix, it's not just you querying an LLM, it's also that you're achieving, it's also that you're not just getting a textual output as a generation, you're actually executing an action. So, you know, the the penalty you pay for the right action versus the wrong action is potentially much higher than, say, a simple generation, which is why you need robust evaluation tooling that can be achieved at scale cost and efficient cost as well.

Conor Bronsdon ¶ Yeah. Atin, I'd love for you to unpack a bit more how is thinking through the approach to evaluating agents in particular, as we see that as such a theme. Here in 2025 is the rise of agents. The opportunity for agents. How do you think businesses should be thinking through agentic evaluations and the different pieces of that process?

Atindriyo Sanyal 12:00 I'd love to hear Yash as well. But from my perspective, there's a few different components to this. Number one is The accuracy off the metrics, that help you truly evaluate. the task at hand and second is customizability because there's no, two tasks which are exactly the same, which one metric can, you know, be faithful. To add a hundred percent accuracy. It's like one size fits some, so that's where the platform kind of comes in and allows the developer rather. It's like inversion of control, giving the developer the power to use these ingredients, but build custom evaluations on top of it, which can adapt To your use case, which brings me to my third point, which is adaptability. Once you productionize these applications, similar to, machine learning in general, like we would productionize models and the data would change over time, the models would meet new data and there'll be drift and you want to address the drift by taking action similar to here, right? As the users are using the system. Data is changing. The patterns of usage are changing. So your metrics also need to evolve over time. They might lose their accuracy otherwise. So all this needs to be baked into an evaluation platform, but Yash, I would love to hear your take as well.

Yash Sheth 14:00 I've always stated, that. when we're productionizing AI, the rigor in AI has gone from curating the best data sets and fine tuning the most accurate model. To using out of the box models that are amazing and really spending that time to create the best set of metrics and this is beyond guardrails because you know when we think about guardrails, it's typically like these prompts or these instructions that we can set in the system prompt for the model, but the model may choose to disregard those guardrails at some point, right? So, it's very important to measure and it. If we can do an amazing job at quantifying the behavior of your application through metrics, that's where it, frees us up to scale that application in production. Now, why is that even more important for agents is because with these API calls, the tool calling, the code execution, there are going to be irreversible changes that agents make out there. And for that, be able to run these metrics that quantify what a good behavior looks like and enforce that in real time in an agentic flow is going to be absolutely critical. we have what we call Galileo protect, but, essentially what's,that is, is a control pane for your generative AI application. It's like, you know, you may call it firewall for some security threats, but it's more so control pane where this, as soon as you detect bad behavior, you can immediately, create a metric to detect that and prevent your application from. bad behavior within seconds. That's going to be absolutely critical for every agentic application of that.

Conor Bronsdon 15:00 And as you both point out, this is an area where Galileo is putting a lot of time in. You know, we're looking at not only the step level of, Hey, is this agent making the right tool selections? Is it doing the correct actions? But also the turn level of. Is it performing these actions in the right order? And then also, is the final result accurate? And we can get more granular around that. I'm sure we'll have a broader discussion there. But I would also encourage folks who are interested in thinking through this kind of identic frameworks to check out our recent episode, our last one of season one, with Vinnie from Twilio, where he goes in depth on how Twilio is thinking through Their platform for their customers to build AI agents and how Galileo is enabling them with evaluations and observability at every step of the way to build those agents. we're really glad to have them as a partner and, very excited to continue to grow that relationship.

Conor Bronsdon 16:00 And as we think ahead to the rest of 2025, I've loved to give the audience some context on some of the other more forward thinking initiatives we are Working on behind the scenes, if not already starting to bring on the scenes. Atin, Yash, I'd love to hear from you both. Maybe Yash, if you want to start, what do you view as the top priorities for Galileo and for AI evaluations in 2025?

Yash Sheth 17:00 Galileo is squarely focused on building evaluation intelligence for the trust layer in the generative AI side. What does that mean? to build through evaluation intelligence, we have to help our users to firstly solve that measurement problem How do we quantify your behavior, your application's behavior into metrics? And we'd be able to do that quickly within minutes and accurately is the first thing that's important. the second most important component of evaluation intelligence is, you know, what use are these metrics if they're just offline? Like, you know, we need to be able to scale these metrics and enforce them in production at scale. because again, With agentic evaluation, it's not like just people talking to a chatbot, they're going to be many transactions that happen for maybe every data entry in a table or every, every code that, you know, file that is updated. There is an agentic flow that gets kicked off with it. So when you, when we think about these applications and enforcing metrics at scale, those are going to be kind of the top two priorities. and you know, love to have Atin also talk more in that direction.

Atindriyo Sanyal 18:00 Yeah. I mean, here's a hot take from me. I think in the next three to five years, every piece of software that is built on this planet will have some sort of AI baked into it. And it's kind of like new era of software development, right? Like this is a new paradigms, new ways to build software. And if you turn the clock back to the 1980s, how we used to build traditional software. we're kind of there within the new era of AI software, there's no eval tooling, there's no machinery around, you know, building a piece of software in a robust way. People are using caveman tools, you know, just to look at, wipe checks and eyeballs. That's how potentially we used to do, you know, software development in the earliest days of when when software became a thing, so we're kind of back to square zero in a way, that's why evaluation is super critical and it took us many decades to get to the level of sophistication and around the tooling around traditional software that has allowed us to literally change the world. 90 percent of the world today has used some sort of software in their life. And to get to that with AI, I think the time frame will be much shorter because we are just more advanced as a species. But, in the next 5 to 10 years, we'll certainly see, a revolution in, uh, The engineering and the machinery around these LLMs, and there'll be progress on both fronts, like you'll have better, better software, but also better models. And, there's a combination of which will be very exciting to see, you know, the possibilities.

Conor Bronsdon ¶ That is a spicy hot take, and I think it'll take longer. I think there's still a lot of room for deterministic systems for at least the next several years, but I'm excited to see if, you get proven right here, I will, I will owe you a meal if, you, said five years was your top end. So let's, keep an eye on it. Cause, uh, we'll, we'll check back here on, and so what'll that be season seven for us and, uh, we'll, we'll have a review of the different hot takes we've had at that point. so as you think about this advancement of AI and how fast the space is moving, as you pointed out, it is arguably iterating faster than any prior technological advancement, because in and of itself, we are building self learning models that are helping speed the development themselves in a lot of ways. How is Galileo going to contribute to the advancement of AI this year and beyond?

Atindriyo Sanyal 20:00 , my perspective is that we think of, not just evaluations as a necessity, but. scalable evaluations. So we're tackling the problem, not just from the perspective of, Hey, we need to give something is accurate and actionable for a user, but also how do we do it at scale and allow the user to take their application literally to the world. And they get a customized suite of evaluation methodologies that scales with their data, that scales with their application. So I think the focus for us is on two fronts. One is, of course, we have, a scalable platform that allows a user to use a lot of the metrics and create metrics and customize metrics on Galileo, but also, work on, baseline metrics that we offer in the product at dirt cheap. If, for example, we have the only hallucination detection model and algorithm that literally works at zero dollars, And there's no one in the industry that does that. And a lot of research and a lot of brainpower has gone into building methods like that. So we'll continue to push the envelope on both. You know, research and newer ways, advanced ways, cheaper ways off of achieving a high accuracy evals at scale, but also offer a world class platform that scales with the user and allows them to, design, test cases and. security measures that adapts to their application, like Yash talked about AI security being such a critical thing. I think the software engineering or the SDK API layers that we offer becomes very critical, but also latencies, right? Like, we're the only ones who offer a very low latency, both P50 and P95. that actually works for applications beyond a certain throughput, but pass the baton to Yash to talk more about that and also get his take.

Yash Sheth 22:00 Yeah, I mean, absolutely. I think, in terms of advancing AI, right? again, I, I go back to the fundamentals of, advancing AI means, Increasing AI adoption in software across the board. Now, as AI replaces parts of software or even augments software, as of today, we're already seeing a lot of that having a trust layer that can speed up the adoption and automation of AI. You know, AI powered software is going to be critical, that includes CICD monitoring and the firewall, you know, the typical trust layer of the software stack that's changing massively and we see how every single hyperscaler out there or model provider is making evaluations front and center. That's because, you know, that's super, super important to, to adopt. Now, to Atin's point about scaling. if. today, a lot of the evaluation is happening via LLMs, like there aren't any, you know, stochastic or statistical metrics that can evaluate these applications, right? And we're all aware about how LLMs are being used in this space. Now, how can we push the state of the art to a point where things can actually scale to millions and millions of requests? To thousands of QPS of traffic at a cost point, the price point that, that can scale as well. no one wants to double their OpenAI bills no one wants to, uh, have, an evaluation system that takes 10 seconds to evaluate one prompt.

Conor Bronsdon ¶ I think you're spot on there. We would all like to decrease those bills if we had our druthers. And I also think it speaks to the value of the proprietary research Galileo has done around our chain pull and Luna methodology. Are there opportunities that you see on the research front to continue to further that unique proprietary advantage that Galileo has? We've been leveraging and hopefully help the entire industry do these evaluations across the board, but in a more cheap and more scalable way.

Atindriyo Sanyal 24:00 We've been working on two aspects of that research. one is, building better, higher accuracy, fundamental sort of foundational models that actually measure things like rag hallucinations and task completions and all the things that people care about. add, reasonably high baseline accuracy that's respectable, but then beyond that, making them fine tunable and to the point of fine tuning, that's the second aspect of our, our work, which is.We term as lunar flow, which is essentially this framework that allows you to, it's almost like a metrics authoring slash fine tuning system that caters to anyone who has a loose definition of what they want to evaluate. They can literally start with a natural language text definition. And from there, we've built the proprietary tech to be able to create a high accuracy metric that adapts to their data. takes feedback. There's RLHF happening behind the scenes, but also there's an auto ML layer in the loop, which fine tunes to the data. So you can bring your data to the table, upload it to the Galileo system and magically the metric accuracy improves, over time. So that that's a lot of the machinery, our engineering team works on and then doing that at scale. So those are the two or three main areas of focus for us.

Conor Bronsdon ¶ Yash, is there a particular aspect of Galileo's research around Luna or, chain pull that you think maybe we haven't talked about enough that we should spend more time on?

Yash Sheth 26:00 Oh, absolutely. I think one of The most amazing recent launches are that can we, how can we quickly adapt our out of the box metrics to the use case itself? the whole continuous learning flow on identifying not only what needs to be measured, but also understanding your data and what needs Like, what, how can we tweak the metrics to best represent your task? another piece of our research that, that Gallio has developed over the years is our capability of measuring semantic drift in the traffic. How can that help, curate the best datasets? How can that help, uh, assign identifying skews in our traffic because as you know, when we think about good observability in production, these applications that we're building are so broad sometimes that users can use these applications in varied ways over time. If we can capture that meaningfully and give users strong workflows, very, very strong workflows. And, you know, I think talked about the lunar flow. That is basically a workflow, to keep their metrics layer, their data sets, up to date and most representative, then teams will feel most confident in delivering those applications in production.

Conor Bronsdon 27:00 Do either of you see opportunities for us to partner with other players in the AI space this year to increase that research opportunity or that technology opportunity.

Yash Sheth 28:00 Absolutely. I mean, I think, we've shown some early, partners through our series B announcements last year with, you know, databricks and ServiceNow and, you know, we have the partnerships with, you know, the cloud providers, obviously, but I think on a technology front, there's a lot happening behind the scenes where, we work with the vector DB providers, we work with the model providers, and we are working on essentially building, a model or a technology agnostic system. Our focus is to help the application developers. by partnering with these technology providers, developers can integrate our systems into an end to end flow that can help them leverage these models at a higher scale. Today, a lot of the POCs are stuck in the POC phase and not being able to go into production because of the missing trust layer. And that's where, we're calling on all partners to actually work with us to embed this trust layer in the stack as we can jointly help developers unlock more value and scale, even from the model perspective, from the vector DB perspective, from the agentic evals, uh, agentic framework perspective.

Conor Bronsdon ¶ To your point, the same kind of work we've already done with Databricks and Google Cloud and others and definitely an opportunity to continue to expand that and have Galileo continue to form this trust layer across the ecosystem. Atin, how about yourself? Are there any other particular collaborations that you foresee or want to pursue?

Atindriyo Sanyal 29:00 I mean, anything that gets us closer to the user is a valuable collaboration in my opinion. like one of the issues has been that a lot of the, even the industrial research is sort of the backbone of it is academic benchmarks and, uh, you know, not, not to throw shade at academia at all. I mean, they're doing fantastic work. In fact, a lot of this work Revolution comes from universities and academia, but what it's lacking is industrial benchmarks and, standards which are more practical for industry use cases. So any kind of partnership that helps us get closer to developers or acts as a channel to thousands of developers so that we can connect the dots and sort of navigate an ever changing tool scape. and ecosystem. I think that will really help us sort of connect the dots and it'll be great for Galileo to build holistic evaluations that, you know, work for many use cases, many users.

Conor Bronsdon ¶ Excellent. Well, guys have very much enjoyed having a chance to connect with you here to kick off 2025.

Conor Bronsdon 30:00 If you could close with one piece of advice to businesses and engineers who are looking to leverage AI in 2025, what would that piece of advice be? Yash, if you want to start.

Yash Sheth ¶ I think the one advice and I'll, I'll, uh, I'll kind of harp on my point of Establishing rigor in the workflows. when we're adopting, AI in our applications, it's very easy to build a cool POC and start to launch it out there. but not having that rigor is the big mistake that most, most people make. So however you want to implement it, one big advice would be start quantifying the behavior of your application into metrics early on, because that is going to be essential as you scale these applications.

Conor Bronsdon ¶ How about you, Atin??

Atindriyo Sanyal 31:00 Get Galileo for your LLM evaluation needs, really. I mean, that's my advice. but anyway, on a serious note, I think I truly second Yash's point about just the rigor is completely and grossly been missing. And that has been kind of the bane of the developers experience where they have this magic ball in their hand which can do so much. The possibilities are endless. And it's so easy to get to a prototype so quickly And then everything falls apart in shambles because you don't have a robust evaluation framework and, there's so many issues to cite, right, like just the discrepancy between the data that you test with versus the real world. Data that hits your application, the cost and the scale of your evaluation methodologies, including like manual eyeballing, like there's no dollar value to it. But the amount of time it takes, it's just impractical and untenable. So the need for robust evaluation is the sort of the unlocking thing. Power. It's kind of like going back 10 years, right? The one big thing was missing in massive cloud adoption was security. And once there were security solutions, it was just free flow. Cloud became a universal thing. I think that's the same thing for AI is evaluations.

Conor Bronsdon 32:00 Fantastic.

Conor Bronsdon ¶ Well, thank you both so much for joining me to kick off season two of Chain of Thought. And for our listeners, we are so excited to be back this year with you. We hope you've been enjoying the show. What we would love is why don't you let us know what else you would like to see. Is there a guest you want us to have on? Is there a topic you want us to cover? Are we wrong about something and you need to tell us? Let us know. We have open comments on Spotify. You can reach out to us on LinkedIn. You can reach out to us on X slash Twitter. Uh, we'd love to hear from you and hope you're all are having a fantastic start to the year and Atin Yash. Thanks again for joining me.

Yash Sheth ¶ Thanks, Conor. Looking forward to an amazing season two of the podcast. And yes, it's such an exciting space. please comment, give us suggestions. uh, you know, we can bring in the experts here. this is meant to be a point where we discuss the most important things, for Gen AI. So yeah, looking forward to, a great season two, Conor. and, welcome to 2025.

Conor Bronsdon ¶ Love it.

Atindriyo Sanyal 33:00 Thank you.