Cover art for Inside IBM's watsonx: Building Enterprise AI That Ships | Dr. Maryam Ashoori

Episodes · S2 E17

Inside IBM's watsonx: Building Enterprise AI That Ships | Dr. Maryam Ashoori

· Maryam Ashoori , IBM · 45 min

AI AgentsOpen Source AIAI Evaluation & ReliabilityEnterprise AI

Key takeaways

  • Maryam Ashoori frames watsonx around the three enterprise challenges she sees consistently: responsible (trustworthy) implementation of AI, cost-performance optimization, and capturing the ROI of generative AI through automation. These — not a single “wow” model — were the guiding principles for the platform’s design.
  • On models, watsonx’s bet is optionality over a single provider: state-of-the-art open-source models available the same day they release, commercial models via partnerships with Meta and Mistral (Mistral Large), and the ability to import your own. Ashoori’s conviction: “one single model is not the solution” — it’s mix-and-match across sizes, architectures, and license terms.
  • IBM trains its Granite models from scratch so it can stand behind them — documenting training-data lineage, filtering toxic and copyrighted content “to best of our knowledge,” being transparent about how they’re trained, and providing client indemnification. Ashoori ties this to customers in regulated sectors like finance, insurance, and health care.
  • Ashoori breaks agent quality into four areas to assess per customer: the LLM itself (hallucination, jailbreak, the usual guardrails); agent-specific guardrails like faithfulness of the action taken; agent evaluation — a superset of LLM evals plus tool-calling consequences; and observability and governance across build time, run time, and over time.
  • Ashoori’s 1,000-developer US survey found only 24% of AI app developers called themselves knowledgeable and skilled on GenAI — and more than half use 5 to 15 tools daily yet will spend “not more than two hours” evaluating a new one. Her prescription: know your problem and build a point of view, so you can tell noise from what’s worth your limited time.
  • Ashoori is blunt that what the market calls “agents” today is mostly “LLM with function calling and tool calling,” not the autonomous reasoning, planning, and decision-making she studied 20 years ago. She doesn’t think current models reliably make sound decisions yet — there’s “some sort of preliminary planning” — but is “personally super excited” about the next six to nine months.

Frequently asked questions

What is watsonx and how does Maryam Ashoori position it against other AI platforms?
Ashoori is Head of Product for watsonx AI, IBM’s AI development studio. She names three differentiators. First, hybrid deployment so customers aren’t locked in — the platform can run on premises or the cloud of their choice, rather than being tied to one model provider. Second, trust: IBM trains its own Granite models with documented lineage and indemnification, and ships governance and observability (watsonx.governance) with guardrails on inputs, outputs, and orchestration. Third, simplicity — integrating frameworks like CrewAI and LangGraph behind the scenes so a developer works through one SDK.
How does IBM approach build-versus-buy for models in watsonx?
Ashoori’s answer is optionality rather than one model. watsonx offers a range of state-of-the-art models across different sizes, architectures, and license terms, because global customers face varying license and GPU restrictions by region. IBM adds open-source models the same day they release, commercial models through partnerships with Meta and Mistral (she cites Mistral Large), and lets customers import their own. The same optionality applies to customization — from prompt engineering and RAG up to full fine-tuning, parameter-efficient fine-tuning, and alignment tuning. Her stated belief: “one single model is not the solution.”
Why does Ashoori say agents make governance and evaluation harder than LLMs alone?
Because agents act. With last year’s LLMs, she says the worst case was generating inappropriate content. Agents take actions, and actions have consequences — her example is an agent connecting to a sensitive structured database and deleting or combining customer data, which amplifies the impact. So watsonx aims not only to document action lineage but to proactively detect inappropriate actions and stop them, keeping a human in the loop so no high-stakes action runs automatically. She frames agent evaluation as a superset of LLM evals plus tool-calling: accuracy now depends on external APIs and the data they return.
What did IBM’s developer survey find about AI skills and tool fatigue?
Ashoori ran a survey of 1,000 US developers building AI applications. Only 24% described themselves as knowledgeable and skilled on GenAI — a gap she ties to why watsonx invests in automation and guidance for people entering AI development without deep AI backgrounds. On tooling, more than half said they use 5 to 15 tools daily in a fast-moving market, yet will spend “not more than two hours” evaluating whether a new tool fits. On AI-assisted coding, she recalled — hedging on the exact figure, “maybe it was 49%” — that a sizable share use it often, saving one to two hours a day on average, with a small handful saving more than four hours.
What does Ashoori think is overhyped about agentic AI right now?
Ashoori — who did multi-agent systems for her master’s degrees roughly 20 years ago, before deep learning — says the misconception is that today’s agents already deliver autonomous reasoning. What’s commonly called an “agent” is really an LLM with function and tool calling, which she values for bringing GenAI into every corner of the enterprise, but it isn’t the autonomous reasoning, planning, and decision-making she means. She doesn’t think models reliably make sound decisions yet — there’s “some sort of preliminary planning” — but is excited about the next six to nine months as the stack matures.
Where does Ashoori see GenAI delivering the most value in the enterprise?
A second misconception she flags: that GenAI fits every enterprise use case. It doesn’t — the work is understanding what GenAI is actually capable of. The pattern she highlights as highest-value is content-grounded question-answering, especially in customer care, where you equip people with answers drawn from a verifiable body of information. That was last year’s RAG story; now, with agents, the system can fire up a web-search API when the answer isn’t in that body and bring it back — always with a human in the loop to verify.

Chapters

  1. 00:00Introducing Dr. Maryam Ashoori
  2. 01:13Overview of IBM's AI Strategy
  3. 01:47Enterprise AI Challenges and Solutions
  4. 04:40IBM's Approach to AI Models and Tooling
  5. 09:52Simplifying the AI Stack
  6. 12:20Challenges in Agentic AI
  7. 15:55Importance of Data Management and Lineage
  8. 21:11IBM's Strategy for Gen AI Products
  9. 23:43Scaling Challenges with Agents
  10. 27:40Effective Agent Evaluation Systems
  11. 35:18Gaps and Opportunities in AI Tooling
  12. 41:35Success Stories with watsonx
  13. 44:00Closing Remarks

Show notes

Building trustworthy, scalable AI isn't just about models; it's about navigating a complex ecosystem of tools and regulations. 

Join hosts Conor Bronsdon and Atindriyo Sanyal as they explore these challenges with Dr. Maryam Ashoori, Head of Product for watsonx AI at IBM. To meet these challenges, Maryam explains how watsonx simplifies the AI stack, automates pipelines, and empowers enterprises to scale their AI operations while optimizing costs rapidly.

Maryam also explores IBM's strategy for leveraging open-source and commercial models, enabling the potential of agentic systems. Plus, she shares insights from a recent survey of 1,000 developers, revealing key takeaways about the current landscape for enterprise AI implementation, and what results mean for both developers and the enterprises they support.


Chapters

00:00 Introducing Dr. Maryam Ashoori

01:13 Overview of IBM's AI Strategy

01:47 Enterprise AI Challenges and Solutions

04:40 IBM's Approach to AI Models and Tooling

09:52 Simplifying the AI Stack

12:20 Challenges in Agentic AI

15:55 Importance of Data Management and Lineage

21:11 IBM's Strategy for Gen AI Products

23:43 Scaling Challenges with Agents

27:40 Effective Agent Evaluation Systems

35:18 Gaps and Opportunities in AI Tooling

41:35 Success Stories with watsonx

44:00 Closing Remarks


Follow the hosts

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠


Follow Today's Guest(s)

watsonx.ai


Check out Galileo

⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠

Transcript

106 segments

Speaker 0:00 I think the market is underestimating how complicated the modern AI stack is and what developers need to master and harvest in order to deliver on those potentials that generative AI is promising.

Conor Bronsdon 0:20 Welcome, everyone. I am your host, Conor Bronsden, here with my fellow co host, Atin Driosanyal, CTO and co founder at Galileo. Atin, great to see you. Great to have you back on the mic with me. Always great to be here, Conor. Yeah, we've had an exciting conversation planned here. We're joined by IBM's Doctor. Miriam Ashoori, Head of Product for Watson xAI. Miriam, welcome to the show. Great to have you here. Thanks for having me.

Conor Bronsdon 0:44 We're really excited about it because you've got over fifteen years of experience in developing data driven technologies, and now you're responsible for leading Watson xAI, IBM's AI development studio. Our listeners may know you as an expert in enterprise AI with two master's degrees in artificial intelligence and a PhD in system design engineering from the University of Waterloo,

Conor Bronsdon 1:05 where you are also an adjunct professor, which is fantastic. I I I have to admit, I've always wanted to teach a university class. That's really cool. So it's such a pleasure to have you on the show. Let's jump right into your work at IBM. Can you provide an overview of IBM's AI strategy and how Watson X is

Speaker 1:24 fitting into that vision or driving it? Everything that we do is a reflection of the market. And it's been exciting to just watch how the market has been evolving over the past, I would say twenty four months at this point. Just looking past, I feel like last in 2023 or even early twenty twenty four, most of the market was exploring and investigating with generative AI. They were looking for a wow factor

Speaker 1:48 and moment. But at this point, the majority of the enterprise market have moved past that moment. They have moved toward production and scale. And that's been the area that we've been focused on since day one designing. What are the requirements for enterprise for production production and scale when it comes to generative AI or AI in general. And it's been interesting to just see how consistent the market has been in terms of the challenges. The top three challenges that I've been seeing and basically that was the guiding principles for the design of our platform has been centering around one, ensuring a responsible implementation of AI.

Speaker 2:29 The second one is cost performance optimization. Like you're talking about the scale of enterprise. The very large general purpose models with large compute is not necessarily delivering what you need. So optimization, exactly the cost. And last area to focus on, which is accelerated with agents over the past year is how can we increase and capitalize on the

Speaker 2:58 ROI of generative AI, bringing this generative AI potential to every single of our enterprise, which is really taking advantage of automation to bring that level of productivity to every single corner. So these three areas optimization, ensuring a responsible implementation of the technology, and bringing automation to every single corner of enterprise has been the driving force for what we are designing as part of the platform for Watson X. Atid, I know a lot of these ideas resonate with you. Do you want to chime in here?

Speaker 3:34 I think there's a lot of similarity between the kind of market that IBM's going after big enterprise and us, even though we're a much more smaller company, I think my experience from what Mariam just described with our customers is very similar. We spent a lot of time in 2022, 2023, really just sort of it was in simmer mode. Like AI was in simmer mode. And last few months has really seen a Cambrian explosion of agentic applications and the maturity of some of these orchestration frameworks.

Speaker 4:12 So very similar observations on my side as well. But one follow-up question for you, Maryam, given what you've seen in big enterprise, how does IBM kind of think about build versus buy when it comes to open source models as well as, you know, bigger models like OpenAI and Thropic? Does IBM look at it as leveraging these models in their platform, or do you focus more on building foundational models from scratch?

Speaker 4:41 Yeah, so let's talk about, for example, cost performance challenge that we just brought up as one of the challenges. Cost performance optimization. If you optimized solution, what are you gonna need? You're gonna need LLM or a collection of LLMs, and you're gonna need a bunch of model customization tooling that is allowing you to take advantage of your proprietary data about your users, your domain specific

Speaker 5:10 data to build something that is differentiated from the market, but is also delivering the performance that you want for your target use case for a fraction of the cost, right? So that's basically the problem that we are trying to solve. So in this story, in order to deliver on this challenge, what you're gonna need is to have access to a series of state of the art models, right?

Speaker 5:33 Different sizes, different architectures, different license terms. We have global customers. Maybe there are some license restrictions in one part of the board versus others. There is GPU restrictions in some part of support. It's like what are the requirements for running these models. Right? So you wanna have access to all of those. And it's the same for the toolings for model customization. There is a range of those approaches. You might just need prompt engineering and rag, or in some cases, they need full fine tuning and parameter efficient fine tuning or even alignment tuning, right? So the needs are different.

Speaker 6:09 We wanna have make sure that there is optionality and there are choices for your customers. And in terms of choices and flexibility, we've been also looking into where are these options coming from, right? So for example, for models, we've been leveraging the state of the art open source models. Like the first day that state of the art models goes out, we are out on the same day. So it's like our users should have access to that. But also some of them are commercial models. Like for example, we established partnership with Meta and Mistral over the last year. So Mistral Largest is an example of a commercial models that we have available in our platform.

Speaker 6:47 Or a customer may come in and they said, hey, I have a custom model that I made myself on my on premises or on another platform and I want to bring it in, import it so it's not in your catalog. So we have also expanded to import your own custom models. So basically delivering a range of options because we strongly and firmly believe that one single model is not the solution for a range of those. It's a mix and match. So that's the model part. The same for tooling part. We wanna make sure that we are integrated with ecosystems. So for example, you mentioned agents.

Speaker 7:22 Where are the developers these days? They are experimenting. They are exploring. Some are exploring with crew AI. Some are exploring with Lama Index. Some are exploring with Nangraft. We are blinking altogether integration to make sure that that optionality is available for experimentation. But also, we care about production and scale, security, privacy, deployment,

Speaker 7:45 like availability of the service, robustness, all of those in production, not just experimentation.

Conor Bronsdon 7:51 Miriam, I I'd love to understand more about how Watson X as a platform is differentiating from other AI platforms in the market. You've talked a bit here about enabling open source, trying to make sure that you access other ecosystems. What are the other ways that Watson X is different?

Speaker 8:10 Very good question. So back to optionality and choices that we talked about, we wanna make sure that our customers are not locked in, and that's one of the foundations that we have been focusing on, being hybrid. So in our platform today, you can take the whole platform and deploy it on basically the platform of your choice, either on premises or the cloud of your choice, right? Versus being locked in into one single model provider. So hybrid is one of them. The second one that we are really

Speaker 8:43 serious about is trust. Ensuring a responsible implementation of AI. You saw that with tooling and models. For models, we started training our own models from scratch. That's our granite collection because we wanted to be comfortable standing behind those models and provide indemnification for our customers. Right? And we've been very transparent in terms of how the models are trained.

Speaker 9:06 And that's on the model side of the house. The same story for tooling. Like, we've been heavily investing on our governance and observability platform. What's an x dot governance? Like, every step of the way, like, who touched the model to do what, we automatically document the lineage of what happened. But also, we are building guardrails in place. Guardrails on input, guardrails on output, guardrails orchestrators.

Speaker 9:32 Right? Basically monitoring everything. So I would say that trust is very close to our heart and very critical for us, especially because we have a lot of customers coming from highly regulated environments like finance, insurance, health care, you can imagine. It's a serious thing for them. So trust is the second angle. The third angle is we recognize the complexity of the stack.

Speaker 9:58 I think the market is underestimating how complicated the modern AI stack is and what developers need to master and harvest in order to deliver on those potentials that generative AI is promising. Right? So from our perspective, we've been looking into simplifying the stack as much as we can, integrating behind the scene, integration between different components of the platform, but also integration with the ecosystem. For example, we mentioned Crue AI, we mentioned Land Graph. We are behind the scene integrated. So a developer comes in, they don't need to be worried about

Speaker 10:35 maintaining the code or coming from third party or learning that. It's one single SDK. They they get the job done. Right? So simplicity is the next factor that I would mention. And that last but not least is automating, providing guidance as much as we can. We acknowledge that, for example, in this work, there are lots of developers getting into AI application development when

Speaker 11:00 the depth of AI knowledge and skills may not be there. Like for example, we asked thousand developers, we ran a survey, they were built in The US, they were building AI applications. We said how comfortable you are with your GenAI. And surprisingly, the AI app developers that we talked to, only 24% associated themselves with knowledgeable and skilled on GenAI.

Speaker 11:23 So there is a big gap in AI skill development and development. Right? So we've been trying because of that. And why is it important? When you think about cost performance optimizations or any sort of optimizations, you need to have a knowledge of AI. Like, what parameter is that? Hyperparameter optimization is one angle. Like what is the right model to use? What is the right technique for model customization

Speaker 11:48 tool to use? So we've been heavily also investing on automating those pipelines. So for example, for Rack, it's like how can I automate multiple Rack pipelines with different parameters and show developers the performance of which one yields a better one just to pick that? Right?

Conor Bronsdon 12:05 And that's the fourth area that we've been heavily investing on. Anton, I see you nodding along to a lot of that.

Speaker 12:11 Yeah, I think I do agree on a lot of things that Miriam mentioned, especially around simplicity, sort of meeting the developer where they are already in familiarity. And scale is certainly a very big sort of challenge, which is I would say unsolved in the agentic workflows because we're in the prototypical phase of agents and it's all very exciting. But one additional observation I have made is when it comes to agents,

Speaker 12:38 the interesting pattern of development would have seen, and this comes from my own experience having worked on the first version of Siri over a decade ago, a lot of the software engineering paradigms are coming back into play. And this includes frameworks like LangGraph, LangChain, who are essentially incorporating software paradigms and design patterns that have been known to work for traditional software systems.

Speaker 13:03 They've kind of been augmented and sprinkled with LLMs, and there's newer and fewer paradigms which need to be learned, like how do you do rethinking or how do you incorporate any agentic specific things. I almost see this as a sort of an amalgamation of what we've already learned around scaling traditional software, sort of meeting the new world of GenAI and the GenAI components that they bring.

Speaker 13:29 And I'm very optimistic and excited about the next twelve to fifteen months when all this will truly come together. And the real challenge will be some of the more fundamental things which Marian mentioned, like trust and accuracy of especially from an evaluation perspective, I can talk about it from a lens of evaluation and observability. We've always had problems of

Speaker 13:55 observability in traditional software systems to root cause something, like how do you do it in an effective way. The same challenges are here, except that there's a few more newer paradigms that you're dealing with with LLMs in the mix. So I would say that a lot of things are coming together, And the challenge for effective, you know, building high quality and effective agents is really high quality observability,

Speaker 14:19 high quality evaluations. And that's something that is very bread and butter to Galileo, and we focus a lot on that day in and day out.

Conor Bronsdon 14:27 That seems like it aligns with your perspective on agents as well, Miriam, and enabling developers to really go and explore this technology.

Speaker 14:36 No, he's right on. It's interesting to just back to the beginning that I started to talk about market, how the market has been evolving. It's interesting to see last year, they were experimenting with LLMs. They moved to production. Now we are going through the same thing with agents. The market is exploring. They are looking for factor. But when they go to production and scale,

Speaker 15:01 they cannot go to production without observability and governance. It's essential. You got to have transparency and traceability of actions and monitor that for agents. It's like, well, action was taken. Under what circumstances? And can I control that, right? And also over time, not just one time. So you're gonna need to see that tracing of information at build time and run time and after that over time, right?

Speaker 15:31 So these are the areas that like I don't think enterprise has been heavily looking into but once they are past that excitement about agents, these are one of the essential elements that they need to seriously follow-up on. Absolutely. And I'd also want to add to what Mariam had said earlier about versioning and lineage. I think that's a topic which is not talked about enough, I feel. I'm drawing parallels to the age of MLOps

Speaker 16:00 when feature stores were created and evangelized. As part of that, model monitoring solutions also came. Lineage and versioning and management, the data management side of AI, is extremely critical, and it will be all the more critical as agentic systems start to slowly scale because in the end, data is the fuel at the end of the day that powers any system. That is something that's never changed in machine learning,

Speaker 16:28 and that will still be the same for agentic applications. So having that layer of data management, lineage, and combining that with observability to be able to actually root cause when something goes wrong, and to be able to track lineage off. A lot of problems in AI essentially boil down to the data and it's the same story here with Gen AI and even with agents.

Speaker 16:51 So that layer of data management is very critical. I had an insurance company coming to me and say then, Maryam, it doesn't matter what the model customization approach is. If you don't have lineage and governance, I just can't use this because I need to know exactly what version of what model was trained on what version of data. And now with synthetic data generation everywhere, like, we need to know what was synthetic data, what was real, retain that data, and be able to audit it back and trace it back. That was an example of highly

Speaker 17:24 regulated industries. They just can't go to production without that knowledge.

Conor Bronsdon 17:29 So I'd love to know more about what's made for successful implementations in these highly regulated industries. Obviously, IBM is an expert here. So something that for decades now IBM has

Speaker 17:42 done successfully. How are you making sure that you translate that success to this new Gen AI agentic era? At the foundations of all of those, we really have the models and we have the tooling stack, right? Let's go back to that. On the model side, there is a deep need on a model that can they can trace back, be transparent in some terms of the training data that went into that. Right?

Speaker 18:07 And just a portion of the models out there, you can actually get access to that information. So from our perspective, at least with our granite, we wanted to make sure that one, we establish a trustworthy governance process around the training of the model that we can document the lineage of what happens and filter out the data, like for example, toxic information and copyrighted information and all of those

Speaker 18:31 to best of our knowledge and keep updating that. But also be transparent with the market about how they are trained and provide client indemnification. So basically we are like, okay, so this is our strategy to cover you on the model side, right? On the tooling side, it's hand in hand. It's like you grab the pre trained models, but then through the input and instructions,

Speaker 18:56 can nudge the model to potentially create an undesirable output, right, independent from what the model was. So it's essential for you to automatically document the lineage of that and stop that. So for agents, it amplifies. Why? Because for LLMs of last year, the absolute worst thing that could happen was LLM creating some content that was inappropriate. Agents

Speaker 19:22 can take actions. Actions have consequences. What if the agent decides to take an action to connect to a sensitive structured database and delete some of the customer sensitive information or grab them and combine them with other workloads. Then we are talking about amplified impacts of this. And that's why it's essential to not only document that lineage and track that, but also stop that.

Speaker 19:54 Proactively detect those actions that are not appropriate, detect them, or make sure that human is in the loop and no actions is taken automatically when the stake is high. These are the areas that we've been looking into and we've been trying to surface and educate the market about. These are the consequences that the stack might not be mature, it will get there. This is a stack that is evolving in the market rapidly around the agents.

Speaker 20:24 And I think that's the part that the technology providers and the technology consumers hand in hand are trying to figure out what is that uncharted territory? What are the gaps and let's resolve them. Might be a dumb question for you, Mariam, but I'm curious to know from IBM's perspective, a lot of the things that you've talked about are sort of platform challenges and tooling challenges to enable developers.

Speaker 20:52 And Galileo, for example, is essentially an evaluation observability platform. So a lot of the problems you're talking about, we think about it day in, day out. But IBM's also a very big company, and it has the ability to also enable developers to build great GenAI products which is kind of the layer above the platform. So I'm curious to know from your perspective is IBM's strategy to

Speaker 21:17 just provide the tools to build any kind of application for developers or also provide products, GenAI products, to your users? Yeah, very good question. At the end of the day, the goal is to solve customer problems. So for every single line, we sit together with them to identify what is the gap. Sometimes the gap is to use a product and out of the box use it. Sometimes the gap is you need a custom build approach,

Speaker 21:49 build it. And not every single customer has the talent to actually make that custom. The majority of the customers at this point, they are interested in out of the box solutions that solves that. So as IBM also, we looked into our own platform. We said, hey. We have a platform that is providing Gen AI technologies, but we have a series of softwares that can benefit from that technology. So we have a category of

Speaker 22:17 softwares and products that are enriched with just generative AI technologies. One example is for example, Environmental Intelligence Suite. It's a product that you can use for your disaster management. Behind the scene, you can use foundation models. We have geospatial models we build with NASA to use them for the purpose of disaster management, right? Out of the box, you can use that. You don't really need to know what are the foundation models or what are the tools.

Speaker 22:46 There is a second category of softwares that are powered up by GenAI. These are the new sets of products that now we can bring to the market. One example of that is Watson Xcode Assistant. So now behind the scene, there's a granite code powering up the whole product that the developer can use for the purpose of productivity. So within IBM, we are actually thinking about four different areas. One is the platform providing these technologies to the market. The second one is the products that are in can enrich the experience and benefit from generative AI. The third one is the new products that are powered up by GenAI.

Speaker 23:26 And the fourth area is the services that we offer to sit together with the customer and figure out, hey, is one of these helping you or there is something else that we can pull in a custom solution to solve your problems. Thanks for explaining that. That makes a lot of sense.

Conor Bronsdon 23:41 I'm curious to dig into the agentic side more. So you brought up that there's this scaling challenge happening with agents where getting to production, particularly in these highly regulated industries, can be challenging. And some of the ways that we've been tracking on this, and trying to understand the impact of agents, align very closely, I think, with your perspective. It's like,

Conor Bronsdon 24:05 did the LLM planner for this agent select the right tool and start on that right path? Did those tools work? Are they having errors? Are they actually advancing towards the ultimate goal with, does the trace reflect this action advancement? And then, of course, completion around, does the final action align with the agent's original instructions? And you brought up this challenge around

Conor Bronsdon 24:32 that final completion metric. I'm saying like, look, we have to be really careful, particularly in these highly regulated industries, to go, we actually need to achieve this goal. We can't risk this action completion being incorrect. We have to ensure that these agents are doing not just the right job, but they're doing it well. How are you approaching that with your customers?

Speaker 24:55 Yeah. There are four areas that we are looking into. The first one is the LLM itself. Agent is powered up by LLM. So all the concerns that we have with LLMs in terms of hallucination, lack of explainability, transparency, all of those are applicable here. Even the guardrails in terms of filtering happen, jailbreak, and they are all applicable to agents here. So that is coming to the picture. The second category is what I call agent guardrails.

Speaker 25:24 So these are the guardrails that are specific to agents that you wanna develop. Like, for example, faithfulness, the action that was taken. But there are a series of metrics that are stemmed from the specific action calling and reasoning. Right? Those guardrails. So that's the second category that we wanna make sure that we have a solid story around. The third one is something that I call agent

Speaker 25:47 agent evaluation. So this agent evaluation is a superset of LLM evals that we had in the past plus all the consequences for tool calling that is coming to the picture. Accuracy, performance, now you're talking about external APIs, the data that you're getting back from that API, how are you gonna evaluate the quality of the respond that you got back. Right? Are you gonna have some sort of content filtering on those? Like, are how you gonna deal with those? So that's that's the third area that I'm looking into, agent eval. And last but not least is one area that Etienne mentioned, observability and governance

Speaker 26:27 throughout the whole life cycle. Build time all the way to run time and after that overtime, Like checking what's going on for the agents. These four areas are the areas that we need to deeply look into and see if it's a requirement for the customer. If it's the requirement, what level of maturity they need to have in order to go to production and what is this custom subset of the metrics that they need to track. If it's not provided by the existing one, how can we expand it or

Speaker 27:00 build share metrics in the market around those to address those?

Conor Bronsdon 27:05 Yeah. Genetic evaluations is definitely an area where we're spending time researching and improving our products as well. Aten, I'm curious if you wanna speak to your perspective on it, because I think there's a ton of ground to cover here. As Miriam and you have both highlighted, there are misconceptions about agentic AI, where folks just think, oh, this is gonna solve my problem, and they don't think about the entire structure that has to go around this to consistently solve the right problem in the right way. May from Ryder said to us a couple months ago, you know, this is not magic. We have to actually put guardrails in place. We have to build the right systems. So so, Atin, what's your perspective on this?

Speaker 27:42 Yeah, my perspective is slightly orthogonal potentially to what Miriam was saying, although I do agree with what she was saying, double clicking into essentially what an effective end to end agentic evaluation system comprises of. I kind of flip the words. People talk about agentic Evals. I see them as we need to build evaluation agents because every piece of evaluation can be essentially needs to be agentic by nature

Speaker 28:13 because they need to adapt to the different variables in your system once you productionize it, data being a key variable because data is always changing and there's no one size fits all metric or any kind of statistical measure that will give you a good sense of whether the entire agentic system or parts of it are doing well or not. What needs to happen is you need to curb the false positives very quickly

Speaker 28:39 and give the instructions to the agent, which is the evaluation agent, to not make those same mistakes so that the agent itself evolves with your product. So that's kind of my view in that you need to solve the problem end to end. One is you need to break it down into different components because each component needs to be evaluated in a different way. There's different measures for the health of your rag system which includes retrieval quality, ranking quality,

Speaker 29:06 and then there's the output which can be measured by a different set of metrics. But each of those metrics can't be one and done. They need to be adapted through human feedback, which is where there's a massive human feedback component. But the challenge for a platform play here becomes how do you take that human feedback in a minimal and least intrusive manner

Speaker 29:27 and incorporate it into your ecosystem so that the agent doesn't make that same mistake ever again. So it's a lot of these things that need to come together to build a cohesive platform, and we at Galileo kind of see them as self evolving evaluation agents, which is kind of like a multilayered system, at the bottom of which are the fundamental metrics, the statistical measures that give you some sense of kind of the leading indicators of good or bad. But on top of that, there's this automation

Speaker 29:56 layer that you have to build which can be seen as agentic in itself, which adapts to the data and the different any variable to the system, whether it's the usage patterns which are evolving over time or the data that's changing over time, And how do you make those evaluation paradigms adapt to that? And I'm glad you brought it up, Atin. I think it's two ways,

Speaker 30:19 both of them, not this or that. So for example, agent evaluation, one area that we are using looking into is LLM as a judge to go and look into, let's say, if it's a chain of thought, break it down to different pieces, multiple notes in every note, evaluate, like, evaluate if that was a right action taken. And if it wasn't, automatically regenerate the prompt

Speaker 30:49 template that can cover that and fix it. It. So LLM behind the scene is part of the process in this case to do the agent evaluation. So I think it's a two way using agents for evaluations but also agent eval itself. No, absolutely. I totally agree. And the other side of the coin is the scale and cost, which we talked about initially. And the same challenges apply to evaluation

Speaker 31:18 as well. To your point about using LMS Judge and Chain of Thought, they do give you great results, but behind the scenes, Chain of Thought is a very expensive process which has, not to get too technical, but multiple forward passes across the deep learning model, And those costs add up because they're running on very expensive hardware. So one of our challenges that we've been going down the journey of is how do you build effective

Speaker 31:44 but cheaper evaluations which need not necessarily use the most complex paradigms all the time. Chain of thought is great but one of our discoveries has been that in many circumstances from an eval's perspective, chain of thought might not be needed. And you can get high accuracy eval's without chain of thought. And we've published about it as well. That's the other side of the coin, which is how do you surface high accurate evaluations

Speaker 32:13 at cheaper price? Exactly, looking at what is the use case, what is the best way to tackle that. Funny enough, last week we released the new Granite, Granite 3.2, Granite Reasoning. For that one, you can toggle thinking on and off. So if you don't need that chain of thought reasoning because it's very expensive, You turn it off. And for the use cases that you need to actually do that, you can turn it on. But I'm with you, it's a combination of models and evas, and at the end of the day, cost performance optimization.

Conor Bronsdon 32:47 Love it. This has been a fantastic part of the discussion. I'd love to keep diving into agentic AI more broadly as it's obviously kind of the theme of 2025 thus far in AI. What are the misconceptions, Miriam, that you see different businesses having about how we should be approaching agentic AI? Let me tell you a fun fact.

Speaker 33:11 Those two master's degree that you mentioned, it was actually a multi agent systems twenty years ago. So I did agents and multi agent systems before deep learning. So when you talk to me and you I'm like, I'm looking into the definition of agents that has reasoning capabilities, I mean, those category of people that don't believe LLMs of today can do reasoning because they're not like in my mindset, that that's very different than

Speaker 33:38 the culture of agents that I grew up with. Right? And I think this is this is also a misconception in the market. Like, what's commonly known as agents in the market is basically LLM with function calling and tool calling, which is, don't get me wrong, it's great opportunity to bring GenAI to every single corner of enterprise because now you can connect them together. Right? But the promise of agents is really that autonomous reasoning

Speaker 34:07 and planning and making decisions and taking actions. I don't think we are there yet. We have some sort of preliminary planning in place, but I think that's that's the opportunity that over the next six to nine months I'm personally super excited about. I feel like once we get comfortable with our technology stack collectively in the market to a point that these models can actually make

Speaker 34:32 sound decisions, then that's the point that it unlocks a lot of use cases for all of us. And I think that's a misconception in the market that it does it today, but it's not really capable to do that today. I have a follow-up question, Miriam, for you. This might be a little more general, but in the evolving tool stack that you see today and, you know, clear increase in capabilities of what we can do with LLMs.

Speaker 35:03 What is that number one thing that you find maybe a gap in the tooling ecosystem or the platform or anything that can enable developers to build effective high quality agents? If there's a gap at all that you see and what is that top of mind? Yeah, I would say that there are three areas that it's both gap and opportunity, right? It's the active areas that they care about. One, back to our optimization,

Speaker 35:31 flexibility and choices. When you talk to developers, I talk to my developers. I surveyed and I said, how many tools are you using in average on a daily basis to create an AI applications? More than half of them, they said five to 15 tools. Five to 15 tools in a market that is evolving rapidly, new tools are coming up in the market fast. I said, how much time are you willing to dedicate to learn a new tool to figure out if it's the right tool to integrate your stack or not? They said not more than two hours.

Speaker 36:05 So they are craving for easy to master and learn new technologies. Less than two hours, bring it in, integrate. This is a major need. Need for simplicity, need for abstraction, need for integration behind the scene to address the optionality that they need. And I think that's the area that we should all work on collectively. The second area is innovation. The crave for innovation.

Speaker 36:32 That that word that, like, researchers develop something and then next year publish they publish a paper and then next year product picks up. It's all gone. Right? Now we are talking about research. What do you have today? What were you working on last night? Can I bring it into products? The speed of innovation, and that's the expectations from developers. Even though if they don't have the AI background,

Speaker 36:57 they expect us to have this state of the art technology as it becomes available to the market for them in a way that they can consume. And that's a major gap. Like, it's it's not an easy thing because they need to go all over the stack. It's not just application layer. They need to go to model layer, GPU level, runtime level, optimize everything. And I think our role is to simplify that and tackle those two areas, three

Speaker 37:24 areas, simplification, simplicity, innovation, and optionality and choices. I couldn't agree more. I think I've had the same observation where I think we've kind of reached the point where the excitement for a new model that comes out is starting to kind of saturate and dwindle over time. Right? It's like, Hey, there's a new model. Great. It probably does the same thing. And there's

Speaker 37:50 step function improvements and some academic benchmarks, which the citizen developer or someone who just wants to build something awesome wouldn't really care about. And on the other side, there's tool fatigue and this proliferation of, Hey, here's another tool that can help you build something. But if that tool is not simple, to Mariam's point, I, as a developer, am trying to keep pace with the innovation and really trying to build something awesome and give me tools that can help me on the way and not serve as impediments.

Speaker 38:22 So I see that as challenge as well. Yeah. This pacing challenge also

Conor Bronsdon 38:26 speaks to something that Miriam brought up earlier in the conversation, talking about this survey of a thousand developers you did where, what was it, 23 ish percent said, Yeah, I feel proficient right now. The pace of this industry and of the tooling around the industry, everything you need to learn, it's moving so rapidly. There's always a new model coming out. There's always a new tool. It's got some new innovation,

Conor Bronsdon 38:47 and it's making it really hard for folks to feel fully comfortable. I can see a world where this just continues to accelerate, and it becomes harder and harder to keep pace. So I'm curious how you both think developers and their technical leaders should be thinking about this challenge.

Speaker 39:05 My suggestion is always to know your problem. What is the problem that you're trying to solve and establish a point of view from the market because then they are able to differentiate what is noise and what is the thing that is worth their time to spend on. They have to be very careful about their time. They have limited time to spend on different pieces of technology. And the best thing that they can do is understand their problem and have established a point of view to be able to evaluate what is noise, what is worth the time. I was in a hackathon recently

Speaker 39:42 where I saw a demo of essentially taking a research paper, like the Transformers paper, for example, which was the demo. You upload it and you create a podcast out of the paper. And it blew my mind because like, just turning the time back to when I used to do hackathons very, very often, this kind of stuff was unthinkable to be built in ten hours. So the technology and the ingredients are there for us to build something amazing.

Speaker 40:14 The only impediments are really the cost, the platform, and just the simplicity of tooling that can enable me to accelerate and build something very real as opposed to just being impediments. And the noise is really one of the big challenges. My last advice is take advantage of that. I was asking the developers how much AI assisted coding are you using? And basically

Speaker 40:40 a good portion of them, I don't remember the numbers now, maybe it was 49%. It was a sizable size. They said very often they use it. And in average, they said they are saving one to two hours per day out of that coding. And four persons of the developers that I talked to, they said they are saving more than four hours a day. Like if they really know how to use AI for that purpose.

Speaker 41:06 And I'm like, this is massive, like get on that. Take advantage of AI for this sake of productivity of yourself.

Conor Bronsdon 41:15 Miriam, this has been a really fantastic conversation. Honestly, I'm looking at our talk track here we kind of built in advance, I feel like we only got through half our questions because you and Aten are diving into, like, so many interesting threads here and having these great back and forths. I'm like, well, I can't I can't move us forward. They're having too great of a conversation here. But I want to close out with a couple of highlights

Conor Bronsdon 41:35 that we didn't have a chance to chat about. So so one, do you have a good example of a success story you can share with the Watson Next platform? Because you've talked about some some really interesting examples here, some challenges with regulated industries, But I also know there's some big successes that have happened. So I'd love to know and maybe inspire our audience with what does success look like? The point that I like to call out is,

Speaker 41:59 and that's a misconception that Gen AI is applicable to every single enterprise use case, it's not. We need to understand what Gen AI is capable of addressing and what are of the use cases that it unlocks. And then when you look across enterprises, there are thousands of those examples that can come up. So for example, for generative AI, if you look into content grounded question and answering as the number one use case for generative AI, classification,

Speaker 42:28 information extraction, code generation that we were just talking about, summarization, right? It's some of the applications. But perhaps the top one that I like to highlight is content grounded question and answering, and most specifically in customer care. Because now we can basically equip whoever that has been providing historically the answers to our customers

Speaker 42:52 with knowledge that is generated from a body of information that can be quickly verified. And that was the story of last year. That was RAC. Now with agents, we can say, hey, LLM. If you don't find the answer in that body of information, you you don't need to say, I don't know. Go to fire up a web search API and see what you can find and come back. And we always have human in the loop to verify that.

Speaker 43:20 So it's just amazing to think about the potential consequences to every single industry. This very simple example of AgenTic Rag or whatever you wanna call it. Right? And because of that, I think I'm super excited. It's not just one example use case for one specific industry. It's impacting every single one of us as consumer, as enterprise businesses. That's the beauty of GenAI. We just have to figure out how to really deliver on the potential of

Speaker 43:54 what they have promised to deliver with our stack and our capabilities,

Conor Bronsdon 43:59 empowering of developers to build those. Absolutely. And I think that's why conversations like this one are so important. So Atin, Miriam, thank you so much for joining me today. Miriam, where can folks learn more about your work or the work you're doing at IBM? What's the Next dot ai? That's my product. Perfect. That is an easy one to remember, and we will, of course, link it, along with everything else we discussed today in the show notes. Go check out the research we've talked about. If you've got that survey you mentioned, that you sent out to developers and and some of that's public, we'd love to share that as well. Wonderful. And if you're listening and you are checking out the show notes,

Conor Bronsdon 44:30 maybe just go ahead and leave us a review, hit that subscribe button on whatever platform you're listening. It really helps us bring more guests like Miriam on the show. If you're developing an AI product or leading a team, you can reach out to me or to Atin directly, at Connor Bronson at at andriosanyal. You can find us both on LinkedIn. We'd love having this conversation, so reach out if you'd like to be on the show or have a guest to suggest. Miriam, just thank you so much. This has been so fantastic. It was a ton of fun. Thank you.

Conor Bronsdon 44:56 Thanks all.