Cover art for Your Key to AI Success is Hiding in Plain Sight | Cohesity's Greg Statton

Episodes · S2 E27

Your Key to AI Success is Hiding in Plain Sight | Cohesity's Greg Statton

· Greg Statton , Cohesity · 46 min

AI AgentsRAG & RetrievalEnterprise AI

Key takeaways

  • Cohesity began as an infinitely scalable distributed file system aimed at backup, then layered on security, file-system access, and now generative AI. Greg Statton frames each phase as a stop on one journey — re-leveraging data that customers already store but rarely touch — noting it is “pretty wild to spend money on something” you hold and hope you never have to use.
  • An early attempt to monetize that data flopped. Cohesity exposed an analytics workbench letting customers run MapReduce queries by writing custom Java mappers and reducers and uploading JAR files to the cluster. Statton admits “literally nobody in the enterprise” wanted to do that, so they paused it and pursued easier-to-adopt steps instead.
  • Cohesity’s machine-learning entry point was inline anomaly detection in the backup stream. By modeling how data changes between backups, the system fingerprints sudden entropy spikes that often signal ransomware encrypting or wiping data. Statton says it has caught malware before customers’ own SecOps tools, alerting the SOC to begin its security procedures.
  • The generative-AI push started as Statton’s personal experiment about three years ago. Leading a team of field experts who kept re-answering the same questions, he saw a “semantic divide” between official FAQs and how newcomers actually phrased things, then built a fix on GPT-3 and Cohesity’s internal docs — effectively a RAG system before the term was common, using an in-memory TF-IDF and cosine-similarity index, with reference links back to source files. The founder and CEO saw it and moved him into R&D to productize it.
  • Statton’s core advice is unglamorous: before chasing leaderboard models or agentic flows, get peers across the org to agree on a data governance model, map where data lives, and decide what AI should never touch. He calls this data readiness, preparedness, and hygiene the hard part everyone skips, because however good the model is, “it’s still going to give you garbage if you give it garbage.”
  • On evaluation, Statton softened his own “turtles all the way down” critique of LLM-as-judge: he says “I don’t know if you should never do this” — it can be valid if you have proper evaluations on the evaluator itself, but it shouldn’t be the only method or use the same model that generated the response. He favors decomposing outputs into claims with an LLM, then using other models plus retrieved context to check each one.

Frequently asked questions

Who is Greg Statton and what does he do at Cohesity?
Greg Statton serves in Cohesity’s office of the CTO as vice president of AI solutions, now working in core R&D. On the episode he notes he is approaching his ten-year anniversary at the company and has worked in nearly every department except finance — including marketing, sales, architect and SE roles, and a global field role. He describes himself as a tinkerer who, despite not holding a PhD in machine learning or AI, has been curious about the space for the last fifteen to twenty years. Conor Bronsdon recorded the conversation on the road at Microsoft Build.
How did Cohesity’s generative-AI work actually begin?
It started as an internal experiment Statton ran about three years ago. He led a team of field experts whom new sales engineers kept asking the same questions, even after being pointed to docs and FAQs — a “semantic divide” between how FAQs were written and how newcomers phrased questions. After OpenAI opened GPT-3 access (he recalls a Reddit post), he built a web UI with an in-memory semantic index using TF-IDF vectorization and cosine similarity, retrieved relevant doc chunks, and passed them to GPT-3 — an early RAG system. Cohesity launched its first generative-AI apps about two years ago.
What does Statton say enterprises should do before building AI on their data?
He argues the first step is non-technical: get peers cross-functionally — marketing, HR, engineering — to agree on a governance model for the data, which he says almost no organization has. From there, map where all the data lives, decide which versions matter, set access controls, and identify data that should never interact with an AI model. He frames this as data readiness, preparedness, and hygiene — the work people skip, because however good the model is, “it’s still going to give you garbage if you give it garbage.”
Why is Statton skeptical of using one LLM to evaluate another?
He has called the approach “turtles all the way down,” though on the episode he walks it back slightly — saying “I don’t know if you should never do this,” and that it may be valid if you have proper evaluations on the evaluator model itself. His concerns: it shouldn’t be the only method, and it shouldn’t use the same model that generated the response, which he likens to walking into a room of thieves as a cop and asking if they are thieves. He also warns it can overfit toward a model’s own bias. His preferred pattern decomposes a generation into discrete claims and uses other models plus retrieved context to check each one.
How does Cohesity use backup data for security?
Because Cohesity ingests data repeatedly, it can model how that data changes between backups and build fingerprints of normal behavior. It added anomaly detection inline as data is ingested, flagging when a dataset’s entropy changes dramatically — often a sign that a bad actor is encrypting or wiping data. Statton says this has worked successfully: in some cases the engine caught the malware before the customer’s own SecOps tools did, and the alert from Cohesity let the SOC begin its security procedures. He frames it as augmenting, not replacing, a company’s existing security stack.

Chapters

  1. 00:00Introduction
  2. 00:36The Role of Gaming in AI Development
  3. 05:43Personal Gaming Experiences
  4. 08:26The Intersection of AI and Gaming
  5. 12:53Importance of Data in Game Development
  6. 19:03User Testing and QA in Gaming
  7. 25:49Postmortems and Telemetry
  8. 27:21Beta Testing and Data Preparedness
  9. 29:18Traditional AI vs Generative AI
  10. 31:31Challenges of Implementing AI in Games
  11. 35:57Leveraging AI for Data Analytics
  12. 39:41Automated QA and Reinforcement Learning
  13. 42:01AI for Localization and Sentiment Analysis
  14. 44:21Future of AI in Gaming

Show notes

What if the most valuable data in your enterprise—the key to your AI future—is sitting dormant in your backups, treated like an insurance policy you hope to never use?

Join Conor Bronsdon with Greg Statton, VP of AI Solutions at Cohesity, for an inside look at how they are turning this passive data into an active asset to power generative AI applications. Greg details Cohesity’s evolution from an infinitely scalable file system built for backups into a data intelligence powerhouse, managing hundreds of exabytes of enterprise data globally. He recounts how early successes in using this data for security and anomaly detection paved the way for more advanced AI applications. This foundational work was crucial in preparing Cohesity to meet the new demands of generative AI.

Greg offers a candid look at the real-world challenges enterprises face, arguing that establishing data hygiene and a cross-functional governance model is the most critical step before building reliable AI applications. He shares the compelling story of how Cohesity's focus on generative AI was sparked by an internal RAG experiment he built to solve a "semantic divide" in team communication, which quickly grew into a company-wide initiative. He also provides essential advice for data professionals, emphasizing the need to focus on solving core business problems.


Chapters:

00:00 Introduction

00:36 The Role of Gaming in AI Development

05:43 Personal Gaming Experiences

08:26 The Intersection of AI and Gaming

12:53 Importance of Data in Game Development

19:03 User Testing and QA in Gaming

25:49 Postmortems and Telemetry

27:21 Beta Testing and Data Preparedness

29:18 Traditional AI vs Generative AI

31:31 Challenges of Implementing AI in Games

35:57 Leveraging AI for Data Analytics

39:41 Automated QA and Reinforcement Learning

42:01 AI for Localization and Sentiment Analysis

44:21 Future of AI in Gaming


Follow the hosts

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠


Follow Today's Guest(s)

Company Website: cohesity.com

LinkedIn: Gregory Statton


Check out Galileo

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠⁠⁠⁠⁠⁠⁠

⁠⁠⁠⁠⁠⁠Agent Leaderboard


Transcript

122 segments

Speaker 0:00 Oh, it's lovely.

Conor Bronsdon 0:02 It's not my opinion of it. I'm why are you pulling the host? Dang it.

Speaker 0:08 Connor, you just you you you you made it too easy for me. It could've gone either way, but it didn't. Yeah. Yeah. We'll go there. We'll go there.

Conor Bronsdon 0:21 Welcome back to Chain of Thought, everyone. I'm your host, Connor Bronson, on the road at Microsoft Build. And today, I am delighted to have Greg Staton joining me. Greg serves in the office at the CTO and as vice president of AI solutions at Cohesity. Greg, welcome to the show. Great to see you. Thanks a lot, Connor. I'm really excited to have this conversation with you. Yeah. I'm excited to chat with you. It's really appreciated that you made time in between your busy travels. You were just in The UK. I know you've been all over the place talking to customers, talking to folks in the AI space.

Conor Bronsdon 0:52 And so we're delighted to be able to dive into the critical challenge that is facing nearly every enterprise looking to build with AI, harnessing the vast, often siloed doors of internal data they have. Cohesity started by tackling the massive scale of enterprise data management and backup and and is now at the forefront of helping organizations unlock that data

Conor Bronsdon 1:13 for the AI revolution. Greg, Cohesity manages staggering amounts of data, hundreds of exabytes globally, much of which traditionally sat as a kind of insurance policy and backups. But you've embarked on a fascinating journey to turn that passive data into an active asset, especially for AI. Let's start there. You've mentioned that Cohesity started with building an

Conor Bronsdon 1:42 infinitely scalable distributed file system, but focused initially on those backups we mentioned. Can you walk us through the key evolutionary steps that Cohesity has taken moving from this initial focus towards enabling broader data intelligence and now AI?

Speaker 2:00 Yeah, absolutely. And I think you said a pretty critical word there. Fascinating. It is really a fascinating story, I think, especially looking outside in. But but from from the inside, it's kind of been a destination where we've we've been heading towards or a stop along the journey ever since we we founded the company. Like, like you said, we started off building this infinitely scalable distributed file system. And a lot of us came from that pedigree of hyperconversion.

Speaker 2:27 So now we had this file system. We wanted to bring workloads onto it. And we took a step back. We said, Hey, you know, the world of data management, data protection hadn't been, you know, enhanced or revolutionized in a very long time. So we said, Hey, this is something that companies all have to do. They have to back up and protect that data to ensure the integrity and the resiliency of that data last resort.

Speaker 2:51 But if it's stored on an intelligent file system, there's a lot more that we can do with it. And I think that's kind of where we started to coin this, this phrase of enabling our customers to re leverage their data for operational efficiencies. So we built this, this backup suite of tools that ran right on top of the file system. It could connect to every single major enterprise application,

Speaker 3:12 both on prem and in the cloud. And we started off by enabling a really easy means for customers to back that data up and then send it to wherever it needed to be. If they needed to replicate to another site, they needed to archive for long term retention. They wanted another copy in the cloud made that extremely seamless and extremely efficient in the way that we transport and

Speaker 3:33 save that that data. And I think one interesting anecdote around this this time is we built the backup software and we really wanted to jump towards this, you know, enabling customers to gain additional insights from the data. And we started off if you've been a customer since the beginning, you would remember this. But we had this analytics workbench where we enabled customers to go out and

Speaker 4:00 use the clustered file system with all of the CPU and memory in there, plus the distributed data to be able to run MapReduce queries on the data. And us being a whole bunch of nerds, we said, Oh yeah, everyone's going to love to do MapReduce queries on the data. And it's super simple. You just write your own custom mapper and reducer and Java and upload the the jar files to the cluster,

Speaker 4:22 pointed at files in a way it goes. Well, shocking, I think literally nobody in the enterprise, but shocking to us was was that was hard to do. So we kind of pushed the pause button on that and focused on kind of the next evolutionary step of backup, then security, then accessing as a file system. And then now kind of in this new phase of generative AI, being able to to bring that data generative AI to help unlock that data's future potential.

Conor Bronsdon 4:53 So I really like that Cohesity and you, Greg, are both talking about how do we releverage this data that we already have, but maybe isn't being used. As you start thinking beyond backup use cases into test automation, security, how did you pave the way for turning passive data into active data?

Speaker 5:16 Yeah, it I mean, it really started at its core. You know, I think traditionally a lot of the data management, data protection companies before us focused on the cheapest way to store data in a proprietary fashion on the cheapest medium possible. And there'd been a lot of advancements in compute and in storage and in memory. So, and we wanted to be able to kind of tap into that and harness that.

Speaker 5:42 So, what we what we did is we said, hey, you know, this data is sitting there and our file system is this snapshot based file system. You know, we can very easily by leveraging the way that we structure our metadata within the file system, instantly create kind of redirect on write clones of that data. So we can go in and say, hey, you have these backups of these hundreds or thousands of VMs or this NetApp filer or or this S3 bucket in in AWS.

Speaker 6:13 How can we how can we instantly reprovision that backup? Not only for recovery. That was kind of the initial phase for this is like, hey, how can we help our customers not only back up quickly, but recover quickly? It's how we do these cloning of it. But then also that allows access to the data. And I think that was kind of a little bit of an moment is like, hey, if we can back this up and if we can then reprovision this back out, it becomes extremely useful.

Speaker 6:41 And, you know, companies are dealing with these these literal truckloads of of data and trying to shift around these raw bits, these ones and zeros. But data in and of itself isn't necessarily extremely useful. You know, I think within I. T, our job is to be able to apply context to data, to turn it into information for the business. And then the business can take their understanding

Speaker 7:06 of that information and turn it into into knowledge. And so by being able to help in the context collection through traditional metadata, advanced metadata, it allows us to be able to help customers not only find the raw raw information that's useful to them, but the correct version, the correct point in time, you know, and be able to kind of help enable this governance layer on top of it to more efficiently move that data for use with with with with other applications.

Conor Bronsdon 7:40 Let's kind of unravel this point you made just now about efficient use of data. What do you mean when you say that? I spent a lot of time in

Speaker 7:49 in in IT earlier on in my career, and we were responsible for for caring for and feeding for the data that that drove the drove the business. And there's lots of it. And I think if you talk to anybody in the enterprise, no matter the size, if you point to any given file anywhere in their environment, chances are there's three, four or five copies of that piece of data, that file, that

Speaker 8:17 object. And there's going to be different permission structures for all of them. There's going to be different information inside of it from from different versions. And so when you're looking forward now in today's world, trying to hop on and leverage AI efficiently in your in your organization, you have some like really tough questions to be able to answer for. And it's not very efficient to back up a truck

Speaker 8:42 load, you know, hundreds of petabytes of data into it virtually or physically and ship it off to like AWS or Azure or GCP. Now, the cloud providers will love that, but it's extremely costly. It takes a lot of time because physics haven't changed. You know, we're still bound by by laws of physics. So the time it takes, then you're extending that time to to value.

Speaker 9:07 So the more that you can do in preparation of that data for your end goal. So again, like, you know, Cy bringing all that data onto a single platform so that you can start munging it to make up a word maybe or to use a word differently here, munging it. All those words. You can have you got that. Yeah, I love it. So you're going to munch that data. You're going to be able to sift through traditional metadata's

Speaker 9:29 and then through CoECI, we can kind of help create some more advanced metadata's on top of that, but allows the enterprise to to have a governance view into this data that they've never been able to easily have before. And it's not like going and buying another piece of software and then creating another copy of this. This data is being reused. It's it's primary purposes for backup. And with backup, you hope you never have to use the data again. And if you kind of think about it, it's, it's pretty wild to spend money on something that's holding something very important to you that you hope you never have to use. So that's why we decided, hey, why not when it's not being needed for, you know, a full data center recovery or your boss needs to recover an email, let's start doing more with it and help the enterprise become more efficient with managing their data.

Conor Bronsdon 10:18 One of the key ways that Cohesity is leveraging this backup data on behalf of their customers is with this shift towards security intelligence, detecting malware or ransomware within backup streams using, anomaly detection. How has successfully applying machine learning for security influenced the roadmap that you have been building for more advanced AI applications?

Speaker 10:46 Yeah. Yeah. I think it touches a little bit on on what I was talking about before around being able to collect more information about that raw data, applying more context to it. So we launched a suite of of security tools because, again, this is this is data that touches everything in your enterprise. So this is every application, every endpoint at times across all different locations physically and and virtually.

Speaker 11:15 And so while this isn't set up to replace, you know, all of a company's security tool sets, it can help augment and provide richer insights to help them make decisions. Like, for instance, we're backing up this data and this this data changes over time. So, you know, the first time you, you go in and do a backup, it's a full backup. So it's a copy of everything from that point in time. But then going forward,

Speaker 11:42 we're looking at the change in data between time, which provides a really interesting insight, being able to look at and model the change of data over time allows us to create fingerprints of this. And so, one of the first things that we did in the machine learning space was like, hey, we can actually put in some anomaly detection into the backup stream as we're ingesting the data to be able to start helping our customers flag data that could be potentially infected by malware. And this is a very fast in line method of being able to look like, hey, does this data like dramatically fall outside of this, of this linear regression line? And maybe this is something that, that a customer should go look at and can more quickly

Speaker 12:27 alert upstream, you know, sec ops tools or, or security teams or socks that, Hey, this application now, the entropy has changed dramatically. And when there's a larger entropy change, it more often means that there's some bad actors in there encrypting all of the data or wiping out the data. And so, and we've seen that very be used very successfully by our customers.

Speaker 12:53 There's been a couple of cases where our anomaly detection engine, because of when the backup was running, actually caught the malware before the SecOps tools did. And the SOC was was alerted. And so they received the alert from Cohesity and they were able to start implementing their security procedures to help safeguard the company. But again, this is all now about collecting

Speaker 13:17 more metadata, more meta information about this data and how it's changing over time. So this this this then now provides a lot more context to help organizations or or AI turn that information into knowledge.

Conor Bronsdon 13:32 Let's drill down there. As orgs are looking to, as you put it, turn this information into knowledge, great phrase, what's the next step look like? What what's the next thing that you think should be built? I'm I'm so curious to understand more about what the roadmap looks like for you.

Speaker 13:49 Yeah. So right now, today, customer go out and buy a Cohesity. Hopefully, they buy a lot of Cohesity, but what but they they'll buy some Cohesity, and and they're oftentimes, you know, optimizing their their backups, and they're fully integrating the security suite and they're starting to remove some of these extra applications and fees and contracts that they have because now they can consolidate

Speaker 14:12 on there. It was about two years ago when when we launched our first generative AI applications. Hey, let's let's kind of tap into this and provide some retrieval augment a generation or rag pipelines to this data. You know, we focus on the information retrieval pipeline, creating a semantic index of that data. Our customers can then open this up to other business lines of business in their organization,

Speaker 14:36 and they can use that own data to to apply context to this. But then if you go back a little bit further in the conversation we had, this is additional meta information. Like I can't even call it metadata. It's not like a file size or ownership. We're now capturing semantic information. We're capturing topic and theme analysis of this data. This is yet more information, applying it as context.

Speaker 15:00 And so, you know, as we kind of look further forward, one of the key areas that we want to be able to solve for and what we're building now is how can it help companies either build micro SaaS AI solutions for their industry or for their line of business through very simple API access to data? How can it help empower ML teams? I think like you probably know if you if you're new joining

Speaker 15:30 an ML, AI or data science program within an organization and you need data for your for your models, there's not really any great cataloging of this. You go find the person with the most tenure in that organization and ask them where they can find this data. And that's madness. And so I think as we're looking forward is like, hey, how can we make it just stupid simple

Speaker 15:51 to find the exact piece of information and data you are looking for to help you solve your business problem? And so think of it like a global data catalog that encompasses traditional metadata, kind of augmented traditional metadata with security enhancements. And then now a new world of semantic metadata and that change over time. And this can empower lines of businesses. This can empower security teams to be able to do like deeper forensic analysis.

Speaker 16:21 And I think, again, it's using that single data fabric where all that data's already been collapsed into.

Conor Bronsdon 16:27 Yeah. I I think it's really interesting to to think about this idea of where data flows and where you have these reuse opportunities. And I I like that everything you're describing thinks kind of through this stage one of, like, okay. Let's leverage the data here, and then, okay. Here's how we evolve it to next. And my understanding of that is that, essentially,

Conor Bronsdon 16:56 Cohesity's use of GenAI started by tackling an internal problem, actually. Yeah. Using GPT three and Cohesity Cohesity data, you essentially created an early reg system before it was widely termed. Could you share more about how that initial experiment and the moment that you experienced realized or I guess help you helped you realize the potential for leveraging backup data this way.

Speaker 17:22 Yeah. No. That that is a really fun story. So I've I'm a tinkerer. I always love tinkering with with new things. And and while, you know, I don't hold a Ph. D. In in in machine learning or A. I, I've always been curious last fifteen, twenty years in this space. And this was probably about three years ago now. I was running a team of global field experts. So these were deep technical experts in our product that helped our

Speaker 17:52 our sales teams win deals with large customers and then help those large customers be successful. And because this team was very knowledgeable, all of the new sales engineers or new people coming into the company would always ask the folks on my team questions over and over again. And oftentimes it's the same question. And what was hard for them while they always wanted to say yes and they always wanted to help,

Speaker 18:17 it was tough to then constantly context switch, answer a question they probably answered 10 times already and then go forward. And when I started questioning a handful of the folks on the team, I said, Well, did you tell the person to read the documentation? Yeah, Greg, we told them the documentation. They still ask the question. Yeah, I still ask the question.

Speaker 18:34 Did you tell them to read the frequently asked questions that we constantly spend time updating? Yes, Greg told me, told me to answer the or read the fact. Did they still have a question? Yeah, they still have a question. And then it dawned on me. Was like, Hey, everybody's going to ask a question slightly different. And it's there. The way that they're asking the question is the way that the way they understand the information that they know about today,

Speaker 18:57 trying to fill in gaps. And so there's a semantic divide between what we think are frequently asked questions and what a new person actually has. And it was right around this time when I saw, I think, probably a Reddit post of OpenAI providing access to the GPT-three. I was like, Well, this is pretty interesting. I hadn't really spent too much time with NLP because words are hard at times.

Speaker 19:22 And I said, well, is interesting. So I signed up and I got a bunch of free credits. I was like, oh, this is sweet. Let me play. With the model, I could send it text and it could send me new text back. Oh, this is interesting. Transformers are really cool. And I already knew that coming into it. But what I was, what I started doing is like, well, Hey, if I send it some text and then I also pack in some more tokens of like some additional context,

Speaker 19:50 I can help frame the response from the model, even though the model wasn't trained on this data. So I quickly whip together a little web UI where I, I created a like a semantic index. I just loaded it in memory using TF IDF vectorization and cosine similarity to find paragraphs of text or chunks of text from our internal docs that were semantically relevant to the

Speaker 20:19 user's question. Pass it all to GPT-three and it answered. And I was like, it was an moment for me. I said, well, this is great. So then we kind of hosted our little internal prototype after running on my laptop and started opening up to some SCs. They were like, well, this is, this is great. Like, you know, it's not a 100% accurate all the time, but like, this is, this is great. And, and then I started providing reference links back to the files that we used to answer the question. They said, this is great. I can get my answer. I can go verify it with the docs.

Speaker 20:51 And it was at that point in time, our founder and our new CEO caught wind of what I built. They said, can you show it to us? Yeah, of course. Showed him. I said, we're going to move you into our R and D organization and we need to build this into the product. I said, well, yeah, it makes total sense. Cause all the data we were using was being stored on Cohesity to begin with. If, if we had this problem, if I had this problem, my team had this problem,

Speaker 21:17 more more than likely, there's a lot more people that are suffering the same problems. And so we kind of started to unlock this knowledge discovery problem, to solve internal problems that that we're also now helping our external customers with.

Conor Bronsdon 21:32 I don't think I realized that you were on the SE side of things at that point when you joined Cohoacity.

Speaker 21:38 That's such an interesting journey. I have had several jobs, you know, the way I like to describe myself, because it's it's my ten year anniversary coming up at Cohesity, and I think I've worked in almost every single department except for finance. And I think we're a better company for that decision for the company, not put me in the finance department ever.

Speaker 21:59 But I was I've been on the marketing side. I've been on the the sales side, both in an architect role, an SE role, a global resource, now in core R and D.

Conor Bronsdon 22:11 Love it. Yeah, it's really cool to look at your career journey. Think folks who feel pigeonholed, maybe go check out Greg's LinkedIn, which we'll certainly link in the show notes because it's very inspiring to see how you've, you know, continue to expand your technical skills and then also apply them in different domains, whether it's, like, starting as an application developer to now being a VP of AI solutions today and doing core r and d. So, very, very cool to see. So, obviously, mastering retrieval is so critical

Conor Bronsdon 22:43 before even getting to the generation part because as you mentioned earlier, you know, like, it's great to have something that works most of the time, but the more accurate you can get it, the better. And particularly in large complex organizations that have lots of data to unlock and to leverage, there's huge opportunities here. So what's your thought process around

Conor Bronsdon 23:05 how organizations should be approaching retrieval for their

Speaker 23:09 AI solutions today? A lot of companies will come and ask questions like, Well, Greg, you know, we want to start leveraging Gen AI internally. We're reading all the blogs and they're saying that I'm going to save, you know, 10x my investment in just 90 days or or I can create this new application with just five lines of Python code and it scales to infinity. You know, how how do I get that tomorrow? And I said, well, let's let's hold on.

Speaker 23:35 Put a pin in that for a second. I said, there's there's a lot of hype around this, but there's also, there's a ton of truth to what people are talking about. There's a ton of value to be, to be gained. But I feel like a lot of people skip over the hard part. It's less fun. It's less exciting. It's less sexy. But I think for those of those that have been in the in the world of machine learning and data science know that it kind of it starts with with data. And so I'll talk to a lot of folks, say what you need to first do. This is a nontechnical thing,

Speaker 24:11 but get to your peers cross functionally within an organization and and all agree on a governance model. So you need to first figure out where all your data lives. Now that you've got this mapped out globally, you say, all right, what version of this data is going to be important to us to be able to kind of get to our end state now that you've got that identified. It's like, well, of this data,

Speaker 24:39 there's probably some data that some people shouldn't be able to see. There's some data that people should see. There's probably some datas that you just don't want ever interacting with an AI model internally or externally. It just could be way too sensitive for you right now. So, you need to be able to then go and say, all right, now here's the data that we want to be able to use. Here's the access controls we want to be able to put on top of that data.

Speaker 25:05 And this is kind of a stepped phase data readiness, data preparedness, data hygiene. It's something that I think we as an organization or an industry have put aside for far too long because it's hard. It starts with consolidating the data into one place, then applying that governance layer that you've hopefully talked with all of your peers cross functionally,

Speaker 25:28 which no shocker, nobody has. Definitely. Yeah. No, there's not a single person when I say, oh, if you talk to your marketing, your HR, your engineering, you guys have a governance model. Start with that. Start with a governance model around the data. Well, Greg, I hear you can just ask you that since you've been in all those departments. I can. And I just say, give me the access to the data.

Speaker 25:51 But it's it's it is by far less exciting than saying, well, I want to go test out, you know, the the top of the leaderboard language model or I want to then play with this new agentic flow. You can, you can spend all of your time doing that, but it's still going to give you garbage if you give it garbage. And so that adage of garbage in garbage out is still totally true. But once you've got that foundation,

Speaker 26:17 then then there's like some great steps that we spend a lot of time thinking about around, you know, how do I extract how do I extract the raw data, raw text, image, video, whatever out of that payload of of raw data? You know, how do I embed this? Like we had some some serious challenges around just simple, you know, embedding of this data at massive scale.

Speaker 26:40 You know, how can I re rank this data to to kind of ensure that my I'm getting the kind of the best blend of precision at K and recall across my across my data state to help answer this question? And then at the very end, say, well, what what's going to be the best language model for this? But there's a lot to get to before you get to play with the the LLM on on the other end.

Conor Bronsdon 27:03 I feel like one of the themes of this show, and anyone who's listened to a few episodes will probably notice this, is this discussion of the magic bullet versus the actual infrastructure work that needs to be done. Yes. Because usually, AI is marketed as this magic bullet's gonna solve your problems. You just apply it and bang, you're off the races. And there's a little bit of that. There is some magic there. No question. But anyone who's played D and D knows that, if you're a wizard, you're trying to harness the the power of the universe,

Conor Bronsdon 27:34 you have a lot of studying to do. You've got a lot of work to do. And if you just decide to go the sorcerer route and make some pact with, some entity as a warlock or something like that, that comes with risks. And, to to apply this to AI here, if I think about, like, hey. Let's just let's just throw an LLM at it, and things will figure itself out. There are risks to that, and you are not gonna get the accuracy you want all the time. You need to test. You need to evaluate. You need to understand your data pipelines. You need to have the infrastructure in place and do the work in order to get that fully constructed wizard spell that you want.

Conor Bronsdon 28:06 And then, yes, it is magic, and you can apply it, but it it comes with all the upside and all the downside if you don't actually think through your approach to data. And so I appreciate you talking about how Cohesity and how you have have thought through this approach because so often enterprises have data spread across various sources, various formats, and unifying that access or providing a consistent way to query and retrieve relevant information for AI regardless of where or how it was originally stored, I can imagine is quite challenging.

Speaker 28:38 It it it really is. No. I was chuckling as you were saying that because because, yeah, I think and I think me being in the industry on the vendor side, we have to take a lot of the blame here. But our enterprise customers are starting to believe that it's magic. And it's funny, whenever I give I always ask the audience and it's mostly IT folks. Now we're getting some of the data folks. But I was saying,

Speaker 29:03 can can anybody give me a simple definition of of artificial intelligence, machine learning or generative AI? Just the simplest. And then it's usually like dead silence. I'm like, Hey, it's okay. Just tell me what you think it is. And people will try to give me these complex definitions of trying to, in their words, describe neural nets or deep learning. I was like, No, no, no, simpler.

Speaker 29:27 I was like, it's, it's not magic. It's statistics at scale. It's all math. It's using information from the past to predict a future event. And that future event could be the next most probable token or word in a sequence, or now tokens. Now we're doing multiple token predictions or it can be a forecasting model, linear regression, like linear regression still works great today for for certain tasks. And I think we're in a world now where where people are also throwing LLMs at at everything.

Speaker 29:57 Let's say, you know, I want to do I'm just going put an LLM at this data. It'll figure it out. Or I'm going to put an LLM, you know, in my life sciences or within in the hands of my doctors and it'll figure it out. And there's a couple of major problems with this too, like like you touched that like we had talked about the data problem on the other side, but there's probably certain problems that that you don't want to use an LLM for or sometimes you don't even want to use AI for. And you have to understand like as a as a as a product owner in the enterprise or

Speaker 30:30 creating products, you have to understand the end use and what like a sufficient F1 score is going to be like, is it okay that it's going to get potentially get it wrong 15% of the time? Or do you want it 2% of the time? And so being able to disclose those evaluation scores as well as understand the use case, what's going to be acceptable is

Conor Bronsdon 30:54 kind of also something that we're not necessarily talking about. It's a great point to take that product owner mindset of what do we actually need to deliver here, because it can really vary. What's acceptable on an internal document retrieval use case is extremely different from what, a financial services provider needs in order to, put a customer service bot in into action.

Conor Bronsdon 31:17 And and that's my you know, we were obviously worked with a lot of Galileo and have dealt with a lot of our enterprise customers. And, it's a it's a very complicated, customizable, deep field. And I I know that you're dealing with that with a lot of your customers. And I've heard you kind of critique this common approach, of of using one model to evaluate another as turtles all the way down. LM as a judge, I think, has a lot of pros, but there are definitely some cons.

Conor Bronsdon 31:46 Why do you view this approach as insufficient, especially for high stakes enterprise use cases?

Speaker 31:51 I think it's tough. Although, you know, I've been I've been thinking a lot about this because I say it a lot like it's turtles all the way down. You should never do this. I don't know if you should never do this. I think if it goes back to what I was talking about getting evaluations for this, like, do we have proper evaluations on the evaluator model? Like, how do we know that it's giving us a good result? There may be some use cases where this is this is completely valid, but I think we initially started doing this pretty heavily. If I look a lot at like the birth of some the open source projects around evaluations

Speaker 32:22 and using an LLM is because it was hard. The cognitive load on the humans to evaluate, you know, five pages of text is a lot, especially when you kind of look at it, you know, on the opposite end of the spectrum. You know, we collect email metrics to to help retrain AI models all the time, either like I put a shirt in my shopping cart or I don't. I buy something or I don't. I give a thumbs up or I don't. That's a very low cognitive load

Speaker 32:53 that needs to be applied to be able to get good, good feedback or good evaluations. But if I word vomit up, you know, a 10,000 word response to a simple question, it takes an expert in that. It takes somebody reading through it. And an LLM might be good at it, But I think it's also disingenuous for us to throw away, you know, twenty years of of NLP evaluations.

Speaker 33:18 There's probably a good mix mix of both. And I think there's uses for LLM's in this. Think about it. LLMs are great at summarization and kind of like first draft generations. And if I'm if I start to use the LLM for what it's good at in these pipelines, like you start to get some great results. Like recently, I was doing I kind of had this this notion or an idea, and I'm sure it's been done out there,

Speaker 33:45 but it's saying, you know, a lot of people are looking at emails with with alarms or rags saying the answer is either wholly good or bad. And again, because that's an easy path and we're as humans, we're like really lazy. It's simple. Yeah. Yeah, it's simple. But, you know, when we're having a conversation, you know, an individual will question a single statement or a single thought. You know, humans, we decompose

Speaker 34:13 whatever is being said to us into kind of facts or claims or assertions. And we try to evaluate internally. Is this do I believe this? Do I have evidence for this? So like I've been doing a lot of work and taking the generations of an LLM, decomposing it using an LLM in the claims because, again, being able to identify parts of speech or phrases and extracting them is a is an NLP task that LMS can do pretty well. And then I can use some other models to like classify

Speaker 34:44 a claim and assertion along with retrieve context to say like, hey, is this is this valid? Do I support this? Do I not support this or do not have enough information? So I think there are great uses for LMS in any bells,

Conor Bronsdon 34:57 but it can't be taking exclusively LMs. It can't be exclusively,

Speaker 35:01 and it can't be the model that you use to generate the response.

Conor Bronsdon 35:04 Oh, yeah. I see this a lot where I think a lot of the folks are having success when we work with enterprise customers today is like, hey. We're applying multiple judges that are different LMs or maybe waiting the responses and then using those to flag things, then you go have humans look at. Because, yeah, at a certain scale, we can't have humans look at everything.

Conor Bronsdon 35:24 That's just where we are with automation today. But, like, it doesn't mean humans shouldn't be involved in the process and that human validation shouldn't occur. We call it continuous learning through human feedback, which is kind of how we've integrated in our platform. Yep. And that is

Speaker 35:38 huge. Being able to get human feedback into the process. And you're right, humans are not going to be able to get everything. It's same with like with like synthetic data generation. If you start with something that humans have created, it's a good foundation. If you then, you know, randomize your distribution of responses and start spot checking.

Conor Bronsdon 35:57 And that can be reinforced into the model to then say, oh, like, let's improve how we approach this. Yeah. And being able to code in reward functions for for correctly identifying

Speaker 36:07 these are great practices. But I've seen some people say, well, I used GPT-four or five to to generate it. I'll just use GPT-four or five to say, is it good or not? And that's, you know, that's just that's the inmates running the asylum. It's like walking into a a room of thieves as a cop and saying, you a thief? No. Thief thief's gonna say, no. I'm not a thief. No. Of course, I'm not.

Conor Bronsdon 36:30 One of my favorite, prompting techniques, I don't even know if you call it a technique, is to like, if I'm just writing some stupid copy and I I, you know, I use GPT four for it first. Mhmm. It's to go to Claude and be like, GPT four came up with this. And, honestly, I think it's kinda crap. Like, you know, I I really need you to make it punchier. And Claude's gonna be like, yeah. I'll do it. I got you. And I I think it's the same comparison where you want to

Conor Bronsdon 36:55 almost pit. I don't want to say pit them. It's it's adversarial.

Speaker 36:59 Yeah. No. It's some adversarial. No. And I think that, you know, using using adversarial techniques is becoming more and more popular in this because you're especially if you're going to go then use reinforcement learning to to retrain. You don't want to overfit towards like, oh, you know, but open a for bias because then everything it says is going to be perfect. You're going to have F one scores of one across the board, and you know that's not going to be right.

Conor Bronsdon 37:27 Yeah. And I think, honestly, we we maybe don't make these comparisons enough to how you think of human fully human organizations versus human plus what I think of as, like, async digital employees, a k Yeah. Yeah. Yeah. Or LMs. And, I mean, like, it's the same way if I have a coding intern who I'm working with, like, as would I be using Cursor to vibe code is, like, hey. Yeah. They're they're very enthusiastic. They'll go do things. You have to check their work. You have to. Like, you don't have time to check all their work, but you have to get the feedback. You have to let them learn. You have to give them, you know, rules and guardrails to approach it.

Conor Bronsdon 38:05 And this can change depending on the sophistication of different AI systems of, like, how much feedback they need, how much can be handled by other async AI employees. It like, just like you would for, you know, training any team member, you can't ignore that work and just assume that the team member has come to you fully trained, fully coached. And I I think that folks who are

Conor Bronsdon 38:30 great managers of people are going to have skill sets that are valuable for managing LLM systems. At least that's my opinion.

Speaker 38:40 I was literally just going to say, I think all of us have had experiences with great managers and terrible managers. And if you think of it the same way, like and those of us who are people leaders out there, there are times where we feel like we're doing really, really well. And it's a lot of that like feedback. It's it's being able to look at the work and provide, you know, great feedback or areas for for improvement. But it's it's our job to be able to inspect

Speaker 39:13 flag if something doesn't look right and then help coach and guide people or or autonomous

Conor Bronsdon 39:21 async employees. I kind of like that eight and async employee. I that's how I've been, like, thinking about it in my head because I I was trying to frame, like, what is an agent to me when it's, like, accurately and well applied. And to me, it's like, it's probably a junior employee. I I I don't I I know there are some folks who are getting really incredible results, and they're saying, oh, this is PhD level research happening here. And in some cases, I think they're getting specific results, but I I don't think it's consistent necessarily all the time yet. There's a there's a lot of hype around agentic AI systems. And while there's a ton of potential,

Conor Bronsdon 39:54 I know a lot of us are expressing caution about teams jumping in too quickly without robust evaluation, without robust testing, without observability Yeah. Because it does feel like you're letting a hoard of junior employees loose, and they can do a lot of great things, and they can also make mistakes that are crucial.

Speaker 40:11 Yeah. I mean, we're we're still working on on trying to get, evals right for, like, a a simple rag pipeline. And now you you open it up to a to an agent that could have access to fifteen, twenty different tools that it's calling. Get it right early and then you're going to be able to. Mean, it's just like we were talking about with with data. It's garbage in garbage out. You can build these really complex and beautiful agentic systems.

Speaker 40:35 But if you can't trust it, how useful is it?

Conor Bronsdon 40:38 I feel like I have to shout out Galileo's own agentic evaluations and our, reliability platform building for agents today and say, hey. Check it out. Gallo.ai. I can sign up for you to try it out. Yeah. There's more information on there. I won't spend too much time on it. But, yeah, there are lot of foundational steps to your point around data preparation, data lineage that needs to occur.

Conor Bronsdon 40:59 And as we come to the close of this conversation, I'd love to get your thoughts, Greg, on on kind of where you see the needs for data professionals and, you know, engineers in the next six months, year. Like, what do you think they should be focusing on? Is it data lineage, and is it evaluations? Is it, you know, the rigor they're applying? Where should their heads be?

Speaker 41:22 Yes. Everywhere and everything all at once. No. But in in in all seriousness, I think there's there's kind of two different camps. And I think I'm, you know, I'm very fortunate to be able to take the career path that I that I have and have had the the the trust for my leadership to kind of let me go go explore. But I, I think it's for one very specific reason. We'll get here, get to a second. But I think if you're sitting there today as a as a as a data professional or an AML professional,

Speaker 41:54 you know, it's it's being able to to have a better grasp of the of the data demand more from yourselves, your organization and the industry to help ease the access to the correct data more rapidly. A lot of these folks, you know, they want to increase performance in their models. You need more data, more data equals more better. But it's really hard to find that data today. So I think as an industry, we need we need to to step that up. But I think on the other side of the coin,

Speaker 42:26 and this is the tough thing, I think, for a lot of us to kind of start grappling with is you're you're building a tool to solve a problem. Try to understand the problem that you're solving for. You know, if you can spend time in that particular industry or that particular job role or function, whether it be actually doing the job, you know, interviewing tons of people that work in that function or bring those people, I think more importantly, bring those people into the fold and you both can kind of co learn from each other on this.

Speaker 43:00 I think if if if we can get more people thinking about the business problem that they're solving for, we're going to we're going to really rapidly increase the pace of of innovation and problem solving.

Conor Bronsdon 43:10 I love that. I think that is a great indicator of empowered product and r and d teams everywhere is they think obsessively about the problem they're solving. They talk to their users. They talk to their customers, and they bring that feedback in. And absolutely, we have to apply it to AI. And I think it's really easy when we're using this incredible technology to just get excited, as you said, to, like, oh, I wanna just build the newest thing. Let's try the newest one. And there's nothing wrong with doing some of that, but to really get into production with customers takes effort, takes work.

Conor Bronsdon 43:39 And I definitely recommend everyone who is listening to check out Greg's LinkedIn for incredible insights he shares and cohesity.com for everything Cohesity does. There's so much opportunity in the AI data space, and Cohesity is definitely leading the way. Greg, thank you so much for joining me today. It's been a ton of fun. Thank you so much. It's been a blast chatting with you.

Conor Bronsdon 43:59 Yeah. Absolutely agreed. I feel like we could go on for so much longer, especially once we start getting into soccer, though I I really don't wanna talk about how good LAFC is right now because my sounders are getting killed. So we're we're gonna actually skip that one, hold out for another episode where, hopefully, we'll have a win to show for it. For another MLS after dark conversation.

Conor Bronsdon 44:19 God. Yeah. After dark indeed, four zero.

Speaker 44:23 Oh, it's lovely.

Conor Bronsdon 44:25 It's not my opinion of it. I'm are you bullying the host? Dang it.

Speaker 44:31 Oh, Connor, you just you you you you made it too easy for me. It could've gone either way, but it didn't. Yeah. Yeah. We'll go there. We'll go back.

Conor Bronsdon 44:39 Anyways, Greg, thank you so much, man. This is a ton of fun. We'll link everything you've talked about in the show notes. Any parting words for our audience? Always be curious.

Speaker 44:49 You know, I think always always ask yourself, well, how does it do that? Why does it do that? Should it do that? Can it do something else? Being curious is is what life's all about.

Conor Bronsdon 45:02 I love that. That is a fantastic approach to life, and I can very much see it in your career and your mindset and and how you've explored. I it makes things a lot more fun, and it gives you a lot of opportunities. So thank you so much, Greg, for sharing your insights and wisdom with us. And that's all for this episode of Chain of Thought, everyone. Don't forget to subscribe wherever you are listening, wherever you get your podcasts. We're on YouTube as well. And you can check out our YouTube for so many more deep dives onto building with AI,

Conor Bronsdon 45:32 our webinars, and much more. So be sure to subscribe. Greg, thanks again. It's been a ton of fun. Thanks, Connor.