Agentic infra is the problem you're probably not thinking about | Human in the Loop Episode 14

By Scale AI

Summary

## Key takeaways - **Scaling Agents Outpaces Building**: Scaling agents is harder than building them because as you increase traffic and scope, the surface area explodes, revealing unknown problems that weren't smoothed over like in traditional ML with millions of GPU hours. [33:50], [35:12] - **95% GenAI Pilots Fail Due to Hype**: MIT found 95% of GenAI pilots fail in enterprises, but this stems from a large denominator of easy trials and delayed production responses, as enterprises take time to connect data and build chatbots after years of work. [00:42], [01:39] - **AI Work Slop Creates Cleanup Burden**: AI-generated content looks good but lacks substance, creating an illusion of progress that forces colleagues to clean up large, useless documents instead of saving time. [06:28], [06:44] - **Context Rot Misdirects Agents Worse**: Context rot is a bigger problem than hallucinations because feeding agents poor or excessive data actively guides them away from desired outputs, as seen when internal indexes yield worse results than web searches. [36:46], [38:29] - **Agentic Infra Glues Messy Problems**: Agentic infrastructure acts as glue between GenAI, humans, classic ML, and databases to solve messy real-world problems like reducing churn by dividing tasks and allocating to optimized components. [23:02], [24:35] - **Capture Behavioral Data from Agents**: Behavioral data from how users interact with agents reveals uncaptured company processes and procedures that make firms unique, like training methods in banks, to make AI more powerful. [26:20], [27:26]

Topics Covered

Why do 95% of GenAI pilots fail enterprises?
AI work slop creates illusion of progress?
Context rot plagues multi-agent systems?
Glue augments humans for hard problems?
Behavioral data unlocks tacit enterprise knowledge?

Full Transcript

Scaling agents is harder than building them.

Do you agree or disagree?

Three, two one vote.

>> Oo, little mixed bag.

>> All right. Why Algra, you were the odd one out.

What do I think building is really hard?

>> Welcome to Human in the Loop, a series where we discuss what it takes to build and deploy real world systems for the enterprise.

Hey everyone, welcome back to Human in the Loop. I'm Felix Sue.

>> I'm Aler Lare.

>> And I'm Mir Panda.

>> Today we're going to be talking about the state of agents and the challenges that come with. But first, let's check in with the news.

>> Did you guys see that MIT found that 95% of Genai pilots fail in enterprises?

Is that a big deal for enterprises?

What do you think?

>> I saw that. It was a little bit of a clickbaity u headline, but I mean I feel like that that study itself has an interesting uh common denominator effect because it's so easy to try a bunch of new things. So you will have a large denominator, but because you try a bunch of new things, they may fail, they may not fail. And so that's my take on on on that piece. Yeah, I think like realistically speaking like enterprises um you know they're I like to think about things in phases and so every single year you got to think about like it takes a long time for an enterprise to get everything together and release things to production and so you're really having like a little bit of a delayed response to some of the initial pilots that you're trying right connecting things to your data trying them out.

So, a lot of the things that you're going to release um after two years of work are going to be, you know, let's try connecting to this data source, let's try building a chatbot, right?

They're not going to be like your most creative and inspired ideas on what's next and and most informed after you've learned through trial. Um, so I think like we're pretty early.

I think there's going to be like in another year or two, you're going to see a lot of agents come out. We're talking about that.

That's going to be more impactful.

Um, but now I think uh yeah, that definitely has a lot of common denominator problems. Yeah, I think the thing that's interesting to me there is that people thought that like Genaii is going to give you this like amazing value overnight and that part doesn't seem to be true and that is something that I do see in various use cases because it's not like models are getting amazing and very custom specific things that matter for specific companies or organizations.

there is some extra layer that needs to exist that bridges the gap between where the models are getting better and what people actually need.

>> Yeah.

>> And I think there's like a uh operational changes that need to happen to bridge the gap in addition to just like the technology uh being useful and I think that um >> takes longer to achieve.

>> It's non-trivial.

>> Yeah. And if I may ask, I think both of you kind of are leading our for deployed efforts now.

Um, I think, you know, hopefully it's it's it's it's easy to see now that some of these things are harder than they seem. I think customers have this like >> um, you know, I guess like this assumption that it just works.

It's magic. The AI will like if if if I can answer, then it can answer, right?

But there's a ton of wiring and plumbing that needs to go underneath. um what has your guys' experience been so far?

What have your engineers been telling you that's hard about these problems?

>> Yeah, I mean I think um because the open source technology is, you know, evolved so quickly and there are so many compelling demos, it really feels like the FOMO effect is real where like there's something out there that we should be building or people can easily build prototypes, but then where we've seen difficulty is trying to get them to production.

And really when it means when it comes to getting things to production, it's trying to get your prototype to work with your production data systems or your production compute runtime environments.

And those parts are complex and bespoke for each organization.

And I mean, organizations don't really hire for those specific things.

That's not what their core competency is.

Um and so that's exactly where like a lot of our engineers are seeing that like they are spending time to come in there and bring these prototypes to production.

>> There's also a big gap between like a prototype delivering on functionality and a production system being like reliable enough for an enterprise solution.

Uh, I think the like reliability piece is maybe what we're going to get into in a little bit, but um the like oversight and uh management and uh building agentic systems that are deterministic enough for like solving real enterprise use cases is kind of the hard the hard part.

>> Yeah. And I think there's there's actually like a technical nuance here which is like when you try a generic LLM, you feel like a lot of the difficulties are smoothed out because what happens is they do training over large amounts of data, right?

And so that is smoothed out because you have thousands probably millions of GPU hours being spent and a ton of data being spent to to smooth those those edges out.

So you have an expectation that it's super smooth. But now when you connect it to like live data that's not been trained on, it's not going to be that smooth because you're doing like like live adjustments and and so there's gaps between your knowledge. Like we do like um an engagement with one customer and uh like the last mile delivery is all about oh I thought that you had access to this data. Why is it not saying that?

Why is chat gbt be able to do it on last year's articles?

Well this year's articles are being fetched with some queries.

Maybe the query is has some gaps in between or maybe there's nondeterminism in writing a query and so you you encounter these like things that you expect to be smooth but in reality connecting live data is there are gaps in it right and it's our job to make it smooth but that last mile delivery is that you know it's tough to educate people like oh these are the ways that you need to do it properly in order to get it to like a production level.

So I think that scares a lot of people which is why you know these 95% of pilots failing is just like that feeling that at the end you get a little bit panicky that it's just not gonna it's not working the way you expect it and you don't kind of grit and grind try to solve that problem. Uh you you know a lot of people like exit out and say maybe it's not worth it to do it at all you know.

So, in response to this study, BetterUp Labs and Stanford Social Media Lab um released a study on something they're terming AI work slop, which is AI generated content that looks good but lacks substance.

And so, the issue is that it creates this illusion of progress.

So, rather than saving time, which is what I think a lot of our Geni solutions are supposed to do, um it just creates more work for colleagues to like clean up these large documents and and this content that's not actually useful.

Um is this something you've seen?

this is something that we should be concerned about.

What are your takes on this?

>> That's so interesting. But yeah, go ahead.

>> I was going to say I think this is something that we are actively trying to fight right now. Um, yeah, I think the interesting piece about this article is that uh if you think about like what is actually consumable by an AI system, it's not 100 pages of documents.

you actually want to like think about compressing the information that you're sending to an AI system into something that can like fit into a reasonable context for that system to consume.

Uh that's probably the same thing that humans need.

And for some reason we've just moved in the opposite direction.

Like if we actually want these systems to be really effective, uh we actually need to compress the information a little bit more and not necessarily like explode it into a bunch of generic nonsense.

>> Yeah. I think like um it's really easy to uh like use AI and then summarize things and hand them off, but um what happens in that situation is that everyone just like does a lot of telephone.

There's a lot of telephone and compression and telephone and compression and and what happens is you lose a lot of like the meat of what you're trying to move towards, right?

Because you're just doing a lot of synthesis.

And actually when you synthesize uh something that's meaningful into something else and then someone synthesizes from that and someone then you lose a lot of that stuff.

Um and so I think it's really really important for us to uh you know and we're going to talk about this later but like always check back in with humans.

You need to like reduce your error rates.

You need to not like just have AI talk to AI synthesize AI stuff.

Um I think a lot of that stuff gets becomes a lot of jumbled up garbage.

So realistically, I think you just have to build AI that's like very targeted in what it's trying to solve, not use it for expansion.

I think expansion is noise and reduction is like like what le saying is like you want to compress um into something meaningful. So uh you know I think a lot of the tools that you see here that people use like effectively are things that like let's take code generation for example um gets like yes there's expansion and generating code and then there's compression and making sure it pass your tests and making sure it pass your PR reviews right and so you need to have like a check-in where you say like this is useful at a certain point and not just you know generating things >> I think that's why it's called slop where it's just like generated stuff that's completely unchecked, unverified, and just put out in the open.

And I think like when it comes to actually using them for loadbearing actual use cases, you need that like feedback loop to make sure that what is being output by your AI systems is is verifiable and is something that is actually useful.

>> Yeah. Like a good a good uh example of slop is like let's say one of our engineers or product managers writes a PRD and you're like oh let me just generate the PRD but the problem is well okay the AI is going and making some assumptions about like unfounded assumptions about how to make this reasonable design but not based on >> this product manager software engineer's understanding of what you're trying to do.

There's like 10,000 ways to solve a problem, but you have to solve it based on like what makes sense for your business, what makes sense in the context of all the other technology that you've built.

Maybe you can say like, oh, we're going to make actually like a suboptimal decision because we don't want to cause a migration problem or something like that. That context lost in translation can cause slop issues.

Yeah.

>> Because the reviewers don't know that was AI generated and not based on real information from people. So, you just have to be a little bit careful.

And I think like having worked with a lot like I actually think I can tell if something is AI generated. I used to think when in the beginning I remember early days when I was working on scale I made this like kind of dumb comment that oh aren't these AI things like basically turning complete now or pass a touring test sorry and I got this like kind of snarky answer like oh no it's not.

And I realized this person u you know that was at scale before had seen a ton of way more AI responses than I have.

And now that I've seen that amount, I'm like, "Wow.

" Like, yes, you really can tell this is Gemini, this is OpenAI, this is like you can read it.

>> Um, and so like honestly having that filter also is helpful in determining slop or not.

>> Great. Well, that's that's all for the news.

Um, let's go ahead and get into the main the main topic. Um, Felix, what are we what are we talking about today?

>> Yeah, so we wanted to talk a lot about um about agents because I think that's the direction a lot of people are moving towards.

Um I remember uh Jason our CEO said that he was talking to a customer and then we're talking about connecting to data and all this sort of stuff and you know they were like tuning in tuning out for about 20 minutes and then he's like yeah you know we built these agentic things and they were like agents uh like oh why don't you just say that from the beginning like agents is what we what we're interested in because there's there's kind of like a big gap between like context engineering which normal engineers understand and agentic engineering which is something that's more of an applied AI or MLA focus.

Right. Um, so I'm curious kind of what your guys's take on the current state of agents.

There's some things like context engineering, data practices that we want to talk about. So maybe Alger, we start with context engineering, context rot.

What is your opinion on some of the problems that we face today with this?

>> Yeah. Um, I think this is kind of what we've been speaking through already, but uh, context engineering seems to be kind of the the hot problem. Um, and I don't think that this is really a solved problem right now.

uh the it is already difficult to context engineer a prompt for a single agent executing on a single task.

I think where it gets really challenging is multi- aent systems where you're trying to figure out uh what is the correct context to give each of the sub aents so that they have uh enough context about how the system is working as a whole but they're also able to stay focused on their individual task that they're executing on.

Um, yeah, I don't have an answer here, but curious if we have any uh opinions there.

>> Yeah, I mean I guess the reason context engineering exists is because we know that like agents don't perform well with infinite information >> and so you kind of have to figure out what information you have to give it and how to package it so that the agent itself can give you a meaningful output.

And that part is context engineering and that's the challenge because figuring out what the output needs to be figuring out figuring out what is good quality output is so subjective like even in our like genai business when we work with the frontier labs the first question we ask them is what does quality mean for you and we come to a shared definition of quality and everything else about generating data is downstream of that and so context engineering is to me is a problem that has to exist because these systems like by definition cannot handle infinite information.

>> Yeah. Yeah. I think it's it's uh it's actually pretty interesting like I feel like when you try to do concept engineering you create problems for yourself too because let's take for example like you were talking about like multi- aent systems. The reason why you build a multi- aent system in most scenarios is because you feel like you're trying to solve a context problem because otherwise you would just give it to one agent and if that agent did deep research and stuff like that you're like why don't I just have one agent do the whole thing but the problem is maybe that research is just one step of a larger workflow and it's filled with a bunch of research steps that maybe some research needs to be discarded some research needs to be considered so maybe you should have a higher level orchestrator but now you have this problem where you know what if I have one orchestrator and I have a deep research agent and I have a slack agent and I want those two things to kind of like or maybe actually you know what I'll give a better example two deep research agents one does research about one thing one does research about another thing um so you hand off a message to each of them and all of a sudden they go off and start independently doing work but now they start because they are agentic going a little bit past the boundary of what they were asked because they they're agents right they can now overlap in what they're but they don't know about each other, right? And so, so now they overlap and they both send back overlapping information back to the orchestra.

Now the orchestrator gets confused, right?

And so I'm curious kind of like how you guys think about these type of problems. What scales approach to solving these things? Um I think it's like totally an unsolved problem and and usually solved in a customized way for each use case. Um, I do think that there's like specifically with that example, at least if you have two agents doing deep research, like that's still a read operation.

They're both kind of reading in parallel.

And yes, they could be reading the same information and maybe giving overlapping output, but the risk is much lower compared to if you have two agents that are like writing and there's some dependency between what one agent is writing versus what another agent is writing. I think that's where it starts to get kind of complicated like the >> uh you need to think more about kind of is there some sort of oversight layer or orchestration layer that they need to be talking back to and agreeing with before they can execute on those actions.

Um yeah, that part is I think where it gets kind of complicated.

>> Yeah. Yeah. I think at least for deep research or like agents that run for a longer time, the common architecture that I'm seeing is that you know it comes back to like first principles where an agent needs to like look at the problem, understand the problem, make a plan on how it's going to tackle the problem and then execute the plan and as new plan and as the plan is executed it learns more information. It keeps track of that, maintains a to-do list of what it needs to do next, and then continues until it finally has an answer that satisfies the initial request. So to do all those things, you also need some infra piece that you can sort of connect to, where you can treat that as a scratch space where you can maintain and update your to-do list, where you can maintain and update your plan, where you can share your plan with other agents uh so that they know what work you have done so far. that that infra layer it becomes very critical to become more horizontal use case.

>> Yeah, I think it's it's funny like I think the article that we're about we're going to have to redisuss in a couple months is like 94% 95% of multi- aent systems are useless and you should go back to single agent. But I think it's like it's like this is a thing that that's happens like people over think oh like oh I tried this once and like it just doesn't work and it's like that's just like not how you you you you develop like you there's like experimental aspects of it.

It's not easy. You have to design a way around it.

Like there's something pretty interesting about the multi- aent system where I think a lot of it is a product decision has nothing to do with AI which is like if you have a multi- aent system do you show a user everything that's happening in the system or do you only show the user what's happening between it and its orchestrator and you know I always say there's like two ways to do it when you show them everything or when you don't show them anything that's happening in the sub agent. But if you show them everything but I only am able to talk to the orchestrator. If the sub agent didn't talk back to the orchestrator or send it back some information then the orchestrator is going to be confused. It'll be like, "What are you talking about?" I like I'll rather, "Hey, you looked up this article.

" Be like, "I didn't look up this article because it thinks it's sub agent and the sub agent didn't tell it it looked up the article." Right?

And so now it's confused. So then what if I don't send it all the information?

Then you know some things that are happening might get lost in translation.

It like you might look up some things and I'll be like, "How did you have access to that?

" or whatever and I'm confused now.

So there's so much product massaging, good AI practices, contact engineers, like you have to like make it work.

You don't just try it. Like that's that's not the way that these things work.

And that's why you see like studies like this.

People try things and they they kind of like get panicked. Like this is like AI and ML used to be like >> like 5 years ago. We kind of forget like AML traditional machine learning is just pure experimentation.

You get like PhDs to go in a room two years they do a research problem.

They run like 50 experiments.

They spend like millions of dollars and like one of those paths work.

Do you know what I mean?

Like that like we kind of forgot that that was like how it works, you know? And so, um, yeah, I think we we we don't want to overthink these things. Like, it takes time, takes effort, you got to get the right people.

>> Yeah. I think what's interesting about what you said is that like it's never a one-sizefits-all solution.

You have to iterate.

You have to try it out.

And so, that's where like the pieces you build around measuring the performance of this become more and more important because the stronger and the more robust your layer is for evaluating your system, the more things you can try reliably.

Uh and I think that's why infra becomes such a critical piece uh for agents.

>> Yeah. Yeah. I want to talk a little bit about that like agentic infrastructure piece as well. Um I think everybody at this company knows that I'm I'm very passionate about this subject, but I think like um my main thing is that I think like um the world's hardest problems take a lot of time. Like it like deep research is not my not that big of a problem for me. It's a it's a big problem for agents and we focus on a lot but like realistically like yeah I need to look some documents up.

Okay, that's like a small problem for me.

A big problem for me is like delivering on a customer account.

There's a lot of human handoffs there driving things forward.

There's a lot of dead time between people not understanding, you know, the difference between two documents.

Where is this document? Where's that document?

There's so much stuff there. And if you have an agent driving that forwards, I think that is super super interesting.

And there's so many like open-ended problems that are not being explored.

For example, we have a customer who wants to decide if they, you know, go somewhere to do like oil drilling or or whether to manufacture a certain drug and what's the process to do that sort of stuff, right? And there's like it's a six week long process. There's a ton of handoffs.

So, I think it's so crucial um for us to build software and infrastructural things that orchestrate this sort of sort of stuff, right?

I mean today I think if you just take open eye agents download it and try it throw it on a server that thing is not going to be able to orchestrate along or anything that thing will die like as soon as it has to like escalate to human what happens you have to like save your whole state revive it I mean there's so much complexity in that and I think super super underappreciated uh area of like software and tech development um that people aren't really seeing um I'm curious kind of like what your guys's thoughts are on on that um and where you think that scale is going to take its business in the future. I feel like the name of this podcast is kind of relevant here because uh that really you really do need to think about humans in the loop as like a first class citizen for designing an agent. And especially uh I think I don't know that we need to fully replace all of these kind of clunky like handoffs between one team to another.

But there is something to think about like multiple people have different context and you might need multiple people to do approvals for a particular action that an agent is taking or there there's some layer of human orchestration that needs to live on top of just the agentic orchestration that we probably need to think about as well.

Yeah, I mean I think the the most important and difficult problems that you mentioned do not exist in isolation and they're not all they are going to be messy problems. They're they're going to have some component that requires the new shiny genai component.

There's going to be something that will require a human in the loop.

There's going to be something that'll require a classic ML.

There'll be something that'll require a database.

And so the reason in the infrastructure and the orchestration is so important is because that is the glue between all of these systems. And the more you invest in a glue that allows seamless communication between these pieces, the more you are likely to solve real world problems. like thinking about what kind of a system should solve what kind of a problem is the most is the entry point for orchestration because you look at a problem which is a messy problem like how do I reduce churn or how do I find anomalies you divide that into smaller pieces you then uh allocate each piece to the component that is optimized to solve that piece you take the results and then you bring them back so if you can build that glue across those various systems then you're more and more likely to solve these problems. So that's how I think about these things.

>> Yeah, I I like the way that you put it.

I think like that glue is so important because like um if you don't have a really really good orchestration layer that can uh you know reach into all these systems which may take long times may may pause for human escalation and all that sort of stuff then you're basically reducing yourself to just like simple like oh I I just asked this thing and it's going to respond in 30 seconds. How many things in this if we really want AI to be as transformational as it is like how many things do you think we can solve if we just like oh it's like a thing it can respond to me in 30 seconds it's like not that like the space of problems that that is solving is pretty limited and so I think the better the glue gets the better the infrastruure gets the more that you can do the more powerful you'll see AI um being and I think like one thing the reason I believe in agents so much is because I think that in a business um you know where myself this point like in a business like there are a few decisions that require some creativity and opinion and then a bunch that don't you know like I look at a document I'm like yeah like based on this paragraph I should go send it to this person or like oh I look at these numbers these like in this uh vendor statements come in like I need to flag this flag that send it to a lawyer send it to here like it's pretty standardized operations right and so if we can get agents in the loop on these standardized operations and and be able to do harder and harder, longer running things.

Um, you will see AI enter another phase where people are starting to do like really hard business critical things.

Even earlier this week, we were in like a business review and I like and and Jason was talking about like humans in the loop and we're like remember I recommended I said we should just take only deals that have humans in the loop because those are the ones that are most business critical.

If you're not going to if you're not if if you feel like you don't need a human in the loop, it's probably not important enough, you know, for you.

Um, and so like I think this is a huge opportunity. I think like human in the loop type things are going to be really critical for our business in the future.

And so um, yeah, it's important that we invest in it.

>> Where does data fit into this conversation?

>> I give some opinions here.

>> Um, okay. So, so my opinion on data is like uh there's two types I think major, right?

One is enterprise data connected to elements.

Like that's kind of like a solve thing.

I think we don't have to talk too much about that. It's just you know like even even enterprises that are late to adopt AI kind of know how to do it.

There's a more interesting data which is behavioral data.

Okay. So when I talk to enterprises I always say there's two types of data. There's the data that you've stored and the data that's in your head.

And the data that's in your head is more like uh what your company was built on. Like why are why do consulting consulting companies exist?

because they they like built their brand on a way of doing something and and they created excellence from that, right?

But but what is the like what is their process?

Like you don't know like why why is a bank different than another bank?

Because they train their employees a certain way. They have a certain procedure, right?

But nobody knows.

No one writes it down like this is the procedure.

You follow this exactly, right?

So the way that we can get that data is to build AI agents that work alongside these people and you get a lot of information about how do the users use these agents? What kind of decisions do they want to make? What are they unhappy about when the agents, you know, go off the beaten path of what they're comfortable with working at their company doing things their way that data storing that is going to make those AI agents really really powerful. So I feel like we we definitely need to have that hidden data which is not not captured yet.

And I think traditionally like when I think of data, I think of like who's the end user of that data. And like in classic distributed systems, we've built uh these mechanisms that allow us to handle like an insane amount of data.

Um because that's ultimately you use that data to get some insights. And we al we've also built some compute systems that help us understand and analyze that data.

And so what we've learned is that there are some use cases where the end user of data is computer systems and then there's some uh use cases where the end user of data is AI or humans.

And so the challenge that we're seeing in a lot of these organizations is that they have obviously uh been maintaining this mountains and mountains of historical data.

But you need systems that are able to understand that, get insights and then leverage them. And so being able to sort of marry these two use cases becomes super critical where you want to build classic systems and computer systems that are able to get statistical information from large amounts of data and then from that statistical information you want to glue that with semantic information for a human to consume.

And that is the kind of the ship of the problem that we are seeing over and over again um in various in various like real world problems as you mentioned and so building something that is able to marry traditional computer systems and new AI systems is like the next challenge in my opinion. Yeah.

>> Yeah. Like there's a there's a good example of this like if you take like inventory management.

Um a good traditional machine learning statistical problem is inflow and outflow optimization right obviously you have a lot of data about what's going in what's coming out.

So you can build traditional machine learning to optimize for what to buy next, right? That that's not a generative problem, right?

It's probably going to be worse if you do.

>> Okay, but what about outlier events that are coming in like it's really hot today or or there's a heat wave coming next week or there's a sports game coming like you're probably going to stack water, you're going to stock drinks, right?

like you know so you want to have some overrides based on semantic understanding of events that are happening which is new information that's like small sample size and all which so it doesn't fit well into the traditional statistical machine learning models um but still needs to be like fit into the system somehow right and so that could be fit in as overrides so I definitely think there's like so many so much opportunity to to augment traditional machine learning models and stuff like that statistical things um with the generative aspects of of the new AI today.

>> Yeah.

>> If you guys had to just to wrap us up, end this discussion up, if you had to each kind of give a piece of advice or a takeaway for an enterprise exec who's kind of facing some of these challenges, what are some of the things that you've talked to customers about or given given advice to customers who are facing these these issues?

Um the big one that I keep repeating to a bunch of execs is think about augmenting your star performer as opposed to replacing them.

>> That is the the biggest unlock you can get immediately.

>> I think that is a way to think about AI systems. >> Um so that happens over and over again.

>> Yeah.

Uh I guess that's kind >> yeah I feel like I would have said the same like start >> with a process that you already know can be improved I guess and start with like the MVP of what can improve that process basically I don't think you need to boil the ocean when you are like trying to adopt an entirely new technology and workflow.

>> Yeah.

>> Yeah. I mean I have like there's there's two ways I give recommendation either like the visionary way or the tactical way.

If you go like tactical I think yeah break off pieces of the problem that you feel like are measurable that have like good outcomes things like that just because like you don't want to get in this rut where you like try too many experimental like theoretical things that are like too advanced without even having any thesis or hypothesis that that's going to work. Um, for example, I think I liked um kind of what we did with the tokco like if we if we if we can reduce churn like that's a measurable problem like you can get numbers from that. So, so it's easy for you to kind of see if there's benefit and see if you want to do further advancement and you learn along the way what the limitations and things like that, right?

Um, so, so I I would I would encourage people to kind of like don't overthink things unless you have a reason to do that. Like you did experiments already and you you do feel like you know better. If it's your first attempt, you don't want to like run and then and then get scared. Um, so so so that's that's an advice that I would give.

Um, and then one more one last advice that I would give is like I think building AI is so difficult that like you you shouldn't be spending any of your time doing like all this plumbing stuff like the plumbing stuff like you just waste so much of time like okay you build an AI agent now you got to deploy it now you got to manage it you got to do secrets you got to do the glue you got to do the human escalation you got like none of that stuff really matters like like in this sprint of AI, you should be focusing on one thing and one thing only. How do you model your business logic to the highest quality possible?

You know, that's like the only thing you should be if you're doing anything else, then you're basically wasting your time on on cycles that don't matter.

So, I do encourage people to kind of like keep that in the back of their minds.

Um, and if they ever feel like they're not advancing as fast as they should be, they should always take a look at that that process.

>> Great. All right. So to close us out, we love to end on some hot takes here uh at Scalei.

So none of you have seen these before.

Is that right?

>> That's right. That is correct.

>> You didn't get you didn't get my my version of this. So what I'm going to do is I'm going to read out loud a hot take um that is sourced from the internet.

Uh and then I'm going to have you vote.

Do you agree with this take? Thumbs up.

Or do you disagree with this take?

Thumbs down. And I'm going to say 3, two, one, and then have you vote. Are you guys ready for your first hot take?

>> Ready. Is this going to be slop?

>> Okay.

>> Well, there's no AI. There's no AI here.

>> Okay. All right.

>> All right. The very first hot take.

Scaling agents is harder than building them.

Scaling agents is harder than building them.

Do you agree or disagree?

Three, two one vote.

>> Little mixed bag.

>> All right. Why Algra, you were the odd one out.

What do I think building agents is really hard?

>> What makes it hard?

>> Uh I think like coming up with the right I mean all a lot of the reasons that we talked about today context engineering orchestration of how agents talk to each other deciding when to escalate to a human.

Uh like I don't even think that's a solved problem yet. I guess scaling would be the next problem but >> well I think building a good agent is really hard.

didn't say good.

>> That's true.

>> Yeah, >> that's true.

>> Could be any agent.

>> Yeah, that's true. Building agents is not that.

>> Um yeah, I think >> you're like this, right?

>> I was medium.

>> Yeah, you're on the fence.

The reason I'm on the fence is because like I think scaling agents is really hard because like um you're you know you encounter a lot of unknown problems once you start to increase the amount of traffic that you're going to get and and increase the scope of what an agent can do.

uh because the surface area explodes and and I I made this comment earlier unlike traditional machine learning where you have millions of hours of GPU to like smooth over stuff, you just don't I mean what you have is somebody doing context engineering and bringing data to it and maybe fine-tuning a little bits and pieces there, but like you haven't smoothed over those things.

And so you're kind of assuming that the assumptions you made while you were building are going to somehow apply.

And not not always is that the case, but the reason I'm in the middle is because now if I think about it, the whole reason that was hard in the first place is because it's hard to build agents.

That's that's like the reason why it's hard is because uh yeah, it like you know, like I said, like I was talking to um like Sam, I might have made this comment before in another episode or something.

I forgot, but like I think there's like two ways to like improve an agent at this point. It's like you context engineer it or you you know you train it, right? train a retrain the model.

Um, but at the end of the day, you're like building a little bit of Lego blocks to try to like make this thing, you know, kind of work.

And you might have some gaps in between and all that.

Um, so so I think building an agent just from the ground up is really really hard.

As soon as you scale it, you go back to why building an agent is really really hard.

So >> yeah, it's kind of a >> it's just like Exactly. As soon as you get to a milestone, I see it like very much like step function style.

like you get it, you work really hard, you get it to like this milestone, you launch it, you're like, "Oh, let's do something more with it." And then boom, you got to redo the building step again.

And it's just as steep as it was before, you know?

So, yeah, both very hard.

>> Yeah.

>> Good. Your second hot take is context rot is a bigger problem for agents than hallucinations.

Context rot is a bigger problem for agents than hallucinations.

All right. Three, two, one, vote.

Some opposite here. Mi, let's start with you this time. What's uh what's going through your head on this one?

>> So, the main reason that this was my answer or I disagree with this is because I think there are ways to measure context rot and it is really hard to measure hallucilation.

So I think both are hard problems, but at least you know one exists and it's still hard to know whether the other one exists and when it's happening.

Um and so for that reason um I'm saying context rod is the the harder one.

But um yeah, that's my take.

>> Or you're saying hallucinations is the harder one?

>> Yeah, hallucinations one. Okay.

>> I got confused.

>> Yeah.

Um, yeah. I think it's I think it's actually, yeah, if you put it like that, actually, that's a kind of interesting way to think about it. It's like one is a more solvable problem and the other one is like less solvable and more about control.

Yeah, if you put it that way, it's pretty interesting to think about.

The reason I say context route is is more of a problem is just because I think that we just spend so much time dealing with like poor context and uh like bad information. And I feel like context engineering is the way that you guide the model, but also the way that you unguide the model the wrong way, right?

You can bring way too much data into the context window or you can bring the wrong data into the context window >> um and all of a sudden like the model is doing something poor because you're like guiding you're like actively guiding it away from what you want. I I'll give a an example of this like we you know we were building this agent that has uh access to the index of this this this uh you know digital media digital media company and and the thing is yes it has access to digital media company um but sometimes you'll you'll get uh worse responses from that agent um than you would if you just search the web and it's interesting because like you know you're asking the agent to do a query on the back end and all of a sudden like it could query the wrong things and you're telling the person like look at telling the agent look at this data we got it for you this is the golden data >> but in reality maybe like the web search was a little bit more natural and and natural way to search for that document and that was the right one so >> feeding the AI like almost feels like deliberately the wrong information by accident to me is like a very common uh problem >> and I feel like that's like something that we wrestle with a lot >> um but uh yeah I think hallucinations are Yeah, harder to harder to solve for for sure.

Yeah.

>> All right, we got a minute left on the clock.

Uh, so rapid fire one.

The problems that agents tackle today in enterprises are not the truly hard problems. So the problems agents tackle today in enterprises are not the truly hard problems. Agree or disagree?

Three, two one.

Oo, another little opposite here.

Anyone of you can start.

I think we're like scratching the surface on the problems that we're address.

I am sure that there are some really hard problems that agents are working on in enterprises.

I think >> we're still at the phase where we're experimenting and doing pilots and trying to tackle the lowhanging fruit.

>> But yeah, >> I agree. I mean I mean I I think we spoke to a bunch of customers who are trying to do some really ambitious things like building cursor for X you know it could be like CAD design or like how do you simulate something but cursor for that like they're trying to do something really interesting uh it's just really hard.

>> Yeah. Yeah. Um, you know, I I think I think I think I made my opinion pretty clear before.

Like I think we we we want to do like bigger things, harder handoff longrunning stuff.

So I think a lot of agents these days are like a lot of research read based agents and not write based agents cuz and they're not on critical paths a lot of the time because people haven't really like you know like fully embraced wanting to put it on critical path things. Um so yeah, I definitely think there's opportunity to to improve.

Thanks for listening to Human in the Loop. To stay in the loop for more AI related content, make sure to like, comment, and subscribe.

Loading...

Loading video analysis...