Building Decision Agents with LLMs & Machine Learning Models

By IBM Technology

Summary

## Key takeaways - **LLMs unfit for decision agents**: Large language models are famous for inconsistency, where they might do the same thing every day and suddenly do something different, and they are notoriously black box, very bad at describing why they did something. You need transparency to explain decisions like why someone didn't get the job or the loan. [00:37], [00:58] - **Decision platforms ensure ruthless consistency**: Decision platforms give ruthless consistency: if you use one, it's going to make the same decision the same way every time with complete control over exactly how the decision is made. Every customer gets it made the same way. [01:56], [02:14] - **Stateless, side-effect-free agents reusable**: Decision agents need to be stateless and side effect-free, responding just to data given at the moment without remembering state, which is managed by a workflow agent. This allows reuse of the same eligibility agent across loan origination, letters, call centers, or marketing campaigns. [06:27], [07:28] - **Embed ML predictions in decisions**: Deploy machine learning models like fraud prediction, credit risk, and payoff risk as endpoints consumed by the origination decision agent to make decisions more precise using historical data. LLMs are not good at building predictive models out of structured historical data. [14:45], [15:06] - **LLMs enhance data ingestion, explanations**: Use LLMs for ingesting documents like boat brochures into structured data for decision agents and for turning detailed decision logs into human-readable explanations for call center reps or customers. This makes interaction easier both in and out. [18:36], [19:31] - **A/B testing via rule versions**: Decision agents learn through A/B or champion-challenger testing by coding multiple rule versions in the repository, running comparisons to see which works better, then updating based on logs without random behavior changes. [22:12], [23:01]

Topics Covered

LLMs Fail Decision Agents
Decision Platforms Ensure Consistency
Build Stateless Decision Agents
Embed ML Predictions in Decisions
Decision Agents Learn via A/B Testing

Full Transcript

So decision agents are an essential component in agentic AI if you're going to solve large, complex problems. The challenge is that if you have a complex decision that you need to make autonomously, and if you're building agentic AI, you're going to need decisions to be made autonomously. Ah, but these decisions are not a great fit for large language models. And large language

autonomously. Ah, but these decisions are not a great fit for large language models. And large language models, of course, are the sort of key technology in the agentic AI, but they're not a good fit for decision agents. So you need to build decision agents in your agentic framework, but you need to

decision agents. So you need to build decision agents in your agentic framework, but you need to use a technology other than large language models. So why aren't large language models a good fit? Well, let's think about some of the things they are famous for. They're famous for

fit? Well, let's think about some of the things they are famous for. They're famous for inconsistency. They might do the same thing every day and then suddenly one day do something

inconsistency. They might do the same thing every day and then suddenly one day do something different. Well, that's not great. If you're trying to make a decision, you really need people to be

different. Well, that's not great. If you're trying to make a decision, you really need people to be treated the same and not vary day to day, minute by minute, just because the LLM feels like doing something different. They are notoriously black box. They are very bad at describing why they did

something different. They are notoriously black box. They are very bad at describing why they did something. And it turns out often you need to explain to people why you made a certain decision,

something. And it turns out often you need to explain to people why you made a certain decision, why they didn't get the job, why they didn't get the loan. And so you need some transparency in all of this. And large language models are not good at that, even when you ask them to explain themselves.

of this. And large language models are not good at that, even when you ask them to explain themselves.

They have a little bit of a reputation for lying about how they decided what they decided. And then

the the final one is that in many business decisions, there's a there's history. You have a database of information that tells you what you should do. What might be fraudulent might be problematic. You need to be able to process that data and turn it into analytic insight. And large

problematic. You need to be able to process that data and turn it into analytic insight. And large

language models are not very good at that either. So for all of these reasons, you're just not going to use a large language model to build a decision agent. So what you can do is you're going to use a decision platform or a business rules management system for this. And let's just sort of reiterate some of the sort of key value propositions for this kind of technology. This is well-established

automation technology that's been in use for a long time. And so we know what the benefits are. So

let's think about what some of those benefits might be. So what are our requirements for a decision that we're going to get out of using one of these platforms? So first and foremost, we're going to get consistency. So if I use one of these platforms, it's going to make the same decision the same way every time. It's going to give me complete control over exactly how the decision is

made, and once I've defined it, Every customer who gets that decision made against them is going to get it made the same way. So I get ruthless consistency. The second thing they're really good at is transparency. Not only do I have a formal definition of how this works and what the steps, they're, the rules are that I'm following. I can explain that to someone. I can show that to

someone, and I can log it. So I can have a complete transparent log of exactly how this decision was made. I have this transparency that I need for a decision. The third thing they give me is

made. I have this transparency that I need for a decision. The third thing they give me is agility. Now agility is important because the way I make a decision is subject to change without

agility. Now agility is important because the way I make a decision is subject to change without notice. Competitors change their behavior. The market changes. The regulations change. There's a

notice. Competitors change their behavior. The market changes. The regulations change. There's a

court case. There's all sorts of drivers changing the behavior of the way you make a decision. And

if you can't do that quickly, if you have to wait for there to be new data or new documents, or you have to retrain something that's going to take too long. So you have to be able to respond more actively, more quickly. The other thing about decisions is that there's a tremendous amount of domain knowledge in them. So programmers often find it really, really hard to correctly build

decision agents. And so, you really want to be able to engage people who have the domain knowledge.

decision agents. And so, you really want to be able to engage people who have the domain knowledge.

That means you're going to need some kind of low-code environment. You're going to need some way to engage customers, engage, sort of, experts, I should say, in managing the behavior of a decision agent while still being able to manage it as a programing, uh, programmatic component in your agentic AI framework. So you need some kind of low-code environment. And then lastly, you need this way of

AI framework. So you need some kind of low-code environment. And then lastly, you need this way of embedding analytics that we were talking about, where I can take analytic insight, I can turn historical data into analytic insight. And I can embed that analytic insight in my decision so that I can make the decision more precise, more accurate, more analytically precise. Now, all of

these are sort of classic benefits of using a decision platform. But let's just reiterate why LLM is a, is a, tough call in these things. If I have an LLM, well it's not really consistent. It's hard

to make an LLM do the thing. This is a feature, not a bug. That variation, that randomness is part of what makes them so powerful. it's very hard for them to be consistent. Uh, they're definitely not transparent, right? They're very opaque about how they did things. Even attempts to get them to

transparent, right? They're very opaque about how they did things. Even attempts to get them to explain themselves are problematic. And if I go to a customer or a regulator and say, hey, I have this black box that's been explained by this other black box, that doesn't really induce confidence.

They are, actually, they can be hard to change. Their behavior is, uh, you know, easy to get it set up. You don't have to, like, code it. You just, you know, provide information to it. But it's then hard

up. You don't have to, like, code it. You just, you know, provide information to it. But it's then hard to change without re, you know, presenting new data to it, retraining it with new data. You can't just like, tell it to stop doing X, stop doing Y. If you've watched some of the news around attempts to block particular agents or make agents behave in a certain way. If you try and like just code

something in quickly, you get very, very strange behavior. Um, they're quite complex. They require

quite AI-like level skills to build them and manage them. And as I said, they're just no good at structured data. They're not good at building predictive models out of historical data, and

structured data. They're not good at building predictive models out of historical data, and using that historical data to improve the precision of your decision-making. They're good at reading documents and text. They're not good at structured data. So we're not going to use a large language model. We're going to have to use something else. So what are we going to use

language model. We're going to have to use something else. So what are we going to use instead? What technology can we use to do it? So let's go back to the scenario I talked about in

instead? What technology can we use to do it? So let's go back to the scenario I talked about in the previous video, where I talked about a bank that needed to lend money. Can't write bank today.

Needed to lend money to a person. So I wanna lend you money. And to do that, I have an agentic AI framework that manages that whole complexity. And as part of that, I have two agents. I have two decision agents. I

identify one was an eligibility agent to say, are you eligible for a loan? And then

another one which was to say, can I actually lend you the money? Which is sort of, uh, what banks call origination? Uh, you want to borrow this amount of money for this actual thing? Can I lend it to you?

origination? Uh, you want to borrow this amount of money for this actual thing? Can I lend it to you?

If so, what's the rate? What's the price of this? So I have these two decision agents. Now, we've been building these kind of autonomous agents using decision technologies for a really long time. So,

um, there's a couple of things that need to be true of a decision agent. First of all, they need to be stateless and side effect-free. So what does stateless mean? It means that you want them just to respond to whatever data they're given at the moment they're given the data. Here's the data, here's the decision. Here's the data, here's the decision. Don't remember the states. That's why we

had, if you remember, a workflow agent whose job it was to remember the state. So the workflow tracks the state, and it gathers the data for us that we need, and it passes that back and forth to these agents. So it says, okay, at this point in the process, I need I've got this set of data about

these agents. So it says, okay, at this point in the process, I need I've got this set of data about this person, about this application, about this loan. Are they eligible? Yes or no. And you get an answer back. And similarly with the origination decision. So they're managing the state. They're

answer back. And similarly with the origination decision. So they're managing the state. They're

managing all of that. And that, uh, scales better. It, uh, keeps the decision agents simpler. Makes it

much easier to check that you're not using things like personal information or health information inside the decision when you don't need to. So it's just a much cleaner interface. But why side effect-free? Why is it important that your decision agents don't do anything, they just make

effect-free? Why is it important that your decision agents don't do anything, they just make decisions? Well, you want to be able to reuse them. Let's think about eligibility. Well, I might be

decisions? Well, you want to be able to reuse them. Let's think about eligibility. Well, I might be using it in the context of a workflow for originating a loan. That's one use case for it. But

I might have other processes, other workflows that do other things that send you letters or that, you know, um, tell a call center wrap or put you into a marketing campaign and so on. So I still need to know if you're eligible, but I'm going to do something completely different if you are eligible. And so by separating that, by not having the action be part of the decision agent, I get to

eligible. And so by separating that, by not having the action be part of the decision agent, I get to reuse it in lots of different circumstances. So I have these stateless, side effect-free agents. Okay.

So how do I build one of these? What does that look like? What technology do I need to build a stateless, side effect-free decision agent that has these characteristics? Well, we use what's called a business rules management system or decision platform. So decision platforms are software stacks designed to build, you know, historically speaking, decision services that can

then be wrapped into decision agents. So what is a decision platform have? Well, it has a number of software components. First and foremost, it's got a couple of editors. It's got typically like an IDE

software components. First and foremost, it's got a couple of editors. It's got typically like an IDE or a technical editor and a low-code editor in which you can write logic, business rules, decision logic. So you can lay out the the actual rules, the logic that has to be followed to make a

logic. So you can lay out the the actual rules, the logic that has to be followed to make a particular decision. And those two editors generally are then linked to a single repository. Now,

particular decision. And those two editors generally are then linked to a single repository. Now,

this might be something I get, but it might also be a more managed repository so that you can have version control and branching and all those things that is specialized for business rules and decision technology and available to your low-code editor. this varies by platform, but they all have the concept of a repository in which you can do branching, the versioning and, and all the

kinds of repository things you need to do to make sure you have a current version of the rules. And

you can do development work and have multiple people working and all that good stuff. Now, once

it's in this repository, and because it's a decision platform focused only on decision-making logic that is stateless and a side effect-free, you can do a lot more testing and validation of the logic, so you can validate that the logic is correct. So you can have often a set of tools that look at the rules that are in the repository and validate them. Is the logic complete? It's the

other. Are you missing a criteria that you're not checking? Do you have overlapping ranges, all that kind of stuff? And it's much easier to check that in the context of a decision platform, because the logic is written in a more declarative, less programmatic way, and it's managed as a set of assets that can be checked. So you typically have a set of validation tools so that the logic you

write is more robust. And then obviously you're going to need to test it. Now testing, um, testing tools can be as simple as the kind of JSON object. Pass it in, see if you get the result you're expecting UI that you would use, like with swagger or something like that but there can also be a lot more sophisticated. Some of the decision platforms have very robust test suites, where you

can load up very large numbers of tests transactions, run them through the results, check expected results, confirm you've passed all the tests and so on, and do all of this in a low-code way so that your non-programmers who are writing, providing their domain expertise can also test it to make sure they haven't broken anything. Now, when it comes to decisions, testing is is

necessary but not sufficient. Because within those decisions, within those business rules, there are going to be thresholds, places where you make choices as a business or as an organization as to what that threshold should be. There's not a hard, this is a good threshold. That's a bad threshold.

There's a, it's going to make a difference. So take loans. How much am I willing to lend you to buy a boat? Well, that's a, there isn't a right answer and a wrong answer in the sense that I can't write a

boat? Well, that's a, there isn't a right answer and a wrong answer in the sense that I can't write a test case for it. But the business could change that threshold and it has an impact. I need to be able to track what that impact is. And so generally, we have some kind of impact tool that takes a bunch of historical data and loads it in and then runs a set of simulations. So very

similar to a test engine, but with a different perspective. Instead of saying this broke, this didn't break, it says, here's the difference. If you make that change to that rule, the results look like this. And if you make this change to this rule, the results look like that. So you can see

like this. And if you make this change to this rule, the results look like that. So you can see what the impact of a change is going to be before you make it. So a lot of these tools. Uh, yeah. You

have to deploy and put a test version out before you can do these things, but several actually allow you to do things like testing and simulation on rules you haven't deployed yet that are just in your repository, and manage all of that essentially under the covers so that you can do it inside your development environment So they provide a lot of tools to make sure you have the

rules correct before you deploy them. Now, once you have them correct, obviously you do, in fact, need to deploy them. So you've got a deployment engine that deploys them as a service. So now I've got my rules service deployed, my decision service deployed. And it's going to execute those rules.

It's got the code and the engine that it needs to execute those rules. So when you pass in data, it's going to give you an answer. Now in this case obviously I'm going to expose it as an agent. So

I've probably got some kind of NCP (model contact protocol) server that exposes these decision services as tools, you know, and then those can then be wrapped into an agent and exposed in my agent framework. So what is agentic? Uh, yeah. What these agents are going to do?

The origination agent is going to say, here's my data packet. It's going to come in to my decision service, and I'm going to get a response back, which I then, you know, goes back to my agent. So I

can quickly, uh, package up my rules as decision services. I can reuse rules and reuse logic and package it up in multiple services, deploy those services, wrap them as agents using MCP. And now

I've got a whole series of decision agents that I'm managing from this repository. The technology

is really good at doing things like I've made a rule change, update the engine, handling in-flight transactions so that an in-flight transaction doesn't get broken if you change the rules. All of

that kind of, uh, constant update is all handled very effectively. So what this lets me do is it lets me build these rules, build these decision agents in a very robust way, and then deploy them as a service that I can then use to support my agentic framework by exposing them as agents. So

yeah, this handles, if you like, most of what's going on in an agent. If you think about these agents, this is all very prescriptive. So this is really describing how I write rules. How do I describe the rules, the logic that prescribes how this decision is made. But many decisions have a

probabilistic component too. So, you know, um, probabilistic, I probably spelt that wrong, but probabilistic elements too. So if it's likely that this is James will do one thing, and if it's not likely that it's James, if it's someone's impersonating James will do something different. If it is likely that this is a legitimate transaction, we'll do certain

things. So these are probabilistic elements that are typically built using predictive analytics,

things. So these are probabilistic elements that are typically built using predictive analytics, machine learning from my historical data. So in an agentic framework, what does that look like? Well

I typically what I'm going to do is I'm also going to deploy these machine learning components.

So I might have a prediction, for instance, of fraud. How likely is it that this person is the person who's applying for this loan? I might have another one around, credit risk. How likely are they to pay us back? And I might have a third one, which is payoff risk. How likely are they to pay us off early? And all of these agents are used by my origination agent as part of the

origination decision. So I need to be able to consume this. So how do I build those agents? Well,

origination decision. So I need to be able to consume this. So how do I build those agents? Well,

I'm gonna use a machine learning platform to do that. I'm gonna use machine learning technology to do it. And generally, with machine learning, you're gonna do some kind of analysis. And this might

it. And generally, with machine learning, you're gonna do some kind of analysis. And this might be, um, you know, supervised in the sense that there is a human user who is directing, directing it, or it might be unsupervised, where you're really just using the algorithm and letting it see what it finds out about your data, which, of course, means you've got

to have data. So generally for machine learning, you have a lot of data so you have multiple databases that have to be sort of combined and merged and managed. And you're going to do something called feature engineering. So you're going to engineer a set of features. And features

are, you know, predictive characteristics of one kind or another, things that seem interesting. They

can be very simple. If you have a date of birth, they can come up with an age. They might classify something. I'm going to say which customer, which age range are you in less than 20, 20 to 30, 30 to

something. I'm going to say which customer, which age range are you in less than 20, 20 to 30, 30 to 40 and so on? Because the range seems important. But they can get quite sophisticated. They can say things like, how often have you been more than 30 days late in the last 180 days on a payment for a

bill? Well, that has to be calculated from all this data. So there's a lot of work to not just merge

bill? Well, that has to be calculated from all this data. So there's a lot of work to not just merge this data, but calculate these features from it. And then I'm going to feed that data and my features that I've created into my analysis, run these machine learning algorithms, neural networks, regression models, decision tree analytics, all sorts of different analytic techniques to see if

I can find patterns or classifications or make predictions based on the historical data that I've got. It can supervise. I'm telling it what I'm looking for. Can you me which

I've got. It can supervise. I'm telling it what I'm looking for. Can you me which features will predict that this person will pay off the loan early? And if it's unsupervised, I'm more looking for things like. Is there anything unusual in here? What counts as an unusual pattern of data? Because that might be indicative of a new kind of fraud, for instance. And so the supervised

of data? Because that might be indicative of a new kind of fraud, for instance. And so the supervised generally driven by a data scientist, by a machine learning engineer. The unsupervised ones, you know, generally, you know, being kicked off and allowed to do their own piece. And then I'm going to go ahead and deploy these as, um, as endpoints that can be consumed by these agents. Now, we used to do

a lot of analytics in batch. So we would run these kinds of analyzes and then update the database with a bunch of scores. Today much more likely to deploy them as individual endpoints, individual REST endpoints that I can pass a JSON object to to score and get a result back. And obviously once I do that, once I have an endpoint, I can use MCP again, and I can deploy those as tools that I can

make available to my analytic agent. I now have analytic agents talking to deployed endpoints. And

those endpoints run essentially an algorithm that's been built from my historical data. So

they're not analyzing the historical data at runtime. What they're doing is they're using the results of that analysis to say, okay, here's a formula that takes this data and calculates a payoff risk for this customer. So I can see how likely it is this customer is going to pay it off early and use that as part of my pricing. So I have all these analytic agents, they're deployed

into my, into my into, you know, into my agentic AI framework. And then my decision service is going to consume the results of those, those predictions, those probabilistic models as part of how it makes the business decision to originate you or not. Now, these are two types of technologies, decision platforms and the machine learning platforms, these are quite separate from large language models. But that doesn't mean they can't be enhanced with large language models. And there's

models. But that doesn't mean they can't be enhanced with large language models. And there's

two areas in particular where we see a lot of work. One of them is this idea of a large language model for ingestion. if I've got documents. If I've got brochures, if I've got a recorded conversation, it doesn't matter how I've recorded a bunch of data. But large language models are really good at extracting the data I need from that. So if I've got an origination

decision and it needs to know, for instance, details of the boat you want to borrow money about, and I've got a brochure about that boat, then I can ingest that using a large language model, feed it directly into my origination agent as input data. So this gives me tremendous opportunity for making it much, much easier to supply the data I need. Often these decision

agents need a lot of data. And so being able to consume documents and turn it into data is very effective. The other place we've seen are really good um, uses is in explaining results. If you

effective. The other place we've seen are really good um, uses is in explaining results. If you

think about I invoke this decision agent, one of the things it's going to do is it's going to log how it made the decision. It's going to create essentially a detailed log of how it made the decision, how much detail goes in that is something that's up to you, but you can look at quite precisely how the decision was made. Which rules fired? How was the decision made? Now that looks

great for you. It's great for long-term improvement. great for understanding how your engine worked, how your decision agent behaved. It's not necessarily great for explaining it to a human being, a call center rep or a customer. So one of the other use cases for LLMs is to take this

log data and turn it into an explanation. So now I can explain how that decision was made. And I can ingest textual data that you give me. So I can use these LLMs to make it

made. And I can ingest textual data that you give me. So I can use these LLMs to make it easier to interact with my decision agent, both in and out. Now there's one last step that I wanted to add, which is how do I make these things learn if I want them to learn, if I want them to get

better over time? What does that look like? How do I, how do I, get my results to like, you know, have an upward trajectory? Well, there's a couple of things to say about that. It really varies depending on the kind of agent you have. Many of the analytic agents will learn on their own behalf. The unsupervised ones in particular, they'll take new data and continually sort of, you

behalf. The unsupervised ones in particular, they'll take new data and continually sort of, you know, update themselves as new data comes in. They'll update themselves. So as you run them, they make predictions, they make scores and new data results in new scores. And so they constantly change their algorithms. Typically you have some guardrails on that. So they can't change too much

without telling someone. But you allow them to essentially run experiments on their own data and experiment internally so that they evolve as predictions So those those kind of agents, agents built on unsupervised analytic techniques are inherently learning. But other kinds of analytic agents, uh, don't learn quite the same way. So you typically then have some data scientist who is

looking at, you know, doing new analysis with new data and proposing a new model. So they might do this every month, every quarter, every week they review the data up until yesterday. They see what day has changed since the last time they built the model. They rerun the model and see if the algorithm is different or noticeably different. if it is, they typically will deploy a new

endpoint. And that gets version to control just like any other code. So any analytic agent can

endpoint. And that gets version to control just like any other code. So any analytic agent can learn. It might learn automatically, but it might also learn because the data scientists are

learn. It might learn automatically, but it might also learn because the data scientists are responsible for keeping it up to date over time. But what about decision agents But decision agents don't really learn, Right? The whole point of a decision agent is that it's concrete, right?

That it's got this hardened definition of how it behaves. And so you don't really want to have it like randomly changing its behavior. So there's a couple of things you can do. First of all, you can, in the rule repository, you can code multiple versions. So you can put in: here's the old version of the rules, here's the new version of the rules. And then put a rule in that says some

people get one, some people get the other. And I get to experiment to see which one works better.

So I can run what's called A/B or champion challenger testing by writing rules in my decision agent so that it looks like one agent to the outside world, but it's got these two versions that it's it's running comparisons for so I can learn, and then I can have somebody look at the log, see which one works better, and, you know, close the loop, adding more rules

back into the rule repository. The other thing I can do is I can start to think about the overall agent and how the overall agent works. And this starts to get more involved. Because if you think about if I want to improve my origination agent, well, what does it mean to make better origination decisions? What that means is I lend money to people who pay me back, but they don't

origination decisions? What that means is I lend money to people who pay me back, but they don't pay me back at once. Right? The whole point of a loan, as you might pay me back over many years. And

so I can't really tell how good you're going to be at paying it back until some time passes. So I

can't do a real-time feedback loop because it's nonsense, right? The idea that I'm going to find out in real time whether this was a good loan decision is just silly, right? So I have to be get a log of how I made the decision and log which scores and predictions I used and what version everything was and stored that in my log. And then I need to wait some period of time, and then

somebody needs to come back and look at all this data and say, well, given this log data and the versions of the analytics that I use and the results I got out of this origination decision as processed through my workflow and actually do that analysis work. So that requires a process and structure that you can follow. You can do it with agentic AI, but you have to be a little bit

more thoughtful about how you would do it. It's not enough just to rely on the individual agents to learn about their bit of the problem. Someone has to own the framework as a whole, the solution as a whole, and systematically learn from how well that works.

Loading...

Loading video analysis...