LongCut logo

⚡️Monty: the ultrafast Python interpreter by Agents for Agents — Samuel Colvin, Pydantic

By Latent Space

Summary

Topics Covered

  • Type Safety Critical for AI Tool Chaining
  • Monty Bridges Safe Tool Calling and Sandboxes
  • Pyodide Fails Outside Browser Security
  • LLMs 100x Faster on Known APIs
  • Agent State Enables Self-Optimization

Full Transcript

Welcome back to the Lind Space uh live studio. Uh we are here with Sam Kovvin

studio. Uh we are here with Sam Kovvin from Pyantic who's recently launched Monty. Welcome.

Monty. Welcome.

>> Thank you so much for having me. Yeah,

it's great to be back.

>> How have you been since the last time we saw you? How's uh Pyantic and Pidantic

saw you? How's uh Pyantic and Pidantic AI?

>> When did we last speak? Like about a year ago, I'm guessing.

>> Yeah, exactly. Yeah, you just launched it. Yeah,

it. Yeah, >> it's Well, I mean, I don't want to indulge in all the cliches of like it's been crazy, yada yada yada. Brag a bit.

Brag a bit. Yeah. But I mean, no, I meant there's like everything has changed. Not not so much. I mean, sure,

changed. Not not so much. I mean, sure, we've had a great time, but like everything has changed in a year. I

mean, LA, if we had spoken this time last year and you told me, you know, I I basically was slightly laughing at anyone who said that they weren't reading all their code and now [snorts] here I am building this like if you want

to be porative slop fork of Python, mostly with AI. I I think it's incredibly powerful. I'm proud of it.

incredibly powerful. I'm proud of it.

But like this is a crazy world where you can sit down over Christmas and write 30,000 lines of Rust that that is actually powerful and useful. Yeah, the

loads of stuff has changed. I think a pantic logfire, our observability platform. And what does that mean for an

platform. And what does that mean for an obser observability platform? What does

that mean for you know an an AI native observability platform?

>> So we're weird because we're the only I guess we mostly consider ourselves against the other AI observability platforms. Brain Trust, Langmith, Langfus, those guys. But we're actually

full observability, full open telemetry.

You send us logs, metrics, traces. Our

pricing is a bit cheaper than other general observability rather than 50x more expensive. But we do lots of the AI

more expensive. But we do lots of the AI stuff, the evals, prom playground, LLM traces, all that stuff. And then, and I suppose the people mean two things by AI observability. They mean observability

observability. They mean observability for AI and AI for observability. We do a bit of both. in particular because Logfire lets you write arbitrary SQL to query your data. I don't think we've

done anything particularly special in our MCP server, but because the AI can just go write SQL, it's way more powerful than I think on many platforms. We basically have like the AI S sur

experience without having to do anything because the AI just goes and write a SQL and it can it can find a bug or it can show you the five slowest end points by P95 or it can go do some random

investigation on some attribute which isn't possible if you don't happen to make this like weird esoteric decision back in 2023 to allow arbitrary SQL from your users.

>> Yeah, that is seems like a nightmare if you allow that. So you started Monty over Christmas. What's the inspiration?

over Christmas. What's the inspiration?

What's the origin story?

>> So I actually had had a like very early version of this that I had done a couple years ago and had completely abandoned and then I spoke to maybe four different people at anthropic and each of them

independently. I like did my standard

independently. I like did my standard thing about how important type safety is because I, you know, I think type safety is important for humans but it's critical for AIs. And I say this to people all the time and some of them nod and some of them don't nod. But like

these four people I spoke to are anthropic. each independently said, "Oh

anthropic. each independently said, "Oh yeah, type safety is super important if you're chaining tool calling or if you're like using using code for tool calling." And when the first fourth of

calling." And when the first fourth of these people spoke to me, I was like, they're obviously thinking about something. I find Anthropic hilarious

something. I find Anthropic hilarious cuz they're like the most secretive company and yet everyone gets excited about whatever's going on inside Anthropic and suddenly everyone and they all hint at you about whatever it is

that's going on at the moment. It's hard

to resist, you know, but it also means you're speaking to the builders, not the marketers, right? The marketers won't

marketers, right? The marketers won't see anything because you don't >> Yeah.

>> And so I kind of started building it and then sure enough in whatever it was December, they came out with a programmatic tool calling piece and then there was another one on uh using code

to call MCP servers. Cloudflare kind of invented code mode or at least invented the term and have like pushed it hard.

And then actually I was speaking to I spoke to an investor who had been looking at the sandboxing space and he said his guess was that 70% of sandbox

invocations are basically this like tool calling or glorified tool calling whether it be like to render a chart or do a calculation stuff that doesn't actually need like full computer use but

where writing code is very very powerful. So yeah, Monty attempts to

powerful. So yeah, Monty attempts to slot in between simple tool calling which is very safe and relatively easy to implement, doesn't require external infrastructure, but isn't that expressive for the LLM and like

sandboxes which are much more expressive, you can do much more powerful things, but they require well the cost is actually the least of it, right? It's the it's the complexity of

right? It's the it's the complexity of setup which is not too bad if you're a startup and you can just go and put your credit card into Modal or E2B or Daytona or whoever. But if you're a massive

or whoever. But if you're a massive financial institution and you need to self-host everything, those are not an option for you. And so just add tool calling to add code mode to my agent is like basically not not possible for

them. Obviously also true for people who

them. Obviously also true for people who can't afford it. But like from [snorts] a customer point of view for us, the the power is on the the enterprise side where just use a sandbox is not a particularly easy solution. And then the

other massive win is latency because if you can have a Python interpreter that runs inside the same process, you're looking at a boot time. I mean in a hot loop we can run we can go from code to

execution result in under a microcond in like 800 nconds. In reality it's like singledigit microsconds to run code or singledigit microsconds to run the next step of a ripple or singledigit

microsconds to call a function on the host. Whereas if you're like creating um

host. Whereas if you're like creating um Daytona sandbox for me was taking one second. Obviously it's not as bad as

second. Obviously it's not as bad as that for like resuming. But you know these are these are big differences in terms of time. Yeah. Uh I think this is a good point to bring up the screen just to show I know I notice you have a

really nice comparison table and I like to you know obviously it's a Python interpreter so it can run Python, right?

I don't know what what else you would demo there. Yeah, I I can share my

demo there. Yeah, I I can share my screen now and I I can just show that table and talk through and I'll be I'll be blunt about the fact that there are trade-offs. It's not I I want to see

trade-offs. It's not I I want to see like what is the design space, right?

Because this is all a design question.

So, this is the the GitHub obviously 6,000 stars and still growing which is incredible for an open source project, but you're very good at that. You're you

have a well established reputation in open source these days. We're generally

better at getting downloads than stars, but yeah, it's definitely struck a struck a level of interest with people.

You track downloads for this? Uh yeah, I mean it's nothing yet. I think it's a I think we were like 27,000 downloads last week, which is obviously a nice number, but it's not.

>> Yeah. Yeah. Um those of us who are maintaining Pantic, it's a tiny number.

But anyway, so so it's it's kind of like you know this is to me like a logical progression from type safety to just overall safety I guess of a code execution.

But this is a crowded space. The the

first thing that someone who's familiar with this mentions is pyodide right because you're you're like all right you want a web assembly sort of lightweight Python interpreter you'll go to piodide.

Yes and indeed we did that. We made the classic mistake of using pyodide. To be

clear I'm a massive fan of piodide. Hood

who maintains it is a good friend of mine. I have nothing but respect for pi

mine. I have nothing but respect for pi as a project. It turns out it's a very bad way of running Python in like not in the browser. So we

maintained a thing called MCP run Python. And one of the impetuses for

Python. And one of the impetuses for this for us to build this is people just kept reporting security vulnerabilities with MCP run Python. And basically

solving them got harder and harder because sure Podide runs Python in um in Web Assembly, but

Web Assembly is not inherently isolated.

So you can't run it with Node. If you

run it with node, you can import the JS module and go and run arbitrary JavaScript code to access the host. So

you have to run it inside Dino.

Now Dino, you can go and set restrictions on what what file system what networking it's allowed. But then

you have to expose a bunch of files because if you want it to be able to download packages, uh it has to be able to write to the node modules directory.

Even if you don't allow that, Dnode does not have any way of controlling memory.

So even if someone can't run arbitrary code, they can om your machine as often as they like. The other problem we had was someone pointed out that although

you can't escape the dino sandbox, you can go and run arbitrary code within that dino sandbox and basically taint that server. So every single invocation

that server. So every single invocation you basically need to kill the dino sandbox and create a new one. So your

latency is actually worse than a full-on sandbox. It's like, yeah, I got 2.8 8

sandbox. It's like, yeah, I got 2.8 8 seconds here for running basically oneplus 1 in a piad you might be able to improve that slightly you can work around some of the security problems but

like fundamentally it's pretty heavyweight and of course even if you get all those sorted you need to install dino then you need to download piodide you're looking at like I think I might

have a number here uh yeah so dino is 50 megabytes the piodide package is 12 megabytes it's not trivial to just go and get it running and one of the nice things about Monty is because it's just

a single rust binary ultimately that you can install with Pippi or with npm.

There's actually and there's actually PRs to have support for Dart and Cotlin at the moment. It should be very very simple to install. Ultimately, you can run it anywhere where you can run Rust.

>> Amazing.

>> What what trade-offs do you make?

>> So the the biggest downside of Monty is that it is not full CPython. we are

implementing it all ourselves and so I think I have some comparisons up here what what you can and can't do. We have

a few standard library modules like async.io and data classes that we support little bits of but there's no support for third party libraries just to be installed. There never will be

directly as in you'll never we'll never be able to speak the C Python ABI and like install Pyantic or install NumPy or something. You can't yet even define

something. You can't yet even define classes in in Monty. You may we may at some point make that work. We don't

support match statements yet but we probably will. Like we can't just go go

probably will. Like we can't just go go use the standard library. So we have to work out which bits of the standard library we want to manually go and implement. This is sounds like an

implement. This is sounds like an enormous task and also kind of like uh self-sacrificial that you don't this is a pyantic runtime that doesn't allow pyantic.

>> So I think I I hear you I hear you. I

think there's a fair bit that so so what's what's super powerful I think we have a like very trivial example here in this this is a kind of like joke example of running the like standard

aantic loop within Monty. But the point is you can call in this case call lm which is a external function on the host which you can just go and call.

So you can do you can make a network request by registering a fetch method.

You can do some pyanic validation of some JSON data by basically having an external function call. You can do all of that stuff. It's just that those those external packages don't run within

the runtime. I mean, I think the other

the runtime. I mean, I think the other thing to say is like I I wanted to implement the like 20 or so built-in functions available in Python. I

basically said to the LLM, go and implement all the standard library functions that you find need to implement, you need to use regularly. We

did a bit of planning and it sets off and it goes and runs for 2 hours. And

what would have taken uh an experienced developer weeks to do was done in a couple of hours. My take is that there are four four things where if you can if

you can cover all four of these things LLMs are not like 3x faster or 5x faster they're like 100x faster internal implementation is well known to the

model as in it knows how to in this case implement a heap uh implement a bite code interpreter if the external API is well known to the model you don't need to explain how Python should work what

the external interface should be it just knows it in its soul in its weights thirdly Unit testing is really [ __ ] simple. You're just like, does it match

simple. You're just like, does it match Python? So it can almost Ralph itself

Python? So it can almost Ralph itself into into passing, right? You just say the trace back needs to be bite code, you know, to the bite identical. And

lastly, you don't have to bike shed at all about what the what the interface needs to be. You don't bike show bike shed about what the error message needs to be when you add a string to an

integer or what happens if you have the following sequence of arguments.

It's all just defined by Python. So

there's no need for the humans to go and argue. It just worked. If you cover all

argue. It just worked. If you cover all four of those things, that's why everyone is like basically cloning Reddit right now in Rust. You know, that's the kind of meme

Rust. You know, that's the kind of meme right now because those four rules all apply really well. There are loads of, you know, the LLMs know what they need to do internally. They know what they need to do externally. The unit tests are easy and there's no bike shedding

about the API. If you can do those four things like tasks get like I mean yeah I would say ballpark 100x faster and so you know I think there's a there's a PR

on Monty right now which I profess I haven't actually read yet that implements like 50 functions in the math module I don't know anything about this this PR

I haven't read it yet but the point is someone has set off an LLM and it's added 800 tests or maybe that's the the full test

existing and I need to look at it. I'm

not saying I know that much about the implementation.

>> You have you have a nice uh what is that? What is that orange thing at the

that? What is that orange thing at the top? What is that? A

top? What is that? A

>> so it's a it gets inserted here.

Basically, this started off cuz one of the OpenAI co-founders created an issue on Pantic and we disclosed it and said it was wrong.

[ __ ] And so we have this that like injects itself and tries to summarize someone and it it gives them a like brutal score of one to five.

>> Yeah. Yeah. Yeah.

>> On how important they are. So we

basically don't go and like when Dario next issue, we're not just going to like say that's dumb and close it.

Anyway, well so so uh I think there's there's more scope for for AI reviews. I have

worked on an AI review tool that many many others like there. I wonder if you find those those helpful at all.

>> We So in pine AI the guys are using Devon quite a lot. Um and actually finding useful we tried quite a few nothing else worked but Devon well not not nothing else worked but like having gone through a few different options.

Devon seemed to be the best of the like offtheshelf ones. I think there we

offtheshelf ones. I think there we actually now have our own one trained on basically all of our reviews that is that is running alongside it. But yeah,

I mean like if you look I mean if you look at the for example the datetime implementation which is here somewhere.

I think datetime is already merged but like the datetime implementation is an awful lot of code but like the point is that like the API the the AI knows exactly what it's doing and so it can implement it pretty pretty quickly.

Yeah. I mean like uh I've reviewed this this a few times. I need to get on a merge it. It's a big PR. It's 4,000

merge it. It's a big PR. It's 4,000

plus. But like you can you can go and look at the tests and ultimately you can be like yep sure enough tests you you know it's basically going to go through

and test all of the stuff and of course the point is that every single one of these tests just runs with CPython and Monty and checks that the output is exactly

the same and so you you know these are the kinds of tasks that are much easier.

I agree with you that implementing the full standard library would be a like a hellish task. I don't think that that's

hellish task. I don't think that that's necessary. I think we need to do like

necessary. I think we need to do like three or four more modules and you describe you just say for the AI you don't need to you can only use the following and it's amazingly good at doing it and for an enormous number of

tasks. It just makes an enormous

tasks. It just makes an enormous difference in in terms of what's possible in terms of how efficient the models are to to perform the task to have uh like this code mode available.

>> Yeah. Amazing. I think it's interesting how little moat software just generally has because your this is piodide pi's been around for I don't know how many 10

years maybe maybe less than that but you know like reddis has been around forever and like here we are trying to rewrite all these things on a whim and we can like we we actually can like this is

absolutely crazy while we're on this topic just generally did you uh what you use devon I'm sure you use cloud code yes no um or was that the main thing any

sort of coding tips anything you know if people were to try to attempt the same kind of thing like implement a full runtime what advice would you have so I

use cloud codeex and gemini I use mostly claude code but I use >> Gemini CLI Gemini or Gemini >> Gemini CLI although so my take is that like you can think of each of them as if

you thought of them as a superhero claude code is like your cliched Captain America like Mr. right? Like pretty

competent, does most things right. A bit

overconfident, but like whatever, fine.

Codex is this like neurotic, geeky, like the kind of Q guy in in Bond, like this neurotic geeky person who's like very specific about little things that don't particularly matter. And then Gemini is

particularly matter. And then Gemini is like the Joker. It's just this like unhinged lunatic that will occasionally do an incredible job, but half the time just delete all of your files. And I

almost never run Gemini with the capacity to do anything other than actually I can I can show you now.

>> Yes, please. I don't meet enough Gemini CLI users which obviously they're working hard on it but they just don't have the market share. So for example, this is my like very brutal way of like doing reviews. I just literally pipe

doing reviews. I just literally pipe that into Gemini and Gemini doesn't is not allowed to by default run like edit any files and it will just go off and like review a particular branch and

write me out a report and then I basically just point claw code at that report and say implement the following things. I'll edit it if there's stuff

things. I'll edit it if there's stuff that's wrong and I do something fairly similar with codeex works fairly well. This is I mean look you could you could do lots of cleaner things with this. I'm sure there are

more elegant ways of doing it. I really

want to start using Code Puppy. I just

haven't got around to it yet for this.

>> I've never heard of Codeuppy.

>> Code Puppy is a coding agent built by an amazing guy called Michael Fafesberger at Walmart built with Pantic AI and it's got like hasn't got very many stars, hasn't got lots of like hype around it, but I don't know how much I'm allowed to

say publicly, but it has got an enormous adoption amongst developers and non-developers to to basically automate bits of their job, which is one of the reasons I find it really fascinating. I

I won't say more about it because I'm not sure how much I'm allowed to say publicly, but yeah, it's it's um but you know, I've heard really good things about it and I want to try using it, but like you know, all open code I mean I find open code's willingness to like get

in the way of my scroll and be a full TUI is just like killer for me. So I I can't I'd rather honestly have Gemini CLI. Yeah, you want you want just like

CLI. Yeah, you want you want just like just only do back and forth. Don't

become a full user interface like that that that's too much, right? Yeah. I I I hear that a lot. So someone said to me which makes sense like the single biggest private codebase in the world is

Google and supposedly Gemini is trained on it. So it makes sense that in the

on it. So it makes sense that in the like very technical end of the spectrum, right? Like Monty is not your average

right? Like Monty is not your average like build me a build me a web app. Like

it's there are some very technical like what's the best way of implementing a bite code interpreter? How can I make the strct a few bytes smaller? Whatever

this might be, right? That stuff really credible that the things that Gemini knows how to do that other models don't.

And so that that's I think that's my reason for trying it. Sometimes it comes up with something magic. Obviously, I

would need to run each of them three or four times to like do a scientific test of which one's best. I haven't done that. I just use use a mixture of them.

that. I just use use a mixture of them.

I think it's a reasonable story to to think about Gemini, but like uh you know, I've asked actual Google people this uh they use an internal thing called jet ski. And I mean, the simple reason is they have internal versions of

everything else. And so if they actually

everything else. And so if they actually train the they release the internal version to the external world, it would just make no sense. Like it would just reference Borg and nobody has Borg.

Interesting. So you think it's it's not actually trained on much of >> it's not trained on but it's probably informed by which is close enough but trained on is a very specific thing where it's like it would have not like

the model would spit out names of internal Google names which you which would be meaningless to you because you don't have it. So what's the point right? Yeah. But uh but it makes sense

right? Yeah. But uh but it makes sense that like it's informed by the same people that worked on that. Therefore it

has those like technical proficiencies right which is just it's hard to articulate what those things are. The

other thing I will say about it in comparison to to both codeex and claude code, it is so fast to review because I mean I I don't know actually but my my inclination is they basically wrap up

the entire diff of your pull request if you're trying to review a branch and just makes one request of it. The model

sits there and churns for a minute or two and returns you a report. Whereas

Codeex is going to like go through and like aantically investigate all of the changes and try and link them up and how does that relate? And maybe it does a marginally better job, but one can take honestly half an hour and one is going

to be done in like 90 seconds. And so

quite often that like Gemini first review will find things that that Claude called Code and I have done wrong to to to fix more quickly.

>> Yeah, I think we need to standardize this. I think it's getting a little

this. I think it's getting a little sloppy how people are incentivized to have more and more autonomy but actually sometimes you just want a quick answer and there's like sometimes no opt- out button to just do that and so people want to just take over more and more

things. I was using I mean I use Opus

things. I was using I mean I use Opus 4.6 for the most part but I have I have honestly had tasks where I'm like I could do this faster than that as in sure I have a much better mental model

of the codebase than it does even after it's read claude MD. I'm not saying that it's a fair fair fight in any in other ways, but like and of course there are many situations where I would rather it churned for 30 minutes when I could do

it in 10, but there are genuinely situations where I'm like I could have done that change faster. And what I hear I who knows whether this is true is that for Opus 4.6 or the most recent Claudes,

they basically took this feedback that codeex was more precise and like basically made it think harder, made it investigate for longer. Fine. There are

places basically to your point there are places where that autonomy is worthwhile but there are places where I need to get something done where I would rather like you know move a bit more quickly >> stuff. Well so I I think that is the

>> stuff. Well so I I think that is the state of uh of Monty. Where's it going next? Is there a commercial sort of

next? Is there a commercial sort of angle to this uh at all or what are you thinking?

>> Yeah. Would it be interesting for me to show you a like quick example of it cuz I think that might might be interest.

I'll show you kind of what we're doing in this example cuz I think it will like hopefully give people a bit more compunction about what's going on. So,

fundamentally what we're trying to do here is scrape the prices of LLM models from their websites. As we all know, AI companies have a very well-developed sense of irony. And so having made their

millions out of scraping the internet, they then make it incredibly hard for us to get data back from their websites.

And we maintain a library called Genai prices where we where we have the prices of all of the models. And at the moment that's basically handwritten as in we go and you know when a new model comes out we go and read it. And so what would it

look like for us to go and download that that stuff? So this is a podkai agent.

that stuff? So this is a podkai agent.

In this particular mode we're actually running it. We've manually implemented

running it. We've manually implemented the loop. In general you wouldn't need

the loop. In general you wouldn't need to do that. In general, you would just use the like code mode feature in padantic AI. But I wanted to do some

padantic AI. But I wanted to do some slightly different things. But the

interesting thing here is we we take this we take playrite and we have a like open page function when it has access to to the browser and it will return me an

instance of this page type. And coming

back to your question about like how do I register stuff inside inside Monty these data classes page here which has a bunch of methods like well some attributes and a bunch of

methods like go to and click and fail and everything. This is using real

and everything. This is using real playright internally but you can expose this data class inside Monty and so the Monty code gets access to to run all of

these functions. Uh it's also worth

these functions. Uh it's also worth saying Monty has the the TY type checker built into it. So before it will go and run any code, it's running type checking

with these type stubs. So the the LLM should be confident that it's got the the code shouldn't run unless the LLM has got the typing right. And then we have this beautiful soup function which

is going to give us back this like basic thing that allows us to query the the DOM and that gives us this tag type where again we do the same thing. We have

these like a bunch of functions that we can register with the LLM. And so

ultimately, let me run this example and see if it will give you some idea of of what we're doing. Does that does that make sense,

doing. Does that does that make sense, Wix? Or is there anything there I should

Wix? Or is there anything there I should explain better?

>> I'm I'm following so far.

>> It'll be probably more interesting to look at what's going on inside Logfire.

>> Here we are. So you you see here like we have the system prompt where we've basically told it what Python what what code it can use and this is like there's not that many instructions right and

most of those things we will go and implement some subset of so we haven't wasted thousands of tokens explaining what it can and can't do and then we've given it the type hints of these are the

functions you can go and call um here's the page open here's the like these are the dock strings for the functions you're allowed to call yada yada yada uh it will then go and write um write the

code that it that it needs to go and exercise. So here you see the code

exercise. So here you see the code written by the LLM. So it's going to go and get the pricing from from Claude's docs. It's going to, you know, and it's

docs. It's going to, you know, and it's going to turn that into beautiful soup tag and then it's going to inspect that tag and it's going to use this to go and investigate return some some HTML about the tables

and it'll do a bunch of investigation and eventually if the demo gods are with me, this is taking particularly long this time, but like sometimes it doesn't take this long. It'll come back and it'll give me a summary of the prices of

the model. You see it's run a bunch of

the model. You see it's run a bunch of it's run into a bunch of errors here.

So, in particular, it like it thought that async.io didn't need to be imported because it assumed a ripple. Uh it turns out the models are very strongly trained to assume a ripple wherever they're writing Python code. So the the big

thing we're basically waiting for to have Pyance AI like have code mode supported is we just we know we need ripple and we're just working through the final bits to solve it. But the

point is that like because it hadn't imported async.io, the type checking failed and told it you need to go and do this. And so the next time it ran code, well, it got something else wrong this time, but like

eventually once it had got it right, it was able to go and and run the code. And

if you look at the full agent run here, you'll see that the code it wrote a number of different blocks of code, but ultimately came back and like was able

to extract the prices pretty successfully. Fully agentic web scraper.

successfully. Fully agentic web scraper.

But but look, of course, you could do this with basically setting up Cloud Code or your coding agent of choice. And

that's fine if you're running it locally and you're either going to watch it or you're going to like yolo have dangerously skip permissions. But if

you're running this kind of thing in the cloud and you and you're going to have ultimately untrusted people prompting the model, that is effectively the same as letting an untrusted person write the

code. And so you need

code. And so you need uh this level of control where you can say this is exactly what you're allowed to do. These are exact functions you can

to do. These are exact functions you can call if you allow it for example to to make a to to load a web page or to make a get request. You can be very specific because all of those function calls are

going through the host about exactly what domains it can and can't connect to. You can control how long it can

to. You can control how long it can execute for, how much memory it can use.

And so you you know the the ideal is we'll get to a situation in not too long where um you can basically go and run this and let any old code run and the worst

you're going to get is is an error.

Yeah. Beautiful. But but what I was going to say in answer to your question is I don't think I have an example of this here, but you see this took 102 seconds and cost 30 cents to run. If you

ask the LLM in its response as well as the information about the prices to say please also return the optimal Python code to run this task again and then you

give it that that function code when it runs again it gets it right first time because you effectively it's just going to go and run that script and so I think that there is an enormous opportunity

for effectively a new layer of state within applications you could think of it as memory but it's often a lot more than memory it's the like current state of agent optimization. Some of which will be code, some of which will be

model choice and and model settings, some of which will be system prompt and that is the like big drive we have now in logfire. So basically moving beyond

in logfire. So basically moving beyond eval to the agent improving over time whether that be completely autonomously or with a user like looking at the changes and clicking accept. And so so

Monty itself doesn't make sense to have a like hosted version of it because the whole point is you don't need to use a hosted service but I think there's an enormous space for basically services on top of it where you're like yeah of

course you can go and use it in in your language of choice but by the way we have the service that will make it that much that much more performant. I think

a lot of people, a lot of observability people eventually find this sort of self-improving agents ideal. The problem

is it requires a lot more scope than you currently have, including access to my codebase, which I don't know how you're going to handle that, but it it's going to be interesting. I I I agree with you.

Yeah, I agree with you to some extent. I

mean, one of the things we're doing now in Pyantic AI is we're we're about to introduce I think there's a PR out for this, so I think this is public serializable agents. So basically you

serializable agents. So basically you can define an agent entirely in a toml file. Everything from the model to the

file. Everything from the model to the system prompt to all of the capabilities whether they be code mode or compaction or registering an MCP server.

There are lots of advantages to that, but one of the things is you have now one file where you can go and run that like optimization and in theory you can get to the point where that one file can basically run

>> untrusted as in it only has the capabilities that you have registered with it and Monty is running all of the like >> arbitrary code that's being executed in that process.

>> I think this is a good idea. I'm not

sure about the toml uh uh but uh because the the the main idea is that toml just like any specification language any markup language they are just a DSL

right uh and you would have to reinvent a DSL you're going to invent branching and variables and looping and then you then you just have a half poorly implemented version of a general

language I agree with you and I think if we get to that point we we resort to Python or we resort to Monty for some of that stuff I think Monty is it. I think

Monty is it. You're already there.

>> Okay. Okay. I I hear you on that. I

mean, I think the point is that like most if you if you think about what what agents are, there are exceptions of course, but there's an awful lot of them that are somewhat formulaic. You have a

system prompt, you have a model, you have an output schema, you have some MCP servers that you register, you have some some settings like is it durable compaction, yada yada yada. there's like

30 of or so of these capabilities and it's effectively a choice of like which ones do I switch on and sure if I want to go and register arbitrary tools that's a whole different thing and this this thing breaks down but often those

tools are more and more at least in enterprise being packed up and put behind MCP servers and so again it comes down to like the settings are the link to the MCP server and how we're going to do or

>> oh yeah off is another thing but yeah I think I I do think that this can work as a standard I think um you have a at defining the standard and um serializable agents is like a really good name for it. So, I'm excited to see

that spread. Excellent. I I think that's

that spread. Excellent. I I think that's I think that's going to be it for our pod. You're going to come speak more at

pod. You're going to come speak more at AIE in London, which is we're we're all we're all very excited. I I do think like this is kind of a big celebration of everything in like sort of European

AI, particularly London. And uh yeah, I'm excited to see you there.

>> Yeah, really looking forward to it. I

don't know when this podcast is going out, but we're also super excited for our conference next week. But I presume this will go out. I think we're we're sold out anyway. So, I think we're good.

But yeah, really looking forward to to London. That's going to be great.

London. That's going to be great.

>> Presumably your conference is going to be out on your YouTube and we'll send people there.

>> Yeah. So So we're co-hosting it with Prefect. So some stuff will be on our

Prefect. So some stuff will be on our YouTube, some will be on Prefect, but it will all there will be links to all of it on, you know, you'll be able to find it relative easily and it will all be recorded. So yeah, we've got we're so

recorded. So yeah, we've got we're so lucky to have people like Guido and Sebastian uh speaking and Armen like we have an amazing lineup of speakers. So

I'm super excited for it. It'll be a great day. Excellent. Well, thank you

great day. Excellent. Well, thank you for your time and congrats on finding like a new I guess angle for open source to sort of penetrate the the the sort of runtime agent conversation. I do think

like this is a like in a way that I guess Dino and Bun has revolutionized Node.js like preAI. I think you're kind of doing this in the AI native sort of era and building exactly what agents

need. I do see that there's a gap in

need. I do see that there's a gap in Python for that. So you guys >> I spoke to my friend David Sora Pereira Anthropic showed him this and he was like yeah this is exactly what everyone

needs.

>> Question is how are you going to make money out of it and I was like yep I think that's you know from a bluntly the biggest the biggest question on Monty is like how do we make money from it? But

like hey I think we're doing well in Oxford.

>> Bun didn't have to. Let's just put it that way, right? [laughter]

>> Yeah. I I've got a funny story on that, but I I won't include it in in the post.

>> Not Not for public consumption. Okay.

Okay. All right. Well, so stop the recording there and I'll see you in London.

>> Cool. Great. Thank you so much.

[music]

Loading...

Loading video analysis...