State-Of-The-Art Prompting For AI Agents
By Y Combinator
Summary
## Key takeaways * **Metaprompting leverages LLMs to dynamically generate and refine their own prompts**, allowing for continuous self-improvement and adaptation based on feedback or new examples, akin to unit testing in software development. (6:51, 29:47) * **"Evals" (evaluations) are the true "crown jewels" of AI companies, not the prompts themselves**, as they provide critical insights into why a prompt performs a certain way and are essential for iterative improvement. (14:18, 14:55) * **Founders of AI companies must adopt a "Forward Deployed Engineer" (FDE) model**, acting as technical ethnographers who deeply understand specific customer workflows and rapidly iterate on AI solutions directly with clients. (17:25, 23:18) * **It is crucial to provide LLMs with an "escape hatch" in their prompt design**, allowing them to explicitly state when they lack sufficient information rather than hallucinating responses, and using this feedback for prompt debugging. (10:48, 12:10) * **Different LLM models exhibit distinct "personalities" or behavioral traits**, particularly in how they adhere to rubrics or handle exceptions, which necessitates tailored prompting strategies for optimal results. (26:13, 27:26) ## Smart Chapters * **Intro: The Evolving Role of Prompting** (0:00) The discussion introduces metaprompting as a powerful tool, likening prompt engineering to early coding and human management, and sets the stage for exploring state-of-the-art AI agent prompting. * **Parahelp’s Prompt Example: A Deep Dive** (0:58) A detailed breakdown of a six-page prompt from Parahelp, an AI customer support company, showcasing how role, task, planning, and output structure are defined for a vertical AI agent. * **Different Types of Prompts: System, Developer, and User** (4:59) An explanation of the emerging architecture of prompt types, differentiating between a general system prompt, customer-specific developer prompts, and end-user prompts. * **Metaprompting: LLMs Improving Themselves** (6:51) Discussion on metaprompting as a technique where one prompt dynamically generates better versions of itself, often used to improve classifiers or refine existing prompts with new examples. * **Using Examples to Enhance LLM Reasoning** (7:58) Exploration of how feeding LLMs hard, expert-level examples (e.g., complex code bugs) significantly improves their ability to reason through and solve complicated tasks, acting like test-driven development. * **Tricks for Longer Prompts: Escape Hatches and Debugging** (12:10) Strategies for managing extensive prompts, including giving LLMs an "escape hatch" to prevent hallucinations by reporting insufficient information, and using thinking traces or debug info for refinement. * **Findings on Evals: The True Crown Jewel** (14:18) Emphasis on evaluations ("evals") as the most critical data asset for AI companies, providing the underlying rationale for prompt design and enabling continuous improvement. * **Every Founder Has Become a Forward Deployed Engineer (FDE)** (17:25) An argument that modern AI founders must embody the "Forward Deployed Engineer" role, deeply understanding customer workflows and rapidly building tailored software solutions. * **Vertical AI Agents Closing Big Deals with the FDE Model** (23:18) Examples of vertical AI agent companies leveraging the FDE model to win significant enterprise contracts by delivering highly impressive, customized demos and rapid iteration. * **The Personalities of Different LLMs** (26:13) Observation that various LLM models exhibit distinct "personalities," with some being more human-steerable (e.g., Claude) and others requiring more explicit steering (e.g., Llama 4). * **Lessons from Rubrics: LLM Interpretation** (27:26) Insights gained from giving LLMs rubrics, revealing that models like GPT-3.5/4 are rigid, while Gemini 2.5 Pro demonstrates flexibility in applying rubrics and reasoning about exceptions. * **Kaizen and the Art of Communication in Prompting** (29:47) Connecting prompt engineering to the Kaizen principle of continuous improvement, where the people doing the work (LLMs) are best at improving the process, and emphasizing effective communication with AI. * **Outro** (31:00) Concluding remarks. ## Key quotes * "Metaprompting is turning out to be a very, very powerful tool that everyone's using now. It kind of actually feels like coding in, you know, 1995, like the tools are not all the way there... But personally, it also kind of feels like learning how to manage a person where it's like how do I actually communicate the things that they need to know in order to make a good decision." (0:00) * "One reason that Parahelp was willing to open source the prompt is they told me that they actually don't consider the prompts to be the crown jewels; like the evals are the crown jewels because without the evals, you don't know why the prompt was written the way that it was, and it's very hard to improve it." (14:55) * "You actually have to give the LLMs a real escape hatch. You need to tell it if you do not have enough information to say yes or no or make a determination, don't just make it up. Stop and ask me." (10:48) * "I think you've put this in a really interesting way before Gary where you're sort of saying that every founder's become a forward deployed engineer." (17:25) * "03 was very rigid actually, like it really sticks to the rubric, it heavily penalizes for anything that doesn't fit like the rubric that you've given it, whereas Gemini 2.5 Pro was actually quite good at being flexible in that it would apply the rubric but it could also sort of almost reason through why someone might be like an exception..." (28:13) ## Stories and anecdotes * **Parahelp's Open-Source Prompt**: Parahelp, an AI customer support company powering services for Perplexity, Replit, and Bolt, graciously open-sourced their detailed, six-page prompt. This was a rare insight, as such prompts are typically considered proprietary "crown jewels," but Parahelp viewed their "evals" as the true intellectual property, underscoring the importance of evaluation data over the prompt itself. (0:58) * **Palantir's Forward Deployed Engineer Model**: Palantir pioneered the "Forward Deployed Engineer" (FDE) concept by sending engineers, not just salespeople, to work directly with clients like FBI agents. This approach allowed them to quickly build and demonstrate functional software based on immediate feedback, drastically shortening sales cycles from months or years to days and securing multi-million dollar contracts by showing tangible, customized solutions. (17:25) * **Vertical AI Agents Closing Big Deals**: Companies like Giger ML (voice support) and Happy Robot (AI voice agents for logistics) exemplify the FDE model's success. Their founders, often technical engineers, directly engage enterprise clients, rapidly develop and demonstrate highly customized, impressive AI solutions, and close six and seven-figure deals by outperforming incumbents with superior technology and quick iteration. (23:18) ## Mentioned Resources * Parahelp: AI customer support company (0:58) * Perplexity: AI company whose customer support is powered by Parahelp (1:20) * Replit: AI company whose customer support is powered by Parahelp (1:20) * Bolt: AI company whose customer support is powered by Parahelp (1:20) * Y Combinator: Startup accelerator, channel hosting the video (6:51) * Tropier: YC startup helping with in-depth understanding and debugging of multi-stage workflows (6:51) * Ducky: YC company using Tropier's services (7:05) * Jasberry: Company building automatic bug finding in code (7:58) * Gemini Pro / Gemini 2.5 Pro: Google's large language model (12:56, 28:13) * ChatGPT: OpenAI's large language model (13:58) * Eric Bacon: YC's Head of Data (14:05) * Palantir: Company that originated the "Forward Deployed Engineer" concept (17:48) * Facebook (now Meta): Mentioned as a top software startup (18:22) * Google: Mentioned as a top software startup (18:22) * Peter Thiel: Co-founder of Palantir (18:22) * Alex Karp: Co-founder of Palantir (18:22) * Stephen Cohen: Co-founder of Palantir (18:22) * Joe Lonsdale: Co-founder of Palantir (18:22) * Nathan Gettings: Co-founder of Palantir (18:22) * Palantir Foundry: Palantir's core data visualization and data mining suites (21:12) * Salesforce: Incumbent CRM company (22:04, 25:35) * Oracle: Incumbent database company (22:04) * Booz Allen: Consulting company (22:04) * Giger ML: YC company doing customer support and voice support (24:06) * Zepto: Company that Giger ML closed a deal with (24:23) * Happy Robot: Company building AI voice agents for logistics brokers (25:52) * Claude: LLM known for being more human-steerable (26:36) * Llama 4: LLM that needs more steering (26:45) * Benchmark: Investor (29:00) * Thrive: Investor (29:00) * Kaizen: Japanese manufacturing technique for continuous improvement (30:05)
Topics Covered
- Meta-prompting enables LLMs to dynamically improve their own prompts.
- Give LLMs an 'escape hatch' to prevent confident hallucinations.
- Evals, not prompts, are AI companies' true crown jewels.
- Founders must be 'forward-deployed engineers' to win.
- Prompting feels like managing a person, not coding.
Full Transcript
Metarprompting is turning out to be a
very very powerful tool that everyone's
using now. It kind of actually feels
like coding in you know 1995 like the
tools are not all the way there. We're
you know in this new frontier. But
personally it also kind of feels like
learning how to manage a person where
it's like how do I actually communicate
uh you know the things that they need to
know in order to make a good decision.
[Music]
Welcome back to another episode of the
light cone. Today we're pulling back the
curtain on what is actually happening
inside the best AI startups when it
comes to prompt engineering. We surveyed
more than a dozen companies and got
their take right from the frontier of
building this stuff, the practical tips.
Jared, why don't we start with an
example from one of your best AI
startups? I managed to get an example
from a company called Parahelp. Parahelp
does AI customer support. There are a
bunch of companies who who are doing
this, but Parhel is doing it really
really well. They're actually powering
the customer support for Perplexity and
Replet and Bolt and a bunch of other
like top AI companies now. So, if you if
you go and you like email a customer
support ticket into Perplexity, what's
actually responding is like their AI
agent. The cool thing is that the
Powerhel guys very graciously agreed to
show us the actual prompt that is
powering this agent um and to put it on
screen on YouTube for the entire world
to see. Um it's like relatively hard to
get these prompts for vertical AI agents
because they're kind of like the crown
jewels of the IP of these companies and
so very grateful to the Powerhel guys
for agreeing to basically like open
source this prompt. Diana, can you walk
us through this very detailed prompt?
It's super interesting and it's very
rare to get a chance to see this in
action. So the interesting thing about
this prompt is actually first it's
really long. It's very detailed in this
document you can see is like six pages
long just scrolling through it. The big
thing that a lot of the best prompts
start with is this concept of uh setting
up the role of the LLM. You're a manager
of a customer service agent and it
breaks down into bullet points what it
needs to do. Then the big thing is
telling the the task which is to approve
or reject a tool call because it's
orchestrating agent calls from all these
other ones. And then it gives it a bit
of the highle plan. It breaks it down
step by step. You see steps one, two
three, four, five. And then it gives
some of the important things to keep in
mind that it should not kind of go weird
into calling different kinds of tools.
It tells them how to structure the
output because a lot of things with
agents is you need them to integrate
with other agents. So almost like gluing
the API call. So the is important to
specify that it's going to give certain
uh output of accepting or rejecting and
in this format. Then this is sort of the
highle section and one thing that the
best prompts do they break it down sort
of in this markdown type of style uh
formatting. So you have sort of the
heading here and then later on it goes
into more details on how to do the
planning and you see this is like a sub
bullet part of it and as part of the
plan there's actually three big sections
is how to plan and then how to create
each of the steps in the plan and then
the highle example of the plan. One big
thing about the best prompts is they
outline how to reason about the task and
then a big thing is giving it giving it
an example and this is what it does. And
one thing that's interesting about this
it it looks more like programming than
writing English because it has this uh
XML tag kind of format to specify sort
of the plan. We found that it makes it a
lot easier for LMS to follow because a
lot of LMS were post-trained in LHF with
kind of XML type of input and it turns
out to produce better results. Yeah. One
thing I'm surprised that isn't in here
or maybe this is just the version that
they released. What I almost expect is
there to be a section where it describes
a particular scenario and uh actually
gives example output for that scenario.
That's in like the next stage of the
pipeline. Yeah. Oh, really? Okay. Yeah.
Because it's customer specific, right?
Because like every customer has their
own like flavor of how to respond to
these support tickets. And so their
challenge like a lot of these agent
companies is like how do you build a
general purpose product when every
customer like wants you know has like
slightly different workflows and like
preferences. has a really interesting
thing that I see the vertical AI agent
companies talking about a lot which is
like how do you have enough flexibility
to make special purpose logic without
turning into a consulting company where
you're building like a new prompt for
for for every customer. I actually think
this like concept of like forking and
merging prompts across customers and
which part of the prompt is customer
specific versus like companywide is like
a like a really interesting thing that
the world is only just beginning to
explore. Yeah, that's a very good point
Jared. So this is concept of uh defining
the prompt in the system prompt. Then
there's a de developer prompt and then
there's a user prompt. So what this mean
is uh the system prompt is basically
almost like defining uh sort of the
highle API of how your company operates.
In this case the example of parhel is
very much a system prompt. There's
nothing specific about the customer. And
then as they add specific instances of
that API and calling it then they stuff
all that in into more the developer
prompt which is not shown here and
that's adds all the context of let's say
working with perplexity there's certain
ways of how you handle rack questions as
opposed to working with bold is very
different right and then I don't think
parhelp has a user prompt because their
product is not consumed directly by an
end user but a end user prompt could be
more like replet or a zero right where
users need to type is like generate me a
site that that has these buttons this
and that that goes all in the user
prompt. So that's sort of the
architecture that's sort of emerging.
And to your point about avoiding
becoming a consulting company, I think
um there's so many startup opportunities
and building the tooling around all of
this stuff like for example like um
anyone who's done prompt engineering
knows that the examples and worked
examples are really important to
improving the quality of the output. And
so then if you take like power as an
example, they really want good worked
examples that are specific to each
company. And so you can imagine that as
they scale, you almost want that done
automatically. Like in your dream world
what you want is just like a an agent
itself that can pluck out the best
examples from like the customer data set
and then software that just like ingests
that straight into like wherever it
should belong in the pipeline without
you having to manually go out and plug
that all and ingest it in all of
yourself. That's probably a great segue
into metaparrompting which is one of the
things we want to talk about because
that's that's a consistent theme that
keeps coming up when we talk to our AI
startups. Yeah, Tropier is uh one of the
startups I'm working with in the current
YC batch and they've really helped
people like YC company Ducky do really
in-depth understanding and debugging of
the prompts and the return values from a
multi-stage workflow. And one of the
things they figured out is prompt
folding. So you know basically one
prompt can dynamically generate better
versions of itself. So a good example of
that is a classifier prompt that
generates a specialized prompt based on
the previous query. And so you can
actually go in take uh the existing
prompt that you have and actually feed
it more examples where maybe the prompt
failed or didn't quite do what you
wanted and you can actually instead of
you having to go and rewrite the prompt
you just put it into um you know the raw
LLM and say help me make this prompt
better. And because it knows itself so
well, strangely um metaprompting is
turning out to be a very very powerful
tool that everyone's using now. And the
next step after uh you do sort of prompt
folding if the task is very complex
there's this concept of uh using
examples and this is what Jasberry does
is one of the companies I'm working with
this batch they basically build
automatic bug finding in code which is a
lot harder and the way they do it is
they feed a bunch of really hard
examples that only expert programmers
could do let's say if you want to find
an N plus1 query it's actually hard for
today for even like the best LMS to find
those and the way to do those is they
find parts of the code then they add
those into the prompt a meta prompt
that's like hey this is an example of n
plus1 type of error and then that works
it out and I think this pattern of
sometimes when it's too hard to even
kind of write a pros around it let's
just give you an example that turns out
to work really well because it helps LM
to reason around complicated tasks and
steer it better because you can't quite
kind of put exact act parameters and
it's almost like um unit testing
programming in a sense like test-driven
development is sort of the LLM v version
of that. Yeah. Another thing that trope
uh sort of talks about is you know the
the model really wants to actually help
you so much that if you just tell it
give me back output in this particular
format even if it doesn't quite have the
information it needs it'll actually just
tell you what it thinks you want to hear
and it's literally a hallucination. So
one thing they discovered is that you
actually have to give the LLM's a real
escape hatch. You need to tell it if you
do not have enough information to say
yes or no or make a determination, don't
just make it up. Stop and ask me. And
that's a very different way to think
about it. That's actually something we
learned at some of the internal work
that we've done with agents at YC where
Jared came up with a really inventive
way to give the LLM escape patch. Did
you want to talk about that? Yeah. So
the trope approach is one way to give
the LM an escape patch. We came up with
a different way which is in the response
format to give it the ability to have
part of the response be essentially a
complaint to you the developer that like
you have given it confusing or
underspecified information and it
doesn't know what to do. And then the
nice thing about that is that we just
run your LLM like in production with
real hoser data and then you can go back
and you can look at the outputs that it
has given you in that like output
parameter. Um we we call it debug info
internally. So like we have this like
debug info parameter where it's
basically reporting to us things that we
need to fix about it and it literally
ends up being like a to-do list that you
the agent developer has to do. It's like
really kind of mind-blowing stuff. Yeah.
Yeah, I mean just even for hobbyists or
people who are interested in playing
around for this for personal projects.
Like a very simple way to get started
with meta prompting is to follow the
same structure of the prompt is give it
a role and make the role be like you
know you're a expert prompt engineer who
gives really like detailed um great
critiques and advice on how to um
improve prompts and give it the prompt
that you had in mind and it will spit
you back a much a more expanded better
prompt and so you can just keep running
that loop for a while. Works
surprisingly well. I think it's a common
pattern sometimes for companies when
they need to get um responses from
element elements in their product a lot
quicker. They do the meta prompting with
a bigger beefier model any of the I
don't know hundreds of billions of
parameter plus models like uh I guess
cloud 4 3.7 or your uh GPD 03 and they
do this meta prompting and then they
have a very good working one that then
they use into the distilled model. So
they use it on uh for example an FRO and
it ends up working pretty well
specifically sometimes for uh voice AI
agents companies because uh latency is
very important to uh get this whole
touring test to pass because if you have
too much pause be before the agent
responds I think humans can detect
something is off. So they use a faster
model but with a bigger better prompt
that was refined from the bigger models.
So that's like a common pattern as well.
Another again less sophisticated maybe
but um like as the prompt gets longer
and longer like it becomes a a large
working doc um one thing I found useful
is as you're using it if you just note
down in a Google doc things that you're
seeing just um the outputs not being how
you want or not ways that you can think
of to improve it. you can just write
those in note form and then give Gemini
Pro like your notes plus the original
prompt and ask it to suggest a bunch of
edits to the prompt um to incorporate
these in well and it does that quite
well. The other trick is uh in uh Gemini
2.5 Pro if you look at the thinking
traces as is uh parsing through uh
evaluation you could actually learn a
lot about all those misses as well.
We've done that internal as well, right?
As this is critical because if you're
just using Gemini via the API until
recently, you did not get the thinking
traces and like the thinking traces are
like the critical debug information to
like understand like what's wrong with
your prompt. They just added it to the
API. So you can now actually like pipe
that back into your developer tools and
workflows. Yeah, I think it's an
underrated um consequence of Gemini Pro
having such long context windows is you
can effectively use it like a a ripple.
Go sort of like one by one like put your
prompt on like one example then
literally watch the reasoning trace in
real time to figure out like how you can
steer it in the direction you want.
Jared and the software team at YC has
actually built this um you know various
forms of workbenches that allow us to
like do debug and things like that. But
to your point like sometimes it's better
just to use
gemini.google.com directly and then drag
and drop you know literally JSON files
and uh you know you don't have to do it
in some sort of special container like
it you know seems to be totally
something that works even directly in
you know chat GPT itself. Yeah, this is
all stuff. Um, I would give a shout out
to YC's head of data, Eric Bacon, who's
um, helped us all a lot a lot of this
metaparrotting and using Gemini Pro 2.5
as a effectively a ripple. What about
evals? I mean, we've uh, talked about
evals for going on a year now. Um, what
are some of the things that founders are
discovering? Even though we've been
saying this for a year or more now
Gary, I think it's still the case that
like evals are the true crown jewel like
data asset for all of these companies.
Like one one reason that Powerhel was
willing to open source the prompt is
they told me that they actually don't
consider the prompts to be the crown
jewels like the evals are the crown
jewels because without the evals you
don't know why the prompt was written
the way that it was. Um and it's very
hard to improve it. Yeah. And I I think
in abstraction you can think about you
know YC funds a lot of companies
especially in vertical AI and SAS and
then you can't get the eval unless you
sitting literally side by side with
people who are doing X Y or Z knowledge
work. you know, you need to sit next to
the tractor sales regional manager and
understand, well, you know, this person
cares, you know, this is how they get
promoted. This is what they care about.
This is that person's reward function.
And then you know what you're doing is
taking these in-person interactions
sitting next to someone in Nebraska and
then going back to your computer and
codifying it into uh very specific evals
like this particular user wants this
outcome after they you know after this
invoice comes in we have to decide
whether we're going to honor the you the
warranty on this tractor. Like just to
take one of one example that's the true
value right like you everyone's really
worried about um are we just rappers and
you know what is going to happen to
startups and I think this is literally
where the rubber meets the road where um
if you you know if you are out there in
particular places understanding that
user better than anyone else and having
the software actually work for those
people that's the moat is that is like
such a perfect depiction of like what is
the core competency required of founders
today? Like literally like the thing
that you just said like that's your job
as a founder of a company like this is
to be really good at that thing and like
maniacally obsessed with like the
details of the regional tractor sales
manager workflow. Yeah. And then the
wild thing is it's very hard to do like
you know how you have you even been to
Nebraska you know the classic view is
that uh the best founders in the world
they're you know sort of really great
cracked engineers and technologists and
uh just really brilliant and then at the
same time they have to understand some
part of the world that very few people
understand and then there's this little
sliver that is you know uh the founder
of a multi-billion dollar startup you
know I think of Ryan Peterson from
Flexport, you know, really really great
person who understands how software is
built, but then also I think he was the
third biggest uh importer of medical hot
tubs for an entire year like you know a
decade ago. So you know the weirder that
is the more of the world that you've
seen that nobody else who's a
technologist has seen uh the greater the
opportunity actually. I think you've put
this in a really interesting way before
Gary where you're sort of saying that
every founder's become a forward
deployed engineer. That's like a term
that traces back to Palunteer and since
you were early at Palanteer maybe tell
us a little bit about how did forward
deployed engineer become a thing at
Palunteer and and what can founders
learn from it now? I mean I think the
whole thesis of Palunteer at some level
was that um if you look at Meta back
then it was called Facebook or Google or
any of the top software startups that
everyone sort of knew back then. One of
the key recognitions that Peter Teal and
Alex Karp and Stefan Cohen and Joe
Lansdale, Nathan Gettings, like the
original founders of Palunteer had was
that uh go into anywhere in the Fortune
500, go into any government agency in
the world, including the United States
and nobody who understands computer
science and technology at the level that
you at the highest possible level would
ever even be in that room. And so
Palenteer's sort of really really big
idea that they discovered very early was
that uh the problems that those places
face they're actually multi-billion
dollar sometimes trillion dollar
problems and yet uh this was well before
AI became a thing you know I mean people
were sort of talking about machine
learning but you know back then they
called it data mining you know the world
is a wash in data these you know giant
databases of people and things and
transactions and we have no idea what to
do with it. That's what Palanteer was
is and still is. That um you can go and
find the world's best technologists who
know how to write software to actually
make sense of the world. You know, you
have these pabytes of data and you don't
know how do you find the needle in the
haststack. Um and you know the wild
thing is going on uh something like 20
22 years later it's only become more
true that we have more and more data and
we have less and less of an
understanding of what's going on and uh
it's no mistake that actually now that
we have LLMs like we actually it is
becoming much more tractable and then
the forward deployed engineer title was
specifically how do you sit next to
literally the FBI agent who's um
investigating domestic terrorism. How do
you sit right next to them in their
actual office and see what does the case
coming in look like? What are all the
steps? Uh when you actually need to go
to the federal prosecutor, what are the
things that they're sending? Is it I
mean what's funny is like literally it's
like word documents and Excel
spreadsheets, right? And um what you do
as a forward deployed engineer is take
these sort of you know file cabinet and
fax machine things that people have to
do and then convert it into really clean
software. So you know the classic view
is that it should be as easy to actually
do uh an investigation at a threeletter
agency as going and taking a photo of
your lunch on Instagram and posting it
to all your friends. Like that's you
know kind of the funniest part of it.
And so you I think it's no mistake today
that four deployed engineers who came up
through that system at Palanteer now
they're turning out to be some of the
best founders at YC actually. Yeah. I
mean produced this incredible an
incredible number of startup founders
cuz yeah like the training to be a fore
deployed engineer that's exactly the
right training to be a founder of these
companies. Now the the other interesting
thing about Palunteer is like other
companies would send like a salesperson
to go and sit with the FBI agent and
like Palunteer sent engineers to go and
do that. I think Palenter was probably
the first company to really like
institutionalize that and scale that as
a process, right? Yeah. I mean, I think
what happened there, the reason why they
were able to get these sort of seven and
eight and now nine figure contracts very
consistently is that uh instead of
sending someone who's like hair and
teeth and they're in there and you know
let's go to the let's go to the uh
steakhouse. You know, it's all like
relationship. and you'd have one meeting
uh they would really like the
salesperson and then through sheer force
of personality you'd try to get them to
give you a seven-figure contract and
like the time scales on this would be
you know 6 weeks 10 weeks 12 weeks like
5 years I don't know it's like and the
software would never work uh whereas if
you put an engineer in there and you
give them uh you know Palunteer Foundry
which is what they now call sort of
their core uh data viz and data mining
suites instead of the next meeting being
reviewing 50 pages of you know sort of
sales documentation or a contract or a
spec or anything like that. It's
literally like, "Okay, we built it." And
then you're getting like real live
feedback within days. And I mean, that's
honestly the biggest opportunity for
startup founders. If startup founders
can do that and uh that's what forward
deployed engineers are sort of used to
doing that's how you could beat a
Salesforce or an Oracle or you know a
Booze Allen or literally any company out
there that has a big office and a big
fancy you know you have big fancy
salespeople with big strong handshakes
and it's like how does a really good
engineer with a weak handshake go in
there and beat them? It's actually you
show them something that they've never
seen before and like make them feel
super heard. You have to be super
empathetic about it. Like you actually
have to be a great designer and product
person and then you know come back and
you can just blow them away. Like the
software is so powerful that you know
the second you see something that you
know makes you feel seen you want to buy
it on the spot. Is a good way of
thinking about it that founders should
think about themselves as being the four
deployed engineers of their own company.
Absolutely. Yeah. Like you definitely
can't farm this out. Like literally the
founders themselves, they're technical.
They have to be the great product
people. They have to be the
ethnographer. They have to be the
designer. You want the person on the
second meeting to see the demo you put
together based on the stuff you heard.
And you want them to say, "Wow, I've
never seen anything like that." And take
my money. I think the incredible thing
about this model is this is why we're
seeing a lot of the vertical AI agents
take off is precisely this because they
can have these meetings with the end
buyer and champion at these big
enterprises. They take that context and
then they stuff it basically in the
prompt and then they can quickly come
back in a meeting like just the next day
maybe with Palunteer would have taken a
bit longer and a team of engineers here.
It could be just the two founders go in
and then they would close this six
seven figure deals which we've seen and
with large enterprises which has never
been done before and it's just possible
with this new model of forward deploy
engineer plus AI is just on
accelerating. It just reminds me of a
company I mentioned before on the
podcast like Giger ML who do customer
another customer support and especially
a lot of voice support and it's just
classic case of two extremely um
talented software engineers not natural
sales people but they force themselves
to be essentially forward deployed
engineers and they closed a huge deal
with Zeppto and then a couple of other
companies they can't announce yet but do
they physically go on site like the
palentier model? Yes. So they did so
they they did all of that where once
they close the deal they go on site and
they sit there with all the customer
support people and figuring out how to
keep tuning and getting the software or
the LM to work even better. But before
that even to win the deal what they
found is that they can they can win by
just having the most impressive demo.
And in their case they've um innovated a
bit on the rag pipeline so that they can
um have their voice responses be both
accurate and very low latency. sort of
like a technically challenging thing to
do, but I just feel like in the like pre
sort of the current LLM rise, you
couldn't necessarily differentiate
enough in the demo phase of sales to
beat out incumbent. So, you can really
beat Salesforce by having a slightly
better CRM with a better UI. But now
because the technology evolves so fast
and it's so hard to get this like last
five 10 five to 10% correct, you can
actually if you're a forward deployed
engineer go in do the first meeting
tweak it so that it works really well
for that customer. Go back with the demo
and just get that oh wow like we've not
seen anyone else pull this off before
experience and close huge deals. And
that was the exact same case with Happy
Robot who has sold seven figure
contracts to the top three largest
logistic brokers in the world. They
build AI voice agents for that. They are
the ones doing the forward deploy
engineer model and talking to like the
CIOS of these companies and quickly
shipping a lot of product like very very
quick turnaround. And it's been
incredible to see that take off right
now. And it started from six figure
deals now doing closing and seven figure
deals which is crazy. This is just a
couple months after. So that's the kind
of stuff that you can do with uh I mean
unbelievably very very smart prompt
engineering actually. Well, one of the
things that's kind of interesting about
uh each model is that they each seem to
have their own personality. And one of
the things the founders are really
realizing is that you're going to go to
different people for different things.
Actually, one of the things that's known
a lot is Claude is sort of the more
happy and more human steerable model.
And the uh other one is Lama 4 is one
that needs a lot more steering. It's
almost like talking to a developer and
part of it could be an artifact of not
having done as much RL RHF on top of it.
So is a bit more rough to work with, but
you could actually steer it very well if
you
actually are good at actually doing a
lot of prompting and almost doing a bit
more RLHF, but it's a bit harder to work
with actually. Well, one of the things
we've been using uh LLMs for internally
is actually helping founders figure out
who they should take money from. And so
in that case, sometimes you need a very
straightforward rubric, a zero to 100.
zero being never ever take their money
and 100 being take their money right
away. Like they actually help you so
much that you'd be crazy not to take
their money. Harj, we've been working on
uh some scoring rubrics around that
using prompts. What What are some of the
things we've learned? So, it's certainly
best practice to give um LLM's rubrics
especially if you want to get a
numerical score as the output. You want
to give it a rubric to help it
understand like how should I think
through and what's like a 80 versus a
90. But these rubrics are never perfect.
there's often always exceptions and you
tried it with uh 03 versus Gemini 2.5
and you found this this is what we found
really interesting is that um you can
give the same rubric to two different
models and in our in our specific case
what we found is that um 03 was very
rigid actually like it really sticks to
the rubric it's heavily penalizes for
anything that doesn't fit like the
rubric that you've given it whereas
Gemini 2.5 Pro was actually quite good
at being flexible in that it would apply
the rubric but it could also sort of
almost reason through why someone might
be like an exception or why you might
want to um push something up more
positively or negatively than the rubric
might suggest, which I just thought was
really interesting cuz that it's just
like when you're training a person
you're trying to you give them a rubric
like you want them to use a rubric as a
guide, but there are always these sort
of edge cases where you need to sort of
think a little bit more deeply. Um, and
I just thought it was interesting that
the models themselves will handle that
differently, which means they sort of
have different personalities, right?
Like 03 felt a little bit more like the
soldier sort of like, okay, I'm
definitely like check, check, check
check, check. Um, and Gemini Pro 2.5
felt a little bit more like a a high
agency sort of employee was like, "Oh
okay. I think this makes sense, but this
might be an exception in this case,"
which was um just really interesting to
see. Yeah, it's funny to see that for
investors. You know, sometimes you have
investors like a Benchmark or a Thrive
it's like "Yeah, take their money right
away. Their process is immaculate. They
never ghost anyone. They answer their
emails faster than most founders. It's
you know, very impressive. And then, uh
one example here might be, you know
there are plenty of investors who are
just overwhelmed and maybe they're just
not that good at managing their time.
And so, they might be really great
investors and their track record bears
that out, but they're sort of slow to
get back. They seem overwhelmed all the
time. They accidentally, probably not
intentionally ghost people. And so this
is legitimately exactly what an LLM is
for. Like the debug info on some of
these are very interesting to see like
you know maybe it's a 91 instead of like
an 89. We'll see. I guess one of the
things that's been really surprising to
me as you know we ourselves are playing
with it and we spend you know maybe 80
to 90% of our time with founders who are
all the way out on the edge is uh you
know on the one hand the analogies I
think even we use to discuss this is uh
it's kind of like coding. It kind of
actually feels like coding in, you know
1995. Like the tools are not all the way
there. There's a lot of stuff that's
unspecified. We're, you know, in this
new frontier. But personally, it also
kind of feels like learning how to
manage a person where it's like, how do
I actually communicate uh, you know, the
things that they need to know in order
to make a good decision? And how do I
make sure that they know um, you know
how I'm going to evaluate and score
them? And uh not only that, like there's
this aspect of Kaizen, you know, this um
this manufacturing technique that
created really really good cars for
Japan in the '90s. Uh and that principle
actually says that the people who are
the absolute best at improving the
process are the people actually doing
it. That's literally why uh Japanese
cars got so good in the '90s. And that's
metaprompting to me. So, I don't know.
It's a brave new world. We're sort of in
this new moment. So, with that, we're
out of time. But can't wait to see what
kind of prompts you guys come up with.
And we'll see you next time.
[Music]
Loading video analysis...