Layering every technique in RAG, one query at a time - David Karam, Pi Labs (fmr. Google Search)
By AI Engineer
Summary
Topics Covered
- Start with Quality Engineering Loop
- BM25 Beats Vectors for Keywords
- Relevance Proxy Fails Complex Ranking
- Fan Out Queries for Agents
- Gracefully Degrade Stochastic Systems
Full Transcript
[Music] I'll I'll just give you all a little bit of context. So uh my co-founder and I
of context. So uh my co-founder and I and a lot of our team were actually working on Google search and then we left and like started Pyabs and uh I I loved I love the exit talk and like we're all nerds for information
retrieval and search and uh so this is going to be a little bit of that. Uh
just going to go through a whole bunch of ways you can actually show up and improve your rack systems. Uh I think one thing that I personally uh sometimes struggle with is there's a lot of talk about things sometimes like too much in
the buzzed like oh specific techniques and you can do RL this way and you can tune the model this way and it's like doesn't help me orient in the space like what are all these things and how do I like hang on them uh or you have the complete opposite which is like a whole
bunch of buzzwords and hype and such and like rag is dead no rag is not dead is like agents like wait what like uh so just you know I think a lot of what I'll do today is just uh what I call like
plain English uh just trying to like set up a framework right like very centered around like okay if you are trying to show up the quality of your system how do you do that and then where do all the things you hear about like day in day
out like fit uh and then just how to approach that and give a lot of examples I think one thing that I always love and we always did in Google we always do in pyabs uh is just like look at things look at cases look at queries see what's
working see what's not working and that's really the essence of like quality engineering as we used to call it at Google if you do want the slides there's like 50 slides and I said my a challenge for myself to go through 50 slides in 19 minutes. Uh but you can
catch the slides here if you want. Uh
I'll flash this towards the end as well with pi.ai-talk.
with pi.ai-talk.
Uh it should point to the slides that we're going through. And as I mentioned, plain English, no hype, no buzz, uh no debates, no like all right. So how to think about techniques before we go techniques and get into the weeds of it
like why does this even matter and the way we always think about it is like always start with outcomes. You're
always trying to solve some product problem. Uh and generally the best way
problem. Uh and generally the best way to visualize something like this. you
have a certain quality bar you want to reach and there were a very interesting talk this this week about like you know benchmarks aren't really helpful and absolutely eval are helpful you're trying to launch a CRM agent and you sort of have a launch bar like a place
where you feel comfortable that you can actually put it out into the world uh and techniques fit somewhere here you have that like kind of end metric and you're trying to like come up with different ways to shore up the quality
and those ways are like sort of the techniques there and you know this is sort of your own personal benchmark you start with some of the easy the easy the easy bars you want to hit and then there's like medium benchmarks and hard benchmarks. So these are query sets
benchmarks. So these are query sets you're setting up. Uh and then you know depending on what you want to reach and in at what time frame uh then you end up trying different things. Uh and this is what we call like quality engineering
loop. You sort of like baseline
loop. You sort of like baseline yourself. Okay you want to achieve you
yourself. Okay you want to achieve you know you want CRM and this is the easy query set and your quality is there. Uh
just through the simplest way you can try it. Do a loss analysis. Okay what's
try it. Do a loss analysis. Okay what's
broken? There were a lot of eval talks this week and then what we call quality engineering. Now the reason I I I I say
engineering. Now the reason I I I I say this is because like okay techniques fit this in this last bucket and one of the things that I think biggest problems is like people sometimes start there and it doesn't make any sense because you say
oh do I need BM25 or do I need like uh vector vector rich people it's like I don't know what what are you trying to do and what is your query says and where are things failing because many times you actually don't need these things and you end up implementing them it doesn't
make a lot of sense anyway so usually the thing I say is like what I call complexity adjusted impact or you know stay lazy uh in a sense like always look at what's broken and if it's not broken don't fix it and if it is broken do fix
it. Uh and we'll go through a lot of
it. Uh and we'll go through a lot of techniques today but like this is a good way to think about them. It's just a cluster. It's a catalog of stuff. The
cluster. It's a catalog of stuff. The
most important two columns are the ones to the right difficulty and impact and if it's easy go ahead and try it. And
most times like BM25 BM25 is pretty easy. You should absolutely try it and
easy. You should absolutely try it and does like you know show up your quality quite a bit. Um but you know should I build like custom embeddings for retrieval? Like I don't know. Let's take
retrieval? Like I don't know. Let's take
a look. This is actually really really hard. Uh Harvey gave a talk. They build
hard. Uh Harvey gave a talk. They build
custom embeddings but you know they have a really hard problem space and just you know relevance embeddings don't don't do enough for them uh and then they're willing to put all that work and effort.
All right queries examples let's stuff let's first technique in memory retrieval uh easiest thing bring all your documents shove them all to the LLM. Uh this is the whole like is rag
LLM. Uh this is the whole like is rag dead is rag not dead context windows.
Well context windows are pretty easy so you should definitely start there. Uh
one example notebook LM uh very nice product. You actually you know put in
product. You actually you know put in five documents just ask questions about them. you don't need any rag just shove
them. you don't need any rag just shove the whole thing in. Now questions might get cut too long and this is where it breaks right maybe things don't fit in memory uh or maybe you just pull the context window too much so this is where
you start to think things like oh okay that's what's happening I have too many documents oh that's what's happening the documents are not attended properly by the LLM and here are like the five things that are breaking okay great
let's move to the next one so now you try something very simple which is can I retrieve just based on terms so BM25 what is BM25 BM25 is kind of like four things um query terms frequency of those query terms
uh length of the document and just how r where a certain term is. It's a very nice thing. It actually works pretty
nice thing. It actually works pretty well and it's very easy to try and um it has a problem that like when things are not have that nature like the exa as exa was saying when they don't have that
nature of like a keyword based search they don't work and this is where you bring in something like relevance embeddings and relevance embeddings are pretty interesting because now you're in vector space and vector space can handle way more nuance uh than like keyword
space uh but you know they also fail in certain ways especially when you're looking for keyword matching and it's actually pretty easy to know when things work and when they Actually, this was queried like I went to Chad Gypt and I asked like, "Hey, give me a bunch of
keywords. Ones that work for like
keywords. Ones that work for like standard term matching and ones that work for relevance embedding." And you can see like exactly what's going on here, right? If your query stream looks
here, right? If your query stream looks like iPhone battery life, then you don't need vector search. But if they look something like,
search. But if they look something like, "Oh, how long does an iPhone like last before I need to charge it again?" Then
you absolutely need like things like vector search. And this is where you
vector search. And this is where you need to be like tuned to what what every technique gives you before you go and invest in it. And when you do your loss analysis and you see, oh, most of my queries actually look like the ones on the right hand side, then you should
absolutely start investing in this area.
All right, now you did BM25, you did vector because your query sets look exactly like that. And now you have conflicted candidate set. And this is where re-rankers help quite a bit. And
when people say rerankers, they're usually referring to like cross encoders. And this is a specific
encoders. And this is a specific architecture. If you remember the
architecture. If you remember the architecture here for relevant for the relevance embeddings was you're getting a vector for the query and you're getting a vector for document and then you're just measuring distance. Now
cross encoders are more sophisticated.
They actually take both the query and the document and they give you a score while attending to both at the same time. And that's why they were much more
time. And that's why they were much more powerful. Now they are more powerful but
powerful. Now they are more powerful but they're actually pretty expensive. And
now this is a failure state as well. You
can't do it on all your documents. So
now you have to have like this fancy thing where you're retrieving a lot of things and then ranking a smaller set of things with a technique like that. Uh
but it is really powerful and you should use it and it fails in certain cases and now when you hit those cases then you move to the next thing. Now where does it fail? Uh it's still relevance and
it fail? Uh it's still relevance and there's a big problem with like you know standard embeddings and standard rerankers. They only measure semantic
rerankers. They only measure semantic similarity and there's a thing like these are all proxy metrics at the end like your application is your application and your set of information needs as your set of information needs and you try to proxy with relevance but
relevance is not ranking and this is something you know we learned in Google search sort of uh it's been like 15 20 years where you know what brings the magic of Google search well they look at a lot of other things than just relevance uh and this is you know this
came from like actually the talk from Harvey and Lance DB was really really interesting and he gave the example of this query right uh it's a really interesting query like it's it has so much semantics for the legal uh domain
that it's impossible to catch these with just relevance. Um and again what does a
just relevance. Um and again what does a word like regime means? That's a very specific like legal term material. What
does it mean? It actually very has a very specific meaning in the legal term.
Uh and then there's like things that are very specific to domain that need to be retrieved like laws and regulations and such. And this is where you get to
such. And this is where you get to building things like custom embeddings.
And you say, you know what, just fetching on relevance is not enough for me. And now I need to go and like model
me. And now I need to go and like model my own domain in its own vector space.
And now I can actually fetch some of these things. Now again, go back to chat
these things. Now again, go back to chat GPD like is this interesting? Should I
actually even do it? So I asked it to give me a list of things that would fail in a standard relevant search in the legal domain. And you start to see like,
legal domain. And you start to see like, oh, all these things would the words like moot don't mean the same thing.
Words like material don't mean the same thing. And when you have a vocabulary
thing. And when you have a vocabulary that is so specific and just off, you will not get good results. Right? So now
how do you how do you match that? Like
you need to have again you need to have evals you need to have query sets. need
to look at things that are breaking and decide that oh the things that are breaking have to do with the vocabulary just being out of distribution of a standard relevance model and that's how you decide right so don't like again
don't think too much about it like oh should I do it should I not do it like what is your what are your queries telling you what is your data telling you and then go and try to do it or not do it there's also an example from shopping um so embeddings are very
interesting because they help you a lot with retrieval and recall uh but you still good need good ranking right so now if If if if if you think relevance doesn't work with retrieval, it also probably doesn't work with ranking. Uh
this is an example I pulled from Perplexity. I was trying Yeah, I was
Perplexity. I was trying Yeah, I was just trying to break it today. It didn't
take too much to break it. Uh I asked like, "Give me cheap gifts uh for a gift for my son." And then I follow up with this query like, "But I have a budget of 50 bucks or more because when I said cheap, it started giving me like $10."
Well, you know, cheap for me is like $50. Uh but it didn't know that, so it's
$50. Uh but it didn't know that, so it's fine. I told it that. But when I said
fine. I told it that. But when I said $50 or more, it still gave me $15 and $40, both of which are actually below uh $50. Uh and this is kind of interesting
$50. Uh and this is kind of interesting right because what we call like in you know in standard terms like for information retrieval this is a signal it's a price signal and it's not being caught and it's not being translated into the query and it's definitely not
being translated into the ranking. So
now you have to like think of okay I have ranking and I need the ranking to see the semantics of my corpus and my queries and this is has a very specific meaning like when you think of your corpus and your queries again it's not
just relevance relevance helps you with natural language but things like price signals things like merchant signals uh if you're doing like podcasts how many times has been listened to is a very important signal has nothing to do with
relevance right and in in in many many applications you will see things that are for example more popular tend to rank more highly uh and as I talk you mentioned like uh the page rank algorithm. Page rank is not about
algorithm. Page rank is not about relevance. It's about prominence. How
relevance. It's about prominence. How
many things outside of my uh document point to me that has nothing to do with relevance and everything to do with the structure of the web corpus. So that's
the shape of the data. So this is a signal about the shape of the data and not a signal about like the relevance.
Um and you know best way to think about it think of like you have horizontal semantics and then you have vertical semantics. And if you're in vertical
semantics. And if you're in vertical domain where the semantics are very verticalized, right? Let's say you're in
verticalized, right? Let's say you're in doing a CRM or you're doing emails uh and it's a very complex bar you're trying to hit uh that is way beyond just natural language. Understand that
natural language. Understand that relevance will be a very tiny tiny part of the semantic universe. And the harder you try to go, the more you're going to hit this wall and the more you all right this breaks again. Things keep breaking.
I'm sorry.
At sufficient complexity, things will keep breaking. So now the thing that
keep breaking. So now the thing that breaks with even custom semantics is user preference. Uh because even when
user preference. Uh because even when you get to all this, okay, you're saying I'm doing relevance and I'm doing price signals and merchant signals. I'm doing
everything. I now I know the shopping domain. Now you don't know the shopping
domain. Now you don't know the shopping domain because now users are using your product. They're clicking on stuff you
product. They're clicking on stuff you thought they're not going to click on and they're clicking they're not clicking on thoughts on things you thought they were going to click on. Uh
and this is where you need to like bring in the click signal, thumbs up, thumbs down signal. Now um these things get
down signal. Now um these things get very complex. So we're not going to talk
very complex. So we're not going to talk about how to implement them. uh just
because again in this case for example you have to build a click-through uh signal prediction signal and then you take that signal and then you combine it with all your other signals. So now if you look at your ranking function it's doing okay I want it to be relevant I
wanted to have this like semi-structured price signal and like query understanding related to that plus I want to get the user preference and that and then you take all these signals and you add them and that becomes your ranking score. So it becomes a very
ranking score. So it becomes a very balanced function. And this is how you
balanced function. And this is how you go from like oh it's just relevance to oh no it's not just relevance to oh no it's not just relevance and and my semantics and my user preferences all
rolled up into one. I'll mention two more things. Uh
more things. Uh you calling the wrong queries. This
happening a lot because this go this goes into more orchestration and you're trying to do complex things. Uh
especially now when you have agents uh and you're telling them to use a certain tool. This is happening quite a bit
tool. This is happening quite a bit because there is an impedance mismatch uh between what the search engine expects right let's say you tune the search engine and expects like keyword queries or expects uh you know even like
more complex queries but you cannot describe all of that to the LM and the LM is reasoning about your application and then making queries by itself and this is a big problem so one thing that we've seen many companies do we've done
this also at Google you actually take more control of the actual orchestration so you take the big query and you make n smaller queries out of Uh I took the screenshot from AI mode in Google and
it's it's very brief. You have to catch it because after after the animation goes away but you see it's actually it's making X queries. It's making 15 queries. It's making 20 queries. Um so
queries. It's making 20 queries. Um so
what we call fan out take very complex thing try to figure out what all the subqueries in it and then fan them out.
Now you might think hey why isn't the LM doing it? The LM is kind of doing it but
doing it? The LM is kind of doing it but the LM doesn't know about your tool. It
doesn't know enough about your search engine. Uh I love MCP but I'm not a big
engine. Uh I love MCP but I'm not a big believer that you can actually teach the LLM and like just through prompting what to expect from the search on the other back end. This is why people still like
back end. This is why people still like oh is it agent autonomous? Do I need to do workflows? This is very very
do workflows? This is very very complicated. Uh and it will take a while
complicated. Uh and it will take a while for this to be solved because again it's unclear where the boundary is. Is it uh is it the search engine should be able to handle more complex things and then the LLM will just throw anything its way or is it the other way around? the LM
has to have more information about what the search engine can support so it can tailor it and right now you need control because the quality is still not there.
Uh so this looks like this. Um if you have sort of like this assistant input and you're turning it in these narrow queries like for example was David working on this has very very specific semantics and it's more like oh JC is
David Slack threads David. Uh and it's very very hard to know without knowing enough about your application that these are the queries that matter and not on the the ones on the left hand side. And
if you send the thing on the left hand side to a search engine, it will absolutely tip over unless it understands your domain. And this is where like you know you need to calibrate the boundary.
Okay. So now you're asking all the right queries. Are you asking to all the right
queries. Are you asking to all the right backends? And this is another place
backends? And this is another place where it all fails. Um and this is what we call like one technique we call supplementary retrieval. This is
supplementary retrieval. This is something you notice like clients do quite a bit which is they don't call search enough. Uh and sometimes people
search enough. Uh and sometimes people try to overoptimize. When you're trying to get high call, you should always be searching more. Like I always like just
searching more. Like I always like just search more like this is similar to when we talked like about dynamic content like the in-memory uh retrieval just like just give more things. So it never fails to give more things. I know in the
in the description we said like there was this query fell which was really hard to uh to do and then you think like oh we're in Google search and it's very simple Middle Eastern dish and it stumped an organization of 6,000 people
like oh my god what's so hard about this query? What's so hard about this query
query? What's so hard about this query is like it's it's it's an ambiguous intent. uh so you need to reach to a lot
intent. uh so you need to reach to a lot of backends to actually understand enough about it right because you might be asking about food at which point I want to show you restaurants you might be asking this for for pictures at which point I want to show you images uh now
what Google ended up doing is that they ask they you know create all the backends and then they put the whole thing in and I think you know I would recommend like this is a great technique to just even increase the recall more
just call more things um and don't try to be skimpy unless you're running through like some real cost overload and that's the last one you're running into cost overloads. GPUs are melting. I try
cost overloads. GPUs are melting. I try
to generate an image, but then I realized there actually a pretty good image that is real. Somebody took a server rack and threw it from the roof.
Um, this was like I didn't need to go to ChatG and generate this image. Uh,
apparently this was an advertisement.
Pretty expensive one. Um, all right. So,
this happens a lot like when you get to a certain scale and you have all these backends and you're making all these queries and it's just getting very very complex and this, you know, I mean Google's there, perplexity is there. I
mean Sam Alman keeps keeps complaining about GPUs melting. Um I think this is the part where like you need to start doing distillation and distillation is a very interesting thing because like to do that you have to learn how to
fine-tune models and this this gets to be a little bit complex. You sort of have to hold the quality bar constant while you decrease the size of the model. U the reason you can do that like
model. U the reason you can do that like is is is kind of like in that in that graph like hey hire me I know everything actually I'm firing you. uh it's
overqualified like an LLM a very like large language model is actually over mostly overqualified for the task you want to do uh because what you really want to do is just one thing like perplexity they're they're doing
question answering uh and they're pretty fast I mean when you use perplexity in certain context they're really really fast which is amazing because they trained this one model to do this one very specific thing which is just be
really really good at question answering um and you know this is very hard so I wouldn't do it unless you know latency becomes a really important thing for your users right like, oh, the thing is taking 10 seconds. Users churn. If I can
make it in two seconds, users don't churn. Actually, that's a really great
churn. Actually, that's a really great place to be because then you can use this technique and like just bring everything down. Um, all right. You've
everything down. Um, all right. You've
done everything you can. Things are
still failing. This is uh everybody.
Okay. What do you do? Like we have a bunch of engineers here. What do you do when everything fails? Um, yes. You you
blame the product manager. It's
it's the last trick in the book. Uh,
when everything fails, uh, make sure it's not your fault. But I'll say there's something really important here.
Quality engineering will never like it'll never be 100%. Things will always fail. These are stocastic systems. So
fail. These are stocastic systems. So then you have to punt the problem. You
have to punt it upwards. So it's it's kind of a joke, but it's not a joke.
Like the design of the product matters a lot to how much how magical it can seem because if you try to be more magical than your product surface uh can can absorb, you will you'll run into into a
bunch of problems. Um this is I use a very simple example. Uh probably a more complex one would be uh sort of a human in the loop for customer support where you're like okay some cases the bot can handle by its own but then you need to
like punt to a human. This is basically UX design right like when when do you trust the machine to do what the machine needs to do and when does a human need to be in the loop. This is a much simpler example from like Google shopping. Um there's some cases where
shopping. Um there's some cases where Google has a lot of great data. So what
we call like high understanding the fidelity of the understanding is really high and then it shows like what we call a high promise UI. Like I'll show you things you can click on them. There's
reviews, there's filters because I just understand this really well. And there's
things Google does not understand at all, mostly as web documents, bag of words. And what's really interesting
words. And what's really interesting about the AI is the eye changes. If you
understand more, you show a more kind of like filterable high promise. If you
don't understand enough, you actually degrade your experience, but you degrade it to something that is still workable.
Like, I'll show you 10 things, you choose. Oh no, I know exactly what you
choose. Oh no, I know exactly what you want. I'll show you one thing. And this
want. I'll show you one thing. And this
is really, really important. has to be like part of every and this is sort of like always understand like there's only so much engineering you can do until you have to like actually change your product to accommodate this sort of stoastic nature. So gracefully degrade,
stoastic nature. So gracefully degrade, gracefully upgrade depending on like the the level of your understanding. And
again, I'll flash these two slides at the end like always remember what you're doing because you can absolutely get into theoretical debates again context window versus rag uh this versus that like is you know agents versus I don't
know like just everything is empirical in this domain when you're doing like this this sort of thing. Oh I have my evals I'm trying to like step by step go up I have like a toolbox under my disposal. Everything everything is
disposal. Everything everything is empirical. So again, baseline, analyze
empirical. So again, baseline, analyze your losses, and then look at your toolbox and see, are there easy things here I can do? If not, are there at least medium things I could do? If not,
you know, should I hire more people and do like some really, really hard things?
Uh, but always remember like the choice is on you and you should be principled because this can be an absolute waste of time uh if you're doing it too far ahead of the curve. All right, again, the
slides are here, I think. Oh, I I I achieved it. 30 seconds left. uh and if
achieved it. 30 seconds left. uh and if you want the slides they're here again and uh reach out to us we're always happy to talk I think I was very happy with the exit talk because it's always nice to find like friends who are nerds
in information retrieval uh we are also such so reach out and happy to talk about you know rag challenges and such and some of the models we are building
um all right thank you so much [Music]
Loading video analysis...