RAG Evaluation (Answer Hallucinations) | LangSmith Evaluations - Part 13
By LangChain
Summary
Topics Covered
- Evaluate answers against retrieved documents for hallucinations.
- Use custom criteria for LLM-based evaluation.
- Define scoring scales for groundedness evaluation.
- Map prediction and reference to answer and context.
Full Transcript
hi this is Lance from Lang chain this is the 13th Part in our Langs Smith valuation series and we've been talking about rag evaluation in the last video we saw uh how to do comparison of my LM
generated answer to a reference answer which is kind of dove into that now let's actually talk about some of the other types of evaluation we can do in particular let's talk about hallucination evaluation so this is
basically an evaluation of our retriev documents relative to our answer so you recall from before we build a rag chain we have a retriever here here's our chain and the chain returns both answer
and contexts okay so that's kind of the setup we just talked through basically again reference answer evaluation this is what we went through now let's talk about what I'm going to call type two so
answer hallucination so here actually we can reuse a lot of the same thinking and a lot of the same components again before
we used a lang chain string evaluator um because we're doing string comparisons fundamentally and I previously showed that we can use the cut the Chain of Thought QA evaluator for answer
evaluations now in this case we're can change it up a little bit because we want to do comparison now between our answer and the reference so this is
basically uh the reference documents okay so um this is kind of an internal comparison for if something is present in the answer that's not in the documents we want to know we want to
penalize that a common thing that happens here is is uh hallucinations for example so all we're doing is our rag chain returns both answer and context and
we're going to pass those into our evaluator again our answer will be the prediction and we're going to use our context now as the reference so we're basically doing comparison between our answer and the reference being the
retrieve documents that's really all it's happening um now here instead of using the coqa evaluator I'm going to use the criteria evaluator it's another
option that you can use and you'll see the IIA is kind of nice because it allows us to supply some kind of custom criteria uh to the grading process and we just talked through the information
flow now here's really the Crux of it you'll look this is actually very familiar to what we saw before it's just another L chain string evaluator just like before in this case it's a slightly
different name so it's called labeled score string and the key Point here is I can pass a particular criteria that I want to evaluate those strings on to the
evaluator now still El llm as judge just like before but in this case I have this like custom criteria field that I can pass in so here's where I can kind of
add all the like kind of rules or kind of logic that I want the evaluator to follow so here I'm going to say is the assistant's answer grounded in the ground truth documentation right um and
I tell it like what a score of zero means I tell it what a score five means a score five is the answer contains some information that's not in the document so some hallucinations zero is like it's
all hallucination 10 is that it's it's perfect okay now this normalize thing just lets me like normalize the scores between zero and one as a convenience so it's basically to take the score
returned by the greater will be 0 to 10 I normalize that by 10 to produce 0 to one now here is where we saw before this
is where I hook up my uh my run and my data set output to the inputs of my prompt so this is the key part so we can look at this here
so in this particular case um my prediction is going to be the um the answer just like before the reference is
actually just going to be my run context or my retrieve documents um and the input is just the um yeah the example input so that's actually not very important for the C Val but we'll just
we'll keep it in there but the key point is this my prediction is my answer and my reference are the retrieved documents so that's all that's happening um so that I can Define that now I kick that
off I'll add an experiment prefix here uh to not that it's um you know hallucination grading um and that's really it so this is
kicked off I can go ahead to my so I've just kicked off evaluation and now that's run and I can go over to my dat data set and I can look at the results so here is again that hallucination uh
um yeah prefix that we added and I can open this up so now I can see the scoring of all my runs here and again this is looking at the answer relative to the retrieve document so it's kind of
a hallucination score and so the results are kind of mixed it looks like in one case it gives it a score of 0. five one
case it gives it score of perfect um I can look at for example the Bad Case here I can open up that run so here we go we can go down and we can actually look at this is of course the prompt so it has all the retrieved doents and it
also has the answer and then it has the llm reasoning so the assistant provide detailed comprehensive response um but it you know found some things it didn't like it gives it a two so anyway the key point is this shows you how to hook
everything up you can see all we did was we basically hooked up our chain uh answer and context into our evaluator reference and prediction and then we just let our evaluator run we give it
instructions here in this criteria field and it's all logged to Langs Smith we can see the scores here um so anyway it's pretty nice and uh definitely encourage you to look
into using um this criteria evaluation with labeled score string
Loading video analysis...