Complete DSPy Tutorial - Master LLM Prompt Programming in 8 amazing examples!

By Neural Breakdown with AVB

Summary

## Key takeaways - **Prompt Programming Beats Engineering**: Prompt engineering writes complex textual prompts that become hacky and brittle with LLM changes, while DSPy does prompt programming by defining data flow through modules in a text transformation graph, decoupling the program from the underlying model. [00:46], [02:12] - **Signatures Specify What, Not How**: Signatures define input/output behavior with descriptions and docstrings, telling the LM what to do rather than specifying the exact prompt, as seen in the QA example where descriptions shape the generated prompt. [05:01], [05:49] - **Chain-of-Thought Fixes Complex QA**: Basic QA fails on 'Who provided the assist for the 2014 World Cup final goal?', predicting Mario Götze instead of André Schürrle, but ChainOfThought adds step-by-step reasoning to extract the correct answer. [06:30], [07:38] - **BootstrapFewShot Boosts Accuracy 21%**: Zero-shot ChainOfThought scores 37.1% on referee decision classification, but DSPy's BootstrapFewShot optimizer selects optimal few-shot examples from trainset, raising accuracy to 58.1%. [21:24], [23:17] - **Multihop Loops Retrieve Key Facts**: Single RAG query fails to fetch André Schürrle passage for birthdate, but multihop generates iterative queries like 'date of birth of André Schürrle' over 3 hops to retrieve and answer correctly. [16:34], [19:29] - **Finetune T5 to 70% with GPT Teacher**: GPT-3.5 RAG scores 83% on trivia quiz, T5-small starts at 0%, but DSPy BootstrapFinetune transfers knowledge in 6 minutes, achieving 70% on T5-base (220M params vs GPT's 175B). [29:06], [33:11]

Topics Covered

Prompt Programming Trumps Engineering
Signatures Tell What, Modules Tell How
Bootstrap Few-Shot Boosts Accuracy 21%
Distill GPT to T5 at 1/1000th Size

Full Transcript

in this video we are going to learn about a new llm framework called dspi which has been making quite a bit of Buzz lately dsy is a python library that allows you to do prompt programming with

llms I'll be showing you eight examples starting from dsi's most basic building blogs to some more advanced and interesting examples and end with a rather complex llm system that just use

those simple building blogs that we covered in this video these examples will each introduce something special about DSi and I made these examples to be the simplest to understand and the most impactful as possible so just

follow along and I hope that you will be ready to build your own llm apps with dsy by the end of this video uh welcome to neural breakdown let's go try

dsy so it's important to understand why we should even learn about DSP and prompt programming and notice I said prompt programming and not the more common term prompt Engineering in prompt

engineering we write complex textual prompts to llms to make them complete a given task this is an example of a prompt quite often uh we will assign a

role to the llm describe the task maybe add some representative examples and then tell it what output format we desire in most llm applications we will

also do prompt chaining where multiple prompts are used sequentially to achieve some more complex multi-step task and depending on the complexity of these chains the task of writing the prompts

themselves gets more and more hacky and cumbersome also the proms can return different outputs when the underlying llms themselves change meaning overtune promps may get obsolete with the latest

llm release introducing a maintenance overhead in consistent performance and some technical debt to your application and many experts claim that as the underlying llms get powerful and

stronger the need for excessive prompt engineering will slowly decrease as developers we shouldn't need to bother about writing these Ultra specific prompts or find the most optimal few

short examples we just want to learn General skill sets that can be applied to all types of domains and all types of language models and with bpy we take a

small step towards that direction away from over engineering individual prompts but instead defining how the data flows through different modules within our llm

program dsy imagines every LM application as a text transformation graph each node in this graph receives one or more text inputs applies some

text transformation and then passes it on to the next node for further processing until at the very end we receive our desired output DSi completely detaches the underlying model

and the program that you designed meaning you can teach both powerful models like CH GPT or mixl and local much smaller models like T5 or gpt1 to

perform the same task and use the same llm program hope that's a good introduction for you to have some intuition about what problem we're trying to solve here with DSP let's get

started with the tutorial with example one things are going to be even clearer from here so the first example that we're going to look at is is a simple example where we pass in a prompt to the llm and

we receive a response and we start off by importing DSP and then we create this open aai object which basically wraps uh the openai API call within this DSi

object and then we do some stuff to basically tell DSi to use this particular language model object and we say here that this llm function is going to input a question and it's going to

Output an answer pretty straightforward right and the way we call this llm after that is like we pass in a question like this for example I just passed who scored the final goal in the football

World Cup finals in 2014 uh and it basically returns us a prediction object and if you see the answer of the prediction object it's the right answer and then if we do this turbo. inspect

history and again this turbo is basically this open aai object which is using this GPT 3.5 turbo model and if you see this turbo. inspect history it

basically is going to return us the last prompt that was run by our llm model and you can see here uh it starts the prom by saying given the fields question produce the fields answer which is

basically a translation of this thing that we wrote down here and then it says follow the following format and then it asks the question and then this green

part is what the answer is from the llm although this Shand formation is nice and cute it is also limiting because we may want to Define more description

about our input and output and for this DSP uses the concept of a signature a signature is a specification of the input output behavior in a DSP module signatures allow you to tell the

language model what it needs to do rather than specify how we should ask the language model to do it and again to define the signature it's pretty straightforward here we have a class

called QA which is inheriting from ds.

signature and inside we mention the question as an input field and the answer as an output field and then we again Define this gspy do credit class

and we input the signature into it and then if we run the same question again we're going to get our answer back and if you inspect the history it's going to be the same prompt that we ran in the

previous example but if we use signatures we can do more stuff than just that for example we can attach a description to each of our field for the output field I'm just saying the output

should be between one and five words and for the question we are saying it's the user question and we can also attach a doc string which is going to be important so if we run this again we're

getting the same answer but look at the prompt now the stuff we put in the doc string is now the start the beginning of the actual prompt that the llm sees and

the question instead of this kind of a ginger template kind of a deal we are now seeing users question and from the answer we seeing often between one and five

[Music] words so in the next example we're going to talk about basic Chain of Thought So in example one suppose asked a more

complicated question to the llm so for example if you asked Who provided the assist for the final goal in the football walkup finals in 2014 now obviously the answer is not Mario Goda

it's actually Andre Sherl but because we just running this kind of a simple prompt where the llm has to directly output the answer it's failing at the job one way to solve this is our example

two which is to do a basic Chain of Thought [Music] so in Chain of Thought instead of just the old input node output we basically

ask the language model to think step by step in order to solve the problem let's see how that works in action so to do Chain of Thought instead of writing dsy do predict we will just write dsy

dochain of thought QA where QA is obviously the that old uh signature that we wrote in the previous example and if we run this then we will indeed see that the answer predicted by the LM is the

right one Andre shl and if you look at the prompt now it has added this additional reasoning step between the question and the answer so it says reasoning lets think step by step in

order to produce the answer and if you look at the llms generation we know that the Willing gold was scored by Mario Goda the assist for that gold was provided by Andre shell and therefore

the answer is UNR and that's a part of Chain of Thought because you basically ask the llm to Output some thought before arriving at the answer and often this makes the llm reach down within its

millions billions trillions of weights to extract some knowledge that it can't just extract on the Fly this is a good opportunity to introduce the term modules into our vocabulary just like

signatures tell us what the llm should input and output modules Define how it should do it we have already seen two modules so far the predict and the Chain

of Thought that both employ different algorithms to transform the input text to derive at the answer and we can even Define custom modules on our own very simply say we have this kind of U

multi-step question which says what is the capital of the birth state of the person who provided the assist in the football World Cup finals so if you let our Chain of Thought module answer this

then the rationale that it produces is it says we first need to identify the person Who provided the assist and then we need to determine the birth state of that person and then it just randomly

says Munich now Munich is not the right answer but what if we wrote a custom module for this here in the unit function it basically creates that same Chain of Thought object and inside the

forward function it just uh calls that Chain of Thought object and passes in the question and then again if we input that multi-step question we again get back Munich obviously because we haven't

really done anything special here but because we have it set up like this we can now add more stuff inside this module so here I've written a class called double Chain of Thought where

basically we have two chain of thoughts the first one inputs a question and outputs a step-by-step thought and the next one inputs a question and a thought and then outputs a one-word answer now

obviously you can write signatures for this but for Simplicity I've just gone with the simple Shand notation and inside the forward function it first called the first co1 object where we

just passing the question and then we take the step by-step thought that it generates to pass it to the cot2 object where we pass in the original question

as well as the thought returned by the first cot1 object and if we run this it turns out that we get a different answer the player Who provided

the assist for Mario goza was Andre Sherl Andre Sherl was born in in the state of this therefore the capital of

that is Ms and if we did inspect history and let's say we take the last two and then we will see that the first prompt it runs is it takes a question and

outputs this stepbystep thought where it basically discovers that the person we need to find more about is this guy called Andre Sherl and in the second one it basically takes the thought and then

determines the correct answer and that is the part of being able to stack multiple smaller modules inside your much larger llm program the whole

concept of a module is inspired by Deep learning Frameworks like py just like py toch we Define how the data graph would look like and inside the forward function we shall Define the logic which

transforms the input to get us our output in fact in the documentation they say that DSP is the pych for foundation models

so in example three we looking at outputting typed predictors so far we have only been able to Output strings from the llm but what if we want to

Output a float or a Boolean or something the way thep suggest we do this is via the library identic so for example maybe instead of just the answer we also want

the llm to Output the confidence about how much the llm is sure about its answer to do that we Define this pantic base model object where we Define our

output the answer is a string and the confidence is a float which is between 0 and one and inside our signature we as usual mention the input field but for

the output field we refer to the base model object we created up here and then we Define our prediction object but instead of just ds. Chain of Thought we

use typed Chain of Thought and then if we ran that the confidence output that it returns is indeed 0.9 and its type is float and if you look at the history of

how this was generated you are going to see that it basically enters uh the Json schema in which the output should be formatted so it says that you must use

this format and it also produces this Json schema from our fentic class structure and that allows the model to generate the Json in the right format

and now let's look at a little more complex predictor maybe like a Json we want the llm to generate a list of the country and the year of FIFA World Cup

winners from 2002 to present as usual we write our base model which is going to contain the country and the year inside the dpy do signature for the answer we're actually going to mention a list a

list of these objects and then once again we call our question and here we get the answer in the right format as we were hoping for do keep in mind that

this type predictor stuff isn't guaranteed to always work because it might happen that the Json that the llm returns is still incorrect because we

are not really doing anything to force the llm to Output in a particular schema we just kind of you know asking it to do it just kind of you know nudging it saying please you must follow this

format and the llm is obliging by following the format it doesn't explicitly have to so quite often we would want to do an

llm call where we would pass in external information that might not be readily available to the llm this is called retrieval augmented generation given an

input prompt the llm defines a query we take this query and search a database the database returns us some passages or some information and finally we then

call another llm module with this newly added context and ask it to generate the answer to our original question so let's see how the rack framework can work in

DSP as usual we import dpy we create our GPT turbo model and we also initialize this ds. cird V2 object so this cber V2

this ds. cird V2 object so this cber V2 is basically a retrieval server where they have the abstracts of many Wikipedia articles till the year 2017

and it is available to use for free you can use other databases as well like milver or chroma DB or other local databases but since sko V2 is free to

use and already comes with a lot of data we're just going to stick to this so to use the skard V2 retrieval model we basically have another DSP module called ds. retrieve and here we're saying

ds. retrieve and here we're saying retrieve the three most relevant passages for the query string Andre sh we see that the top ke passages are

these ones and the first one is actually about the German professional footballer who we are looking for and obviously if I you know change it to something I don't know like iPhone it is also going

to updated with entries about the iPhone now how can we use this information to create an llm app let's say we're asking the question what is the date of birth of the player Who provided the assist

and obviously if you ask this question with a normal Chain of Thought it doesn't give the right answer but here we can see this information is actually present in our retrieval database so is there a way

that we could retrieve these passages and extract the date of birth from here to do that we can create this rag module the rag module has just two things the retrieve and the Chain of Thought and if

you look at the signature used by The Chain of Thought it basically inputs the context and the question and outputs the answer the context obviously comes from the top passages from the retrieval

algorithm and in the forward function we passing the question into the retrieve and then whatever it generates we basically use it to predict the answer

turns out that doing it like this doesn't work this is a wrong date of birth if if we can look at the turbo.

inspect history we can see that that under shl uh passage that we wanted this uh model to retrieve it has not actually retrieved and the reason why is that the

input to uh self. retrieve is the question itself but ideally we want it to be Andre Sher and not the question so in example five we're going to look at a

different algorithm to get this done by the way if you're enjoying this video the best way to support the channel is to just hit the like button drop a comment and subscribe to the channel because if you like this video you're probably going to love my next one we

also have a patreon where I post Channel updates as well as the slides writeups animations code from all of my videos a huge thanks to all of our supporters

this month across patreon and on YouTube your support means a lot thank you let's get back to the video so the llm program that we're instead going to try to do is called

multihop reasoning if this is a rag multihop reasoning adds a loop inside the rag After we receive a question we're going to ask the llm to Output a

query and then we're going to query this to our database the database will return us some passages which we're going to include in the prompt next time to to

again output a different query and then again ask for more information from the database we are basically going to run this in a loop until we get our answer

and let's see how this is implemented in DSP first off we do our setup stuff and then we have the generate query signature the generate query signature

takes in the context and the question and it outputs a query of course the generate answer stuff is the same as it was before and then we write our multihop module the multihop module

takes us input the passages per hop which is going to be used uh for our retrieval and the max Ops which is the number of times the max number of times our program is going to try to Loop to

find the correct answer the init function a little more closely we got this list of chain of thoughts where the signature is this generate query stuff which takes the context and a question

and outputs a query and here in the forward function we Loop over our self.

Max hops we generate a query given the current context and the question we extract that query we find the passages

that are inside those queries we run a d duplication step so that we don't have duplicate queries inside our context and finally after this thing has run for

three times we generate the answer and if we ran our question through this one we indeed see that we get to the correct answer 6th November 199 and if you look

at the contexts that it retrieved it retrieved that uh entry about the 2014 World Cup final and it also outputed the Andre shle passage as well and if you

look at the inspect history you're going to see the queries that the model was looking for so the first one was Data of the birth player Who provided assist for the final goal and that returned a

context about the World Cup final and here using Chain of Thought it determines that the person who provide the assist was indeed Andre Sherl and we need to find his date of birth so it

searched for date of birth of Andre Sherl and this is when it retrieved the Andre shl passage where it was able to determine what the answer

is and next we are going to look at a classification problem and how we can use few shot prompting to improve llm accuracies so um going with the theme of today's example so far we have a data

set here that describes a situation in football and what the referee decision should be I basically generated this data set with cloud and I'm going to see if GPT is able to do well on it so at

first we are going to create these data set objects which is basically going to be a list of dspi do example those examples just contain the question and the answer within them and then we are

going to create our signature which is going to have the event and the answer and we're going to create a basic predict model that just contains a chain of thought and then if you ran this

evaluate function which takes in a test set which is just a list of these DSP examples and a metric and in this case I'm using the answer exact match metric

which is going to return one if the predictions and the ground truth are the same and otherwise it will return zero and if we run this evaluate program on this predict model thing you're

basically when to see the accuracy is like 37.1% this is called zero shot classification since the llm does not receive any hint about the expected

input output pairs but if you begin to add some examples into the prompt we will soon see that the llm performance go up and this is called fot prompting

So the plan is we have this huge data set and we are simply going to add some of these examples into the prompt and hope that our model does better but the

question is still how do we determine which examples to pass in we sure can't pass all of the 100 examples maybe we only want to pass 10 examples but how do

we decide which 10 normally if you're doing prompt engineering we will handpick 10 examples that we think will work well and call it a day but with dsy we can run experiments where dsy will

bootstrap different combinations of the prompts and test the llm accuracy and optimize for the best combination of examples to show to the model to

increase the overall accuracy and what if we try to bootstrap using DS spies built-in teleprompt module so this teleprompt thing is basically like an

Optimizer and we use this bootstrap fuse shot Optimizer which inputs some of uh the examples from our train set into the

query and then sees if that improves the prediction and this bootstop fuse shot is an Optimizer that takes a program the training set and a metric and returns a

new optimized programs and for bootstrap F shot we pass in our train set which is again a list of examples that the bootstrap fot algorithm must choose from to insert into our prompt and we run

this compile function and then it takes like 3 minutes to run and it returns us a new predictor or a new object which is optimized for the current task and again

if we run this evaluate on the test set we see that it has now gotten 36 out of 62 right which is an accuracy of about 58.1% Which is higher than the 37.1 one

person it was receiving earlier this is also good opportunity to talk about dsi's assert and suggest framework whenever we want to add constraints to our lm's output we would use assert or

suggest for example in this case we want the model's output to always be one of the following categories so we can add a suggest statement to nudge the llm to

predict within the six categories if it outputed something else so every time the llm outputs something that's not one of these same six categories uh we're going to throw a custom eror message as

well as its previous output and ask the llm to Output the right p uh the main difference between suggest and assert is that after some retries if our condition

is still failing we would basically halt the program right there if you're using assert and if you're using suggest we will let the program Flow To The Next Step assert literally stops the program

on its tracks and suggest just notes the llm to Output the right thing and if it doesn't do it just continues the rest of the downstream tasks anyway and so to do suggest we just basically add this

additional line inside our predict model function where we just say the output answer should be in the following list and and in case it is not then it's

going to throw this error message to the llm and then let's see the difference when we run it without the suggest so without the suggest the output is

offside which is not part of of our six categories but if we run it with predict model assert we see the answer is VR which is indeed present in our category list and if you look at the inspect

history of this we do see that originally the model did predict offside but then we informed it that the past answer was offside and the instruction

is the answer can be only one of the following and then yellow card red card Etc and this results in a change in the output to V

okay so it's time for example seven in example seven our goal is to generate a new question answering data set but there's a caveat we only want to generate question answer quizzes from

the information that we have access to in our cber 2 database so so how this algorithm is going to go is that we are going to generate a random list of

Search terms and we are going to query our cber to database using these queries and the passages that we receive we're going to ask an llm to generate quizzes

out of those passages only so all the questions in the quiz can be answered by looking at the database and know external knowledge is required and let's see how we're going to build this so I

have the list of Search terms right here and for our signature it's a pretty simple one our doc string says that generate a very simple trivia question from the given information the

information is going to come from uh the retrievals obviously and then as output we have the question and the answer and here in the forward function we input

the query which will come from this list uh we're going to pass that query into our retrieval module to get back those passages relevant to the query and then

we're going to Loop over each of those passage and we're going to call our Chain of Thought module to Output a question and an answer from those options and we're going to keep a list

that opends all of the context questions and answer and then if we run that whole thing we're going to end up with the database with questions and answer and also the context which we retrieve from

the database from which the uh question was generated so a lot of times we would want to train a much smaller model on a task that has already been learned by a

much larger llm where we take a large model and transfer its knowledge to a much smaller model and that's what we're going to do now and for our final

example we want to train a T5 model to a learn how to query the database given a question and B given the passages return by the database learn how to generate

the correct output let's see how we're going to do that so first off we have our trivia. CSV which we generated in

our trivia. CSV which we generated in the previous uh file of course it has a lot of the question answer pairs but we don't know how to generate the query

that must be sent to the database to retrieve the relevant passages used to produce these answers and for that reason we are not going to directly train T5 on this input output pairs but

instead we're going to first build a rag pipeline using GPT and then use that to train our T5 model and that way the T5 model is not going to learn how to

generate answers from questions but it's going to learn how to generate queries from questions and answers from the passages returned by the database let's see that in action here we have a

generate search query signature whose task is to input a question and write a simple search query and we have the generate answer signature that will take a context and a question and output an

answer and the rag module inputs a question it generates a query based on the question the retrieval client returns some informative passages and we

pass that in as well as the question itself into the generate answer and then we use that answer to compare with the ground truth answer let's see how the

normal GPT 3.5 does on the quiz to No One surprise it does pretty well having a score of 83.3% by getting 25 of the 30 questions

on the test set given the question it is able to find the right query to input to our database and then given the context that's generated by the database it's is

able to generate the correct answers now how can we transfer this Behavior to a much smaller T5 model but before fine-tuning let's see how much a T5

model actually scores it actually scores 0% and the reason why is because it's not able to generate the queries correctly and obviously because the queries aren't right the retrieved

context is not right and it's not able to generate the answer correctly either so now let's take the rag that's run by gbt 3.5 turbo and then use that to fine

tune our T5 small model now notice that gbt 3.5 has 175 billion parameters whereas T5 small just as 60 million but let's see how the fine tuning goes to

fine tune we will use DS spy. Tel proms

bootstrap finetune class we will create our training data initialize the GPT rag module and then finally when doing the fine tuning we will use a T5 small as

our Target here the teacher model is the GPT rag which is our basically the rag model controlled by GPT 3.5 turbo and the student model is controlled by T5

small and then it takes about 6 7 minutes to run uh the 10 EPO and as you can see the evaluation loss is indeed going down as the epoch are progressing

and at the end of the epox we get back this rack T5 finetuned which is basically an optimized model that has learned from the teacher GPT 3.5 let's

see if this fine-tune model can output some sensible answers here I have passed in the question home of which football club is Emirate stadium and the query that the T5 model generated is Emirate

stadium and if you see the retrieved context it is actually retrieving a passage about Emirates Stadium where it does say that it is the home of Arsenal

Football Club and that helps the second part of of our pipeline to generate an answer saying Arsenal Football Club to remind you how our rack pipeline worked

it basically had these two modules the first one generates a query and then the second one generates an answer from the context retrieved by the query and here

are a couple of more example like for this one we are asking him where the ufi 2008 took place and it is correctly retrieving the right context to Output Austria and Switzerland in fact if we

ran our evaluation function on uh the new T5 finetune model the performance is actually pretty good it gets 50% of the answers correctly scoring 15 out of 30

and that is much higher than the zero it was doing earlier and here are some of the correct answers that uh this T5 model was able to find out and that's

quite encouraging now obviously the original GPT 3.5 turbo model scored an 83% and we are still not there with just

a 50% accuracy so here I fine-tuned again following the same algorithm I fine-tuned T5 base which has 220 million parameters about four times as larger as

T5 small but still a lot smaller than gbd 3.5 and if we ran the training again you'll see that the evaluation loss is indeed

0.07 uh the evaluation loss for the T5 small model ended in 0.15 after 10ox meaning that the larger model is definitely able to extract answers much

better and indeed if we run evaluate we'll see that it gets 21 out of 30 right meaning a 70% accuracy and here are some of the queries and some of the

answers that the T5 base model was able to generate and that's quite encouraging because we have succeeded in fine-tuning a model that is 1,000 times smaller than

the original GPT 3.5 model and still managed to achieve a solid performance boost with only 5 to six lines of code and just 5 to 6 minutes of training so

that is the end of this dsy tutorial we started with the basics of DSP like signatures then we talked about modules and all the different types of modules like predict and Chain of Thought and

how to write your own modules and your own text transformation graphs and then we talked a little bit about things like Rag and retrieving from an external

database and multihop reasoning we talk talked about type predictors and DSi assertions and also how to do few shot prompting and optimize our few shot prompts and finally we talked about

fine-tuning where we took a compiled llm program and then transferred that knowledge into a small local T5 model with pretty decent accuracy I genuinely think that these dsy guys are onto

something it's a really cool idea to create llm programs that are just about the flow of data and I think it's a great idea to decouple the programming portion of it from from the modules

portion of it but let me know in the comments if you had any questions and if you are building anything with DSP if you enjoy this video please do hit the like button and share it with your

friends I'll see you next time bye

Loading...

Loading video analysis...