LLM Engineer's Handbook: From theory to production | TDE Workshop

By The Data Entrepreneurs

Summary

## Key takeaways - **LLM Twin Architecture**: The high-level architecture splits into data collection pipeline storing raw data in MongoDB NoSQL warehouse, feature pipeline creating fine-tuning and RAG data in a logical feature store from vector DB and data registry, training pipeline producing fine-tuned LLM in model registry, and inference pipeline as a chatbot. [05:03], [07:21] - **Store Raw Data Practice**: Storing data in raw format in the data warehouse allows easy experimentation and recomputation of features if changes are needed, since internet sources are moving and data access isn't guaranteed. [09:25], [09:58] - **Chunking's High Impact**: Cleaning and chunking in pre-retrieval stage of RAG ingestion pipeline often have the most impact when optimizing RAG systems, more than sexier parts. [13:52], [14:15] - **Separate LLM and Business Services**: Deploy LLM microservice GPU-bound for model hosting separately from CPU/IO-bound business microservice handling RAG and monitoring logic to scale independently and avoid complications. [26:06], [27:51] - **Production Essentials**: For production, use model registry for versioning models, data registry in feature store for versioning data, offline store like S3 for batch training access, and online store like Qdrant vector DB for low-latency retrieval. [41:48], [44:06] - **LLM-as-Judge Evaluation**: For evaluating generative LLM outputs without ground truth, use LLM as judge with custom metrics like style scores from 1-5 based on expected traits such as avoiding general text and complicated words. [57:32], [58:22]

Topics Covered

Store Raw Data for Iteration
Chunking Trumps Embeddings in RAG
Decouple LLM from Business Logic
Pipeline Triggers Enable Continuous Training
Production Needs Four Data Pillars

Full Transcript

okay welcome everyone for those who don't know me I'm sha this is the data entrepreneurs and today I'm super excited because I'm joined by Paul and

he's going to be talking about something that uh a lot of entrepreneurs are very much interested in these days which is

Building Systems with llms and so Paul's actually done another workshop with us and last time he did a workshop with us he kind of broke our streaming system

and the way we host events and I'm happy to say that he's done that yet again we originally gonna do this event on Google meet but we kind of destroyed the 100

person uh limit that we can have in a Google meet call so that's the reason why we're streaming live on YouTube and Linkedin so I don't want to hold up the

the talk so I'll just hand it over to you Paul and then we can get into it yeah hello hello hello everyone I'm

excited to be here so as show said today I will dig into a presentation based on my latest

book which will show you at a very high level overview what it takes to take an llm to production how to build an llm

system how to design it some challenges is and everything around it so yeah let's let's dig into it so first let me

tell you a few words about myself so I'm from Romania maybe it's well known as the land of vampires and good internet uh it's already night in here

but I promise that I will behave uh I have seven plus years of experience in AI mL mlops of however you want to call

it uh and I'm doing content on llms mlops and rexes soon I I will add this on top of my content soon for the latest

like two years or so you can find me on LinkedIn where I post daily and on my decoding ml ecosystem links I will attach Links at the

end and also I started doing consulting in the llm and LMO space in the past months and work great so yeah let's dig

into it so as I said this presentation will be based on my latest LM Engineers handbook which I released like one month

ago ago or so uh I'm one of the Cs together with Maxim labone and it was actually endorsed by the CTO of hugging

face and the CT of of zml and throughout the book we show you how to build one big project instead of multiple smaller ones because we thought

that by doing so we can really highlight what it takes to take a proof concept which is similar to

what you will do do do doing your job and yeah and really tackle all the aspects it takes and we as uh we cover things such

as custom data colle ction not using static data sets which is really easy to work with and you use that just in toy

projects uh rag LM fine tuning LM optimization LM Ops and and more so yeah now let's start and dig into the

technicals right so let's first understand the llm twins architecture so first let's understand

what we think about what is an llm twin right so it's a term that I coin and it's actually an llm which where you

want to inject your style and voice and more specifically for this use case is an llm where you injector style and

voice to write post and blog post like yourself so no I I'm not doing this yet I I think it's still like a proof of concept but at some point when this

technology will mature for sure we will be able to do that uh while really being good content right so you can of course do that but

usually that content is is trash or very very neutral not engaging and all of that so so yeah let's dig into the

technicals as I said so the high level architecture is split in the data collection pip line which takes in row

data right takes in row data processes it and puts it in a data warehouse next we move to the feature pipeline that takes this row data from

the data warehouse and uh and processes it to create fine tunning data and rag data right because

usually in llm system and you need data for two things for fine tuning your llm and for rag we take this uh row data from a mql

DB data warehouse we will highlight this a little bit later as I said processing processes it we in our particular case we have three types of data articles

post and code processes it and put it into a logical feature store why in our case is a logical

feature store and not just a feature store because we are we are kind of creating what the Lo what a feature

store should have out of a vector DB and a data registry right so there are other Technologies like hopworks and ton that

are feature stor and off for this off the shelf but I think that in the LM world at least at this point you don't really have everything you need in these

Technologies probably in a few months or even earlier we will but when we did the course we haven't okay then on the training

pipeline we take the F training data from The Logical feature store R the model using various techniques which

will uh attack later and the output is a fon llm which will store in the model registry and the last step is the

inference pipeline which in in our case is actually a chat Bo so this is a really high level overview of of what's going on I would start zooming in into

each component so that's why I went really fast over everything because now we are digging into the details of each so yeah so let's dig into the data

collection pipeline which looks like like this as I said we will mostly craw three types of data category articles

repositories and social media posts for example articles we can take from medium substack or custom blog post repositories from GitHub or rather uh

git like repositories and social media post LinkedIn in our use case but you can extend it to X or Instagram or

whatever right so we had to build like an ETL like pipeline for each because each source needs to be like crawled

differently because you know each side has like a different HTML structure that needs to be pared uh cleaned structur

normalized and all of that before stored into no no no SQL data warehouse uh so we use that manga DB no

SQ as a data warehouse to store our row data this is not like really the standard data warehouse that you see in

books because a data warehouse usually needs to be like to have analytics properties like to be able to analyze your data to compute all kind of

Statistics but usually these data warehouses are more heavy and we don't use that in this this book

so because of our data which is very unstructured and text like documents using a Nal data base allowed us to really easy spin up everything and store

everything like in a production like setup because mongodb can scale up really good and you can actually use this uh and in some future steps you

could actually like take this data and put it in a real uh tool that offers you like analytics like big query or or something

like that right and why we are taking the data and store ing it in row format this is actually really good practice because storing your data in its row

format allows you to experiment rate uh with the next part of your system for example if you want to change some part of your feature engineering or

change some part of the system or add some extra stuff you have all your row data which you have access to really easily and you can recompute everything

from scratch because if you rely only on this ETL pipelines that craw data and all of that the internet is a moving system and you

never guarantee that you have access to that data anymore or at least in the same format okay some some tools to do crawling are like selenium if you want

to go like more low level and have a lot of control also HTTP requests you can directly make cat request on some pages and P HTML and it's good enough you can

PA this structure with D diesel soap also L chain or Lama index or these kind of tools offer some high level uh methods to crawl and get data

from the internet and fire crawl I used it in the past weeks and really liked it to crawl all kind of stuff I think it's a really a plugin plugin and Play

Crawling tool that is really powerful so yeah so now let's move to the rag part of the system so

so uh let's move a little back so before after the data collection pipeline we have the feature pipeline right which is actually a rag feature pipeline because

here we want to prepare the data and load it into a vector DB for rag but also prepare it for fine

uning and that's why I want first to bre really go really fast on what a Rec system looks like right right so we have the rag inje

pipeline which uh has the following components you have the rag inje pipeline which will which actually consist this part

where you need to clean chunk and it bed your data many times it actually has the extract part where you need to take data from somewhere but because we took the

extraction part and reformulate it as a data collection pipeline in our case the extraction part is just reading from the

data warehouse so it's really easy so here we focus on cleaning chunking and embedding uh next actually after

embedding we also want to load it into Vector DB right obviously and the other part of the system is the retrieval and generation pipeline this these two parts

actually run on different processes so the rag injection pipeline usually is like a batch pipeline or streaming pipeline that does these things

constantly because you want constantly grab data and update to Vector DB while the retrieval and generation of pipeline usually runs when you need it for

example when a user wants to do a query right so then you take your your user query embed it uh call the vector DB to

find relevant chunks and create the prompt based on the user query and chance and generate the final answer so

this is like a very standard a r system and every reg system has like this main components at its roots

right for example if you want to optimize a r system which is also known as advanced rag you mainly have the few areas where you should look at when you want to optimiz it for example at the

pre- retrial uh stage which us usually on the rag ingestion pipeline consists of how you clean and how you CH the data which

is very important many times are like the most important parts which you can optimize not the sexiest part but the most

sometimes are the parts that have the most impact and also on on the retrieval and generation part the user quill uh the user input for example is

it like verbose enough does it have grammatical errors do you need some specific entities to be in it and that the types of things that can really do

magics in how you query the vector DB next we can optimize the retrieval part again on the rag ingestion pipeline

we have for example what Bing model we we use right and on the retrieval part we we we could use we could think

about things on how we query the vector DB for example uh how we can reduce the search space to increase our probability of good hit

in the vector DB and the last part is the PO post retrieval part where after we retrieved

our potential relevant uh chunks we want to reduce bias reduce noise and sometimes we don't want to have a huge

context due to latency or cost uh issues we want to reduce it and let only the what's essential in it right and next we

pass that context to the llm and we are back uh to the standard pipe okay so let's see how direct to

pipeline AKA a a injection pipeline from the standard system looks like so in our llm twin use case right

so we always come back to our llm twin use case uh so as I said in the beginning we consume row data from the

data warehouse and we want to create two snapshots the one is clean data which will be used for fine tuning and the other one is chunk and embedded data

which will be used for rag in our particular use case uh because we wanted to keep it simple

here for example for cleaning we used a very sneaky but simple uh method where we just kept all

as as characters because we know that only as characters can be thrown into the embedding model and process okay

like if you want to create a real product we want to be more careful about this because sometimes italic bols emojis can have a lot of signal in them

and you you want to process them and for chunking we use like the standard like where you look for paragraphs and then based on paragraphs just Chun them based

on the context of the embedding model again you can go really deep into how you do this and spend a lot of time on

optimizing your chunking part right but we will dig more into advanced Parts when we will look at our retrial CL which is this one so the the clean data

will basically we will take the output of the clean model and throw it into the vector DB as is because you can actually use the vector DB as a nosql database

you most Vector DBS have support for quering just for IDs which and put any text into the metadata and in the end that's a no SQL database and then we do

the standard chunking and embedding and we put the vectors plus the metadata in the vector DB as usual okay so now let's move move on the

other side of the spectrum on the Reger pipeline so this is where we put stuff into the vector DB and here is where we query the vector DB and here things

looks a little bit more complicated or more interesting depending how you want to see it uh versus like the naive red pipe rag uh

system so let's let's look how we can optimize like the pre- retrieval part of a St or how we optimizing our llm system

the pre retrieval part of the system so one quite easy method to which you can adopt this query expansion where you

take your your query from the user and try use an llm to find like different perspective of the same input uh which will have like different

words or a different meaning or or try like to to to see the truth from different angles uh that's basically what it

does uh and the second method is to use self query this is really powerful when you have some entities in your data datase that you want to filter for

example we use the author entity uh to filter down parts from the vector DB where you take the author and when you

do a search Try to find try to find context just based on that author you can always extend this

to more practical use cases like uh tags which is a very standard uh use case right or any entity that's used within your application can be used here so

again you use an llm to extract these entities which we will use later in the retrieval part other options as I said

that you can further optimize are like chunking or just use an llm to be sure that the query is verose enough doesn't

have grammatical errors even if you when you embed it it's not that is not the biggest mistake to have grammatical errors but yeah it's it's a thing that

you can play around with now on the retrieval part we used a simple ve filter Vector search that reduces our search space to only

entities that to only documents that we know we care about so we reduce the chance to grab faulty faulty chunks FY documents other things that that you

could uh think about is the hybrid search especially in systems that still use keywords so combining keywords

with with this semantic search are actually really powerful you you could also think about rooting right so maybe you can optimize

you have multiple Collections and for specific queries you think you care only about one or two collections not about all of them in our particular use case

we have the articles repositories and post Collections and maybe for some queries we care about only about the Articles right and in the post retrieval part we

use the classic rerank which is actually really powerful because systems are usually biased about the first part of the context and about the last part of

the context and also with while using ranking you can for example search at the retrial part for

larger we though let's say 20 uh chunks and using rerank you res sort them and take only like the first five so like this you also filter out the noise

filter out the bias and also reduce your context with one algorithm which is uh intuitive to understand and Powerful uh you you instead of like

using set this uh uh the standard reranking models I I've seen that is quite popular to use

another llm as a runker I don't have a strong opinion on which one is better it's again I think a very experimental process and

yeah so uh this is about the rag infms pipeline uh now so we have the the rag

collection the rag feature Pipeline and the rag inference pipeline now let's understand how we could like deploy

these the the these things uh for example the the inference pipeline which is actually like our our

chatbot in the end right so if you think about it we have this query all of these like Advanced things are going around

then you have the context you create a prompt you throw in this prompt llm which is like either an API like open AI

or or cloud or whatever and get an answer right so now let's see how we can deploy all of this without using

these these apis like as I said open or old C or whatever right so we have two parts the LM microservice which is

in most basic applications is as I said replaced by these these apis but in our particular use case we wanted to build

everything from scratch so no apis right so a standard uh approach is to use systems

like Sage maker to deploy this or Bedrock or model like serverless systems or actually use like open source tools

like tax inference tax generation inference from huging face or or glm like open source Solutions take them and and put them on

a kubernetes system uh so there are various approaches but I I think that in the

beginning I wouldn't go like on this open source route because it can get complicated and I think the easiest way to do it is just choose Bedrock or model

or other sever solution build out your system iterate fast and then think about other optimizations in the book we we picked

stage maker because it's like a very good uh Middle Ground between building things yourself and serverless because seres hies everything away you just

write a few lines of code and it works like magic so because it a book we did not wanted that so we choose sagemaker which still you don't have to manage

everything yourself but you have some more control on what quantization method you you pick so you load for example llm from the from huging space from from

huging face model registry you can choose how to quantize it you have control over the tokens over the a compute that you can run and you can

play around with all these param parameters which give you a sense of what it takes to have that LM in in production but the end result

is a rest API similar to open AI that you can query and play around with right and on the other side you have the business micros service which actually

has all this Advanced uh Advanced rag logic right

usually you want to to divide the microservice and the business microservice in in two different Services of course this is not mandatory

but why personally I think this is useful because for first of all usually these llm Services don't give you a lot of stuff

besides hosting your LM if you want to add some extra stuff it gets quite complicated because they really embedded into this LM ecosystem

and the second part is that the llm micros service is really GPU dependent and usually the business microservice is

CPU and IO depend so let let's take the this example usually an llm taking takes

in client request it and directly takes them and processes them with the llm which directly puts that data on the GPU processes it gets a response and sends

it away you can do some Dynamic badging to uh to process multiple requests at the same time and that kind of thing but still it is GPU

bonded while the the business microservice where you have stuff like rag or monitoring you are mostly CPU and I

bonded because you take a client request right you do some API calls to other services which are iob you do an apal to

the vector DB which is how bonded then you or I don't know aggregate the data do some string concatenations and that kind of stuff which is CPU bondage so

you don't really depend about models here it's most like mostly like a standard service of like a python server

that you we are used from the software engineering world right and yeah and because you decoupled that you can also scale them differently like uh

horizontal scaling uh and yeah you have more more control on how we can do this and also one thing that I I missed is that you

can take this business microservice and also here is where you hook the prompt monitoring pipeline where you take your prompt which you compute here and you

get the answer from them in microservice and you send the whole chain to The Prompt monitoring pipelight service or however you want to call it and you can

see this as a the business micros service is the place that Aggregates your your business logic that's why it is called like this so as I said you

have your rag Pro monitoring and in the end you can expose it as a rest API using Frameworks like Fast API but you're not bounded to do this you can

expose it at grpc or directly put it maybe on a client application or whatever makes sense uh for you and your

application now here we we we deployed only the the inference pipeline the chat bot right the the final product that you

interact with as a user but what what what's left is to look at how we deploy all the ml pipelines right so we have that uh data collection pipeline we had

the feature pipeline we had the training peline how how we deploy them right because usually people stop here and leave the other

pipelines uh untouched and usually they run them manually and take the model from the model registry and everything that's before the model registry is is a

mess and it is hard to reproduce so uh for example a really good strategy is right

to start with versioning your code right having all your code in

GitHub then you you uh when it comes to code it's actually like the standard software engineering

pipeline where that you want to take it and deploy it which is for example in our book we use zml to

orchestrate uh all our pipelines so they they they provided this option to have it in a zml code

repository then out of this we doite all all our code put it in an ECR Docker Docker registry which is uh AWS version

of a Docker registry from here we can take uh the code under packed with all the

dependencies system dependencies Pyon dependencies and everything we need in a Docker container and Zen ml let us to

take this Docker containers and ship them to Sage maker which was the compute right so basically we have all the

pipelines packed in a Docker container and we told zml that hey here is our Docker container with our ml pipelines

and we want you to put them on Sage maker using ZL you can do do this through the UI or you can even automate it further using infrastructure as code like

terraform or um let's it from from my knowledge but I know they they add tons of feature

in this this melops zone so this is the code this is the the compute and again using zml we actually hooked

to S3 as artifact storage and we independently wrote custom code to connect to quadrant and

mongod DB quadrant is our Vector DB and mongod DB is our data warehouse so now we have the code packed as a Docker container put on sve maker everything

runs perfectly we can scale this easily like each pip pipeline can be scaled independently on different e to VM on

AWS and all of that we have the S3 storage as blob storage where you can put in files images and everything you

think about as long as they're not stables and we have quadrant and mongod DB right or as storage and how how we can execute these things so everything

is waiting for us to be executed uh through zml we have different options so the most obvious one is the zml dashboard where you just go in there and with a few clicks you choose what

pipeline to run you can do it through the C CLI uh which this offers you tons of other Integrations or a pipeline can be

triggered by other pipeline or event for example in our use case we had a data collection pipeline which after it end it triggers the next Pipeline and so on

and so forth which I will detail a little bit later on how you can do continuous training with this trick quite easily uh as as I said for storage we

have these three guys for computer we have stage maker and for triggers we have uh the dashboard the CLI or other

events sorry quadrant that I wrote your your name that I typed your name uh okay so the last part of the

system is and AOG right as I said how you can actually automate this this part so so far we have the the data

how we gather data how we we train I know I haven't touched the trading part but that's part of Maxim part of the

book and I don't have the visuals for that and I not want it to use the visuals for that but I will touch the training pipeline like an

abstraction and from the mlop point of view it's enough uh so so yeah how how we have all

all these pipelines that creates our model which is in our model registry that creates our features and training

data which is in in our logical feature store uh how we can take this and opener operationalize it like do the Ops part

of things for example when you change the code how how you can reflect all these changes into your production system so that's actually what mlops is

all about but not only about the code also about the model if the data changes and you want to create a new

model how you can do this automatically without uh going to your computer and running scripts manually and not being

sure what's what's going okay so the first part is to to to build the CI and CD

pipeline so again this is actually classic software engineering and Devo a cicd usually checks so a standard

process is that you work with GitHub or similar solution and you create some you have the main branch and you create a feature Branch or you can also have a

staging environment and you can make this more complicated but for the sake of the example you we have production and feature Branch right so when you you

work on your feature branch and you want to do a merge on your production Branch to do that you actually first create P request which will trigger automatically

linking formatting and testing linking and formatting usually uh test your code your how you format your code or

LinkedIn can actually check for like really can check for bugs which can be detected at a syntax level formatting

thing checks for how you write your code and testing these are like your unit integration System test that you have to write and check that your code works

well uh if all this passes you can merge your your pipe your GitHub Branch into Main and the continuous

deployment pipeline triggers which works as we a spoke before we have the zml code where we build a Docker image out

of it we put it in an ECR Docker registry and for this is enough because here we have two options either

we decide to trigger the pipeline manually but this time we do it from the zml dashboard so with one click we can

trigger our whole system or we can put it put in our CI in our uh continuous deployment pipeline another trigger to

actually trigger our pipelines automatically pull in the latest part of the data trigger all our ml pipelines create a new model and take that model

and push it to to production right so if you are here actually it's quite easy to to move

forward next on the continuous training so this is actually the the the other part of the system where we we we want to do

continuous training which means that basically with one click or on a schedule or to a rest API Trigger or

whatever trigger is you want to create your next production model candidate in one move so with with zml or other

orchestrator this is actually quite easy maybe it sounds daunting but it's quite straightforward right so you have the trigger

that uh starts your data collection pipeline then your data collection pipeline moves your data into mongodb and because we detached all these

components through data warehouses feature stores Vector DBS model Registries artifact stores and every component has a clear interface and is

not coupled for each other in the code it's easy just because at the nest we we trigger

it uh for example so because we using an orchestrator this pipeline can for example the DAT collection pipeline can can trigger the next

as the red fature pipeline which will know to take the latest part of the r row data from mongodb or we could e even

version it and tell it that hey I want a specific version to from m b DB to take it and create features out of it uh then

this feature pipeline will create another trigger to for example to trigger the uh Genera data set pipeline which we haven't talked in this

presentation because yeah I have only 45 minutes and I can't touch everything but yeah you get the idea so if you create your pipelines modular and you

detach them to all this elements then it basically you just create a dag which is a sequence of

components that are ran sequentially and each one triggers the next one and the output the final output is the are the LM weights which are stalled in the

model registry and here on the deployment part where you want to take the model and deploy to production you can adopt various strategies like check

if the new model the evaluation metrics are better than from the previous model then you can adopt like techniques like AB testing or Canary testing and all of

this before actually pushing your your model to production and yeah this depends on the maturity of your system usually

and yeah I think that one thing that I want to touch because I see that I have two more minutes before the

Q&A uh and I forgot to to highlight is actually at the beginning of the presentation but I think that I I will

stick to this one is that for mlops if you're just at the beginning what you have to understand

are that you need to use a feature store and a model registry uh under one form or another so the

model registry never helps you not to store your models on your disk on your GitHub or whatever you always always store your models in the model registry

where the models are reversioned tracked and can easily be sharable with everyone from your team from your company or

publicly right and with a feature store again a feature store like the main one of the main features is that it

has a data registry which we labeled it here as a logical feature store because we built this data registry actually

with zml artifacts which were backed by S3 artifacts so with the data registry similar to model registry you can store your data version your data share your

data and with all the perks that come out of it so by using this two you can easily have lineage and you can always understand that a model was created with

that version of the data if you use GitHub what's cre with the de version of the code and you can easily add through the uh uh meta store also track your

configuration and and other things and again we label this as a logical feature store because along this data registry

we also needed the uh Vector DB features for for red so like and another thing that you need for

production is to have two types of data access one is offline which is usually optimized for training which in our case was the S artifact store where we can

easily store tons of data and access it in batch mode and the other one is is like an online store which is optimized

for uh low latency which in our case is the quadrant Vector DB where we need uh usually we request for one item but we

need it really fast as fast as possible so to go to production you need these four elements aset a model registry a data registry an online store and an

offline store right so these are like things that you really need to consider when you design the system so yeah I think that's it I think

uh I'm in time so you you can find more on my socials all I like LinkedIn uh on my

decoding ml new letter which is also a Blog now where I I talk in depth every week about these type of things like

around llm mlops lops and recommender systems uh and more on my book so if you enjoy this in our book we go a lot more

in depth and also on the fine-tuning parts of this system as I said superv fine tuning preference alignment LM optimization

compation uh how to generate a custom instruction data set a preference alignment data set and all these stuffs around fine

tuning and you can buy it on Amazon also on petsite and you can the repository is open source and you can play around with with all of this if you don't want to

buy the book The R me is quite comprehensive and we try to make it as an independent experience and also we have the llm twin

CS which is like inspired by this book and it's like I don't know 20% of the experience of the

book where we touch again parts parts of what I've uh talked about today so yeah I guess it's Q&A time well

let's thank Paul for a wonderful presentation I always learn so much whenever you come give a workshop so let's do the Q&A so for this

I'm gonna move some things around so we got a bunch of questions in the chat so I think we can just like dive right into it uh let's see actually I think

Jorge okay so Jorge asks what are your thoughts about using graph rag or agentic rag in this use case how do you know what rag method is the best

used from from my experience graph rag sounds really promising I'll be honest I don't have yet a tons of experience I I started reading into

it and it it sounds amazing but one one problem that I've SE with it is that usually it has a lot of complexity and

latency issues so my Approach would be usually to create a first iteration of the

system as naive as possible and focus a lot on the evaluation of the system so it's really important

to quantize the results and expectations of your system then if you understand what you expect from that you can start thinking about optimizing because you if you

don't know what you optimize you actually just follow the hyp at what you think is more interesting and cool so so yeah

that that that will be my Approach yeah just start as simple as possible get the system actually working and then you can worry about doing more sophisticated

techniques good question yeah sorry yeah go go go go to the next one yeah we got a lot of questions so let's keep it rolling anas asked is

there a real use case for Rag and llm Engineering for individuals or is it primarily centered toward B2B so I guess maybe like ic's the question

you interrupted for a second yeah yeah so I guess the question's like for individual contributors or you know just people working on their own locally um

are there like good use cases for Rag and llm Engineering like what do you think oh yeah uh yeah totally so well I think that one

amazing use case is LM twin uh where if you're in content creation and it's not necessarily business to business uh it has tons of

value another use case is research if for example if you want to find any kind of information right

you just throw in uh a lot of documents and you start talking to to these documents I think

this is from my experience this is the most uh exciting part of it and actually there are like open source tools that you don't need to build this kind of stuff they like have a prei they let you

throw in some documents and chat with them yeah another one could be uh if you want to read the the llm engineers

handbook but you don't want to read the whole textbook you can just throw it into one of these rag systems and then you can just chat with it and have like a Paul and Maxim guiding you through

your development you like your use case yeah yeah exactly and here you can have like two options like one is to create an

overview of the system like what I've created now for you guys because maybe in the beginning beginning you don't know what questions to ask so you just

want to explore it and based on that you can start asking questions and digging into what you think is more more

interesting yeah that's good let's see we got another one from Miguel who asks I'm trying to implement an ETL pipeline for job matching any ideas or

suggestions on how to tackle this I I personally think that here this is more like

a recommender system problem is so you want you have like a user and a

company and you want to see which has a higher matching score with the other so I've actually seen and playing around

with this using llms as ranking systems and in recommended systems most of them

are split into two parts like the retrieval part where you for a user you want to grab some candidates which is more course and then you have like this

ranking step where you want to filter out and remain only with what's more important and which is really similar to a r system in the end and I think that

you can use like this ranking part for example if you want to just for yourself and you already have the candidates you

can dig into some how to use llms for ranking in recommender systems to find out more more more ideas on how to

approach this yeah that's cool good question we got another one from l i ID I think I I pronounced I butchered that so I apologize but the question is how

do you choose the right chunking strategy for llm application this is a I think it's a very big question right here the right Ching

strategy yeah that's right for your llm application okay yeah this is a hard question personally uh I think it's really

dependent on your data so because by having domain knowledge you know

how it makes more sense to to chunk your data right so okay of course I can tell you like the standard okay you can take a

document and look at like uh sections and split splited by

chapters split it by sub chapters by sections even take around tables images and all all of these areas go even down

in paragraphs sentences and all of that of course that that's important to consider other options are like semantic

chuning which kind of looks at parts that are correlated with with each other

or uh but if you no have the domain knowledge like if it's Financial or or or I know medical of all all of this I

think you have a better much better intuition on how to pick these delimiters than just trying to generalize

it uh so so yeah other options are like this graph rag like systems where you

try to create to keep the relationships between uh the chapter the section and all this right so you kind of keep this

up narrowing down relationship and when you find a small a part of a paragraph or a

paragraph you can grab the context of the bigger part of the system or there's also this

context rag implemented by by entropic or something like that which creat this context and add it directly to the

chunk Al yeah this is more an art than a straight answer I guess it's so vag

usually yeah endless approaches and strategies uh and it sounds like very much dependent on the the specific use case uh let's see we got another exactly

so it's really hard to yeah hard to give a broad broad brush answer a general answer to the question so we got another one from Jorge uh what

methods Frameworks are you using to evaluate the llm twins outputs so what methods and Frameworks

so this is this was done actually by by Maxim in our specific book and

we haven't used any any Frameworks which many times is fine I been so we we actually use an LM as a

judge so when when it comes to evaluating lln you kind of have like uh two big options One is using hor istics

actually three big options One is using her istics which if you can't EX

Str the output of llm in a very Str structured and deterministic way is it's not a good metric for example you you

use an LM for classification or text extraction anti extraction then you kind of know what you expect so the answers are black and white in this case you can

use like classic characteristics they're per perfectly fine but if you move away into a more like gen uh generation space

where two pieces of text are both correct but just formulated differently uh here usually you can have two options one is like are this

embedding based scores which I think they're called like semantic metrics where basically you embed the ground fro

and your uh your answer and use uh this cuse distances or or others to to to to see the difference semantic

difference and the third option is using LM as a judge when you you are thinking about LM as a judge the nice thing about is that you can also have a ground Froot where you

put an llm to compare the two but you can also do not have a ground Tri and where you can ask an llm for more to dig

in into the Tex for specifics like for example in our LM t course we created a custom style metric right where we as

the them we created multiple scores from one to five I recall where ask him hey if the text is super General and

contains super complicated words and it's not like a blog post like uh then give it like a score of one and as it goes more to

expected value give it a score of which was the highest and there you don't need a ground fro you just need to

explain like for each threshold what you expect from the output which I think it's really powerful because many times you don't

have ground FR this is like a big problem in AI before so judges are are super handy in this

case yeah for sure eval is a huge space u but I guess we're lucky in that we can use other llms to evaluate our

llm systems we're running uh short on time but oh sorry go ahead yeah one one last thing is you can

use uh tools like opic which actually have these prompts very well tested to to use llm as J so they come basically

they come with out of the box well tested Pro to to to test them for various very scenarios okay yeah that's good call so

I do want to take two more questions um and then we'll wrap up uh we'll go a little bit over um so we won't get offended if you have to drop but I just want to get uh at least two more

questions so let's see no it's F Okay cool so must exist asked uh let's see you talk about feature store and continuous training how do you think

about about these for large language models what are features how can we train them well yeah you guys right so when it comes

to NLP and large language model I think that the term feature is a little bit like in a gray area so usually a feature is what

you feed directly to a model but in our use case usually you take the text tokenize

it and put the tokens into a model so if you go by the book the features are actually the the tokens like those IDs of the tokens that

you feed into a model and then you have embeddings associated with them that go into the Transformer

architecture but I think that it's easier to think a little bit higher level and consider the text as a feature

I think just at a human level it's more intuitive to see that way because usually you take the tokenizer which is is model dependent

and attach it to the model so like this the features are more independent to the model you know so you you can take the

Tex and apply which are stored version and all that into the feature store and you can take the tokenizer plus different tokenizers with their models

and try different experiments I think it's more intuitive and more flexible to do it this

way I hope respond to your question it's interesting as machine learning evolves uh I guess our terminology and like what we call our Technologies may have to

evolve as well so let's do this last question yeah it's really really hard to go by it's really hard to go by the book sometimes it's yeah maybe that will

annoy some people but yeah so let's do this one last question I think it's a good one um so it's on security so for context uh Steve was asking about you

know the testing for vulnerabilities as outlined by oasp AI exchange um so he asked could you elaborate on how security testing is integrated into the

pipe line on different deploy environments or is it managed separately yeah will be honest with you

like uh security is my healers point so uh I don't really know usually on this security

stuff I I don't know it's it's not I usually pass this stuff to other people because yeah give it to the other guy

so so let's do another let's do another question then so uh we got one more from Basel what's your preferred observability platform uh for

reinforcement learning uh fine-tuning or I guess maybe just fine tuning uh in general gu so I think

that fine tuning and observability are like two different concepts where with reinforcement learning with human feedback can indeed can

get have some connections with with each other for example again I I played a lot

with opic in the past months and for example opic let you like gather your feedback

scores from uh why you're monitoring your your inference pipeline right so with an

observability platform or tool you gather these cores uh and aggregate them into future data set so that's what you're doing at

an observability step but while you're fine tuning you actually need an experiment track for example com atml weights and biases ml flow

Neptune uh and that's what you take the data you created with an observability tool take a version of it and find tune

your model and track your metrics like loss or other um evaluation metrics you need do during

training yeah that's how I would do it yeah I think those uh observability platforms you mentioned uh are super helpful like weights and biases and all

the I mean the countless others that are out there it's super cool that um you know mlops is you know you don't have to code everything from scratch anymore you

can uh use these uh uh you can use these ready to go platforms yeah yeah otherwise with a nightmare still be using

M exactly so let's uh let's wrap up we went a little bit over I dropped Paul's socials uh in the chat so I'm sure you can reach out to whom if you have any

questions that we weren't able to answer here um I just want to thank Paul again for such a wonderful Workshop thank you guys and the audience for participating

you'll receive a follow-up email uh that kind of with the recording of this event and then you'll get a link to all the upcoming events we have with the data entrepreneurs so with that thanks again

and we'll see you in the next one take care thank you for having me here bye thanks Paul

Loading...

Loading video analysis...