LongCut logo

Chroma Vector Database: Retrieval for LLMs (Hammad Bashir + Liquan Pei)

By CMU Database Group

Summary

Topics Covered

  • Language Models Hallucinate Without Retrieval
  • Vector Workloads Demand Real-Time Updates
  • Closure Clustering Fixes Inverted Index Recall
  • Hybrid Index Separates Storage Compute

Full Transcript

the databases for machine learning and machine learning for databases seminar series at carigi melon University is recorded in front of a live studio audience funding for this program is

made possible by Google and from contributions from viewers like you thank you hi guys welcome it's the last Talk of the semester we're excited today to

have uh to finish things off with uh the guys from so Hamad and lequan are from the founding team of choma and it's another Vector database they're here to talk about what makes them special what makes them different so as always if you have

any questions for the Kaa guys as they're giving a talk please unmute yourself and say who you are and feel free to do this any time that way they're not talking to themselves for an our own Zoom so with that hammad and Lan thank you so much for being here the

floor is yours go for it awesome thanks Andy and thanks to the course staff for putting this on this is definitely something I wish existed when I was in school it's cool to hear from everyone and it's been really fun I've been following along with the course uh

throughout the semester and watching everyone's talks so hopefully today you know we're going to talk about chroma hopefully it won't be too much that you've already seen in other talks um we'll keep it interesting and talk a little bit about what we think is our

unique perspective on Vector databases and specifically our unique perspective on what's needed to build a End to-end retrieval system for large language models so um before I start I'll just

talk a little bit about here we go talk a little bit about chroma so the project itself was launched in February of this year um it's been a tremendous year for AI I know it's like peak of the hype

cycle right now but uh we've had roughly you know 850,000 downloads just on Pye High alone per month and we're seeing like a million machines running chroma per month we've had you know been used

in over 12,000 GitHub projects and it's been really cool to engage with the open source Community there's like 80 you know full-time contributors that are helping us a lot with like small fixes giving us feedback and a core team of of

five and and soon to be six seven people um and I think when you think about retrieval it's really important and when you think about Vector databases in the context of language models it's important to think somewhat about language models themselves so I'll start

with the basics right what is a language model uh and naively we can you know look at it this way which is we have some query or some question and we get out some answer this really naive perspective of a language model but hopefully one that makes sense to

everyone by now you've probably used things like chat gbt or plenty of the open source models and uh you know you get some question and and you get some answer and really what this looks like right we want to ask the language model hey what is relational algebra and

perhaps perhaps the language model has been trained on Wikipedia and so it regurgitates to us the definition that we might see if we search relational algebra on Wikipedia which is a theory that uses algebraic structures for

modeling data and defining queries on it with well-founded semantics um but what does the language model actually do when we ask it this question what it does first off is it takes your your text and

it tokenizes it there's various tokenization schemes um but all of them basically amount to you have some fixed set of tokens some n, potential tokens that you can turn your text into and

quite literally we just get a an array of integers that's that's what we turn the text into and we pass that into a language model and it tells us the probability for all possible tokens that we've you know trained the model on what

is the the probability that we might generate the next one and we do this repeatedly in an auto regressive fashion generating the entire string um and this is important to understand because if you do something in this way you're

you're going to be quite prone to potentially generating the wrong piece of information um so a huge problem with language models is that they quote store knowledge in their parameters right it's this parametrically trained system we

have you know a potentially deep nested sequence of Transformer encoders or decoders and we are creating a language model that learns the probability of certain strings but there's some problems with that right we can't update

them in real time we can't update them deterministically uh we can't provide Providence for their knowledge so how do we get a language model to tell us where did you get this information from and also they have a tendency to hallucinate

because they're just modeling language they're predicting the next most likely token while they do you know people argue whether or not these things are learning World models we can debate that separately uh please email me if you're curious about these things I find them

fascinating um but some people you know will talk about how they have a tendency to hallucinate um because they are just predicting the next token we are just always going to blindly sample the next most likely token in the most naive

decoding scheme for a language model so a a technique that many people have been using is what's called retrieval augmented generation and the idea is pretty simple you combine parametric and non-parametric memory um and this comes

from a paper from Patrick Lewis and team at fair I think like two years ago and the idea is pretty simple as opposed to just taking some text and giving it to a language model can we now take some text give it to a language model in addition

to some documents that might help ground the language models generation to generate some output Y and more concretely what this looks like is we have some Corpus of documents we basically we try and figure out some

scheme to help us select some subset of documents we literally can and attach that to the input text for the language model rerun Our Generation scheme and get some output text Y and there's a couple different decoding schemes for

this right in the most naive decoding scheme what we might do is generate the entire sequence full shot with all of the all of the documents attached to the prompt so if you're asking the language model what is relational algebra perhaps

you might give it some sequences that seem relevant to answering that question um and then you just have it one shot generate in another approach you might actually attach one of the documents and do many runs of the language model in

parallel and then compute the marginal probabilities across all of those in another scheme you might even do this lookup per token uh the point I'm trying to elucidate here is that there's many different small variations to the same

strategy and developers often struggle with this which one do I use which one will work best uh how do I reason about what's better for my specific use case how do I try them in real time or online and evaluate the differences and these

are all problems that chroma hopes to solve one day um and specifically the way that we actually do this retrieval step this problem of I have some set of documents and I want want to get it the technique that most people have been

using is to to use dense Vector similarity search so the basic idea is and I'm sure you're familiar we use some sort of dense encoding model usually a learned model um and we map text passages to vectors then we perform

similarity search which is usually just you know raw K nearest neighbor search for a given query similarity can be any metric in this case for for the paper that I'm I'm citing here um it was the

inner product which is like actually they trained a in ating model on on the inner product and what you can quickly see is compared to like a naive rankings not naive but compared to a a scaler

ranking scheme like bm25 instantly the accuracy on many retrieval data sets jumps quite a bit so for example natural questions which is like a retrieval Benchmark um which I think highlights

fans of Wikipedia based on human annotations we see that really quickly the Learned embedding model performs a lot better even after only exposed to a thousand training samples and after we expose it to all of the training samples

it performs um about like 20% better um and if you're curious you can you can read more in the paper but the point here I'm trying to make is that by using uh dense

similarity search you can get much more accurate and contextual documents for the problem you're trying to solve at any given moment and what this actually looks like concretely is you input some documents D1 through D and then you chunk these into small chunks right so

the embedding model has some fixed window that it can actually take some piece of text and turn it into an actual embedding then we take each chunk and tokenize it embed it um and then we actually because if you remember in the

previous slide sometimes it's actually useful to combine like traditional you know bm25 ranking with Vector ranking so you might want to index them in a in a scalar fashion so perhaps you want to build a full Tech search Index o over

the documents or perhaps you want to build some other sort of ranking index over the scalar components of your document uh and then we actually take each trunk and we index its Vector component uh and then afterwards you

might query these using the vector index and maybe you apply some rank ing um using a learn V ranking model but this is generally the pipeline that people follow from raw document all the way through to retrieve results right it's a

combination of many different pieces each with their own ristic knobs and relevant strategies and tuning these is where developers spend a huge proportion of their time and run into a huge number

of potential pitfalls and so our perspective on this is that there should be tools in the market that exists for people to not have to think about these individual pieces in isolation to build

each system in isolation but end to end systems that help you do all of these uh together in an easy way and to iterate on the system as a whole slowly over time um as opposed to having to take

each problem decompose it into its own respective system and put them together manually um so there's a couple use cases for this retrieval workflow that have been emerging for the last year or so the first is dialogue so we've all

used chat GPT by now and you might just have some additional data that you put into the context of chat gbt you want to have some dialogue with the model but also you want to give you want to give the model access to personalized or

private data um and so this Data Corpus is you know something that is personal to the specific user in mind or spe personal to the specific team that is using the product and another use case is the one of Agents so this is more new

and emerging but one that we're pretty excited about but the idea is you either have autonomist or somewhat manag agents that interact with external services and systems so one thing that uh was like a small like example perhaps you might

have some sort of system that manages your devops for you right you have an agent that stands your AWS setup you can just say hey I want to set up cicd for this workflow it just goes and does that because it knows how to look at your

system and then it stores the skills it learns in a vector database and so this pattern of storing learned knowledge storing learned experience and learned skills in ative database is one that has

been more recently started to become studied a paper that we've uh been excited about is called Voyager it happened to use choma uh where they basically train a Minecraft Agent to play the game of Minecraft and it stores

its learn skills inside of inside of a v database and another one is autocom completion or predictive AI um which is basically like if you're doing the the the most I think

relatable example for this audience would be imagine like U GitHub co-pilot so you're writing code and we're able to like predictively fetch the right pieces of documents the right pieces of code in

order to help ground a language models generation uh based on what you're seeing so if you're writing some code it might be useful to fetch similar code to the code that you're writing to help give the language Model A few shot prediction

of what code it should suggest that you should write um and these use cases entail a very specific workload shape and I think the argument I'll make is that this workload shape is rather different than I think what Vector

indices were originally created for and what Vector search research I think traditionally focused on which was very large scale very high throughput uh data sets and usually it was one index you

built it offline you updated in batch and this workload is rather different we have a data set where the QPS is relative ly marginal right like we're not shooting for one index that has a

very high queries per second most of our customers are okay with you know 100 to a th000 QPS on a given index um but the data is also very real time it's not updated in batch and so the data lag

that they're willing to tolerate is somewhere in the order of 100 milliseconds and their data scale is fundamentally relatively moderate so their data scale is on the order of one to 100 million vectors we're not dealing

with data sets you know a personalized data set for a chat dialogue bot it doesn't have uh many billions of embeddings it has somewhere on the order of one to 100 million embeddings but these embeddings might be frequently

updated and deleted so you might take for example the case of a document editing uh you know software and it needs to be able to frequently update and delete pieces of documents as users

are editing them so that an a agent that is working with you in order to like suggest what you may or may not want to write is able to respond to the updates and deletes in the document but I think

the most I think to me the most large difference is the one of the index the number of indexes alone in any given data system so you might actually end up having a index per user and so you might

have on the order of like a 100 million indices um for you know a reasonable workload uh at to many companies and you want a very high recall Target out of these because you're conditioning a

language model the generation and it's been shown that language models are quite poor at ignoring data that they don't understand to not be related to the task they're trying to perform so it's very important and we can't really

tolerate you know deviant and recall and one thing that I think is also pretty different is the because of the curs of dimensionality the vectors that we're dealing with are quite large um they if you go look at a lot of the literature

on Vector indices the the dead dimensions of the vectors tend to be quite small so they're 96 to 120 dimensions and these are most of the benchmarks you'll see in most of the papers that have been put out in the

vector database world but the vectors that we deal with are are much larger than that and this actually makes a really big difference when when you're actually doing the number of distance comparisons that you end up doing when searching Vector indices so the

dimensionality we deal with is on the order of somewhere between 768 4096 Dimensions which is much much larger than um you know traditional sparse

sparse vectors or sorry much larger than um the dense vectors you saw from uh from Vector search in in older workloads and and what usually what

people do right is you use some sort of uh approximate nearest neighbors index to deal with the fact that you can't end up exhaustively searching the whole solution space and broadly there's two classes of algorithms that are used the

first is inverted indices so the idea here is you create some centroids over your data and you assign data points to a given centroid usually this is used in conjunction with some sort of compression scheme usually product quantization so you're compressing the

individual points um and then you're assigning them to a given centroid um all of them all all of these algorithms are basically trying to reduce the number of distance comparisons you're doing by by by pruning the search space

that is elgible to you um the other commonly used algorithm class is a proximity graph I'm sure you've seen from from other people in Industry which is we create a graph uh expressing neighbor relationships these are often

approximating a Del or relative neighborhood graph um these are like sort of the inspiration and the sort of mathematical intuition you can have there is that the deloi the Del is a dual to the voronoi and the voronoi

obviously is a good uh tool for dividing up space um and then the relative neighborhood graph is closely related the delois triangulation of any set of points and so that Al also gives us some

intuition for how to think about why a proximity graph where any point is connected to nodes in its local region might be useful as a way to divide up space and search through it um so I'm going to talk a little bit more about

inverted indices which are a commonly used index type in industry and what I like to do is walk through inverted indices and walk through commonly used proximity graphs and sort of talk about why they're not the best fit for the

workload that we're targeting and then I'll talk a little bit about what exactly we're doing and how it's how it's somewhat different so traditionally when people use inverted indices they do so with product quantization so the the

classic workflow is you take your chunks of documents that we discussed earlier they're of D dimensionality we run some sort of clustering algorithm called K means uh and we'll we'll end up with some list of centroids then what we do

is we quantize our chunks so we take our chunks they're D dimensional we do some sort of quantization perhaps you know we divide them into six Subs spaces and and we get a dimensionality reduction of six

then what we do is we take all of our chunks and we assign them to the closest centroids um we then find at query time we find the closest and centroids so you have some query you find your close centroids and we compute the distance to

the assigned centroids uh and that's how we know which posting list to actually search and then we can rerank because the the quantization step actually reduces the Precision of our um

comparisons for each posting list by quite a bit so we recompute the full Precision distance for each final list and so this is a very commonly used approach one of the problems with it is

it ends up suffering a lot from recall the recall suffers quite a bit you don't get the answers as high as accurately as you'd want so people end up doing is they way over query their index by an

order of 100 to sometimes a th X and then rerank that list and if you just actually do the measurement on the number of uh distance comparisons you end up doing there it it begs the question of like are there things we

could be doing that perhaps might be slightly better and other people have found these techniques when you say when you say way over query you just mean like limit a

thousand yes okay so M like they want 10 they'll do limit a thousand yeah like sometimes 100 to a th000 x the number that they want to actually retrieve so

they just set K quite High yeah thanks um and so the other Vector index that's commonly used is a proximity graph algorithm called hnsw or hierarchical

navigable small worlds this is um work from Yuri moff and Demetri tun in um and um the idea is really really simple and and really elegant the idea is can you actually just build a traditional

proximity graph so you have some graph and without getting too much into the details because what really matters for our conversation today is that it is a graph and graphs are poor poorly stored

on disk um oftentimes and so because it is a graph what they do is they build a proximity graph and they they inspired by Skip lists they build layers on this graph so the top layer of the graph has

longest range connections the bottom of the graph has smaller and smaller connections and the way you can think of this is that the diameter or the radius of your graph is shrinking as you go down the number of hops from uh any one

node to the furthest node is decreasing and uh the other trick that they employ in addition to doing this hierarchy of graphs which makes the search go from you know a poly polinomial time to

logarithmic time is they have this heuristic which I think is really important to pay attention to which is often times you end up as you're building proximity graphs you end up with points that are on the edge between

two clusters and so they have a heuristic that will encourage cross-cluster connections and this this sort of problem comes up a lot when you're partitioning space um and is also

like we'll talk a little bit later about how the same heuristic you can apply to inverted indices and end up with something that's a little bit better than the traditional inverted index approach but the key thing here for our

for our conversation even if you don't understand how hsw works at all is that it's a graph and graphs are not that great to be stored on dis the pattern is really random uh and people have done

work for example on disn where you can build the graph in such a way that it's more amenable on disk but at the end of the day these things are really hard to reason about um the fundamental AIS pattern of a graph is very random your

the graph is actually designed to have long range connections to move all over the data and if you actually just looked at the page access if you divided your graph into some some number of pages and you look at the page access across the

graph you end up touching a large large percentage of the graph um in terms of of the number of pages you touch when you do a query and the problem with this so what people do is they store the entire graph in memory and that quickly

becomes a bottl both from a cost perspective but also just from a a usability perspective like how do you think about deploying something when everything that you have to touch will always live in memory um and then

another problem with hsw is because of its design deletes tend to degrade the graph and the IND the like commonly used solution for this is you just end up rebuilding the hnsw index entirely so

let's say 20% the graph gets deleted we'll just entirely rebuild it um and then inverted indices also have their own set of problems um the the cavan clustering step often poorly partitions

the space and so you suffer from low recall it's also very hard to think about how many centroids to search and then it's often used in conjunction with PQ and then this has this over quering problem that I mentioned and this is

very hard for people to think about in terms of tuning every different queries might have a different number of probes a different number of centroid lists that you actually want to search and and also if you have real-time data which is quite important to us when you do the

centroid step your data can drift quite a bit right if you are indexing data over time the centroids have to be a a representation of the distribution of the data at all times but you are only

building these centroids a piece at a time and that's going to lead to drift in your actual centroid cluster inste set which makes the inverted index have poorer recall over time um so there's

some interesting work done at Microsoft um about two years ago and they they made this algorithm called spam um where they they take the idea and how can we combine these two approaches which is we store the centroids in a

graph in memory and then we we keep the posting lists on disk so we do our centroid search but we don't do root force over the centroids we store them in an approximate nearest neighbors algorithm and then we store the actual

posting list on dis and what's interesting here is they keep a very high centroid count so 10 to 20% of the data is actually put into centroids so if you had a million scale data set you would have 100,000 centroids this

actually saves you a lot of the memory and let you you know push a lot of the workload to dis but also what happens is um now you have a very fast layer in memory and you can keep your list very small and when you can keep your list

very small because you have a very high centroid count it's very easy to reason about the dis access pattern because you know exactly how many are going to be accessed you can plan your ios's really well and you can fetch these these

posting lists um off disk quite effectively um and then uh what they do is the other tricks that they employ and you you'll recall that I mentioned this of you know points on a boundary so what

you can do is you can assign points to multiple cids they call this closure clustering the idea is really simple if you're assigning a point to a centroid when you when you actually do assign it only assign it to another point if its

distance to that point is less than the distance to the closest point plus some Epsilon so we only want to attach a point to another point if we we will

attach a point to multiple centroids if we feel like it might be on the boundary between multiple centroid clusters and then we also follow relative neighborhood graph rule for centroid assignment so we don't want to

overpopulate all of our centroid lists if a point over is overly specified on a boundary so what we do is we say we're going to skip assigning a point to a cluster if the distance between that

point and the cluster is greater than the point the distance between that point and the between the two clusters so in this diagram here right we have um this Orange Point it's being assigned to

potentially this bottom left cluster and this bottom right cluster and we don't actually assign it to this bottom left cluster because the distance between the two clusters the cluster it's already

been assigned to is less than the distance between this point and the cluster itself and that means that the Clusters probably overlap and so at query time when we go to look up a point

in this region we're probably going to end up looking in this list and this list anyways so it doesn't make any sense to store this point twice really simple rule but if you actually go do the the quick analysis of how many

lookups it saves you it's quite a bit and then the last thing they do is they observe that when you query not all points need to visit the same number of centroids so they apply the same pruning rule we applied for building the

centroids for assigning points to centroids they apply it at query time and the observation is that different queries only need to access a different number of different number of centroids and so you can actually prun in the

centroids when you're doing the query using the same distance rule we won't query a point unless we find that the centroid that we want the other centroid that we want to query is within some Epsilon of us and the centroid that is

closest to the point and this these three tricks make a huge difference in terms of the the recall performance you get out of inverted indices and so what you'll see is on the bottom here we have a graph where the blue line is

traditional inverted indices um I forget what data set this is to be frank but um trust me the results hold you have a blue line that you know is just the result of traditional inverted indices

and then what we do is we assign it to another four centroids and then eight centroids and 10 centroids and really quickly you can see a very large jump in the recall performance with this really

simple trick of dealing with points on the boundary um and then another another small trick that we've been uh playing with is called ad sampling and the the observation here is that a bulk of the

time searching is spent rejecting candidates based on distance comparisons and the idea is that um when you actually look at the time spent the majority of the time you're spending is

spent rejecting candidates so we have uh graphs here of um distance comparisons on negative operations in hssw and ibf you can see the bulk of the time these dcos dist operations is spent on rejecting

candidates and the idea is that when you're actually sampling your graph all you care about is is the point I'm looking at further away from the closest from the you know the top of the stack

of the closest points I've seen um is it worth adding it to my top K list and so what you can do is instead of having to do the full distance comparison can we just compare a subset of the dimensions

right intuitively like if you had a point in ukan space and these two points are really far apart you don't need to actually do the math for each Dimension at some point the math for some subset

of Dimensions meets some threshold where you can just prune that that point from your search entirely um and so what they do in this paper is um they use a result in high dimensional analysis know as the

Johnson Linden straw dilemma which lets you basically say that if you're going to take a point and apply some random ukian transformation to it you can then prune that point from you basically can

found the um the distance of that point the norm of that point by some Epsilon and so they invent a hypothesis they come up with a hypothesis testing procedure based on this bound um from

the Johnson Linden TR LMA that lets them basically flexibly when they're doing their search only compare a subset of the dimensions for any given point and this ends up speeding the search up quite a

bit um and so our index combines uh the these these two approaches we divide the data into real time and historical seg ments we index realtime data into an hnsw graph we compact historical data

into inverted indices using the few tricks from the span paper that I just mentioned to you closure clustering random neighborhood graph and flexible query sampling or flexible query Printing and then when we query we use

ad sampling in order to reduce the overall distance computations required to do any one query um we store the centroids in H&W and then we store our posting list separately and allow and

using an inverted index approach while increasing the latency of our queries a little bit allows us to separate the storage and compute layers of our architecture and this leads to a lot of

benefits for both developer experience um but also just general operational flexibility and so now I'll pass it off to laquan to talk a little bit more about the core architecture and how the index actually lives inside of the full

distributor architecture yes cool H yeah share my screen I think before we go to that part quick question Hamad that was excellent

so far this is really great uh uh could you shed a little light on how much you go into quantization I know there was a slide you presented uh few

slides back that said in general people do quantization how aggressively do you use that u in your overall scheme or if you're going to answer that question later you could defer it too yeah um

that's a good question so we actually don't use quantization at all uh and the reason being that um it it is really hard to reason about the performance of your system when you use quantization

and so people end up doing is they end up tuning the quantization scheme or the over query scheme to deal with the recall loss that you feel you face in quantization um and the problem is if we

were just running our own system I think that would be easy for us to reason about but because we have a lot of developers who are being exposed to these concepts for the first time we've been in search for a solution that that doesn't leverage quantization and so

that's why we have these sort of other set of tricks that allows us to avoid having to quantize the data at the expense of some at the expense of some latency obviously thank

you all a quick question uh Martin prommer from University of Wisconsin Madison so you mentioned a lot of talk about using the distance metrics have

you looked into any of the alternative approaches where you specifically train a separate model to do all of your distance calculations and then you rely

on this distance measuring model instead of the the more traditional approaches yeah that's something that we've been interested in but to be honest have not spent that much time

investigating I'm personally also really curious about how those those approaches um perform but we haven't used them ourselves okay thank you oh I think we can get started thank

you Hamad for the introduction of the uh the problem the workload and the index we we are trying to build I think this is one of the cornerstones of vector database and also like one of the

cornerstones of the retrial architecture for like large long range models I think another conone really the distrib distributed protocols that we build in

order to ensure the safety and the life needs properties of the system and we'll talk about a few problems and the solution we propose to achieve safety

and the less but before that I I I'll I'll quickly talk about the the chromas architecture and I'll start with a single node uh the single node of of CHA

is pretty simple it's has a server contain up front end and also mat index and WR index and then when people send the nearest neb queries they will first

go to the metadata index and then query the vectors and then return the results and there are couple highlights of the single node like architecture and

currently we're using the the um hns hnsw uh index and also we use site as our

metadata index and we also store the system LEL catalog uh we we call it system DB using cite this is the single node version in the distributed version

we basically decompose those component into separate services for example we have a coordinator that basically the system catalog and we will introduce a

frontend servers which are kind of like the proxy thring uh sending queries to the logs and servers depending on whether it's read or right um we

introduces a defus log as a database and the D log basically uh serving the rights and the RO the rights will go to the dist log first and then the index

node and query node will tell the distri log and then build the views and the indexes uh increment mentally

and next I'll talk about the read and right pass for the chroma so the right path will first go to log and the index

node will ta those logs and once it reach a certain side threshold it will flash to the object to the object storage like

S3 and then that that the right pass the read pass will be like the front servers will figure out the assignment of the uh

active segments and heat segments and draw the information to the right note so we using we're using Haring

scheme to reduce the global cordination so basically the idea is that uh basically distribute log like cical poer

have the concept of like topics and then based on the topic ID we'll assign the topic to different noes and for the hiic

segments we also has IDs and based on the uh IDs we will assign them to a uh to different nodes here I want to talk about a few

like highlights of the system um is it's very different like uh today compared 10 days before ago that there's a lot of ecosystem we can leverage to build

distribute systems for example we can use up storage we can use distribute logs and we can also average kubernetes

to actually manage the membership and uh and uh and do notifications so that is our one uh principle basically we want to leverage

ecosystem of that possible however I also want to build like critical capabilities like the uh kind of information propagation protocols as

well as like public reconfiguration protocols to ensure that uh the system has certain safety and Mis property and

we highly leverage har in currently in our system to reduce Global coordination um yeah next talk about one problem we are

solving uh which is called we call the roting information propagation so in our right PA

basically uh we have like one I two segments that the log and then it increments the data and then build indexes and once it reach a certain

stold it becomes a historical segment you can think about like all the segments are uh generated from the active

segment and I I think this is important for us to making sure that during the segment generation splits the final

servers will always get the most up to-date information so that a quer will carry all the data for this collection

so that we don't have like data miss or have a partial query during this active segments split or no addition no dele

and the protocol we are building is we we build a monotonic protocol to ensure no query to correct data because from

the server because this is essentially a a synchronous system and the what what what what the indexing node does is that as it tail the data as it flash to the

disk it will register the membership to the coordinator and the coordinator will broadcast the new term of ieg uh to the front server the term is a

classwide term indicating a new assignment to the member list and the term is only incremented in respon to node additional removals and iic is a

per collection concept it means that each time when we uh split or we generate a historical segment the Epic will be incremented because of the

system is Inc the front L server may receive the updated terms or epics in delayed fashion similar with the workers

then how do we make sure that the system always carry the right information and we basically has a pull and P push as

well as reconciliation rout pull means that when the new front servers it starts it will pull the coordinator for the latest term and EIC for certain

collections push means that when there's a new membership or there's new uh can of like segment generated will push the

updates to the old friend servers and workers the reconcillation means that all the queries will attach the term and

Epic observed by the worker and then you will check whether the iPic and term in the query is smaller or bigger or equal

to the iic on the worker if it's smaller which means that the final servers are lagging behind that you need to query the coordinator to fetch the most up

toate information I think with pull push and Reconciliation will guarantee um the query always querry the complete data

for the collection in as Cur fashion yeah this is one problem we just saw the second problem that think we are facing as Haman mentioned we we we're

dealing with a large number of Collections and also even we are using the C poer they have their own limitations on number topics which means

we can not just simply assign one topic to one collectional one topic we have to use some sort of like data multiplexing meaning that will be assign multiple

collections to a set of topics and then but the challenge is that how do we make sure that as the system grows as the number collection

grows the we can rear re reconfigure the collections over topex and this is in general a pretty challenging problem and the problem is basically like like that

during this reconfiguration because the order of rights is very important because you don't want to you

delete uh to data show up again because of reordering so you have to guarantee the orders of the rights even in the CH

even in case of changing the number of topics and we have designed a protocol

inspired by uh the paper n basically to ensure that during the topic rear R configuration the

data observed by the indexing workers are ordered by the configuration and of that from the topics and this is critical to ensure the correctness of

the data and the user experience and here's high level view of the protocol it actually basically solving two sub problems one is a reconfiguration or

rear desision and we use a coordinator to make the Deion and once coordinator decide we have to do a topic reconfiguration it has to uh be executed

this is similar as the prepare uh like stage in the two-phase commit protal and then the coordinator Bal a new topic list to the producer and when the

producer receive those like new topic list and new configuration it will communicate with the consumers by

writing a message called fin to the to the topics indicating that it will not produce messages with previous configuration and then the configuration

the consumers will buffer the messages and then one it receive the like th messages you will know that which

collections I will not receive the messages for the pr conf configurations then I can deliver those messages to the

indest worker after I delivered the messages of the private of the current configuration for this collection here's are some examples uh

it's like two producers are responsible for two Collections and as and the in the configuration with there we have one

topic and both of the producers will will produce to the old topic and the consumer basically will consumer um messages from this

topic and then we want to change this configuration to use two topics and the protocol is like that so the coordinator make decisions and it broadcast to the

configuring change to all the producers and the when the producer receive the messages you send the the SC to the old

topic for example uh let's say uh producer one responsible collection one and then once consumer red receive the

message from producer one it knows that collection one will no longer have messages for collection one with

configuration with zero this is very important and then it can expose data to the index worker and once it's receed the producer

to the them message from producer to then it can uh expose the information for collection two and this is uh very important and we

do have some this is not going to work all the time but we do need to have some like uh uh assumptions basically the Assumption we need to make is that the

consumer know how many producers are producing to this collection and this cannot be changed during the configuration change and the way we can

do it is to disallow node addition and the node removal during the reconfiguration stage and also we can use some kind of like static membership

mechanism in case of like not crashes we recover State the node into a state that is the same as before in this case we

can make sure that um the configuration during the uh we can have like a configuration change without any blocking of the of the producers and

without blocking to the uh to the consumers so this is like a prod call with developer and we are going to verify with

tr+ so our our philosophy is basically like we think that because of distribut system all about reaching consensus and

lesses without with minimum I would say like synchronization between systems so we need to make sure that those Protocols are right from day one so that we don't run into

situations that we have to do patch and Pat again of the system of the protocol that is not predesigned so we will make sure that those information those

Protocols are designed are verified on day one with t plus so that we make sure that the protocols are very solid and

then we'll be implementing them the other thing we're doing focusing question is that even the single note chroma we take a a lot of like effort

implementing the property based testing and the model based testing and the idea is very simple is came from like uh the

light lightweight formal verifications uh from metl and we basically have a model for the metadata and model for the data and we have like reference

implementations for example like in memory map for the metadata and we have like a per index in memory for the data and then we

compare uh we create a state machine that randomly prop like aform operations both on the reference implementation as

well as the implementation and I compare the result within certain Bond and we actually uh did quite this is quite helpful in terms of improve our product

capability and we fixed quite a few bucks with those uh kind of like a property based test in the model based checking and I think this also serve as

a foundation when we build the dist systems because we don't want our system to you know degrade quality or have have

like a uh crus issues and this is a solid foundation for us to move on okay next up I'm going to talk about

load maps and I think as herand mentioned we want to be a endtoend system for retrial for langage models I think white search is only part of it

and R itself is not sufficient so if you think about building application with uh language models there are three kind of like things you're are thinking about

three components the first one is really the logic or the chain of like prompts the second one is the data including the retrial as well as your structured data

and third thing that is very important is also like the feedback evaluation so that you can find the right strategy of chunking embeding kind of

ranking so that you can have a right kind of like a uh U production grade language model for your

application so this is come so with a chromas like road map is we want to expand from our API to allow user

defined function for tanking embedding indexing and reranking and also we allow more sophisticated retrieval strategies to be pushed down closer to the data

right now I think kma is focusing on the vector search part but later we also want to um uh include APS of chunking

embedding storage and reranking and in the following slides I'll talk about a few examples and the first example will be retrieval evaluation

so you you you have some querry and then you have some ret result and also the generation result and there can be uh

some feedback around those result and we can directly save those feedback into chroma and this feedback will be useful

in many kind of use cases uh it can be used to finding the embeddings and the different layers uh and different steps

for example these feedbacks can be used to finding the model these feedbacks can be uh fitted to the at adapter metrics

and we can also build reuning and filing mechanism based on those feedback and based on those feedback uh people can build better

uh uh kind of like unto systems and another thing Coral is aiming to support with the pipeline API

is uh allow users or automatically generate the best uh kind of chunking strategy the chunking strategy is very important but is often

overlooked uh but however the chunking is very important because it just actually tell the system what information they need to embed and there are many historical

chunking strategies based on the document structure but however what is optimal chunking strategy depends a lot on the task and data set and I think

promise experimenting with the use of language models and the larger approaches to perform optimal chunking at the

injection and then based on those uh uh feedback and also um we can chroma can

also include the reranking mod model at the end to end which will stack and the relevance really uh depends on the

speciic querry or tasks as well as the underlined Insight so after we Reve the data I I think we can allow users to experiment with

different ranking models to generate the result that has the most relevant results and that's all thank

you I will clap behalf of the audience uh we have time for a couple questions if the audience wants to go for it I was R connecting back to some of

the discussion earlier on to uh hammad you had mentioned how the vector sizes that you use sometimes go up to 4K and that's driven by the workload uh could

you and Quan tell a little bit more about the diversity of the workloads that you see obviously people get those vectors probably by quing some of llm or some other uh method to do that is that

what's driving it or is there something else that's going on with the diversity that you see in the vector sizes yeah I think it's mostly coming from the large variety of models that we're seeing in the open source

landscape um and then also from multimodal um so nowadays we're seeing like video embedding models that for example so the 4K embedding models that I mentioned are specifically video

embedding models and those just have by their nature a much higher dimensionality but then they also have a temporal nature to them which makes them kind of difficult to reason about sometimes so the first is just multimodal embeddings and the second is

there's a huge Resurgence of Open Source embedding models and so people are pulling them off the shelf and these are often trained in uh ways where it's a language model and then some part of it is removed and then fine-tuned at the

end or it's you know a a language an embedding model trained specifically for this purpose um and just the nature of the architecture leads to a much higher dimensionality now to say we we have seen that you can reduce the

dimensionality quite a bit and get the same performance um so I do think there is a lot of Headroom in dimensionality reduction when you have a specific task and domain in mind and I think that's like very much in line with the kinds of

problems that we want to solve because it doesn't make sense for you to have a embedding model that is you know generically good at all of human language if all you're trying to do is do Code retrieval it makes sense for you

to find T model to a much smaller Dimension um and and focus it on that very narrow task so I think that that's kind of where things are going is we'll have these Foundation embedding models that are trained on a wide forpus of

data and people will fine-tune them to their specific task and in that process can also do dimensionality reduction and it's kind of silly sometimes how simple it is to do that dimensionality reduction you you really can just add a

linear layer to the end of your embedding model and fine- tune it down to a smaller Dimension and get surprisingly good results with just simple binary classification as your training data set um so that's what we

end up seeing a lot of people do any other questions for the audience oh I think you're muted sorry two more questions one is how big do some of these databases get in terms of number

of vectors on some of the lodger and applications that you see uh and then I'll wait for that answer if we ask in the next

one sure um I think that the biggest data sets that we've seen are like right under a billion data points um this like the biggest data set that I've personally seen inext into chroma but

that's like very much on the tail case so I think the large majority of our data is on the order of several hundreds of thousands to several millions of vectors um but then again the key part is that they're growing quite frequent

they're being updated quite frequently both in terms of additions but then also in terms of mutations both updates and deletes um and that leads to some challenges in the underlying Vector index that algorithm that you use um but

yeah I think several millions of vectors seems to be where a lot of these things are converging and I think that comes out from the fact that when you end up chunking a reasonable Corpus of documents um you say say like 100 to a

thousand documents when you end up chunking them and then processing them you end up somewhere in that regime and so for most of the use cases that we're seeing be it chat your documents so you have some internal knowledge based that

you want to chat over or like Code retrieval or you know augmenting a language model with specific about some new tool that you're developing so it can assist users with that task those

corpuses of data tend to be not trivially small to where you can root Force but um you know not insanely large where we have to start getting into how do we do billion scale Vector search on a single

node great and then the followup question was do you run into issues where someone has a long-standing application where they're not just creating these Vector databases for a

small amount of time but they're really building on it more like a database and they go ahead and switch the model that they're using for embedding maybe they

switch Ro from GPD 3.5 to four uh do the users take care of migrating the database to this new embedded model or do you provide tools to allow them to do

that yeah this is something that we've been thinking a lot about we don't have any explicit tools to help with that today but I think there's sort of three flavors of problems that all align with this the first is the actual language

model drifts over time without you knowing so if you're using some API service provider the behavior of the language model might change in and of itself without you even knowing and that can be a bit strange to monitor for so I think monitoring tools are something

that the ecosystem is working on and and perhaps might make sense for Kera but I'm not entirely sure um and there's a really good paper actually from data bricks where they they measure like over three months how does the performance of

a language model on a fixed Benchmark of tasks change over time um so I think that's the first um class of problems and the second class of problems is as you mentioned it like you change the embedding model yourself and you have to

sort of manage um you know these different versions of the same task workflow that you're that you're trying to perform and I think that in that case the way that we think about it is people are going to be doing that both you know

for experimentation but then also for valid use cases where maybe it makes sense to online AB test uh two different embedding models and perform uh you know a comparison and then see which one's performing better maybe you fine-tuned

something and you want to run an experiment so I think in that case um we do think that there needs to be a way to for the system itself to handle that for you and I think that's sort of an experimentation platform that could live

inside of of chroma but then also of course you can just do that on your own using the same data model and I think the way that ends up manifesting is the fundamental data structure or like data model chroma is a collection of data and

you just create multiple and then the idea is like how can we somehow allow those to share indices to start but then if you mutate them or change them over time um can you know do some sort of

copy on right scheme um so I think that that's that's one potential approach is you don't need to you don't need to duplicate all the data just the data that matters for your specific variation

um and then I think the the third class of problems that people run into is they um they you know are changing other strategies so they're not changing the edting model but perhaps they like are changing the actual uh you know find

that they're changing the chunking strategy or they're changing the ranking strategy or they're actually just updating a document so a huge problem we see is like I took this document I broke it into 10 chunks then I actually update

the source document what's like the best way to you know update all the individual chunks of the document without having to redo all the processing that I just did and waste a bunch of compute in the case of very

large documents um how how do we do like a diff and embed is like an active problem so I think those are sort of the three areas where we see some sort of drift in either embeding model providers or you know language model providers um

resulting in issues or actual just Drift from the person using the database thes potentially causing issues and and I do think that it's necessary that whatever system emerges to solve these problems hopefully kroma Mak makes those things

easy for you yeah thank you and I just loved how you had so many connections made to papers and just fantastic loved it awesome

talk all right any other questions with the audience so I wanted to go back to the sort of the distrib assess protocol you're running to do the reconfiguration

is this something specific to chroma or is this like a just a limitation in Kafka that like you do this reconfiguration and you need to update a bunch of people that are feeding from it

in a sort of atomic manner yeah like go ahead yeah this is a limitation from Kafka I think there was a proposal about like topic expansions uh in Kafka

probably in PA as well but this is in general a pretty hard problem in the sense that you need to coordinator the producer the broker the consumer so I I

think this is like how we have to uh build onp to limitations per the C I think the the the I think the on High

level the intuition is that uh assuming that you are dealing with even time processing with out of other like messages and the message can arrive kind

of like in in a fashion that can be delayed but you want to make sure that for example you want to compute the number of like the spend on an hourly

basis and then the hourly basis can really arrive late maybe in 24 hours late you basically need to keep all the windows open before you can um send the right answer to down stream right uh

this is how this is basically Paradigm of Flink or like spark streaming but in N the idea is that you always output the correct answer to the downstream and the

way to do it is to figure out uh what they call our Frontiers which is the latest of the lower bound of event time stamp that will happen in the future and

once you know the like the frontier then everything be before this time stamp is kind of like you already know the not not going to change everything you send

out the down will not will be re workable the idea is that this is basic highight idea and the way we do it is to make sure that we know all the producers

are producing this this collection and we know that all we need to keep track of how many producers already make the configuration change and then

the producer once they are absolutely sure that they have receive all the kind of acknowledgement from producers that we are no longer producing the data

before them I can safely sending data to the downstream without violating any like ordering guarantees yeah that is I mean I mean how expensive would it stop

the world be in this like to do this configuration in this case if I I think there's um there won't be any big like uh stop the words the first step you

need to know because we probably will use two code pass for the normal case as well as reconfiguration case I think the only thing you need to like two stage

consensus is we need to make sure that everyone is entering is make sure that we are executing a configuring change and then we switch to the uh

configuration change code pass this similar as yeah that is only after that everything is asynchronous if you make

sure that there's no uh like uh uh no addition or removal or crashes yeah got it okay that's IDE yeah and then and then at the beginning of the talk you guys went through a bunch of different

optimizations that Jes said we you know appreciate you guys putting the citations to the papers um but like you didn't show is is it possible to show like for all those optimizations like

which one gave the the most bang for the buck in terms of engineering or even compute time like if someone's going to implement all those optimizations which one should they start with first yeah I think the the biggest one is if you're

doing inverted indices the closure clustering what what they call closure clustering in the span paper where you assign a point to multiple centroids based on a battery condition um that that changes inverted indices from

something that is very hard to reason about the performance of basically working almost as good as hsw in most cases um it was quite surprising to us when we implemented it it's like a very simple idea like you have points on the

boundary just assign them to multiple centroids very simple heuristic but it it changes the recall performance by like 20 to 30% in some cases um so it's quite a large and drastic

[Music] jump

Loading...

Loading video analysis...