Fast LLM Serving with vLLM and PagedAttention

By Anyscale

Summary

Topics Covered

Single GPU Serves Less Than One Request Per Second
PageAttention Becomes Industry Standard in Serving
Paging Memory Helps Even Without Attention

Full Transcript

hello everyone thank you all for joining my name is uzukon I'm a PC student at UC Berkeley working with Jan today I'm more than happy to present a

recent project Bria 11 fast llm serving with page attention this project is co-led by me and Juan here and our otherwise you're young

as we all know we are in the era of hero limbs the most Innovative applications today like Chachi PT and co-pilot are powered via alums

moreover the the whole field is still rapidly growing every week we see new models and new breakthroughs that continue to expand the potential of of adult limbs

certainly L Ms are opening up a plaster of exciting opportunities as a result we've seen a surge in applications and startups leveraging LMS

across various fields from chatting and programming to business operations and an essential part of these applications we focus in this talk today

it's serving Diego limbs given these applications all heavily utilized at olymps the performance and cost of the applications largely depend on the speed

and cost of serving lens therefore how to serve adults fast and cost effectively is becoming a tremend of the important problem these days

however we observe that serving elements is surprisingly slow and expensive even on the best of free Hardware due to the large model size LMS often

run on high-end gpus like a media a100 despite that a single GPU with previous systems could only serve a handful of weekends per second

even less than one weekend per second for a 13 billion model with moderate input sizes this means you will need a ton of gpus if you actually build production skills

services using LMS obviously this has been a critical pain point for many large and small companies to understand the problem let's recap

the inference process of it alums first the user provides a prompt consisting of multiple tokens the problem goes through the model

and then we get the next token to the prompt the next step we feed this means feed this new token back to the model

and get the second output and this process is repeated until the sequence either reaches a predefined maximum length say 2000

tokens or generates a certain token for stopping this is basically how the inference process of LMS works in the inference process LMS have a

unique component which is often called KV cache in the literature when processing a new token the model actually needs not only the representation of the current token which is though in this particular

example but it also needs the representations of all the previous tokens so these states or previous tokens

should be kept in memory and they are called KV cache the same for the next step except that the input token in the previous step which was though in this example was

newly appended to the KV cache as such the KV cache dynamically grows and it also shrinks when the sequence finishes here the key Insight in our project is

that efficient management of the KV cache is crucial for high throughput LM serving let me give you an example let's say we

run a 13 billion llm on a 40 gigabyte GPU the model parameters take roughly 26 gigabytes of memory in addition a small fraction of memory

is reserved for workspace the rest of the memory which is about 20 gigabyte in this example can be used for the KV cache

our finding is that previous systems use KV cache inefficiently and thus significantly limit the number of retests that can be batched together

in contrast our solution field of them manages KV cache much more efficiently and thereby allows much larger batch size with the same amount of memory

and importantly the increase in batch size translates to the increase in throughput enhanced reduces the cost per weakest so what are the memory inefficiencies in

the previous systems this is a snapshot of the KV cache when using a previous system where we find three types of memory waste

the first is internal fragmentation which means the slots are located for a sequence but never used this happens because we don't know the

we don't know in advance how many tokens the model will generate the second is reservation which means the slots better not used at the moment but will be used for the sequence in the

future here the three slots in the middle are reserved because they don't store any token at the current step but we'll store output tokens in the

subsequent steps this is another kind of waste finally there is external fragmentation because different request A and B may

have different sequence links according to our profiling results the memory waste was significant only 20 to 40 percent of the kvk space was actually utilized to store the token

States the rest of the memory was merely wasted for these three reasons so how does vlm solve this our solution is basically to employ the

old idea of virtual memory and paging in operating systems as we all know OS uses paging to reduce fragmentation any users virtual memory for efficient

and elegant space multiplexing between uh between processes and what vlm does are basically the same it uses a similar kind of idea to resolve the fragmentation in the KV test

space and to enable efficient space sharing between recast specifically this is enabled by our new technique page retention

let's dive into it okay to begin with we partition the KV cache into an array of KB blocks just like paging a KV block is a fixed side chunk of

memory that can store token States from left to right in this particular example the block size is 4 which means we can store four tokens in the block

on top of this we introduced page Attention our new implementation of attention mechanisms we find that the fundamental limitation of the previous systems is that they

require all KV State servers for a sequence to be stored in a contiguous memory space this is useful convention in traditional DNA workloads where the input and output

shapes are static however it turns out to be highly inefficient for LM inference where the sequence links are Dynamic and unknown a priority pays attention directly addresses this limitation

as shown in the animation page attention operates on the KV States stored in non-contiguous blocks and it efficiently patches the blocks located in arbitrary

positions in the KV cache space furthermore we virtualize the KV cache to logical and physical KB blocks let's see how it works with an example request

a its prompt is elementoring is a computer and scientist in The Logical view the tokens are stored in a natural fashion they're

stored in in consecutive blocks where the order is preserved in the physical view on the other hand the tokens in the same sequence may not be stored in adjacent blocks and in

their order between the the blocks can be arbitrary and the mapping between The Logical blocks to the physical blocks is stored in a new data structure called block

table which which is analogous to page table let's continue the example let's say the model has generated the next token end

the new token is first appended to the to the last logical block and using the block table we also append the new token state to the corresponding physical block

the same for the Next Generation token mathematician and finally if the last plug is full then we allocate a new physical block and store the token in the first slot of

the block importantly this means the blog is allocated on demand as opposed to being pre-located like in the previous systems

so basically this is how we only manages the KV cache so far we've covered how it works for a single recast a single sequence however I believe you will be able to imagine how it's going to work for

multiple request process simultaneously since its underlying principle is saying is the same as the virtual memory in OS let's quickly analyze the memory efficiency of viola

first vlamp has minimal internal fragmentation this is because the internal fragmentation only happens at the last block of a sequence

this means the the number of the wasted tokens per sequence is bounded by the block size in practice we use block size 16 or 32

which means we store 16 or 32 token States per block and this is orders are magnet is smaller than a typical sequence length therefore the internal fragmentation is very small

second vlm does not have external fragmentation since all blocks have the same size putting these together real name only weighs four percent of the KV cast space in other words it improves the memory

utilization by three to five times compared to the previous systems this leads to significant increase in batch size and eventually disturbing throughput

okay from now on G1 we'll take over the rest of the talk John okay um okay hello everyone this is Johan so just now also talks about how the page

memory or the dynamic block mapping reduces the memory waste and improves the serving throughput okay and actually another big advantage of the dynamic

block mapping we have in BLM is that it enables sharing so for for example here on the slide the parallel sampling is actually a very popular sampling method

for air amps which which generates multiple output from the same problem so as shown on the slide here we have one input prompt the future of cloud

computing is and we see this one input prompt into an llm and generate multiple sample output from the lrm and in such a case the computation and the memory for

the prompt can be can be saved by sharing can be saved by sharing it between the different parallel output samples so this scenario might be pretty rare for a

playing chatbot but it can be very helpful for cases like element-based programming assistance like the GitHub copilot which we all use for for coding for coding like you can try to generate

multiple uh multiple candidates and select the one you want okay and so vlm can naturally support this decoding scenario and using its blog

tables so we use this we use the same prompt the future of cloud computing is as the previous slide for example and here we generate two parallel samples so

for the prompt part because they share the same prompt these two samples share the same problem vrm will only do the computation once and store only one copy of the KV cache in the physical blocks

as shown in the slides here and because the sequence A and B have the same prompt we'll your will will because they have the same problem will only keep one copy but

because both A and B they still have the same logical blocks so as shown on the slide here and because each physical block can be mapped by multiple logical

blocks right naturally will keep a reference count for each physical blocks so in this case these two physical blocks have a reference count of two and so as the next step these two

samples are begun to generate the output right and these two samples because they are different samples they will sample for different outputs say here the sequence a generates the world price and the sequence B generates the word

interwine and both of them need to write the newly generated KV cache for these new words into the physical KV blocks and now we Face an issue so where sequence a wants to write to the

physical KV block it will check the physical KV blocks reference count okay so it says see it's the reference count is two and then it will reduce this reference count from two to one and do a

copy on right and copy this physical block to a new block and then write the corresponding KV cache of this World Bright to this new new bra and for

sequence B because it sees okay it checks that physical checks that physical block and since the reference count is one it will directly write this newly January word interwind

into this physical KV block and then the generation continues with these new newly generated blocks so in this case we can see that that except for the very

last token block KV block for the prompt part all of the other token KV blocks for the prompt can be shared across both sequences and this can greatly reduce the memory usage for the prom part

especially when your of your problem is very long and can release can also increase the throughput because of the like memory reduction and uh but and vom

does not only support parallel decoding and VM also supports even more complicated decoding methods like beam search so beam search for people who are not very familiar with beam search here

so beam search is a decoding algorithm that is very popular for like machine translation to find the most accurate translation and it can be also used in LMS say for example if you use LM for

translation and theme search is very similar to parallel sampling in the sense that the prompt part is also shared by different beams so when different beams generates this prompt

part can be shared across different however beam search can can be more more Dynamic and more complex compared to parallel sampling because the different

beams can different beams can diverge from another beam so for example here the new beam zero and beam one are all from the original beam zero and the new

beam 2 are from the original beam 2. so

in this case this part of the of of the KV Cache can also be shared across beam zero and beam one and the original like beam once KV Cache can be just deleted

and then as the next time step uh the the beam all the beam search result looks like this the B so beam zero is from the original beam zero and the beam one and beam 2 are from the original

beam 2. in this case this part can be

beam 2. in this case this part can be also be shared by the two beams beam one and beam two the you know that the sharing pattern is actually very Dynamic and can change this over time and Via

vrm dynamically supports supports this VR is a broad-based memory management the page page memory and the block table mapping and this is for people who are familiar with operating system this is

like very similar for you for a fork tree so it's like when you keep calling Forks in your like operating system processes and how does the OS memory OS manages the memory it's very similar to

to this Fork 3 style and and this is also like this whole sharing scenario is efficiently supported by vrm with this that with the dynamic block

mapping and the copy on right mechanism okay and uh yeah so two brief summary to briefly summary summarized so how does page how do page attach NVM benefit ERM

serving first as we success we can reduce memory fragmentation with paging so this can reduce the memory waste and so we can fit more requests in the same batch so we can

improve the throughput and second is because of we can do more complicated sharing and we can further reduce the memory usage and which further boosts us throughput

and next let's dive a little bit into the systems architecture and implementation of vlm so we'll have we have a we build Vim as an end-to-end LM steering engine that includes a

front-end distributed model executor and scheduler so we have a centralized engine that manages the block table and at each iteration it sends the memory command to the GPU workers the cache

engine in the GPU worker will actually allocate the physical blocks and then execute the model sharps and more specifically speaking so vrm is a mostly

python project but we do have some cooler code for the page attention kernel and we build on top of the popular stack of LMS like hugging face and hugging phase and Python and for

distribute Purity execution we use Megatron LM for the model parallel decoding and use Ray for the array for the cluster management and the control

plane communication here and uh yeah this is the performance results when we are when we first announced vrm which we

compete We compare ourselves with with playing hugging phase.generate and

another hugging phase inference solution hugging phase text generation inference so compare with the most naive way of using LMS which is most of you might have used basically download a model

from hugging face and call model.generate we can achieve a 24x

model.generate we can achieve a 24x higher throughput compared to that naive method and even compared with the hugging phase text generation in text

generally inference this TGI framework it's because it doesn't have a page attention back then you can achieve a up to 3.5 x higher throughput compared to TGI

and yeah and vrm is a open source project and it's open source on this link and it's uh as we said it's mostly a python project so it's actually very easy to use you can just pip install it

and you can just start using it and the the we have documentation and we have got nearly 7K stars and yeah for this open source project in the past several months months we have been

working on intensively on model support as we believe we are now supporting a very large set of the LMS basically most of the I think 95 of the most popular

airms are supported by via vrm right now and you can pick your favorite one and you start start serving with vrm and our first successful story of vrm is

actually before our open source launch so we launched vrm on June 20th this year and actually like two months is before that we will we have started

serving serving the rikuna the famous vikuna model and the chessboard Arena demo which is vrm and it has been deployed till the day it's like for like

about five months and it's the back end for llama based models and many most info lava based models including vikuna alpaca koala and many more and

this demo receives a average traffic of 40K conversations per day and with vrm we can first reduce the total number of gpus to serve by 50 and even with like

half of the gpus we can have we can serve 2x to 3x more requests per second and and this game is only achieved when we

serve we serve vrm for a part of models and if you use VMS to serve all of your models and the game can be more and as our product keeps growing we get

many many adopters so so on the open source side first we have fastjet as we just discussed and also we have Sky pilot so Sky pilot is another project

here we developed Skylab so we use Sky pilot during our development and also you can easily launch a BLM cursor

VM server with Sky pilot and also we have many other like LM open source ERM Frameworks like open chat openings drug and with their LM they are all using

using vrm for for in first and we are talking to many companies and many companies actually we haven't talked to some of them and they automatically

start using vrm okay yeah and uh so you might say okay talk is cheap show me the code so this is all of the code we can find on GitHub that these companies are

using via so so we want to show you okay vrm is actually a very mature project and many people are using this video and vlm for their inference survey workload

and yeah and more Beyond of vrm Beyond of vrm the open source project from a more Ivy more like a massive perspective the page attention algorithm has became

the industrial standard the all of the other like uh other influence Frameworks like like uh like built by others like fireworks or AI or or our Baseline just

now having for stgi and the recent Nvidia trm they are all using page attention in their in their serving engine to to accelerate and lastly let

me briefly dive into okay how you can get started with vrm so VM API is actually very very simple so first we show a case if you want to use vrm for

offline batch batch influence for offline battery inference so you can just import the lrm class from vrm and and initialize the LM and feed in the

prompt and that's it and vrm will automatically batch all of the executions of all the prompts and give you a result so you can fit as many as prompts as possible into this prompt

list and vrm will automatically handle the batching for you and handle the correct batch side and also manage the memory with page detention and next because we are a server we also

provide a open AI compatible server as a demo server so you can launch a open air server open AI command server with a single command and then you can query

the the API as if you are querying an open AI API so yeah so let me Briefly summarize so in conclusion vrm can improve the memory

efficiency of of LM Serene by 2.5 x to 5x by reducing the fragmentation and enabling the sharing with page attention and vlm can outperform the state of the

art by 1.7 x to 4X in terms of the serving soup and we are building vrm as a vibrant open source project that supports many models and decoding methods and vrm is getting widely

adopted so here we showed our GitHub link you can check out our code and also our blog this is this is a very a good starting point for you to learn about

vrm and the page attention technology behind it and also we recently recently published our paper on in the sosp conference so if you want to dive more

into the page attention work so please take a look at the paper and finally if you have any questions you can go to join this Discord Channel and just

directly talk to us yeah and and yeah very happy to be here and thanks for listening foreign yes I think so because those those rnas

uh okay so the question is about uh okay there are many recent neural networks that are more like recurrent based like RN style neural network uh can with the page attention technique still apply to

those kinds of neural networks I believe the answer is yes so because even with recurrent neural network you still need to manage the memory and you still don't know okay what's the output length is so

you need to like like if you don't do page paging in memory you still need to like reserve a big chunk of memory for like for those for like every request

and it will still waste a lot of memory so in this case like I think paging will also help on the RN side but like but obviously like a paging will help but

like page attention will not help if it's not using attention yeah is okay the the question is how we identify the the memory problem

uh I think okay the project is started by just uh building the LM serving system opens to the LM serving serving system from scratch because back then

there was no one available in in GitHub in open source words and then what I think would be different is I think the way we are different is that

we focus on high school serving rather than like reducing the latency because that one was more like effective in in reducing the cost per recast and then

like when you target The Heist group of serving then you you've you wanted to basically we wanted to increase the batch size and then we found out the problem basically when you start implementing an

inference server once you reach that point you'll find this issue yeah the how you manage that memory is like all the existing messages will not be elegant enough and will waste a lot of memory

go with it yeah from hands-on experience you may be over there oh yes that's a very good question so let me repeat the question so the question is about

like are there any like cash misses or something like that yeah I think the the answer is no the biggest reason is because of the attention computations pattern it needs to uh for the to

generate a new world it needs to take a look at all of its like previous words so there's like at every step all of these previous words need to be like looked at once so there's no misses something like that it's because of the

computation pattern so there will be that's the misses yeah okay um yeah uh so you've mentioned a comparison with

uh who face TGI I didn't you showed that TGI is maybe incorporating uh paid attention so is there going to be any benefits to using vllm directly over

just using the pay attention mechanism oh okay so the question is like what is the current benefit of using vlm as TGI already Incorporated the page attention

algorithm right so uh one thing we like to highlight is that DJI is doing pretty good job in uh reducing the fragmentation using page attention but

they didn't like do a great job in like memory sharing that part is actually more like uh more complex to implement so we have still have this performance

benefits when the when for the complex decoding methods like beam search and also uh uh in the engineering side I think we

did a pretty good job to I mean we did a pretty good job on the optimizations also wanted to pay attention so we believe we are still have higher

performance than TGI yeah and we are on apache2 license yes they are like very yeah yeah

[Applause] okay yeah okay so the question is about okay we talk about memory sharing for Bing search and parallel decoding how about

some other cases like sharing prefix across different requests and yeah that's a great question actually we did that experimenting in the paper so it actually works yes yes because there are some some small

engineering issues so we haven't merged that into the main branch but we will we will do it like eventually yeah okay yeah yeah okay the first one okay I think the maybe last question

yeah so great work any question you talk a lot about the throughput what about latency that too

okay I think for the latency okay so the question is about we talk a lot about Super and what about latency so I think uh so at a very high level we focus on throughput is because throughput is

actually what reduces the cost because uh like reducing the latency like if you reduce the latency at the cost of reducing throughput it's actually not good if you are serving a very big

bigger bigger big big like service for everyone and and more specifically I think we can categorize the the latency of another techniques into two

categories and one category is like it's just pure optimization it's optimized latency and Source increased input and we will definitely incorporate these kind of techniques for example like kernel Fusion things like that yeah we

will definitely incorporate that in BLM in the future and uh and furthermore the other side of the the work like like say

trading off latency for throughput something like that uh treatment of throughput for latency we will will be more cautious in that space there are some techniques over there like

speculative decoding uh that's one like most very important and very impactful technique and we are thinking about it but we will be uh we'll implement the

first set of techniques first and then think about the second step and cool yeah thanks everyone yeah yeah thank you everyone

Loading...

Loading video analysis...