vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Simon Mo, vLLM
By PyTorch
Summary
Topics Covered
- vLLM: Open Ecosystem Engine
- Day-Zero Model Support Unlocks
- Bit-Exact Accuracy Matches Training
- Hybrid Allocator Handles KV Hybrids
- LMD Enables Reproducible Clusters
Full Transcript
All right. Hello folks. Thank you for coming. I'm Simon. I'm one of the lead
coming. I'm Simon. I'm one of the lead of VLM projects. And today I'll talk to you a little bit about VLM, what we have been up to, what is happening and then as well as some recent development and
uh sort of the way we're thinking about the future of inference from the current point of view. All right, let's get get started. So um VM started as a research
started. So um VM started as a research project but quickly become the engine of choice for large language model inference. And our goal is to build the
inference. And our goal is to build the fastest and easiest to use open source LM inference and serving engine. And to
really look at it, VM as I kind of mentioned at the keynote today, we really see it as a ecosystem where you have model, you have accelerator and
framework all working together. That is
from the model vendor point of view, VM is a trusted place and the engine of choice to implement the model in a most efficient way. And then from the
efficient way. And then from the accelerated point of view, as soon as they're able to add kernels for new models, it will broadcast and any optimization will be able to benefit
both the past and future set of 200 model architecture in VM. And then on the framework side, VM has been the go-to default rollout engine for a lot
of our framework today as well as being used for uh reward modeling, uh synthetic data generation, and a lot more. And just to reiterate and dive a
more. And just to reiterate and dive a little bit deeper, the reason we're here today is that VLM started on PyTorch and have a close relationship co-developing
both VM and PyTorch together. And on the one side, VM really help battle test a lot of PyTorch features such as torch compile, flax attention, torch. On the
other side, we leverage almost every extent of PyTorch library itself such as plugging in our own acceler memory allocators or adding in uh new
accelerator plugins and as well as really abusing in a way torch compile to achieve the best inference speed possible.
And the reason we're also here is we're part of the foundation project that is where we directly work with all the members of PyTorch Foundation to make
sure that VLM is being um developed and openly governed by all the members and together to make sure that there's a longevity and share governance of this
project.
And if you look at VRM progress so far, we're fortunate enough as of last week, we have of course 60,000 stars as well
as uh 800 PRs being merged every month.
And to manage this project as a scale, we cannot do it without a lot of industry industry partners such as Meta, Red Hat, uh Aniscale, HuggyFace and a
lot more companies that directly focus on um working on the open source upstream VL supporting different models supporting different hardware and then making sure that you are getting the
best optimization on the newest Frontier model. Let's look at the code a little
model. Let's look at the code a little bit. So this will help us anchor around
bit. So this will help us anchor around the conversation about the newer feature that we're recently introducing in VLM.
So there's two way to use VLM. One is to use it as almost a dropping replacement for hugging face transformers inference where instead of running batches at a large batch with padding, we offer
continuous batching with the LM interface. The LM interface allow you to
interface. The LM interface allow you to initialize a large language model and directly run through your list of prompts. Uh this prompt can be millions
prompts. Uh this prompt can be millions of entries if you need to be. Uh the
other way is to use it as a server. You
can drop use it as a drop in replacement for open AAI server or anything that speaks the SAS protocol and this is where you can directly target your application that's designed to talk to
OpenAI protocol today to talk to VLM and with that unlock you to all the open source language model. And this a piece of news that's so new that we actually didn't get to put on the slide is that
as of this morning we just merged uh support for the anthropic API. So now
you can point your clock code and related application tied to anthropic protocol to directly talk to a open source model as well. And we're going to see more and more usage down the line
for this.
And additionally as I mentioned before this is the two primary way of directing using VLM but as we build the entrance system we also support a way to break it
up apart. So that a lot of framework are
up apart. So that a lot of framework are directly leveraging internals of VLM and or lower level abstraction and construct to directly use use it to integrate into their existing framework and
applications.
And just to talk a little bit more about uh the model we're are serving. A lot of times we directly work with people who are making the model and training the model and open sourcing them directly.
And a lot of cases you see what we call day zero support. And this word day zero support really just means the moment that the model weight went down to public be public on hugging face model
hub or any other places you are able to download and run it with vi. And of
course this is not magically happening like the moment the model release and we'll get to work. It is a long relationship we're building with all the model vendors uh across uh both China,
US and around the world to make sure this is happening and for models that do not have a direct relationship or directly have bandwidth to work on this.
The great news here is we have worked with hugging face to support what we call transformers back end and this allow you to have one flag in VLM serve that plug into models that do not yet
have some native support in VLM but you can directly run it with transformers implementation. This also comes with a
implementation. This also comes with a great benefit that you can have the same piece of code the same model definition to be used between pre-training RL as
well as inference. It's the same hugging face model definition code and we also have still our uh vision a visual language model support. We have
recently got on a lot of headline with uh our support for deepseek OCR system as well as quincry VL. We're seeing that our design architecture support you to
add new modality quite easily and this is just a start. So we are you can use VLM to do OCR task vision understanding but later on we're quite excited to
start supporting like multimodal output such as uh image generation uh in the vision transformer style.
And speaking of model support we prioritize accuracy quite a lot. If you
ask me how much time did I spend actually do we spend on supporting GPToss which is the model from openai.
When we spend 30 days with this model, 20 or more days are spent on making sure that we get exactly the same bits of log probabilities output directly out of the model so that you are actually getting
the model that it you are intending to get not some sort of version that have silent corruption bug or accuracy issue.
And with that just today we have announced that the support for bashing variant mode. So this is uh launching in
variant mode. So this is uh launching in the making in collaboration with syncing machine labs as well as meta to support that regardless how many requests is in
the batch regardless of the batch structure the ordering you are able to get exactly the same output over time and this actually has been the case uh
strongly used to verify accuracy especially to make sure your model is exactly what you're getting from the training framework as well and accuracy is one part the other We
prioritize performance. There's a lot of
prioritize performance. There's a lot of way to tell you about this, but one way I find like uh liking is instead of showing the performance bar chart is that we actually invest a lot. So on the left it is a nightly performance
dashboard that we work together with a piece uh developer team that we actually run every single commit run through a extensive set of tests to make sure
there's no performance regression and it will only get better. And on the right side is really talking about how much this cost uh in terms of funding and the
infrastructure uh to use to provide performance guarantee and just to speak a little bit more on the pyrochan VLM one thing I think we
didn't mention is there is large production use cases running on the nightly of VLM on top of nightly of PyTorch and this really ensures that as
the most cutting edge where you can get the most performance possible that is being validated and being wrong and this is why you see a lot of GitHub issues and PRs popping up on VM talking about
wait the current main is broken on PyTorch nightly that hasn't been released uh and will be releasing in three months because three months before you get the new PyTorch release we're already making sure it's working
and of course this is just the first series of talk that there's a lot more talking about VRM during the PyTorch conference and now I'm going to spend few slides talking about the new
technical uh insights that we're developing on on in account for the new trend that we're seeing today. So we're
going to I'm going to first talk about a little bit a summary of two techniques and then interfaces and then later I'll dive into the work I've been doing on distributed inference. So the first
distributed inference. So the first technique is what we call hybrid allocator. So if you are seeing the
allocator. So if you are seeing the trend of model going forward, we started with just dense transformer like llama and quend dance and then gradually moved
towarde like mixrol and then later moved towards larger like deepseek and llama street scout and maverick. But then what that's how the FFN is happening. But the
other side is attention. On the
attention side, we're actually quite seeing quite a lot of innovations beyond just the typical multi-headed attention.
We have been seeing MOA multilaton attention from DeepSeek. But then people are actually innovating on the improving the quadratic behavior of attention mechanism. So that's why we started um
mechanism. So that's why we started um building a system called hybrid memory allocator. The hybrid memory allocator
allocator. The hybrid memory allocator allow you to efficiently manage the KV cache for model with hybrid memory uh requirement such as a model with sliding window and full attention or the sliding
window full attention and slide and mamba or state space model blocks and we allow you with different kinds of intricate design to be well managed and efficiently served.
The second part related to KV cache is emerging need of KVach transfer as well as storage. So as KV cache gets as the
as storage. So as KV cache gets as the context gets bigger and bigger you have a higher need to manage the KV cache to move it around. Take example of a
thousandk context window the memory bytes required to manage that KV cache is quite big and recmp computation sometimes is going to be slower than just moving that caching it in CPU or
remote storage and bringing it back on.
This is why we uh in collaboration with the team at OMache and other we built a interface called KV connector. So in VM now we can support offloading the KV
cache compressing the KV cache and as well transfers in KV cache in multiple ways. So for example you can store the
ways. So for example you can store the KV cache in your uh accelerated storage in WA or you directly move it to CPU RAM for later usage.
And then one more bits on a new uh innovation we're seeing again related to KV caching attention is as attention is getting more and more efficient you typically see a straggler effect when
you are serving at large scale. What
does that mean is if every single GPU is running its own request sometimes there's one request with with a lot of uh input tokens that would just slow down everybody else. And this is why in
collaboration with the team that's behind Kimi K2 the world's sort of first modern transformer one trillion parameter model open source that um they're open they have open source what
you call decode context parallel decode context parallelism is something that allow you to shard and stripe the input context across different ranks and this
is a new paralization degrees compared to uh tensor parallelism x parallelism and many more. This really allow you to resol resolve the straggler problem. And
then you can see that if you use this model to use this way to serve deep 3.1 with a decence parallel, you will immediately get a lot more KV cache
space as well as better throughput.
All right, just to recap of the session we just mentioned before, VLM makes sure we are delivering timely, accelerate, accurate, and optimized model support.
We also have a wide harbor coverage going from latest generation and media ship to AMD to Intel and a lot more as well as flexible API with variety of integration with our framework and deep
deeply integrated PyTorch and frontier of innovation of many techniques. Now
I'm going to dive a little bit more with distributed inference. That is a core
distributed inference. That is a core focus of us over the last few months that to focus on while VM can make the job can do the job of serving a single
model on a single group of GPU really well. What are you doing to scale it up
well. What are you doing to scale it up and this is where uh the way different ways of thinking about distributed come in. One way to think about it is uh
in. One way to think about it is uh there is uh from the engine out perspective that is we can think about different parallelism think about fall
tolerance think about elasticity VM does all that. Another way of thinking about
all that. Another way of thinking about it is from the cluster end. From the
cluster management point of view, especially someone who have been managing thousands of GPUs at large scale or even more, you actually need to worry about routing, caching and
operation. And this is where LMD comes
operation. And this is where LMD comes in as a community project to focus on really gluing together the cluster operator perspective as well as the
inference engine perspective. So the way you look at it is chromatic the world of chromatis that's a lot of it is designed
for microservices where it is processing requests that are fast uniform and cheap but on the other side LM inference is orders of magnitude more expensive way
less uniform and require a lot more acceleration and consideration from the cluster point of view and LMD really helps this by deeply integrating with Kubernetes primitives
such as request when the requests coming in you'll go through the inference gateway and you'll able to intelligently choose and route to different LM instances VLM instances on different
hardware and this also provide quite a lot of reproducible samples. One
challenge we had if you ask me a year ago is that you really cannot scale VLM in a reproducible manner. Every cluster
behave differently. Everyone have
different requirement network topology and sometimes when we say we can get let's say 2,000 tokens per second on this setup per GPU someone else cannot get it but then LMD by building on top
of chromatis and primitives this really help our community with um different well-lit paths that is directly reproducible and tweakable so that you can directly bring it into your setup
and this includes intelligent inference scheduling PD disagregation KV cache management and wide EP key. I'll quickly
illustrate a few of them for intelligent inference scheduling. This is a way of
inference scheduling. This is a way of framing this problem as a smarter way of routing request and no balancing the request. So the way you deploy VM in a
request. So the way you deploy VM in a distributed manner today uh is one you can just put it in chandi services and perform runro robin load balancing and
this typically is pretty bad in the sense that because different request is such a non-uniform fashion that you don't really get the best balanced approach
and LMD supports prefix aware routing so on the left you can see that you'll match to the right prefix tree so that you always have lower time to first token when the request has already been
processed. On the other side, there's
processed. On the other side, there's load aware routing where it will be able to scrape and intelligently analyze VM specific metric to route to the right
pod. on the PD disagregation. LMD along
pod. on the PD disagregation. LMD along
with the V VLM integration itself provides a way to efficiently transfer from GPU to GPU leveraging Nvidia Nexo library as well as a counterpart of what
will happen in AMD and TPU to be able to directly transfer the KV cache from the prefill instance to the decode instance.
And by separating prefill and decode, this allow you to run different parallelization strategy and deployment option and scaling across the two.
And finally, there's wide expert parallelism. The way you typically serve
parallelism. The way you typically serve the one trillion parameter model today is not getting a GPU has one trillion uh getting a node with one trillion uh
bytes in memory. Rather is you actually scale it out into way bigger pool and this allow you to efficiently manage a large batch of request and distribute it across multiple ranks. And this is where
uh LMD allow you to have a way to directly easily manage this on top of chadities again. So it is more
chadities again. So it is more reproducible and easily usable.
And along with this we have leveraging optimization from the deepseek team where such as uh performing expert parallel load balancer where you can move the expert within the model move
them around replicate them so that you can achieve a great uh utilization score as well as dual batch overlap. This is
again work inspired by what Deepsec has been working on to run uh communication and computation over overlapped to make sure that in this large distributed fashion every parts and the network the
memory bandwidth and then the compute are all well utilized and that's pretty much it. So um now um I have covered VLM
much it. So um now um I have covered VLM a little bit. I've covered our recent development and I cover our distributive and to really think about the road map
is really much a doubling down on all that. But I will have a few minutes. So
that. But I will have a few minutes. So
now I can take questions. Thank you so much.
Um I don't think we have a mic runner so just shout and then I'll I'll try to repeat the question. Yeah.
>> All right. So the question is if you want to integrate withm as a different accelerator. Yeah.
accelerator. Yeah.
>> Previously it was completely py based from what I looked at >> has things changed like do we need to rely on can we have a runtime?
>> Yeah. So the question is about the experience of adding new accelerator how strongly tied it is with pietorch. So vm
does require pytorch but what we do offer now is a plug-in mechanism where you can swap out what we call the model runner and the model execution path. So
within VLM and PyTor is the framework that you're using but outside of it when you're adding your own hardware you don't need really need your uh hardware runtime and software stack to directly
work with every single operator of PyTorch rather you can take the PyTorch graph go through the torch to 2.0 compile inductor pass and lower it to whatever framework or language you like
and this is one of the ways that um later you'll hear about the VMTPU story when for VRM TTPU initially they started with torch XLA but now as of a week ago
actually the the version they announced is actually lowering PyTorch model definition to Jax and then Jax will be able to run run it on a higher utilization on TPU. So for any new
hardware you can actually do pretty much exactly the same thing to lure a PyTorch model definition into whatever language and framework and compiler of of your choice. Thank you.
choice. Thank you.
>> Yes.
>> So my question is mostly about the optimization part and you mentioned that you guys also have serving and stuff. So my question is like do you
and stuff. So my question is like do you like shard thems do that like how do you ensure that the first like the first token that is
tokenized by know whichever whichever GPU how do you guys ensure that each token is done.
>> Yeah. So, so the question is in different kinds of uh parallelization paradigm how do we make sure there is correctness where tokens are routed to
the right rank as well as I guess performance where this is well served and this what the job of VRM is to ensure that we have so a lot of options
to scale it out in a different fashion so initially VM only have tensor parallelism and this is when when the token come in it's kind of broadcasted to all ranks and then internally across
the intermediate and hidden dimension.
You are doing all sorts of collective to make sure it continues to run. And then
this pipeline parallelism where you shard the model layer by layer across different nodes. And then later we add a
different nodes. And then later we add a context parallelism as I mentioned that is really sharding the uh token across different rank and making sure that the attention will add up in the end with
like zero accuracy loss. And then of course expert parallelism and a lot more coming down the pipeline to make sure that um we are separating and sharding
the model in different way and but correctness is never a sort of a senior sacrifice and trade-off rather distrib distributed way is a way to really make
sure that you can run it at a larger scale at a greater efficiency.
>> Cool. Yes.
How does the prefix aware routing and load aware routing work together in >> uh great question? How does prefix aware routing and load aware routing work together? So they don't. Sometimes there
together? So they don't. Sometimes there
are a choice. Uh sometimes they do uh in a way that here here's what happens under the hood. Prefix aware routing require expensive computation and reconstruction of the prefix tree. So
that requires you essentially understand every string and every token that happens there. And then load aware
happens there. And then load aware routing is where you essentially have statistic averages over like the KV cache utilization. And a lot of cases
cache utilization. And a lot of cases you actually choose one way or other or the other depending on P or D and then whether or not you are working on prefill or you're working on decode. But
in different cases depending on your production environment you might actually mix a signal of the two.
Fundamentally the choice of the endpoint picker that's a um in the LMD construct or the load balancer itself is the one that you have to choose the host to send
the request to and it is a matter of how do you compose or choose or select the right algorithm to deliver a token over there. Cool. I think we have one more
there. Cool. I think we have one more minutes over there. Yes.
Ah, so ju just to confirm because I have trouble hearing is about the numeric and bashing variant right or like training inference consistency is that okay yes so uh the question is about like the
share more about the training inference consistency and the way we typically approach this um is given the same number of GPU right given the same
distributed setup and given of course the same model weights we want to make sure that in the end you are getting exactly the same in in a condition under a condition that is um if we get to use
the same kernel or same distribution strategy you are going to get the same output but this is fundamentally difficult because in attention especially in inference engine like VLM we do what you call page attention and
continuous batching and this is different from the training paradigm where you actually pad all the requests together and put them in one batch. So
if you put them in one batch and run it and then we want to make sure that in a continuous batching scenario we're getting exactly the same output as well.
So that actually requires a lot of tricks with kernels with tuning distributed um uh computation and again for that I highly recommend the blog from syncing machine labs to talk about
bash invariance and it's from Horus.
Horus might be here today as well. All
right, cool. I think that's we're out of time. I'm happy to take question off the
time. I'm happy to take question off the stage. Thank you so much.
stage. Thank you so much.
Loading video analysis...