Matrix: Reliable Framework for Data-Centric Experimentation at Scale | Ray Summit 2025

By Anyscale

Summary

## Key takeaways - **Data Wall in AI Training**: We have hit a wall with human-generated data, having consumed nearly all available content, yet AI capabilities continue to demand trillions of data points for models as smart as or smarter than humans. [02:44], [03:39] - **Synthetic Data Revolution**: Synthetic data from AI models like those generating ChatGPT interactions equals the cleaned Common Crawl dataset in one year, unlocking vast quantities for training, especially in scarce areas like reasoning and agentic datasets. [04:34], [05:38] - **Matrix Enables Single-Command Scaling**: Matrix allows researchers to start Ray clusters, deploy open-weight models, and launch thousands of stateful containers with one command, integrating with Ray and vLLM for efficient distributed compute. [11:26], [12:10] - **Automated Checkpoint Evaluations**: Using Matrix's jobs API, researchers submit multiple fine-tuned checkpoints with specified benchmarks like GPQA Diamond, running 24 concurrent instances to automate evaluation and track status without manual management. [13:18], [13:58] - **Multi-Agent Collaboration Traces**: Matrix generated 4.5 million trajectories to study how LLMs persuade, assert, and correct each other in multi-agent setups, efficiently utilizing GPU resources for downstream AI capabilities. [20:40], [21:02] - **Natural Reasoning Dataset Creation**: Matrix enabled generation of a 2.8 million reasoning dataset from web documents via prompting, improving fine-tuned models' reasoning without relying on benchmarks, released on Hugging Face. [21:12], [21:34]

Topics Covered

AI models hit wall on human data?
Synthetic data unlocks AI scaling?
Quality trumps data quantity?
Matrix scales inference efficiently?
Matrix powers multi-agent reasoning?

Full Transcript

Good afternoon everybody. I am super excited to talk to you about the work we've been doing over at Meta uh for the last year or so on building this framework for datacentric experimentation.

Um the system is called Matrix.

It's powered by Ray and uh it has been used to power some of our AI model training.

um and and excited to talk about it. My name is Rama Ragwendra and I drive um our data effort at fair um the fundamental AI research lab at Meta.

For those of you that might not be familiar, this is the research lab in the AI or we work on um essentially powering the next generation of research models.

And uh I will let me get a quick agenda of what we are going to cover today.

I will start with uh just giving you a background and motivation on how this started. what has been our journey of datacentric research at uh fair at Meta and our ENG lead Dong who will talk about um the system we've been building matrix uh give some examples of how how to use the API some of the features uh that that it covers as well as some downstream model training uh that has been done to build some AI capabilities such as multi-agent collaboration and reasoning and other such um AI capabilities please.

So to get started um with some background so you know as we all know AI models have been demonstrating jaw-dropping capabilities over the last few years but let's remember these models are only as good as the data that they consume.

Now this figure here sort of summarizes the journey of AI training data over the last fiveish years and you can if there's one well two things you want to remember from this talk I'll summarize two key takeaways. First is that on the left side of the graph you will see an exponential rise.

This steep increase in AI capabilities has been accompanied by a corresponding increase in the size of the data that is being used to train these models.

We started with millions to of data points to billions to now trillions of data points that are going into train these models. So the last few years have definitely been about scaling of the data that's going into training these models.

So that's the first takeaway.

The second if you look at the right side of the graph is sort of where we are at today. We I think there's a general consensus that we are at a point where we have essentially hit a wall of how much value we can get out of this um this human generated data that is being used to train the models. Um and you know we if we want to continue to scale the AI capabilities we need to answer the question where is the data going to come from to power these you've heard of you know we have AGI we talk about super intelligence essentially in a few years from now we are talking about AI that is going to be as smart or smarter than a human being while at the same time the reality of us saying that we have consumed about all the content there is um that there is to consume.

So this team uh essentially tries to answer the question of where is the data that is going to power these AI models going to come from.

So with that um let me answer the question that I just asked.

Obviously the last few years have been about let's consume the data that is on the internet that is publicly available clean it and use it to train our models.

You know you're a lot familiar with common crawl and various such data sets that have come out of it.

Uh if you want to continue to use data that is not human generated and and isn't um isn't isn't uh at about to uh to get over. We're talking about unlocking other new sources. So there is data that is behind pay walls and data that is available to purchase which is still human generated data.

Um and for a company like Meta we have you know opportunity there's about over three billion users on the products we have the opportunity to use product data which you know within the constraints of what we are allowed to use but finally and the reason we are all here to discuss this this topic is that that of using AI models to generate data that can then be used to train other AI models also known as synthetic data Now synthetic data has seen as you know the promise land uh in order to get us over this data wall for the reason that you can actually generate large amounts of data uh that can be used for training purposes.

So there was one estimate that said that uh just based on the interaction and usage of chat GPD there's enough data that is being generated in one year that is equivalent to amount of data that can be extracted out of common crawl after cleaning and and so forth. So that's a huge amount of data and we need to unlock it.

The other is that there are um areas such as reasoning or agentic data sets where we simply do not have the data to use for training these large training models.

So we need to think about other ways of generating this data.

So that's sort of you know how we think about the scaling paradigm and how we are going to increase the data quantity.

Um I would be remiss if I didn't mention that data quality is just about as important.

We need to make sure that the data that is going into these models is of high quality um based on whatever are the parameters used to define that for your purposes.

Um you need to make sure that you are not feeding information that is actually hurting the model performance.

Um and you need to make sure that you are able to extract more value out of less data in order to break this exponential scaling. So that's um you know another thing you need to take into account as you're building your data experimentation.

So I want to sort of put these pieces together and in and and so and give you a picture of what it is to build an experimentation framework before we get a deep dive into matrix itself.

Um roughly the steps are find high quality data and we just talked about how that's going to come from. You might be using AI models to generate this high quality data.

Uh these models might be LLMs, diffusion models.

It could be gaming engines.

Uh this is not necessarily synthetic in the sense that AI is using is generating it, but it's being maybe it's physics- based rendering. Um or it could be the other sources that that that I just talked about um which is still in the realm of human generated.

you might need to do data augmentation in the sense that there are some signals missing in it that are necessary to train your models. You know one you know easy to relate example is that you have a video I need to add captions to it or if it's it's in a different language I need to translate it. So there's a lot of audio um there's a lot of augmentation you might need to add.

Um and finally data curation which is increasing the the data quality that is going into your models or actually looking at the specialized set of data that is of relevance to your model training and then you take this data set you feed it into your into your training uh framework.

uh data loading can be static.

It can be generated on the fly if it's your if it's an RL if it's an RL based learning system and then matrix itself works with any training framework.

So we're not making any assumptions on what training is going to be used.

Finally, you need to run your ablations.

You need to have evaluations and benchmarks which you are able to test the model performance against and and make sure that the data sets you're creating are actually improving your downstream model performance.

And then you ship these either as a research um output, a publication, an open- source model, or it goes into training one of the production models. So this is what the system enables.

Um and with that background, I just want to before I hand it over to Dong for a deep dive, I want to u mention that this is work done by a large team.

Um and essentially the goal of the team is to foster this kind of u datacentric data first thinking that thinks about how can we think of data as a first- class citizen in order to build powerful models.

Uh and a lot of this work has gone into some of uh into um models and products that are coming out of meta.

With that over to Don thanks Rama.

Uh my name is Don.

I'm the uh one of the research engineer in fair data foundation team.

So I as Ram said I will go over uh matrix and also explain some of the API examples that researchers can use and then underlying uh features that services that supporting those APIs for research.

I will go over a couple of research use cases within fair that use a matrix.

uh matrix is a research infer.

So in fair uh researchers use cloud to run their uh research experiments uh such as AWS uh uh core core wave and then on top of those cloud uh so there's a common layer called slurm which is a common research uh infer uh to get resources skyd similarly matrix is actually building on top of slurm uh on the slurm cluster it will start help your searcher to start Uh reccluster the reccluster is not uh hosted services.

Uh so researchers uh individually or as a team they can start uh many different matrix instances within the cluster. Uh so uh for one of for each of the instances there will be a reccluster using those resources.

Uh on top of the uh re resources we build uh online services to do uh for example language model inferences uh containers to run uh software and also uh doing uh uh video encoding uh for embeddings and uh also proxy to uh external uh API models.

Uh so this is the uh reserve based services.

Uh we're also running a lot of different uh uh use cases uh using uh repela similar to Paris talks.

Um on top of those services we offer a easytouse API for researchers to use and to supporting various different use cases.

on the API side. Uh so as I said the cluster API allow researcher to start a reccluster easily using one command or a simple Python API. So once you have the we also added uh recently a HTTP service so that users can just curl something to manage their cluster.

So including adding uh workers uh major part of matrixes for LM inference.

Uh so similar to the uh reccluster management you can also using one command to deploy uh most open weight models from hardening phase uh both uh from the command line also from python once you have the instance running.

So there will be a very simple to use API to run a lot of the uh inferences.

We do a lot of optimizations on this layer to make sure the inference is efficient which I will cover later.

uh containers also important for example for running arbitrary code if you are doing code code code dream for example uh the similar interfaces we also call this applications the application type of the container so you can also deploy thousand containers within one command uh once you have containers you can allocate a container and then running a arbitrary docker images on that container and then once the container is up you can run command on top of the instance uh drop API is something higher level than the previous longer model or the uh container services.

So drop is basically more userf facing uh user will submit a function with resource requirements.

So the job manager will make sure it acquires the resources run the jobs and uh return the status and the logs.

Uh one example is the checkpoint evaluation.

So researchers will uh doing a lot of fine-tuning on the uh foundation models.

Uh for each fine-tuned checkpoint a separate uh evaluation benchmark need to be run.

So uh there are so many checkpoints and so many benchmarks.

So it's very hard to manage uh manually. So uh using the jobs API you can submit multiple checkpoints.

You can specify what benchmarks to run, how many instances to run concurrently.

And then once you submit this and then later you can check back and then using the command to figure out uh the status of the benchmarks running.

For example, here we're just saying okay we are interested in GPQA diamond and then we are running 24 instances to get a stable estimate and then here are the 24 runs.

So in this way uh we abstract the management of the checkpoint and uh benchmarks out of the uh uh using matrix to automate it.

Uh so these are the examples of API.

So now we'll go over some of the services that supporting those APIs. Uh for to deploy a reccluster so researchers can uh independently running many instances uh of the reccluster on the same uh uh like slurm cluster. So there are many instances we need to make sure they don't conflict each other. So the system will manage the port assignment managing the temporary directory so that there's no conflict.

Uh we're also using slurm checkpointing and to make sure that you can recover once if the resources got lost.

Uh but you can continuously adding uh things to your uh reccluster so that can grow the number of workers uh when you have more capacity. So uh the graphana dashboard is very important for us to uh for debugging and performance.

Later we'll show some examples.

So uh using the uh matrix is very easy to set it up as well.

Uh inference uh is very important uh for one of the matrix inferences because we use lungry models to do a lot of the data generation.

Uh so when we look at the available uh other uh frameworks uh including uh light am and uh sage maker etc. Uh so matrix has some uh advantages such as it runs on slurm which is very uh familiar with uh for researchers it can run am instead of just running as a proxy is supporting both http and gpc and because it runs on top of reerve it can do autoscaling uh so we also open source the framework in April so all these uh are nice features as a research uh runs on open open source stack uh we did a lot of optimizations two of them more important ones one is the load balancing the other is gRPC uh because uh we also know the rate had a network can be pretty congested uh to uh to uh load balancing across hundreds of model replicas.

uh we build a local cache uh so that the worker the client can directly hit the uh workers to do load balancing instead of going through the head.

uh so all these are also using gRPC so that we can avoid some of the overhead in HTTP.

Uh so here is a a brief comparison with another uh popular uh LM inference system on slurm uh it's called LM swarm.

So the feature is is we both use the same VM back end. Uh but the LM swarm load balancing through a NX proxy.

Uh so when we scale the number of TPU node and also scale the number of replicas.

So this uh single proxy uh becomes a bottleneck so that we can see a matrix scales better with more uh model replicas.

uh as I said we use the red dashboard for a lot of the performance tuning to look at the metrics to see where the bottleneck is.

So this is one of the reserve dashboard where you can directly observe the number of QPS's to compare the uh the the benchmarks whether they can fully utilize the GPUs.

Uh for custom metrics we uh our framework can publish uh metrics to the red dashboard.

So here is uh uh the showing the uh pending tasks across a distributed systems and you can see the where the uh the how many tasks are being cued.

Uh the same metric uh but it's a different component where you can see there's a potential bottleneck of the system where the pending task actually spikes.

So we can look at this to figure out okay uh maybe we should scale this component.

Uh with all these uh optimizations we can uh we look at to maintain consistent high uh request to all the model back end because the GPUs are the bottleneck.

We want to fully utilize them.

As you can see here we can maintain a pretty smooth 12k request for all the model replicas across a long time.

Um now move to containers.

Containers are become more and more uh important when uh when researcher want to uh uh research on codeg gen search to to using tools to for agents. So using matrix we can deploy thousands of containers concurrently.

Uh one of the uh differences of containers between with is the container are stateful.

uh you want to request a container and then continuous running command on it.

You don't want to be routed to a different container.

Uh for this we build a custom uh infer where we have the reserve uh we have a registry we have the container actors.

Those are the array actors.

Uh the main importance is the re actor will map your the registry will map the id to the actors.

So that once you can check out acquired uh container by its id and then we use the ID to to request the uh request will uh will make sure to end up in the same uh uh same app instances.

Uh so here is some overview of uh are we doing also doing uh offline data curation uh examples like offline inferences media generation uh using the uh physics engines uh doing video embedding encodings doing the dupings uh here's a simple uh diagram about we're doing a semantic task clustering uh where we take the last layer of llama 8b and then uh project it to a 30 uh the dimension and then using uh DB scan on a subset to build the clustering and then apply the cluster to cluster the whole data set.

So this is one of the pipelines we use to for uh text diversification using uh embeddings matrix is built to supporting research.

So here we have the uh uh one of the research is using multi- aent collaboration where we use matrix to generate a lot of the traces.

Using these traces you can study the persuasive assertive and whether the uh lung models can agree to each other and uh can correct each other's mistakes.

So we do matrix to generate 4.5 million dogs and using a lot of uh GPU resources efficiently.

Uh the last uh use case is basically for natural reasoning.

Uh it's a 2.8 million reasoning data set released to harking face earlier this year. It's not based on the benchmarks. Instead is directly looking at web docs uh such as DCLMs using prompting and to run matrix efficiently to generate the reasoning questions and also the reasoning traces.

So this has been shown to uh to improve uh downstream fine-tuning models res capabilities.

Uh so more work in progress we'll upgrade uh the isray and we am going uh improves release new versions we will look at multimodality uh we running multi- aent data synthesis for use cases like three bench and tall agents.

So matrix is open source feel free to check it out and also contribute.

Thank you.

Um, any quick questions? We are slightly out of time.

Question. You said that you are using sess.

>> Yes. Yes. So we try to do it uh similarly in lungry model. So using some tricks uh but for the container ID uh we want the flexibility so that we can manage it.

>> The use case of this will not scale in production.

Is it only for research?

>> Um yeah it scales up to our our scale.

Uh so we basically uh running uh containers and then thousands of them is uh it's manageable in our case.

So if you really wanted to go uh to remove the single registry you can of course scale that as well.

>> What is the use case >> uh for containers? Uh so as I said we do a code generation. So once the long model generate arbitary code you want to validate the code measuring the performance uh if it's ML you want to measure whether the the model post performance all these need to run in a secure uh uh container and then the container can different have different components.

Uh similarly if we want to do aic data generation then you need a environment which will be very easy if you have a container.

Cool. Okay. Yeah.

So what do you see?

>> Yeah. Yeah. Very good question.

I think uh fair like in for teams are exploring um so different uh uh ways to run the resources whether it's a slurm or kubernetes uh so it's always debating uh when we talk to researchers they are more familiar with slurm so currently yeah so that's why still using slur but as more different things shows up it's possible we should look at different ones Oh sorry.

>> Matrix abstract away.

>> Yeah. So, uh right now the uh the uh training side are mostly still using uh slurm.

We added some uh capabilities supporting uh running uh fine-tuning on Ray but the large scale foundation model are still using slurm because there they are very thin there's very little overhead cool thank you yeah feel free to to chat offline thank

Loading...

Loading video analysis...