Building a Virtual Cell: An AI Platform for Engineering Cell State (Oct. 20 Seminar)
By Dept. Biomedical Informatics Columbia University
Summary
Topics Covered
- AlphaFold templates in silico cell experiments
- Virtual cells enable therapeutic transitions
- Transformers model cell set heterogeneity
- AI agents curate billion-cell repositories
- Agents iteratively design perturbation experiments
Full Transcript
Thank you for joining the seminar today.
I'm excited to see you all and I'm also very excited to uh introduce Ysef Rouani with the talk building a virtual cell
and AI platform for engineering cellscape. Ysef Rohani is a machine
cellscape. Ysef Rohani is a machine learning group leader at the Ark Institute and a visiting scholar in the department of computer science at Stanford University. His re his research
Stanford University. His re his research explores how artificial intelligence can guide experimental design in biological discovery.
Sorry, we had a Wi-Fi issue. Okay.
All right. Um, so, uh, as Andre mentioned, I lead a machine learning group at the Ark Institute. Uh, for
those of you that may not be aware, the ARC Institute is a nonprofit, uh, biomedical research organization. We're
based out in Palo Alto, um, right off of the Stanford campus. Um and my group uh is working on this effort which we call um the virtual cell initiative. Um and
our goal is really to build use um artificial intelligence to um create better models of cell state and to use these models to predict how cell state may change uh under different
interventions in particular experimental conditions and perturbations that we haven't actually tested in the lab but we can predict the outcomes of in silicico.
All right. So just a brief background on myself. Um so I've been in this space
myself. Um so I've been in this space for about a decade now. I started um working on ML for uh biomedical research and drug discovery broadly um in 2016.
Um working at a u startup in uh Cambridge, Massachusetts uh looking at modeling um network biology in cancer cells trying to understand resistance mechanisms to chemotherapy. Um I then
moved on to GSK where we continued trying to build better models of uh cellular phenotypes. uh in this case
cellular phenotypes. uh in this case working with high content screening data. So looking at image based screens
data. So looking at image based screens um across uh millions of drugs. Um I
then moved over to Stanford uh where I was advised by Yuri and Steve um as part of the Stanford AI lab. Um that's my son. Um and uh here we were looking at
son. Um and uh here we were looking at kind of moving more into the transcrytoic space and thinking about how we can create better models of um uh cellular transcrytoic state and trying to understand how that would change
under interventions. in particular
under interventions. in particular looking at cominatorial interventions and how um we can model that using a variety of approaches uh ranging from graph neural networks to large language
models. Um and then since last year I've
models. Um and then since last year I've been um leading a machine learning group at the arc. Um and here we are essentially trying to take those ideas to the next level by scaling up uh both the data generation as well as the model
development. Um and trying to see how
development. Um and trying to see how far can we push these ideas um what does it take to really make models that are actually useful um within the lab. Um
and so it's been great to kind of think about not just the model development side but also kind of data generation how we can guide that.
All right so moving over to the science um I'm sure I don't need to convince anyone here that AI has transformed scientific discovery. Uh my favorite
scientific discovery. Uh my favorite example is Alphafold. Uh so Alphafold takes in protein sequence and will predict for you the threedimensional structure um of that protein. Um and so
essentially it's learning this mapping from sequence to structure. And by being able to learn this representation, it can then predict um how uh other proteins would fold even those that it
may not have seen um at the time of training. So over here you can see that
training. So over here you can see that alphaold is able to map out a very large space of possible protein structures. Um
and many of these shown here in blue it hadn't actually seen uh during in its uh in its training set. And if you think about it um experimentally uncovering each of these structures takes um you
know a lot of time and effort. um it's
not an easy process and so in a way what AlphaFold is doing here is it is performing this experiment for us in silicone um and and this kind of template of uh perform of really being
able to perform in silicico experiments map out this large space of possible hypotheses and then search for the ones that are most interesting. This is a template that we actually find not just in um protein structure prediction but
in scientific discovery broadly. Um, and
so one way to think about scientific discovery is that it's a search problem over a space of potential hypothesis.
And so what we wanted to do was take these ideas from the molecular level and sort of take them up one level to the cellular scale. So can we move beyond
cellular scale. So can we move beyond modeling individual molecules and individual proteins and can we now begin to understand better instead how these uh proteins and how these different molecules are interacting with each
other within a cell which is a much more complex and noisier system. um and of course kind of build up from there to um more higher level phenotypes at the tissue level and the organismic level
and so on. And so why would this be useful? Um uh you can once you're once
useful? Um uh you can once you're once you're modeling phenotypes at level of the cell, it's easier to begin to query uh questions that might have therapeutic relevance. So for instance here if you
relevance. So for instance here if you have a space over um neurons um and you're able to identify different forms of disease neurons such as degenerative neurons um hyperexitable neurons and you
have more functional or healthy neurons um if you had a a working virtual cell model you should be able to identify what are transitions that would take cells or take these neurons out of these
uh more diseased states and move more closer towards um healthy states. And
that's really our goal here is can we use um a model uh that understands cell state intimately well to perform in silicico experimentation and by doing so also accelerate the discovery process
because you have u more likely hypotheses that you can then drill down on and uh perform in the lab.
Another aspect that becomes apparent through this uh description is that it's very important to be able to connect these models back to um experiments in the lab. That's something we've taken
the lab. That's something we've taken very seriously. Um we wrote a
very seriously. Um we wrote a perspective about how we think you should build the the AI virtual cell which was published um in cell um last year and and we kind of described this
idea that um this these these models cannot kind of sit in isolation and cannot just be developed without any consideration uh in terms of the data that's being used to build them and also
uh without kind of le without being employed to generate the the next set of data to further empower these models. So
really it's kind of this closed loop where um you have these virtual cell models that are able to help you understand um cell biology in silicone propose hypotheses which you can then
test out experimentally in the lab and then feed back into your model to further strengthen it and further advance its understanding. So this is our vision. Um and so in the in the rest
our vision. Um and so in the in the rest of this talk what you will see is that we never really think about um uh the the the virtual cell or or this vision of the virtual cell as just a model a
predictive model in isolation. We also
think about u the data that goes into it and how we can uh more strategically generate the right data to power and strengthen uh these approaches downstream.
Um and so that kind of segus nicely into the structure of the talk today. Um uh
I'll be talking to you about the work in my group which kind of focuses on these four broad themes. Um on the left here what you see is as I mentioned how we're leveraging AI to build better models um
and also to create better evaluations that connect back to what is more relevant to the biologists to uh the clinicians to other people using our model um and also how we can leverage AI
to generate data to power these models.
And so I'll describe some work where we use AI agents to curate um vast uh biological uh data repositories that can then be used to um uh train these
models. Um and I'll also describe uh if
models. Um and I'll also describe uh if we have time um some work we did in using agents directly to design uh genetic perturbation experiments in order to further optimize and arrive at
phenotypes of interest.
So over the past year we've had um papers that we've put out in each of these areas. Um and so just briefly
these areas. Um and so just briefly touching on each of them um I'll talk to you first about state which is a transformer-based model that leverages data from um sets of cells to better
predict how um those cells will respond to perturbation and this uses single cell transcripttoics data. Um I'll also describe the virtual cell challenge uh which we launched at the arc institute
um 3 months ago. Um it now has over a thousand teams competing to create better uh models of cell state that can predict how cells will behave under intervention. Um I'll also talk to you
intervention. Um I'll also talk to you about um SC base count which is an agent that we developed to curate uh single cell transcrytoic data sets and at the
moment we've created um the largest existing single cell data repository.
It's over three times as large as the next uh largest uh repository of single cell data. And lastly I'll talk to you
cell data. And lastly I'll talk to you about bioiscocovery agent which is another agent we developed for being able to guide and design uh perturbation experiments.
All right so I'll pause there. Um, if
there are any questions right now, I can answer them. Also, feel free to stop me
answer them. Also, feel free to stop me in the middle um if you have questions or if anything is done later.
>> Can you um >> Hi, can you slow down a little bit?
>> Okay.
>> Or the the Sorry.
>> Oh, sorry. I didn't realize. I thought I thought it was only on my screen.
Okay, great. Let's keep going. Um,
all right.
Okay. So, let's first talk about um state. Um and really what the goal
state. Um and really what the goal really here is, right? So before talking about like uh virtual cell models or you know even just stealth state models, whatever you want to call them, what really is the goal? Why do we even call
this the virtual cell? Where does this come from? So um this is not a concept
come from? So um this is not a concept that we created. It's not a concept that even is very new. It's actually
something that's existed in the literature for a very long time. Um for
those of you that may have heard of this field called systems biology, it essentially had this very same idea that can we create um mathematical models of cell behavior um and essentially train
them on experimental data that we generate in the lab and then use that to predict outcomes for new experiments that you know we've not done before. Um
in the past uh the way that they approached this was by developing systems of differential equations. So
there's a seminal work by Oryon's group where uh what they did was they mapped out the reaction kinetics uh between different um enzymes and different proteins within the cell. Um they
created a system of differential equations. Um and then the goal then was
equations. Um and then the goal then was to be able to tune different parameters or change the input uh settings for this model to be able to predict what would happen were we to perform a different
experiment. Um there was also great work
experiment. Um there was also great work from Marcus Covert's lab at at Stanford which created what they called the first whole cell model and they tried to map every single interaction within the cell
um across um DNA, RNA, proteins and various different cellular components um resulting in over I think several thousand um differential equations which their solver would solve for simultaneously to try to predict what
would happen to the cell under you know a different input or a different uh condition. But um so these models are
condition. But um so these models are very useful. They definitely informed a
very useful. They definitely informed a lot of our understanding of of the the biology of these specific cells, but as you can imagine, they don't translate and they don't generalize very well beyond the specific cell or the specific
uh context that they were trained on.
Um, and so we wanted to explore how can we leverage more um modern approaches to machine learning and artificial intelligence to further enhance this process and leverage uh the large amounts of data that we're generating
right now to kind of more uh learn these relationships in a more unbiased perspective. And so here comes in kind
perspective. And so here comes in kind of uh this the foundation model approach. So um as many of you are aware
approach. So um as many of you are aware artificial intelligence has um you know there's been a lot of uh developments uh in the AI research area. And I think one
big um uh one big cause of this has been uh this foundation model approach of looking at machine learning. Um, and
what this means is that you no longer design your model specifically from scratch for a specific task, but instead you build what are called these uh foundational uh models that are trained
unsupervised on large amounts of unlabelled data. So this could be text,
unlabelled data. So this could be text, it could be images, could be speech. Um,
and then what you do is you then uh after this model is trained, you apply it downstream to various different downstream tasks, usually through some form of fine-tuning or other adaptation.
Um and these many of these tasks such as for instance answering questions or um captioning images may not directly be trained for in your pre-training process but you can leverage the same general
representation um downstream uh for a variety of different uh uses. And so we wanted to kind of bring this same approach to the modeling uh of data from the cell. So can we learn these general
the cell. So can we learn these general purpose representations um of cell behavior and then kind of apply them to various different downstream tasks to assess how well the model is working.
All right. So of course there are many different tasks that would be of interest. In particular what we were
interest. In particular what we were really interested in was being able to predict the effect of perturbations. So
the specific uh formalization that we looked at was given a distribution of unperturbed cells um say you apply some perturbation to these cells what is the distribution of perturbed cells going to
look like? Um and essentially we wanted
look like? Um and essentially we wanted to develop models that are able to learn this effect. So uh can we learn the
this effect. So uh can we learn the distribution of perturbed cells given unperturbed cells and an applied perturbation. Um of course there are
perturbation. Um of course there are various other things that might be of interest when you're studying cell behavior. But for us this was something
behavior. But for us this was something that seemed very central and very important for really understanding uh the regulatory um structure of a cell and how it functions and how different genes interact with each other. Yes.
>> Sorry. Um just are you um are the are XP and X not population level?
>> Yes.
>> Or individual cell level?
>> Yeah, that's a good question. Over here
it's kind of loosely defined but you can think of it as a >> Can you repeat the question please?
>> All right. So how are we thinking how are we thinking about X knot and P? So
of course there are many ways of thinking about cell state. Uh in our research we focus a lot on transcrytoics largely because we have a lot of um data at the cell single cell level for transcrytoics. Um but of course there's
transcrytoics. Um but of course there's also a lot of exciting data um in imaging proteomics cell viability uh which you know which can also be explored at um in future. Um and so we
in particular looked at a few different data sets that were very interesting. So
as I mentioned base count which has over 500 million cells worth of data from 70 different tissues. We also made use of
different tissues. We also made use of data from um the Tahoe 100 data set which had about um 50 different cell lines and and of course the cell by gene
corpus um which also has data from various different tissues and species.
All right. So in terms of perturbing cell state we looked primarily at uh genetic perturbations. Um in particular
genetic perturbations. Um in particular we very interested in using perturbse data again large cell numbers large number of perturbations kind of easily scalable. And so this was like a very um
scalable. And so this was like a very um relevant uh data type for us to teach the model some causal structure on what is um what happens under different interventions and and how transcrytoic
how the transcripttoic profile is impacted.
All right. So what is the machine learning task more specifically? Um so
say you have some a given cellular context. You have a certain number of
context. You have a certain number of perturbations that you've already tested and you know what the cell state is in response to those interventions. you're
then interested in being able to predict what's going to happen for a new unseen um test context and we call this the perturbation generalization condition.
Um this has been studied quite extensively in the past. Um so what we were more interested in is actually a slightly different variant of this task.
Um we were interested in essentially not just looking across perturbations but we were more interested in looking across contexts and for us this was something a bit more tractable and also something that can actually be quite biologically
useful. So for instance, if you have
useful. So for instance, if you have data from a couple of different donors um say you have T- cell populations from 50 different donors um and you know how those T- cell populations react to
various different perturbations you know how the state of those T- cells or their activity varies and you get data from a new patient um we want to be able to build a model that could then predict for this new patient given the data that we have given the transcripttoic
information that we have or other variables can we predict what would happen under the same set of perturbations and so that's the setting that we thought would be more useful is in fact from a machine learning perspective more tractable and so that's
what we focus on a lot more in our in our uh in our model.
All right. So again this is not something that uh is is new. This has
also been studied in the past. And so a lot of existing models what they do is they will take um one cell at a time um from each of these two uh uh distributions. So you have your input
distributions. So you have your input cell distribution um your output distribution and they try to learn a model that can predict perturbation effect um conditioned on the specific donor that you're trying to make uh or
the specific cell type whatever uh what however you're defining context. But
what we found is that a lot of these models um are not able to perform very well against even and will often not outperform simple baselines. There have
been a lot of papers within the past two years uh within the past year actually that have evaluated existing models and found that they often will fail as when when you compare them to linear baselines or just predicting like the
average effect and and we believe this is largely driven by the fact that existing models are not meaningfully accounting for the uh strong heterogeneity that exists within these populations. Um we think that it's
populations. Um we think that it's important to model the heterogeneity um both within a specific cellular population but also across different experiments. Um and so in our model
experiments. Um and so in our model we're really focused on creating specifically modeling those effects and allowing the model to learn the true perturbation effect really kind of disentangle that true perturbation
effect and we found that that helps us uh quite a bit in performing better than existing approaches. And so what I mean
existing approaches. And so what I mean by that is um if you have uh this unperturbed population and a perturbed population you can um model this as essentially this perturbed state can be
modeled as composed of three different factors. you would have the true effect
factors. you would have the true effect that is the true uh effect caused by the perturbation. Um then you have the
perturbation. Um then you have the heterogeneity within that basil population. So say you have cells in
population. So say you have cells in different phases of the cell cycle. Um
each of those might respond slightly differently to that same perturbation.
And so we want to be able to model this unanotated um variation within the uh within the source population and also the noise um between different experiments. So if you're combining data
experiments. So if you're combining data from various different uh perturbation experiments or single cell experiments, you might have different sequencing depths. You might have performed um the
depths. You might have performed um the experiment under different conditions.
So how do we um model that effect um independently as well? And so we we we create a model that essentially accounts for all these three factors. So this is accounting for the heterogeneity in the
B in the source distribution. This is uh accounting for the true perturbation effect and this is accounting for the technical noise within experiments.
>> Yes.
>> Sorry. this um you know a qualitative equation or a technical equation.
>> It's still fairly qualitative but we do map it back to specific parts of the model in the next few slides.
>> Yeah, I'll wait.
>> Okay. Yeah, I'm happy to answer questions after that. Yeah.
>> Um so what we did was to resolve um the within data set heterogeneity or the first two terms in this equation. We
came up with this idea to basically train a transformer over sets of cells as opposed to individual cells. And what
this lets our model do is essentially by visualizing not just individual cells at a time, but instead looking at like um large samples of cells from this distribution, it kind of intrinsically begins to learn what is the variation
between cells within this population.
How should I model that? Um and so the way that we do that is we first um stratify the input population by known coariants. So if we know what cell types
coariants. So if we know what cell types are in this population, we know um what perturbations are applied, what experimental batches are present, we'll stratify that population by each of those different coariantss. And then
what we'll do is we'll pair up the unperturbed and perturbed populations with matching with matching coariantss and use a maximum mean discrepancy loss, which is a distributional loss to align the predicted perturbation effect and
the true perturbation effect. And so
what this does is it doesn't enforce that the model learn a onetoone map between cells. um it allows the model to
between cells. um it allows the model to essentially leverage the full distribution of cells to learn an effect and at the same time it also begins to capture some of that underlying heterogeneity um within the within the
basil cell population.
Um so in in in uh more implementationally we uh we developed this transformer as I mentioned where every token within the input sequence is actually an independent cell as opposed to being independent genes which is what
you'll see in a lot of existing models and you predict how each of those tokens vary following the transformations within the model. Um and this gives you a sense of what the predicted um output expression would be for that
distribution of cells. All right. And
then for the um the epsilon term, we wanted to be able to model different data sets. So when we're training our
data sets. So when we're training our model on multiple different data sets, we make use of an embedding model. Um
and what this embedding does is essentially um I I won't go into details here, but we uh optimize the model to focus on true prohibition effects and um reduce the effect of noise between
different experiments. Um and what and
different experiments. Um and what and this allows us to more effectively train across um across different uh settings.
So what does this look like? Um so here you can see how the model's performance varies as you increase the cell set size. Uh so on the y-axis over here what
size. Uh so on the y-axis over here what you're seeing is the validation loss which tells you how effectively is the model able to predict that perturbation effect that you're interested in. Um on
the x-axis what you're seeing is the the amount of compute that you've used. So
it tells you how many operations uh the model has uh how many how many uh operations has have been performed during the training process. And and as you can see here as you increase the cell set size as the the the green gets
darker the the loss for the model the validation for the loss of the model keeps decreasing up until a certain point around 256 at which it starts to plateau and and beyond that um the model is uh doesn't get a lot better for this
specific data set. But but the core idea here is that as uh even though you might see the same uh number of cells and you might perform the same amount of computation, if you have uh if you're
able to see multiple cells at the same time during a forward pass, the model has a better understanding of the underlying distribution and it more quickly learns to predict a perturbation effect more accurately.
>> All right, your questions.
>> Um so you answered my other question.
>> Yes, >> we slides already. So but I have a new question.
>> Okay, sorry. So um so am I right in thinking that what you're doing here is the model is uh you know you're not changing the input and output of the model as compared to some of the prior
approaches. You're using this uh
approaches. You're using this uh stratification and maximum discrepancy to restructure the loss.
>> That way the model is sort of uh getting a better learning signal without making the model take in a whole population of certain size as I put. Is that correct?
>> Yes. Yes. So what we're doing exactly so a lot of existing transformer based approach in this space would take individual cells at a time and try to model effects but what we're doing is giving the model visibility to a large set of cells. So almost like a large
sample of the population. Yeah.
>> But the model can still operate on like at inference time the model could still take a single cell as input.
>> Yes it could.
>> Okay. Yeah
>> it could. Yes. Yeah.
>> So in one of the previous slides you said that so you can tie your cells as as tokens right.
>> Yes. Is this uh do you feed in like a vector of the entire RNA expression profile or do you uh select certain RNA expression features and if so how do you do that?
>> Yeah know that's a great question. Um so
in uh we have two versions of our model.
One that works directly on gene expression and one that works in the embedding space that I mentioned earlier. So when we have like this
earlier. So when we have like this version that works on the embedding space. we actually embed the RNA profile
space. we actually embed the RNA profile of uh of the cell into a lower dimensional embedding of about um about it's I think uh 256 or sorry 5 256 or
512 dimensional embedding um and that is fed in as the token um but in the case of gene expression what we do is we at the moment we'll take the 2,00 most
highly variant genes and we will pass that through an MLP that creates a lower dimensional representation that is then fed into the model as a token so it doesn't have um like gene level like
dimensions um but you can if you really want to you can map back to any other questions okay I'll keep moving um and we could also ask like are you actually I mean what if you did
something simpler without using self attention say say you just fed in through the transformer um the average so you just tried to predict the use the suitable expression as input that's what you see as the solid gray line it does
quite Well, but it's still not able to outperform uh the the model that's seen a large set size um as input. And you
could also provide as input uh the same cells but not provide any self attention. So not make use of the
attention. So not make use of the transformers attention mechanism and that doesn't do as well. So really both the the cell set component as well as the self attention help the model learn
uh new aspects of uh the underlying cells heterogeneity and the perturbation effect that is otherwise lost when you remove those components from the model.
Yes.
>> Set of pudobs versus >> Oh, you mean like randomly sampled pseudo bops.
>> For example, each of your single cells.
>> Yeah.
>> So, we've tried that in a different model. We haven't tried it here. I think
model. We haven't tried it here. I think
the challenge with doing that is it tends to uh it changes the distribution that the model is used to looking at.
Um, and then you're kind of only restricted to always predicting pseudobuls. So, for some people maybe
pseudobuls. So, for some people maybe that's fine for some use cases. Um, we
haven't tried it in this setting. Um, it
might actually work. Um, but yeah. Yeah.
All right. Uh, can I take it in a little? Okay. Okay. All right. So, um,
little? Okay. Okay. All right. So, um,
if you look at the attention heads, it's quite interesting. You see, I mean, I
quite interesting. You see, I mean, I don't, it's not very informative to put too much weight on attention heads because you can pull out any story from an attention head. But what's nice to see is that you do see different patterns, right? So you see some of the
patterns, right? So you see some of the attention heads are like just like flat vertical lines which shows that in these cases the model is learning some kind of a more suitable response across different cells meaning that it doesn't
vary and then there are other examples where there is a variation across uh different cells and so it does learn more cell specific um responses. Um, we
also tried to compare we also compared our approach theoretically to optimal transport because essentially what we're learning is something very similar to what optimal transport does which is it tries to learn a mapping function that
minimizes uh a cost um and you kind of predefine what the cost is except in our model we're not explicitly defining the costs. So we have fewer assumptions um
costs. So we have fewer assumptions um we're technically um easier to scale at least in our current implementation. Um
and we found that theoretically uh within our space of solutions we also are able to learn optimal transport um which is uh quite exciting.
All right. So um we then spend a lot of time thinking about model evaluation. So
another challenge with ex u the a lot of the existing perturbation models is they haven't um really developed metrics that look across um a range of um a range of
uh metrics. Right? So you're a lot of
uh metrics. Right? So you're a lot of them will only focus on Pearson correlation. But it's very interesting
correlation. But it's very interesting also to understand how is the model doing on differential expression. How is
the model doing on detecting things that biologists care about. So um as a first step what we did was we developed an evaluation procedure which we thought resembled a common setting within um experimental discovery. So first we
experimental discovery. So first we looked at three different data sets. We
looked at chemical perturbations, signaling perturbations and genetic perturbations um from three different data sets. Um and we set up the
data sets. Um and we set up the evaluation such that the model would um have access to a number of different cell contexts or cell lines at the time of training and it would predict the effect in a held out cell line at the
time of test. Um to make the task um a little bit more tractable, we allowed the model to see 30% of uh the train 30% of the data from that context at the time of training. And this is not
something that's um uh not very common in in um in academia or industry. uh
very often you will try to run you will run a perturbation screen very deeply in specific cell lines and maybe a more shallow screen on other cell lines and so this is a kind of recapitulating that same setting and then the goal of the
model is to predict what would happen in this 70% held out um test set. Um in
addition to that we incorporate baselines to make sure that the model isn't just learning to average or copy over effects that it's seen at training time. So we have two specific baselines
time. So we have two specific baselines that are um most uh that are strongest in our in our opinion. The first is um the perturbation mean baseline. And so
what this does is that for the training context, you essentially just take the average of the effect and you paste it over for the test test context. So this
would be just simply learning to average what you've seen at training and just pasting that in the test set. Or you can also learn from the 30% that you've seen held out in the training set and just kind of copy that over. Um and we call
this the context main baseline.
Um we also developed uh a comprehensive evaluation suite and this is now being used by over a thousand teams uh from across the world for as part of the virtual cell challenge. So we're really excited to see um how well this has been
holding up to you know a lot of uh a lot of testing and a lot of different users.
Um and the goal for this evaluation suite was how well the state recapitulate a real perturbse experiment. And so in our mind when you
experiment. And so in our mind when you run a perturbse experiment one of the core things that you try to look at is expression counts. you look at
expression counts. you look at differential expression and you look at overall effect sizes and so we design metrics to look at each of these um which are I to to a large extent independent of each other but also of
course there's some overlap in what they're measuring but it's also it gives you a more interpretable understanding of how the model is working um and so under so we have like a number of different metrics here um I'll I'll walk through each some of these in more
detail um uh and maybe I'll just go into that instead of listing them so first let's talk about some of these expressionbased metrics so one metric in particular that's been um very effective for us in determining how well the model is working is something we call the
perturbation discrimination score and the idea here is how well does the model learn to discriminate between the effects of different perturbations. Um
so for instance say you have a pert prediction that your model makes shown here in red and say this is the ground truth expression um based off of the ground truth data. You want to learn how well does the model do in predicting um
uh how how similar is your prediction to this ground truth as compared to other prediction data points um in your training set. So the blue points
training set. So the blue points correspond to other uh true perturbation states. Um the green is the real true
states. Um the green is the real true the real state for that specific perturbation and the red is our prediction for what that state would be.
And so what you want is that you want the red and the green to be as close as possible. And so what we do is we use a
possible. And so what we do is we use a rankbased metric to measure how similar similar they are. So we start numbering um these uh these perturbations based on how similar they are. So this is the most similar. So this gets a rank of
most similar. So this gets a rank of zero. This gets a rank of one and this
zero. This gets a rank of one and this gets a rank of two and so forth. And
then your score is essentially just um the the rank that the right uh the right ground truth value got as uh divided by the total number of comparisons that you make. And so the ideal model would
make. And so the ideal model would always be closest to the ground truth prediction for every perturbation as compared to other perturbations in your test set.
Any questions? Okay, so this was a very informative model and we applied this to um a couple different data sets and you can see that in blue here is the prediction made by state. Um we compared this to the three baselines I mentioned
earlier. So you have the perturbation
earlier. So you have the perturbation mean model, the context mean and a linear model. All of them don't perform
linear model. All of them don't perform very well. Um but they still per but
very well. Um but they still per but they still perform better than the existing uh deep learning baselines including SDVI which is a uh autoenccoder based variational autoenccoder and um CPA which is also an
autoenccoder and SGPT which is a transformerbased foundation model.
>> All right. Um we then moved on to looking at differential expressions. So
for those of you that haven't worked in transcrytoics um differential expression is like the bread and butter of working with these data sets. And the goal here is to look at each individual gene and try to identify where are we seeing um
significant change in expression following a specific um condition such as treatment with a drug or or or in our case a genetic perturbation or or or other interventions. And um so we looked
other interventions. And um so we looked at um three different aspects of uh of differential expression. So whenever
differential expression. So whenever people talk about differential expression they'll usually show you this volcano plot. Um on the x-axis here what
volcano plot. Um on the x-axis here what you're seeing is the so every dot here is a specific gene. On the x- axis is the logful change of that gene following perturbation. On the y-axis is the
perturbation. On the y-axis is the significance of that change in expression. And so first we wanted to
expression. And so first we wanted to see how well are we doing in detecting the significance of change in expression. So on the y- axis here is
expression. So on the y- axis here is the predicted significance based of a different model in this case state or the other baselines and on the x-axis is the true significance. But you can see that it's pretty good concordance in the
case of state as compared to um the other models. Um we also then uh you can
other models. Um we also then uh you can also look at this across different perturbations and if you do a precision recall curve again we see that you know state does pretty well compared to the same baselines. Um and then we also
same baselines. Um and then we also wanted to see like log full change. Uh
so here we're looking at the x-axis. So
if the model predicts that this gene is going to go up by twofold change does state or other predictive models also predict twofold change or do they predict something else? Um and so here again we see that state does quite well
at capturing um the true the true logful change as compared to to other baselines. All right. So how do we um
baselines. All right. So how do we um well here's some more examples. I can
skip this. All right. So how do we use this in practice? Um so I've shown you that state is doing a good job of detecting these cell type specific effects. But then how do we actually
effects. But then how do we actually leverage this for real discovery? Um so
we looked at one specific drug which is a drug for um melanoma. It's called
traumatinib. Um and over here you can see uh the performance of state compared to two different baselines. Again the
context between the perturbine. Um and
then we identified here one of these uh dots corresponds to the perturbation caused by traumatinib. Um and what we did was we looked at uh a specific uh drug which is C a specific cell line C32
which is a BFF mutant um cell line C32.
And traumatinib is specifically designed for BFF mutant melanoma. And we wanted to see is the model actually able to detect um uh gene expression that is specific to this um BFF uh this BF
mutant cell line. Um and so how do you understand this plot? So here we have five different cell lines that were actually kept in our test set. Um the
each each uh axis each column here corresponds to an independent gene. And
so the green corresponds to here is like the ground truth. So these are all the genes that are differentially expressed in the ground truth. This is for this other cell line. And so you have this for five different cell lines. And then
here are the predictions made by state.
So you can see that state is able to detect a lot of these different um uh genetic a lot of these differentially expressed genes across um different cell types. And the nice thing is it also
types. And the nice thing is it also detects genes that are agnostic to um cell type. Right? So for instance, some
cell type. Right? So for instance, some of these genes over here, they show up in all the different five different cell types and state is able to actually detect those five which is uh detect them across the various different contexts.
>> Yes.
>> Uh you were asked if you could speak a little bit slower that that could transcribe.
>> Okay. Sure. Sure. No problem. All right.
I'll try I'll try to talk a bit slower.
Yeah. Um yeah. So, and this was great because what we're seeing is that um for this for this FDA approved drug for BRF mutant melanoma, we're able to see that the model is actually detecting effects in BF mutant cell lines that are
specific to that cell line. And so, this kind of goes back to our initial motivation that if we do build these models and they are really um donor specific, we could then hopefully um leverage them for designing more
personalized treatments and um uh and personalized therapeutics.
So some more examples of uh the model's predictive uh use. So here essentially we're looking at um state predictive values for cell survival and we compare that to actual values. And again we see
that uh the performance of our model is is better than if you were to use um a different kinds of mean baselines.
All right. I'll pause there. Any
questions uh before I move on?
Okay, cool. Um all right, so I'll just quickly go through this as well. So
really the vision for the virtual cell is to not as not just to be able to predict effects within the specific data set or the experiment you're performing but really to be able to leverage these models um across different contexts
across different settings. Um and so for this what we were excited about is can we actually leverage models like state and bring to it like brand new novel cell types, novel cell states, novel
donors and then be able to predict what the effect of that perturbation or uh or of that intervention would be. Um and so in what we did was we actually built um
uh so we built our own foundation model as part of state to really learn this general purpose representation across these different contexts. Um and we compared that to existing foundation
models like um UC SGPT transcript former and we found that our model was actually very effective at being able to separate between um different perturbations. So
on the x a on the y- axis here is the model's ability to separate between perturbations and the true data and on the x-axis what you're seeing is if you actually use that downstream as part of
the the the state model you see better performance if you use a state embedding as opposed to using um other embeddings and so that was exciting for us and and then we applied this um downstream I I'll skip this um we applied this
downstream to the setting where we essentially train pre-trained our state model on the Tahoe data set which had 50 different contexts um with over uh 2,000 over 1,000 perturbations. So we
pre-trained our model on that um large chemical perturbation data set and then we fine-tuned it on various different downstream data sets including some genetic perturbation data sets and we found that by using our embedding model
and and the state architecture overall we were actually able to make predictions in the zeroot setting that were um quite effective for certain metrics not for all metrics. In
particular we were very good at predicting the overall effect size of a perturbation. We didn't do as well on
perturbation. We didn't do as well on predicting differential expression and that's understandable because it's a much harder task. But it's very encouraging to see that they're using these models that account for heterogeneity across data sets and
within cell populations. We're slowly
able to piece together the signal from across these different data sets and and build um these general purpose um perturbation models that can be applied to um different settings.
All right. So um so the next thing we wanted to focus on is how do we um kind of bring this uh this discussion around virtual cells and predicting perturbation effects. How do we uh bring
perturbation effects. How do we uh bring the community input into this and how do we kind of um uh really uh have a larger discussion around what is the best way to evaluate these models? What is the
right task? How should we be best be
right task? How should we be best be generating data? Um and so in this
generating data? Um and so in this spirit we we launched the virtual cell challenge um a few months ago. Um and
the idea behind the virtual cell challenge was um if if this if uh you know the community is really excited about the potential for artificial intelligence to impact our ability to
model cell behavior and understand cell state, how do we formalize this task in the same way that perhaps CASSP did for the protein structure prediction problem. And so we wanted to really
problem. And so we wanted to really start that conversation with the community and ask them and basically we proposed our formulation of the task and also now we're getting a lot of feedback from the community on how they think this task can be adapted and improved.
And so again the task that we focused on is the context generalization task in perturbation prediction. We also
perturbation prediction. We also developed a a purpose-built um very deeply sequenced uh single cell perturb seek data set just for this for this challenge. Um and it's been very well
challenge. Um and it's been very well received by the community. um we've
we've been told by many people that it is possibly one of the highest quality perturbation data sets that they've used. And so we're really excited about
used. And so we're really excited about um how we can kind of take this forward and really become uh similar to how AlphaFold use CASPs to optimize over the years and become like a really powerful model. We want to use the virtual cell
model. We want to use the virtual cell challenge as a you know as a launching pad for really strong models that can begin to do well very well on this task.
Um so yeah, the the competition has been launched for a while now and we're actually coming close to the final um uh end date, which is I think about a month away. Um we've had about 3,000 uh
away. Um we've had about 3,000 uh registered participants. Over a thousand
registered participants. Over a thousand teams have made sub submissions already.
Um I think the the latest number from this morning was,36. Um so it's been great. It's been great to see the um
great. It's been great to see the um involvement from the community. Um and
it's been great to receive the feedback on what's working and how we can kind of improve this and and really uh steer it towards uh the most uh impact for the uh
scientifically and um and also more broadly. Um and I also just want to make
broadly. Um and I also just want to make a small point that um the data generation for this this challenge was a very long and uh uh involved process. Uh
we had um we wanted to be very uh strategic in the perturbations that we selected. We also wanted to really
selected. We also wanted to really deeply sequence and have very high cell coverage. So the way that we did that
coverage. So the way that we did that was we first ran um a low depth screen uh over 2,500 perturbations in human stem cells. Um and then we looked at the
stem cells. Um and then we looked at the data the perturbsy data for that cell line and we grouped different perturbations by their effect sizes. We
clustered them by phenotypic diversity and then we carefully picked um perturbations that had maximal overlap with existing data sets so that um different teams can train on existing data sets. And finally we ended up with
data sets. And finally we ended up with a very um large and high quality data set of 300 perturbations each with over a thousand cells per perturbation. Just
for context most existing data sets will only have between 30 to maybe at most 100 cells per perturbation. So this was like almost an order of magnitude more.
And we also sequence it very deeply at 50,000 um per cell. So you get a very uh precise kind of gene level expression.
Um and so it's been great and you know and we're excited to see the outcome and how this impacts the field and how it kind of accelerates uh uh development towards um better models of predicting
you know cell behavior.
All right, I'll pause. Any questions so far? Are we good?
far? Are we good?
Okay, cool. Um so in the spirit of like clean large data sets, we also wanted to think more deeply about how we can leverage AI not just for developing models and evaluating them but also
thinking about how can we better curate um a large you know curate public data repositories and clean them up using AI.
And so in this period we developed what we call um SC base count. And the
motivation here was um almost every uh paper that is published that uses single cell data actually every paper that is published using single cell data you have to submit the raw sequencing reads
to SRRA um and so SR has an enormous amount of data um but it's all kind of like raw um transcript raw reads from your sequencer um and so we wanted to see can we just like directly go into SR
and reprocess all the single cell data that's in there and essentially create our own kind of very large single cell um data set um and we estimated that there must be over 500 million cells
worth of 10x uh data in SR and the largest existing repository of single cell data is just at about 120 million.
Um the challenge of course is that the metadata is an is a is a big mess. It's
very difficult to process um manually and so that's one of the main reasons nobody's done it. Um so what we did was we we created an agent to do this process for us. So we we designed a um
hierarchical agent workflow to look through every single 10x record within SRRA um process the metadata and then feed that into a recounting pipeline
which essentially would pro would do all the sequence alignment and count all the different um you know gene expression counts and and give us a final um cell by gene count matrix for every for every data set and the goal was to enable
better development of AI models um downstream. So here's the overall
downstream. So here's the overall pipeline. So we have SR agent which
pipeline. So we have SR agent which automatically keeps looking for data sets in SRRA. It finds them and then it has other agents that will assign different metadata components um to that
specific data set. Um so there's a number of different agents here. I'm
just kind of skimming over this but it's a a hierarchical workflow. And then when the data set is ready with associated metadata, it's fed over to our next flow pipeline for performing the recounting um you know aligning the reads and then
getting the final gene expression count matrix. And so one of the benefits of
matrix. And so one of the benefits of doing this of course is that we uniformly process all the data using the exact same um uh read alignment and the exact same um uh you know gene sets. So
it's it's a really nice large data set which has the exact same number of genes um uh with expression counts. And so
it's very easy to also deal with that data set and also read out not just um transcriptto not just um uh protein coding genes but also non-coding genes also looking at intronic expression
exonic expression. So it's like it's a
exonic expression. So it's like it's a really valuable um data set. Um so to give a sense of scale right now SC base count is at over 500 million cells. So
it's over four times as large as the largest um uh single cell data repository um cell by gene. Um it's also the most diverse. We have over 27 different species including both plant
and animal species. Uh the largest of course are human and mouse um which uh where you can see we have about 300 million human cells and about 150 mouse cells. Um it also is the largest
cells. Um it also is the largest repository that is uniformly processed.
So we see actually reduce batch effects.
We also see that it's uh we can we can we're also the largest repository of non-coding gene expression. So you can actually see how um how these non-coding genes vary across uh various different
cellular contexts. Um and it's also the
cellular contexts. Um and it's also the largest biological data repository to our knowledge that has been curated um by an AI agent. The fact that it's curated by an agent gives us a lot of power to keep growing this repository
and in fact it does keep growing. So we
we just kind of send off the agent to go um you know uh look through Srra again and identify new data sets and process them and so it's a continually expanding uh repository of data. And here on the
left what you see are the various different cell types and uh tissues that are represented within this data set. we
see an overabundance of tea cells and immune cells but there are also other cells that are pres represented such as epithelial cells and theal cells and you know we've done a lot of analysis we have an updated preprint that should be
out soon as well um so we did some downstream analysis as well what is the benefit of having a a large uniformly processed uh data repository we see that the batch effects are significantly lower especially if
you compare um if you try to see how much of the signal within this data set can be explained by technical ical factors such as sample ID or single cell versus single nuclei or the library prep
chemistry that in each of these cases in terms of base count the signal is lower than if you compare it to something like cell by gene. And then we see that in terms of more biologically meaningful um
uh var variables such as tissue a lot more of that signal can be explained in the case of base count as compared to cell.ene. So it's all kind of very
cell.ene. So it's all kind of very positive indicators that doing this uniform processing helped us get cleaner um cleaner data downstream. Um we also use this to train um an AI model in this case state and we looked at how well are
we able to distinguish between different phenotypes including cell type disease perturbation. Um and we found that in
perturbation. Um and we found that in all of those cases uh base count was able to do quite well. The differences
were less stark in the case of more like uh more easy to separate phenotypes like cell type. But then when we go down to
cell type. But then when we go down to the de the level of differentially expressed genes, we found that um using base count gave us a significant edge in being able to detect variation uh across
different um uh across different cell cell types and cell states.
All right. Um okay, I'll pause here again if there are any questions. Yes.
>> About a variety of different data.
>> Do you have any understanding of how generalizable results are across cell types or does it kind of depend on the problem?
>> Um, you mean for our data curation effort or for the model that I showed earlier?
and like think about scaling yeah these experiments to different cell types.
>> Yeah.
>> Yeah. Yeah. So that's a really good question right and I think I had some results at the end where it showed that there is some kind of meaningful improvement but I would say that still these models are far away from where you
can just like point at a cell type or point at a you know a specific tissue and say what's going to happen there like what is the effect of this perturbation going to be there. I think
right now at least within our model we find that if you have some amount of experimental data for that condition for that cell type it tends to do much better. It tends to like learn the
better. It tends to like learn the underlying regulatory structure a bit better. Um uh also so so I guess the
better. Um uh also so so I guess the short answer is I don't think we're there all the way there yet but there it seems like there's a good sign that um you know we've made good progress and also I think we at ARC we think that if
we standardize the data generation process we can probably also make a lot more pro progress. One of the big issues with these perturbation experiments is that everyone does them slightly differently. And so what a knockdown of
differently. And so what a knockdown of a specific gene means to me in my data set might be very different from what it means to your to in yours. And and so really trying to predict that is kind of
very hard for the model or that one of that's also another reason. Um but yeah, I think uh I think there's a lot of work that needs to be done. Yeah. Yeah. Yes.
So in the in the slide where you show uh the the distribution of cell types uh in your data set.
>> Yeah. you showed that you had like a very big concentration of T- cells >> and I this caught my attention because historically like immune cells often
have you know you were mentioning earlier donor specific uh right behavior >> right >> and this can be for HLA alles or varying many number of reasons
>> right >> does your model learn uh to I guess per donor analyze the effects or is it learning across all?
>> Yeah. No, that's a great question. So,
in our model as it is right now, we've not dived into this too much, but we have another uh model we've been working on which specifically focuses on this.
And in fact, we actually find that we do have pretty good generalization capabilities for TE- cells. I think for this reason because there's just such an overabundance of TE- cells in um in the
base count data set. Um so, definitely I think that does help. Um it really helps I think because um I think a lot of studies have been really focused on like cancer looking at blood looking at
immune cells. So that kind of it doesn't
immune cells. So that kind of it doesn't surprise us that like immune cells and T- cells are so over represented and and definitely for the models that we're training on this data we do find that generalization is better uh when you're working with immune cells when you're
working with blood cells. So so
definitely it is having an impact. Yeah.
Okay. Um I have about 5 minutes so I won't go into too much detail for the next uh one but I'll do maybe two slides. So the last the last thing we
slides. So the last the last thing we wanted to look at was um can we use these agents to actually tell us what experiment to do. So so far we've kind of been using them a bit more passively in terms of here's the data can you clean it up can you help us like you
know make this large standardized repository. Now we wanted to see can we
repository. Now we wanted to see can we actually have this agent design a lab experiment and then take the readout from that lab experiment update its belief and then perform another one and do this iteratively. So we did this
again within the context of knocking down genes and measuring phenotype. Um
and so here the goal essentially for our model was can we identify strong uh responders. And so this is another
responders. And so this is another screen in T- cells and we wanted to identify what are perturbations that lead to an overactivation of a specific pathway. Um and so what we do is the
pathway. Um and so what we do is the model makes its first guess of okay so what you should try perturbing these 30 or these 60 different genes. Um that
experiment is performed. you get a hit ratio which tells you how many of the predictions that you made actually caused a strong effect. Um this
information is fed back to the agent. Um
the agent processes it and makes another prediction. Um and then that results in
prediction. Um and then that results in another readout. And so we keep
another readout. And so we keep performing this over multiple steps. And
we see that over time the model starts to pick more and more interesting hits or interesting uh perturbations. And we
compared these predictions to say simpler approaches like using basin optimization or using uh random baselines. And we found that actually
baselines. And we found that actually our model was doing quite well across different data sets. Again, here a lot of this was focused on T- cells on immune cells because they were over represented in the data that we used for
for evaluating the model. But it's it's quite it's very encouraging to see that um the model is doing something better than essentially kind of randomly picking genes or even better than if you
were to make use of kind of a basian optimization approach.
Um and then uh when we do a bit more deeper analysis, we actually find that um these agents are able to make use of both the information that you provide as part of the prompt as well as the experiments that they see. And so this
is very encouraging. Um so for instance here, the black line corresponds to a random baseline. The purple line
random baseline. The purple line corresponds to if you didn't tell the agent what you were doing and it just kind of saw every experiment without any context over initially it would start off pretty weak but then later it would
start to figure it out. Um, but if you told the agent like, "Oh, here's the biology I'm studying. This is exactly what I want to do." It starts off pretty strong and then kind of it keeps it keeps it up and like, you know, so this
so this idea of like giving the model both the background information on what you're doing as well as the experimental data results in kind of the strongest results that we've seen. Um, and which is very encouraging and it shows that the model has the ability to reason over
different pieces of information.
All right. And of course, you know, this is uh there's a lot of people in um uh you know, at the frontier labs that are very excited about the potential of using these agents for designing experiments. Um in fact, the co-founder
experiments. Um in fact, the co-founder of Anthropic actually said that we shouldn't even be thinking about AI as just a form of data analysis, but instead we should just think of them think of it as a virtual biologist that performs all the different actions that
we do as biologists and and that should be able to really accelerate our ability to make discover new new therapies and make progress in research.
All right. So, yeah, this is this is the end of my talk. So, thanks for listening. I uh just uh for a summary, I
listening. I uh just uh for a summary, I I I walked you through four different projects. Talk uh talk about state, the
projects. Talk uh talk about state, the virtual cell challenge, and also how we're making use of agents for curating and generating new data. Um thank you so much for your attention. I want to make sure I acknowledge everyone in my group.
Um in particular, the people with asterisks against their names. They were
uh lead contributors to the state model.
Um and other contribu uh other collaborators at Stanford. Um if for any of the papers that I presented today, you can follow the QR code to read more in detail. Um yeah, and happy to take
in detail. Um yeah, and happy to take questions now or if you want to email them later. Yeah.
them later. Yeah.
>> Oh, sorry. I didn't see you. Yeah. Yeah.
Um so um I I wonder uh I need to read the paper to get a better sense of the details on um on uh some aspects of the
the way the state model works but um so you know I recognize you're training in the optical transport um setting um it also seems like there's some similarities there actually looking at
this as a metric learning problem um where the choice of the stratification you're doing is sort of akin to batch mining of identifying what things to populate your bash with in order to make
the loss learn something meaningful and interesting as opposed to something that's sort of just noise. Is that a perspective that that resonates at all or is that something that you guys have thought about or am I totally off base here?
>> No, that's actually a really cool way to put it. Yeah, for sure. I think um I
put it. Yeah, for sure. I think um I think when we do kind of align the coariantss, we make it in some sense more meaningful for the model to learn variation because we've already
corrected for the obvious variation. So
yeah, I think that I think that's that that is actually an interesting perspective. We don't we don't have uh
perspective. We don't we don't have uh any metric learning style losses. So
like a triplet based loss for instance, we don't have those in the model at the moment. So maybe we're not directly
moment. So maybe we're not directly making use of setto variation as a kind of loss. But I'm sure it's learning some
of loss. But I'm sure it's learning some aspects of that because we also try this experiment where we artificially put in very heterogeneous sets of cells. So
like you put in a different cell type, two different cell types. Um and then we actually do see in the attention patterns there's like this like block pattern so it's kind of learning to detect that. So yeah it's an interesting
detect that. So yeah it's an interesting way to frame it. Yeah.
>> Thank you.
>> Yeah. Uh yes.
>> Um I'm like kind of the last pages where you discussed using AI as a virtual biologist right >> and you briefly went over how you broke
down workflow multiple uh >> agents doing different stages. Um, I'm
curious how does how did you approach like building that out and deciding how right what tasks that were going to be done?
>> Okay.
>> You know, >> yeah, >> I think you're talking about >> you mean this thing.
>> Yeah.
>> Yeah. Yeah. Yeah. So, yeah, that's a great question. So, um, it's funny. So,
great question. So, um, it's funny. So,
this was led by Nick. He's really into agents. He was the first author. Um, I
agents. He was the first author. Um, I
think when you, uh, there's a really nice tweet. I think I have it in here.
nice tweet. I think I have it in here.
um which it's a hidden slide. Uh it's
this one. So um it's like me using LLMs for personal fund projects. Wow, this
thing is such a genius. Why do we even need humans anymore? But then when you try to deploy them in the real world, it's like oh my god, why is this thing so unbelievably stupid and dumb? And
it's really exactly this like you start off thinking that oh you can just you you can just feed this to GPT and it'll figure it out. But there's so many there's so many possible mistakes you
can make. um especially like uh you know
can make. um especially like uh you know in just being able to uh the way that you uh uh you know parse the text and then the way that you map specific uh elements of that text to known
vocabularies and known ontologies and so because of that what we ended up doing was we had multiple levels of agents that would perform similar tasks but like check each other so like you have
here for instance I don't have the full hierarchy I think I we have it here so like here's actually the full hierarchy so you have like this overall supervisor that manages the lower level agents and
you have these lower level agents that are doing like specific things. So you
have a find data set agent whose only job is to like find data sets. You have
another agent that is focused on the e search attribute and then and every supervisor kind of like checks the output of the of the agents that report to it and make sure that they're not
making a mistake. So that it's it's really it's kind of funny but also cool how like these little uh agents are like checking each other and then you finally the response you get is like much more validated. So that's kind of how I think
validated. So that's kind of how I think the system keeps getting larger. If
you're interested in looking at this in more detail um there's we used a package called lang graph. So if you may have heard of lang chain it's essentially I think it's by the same people but it's
it's looking at a graph representation of of agent uh workflows. Yeah.
>> Yeah.
Uh, you had a question first. Yes.
Sure.
I'm not sure if that actually generalizability is very great that it outperform the other >> one or you mean like >> data
from your states to the external data just using the model.
>> Uh yeah, we used um HVGs. Yes. um um and we matched the the same gene sets. Um I
should I should be clear that this is only looking at one metric though. So
like if you look at we were from we had like five or six different metrics. So
on the others it was kind of mixed uh because it is a difficult task. if you
go zero shot. But if you just start predicting like um this metric, sorry we'd go uh like this metric the effect size which is like an aggregate effect
across all genes that tends to be kind of robust and it it's like consistent across data sets but if you look at individual genes it's like much harder.
Um and there we found that the noise between different data sets would overcome the predictive performance.
Yeah. Yeah.
Yeah.
the final results from virtual solid challenge. But like I know that since
challenge. But like I know that since like for alpha bold like the protein data band was founded in 2002. So it
takes tech decades to have such a model like AR3. So how long do you expect it
like AR3. So how long do you expect it to to have like more to have enough data or highly sufficient model that can
achieve um high as high as the accuracy rate as alcohol as like for perturbation.
>> Yeah. Yeah. That's a great question. I
definitely think the big um the bottleneck right now is definitely data and data quality. Um so because of that, I think it will be kind of a longer timeline. If it were just a question of
timeline. If it were just a question of oh, we just don't have the right model, like you could probably make the case that you know it might be a shorter timeline because it's easier to iterate on model architectures. I think there's
a lot that needs to be done on data. Um
we just saw that ourselves when we put out the virtual cell challenge data set.
It was just for one cell line. it's just
like 300 or less perverbations, but it was already like, you know, um showing much stronger and consistent signal compared to almost all the public data sets out there. And I think a lot of the labs are beginning to realize this that
we're not going to make any meaningful progress in the single cell space until we really seriously clean up the data generation process um you know, standardize the processing steps. Um and
so base count is an important step.
We're now actually cooper um like uh collaborating with other academic labs to even standardize the way in which the data itself is submitted to SR and you know so I think there's a lot involved
on the data end but it's great to see the momentum in the community. So at ARC we have a 1 billion cell project where we want to um you know generate 1 billion cells worth of fertivation data
within the next couple of years. Um I
think Chen Zuckerberg initiative also uh they they're also doing something uh they're also trying to generate a lot of sales. So I think that will really
sales. So I think that will really accelerate the pace of progress. So
definitely a decade is probably um more than a decade I would say. Yeah.
Yeah. Yeah.
Uh yes >> question. So are you working on ice form
>> question. So are you working on ice form level configuration for the SRS data sets?
>> Um for usage >> yeah we do get that information. Yes we
we do have isopform level information and um we've not done any deep analysis on it but that's one of definitely one of the perks of our approach. Yeah
>> same question. I think another important perspective of pertation experiments is how you define read out how the tax Do you have any insight on how to
standardize that?
>> Yeah.
Exhaustion or >> Yeah.
>> Yeah.
>> Yeah. That's a good question. I mean, we work with so transcripttoics is easy because you know you could just say oh pull out whatever you want from the genes gene expression. Um for us because
we're trying to be more unbiased I think we're not focusing on any one specific readout. So at the moment gene
readout. So at the moment gene expression has been perhaps the most expressive one for us. Um I would say if we had to go deeper I would be interested in bringing in some proteomic
panels and like getting uh some of that information. Um may maybe some amount of
information. Um may maybe some amount of imaging would be useful might be overlapping with proteomics. Um uh but uh yeah it is possible to train your model specifically for a phenotype of
interest. Um you might see better power
interest. Um you might see better power to predict effects. Yeah. Uh yes. Yes.
So kind of I guess follow question last chance is um I'm very impressed with what you have in data set for grector so far but you mentioned earlier that it's
possible to to curate data sets for proteomics and um cell viability and all these other all these other things right and is there um
>> what efforts are currently being made to to >> uh incorporate those data modalities any
>> yeah um there are definitely efforts on the imaging front which I've heard of um because I think imaging tends to be cheaper to generate even than transcrytoics especially pharma
companies have a lot of capabilities there on the proteomic side I think single cell readouts tend to be more expensive um there's more I think development that needs to happen on the
assay and the technology side of things um so I'm not really uh synced in with what's happening there I don't know if it's happening at the same scale but there is a lot of interest in adding
proteomate panels um the vast m I think definitely the vast majority of interest has kind of been more on the transcrytoic side just because of the easier scalability
um so I don't think there's anything at this scale uh except maybe in imaging I would say um at least to my knowledge
Loading video analysis...