LongCut logo

Building a Virtual Cell: An AI Platform for Engineering Cell State (Oct. 20 Seminar)

By Dept. Biomedical Informatics Columbia University

Summary

Topics Covered

  • AlphaFold templates in silico cell experiments
  • Virtual cells enable therapeutic transitions
  • Transformers model cell set heterogeneity
  • AI agents curate billion-cell repositories
  • Agents iteratively design perturbation experiments

Full Transcript

Thank you for joining the seminar today.

I'm excited to see you all and I'm also very excited to uh introduce Ysef Rouani with the talk building a virtual cell

and AI platform for engineering cellscape. Ysef Rohani is a machine

cellscape. Ysef Rohani is a machine learning group leader at the Ark Institute and a visiting scholar in the department of computer science at Stanford University. His re his research

Stanford University. His re his research explores how artificial intelligence can guide experimental design in biological discovery.

Sorry, we had a Wi-Fi issue. Okay.

All right. Um, so, uh, as Andre mentioned, I lead a machine learning group at the Ark Institute. Uh, for

those of you that may not be aware, the ARC Institute is a nonprofit, uh, biomedical research organization. We're

based out in Palo Alto, um, right off of the Stanford campus. Um and my group uh is working on this effort which we call um the virtual cell initiative. Um and

our goal is really to build use um artificial intelligence to um create better models of cell state and to use these models to predict how cell state may change uh under different

interventions in particular experimental conditions and perturbations that we haven't actually tested in the lab but we can predict the outcomes of in silicico.

All right. So just a brief background on myself. Um so I've been in this space

myself. Um so I've been in this space for about a decade now. I started um working on ML for uh biomedical research and drug discovery broadly um in 2016.

Um working at a u startup in uh Cambridge, Massachusetts uh looking at modeling um network biology in cancer cells trying to understand resistance mechanisms to chemotherapy. Um I then

moved on to GSK where we continued trying to build better models of uh cellular phenotypes. uh in this case

cellular phenotypes. uh in this case working with high content screening data. So looking at image based screens

data. So looking at image based screens um across uh millions of drugs. Um I

then moved over to Stanford uh where I was advised by Yuri and Steve um as part of the Stanford AI lab. Um that's my son. Um and uh here we were looking at

son. Um and uh here we were looking at kind of moving more into the transcrytoic space and thinking about how we can create better models of um uh cellular transcrytoic state and trying to understand how that would change

under interventions. in particular

under interventions. in particular looking at cominatorial interventions and how um we can model that using a variety of approaches uh ranging from graph neural networks to large language

models. Um and then since last year I've

models. Um and then since last year I've been um leading a machine learning group at the arc. Um and here we are essentially trying to take those ideas to the next level by scaling up uh both the data generation as well as the model

development. Um and trying to see how

development. Um and trying to see how far can we push these ideas um what does it take to really make models that are actually useful um within the lab. Um

and so it's been great to kind of think about not just the model development side but also kind of data generation how we can guide that.

All right so moving over to the science um I'm sure I don't need to convince anyone here that AI has transformed scientific discovery. Uh my favorite

scientific discovery. Uh my favorite example is Alphafold. Uh so Alphafold takes in protein sequence and will predict for you the threedimensional structure um of that protein. Um and so

essentially it's learning this mapping from sequence to structure. And by being able to learn this representation, it can then predict um how uh other proteins would fold even those that it

may not have seen um at the time of training. So over here you can see that

training. So over here you can see that alphaold is able to map out a very large space of possible protein structures. Um

and many of these shown here in blue it hadn't actually seen uh during in its uh in its training set. And if you think about it um experimentally uncovering each of these structures takes um you

know a lot of time and effort. um it's

not an easy process and so in a way what AlphaFold is doing here is it is performing this experiment for us in silicone um and and this kind of template of uh perform of really being

able to perform in silicico experiments map out this large space of possible hypotheses and then search for the ones that are most interesting. This is a template that we actually find not just in um protein structure prediction but

in scientific discovery broadly. Um, and

so one way to think about scientific discovery is that it's a search problem over a space of potential hypothesis.

And so what we wanted to do was take these ideas from the molecular level and sort of take them up one level to the cellular scale. So can we move beyond

cellular scale. So can we move beyond modeling individual molecules and individual proteins and can we now begin to understand better instead how these uh proteins and how these different molecules are interacting with each

other within a cell which is a much more complex and noisier system. um and of course kind of build up from there to um more higher level phenotypes at the tissue level and the organismic level

and so on. And so why would this be useful? Um uh you can once you're once

useful? Um uh you can once you're once you're modeling phenotypes at level of the cell, it's easier to begin to query uh questions that might have therapeutic relevance. So for instance here if you

relevance. So for instance here if you have a space over um neurons um and you're able to identify different forms of disease neurons such as degenerative neurons um hyperexitable neurons and you

have more functional or healthy neurons um if you had a a working virtual cell model you should be able to identify what are transitions that would take cells or take these neurons out of these

uh more diseased states and move more closer towards um healthy states. And

that's really our goal here is can we use um a model uh that understands cell state intimately well to perform in silicico experimentation and by doing so also accelerate the discovery process

because you have u more likely hypotheses that you can then drill down on and uh perform in the lab.

Another aspect that becomes apparent through this uh description is that it's very important to be able to connect these models back to um experiments in the lab. That's something we've taken

the lab. That's something we've taken very seriously. Um we wrote a

very seriously. Um we wrote a perspective about how we think you should build the the AI virtual cell which was published um in cell um last year and and we kind of described this

idea that um this these these models cannot kind of sit in isolation and cannot just be developed without any consideration uh in terms of the data that's being used to build them and also

uh without kind of le without being employed to generate the the next set of data to further empower these models. So

really it's kind of this closed loop where um you have these virtual cell models that are able to help you understand um cell biology in silicone propose hypotheses which you can then

test out experimentally in the lab and then feed back into your model to further strengthen it and further advance its understanding. So this is our vision. Um and so in the in the rest

our vision. Um and so in the in the rest of this talk what you will see is that we never really think about um uh the the the virtual cell or or this vision of the virtual cell as just a model a

predictive model in isolation. We also

think about u the data that goes into it and how we can uh more strategically generate the right data to power and strengthen uh these approaches downstream.

Um and so that kind of segus nicely into the structure of the talk today. Um uh

I'll be talking to you about the work in my group which kind of focuses on these four broad themes. Um on the left here what you see is as I mentioned how we're leveraging AI to build better models um

and also to create better evaluations that connect back to what is more relevant to the biologists to uh the clinicians to other people using our model um and also how we can leverage AI

to generate data to power these models.

And so I'll describe some work where we use AI agents to curate um vast uh biological uh data repositories that can then be used to um uh train these

models. Um and I'll also describe uh if

models. Um and I'll also describe uh if we have time um some work we did in using agents directly to design uh genetic perturbation experiments in order to further optimize and arrive at

phenotypes of interest.

So over the past year we've had um papers that we've put out in each of these areas. Um and so just briefly

these areas. Um and so just briefly touching on each of them um I'll talk to you first about state which is a transformer-based model that leverages data from um sets of cells to better

predict how um those cells will respond to perturbation and this uses single cell transcripttoics data. Um I'll also describe the virtual cell challenge uh which we launched at the arc institute

um 3 months ago. Um it now has over a thousand teams competing to create better uh models of cell state that can predict how cells will behave under intervention. Um I'll also talk to you

intervention. Um I'll also talk to you about um SC base count which is an agent that we developed to curate uh single cell transcrytoic data sets and at the

moment we've created um the largest existing single cell data repository.

It's over three times as large as the next uh largest uh repository of single cell data. And lastly I'll talk to you

cell data. And lastly I'll talk to you about bioiscocovery agent which is another agent we developed for being able to guide and design uh perturbation experiments.

All right so I'll pause there. Um, if

there are any questions right now, I can answer them. Also, feel free to stop me

answer them. Also, feel free to stop me in the middle um if you have questions or if anything is done later.

>> Can you um >> Hi, can you slow down a little bit?

>> Okay.

>> Or the the Sorry.

>> Oh, sorry. I didn't realize. I thought I thought it was only on my screen.

Okay, great. Let's keep going. Um,

all right.

Okay. So, let's first talk about um state. Um and really what the goal

state. Um and really what the goal really here is, right? So before talking about like uh virtual cell models or you know even just stealth state models, whatever you want to call them, what really is the goal? Why do we even call

this the virtual cell? Where does this come from? So um this is not a concept

come from? So um this is not a concept that we created. It's not a concept that even is very new. It's actually

something that's existed in the literature for a very long time. Um for

those of you that may have heard of this field called systems biology, it essentially had this very same idea that can we create um mathematical models of cell behavior um and essentially train

them on experimental data that we generate in the lab and then use that to predict outcomes for new experiments that you know we've not done before. Um

in the past uh the way that they approached this was by developing systems of differential equations. So

there's a seminal work by Oryon's group where uh what they did was they mapped out the reaction kinetics uh between different um enzymes and different proteins within the cell. Um they

created a system of differential equations. Um and then the goal then was

equations. Um and then the goal then was to be able to tune different parameters or change the input uh settings for this model to be able to predict what would happen were we to perform a different

experiment. Um there was also great work

experiment. Um there was also great work from Marcus Covert's lab at at Stanford which created what they called the first whole cell model and they tried to map every single interaction within the cell

um across um DNA, RNA, proteins and various different cellular components um resulting in over I think several thousand um differential equations which their solver would solve for simultaneously to try to predict what

would happen to the cell under you know a different input or a different uh condition. But um so these models are

condition. But um so these models are very useful. They definitely informed a

very useful. They definitely informed a lot of our understanding of of the the biology of these specific cells, but as you can imagine, they don't translate and they don't generalize very well beyond the specific cell or the specific

uh context that they were trained on.

Um, and so we wanted to explore how can we leverage more um modern approaches to machine learning and artificial intelligence to further enhance this process and leverage uh the large amounts of data that we're generating

right now to kind of more uh learn these relationships in a more unbiased perspective. And so here comes in kind

perspective. And so here comes in kind of uh this the foundation model approach. So um as many of you are aware

approach. So um as many of you are aware artificial intelligence has um you know there's been a lot of uh developments uh in the AI research area. And I think one

big um uh one big cause of this has been uh this foundation model approach of looking at machine learning. Um, and

what this means is that you no longer design your model specifically from scratch for a specific task, but instead you build what are called these uh foundational uh models that are trained

unsupervised on large amounts of unlabelled data. So this could be text,

unlabelled data. So this could be text, it could be images, could be speech. Um,

and then what you do is you then uh after this model is trained, you apply it downstream to various different downstream tasks, usually through some form of fine-tuning or other adaptation.

Um and these many of these tasks such as for instance answering questions or um captioning images may not directly be trained for in your pre-training process but you can leverage the same general

representation um downstream uh for a variety of different uh uses. And so we wanted to kind of bring this same approach to the modeling uh of data from the cell. So can we learn these general

the cell. So can we learn these general purpose representations um of cell behavior and then kind of apply them to various different downstream tasks to assess how well the model is working.

All right. So of course there are many different tasks that would be of interest. In particular what we were

interest. In particular what we were really interested in was being able to predict the effect of perturbations. So

the specific uh formalization that we looked at was given a distribution of unperturbed cells um say you apply some perturbation to these cells what is the distribution of perturbed cells going to

look like? Um and essentially we wanted

look like? Um and essentially we wanted to develop models that are able to learn this effect. So uh can we learn the

this effect. So uh can we learn the distribution of perturbed cells given unperturbed cells and an applied perturbation. Um of course there are

perturbation. Um of course there are various other things that might be of interest when you're studying cell behavior. But for us this was something

behavior. But for us this was something that seemed very central and very important for really understanding uh the regulatory um structure of a cell and how it functions and how different genes interact with each other. Yes.

>> Sorry. Um just are you um are the are XP and X not population level?

>> Yes.

>> Or individual cell level?

>> Yeah, that's a good question. Over here

it's kind of loosely defined but you can think of it as a >> Can you repeat the question please?

>> All right. So how are we thinking how are we thinking about X knot and P? So

of course there are many ways of thinking about cell state. Uh in our research we focus a lot on transcrytoics largely because we have a lot of um data at the cell single cell level for transcrytoics. Um but of course there's

transcrytoics. Um but of course there's also a lot of exciting data um in imaging proteomics cell viability uh which you know which can also be explored at um in future. Um and so we

in particular looked at a few different data sets that were very interesting. So

as I mentioned base count which has over 500 million cells worth of data from 70 different tissues. We also made use of

different tissues. We also made use of data from um the Tahoe 100 data set which had about um 50 different cell lines and and of course the cell by gene

corpus um which also has data from various different tissues and species.

All right. So in terms of perturbing cell state we looked primarily at uh genetic perturbations. Um in particular

genetic perturbations. Um in particular we very interested in using perturbse data again large cell numbers large number of perturbations kind of easily scalable. And so this was like a very um

scalable. And so this was like a very um relevant uh data type for us to teach the model some causal structure on what is um what happens under different interventions and and how transcrytoic

how the transcripttoic profile is impacted.

All right. So what is the machine learning task more specifically? Um so

say you have some a given cellular context. You have a certain number of

context. You have a certain number of perturbations that you've already tested and you know what the cell state is in response to those interventions. you're

then interested in being able to predict what's going to happen for a new unseen um test context and we call this the perturbation generalization condition.

Um this has been studied quite extensively in the past. Um so what we were more interested in is actually a slightly different variant of this task.

Um we were interested in essentially not just looking across perturbations but we were more interested in looking across contexts and for us this was something a bit more tractable and also something that can actually be quite biologically

useful. So for instance, if you have

useful. So for instance, if you have data from a couple of different donors um say you have T- cell populations from 50 different donors um and you know how those T- cell populations react to

various different perturbations you know how the state of those T- cells or their activity varies and you get data from a new patient um we want to be able to build a model that could then predict for this new patient given the data that we have given the transcripttoic

information that we have or other variables can we predict what would happen under the same set of perturbations and so that's the setting that we thought would be more useful is in fact from a machine learning perspective more tractable and so that's

what we focus on a lot more in our in our uh in our model.

All right. So again this is not something that uh is is new. This has

also been studied in the past. And so a lot of existing models what they do is they will take um one cell at a time um from each of these two uh uh distributions. So you have your input

distributions. So you have your input cell distribution um your output distribution and they try to learn a model that can predict perturbation effect um conditioned on the specific donor that you're trying to make uh or

the specific cell type whatever uh what however you're defining context. But

what we found is that a lot of these models um are not able to perform very well against even and will often not outperform simple baselines. There have

been a lot of papers within the past two years uh within the past year actually that have evaluated existing models and found that they often will fail as when when you compare them to linear baselines or just predicting like the

average effect and and we believe this is largely driven by the fact that existing models are not meaningfully accounting for the uh strong heterogeneity that exists within these populations. Um we think that it's

populations. Um we think that it's important to model the heterogeneity um both within a specific cellular population but also across different experiments. Um and so in our model

experiments. Um and so in our model we're really focused on creating specifically modeling those effects and allowing the model to learn the true perturbation effect really kind of disentangle that true perturbation

effect and we found that that helps us uh quite a bit in performing better than existing approaches. And so what I mean

existing approaches. And so what I mean by that is um if you have uh this unperturbed population and a perturbed population you can um model this as essentially this perturbed state can be

modeled as composed of three different factors. you would have the true effect

factors. you would have the true effect that is the true uh effect caused by the perturbation. Um then you have the

perturbation. Um then you have the heterogeneity within that basil population. So say you have cells in

population. So say you have cells in different phases of the cell cycle. Um

each of those might respond slightly differently to that same perturbation.

And so we want to be able to model this unanotated um variation within the uh within the source population and also the noise um between different experiments. So if you're combining data

experiments. So if you're combining data from various different uh perturbation experiments or single cell experiments, you might have different sequencing depths. You might have performed um the

depths. You might have performed um the experiment under different conditions.

So how do we um model that effect um independently as well? And so we we we create a model that essentially accounts for all these three factors. So this is accounting for the heterogeneity in the

B in the source distribution. This is uh accounting for the true perturbation effect and this is accounting for the technical noise within experiments.

>> Yes.

>> Sorry. this um you know a qualitative equation or a technical equation.

>> It's still fairly qualitative but we do map it back to specific parts of the model in the next few slides.

>> Yeah, I'll wait.

>> Okay. Yeah, I'm happy to answer questions after that. Yeah.

>> Um so what we did was to resolve um the within data set heterogeneity or the first two terms in this equation. We

came up with this idea to basically train a transformer over sets of cells as opposed to individual cells. And what

this lets our model do is essentially by visualizing not just individual cells at a time, but instead looking at like um large samples of cells from this distribution, it kind of intrinsically begins to learn what is the variation

between cells within this population.

How should I model that? Um and so the way that we do that is we first um stratify the input population by known coariants. So if we know what cell types

coariants. So if we know what cell types are in this population, we know um what perturbations are applied, what experimental batches are present, we'll stratify that population by each of those different coariantss. And then

what we'll do is we'll pair up the unperturbed and perturbed populations with matching with matching coariantss and use a maximum mean discrepancy loss, which is a distributional loss to align the predicted perturbation effect and

the true perturbation effect. And so

what this does is it doesn't enforce that the model learn a onetoone map between cells. um it allows the model to

between cells. um it allows the model to essentially leverage the full distribution of cells to learn an effect and at the same time it also begins to capture some of that underlying heterogeneity um within the within the

basil cell population.

Um so in in in uh more implementationally we uh we developed this transformer as I mentioned where every token within the input sequence is actually an independent cell as opposed to being independent genes which is what

you'll see in a lot of existing models and you predict how each of those tokens vary following the transformations within the model. Um and this gives you a sense of what the predicted um output expression would be for that

distribution of cells. All right. And

then for the um the epsilon term, we wanted to be able to model different data sets. So when we're training our

data sets. So when we're training our model on multiple different data sets, we make use of an embedding model. Um

and what this embedding does is essentially um I I won't go into details here, but we uh optimize the model to focus on true prohibition effects and um reduce the effect of noise between

different experiments. Um and what and

different experiments. Um and what and this allows us to more effectively train across um across different uh settings.

So what does this look like? Um so here you can see how the model's performance varies as you increase the cell set size. Uh so on the y-axis over here what

size. Uh so on the y-axis over here what you're seeing is the validation loss which tells you how effectively is the model able to predict that perturbation effect that you're interested in. Um on

the x-axis what you're seeing is the the amount of compute that you've used. So

it tells you how many operations uh the model has uh how many how many uh operations has have been performed during the training process. And and as you can see here as you increase the cell set size as the the the green gets

darker the the loss for the model the validation for the loss of the model keeps decreasing up until a certain point around 256 at which it starts to plateau and and beyond that um the model is uh doesn't get a lot better for this

specific data set. But but the core idea here is that as uh even though you might see the same uh number of cells and you might perform the same amount of computation, if you have uh if you're

able to see multiple cells at the same time during a forward pass, the model has a better understanding of the underlying distribution and it more quickly learns to predict a perturbation effect more accurately.

>> All right, your questions.

>> Um so you answered my other question.

>> Yes, >> we slides already. So but I have a new question.

>> Okay, sorry. So um so am I right in thinking that what you're doing here is the model is uh you know you're not changing the input and output of the model as compared to some of the prior

approaches. You're using this uh

approaches. You're using this uh stratification and maximum discrepancy to restructure the loss.

>> That way the model is sort of uh getting a better learning signal without making the model take in a whole population of certain size as I put. Is that correct?

>> Yes. Yes. So what we're doing exactly so a lot of existing transformer based approach in this space would take individual cells at a time and try to model effects but what we're doing is giving the model visibility to a large set of cells. So almost like a large

sample of the population. Yeah.

>> But the model can still operate on like at inference time the model could still take a single cell as input.

>> Yes it could.

>> Okay. Yeah

>> it could. Yes. Yeah.

>> So in one of the previous slides you said that so you can tie your cells as as tokens right.

>> Yes. Is this uh do you feed in like a vector of the entire RNA expression profile or do you uh select certain RNA expression features and if so how do you do that?

>> Yeah know that's a great question. Um so

in uh we have two versions of our model.

One that works directly on gene expression and one that works in the embedding space that I mentioned earlier. So when we have like this

earlier. So when we have like this version that works on the embedding space. we actually embed the RNA profile

space. we actually embed the RNA profile of uh of the cell into a lower dimensional embedding of about um about it's I think uh 256 or sorry 5 256 or

512 dimensional embedding um and that is fed in as the token um but in the case of gene expression what we do is we at the moment we'll take the 2,00 most

highly variant genes and we will pass that through an MLP that creates a lower dimensional representation that is then fed into the model as a token so it doesn't have um like gene level like

dimensions um but you can if you really want to you can map back to any other questions okay I'll keep moving um and we could also ask like are you actually I mean what if you did

something simpler without using self attention say say you just fed in through the transformer um the average so you just tried to predict the use the suitable expression as input that's what you see as the solid gray line it does

quite Well, but it's still not able to outperform uh the the model that's seen a large set size um as input. And you

could also provide as input uh the same cells but not provide any self attention. So not make use of the

attention. So not make use of the transformers attention mechanism and that doesn't do as well. So really both the the cell set component as well as the self attention help the model learn

uh new aspects of uh the underlying cells heterogeneity and the perturbation effect that is otherwise lost when you remove those components from the model.

Yes.

>> Set of pudobs versus >> Oh, you mean like randomly sampled pseudo bops.

>> For example, each of your single cells.

>> Yeah.

>> So, we've tried that in a different model. We haven't tried it here. I think

model. We haven't tried it here. I think

the challenge with doing that is it tends to uh it changes the distribution that the model is used to looking at.

Um, and then you're kind of only restricted to always predicting pseudobuls. So, for some people maybe

pseudobuls. So, for some people maybe that's fine for some use cases. Um, we

haven't tried it in this setting. Um, it

might actually work. Um, but yeah. Yeah.

All right. Uh, can I take it in a little? Okay. Okay. All right. So, um,

little? Okay. Okay. All right. So, um,

if you look at the attention heads, it's quite interesting. You see, I mean, I

quite interesting. You see, I mean, I don't, it's not very informative to put too much weight on attention heads because you can pull out any story from an attention head. But what's nice to see is that you do see different patterns, right? So you see some of the

patterns, right? So you see some of the attention heads are like just like flat vertical lines which shows that in these cases the model is learning some kind of a more suitable response across different cells meaning that it doesn't

vary and then there are other examples where there is a variation across uh different cells and so it does learn more cell specific um responses. Um, we

also tried to compare we also compared our approach theoretically to optimal transport because essentially what we're learning is something very similar to what optimal transport does which is it tries to learn a mapping function that

minimizes uh a cost um and you kind of predefine what the cost is except in our model we're not explicitly defining the costs. So we have fewer assumptions um

costs. So we have fewer assumptions um we're technically um easier to scale at least in our current implementation. Um

and we found that theoretically uh within our space of solutions we also are able to learn optimal transport um which is uh quite exciting.

All right. So um we then spend a lot of time thinking about model evaluation. So

another challenge with ex u the a lot of the existing perturbation models is they haven't um really developed metrics that look across um a range of um a range of

uh metrics. Right? So you're a lot of

uh metrics. Right? So you're a lot of them will only focus on Pearson correlation. But it's very interesting

correlation. But it's very interesting also to understand how is the model doing on differential expression. How is

the model doing on detecting things that biologists care about. So um as a first step what we did was we developed an evaluation procedure which we thought resembled a common setting within um experimental discovery. So first we

experimental discovery. So first we looked at three different data sets. We

looked at chemical perturbations, signaling perturbations and genetic perturbations um from three different data sets. Um and we set up the

data sets. Um and we set up the evaluation such that the model would um have access to a number of different cell contexts or cell lines at the time of training and it would predict the effect in a held out cell line at the

time of test. Um to make the task um a little bit more tractable, we allowed the model to see 30% of uh the train 30% of the data from that context at the time of training. And this is not

something that's um uh not very common in in um in academia or industry. uh

very often you will try to run you will run a perturbation screen very deeply in specific cell lines and maybe a more shallow screen on other cell lines and so this is a kind of recapitulating that same setting and then the goal of the

model is to predict what would happen in this 70% held out um test set. Um in

addition to that we incorporate baselines to make sure that the model isn't just learning to average or copy over effects that it's seen at training time. So we have two specific baselines

time. So we have two specific baselines that are um most uh that are strongest in our in our opinion. The first is um the perturbation mean baseline. And so

what this does is that for the training context, you essentially just take the average of the effect and you paste it over for the test test context. So this

would be just simply learning to average what you've seen at training and just pasting that in the test set. Or you can also learn from the 30% that you've seen held out in the training set and just kind of copy that over. Um and we call

this the context main baseline.

Um we also developed uh a comprehensive evaluation suite and this is now being used by over a thousand teams uh from across the world for as part of the virtual cell challenge. So we're really excited to see um how well this has been

holding up to you know a lot of uh a lot of testing and a lot of different users.

Um and the goal for this evaluation suite was how well the state recapitulate a real perturbse experiment. And so in our mind when you

experiment. And so in our mind when you run a perturbse experiment one of the core things that you try to look at is expression counts. you look at

expression counts. you look at differential expression and you look at overall effect sizes and so we design metrics to look at each of these um which are I to to a large extent independent of each other but also of

course there's some overlap in what they're measuring but it's also it gives you a more interpretable understanding of how the model is working um and so under so we have like a number of different metrics here um I'll I'll walk through each some of these in more

detail um uh and maybe I'll just go into that instead of listing them so first let's talk about some of these expressionbased metrics so one metric in particular that's been um very effective for us in determining how well the model is working is something we call the

perturbation discrimination score and the idea here is how well does the model learn to discriminate between the effects of different perturbations. Um

so for instance say you have a pert prediction that your model makes shown here in red and say this is the ground truth expression um based off of the ground truth data. You want to learn how well does the model do in predicting um

uh how how similar is your prediction to this ground truth as compared to other prediction data points um in your training set. So the blue points

training set. So the blue points correspond to other uh true perturbation states. Um the green is the real true

states. Um the green is the real true the real state for that specific perturbation and the red is our prediction for what that state would be.

And so what you want is that you want the red and the green to be as close as possible. And so what we do is we use a

possible. And so what we do is we use a rankbased metric to measure how similar similar they are. So we start numbering um these uh these perturbations based on how similar they are. So this is the most similar. So this gets a rank of

most similar. So this gets a rank of zero. This gets a rank of one and this

zero. This gets a rank of one and this gets a rank of two and so forth. And

then your score is essentially just um the the rank that the right uh the right ground truth value got as uh divided by the total number of comparisons that you make. And so the ideal model would

make. And so the ideal model would always be closest to the ground truth prediction for every perturbation as compared to other perturbations in your test set.

Any questions? Okay, so this was a very informative model and we applied this to um a couple different data sets and you can see that in blue here is the prediction made by state. Um we compared this to the three baselines I mentioned

earlier. So you have the perturbation

earlier. So you have the perturbation mean model, the context mean and a linear model. All of them don't perform

linear model. All of them don't perform very well. Um but they still per but

very well. Um but they still per but they still perform better than the existing uh deep learning baselines including SDVI which is a uh autoenccoder based variational autoenccoder and um CPA which is also an

autoenccoder and SGPT which is a transformerbased foundation model.

>> All right. Um we then moved on to looking at differential expressions. So

for those of you that haven't worked in transcrytoics um differential expression is like the bread and butter of working with these data sets. And the goal here is to look at each individual gene and try to identify where are we seeing um

significant change in expression following a specific um condition such as treatment with a drug or or or in our case a genetic perturbation or or or other interventions. And um so we looked

other interventions. And um so we looked at um three different aspects of uh of differential expression. So whenever

differential expression. So whenever people talk about differential expression they'll usually show you this volcano plot. Um on the x-axis here what

volcano plot. Um on the x-axis here what you're seeing is the so every dot here is a specific gene. On the x- axis is the logful change of that gene following perturbation. On the y-axis is the

perturbation. On the y-axis is the significance of that change in expression. And so first we wanted to

expression. And so first we wanted to see how well are we doing in detecting the significance of change in expression. So on the y- axis here is

expression. So on the y- axis here is the predicted significance based of a different model in this case state or the other baselines and on the x-axis is the true significance. But you can see that it's pretty good concordance in the

case of state as compared to um the other models. Um we also then uh you can

other models. Um we also then uh you can also look at this across different perturbations and if you do a precision recall curve again we see that you know state does pretty well compared to the same baselines. Um and then we also

same baselines. Um and then we also wanted to see like log full change. Uh

so here we're looking at the x-axis. So

if the model predicts that this gene is going to go up by twofold change does state or other predictive models also predict twofold change or do they predict something else? Um and so here again we see that state does quite well

at capturing um the true the true logful change as compared to to other baselines. All right. So how do we um

baselines. All right. So how do we um well here's some more examples. I can

skip this. All right. So how do we use this in practice? Um so I've shown you that state is doing a good job of detecting these cell type specific effects. But then how do we actually

effects. But then how do we actually leverage this for real discovery? Um so

we looked at one specific drug which is a drug for um melanoma. It's called

traumatinib. Um and over here you can see uh the performance of state compared to two different baselines. Again the

context between the perturbine. Um and

then we identified here one of these uh dots corresponds to the perturbation caused by traumatinib. Um and what we did was we looked at uh a specific uh drug which is C a specific cell line C32

which is a BFF mutant um cell line C32.

And traumatinib is specifically designed for BFF mutant melanoma. And we wanted to see is the model actually able to detect um uh gene expression that is specific to this um BFF uh this BF

mutant cell line. Um and so how do you understand this plot? So here we have five different cell lines that were actually kept in our test set. Um the

each each uh axis each column here corresponds to an independent gene. And

so the green corresponds to here is like the ground truth. So these are all the genes that are differentially expressed in the ground truth. This is for this other cell line. And so you have this for five different cell lines. And then

here are the predictions made by state.

So you can see that state is able to detect a lot of these different um uh genetic a lot of these differentially expressed genes across um different cell types. And the nice thing is it also

types. And the nice thing is it also detects genes that are agnostic to um cell type. Right? So for instance, some

cell type. Right? So for instance, some of these genes over here, they show up in all the different five different cell types and state is able to actually detect those five which is uh detect them across the various different contexts.

>> Yes.

>> Uh you were asked if you could speak a little bit slower that that could transcribe.

>> Okay. Sure. Sure. No problem. All right.

I'll try I'll try to talk a bit slower.

Yeah. Um yeah. So, and this was great because what we're seeing is that um for this for this FDA approved drug for BRF mutant melanoma, we're able to see that the model is actually detecting effects in BF mutant cell lines that are

specific to that cell line. And so, this kind of goes back to our initial motivation that if we do build these models and they are really um donor specific, we could then hopefully um leverage them for designing more

personalized treatments and um uh and personalized therapeutics.

So some more examples of uh the model's predictive uh use. So here essentially we're looking at um state predictive values for cell survival and we compare that to actual values. And again we see

that uh the performance of our model is is better than if you were to use um a different kinds of mean baselines.

All right. I'll pause there. Any

questions uh before I move on?

Okay, cool. Um all right, so I'll just quickly go through this as well. So

really the vision for the virtual cell is to not as not just to be able to predict effects within the specific data set or the experiment you're performing but really to be able to leverage these models um across different contexts

across different settings. Um and so for this what we were excited about is can we actually leverage models like state and bring to it like brand new novel cell types, novel cell states, novel

donors and then be able to predict what the effect of that perturbation or uh or of that intervention would be. Um and so in what we did was we actually built um

uh so we built our own foundation model as part of state to really learn this general purpose representation across these different contexts. Um and we compared that to existing foundation

models like um UC SGPT transcript former and we found that our model was actually very effective at being able to separate between um different perturbations. So

on the x a on the y- axis here is the model's ability to separate between perturbations and the true data and on the x-axis what you're seeing is if you actually use that downstream as part of

the the the state model you see better performance if you use a state embedding as opposed to using um other embeddings and so that was exciting for us and and then we applied this um downstream I I'll skip this um we applied this

downstream to the setting where we essentially train pre-trained our state model on the Tahoe data set which had 50 different contexts um with over uh 2,000 over 1,000 perturbations. So we

pre-trained our model on that um large chemical perturbation data set and then we fine-tuned it on various different downstream data sets including some genetic perturbation data sets and we found that by using our embedding model

and and the state architecture overall we were actually able to make predictions in the zeroot setting that were um quite effective for certain metrics not for all metrics. In

particular we were very good at predicting the overall effect size of a perturbation. We didn't do as well on

perturbation. We didn't do as well on predicting differential expression and that's understandable because it's a much harder task. But it's very encouraging to see that they're using these models that account for heterogeneity across data sets and

within cell populations. We're slowly

able to piece together the signal from across these different data sets and and build um these general purpose um perturbation models that can be applied to um different settings.

All right. So um so the next thing we wanted to focus on is how do we um kind of bring this uh this discussion around virtual cells and predicting perturbation effects. How do we uh bring

perturbation effects. How do we uh bring the community input into this and how do we kind of um uh really uh have a larger discussion around what is the best way to evaluate these models? What is the

right task? How should we be best be

right task? How should we be best be generating data? Um and so in this

generating data? Um and so in this spirit we we launched the virtual cell challenge um a few months ago. Um and

the idea behind the virtual cell challenge was um if if this if uh you know the community is really excited about the potential for artificial intelligence to impact our ability to

model cell behavior and understand cell state, how do we formalize this task in the same way that perhaps CASSP did for the protein structure prediction problem. And so we wanted to really

problem. And so we wanted to really start that conversation with the community and ask them and basically we proposed our formulation of the task and also now we're getting a lot of feedback from the community on how they think this task can be adapted and improved.

And so again the task that we focused on is the context generalization task in perturbation prediction. We also

perturbation prediction. We also developed a a purpose-built um very deeply sequenced uh single cell perturb seek data set just for this for this challenge. Um and it's been very well

challenge. Um and it's been very well received by the community. um we've

we've been told by many people that it is possibly one of the highest quality perturbation data sets that they've used. And so we're really excited about

used. And so we're really excited about um how we can kind of take this forward and really become uh similar to how AlphaFold use CASPs to optimize over the years and become like a really powerful model. We want to use the virtual cell

model. We want to use the virtual cell challenge as a you know as a launching pad for really strong models that can begin to do well very well on this task.

Um so yeah, the the competition has been launched for a while now and we're actually coming close to the final um uh end date, which is I think about a month away. Um we've had about 3,000 uh

away. Um we've had about 3,000 uh registered participants. Over a thousand

registered participants. Over a thousand teams have made sub submissions already.

Um I think the the latest number from this morning was,36. Um so it's been great. It's been great to see the um

great. It's been great to see the um involvement from the community. Um and

it's been great to receive the feedback on what's working and how we can kind of improve this and and really uh steer it towards uh the most uh impact for the uh

scientifically and um and also more broadly. Um and I also just want to make

broadly. Um and I also just want to make a small point that um the data generation for this this challenge was a very long and uh uh involved process. Uh

we had um we wanted to be very uh strategic in the perturbations that we selected. We also wanted to really

selected. We also wanted to really deeply sequence and have very high cell coverage. So the way that we did that

coverage. So the way that we did that was we first ran um a low depth screen uh over 2,500 perturbations in human stem cells. Um and then we looked at the

stem cells. Um and then we looked at the data the perturbsy data for that cell line and we grouped different perturbations by their effect sizes. We

clustered them by phenotypic diversity and then we carefully picked um perturbations that had maximal overlap with existing data sets so that um different teams can train on existing data sets. And finally we ended up with

data sets. And finally we ended up with a very um large and high quality data set of 300 perturbations each with over a thousand cells per perturbation. Just

for context most existing data sets will only have between 30 to maybe at most 100 cells per perturbation. So this was like almost an order of magnitude more.

And we also sequence it very deeply at 50,000 um per cell. So you get a very uh precise kind of gene level expression.

Um and so it's been great and you know and we're excited to see the outcome and how this impacts the field and how it kind of accelerates uh uh development towards um better models of predicting

you know cell behavior.

All right, I'll pause. Any questions so far? Are we good?

far? Are we good?

Okay, cool. Um so in the spirit of like clean large data sets, we also wanted to think more deeply about how we can leverage AI not just for developing models and evaluating them but also

thinking about how can we better curate um a large you know curate public data repositories and clean them up using AI.

And so in this period we developed what we call um SC base count. And the

motivation here was um almost every uh paper that is published that uses single cell data actually every paper that is published using single cell data you have to submit the raw sequencing reads

to SRRA um and so SR has an enormous amount of data um but it's all kind of like raw um transcript raw reads from your sequencer um and so we wanted to see can we just like directly go into SR

and reprocess all the single cell data that's in there and essentially create our own kind of very large single cell um data set um and we estimated that there must be over 500 million cells

worth of 10x uh data in SR and the largest existing repository of single cell data is just at about 120 million.

Um the challenge of course is that the metadata is an is a is a big mess. It's

very difficult to process um manually and so that's one of the main reasons nobody's done it. Um so what we did was we we created an agent to do this process for us. So we we designed a um

hierarchical agent workflow to look through every single 10x record within SRRA um process the metadata and then feed that into a recounting pipeline

which essentially would pro would do all the sequence alignment and count all the different um you know gene expression counts and and give us a final um cell by gene count matrix for every for every data set and the goal was to enable

better development of AI models um downstream. So here's the overall

downstream. So here's the overall pipeline. So we have SR agent which

pipeline. So we have SR agent which automatically keeps looking for data sets in SRRA. It finds them and then it has other agents that will assign different metadata components um to that

specific data set. Um so there's a number of different agents here. I'm

just kind of skimming over this but it's a a hierarchical workflow. And then when the data set is ready with associated metadata, it's fed over to our next flow pipeline for performing the recounting um you know aligning the reads and then

getting the final gene expression count matrix. And so one of the benefits of

matrix. And so one of the benefits of doing this of course is that we uniformly process all the data using the exact same um uh read alignment and the exact same um uh you know gene sets. So

it's it's a really nice large data set which has the exact same number of genes um uh with expression counts. And so

it's very easy to also deal with that data set and also read out not just um transcriptto not just um uh protein coding genes but also non-coding genes also looking at intronic expression

exonic expression. So it's like it's a

exonic expression. So it's like it's a really valuable um data set. Um so to give a sense of scale right now SC base count is at over 500 million cells. So

it's over four times as large as the largest um uh single cell data repository um cell by gene. Um it's also the most diverse. We have over 27 different species including both plant

and animal species. Uh the largest of course are human and mouse um which uh where you can see we have about 300 million human cells and about 150 mouse cells. Um it also is the largest

cells. Um it also is the largest repository that is uniformly processed.

So we see actually reduce batch effects.

We also see that it's uh we can we can we're also the largest repository of non-coding gene expression. So you can actually see how um how these non-coding genes vary across uh various different

cellular contexts. Um and it's also the

cellular contexts. Um and it's also the largest biological data repository to our knowledge that has been curated um by an AI agent. The fact that it's curated by an agent gives us a lot of power to keep growing this repository

and in fact it does keep growing. So we

we just kind of send off the agent to go um you know uh look through Srra again and identify new data sets and process them and so it's a continually expanding uh repository of data. And here on the

left what you see are the various different cell types and uh tissues that are represented within this data set. we

see an overabundance of tea cells and immune cells but there are also other cells that are pres represented such as epithelial cells and theal cells and you know we've done a lot of analysis we have an updated preprint that should be

out soon as well um so we did some downstream analysis as well what is the benefit of having a a large uniformly processed uh data repository we see that the batch effects are significantly lower especially if

you compare um if you try to see how much of the signal within this data set can be explained by technical ical factors such as sample ID or single cell versus single nuclei or the library prep

chemistry that in each of these cases in terms of base count the signal is lower than if you compare it to something like cell by gene. And then we see that in terms of more biologically meaningful um

uh var variables such as tissue a lot more of that signal can be explained in the case of base count as compared to cell.ene. So it's all kind of very

cell.ene. So it's all kind of very positive indicators that doing this uniform processing helped us get cleaner um cleaner data downstream. Um we also use this to train um an AI model in this case state and we looked at how well are

we able to distinguish between different phenotypes including cell type disease perturbation. Um and we found that in

perturbation. Um and we found that in all of those cases uh base count was able to do quite well. The differences

were less stark in the case of more like uh more easy to separate phenotypes like cell type. But then when we go down to

cell type. But then when we go down to the de the level of differentially expressed genes, we found that um using base count gave us a significant edge in being able to detect variation uh across

different um uh across different cell cell types and cell states.

All right. Um okay, I'll pause here again if there are any questions. Yes.

>> About a variety of different data.

>> Do you have any understanding of how generalizable results are across cell types or does it kind of depend on the problem?

>> Um, you mean for our data curation effort or for the model that I showed earlier?

and like think about scaling yeah these experiments to different cell types.

>> Yeah.

>> Yeah. Yeah. So that's a really good question right and I think I had some results at the end where it showed that there is some kind of meaningful improvement but I would say that still these models are far away from where you

can just like point at a cell type or point at a you know a specific tissue and say what's going to happen there like what is the effect of this perturbation going to be there. I think

right now at least within our model we find that if you have some amount of experimental data for that condition for that cell type it tends to do much better. It tends to like learn the

better. It tends to like learn the underlying regulatory structure a bit better. Um uh also so so I guess the

better. Um uh also so so I guess the short answer is I don't think we're there all the way there yet but there it seems like there's a good sign that um you know we've made good progress and also I think we at ARC we think that if

we standardize the data generation process we can probably also make a lot more pro progress. One of the big issues with these perturbation experiments is that everyone does them slightly differently. And so what a knockdown of

differently. And so what a knockdown of a specific gene means to me in my data set might be very different from what it means to your to in yours. And and so really trying to predict that is kind of

very hard for the model or that one of that's also another reason. Um but yeah, I think uh I think there's a lot of work that needs to be done. Yeah. Yeah. Yes.

So in the in the slide where you show uh the the distribution of cell types uh in your data set.

>> Yeah. you showed that you had like a very big concentration of T- cells >> and I this caught my attention because historically like immune cells often

have you know you were mentioning earlier donor specific uh right behavior >> right >> and this can be for HLA alles or varying many number of reasons

>> right >> does your model learn uh to I guess per donor analyze the effects or is it learning across all?

>> Yeah. No, that's a great question. So,

in our model as it is right now, we've not dived into this too much, but we have another uh model we've been working on which specifically focuses on this.

And in fact, we actually find that we do have pretty good generalization capabilities for TE- cells. I think for this reason because there's just such an overabundance of TE- cells in um in the

base count data set. Um so, definitely I think that does help. Um it really helps I think because um I think a lot of studies have been really focused on like cancer looking at blood looking at

immune cells. So that kind of it doesn't

immune cells. So that kind of it doesn't surprise us that like immune cells and T- cells are so over represented and and definitely for the models that we're training on this data we do find that generalization is better uh when you're working with immune cells when you're

working with blood cells. So so

definitely it is having an impact. Yeah.

Okay. Um I have about 5 minutes so I won't go into too much detail for the next uh one but I'll do maybe two slides. So the last the last thing we

slides. So the last the last thing we wanted to look at was um can we use these agents to actually tell us what experiment to do. So so far we've kind of been using them a bit more passively in terms of here's the data can you clean it up can you help us like you

know make this large standardized repository. Now we wanted to see can we

repository. Now we wanted to see can we actually have this agent design a lab experiment and then take the readout from that lab experiment update its belief and then perform another one and do this iteratively. So we did this

again within the context of knocking down genes and measuring phenotype. Um

and so here the goal essentially for our model was can we identify strong uh responders. And so this is another

responders. And so this is another screen in T- cells and we wanted to identify what are perturbations that lead to an overactivation of a specific pathway. Um and so what we do is the

pathway. Um and so what we do is the model makes its first guess of okay so what you should try perturbing these 30 or these 60 different genes. Um that

experiment is performed. you get a hit ratio which tells you how many of the predictions that you made actually caused a strong effect. Um this

information is fed back to the agent. Um

the agent processes it and makes another prediction. Um and then that results in

prediction. Um and then that results in another readout. And so we keep

another readout. And so we keep performing this over multiple steps. And

we see that over time the model starts to pick more and more interesting hits or interesting uh perturbations. And we

compared these predictions to say simpler approaches like using basin optimization or using uh random baselines. And we found that actually

baselines. And we found that actually our model was doing quite well across different data sets. Again, here a lot of this was focused on T- cells on immune cells because they were over represented in the data that we used for

for evaluating the model. But it's it's quite it's very encouraging to see that um the model is doing something better than essentially kind of randomly picking genes or even better than if you

were to make use of kind of a basian optimization approach.

Um and then uh when we do a bit more deeper analysis, we actually find that um these agents are able to make use of both the information that you provide as part of the prompt as well as the experiments that they see. And so this

is very encouraging. Um so for instance here, the black line corresponds to a random baseline. The purple line

random baseline. The purple line corresponds to if you didn't tell the agent what you were doing and it just kind of saw every experiment without any context over initially it would start off pretty weak but then later it would

start to figure it out. Um, but if you told the agent like, "Oh, here's the biology I'm studying. This is exactly what I want to do." It starts off pretty strong and then kind of it keeps it keeps it up and like, you know, so this

so this idea of like giving the model both the background information on what you're doing as well as the experimental data results in kind of the strongest results that we've seen. Um, and which is very encouraging and it shows that the model has the ability to reason over

different pieces of information.

All right. And of course, you know, this is uh there's a lot of people in um uh you know, at the frontier labs that are very excited about the potential of using these agents for designing experiments. Um in fact, the co-founder

experiments. Um in fact, the co-founder of Anthropic actually said that we shouldn't even be thinking about AI as just a form of data analysis, but instead we should just think of them think of it as a virtual biologist that performs all the different actions that

we do as biologists and and that should be able to really accelerate our ability to make discover new new therapies and make progress in research.

All right. So, yeah, this is this is the end of my talk. So, thanks for listening. I uh just uh for a summary, I

listening. I uh just uh for a summary, I I I walked you through four different projects. Talk uh talk about state, the

projects. Talk uh talk about state, the virtual cell challenge, and also how we're making use of agents for curating and generating new data. Um thank you so much for your attention. I want to make sure I acknowledge everyone in my group.

Um in particular, the people with asterisks against their names. They were

uh lead contributors to the state model.

Um and other contribu uh other collaborators at Stanford. Um if for any of the papers that I presented today, you can follow the QR code to read more in detail. Um yeah, and happy to take

in detail. Um yeah, and happy to take questions now or if you want to email them later. Yeah.

them later. Yeah.

>> Oh, sorry. I didn't see you. Yeah. Yeah.

Um so um I I wonder uh I need to read the paper to get a better sense of the details on um on uh some aspects of the

the way the state model works but um so you know I recognize you're training in the optical transport um setting um it also seems like there's some similarities there actually looking at

this as a metric learning problem um where the choice of the stratification you're doing is sort of akin to batch mining of identifying what things to populate your bash with in order to make

the loss learn something meaningful and interesting as opposed to something that's sort of just noise. Is that a perspective that that resonates at all or is that something that you guys have thought about or am I totally off base here?

>> No, that's actually a really cool way to put it. Yeah, for sure. I think um I

put it. Yeah, for sure. I think um I think when we do kind of align the coariantss, we make it in some sense more meaningful for the model to learn variation because we've already

corrected for the obvious variation. So

yeah, I think that I think that's that that is actually an interesting perspective. We don't we don't have uh

perspective. We don't we don't have uh any metric learning style losses. So

like a triplet based loss for instance, we don't have those in the model at the moment. So maybe we're not directly

moment. So maybe we're not directly making use of setto variation as a kind of loss. But I'm sure it's learning some

of loss. But I'm sure it's learning some aspects of that because we also try this experiment where we artificially put in very heterogeneous sets of cells. So

like you put in a different cell type, two different cell types. Um and then we actually do see in the attention patterns there's like this like block pattern so it's kind of learning to detect that. So yeah it's an interesting

detect that. So yeah it's an interesting way to frame it. Yeah.

>> Thank you.

>> Yeah. Uh yes.

>> Um I'm like kind of the last pages where you discussed using AI as a virtual biologist right >> and you briefly went over how you broke

down workflow multiple uh >> agents doing different stages. Um, I'm

curious how does how did you approach like building that out and deciding how right what tasks that were going to be done?

>> Okay.

>> You know, >> yeah, >> I think you're talking about >> you mean this thing.

>> Yeah.

>> Yeah. Yeah. Yeah. So, yeah, that's a great question. So, um, it's funny. So,

great question. So, um, it's funny. So,

this was led by Nick. He's really into agents. He was the first author. Um, I

agents. He was the first author. Um, I

think when you, uh, there's a really nice tweet. I think I have it in here.

nice tweet. I think I have it in here.

um which it's a hidden slide. Uh it's

this one. So um it's like me using LLMs for personal fund projects. Wow, this

thing is such a genius. Why do we even need humans anymore? But then when you try to deploy them in the real world, it's like oh my god, why is this thing so unbelievably stupid and dumb? And

it's really exactly this like you start off thinking that oh you can just you you can just feed this to GPT and it'll figure it out. But there's so many there's so many possible mistakes you

can make. um especially like uh you know

can make. um especially like uh you know in just being able to uh the way that you uh uh you know parse the text and then the way that you map specific uh elements of that text to known

vocabularies and known ontologies and so because of that what we ended up doing was we had multiple levels of agents that would perform similar tasks but like check each other so like you have

here for instance I don't have the full hierarchy I think I we have it here so like here's actually the full hierarchy so you have like this overall supervisor that manages the lower level agents and

you have these lower level agents that are doing like specific things. So you

have a find data set agent whose only job is to like find data sets. You have

another agent that is focused on the e search attribute and then and every supervisor kind of like checks the output of the of the agents that report to it and make sure that they're not

making a mistake. So that it's it's really it's kind of funny but also cool how like these little uh agents are like checking each other and then you finally the response you get is like much more validated. So that's kind of how I think

validated. So that's kind of how I think the system keeps getting larger. If

you're interested in looking at this in more detail um there's we used a package called lang graph. So if you may have heard of lang chain it's essentially I think it's by the same people but it's

it's looking at a graph representation of of agent uh workflows. Yeah.

>> Yeah.

Uh, you had a question first. Yes.

Sure.

I'm not sure if that actually generalizability is very great that it outperform the other >> one or you mean like >> data

from your states to the external data just using the model.

>> Uh yeah, we used um HVGs. Yes. um um and we matched the the same gene sets. Um I

should I should be clear that this is only looking at one metric though. So

like if you look at we were from we had like five or six different metrics. So

on the others it was kind of mixed uh because it is a difficult task. if you

go zero shot. But if you just start predicting like um this metric, sorry we'd go uh like this metric the effect size which is like an aggregate effect

across all genes that tends to be kind of robust and it it's like consistent across data sets but if you look at individual genes it's like much harder.

Um and there we found that the noise between different data sets would overcome the predictive performance.

Yeah. Yeah.

Yeah.

the final results from virtual solid challenge. But like I know that since

challenge. But like I know that since like for alpha bold like the protein data band was founded in 2002. So it

takes tech decades to have such a model like AR3. So how long do you expect it

like AR3. So how long do you expect it to to have like more to have enough data or highly sufficient model that can

achieve um high as high as the accuracy rate as alcohol as like for perturbation.

>> Yeah. Yeah. That's a great question. I

definitely think the big um the bottleneck right now is definitely data and data quality. Um so because of that, I think it will be kind of a longer timeline. If it were just a question of

timeline. If it were just a question of oh, we just don't have the right model, like you could probably make the case that you know it might be a shorter timeline because it's easier to iterate on model architectures. I think there's

a lot that needs to be done on data. Um

we just saw that ourselves when we put out the virtual cell challenge data set.

It was just for one cell line. it's just

like 300 or less perverbations, but it was already like, you know, um showing much stronger and consistent signal compared to almost all the public data sets out there. And I think a lot of the labs are beginning to realize this that

we're not going to make any meaningful progress in the single cell space until we really seriously clean up the data generation process um you know, standardize the processing steps. Um and

so base count is an important step.

We're now actually cooper um like uh collaborating with other academic labs to even standardize the way in which the data itself is submitted to SR and you know so I think there's a lot involved

on the data end but it's great to see the momentum in the community. So at ARC we have a 1 billion cell project where we want to um you know generate 1 billion cells worth of fertivation data

within the next couple of years. Um I

think Chen Zuckerberg initiative also uh they they're also doing something uh they're also trying to generate a lot of sales. So I think that will really

sales. So I think that will really accelerate the pace of progress. So

definitely a decade is probably um more than a decade I would say. Yeah.

Yeah. Yeah.

Uh yes >> question. So are you working on ice form

>> question. So are you working on ice form level configuration for the SRS data sets?

>> Um for usage >> yeah we do get that information. Yes we

we do have isopform level information and um we've not done any deep analysis on it but that's one of definitely one of the perks of our approach. Yeah

>> same question. I think another important perspective of pertation experiments is how you define read out how the tax Do you have any insight on how to

standardize that?

>> Yeah.

Exhaustion or >> Yeah.

>> Yeah.

>> Yeah. That's a good question. I mean, we work with so transcripttoics is easy because you know you could just say oh pull out whatever you want from the genes gene expression. Um for us because

we're trying to be more unbiased I think we're not focusing on any one specific readout. So at the moment gene

readout. So at the moment gene expression has been perhaps the most expressive one for us. Um I would say if we had to go deeper I would be interested in bringing in some proteomic

panels and like getting uh some of that information. Um may maybe some amount of

information. Um may maybe some amount of imaging would be useful might be overlapping with proteomics. Um uh but uh yeah it is possible to train your model specifically for a phenotype of

interest. Um you might see better power

interest. Um you might see better power to predict effects. Yeah. Uh yes. Yes.

So kind of I guess follow question last chance is um I'm very impressed with what you have in data set for grector so far but you mentioned earlier that it's

possible to to curate data sets for proteomics and um cell viability and all these other all these other things right and is there um

>> what efforts are currently being made to to >> uh incorporate those data modalities any

>> yeah um there are definitely efforts on the imaging front which I've heard of um because I think imaging tends to be cheaper to generate even than transcrytoics especially pharma

companies have a lot of capabilities there on the proteomic side I think single cell readouts tend to be more expensive um there's more I think development that needs to happen on the

assay and the technology side of things um so I'm not really uh synced in with what's happening there I don't know if it's happening at the same scale but there is a lot of interest in adding

proteomate panels um the vast m I think definitely the vast majority of interest has kind of been more on the transcrytoic side just because of the easier scalability

um so I don't think there's anything at this scale uh except maybe in imaging I would say um at least to my knowledge

Loading...

Loading video analysis...