Better Data is All You Need — Ari Morcos, Datology
By Latent Space
Summary
Topics Covered
- Models Eat Data
- Inductive Biases Fade with Scale
- Self-Supervised Unlocked Massive Data
- Humans Can't Judge Data Value
- Curation Bends Scaling Laws
Full Transcript
[Music] Hey everyone, welcome to the L in Space podcast. This is Allesio, partner and
podcast. This is Allesio, partner and CTO at Desible, and I'm joined by Swixs founder of Small AI.
>> Hello., Hello., And, we're, so, excited, to, be in the studio with Ari Moros, CEO co-founder of Dtology. Welcome, and
>> thank, you, so, much, for, having, me, >> Ari., So,, you, first, came, across, my, radar.
>> Ari., So,, you, first, came, across, my, radar.
I mean, I guess Dtology is like a relatively, I guess, exciting or well-hyped, startup,, at least, with, the fundraising and the the high profile of the people that you hire. I reached out to book this interview after you worked
on the RC. I don't even know how to pronounce it. RK
pronounce it. RK >> RC., Yeah., It's, inspired, by, a, real
>> RC., Yeah., It's, inspired, by, a, real transformer that was called RC.
>> Yeah., The, the, RC, foundation, models., And
you guys have been doing a lot of like data work. How would you describe
data work. How would you describe Dlogology today?
>> Yeah., So,, so, our, mission, at, Dtology, is to take everything around the data side of of machine learning, right? So, going
from you have a bunch of data sitting in storage to you're going to feed it into a model, you know, via a data loader.
There are a ton of choices you would make in that process, ranging from how you're going to filter the data, how you're going to sequence the data, what synthetic data you're going to generate if any, um how you're going to batch the data, all of those things. And those
will have a tremendous impact on the performance of the model that you train on on the data. One of my favorite catchphrases is models are what they eat. If you show them great data
eat. If you show them great data they're going to be really high quality.
If you show them low quality data they're going to be low quality. But but
this is a frontier research problem. How
do you actually do this effectively? How
do you do this automatically at scale?
Right? It has to be automatic to be able to process trillions of tokens, billions of images, things like that. And that's
our mission at Datlogy is to take that whole process, make it really easy so that anybody can get access to state-of-the-art data curation without needing to be an expert themselves. um
and in doing so help uh the folks we work with to train models much faster to much better performance um and to also help them train much smaller models to the same or better performance which I actually think is some of the most exciting stuff going forward but
fundamentally, that's, what, we, do do, at Datlogy is help people curate their data so they can train models faster better smaller so the key words for that data curation as a service data efficiency all those terms in the pre-hat before we
started recording you mentioned that there's a cool story around how you got into data in the first place right you were at GDM you were at Meta as a research scientist. Describe how like
research scientist. Describe how like that became an interest.
>> My, PhD, is, actually, in, neuroscience., Uh
so I come much more from an empirical science sort of background. I actually
spent time trying to teach mice how to count and then analyze the activity of thousands of neurons in the brain while mice did count and try to understand how did that actually happen? How what were the neural dynamics that that enabled that? Um and that's actually initially
that? Um and that's actually initially how I got into machine learning was as a means to analyze my my neural data sets.
I also started my PhD 2011. So Alex Net came right after that, Char DQN right after that. Lots of evidence that AI was
after that. Lots of evidence that AI was going to be very very exciting which which led to me transitioning. But as a result because I had this kind of somewhat different background of being trained as an empirical scientist rather
than as a computer scientist. My real
first mission when I when I joined AI was to try to build more of a science of deep learning. something that I think
deep learning. something that I think you know is still true today in many cases is that deep learning is an empirical science and but most people that um have computer science backgrounds were trained more in the
context of a branch of theory um right everything was very provable that was the initial push back to deep learning actually was that you couldn't prove anything in it um but deep learning is at its core an empirical science right
we have to run large experiments we understand the rules for how we design these systems but the properties that come out of them when we actually train them on ton of data um are emergent and unexpected So I always really wanted to
to write these papers where they had two halves where the first half of the paper was trying to understand why is this representation desirable or or undesirable. Why is a model good or bad?
undesirable. Why is a model good or bad?
Um and then understand that and then use that understanding to then improve the model. And that was always my goal. That
model. And that was always my goal. That
was kind of the perfect paper. Rather
than just throwing spaghetti against the wall and seeing what stuck, we were able to really understand why something didn't work and then use that understanding to improve it.
Unfortunately, it turns out that it's not so difficult to do the first half of that, try to understand the system, but really really difficult to actually use that understanding to improve the system. Um, a lot of times what would
system. Um, a lot of times what would happen is you go do you optimize for this variable, you find, hey, here's this property of representations that makes models good. You go and you optimize for that and then it turns out that wasn't a causal variable, that was
a coralate. Um, and it doesn't actually
a coralate. Um, and it doesn't actually work. So, I maybe wrote 30 papers where
work. So, I maybe wrote 30 papers where we did that first half and maybe only three or four where we did that second half. And that was always kind of
half. And that was always kind of frustrating and dissatisfying to me. And
then around 2020, I had several papers that all kind of slapped me in the face at the same time with the same insight which is that all that really matters is the data. And I had come into all three
the data. And I had come into all three of these papers very much focused on inductive biases. How do we put better
inductive biases. How do we put better inductive biases into models either through changing the objective or through changing the architecture, which is, where, most, of the, field, was, and, still where you see a lot of the papers at the big conferences are about architectures
and various tweaks to architectures. But
I had these multiple papers all of which made this clear takeaway that the data is the only thing that mattered. I'll
give you one example. There's a paper we had called convict uh where the idea was to take a vision transformer and initialize it as if it was a convolutional neural network. And that
way you could actually start with this inductive bias of of convolution but the model could choose to unlearn it if it wanted to. So the idea was it was a soft
wanted to. So the idea was it was a soft inductive bias not a hard inductive bias. Comnets have a hard inductive
bias. Comnets have a hard inductive bias. You can't not be convolutional in
bias. You can't not be convolutional in a comnet. But in this case, you you
a comnet. But in this case, you you initialize the transformer that way and then if it wants, the model could learn not to be that. And the idea here was that this would be really helpful for models to give them this inductive bias but then they could learn not to use it
if they didn't want to.
>> Just, to, follow, up,, you, there's, a onetoone mapping of a confident to a transformer and you can map it directly onto the weights.
>> Exactly., You, can, map, it, exactly correctly. It turns out um if you make
correctly. It turns out um if you make it just, you know, say you have a 3x3 kernel, you can have nine heads, each head corresponds to a different part of that kernel and then you can initialize it. So it is exactly
it. So it is exactly >> so, it's, like, a, very, coarse, thing, that can then be refined as with training.
>> Exactly., And, then, it, can, choose, to change its weight so that it it can undo the weight tying that you impose on it this way. We actually had a follow-up
this way. We actually had a follow-up paper which showed you could take a train network and actually in instantiate a train CNN as a vit as well. So there's a way to do this. Turns
well. So there's a way to do this. Turns
out in the small data regime and when I say small data here I mean say less than 500,000 data points um and this was in the context of of image uh self-supervised learning. So in that
self-supervised learning. So in that small data regime, this is super helpful. And where where this paper's
helpful. And where where this paper's actually been cited is a whole bunch of kind of niche scientific problems where there's very little data. For example
volcano prediction where you have like 1500 data points or things like that.
But the advantage of using this soft inductive bias decays as the data size increases and eventually actually becomes harmful. So if you see enough
becomes harmful. So if you see enough data and the threshold at which this changes is around like a million data points. So it's not massive by any
points. So it's not massive by any stretch by our current model. So
basically once you get past a million data points that soft inductive bias no longer helps you um and it actually now is mildly um harmful. So I had this paper and and a couple other papers that
all kind of made this this same point that basically you know when you get to enough scale inductive biases matter not at all. All that really matters is the
at all. All that really matters is the learned posterior from the data distribution and that's really what defines everything. And then of course
defines everything. And then of course the rise of the transformer really showed that actually starting with models that have fewer inductive biases built into their architecture you know is the right thing. So we had this kind
of these this combination of factors which ultimately like actually was very very confronting for me um because I had spent the last six years of my career working on inductive biases. Um and now I'm faced with you know several
different papers all of which show me that hey what you've been working on isn't actually really that important.
>> Bitter, lesson, build >> bitter, lesson, indeed., So, you, know, the bitter lesson was indeed very bitter uh for me and you know that was really my you know inculcation in it I suppose um where at the end I kind of thought to
myself okay clearly the bitter lesson is true here what should I do in this new world you know and it became clear to me that there are really two options that made a ton of sense either go work on making GPUs go burr and I'm not a
hardware, engineer, I, don't know, how, to make GPUs go faster or work on data and and for a whole bunch of reasons data has been dramatically underinvested and
relative to its impact. Something I I've said before and I'll say again is is that data is the most underinvested in area of research relative to its impact.
And I don't think it's even close. And
there there are a whole bunch of reasons for this um which we can go into. Some
of which have to do with the culture of machine learning. Some of which have to
machine learning. Some of which have to do with the incentives that have been set up. But data has systematically
set up. But data has systematically generally not been considered. Even if
you, go, and, you, look, at the, scaling, laws work from Kaplan and Chinchilla and all these other things, they all assume IID data. Um which is insane. We know that
data. Um which is insane. We know that all data are not created equal. that
garbage in, garbage out is like the oldest adage in computer science. And
yet all these scaling laws assume that all data is created equal. That that
makes no sense um whatsoever. That's
what led me to start working on this problem. And it turns out that there's a
problem. And it turns out that there's a really cool thing about data research.
In addition to it being something that's impactful relative to the investment which makes it a great research area and makes it an even better company. What I
had said previously was that with representations, you have this disconnect where there the questions which are kind of scientifically interesting about understanding why a representation is good and then the questions that are practically relevant.
How do I use this to improve it? And I
think what was so frustrating to me early in my career was that those were different questions a lot of the time.
Um the questions that I wanted to ask which were curious um curiositydriven and really interesting to me as a scientist ended up often not being the questions that were practically relevant downstream. But it turns out with data
downstream. But it turns out with data this is no longer true. With data, if you can understand what makes a given data point useful or what makes a given data point not informative, you can
almost always use that insight to make a data set better and therefore make a model better. So what this means is that
model better. So what this means is that the set of questions which are scientifically interesting and the set of questions which are practically relevant in data research are largely the same questions. And that's really
rare to find in research period. And
what this means is that we can ask the questions which as scientists are extremely motivating to us but then have very high confidence that those the answers to those questions are going to help us to build models that train much
faster that train to much better performance and that can train with with far fewer parameters. So that's a little bit of a high level of kind of how I got into to the data problem and and I think the pain that I had to go through to to
get there in the first place.
>> You, mentioned, something, about, the incentives in the data not being aligned. Can you unpack that because I
aligned. Can you unpack that because I think from the outside you have companies like scale that obviously have become super successful. So people are investing a good amount of money but what you're basically saying like Nvidia
is like four trillion and scale is not 4 trillion. So why do you think there's
trillion. So why do you think there's that inefficiency?
>> Okay., So, first, off, we, have, to, divide, the research community from the industrial community because I think they're very different and I think in general data work has been far more valued um in industry consistently than it had been
in the research community. First and
foremost, part of this is that data work has just often been considered uh secondass citizen sort of work. It's the
grunt work. It's the plumbing. It's the
stuff that, you know, you don't want to work with as a as a you know, super hoidytoy scientists. There are even some
hoidytoy scientists. There are even some tweets recently going around people saying, you know, data cleaning is is boring. It's low value work. Whereas, I
boring. It's low value work. Whereas, I
think what you'd find is that if you talk to the most talented AI researchers and you ask them what's the secret to your success, they'll largely tell you that they look at the data. Ultimately
these models are a reflection of the data that you showed. And yeah, it can be tedious. It can be challenging, but
be tedious. It can be challenging, but it is so critical to to get this right.
So I think first off, there's this general perception that this is lower quality work or or not quality, but but lower prestige work and that's been there for a long time. I think part of this had to do with the way that
research incentives were set up. The
data set was viewed as a given. So if
you think about research circa say 2018 given imageet maximized performance on the valet or on the test set, right? But
the the data set imageet was given as something you don't change. Even Kaggle
had this framework, right? Given the
data, set,, go, and, and and, make, this better. People might try things like
better. People might try things like bootstrapping or stuff like that. But
generally the assumption was you're going to improve the model through better modeling, not through improving the data set. And part of this also was just in the supervised learning era this made sense, right? We generally
weren't compute limited. We were
generally very data limited, right? Data
was very scarce. It's like if you want to assemble imageet, you have to go to mturk and get a whole bunch of people to label the data set. And then there's generally some quality floor, right?
Because a human has looked at every data point in this data set. Even if there's still, a, lot, of, errors, there,, at least it's not going to be as bad as just the internet scraped. But then in in 2019
internet scraped. But then in in 2019 the field underwent this pretty massive change, right? We figured out how to
change, right? We figured out how to train without labels. And one of my like more controversial viewpoints I think is that I think the transformer is a great advance to be sure but I think it's one
of a very large set of equivalently good architectures um that we could have found and there are many many ways we could we could get to the same performance without the transformer but I do not think there's any way we could
get to where we are today without self-supervised learning and the ability to train on unlabeled data. That was the the real advance uh to my mind that enabled us to get these incredible increases in capabilities
>> which, is, like, the, mask, objective.
>> It's, not, just, masking, objectives., I
think mask language modeling objective is one but even next token prediction right but generally this notion that hey instead of having to get an external label from a human we can ask the model to predict one aspect of a data point
from other parts of that data. And that
is really powerful cuz think about it right that meant that we went from imageet a million data points to literally trillions of tokens a millionfold increase in data quantity in a matter of like several years. That's
completely unheard of. And that also changed everything cuz now we went from data being scarce and having a high quality, floor, to, now, all, of a, sudden data is absolutely massive. All of our
models are basically always underfitting the data. Um whereas previously we would
the data. Um whereas previously we would do 160 epochs on an image data set right where they would all be overfitting the data generally. So now we move to this
data generally. So now we move to this underfitting the data regime. There's no
more quality floor and now we we have all of these problems with redundancy with with low quality with low information gain all these various things that come with these massive unlabeled data sets. So I think the
problem also changed pretty dramatically from the 2010s to the 2020s. Um, and I think that's what makes it so exciting as a scientific question is that this didn't really make sense to study prior
to 2020, but now this makes tremendous sense and is, I think, absolutely critical for us to solve in order for us to enable these models to continue to improve and also to enable the cost effectiveness of these models so that
they don't just stay as something that's only possible to achieve if you have hundreds and hundreds of millions of dollars. Making the data better can be a
dollars. Making the data better can be a massive compute multiplier. It can
change the performance per dollar by orders of magnitude. Um, and you know in many ways that's our whole goal is how do we make that easy and effective for everyone?
>> Totally.
>> And, you, were, a, meta, from, 2018, to September 23, which is both during Lama 1 and Lama 2. At what point inside of Meta maybe did some of these learnings become apparent like, okay, we should
start to spend resources working on on this. You mentioned 2020, so I'm
this. You mentioned 2020, so I'm wondering if that was like >> I, think, one, was, already, a, big breakthrough. Yeah, Llama 1 definitely
breakthrough. Yeah, Llama 1 definitely put more effort into into data filtering I think than than many others and definitely started to change us. But
even then I would say that actually you know even when I left Meta this was still an area of of kind of the idea of actually curating the data to figure out what's the highquality highv value data I think still was fairly
underappreciated and if you if you talk to, a, lot, of the, folks, on, the, data, teams within the big frontier labs um what you'll find is that they've actually invested really heavily in crawling oftent times they've really worked on getting better crawlers and trying to
clean up the source of the data that's coming in which makes sense but ultimately you know I think what you really need to do is you need to take this perspective of given everything
that the model has seen so far and given a potential candidate set of data, what data point is going to teach the model the most the next time it sees a data point? And that's a pretty different
point? And that's a pretty different framing for how to think about this problem. And I I think we've certainly
problem. And I I think we've certainly there's certainly been some great work done, although it's all secretive within I think the bigger labs, but that's a really hard problem. That's a frontier research problem. And and I don't think
research problem. And and I don't think we we still know how to solve that. I
think data creation also is a hard problem to solve, quote unquote, because it's not one where there's a single silver bullet. There's not just do this
silver bullet. There's not just do this one trick and all of a sudden things work. It's rather here are these 50
work. It's rather here are these 50 different things that you can do, each of which provides a pretty modest gain on its own. But then if you can figure out how to make them combine, you then get a really big gain. But you have to figure out first off, what are all these
different things you want to do? And
then two, how do you make them play nice with each other? Because by default they don't play nice with each other.
Yeah, I'll make a quick observation on uh you mentioned self-supervised learning. I I definitely agree like that
learning. I I definitely agree like that uh just getting rid of labels altogether is great or forming your own labels right? And I have a general observation
right? And I have a general observation that I think that extends to things that are not just learning. So
self-supervised I don't know optimization self-supervised neural architecture search, self-supervised curation. If you
can just automate everything, I think that's the lesson really just like just just get the machines to do it because we are the rate limiters if we if we must label everything. Yeah, I think I think this is very true. It's actually
something I think about a lot is are we actually falling prey to the bitter lesson again here by trying to have human-g guided methods of of data curation. Um, probably the best open
curation. Um, probably the best open effort on data curation is DCLM, data comp LM. It was led by Levik Schmidt, a
comp LM. It was led by Levik Schmidt, a professor at Stanford and you know about 30 students across many different institutions. Really wonderful effort to
institutions. Really wonderful effort to kind of curate uh common crawl style data sets. Yeah, we've we've actually
data sets. Yeah, we've we've actually covered data comp and DCLM on on the podcast.
>> Awesome., Great., Um,, but, DCLM, had, a really cool uh study at the end of the paper that um I don't think gets nearly as much attention as it should. So
okay, so they have these 30 grad students spend, you know, two years basically trying to design what are the optimal filtering criteria for these um models, right? And they built a system
models, right? And they built a system that's pretty good at this. Um, right?
So then they asked all those students predict what that system is going to do.
So given a data point, is the system going to say keep the data point or is it going to say reject the data point?
These are nominally the best experts you could ever hire to do this. These are
students who have just spent all of their time looking at NLP data for 2 years. They could not predict what the
years. They could not predict what the DCLM classifiers would say above chance.
So you know this comes up a lot of times where people often ask me how can you possibly do this without a human in the loop? You know, it just seems
loop? You know, it just seems impossible. You need to have a human and
impossible. You need to have a human and to actually rate these data. But I think that you know what the takeaway from from that study is and I think there's a number of other uh piece of evidence that also suggests this is that
obviously we have to be automated because humans just can't scale to billions of data points trillions of tokens. It's just not possible. Um but
tokens. It's just not possible. Um but
even if we could we actually wouldn't want that humans are not good at this task. And to give an intuition as to why
task. And to give an intuition as to why humans aren't good at this task, I think the the easiest way to think about this is that the value of a data point is not just a function of that data point
itself. It's rather a function of how
itself. It's rather a function of how that data point relates to every other data point in the training set. Right?
So if I have 10,000 copies of slightly variable summaries of Hamlet, I don't need all of those. But if I were to look at any one of those individual summaries, I might say, "Hey, this is really high quality. This is a really
accurate. It it tracks all the
accurate. It it tracks all the characters. It's well written. It's
characters. It's well written. It's
clear. But I don't need 10,000 of those." Yeah. And that's just a task
those." Yeah. And that's just a task that a human would never be able to do because a human can't keep the whole data set in their head, obviously. So
even if you could have this scale with humans, you wouldn't want to.
>> But, so, what's, the, right, number, between, 1 and 10,000?
>> The, unsatisfying, answer, is, it, depends, but it's also the right answer. Um, so
it depends on the on how complex the concept is. So redundancy is really
concept is. So redundancy is really useful, right? And like removing all
useful, right? And like removing all redundancy is a bad thing. If I remove all redundancy, then I'd only be able to understand say a golden retriever in the one situation that I've ever seen it in before. I wouldn't be able to generalize
before. I wouldn't be able to generalize and that would be bad, right? So some
redundancy is good, but I think we all have the intuitive understanding that infinite redundancy is not good. It's
bad. So where is this line for different concepts? Well, one example I like to
concepts? Well, one example I like to give for this is elephants versus dogs.
So elephants are pretty stereotyped.
There are two kinds of elephants in the world. There are Asian elephants and
world. There are Asian elephants and African elephants. Um, they're all gray.
African elephants. Um, they're all gray.
They all have floppy ears. They all have a trunk and some tusks. They all have uh, you know, wrinkly skin. African
elephants are bigger than Asian elephants, but largely they're all pretty similar. There's not too much
pretty similar. There's not too much variability. So, I don't need that much
variability. So, I don't need that much data or that much redundancy to understand the concept of elephants, you know, fully and completely. But dogs on the other hand are totally different right? Dogs are super variable. There
right? Dogs are super variable. There
are hundreds of breeds, not to mention all the mixes of different dog breeds.
Their different shapes, sizes, textures colors, all of these different things.
The amount of data that I need in order to properly understand dogs is going to be a lot higher than the amount of data I need to understand elephants. So, this
comes to some of the challenge when you're actually trying to to do this sort, of, um, creation,, at least, on, the filtering side, is you have to first off, you don't get a data set where you're given, hey, these are a bunch of dogs, these are a bunch of elephants.
Instead, you just get here's a bunch of data, right? So first off, you have to
data, right? So first off, you have to un you have to in an unsupervised way discover what these concepts are. Use
something about that concept in order to make some inference about how complicated it is or how complex it is and therefore how much data you need to understand it. Figure out okay this is a
understand it. Figure out okay this is a really complicated concept. I probably
should keep a lot of redundancy. This is
a really simple concept. I don't need that much redundancy. And then make that appropriate choice of what do you want to remove. So these are this is I think
to remove. So these are this is I think where a lot of the challenge comes from.
But these are the sorts of factors that you have to keep in mind when you're trying to design these systems. >> How, do, you, draw, the, line, of, a, concept though, right? Like because then it's
though, right? Like because then it's like, well, the elephant and the dog but what about mammals? And then what about you know what I mean? It's like
how should people think about it? Maybe
it's the why you need autology because it's hard it's hard to it's hard to talk.
>> Yeah., No,, I, think, that, I, think, that's that's right to some extent. I mean
look, it's an empirical question like like all things are, right? Is is that with every data set that you can choose different level of fine grained ultimately it's a hyperparameter. a knob
that you can tune, right, for for how um aggressive, are, you, going to, be, with respect to creating new concepts versus keeping concepts together. Um and it's one of these things where, you know, I think to your point, it's why we've run hundreds and hundreds of thousands of
experiments to try to figure this out. I
I think, you know, this is something where it requires just a lot of experimentation to understand how to do this., Um, and, I, think, one, of the
this., Um, and, I, think, one, of the challenges we have is not only do we have to make this so that this works on one data set um but we also have to build a system that can automatically adapt to any arbitrary data distribution
and be able to make the appropriate inferences you know in zeroot on a new data distribution. So we kind of have
data distribution. So we kind of have these two sets of questions. First off
is like how do we push the frontier of data curation forward and then second of all how do we do out of distribution generalization where we say hey we have this great data curation approach. How
do we make sure that this generalizes to a novel data distribution?
>> I, don't, know, if, this, is, like, a, good time, but um I was going to ask for like a brief history of data sets.
It might be too much. I don't know. Uh
I'll just list off because we've done a data set 101 episode. I think that was like one of our earliest episodes by far because I we want people to to know the data sets and I think everyone starts at Common Crawl. I think every lab has
Common Crawl. I think every lab has their own web scrape. Would you would you say that's true or do they start from common crawl >> at, this, point?, Yeah,, I, think, I, think like I said this is where most of the labs I think have actually invested most of their time and effort is in building
better versions of common crawl for themselves.
>> Yeah., Uh, I'll, just, name, check, some, of these if you have commentary just you know just just chime in. GitHub the
source of code maybe Stack Overflow even though that's cut off these days. I
don't know do do people get code from anywhere else?
>> Um, I, mean, I, think, there, are, obviously places where you buy code uh data but for public code I think those are the most common. Um I think some interesting
most common. Um I think some interesting things about those that I just personally find surprising. stars are
not a good predictor of whether data is useful for models or not. Not
surprising.
>> Um, like, I, think, that's, like, the, most popular repos are not necessarily higher quality, at least, with, respect, to, do, they improve a model's coding capabilities.
>> This >> I, haven't, done, it, but, the, star, coder paper uh has done it and there have been a couple other papers that have all shown that something that I just consistently found to be a little bit surprising. There's a lot of things that
surprising. There's a lot of things that are kind of counterintuitive about data curation.
>> Did, they, uh, this, shows, that, haven't, read the paper but did they find anything good that that was like a sign of a good codebase?
>> There, wasn't, anything, that, was, super predictive. Oh man
predictive. Oh man >> that, like, honestly, in, some, ways, like some of them were length like some of these like simple heristics actually ended up being better but nothing was super discriminative um there which is kind of interesting.
>> Okay,, cool., I'm, going, to, keep, going., Uh
archive is which is you know GitHub for papers books books one books two and obviously books three controversial I think anthropics getting sued over books three. Yeah, I think a bunch of people
three. Yeah, I think a bunch of people are getting sued. Meta has also over books three >> in, some, sense, like, can, we, just, like, look past it? I don't know. It's like books
past it? I don't know. It's like books are transformative use like I don't know if you have a view on this.
>> Well,, I, think, the, the, recent, uh, ruling was interesting although it was an appellet court ruling so presumably it's going to go to a higher court um afterwards. Um but what they ruled was
afterwards. Um but what they ruled was that it's fair use so long as you purchase the book. So, you know, if if you you can't download books three and then use it because that's piracy and that you've you've stolen the books in
the first place. But if you bought a copy of all of those books, then you can train it on and then it just counts as as fair use, which I think is an interesting and to me it feels pretty reasonable um line there. One fun thing
about books 3 is that it it also has like a lot of not safe for work stuff um in in books three, which is kind of interesting if you actually go and look through it. There should be a stripe
through it. There should be a stripe oneclick checkout with like books three.
Just buy books three and then get a warehouse and then get them all get them all shipped there.
>> I, wonder, what, the, cost, would, be., I'm
sure somebody run the numbers. I'll look
it up.
>> I, don't, know, if, you, comment, can, comment on this at all, but the in the Meta lawsuit I I remember there was a like an email thread with like some of the research scientists inside of Meta talking about books three and Zach was like just do it. This is public, right?
>> Yeah,, that, was, that, was, I, think, public and part of the lawsuits.
>> Yeah., Any, reflections,, comments?, All, I can say is that when I was at meta certainly legal stuff around data sets was very challenging and becoming increasingly challenging and there were a number of situations where you know
the only person that could approve things was Zuck um because of the the scale of of the risk I think but and it definitely made publishing at Meta near the end more challenging around just
what we could do with with any data set because I mean realistically companies like Meta and Openthropic are big targets for these lawsuits.
>> Yeah., So, my, conspiracy, theory, for, what happened to Llama 4 is the lawyers got to it. The lawyers got to the data sets
to it. The lawyers got to the data sets >> and, they, had, to, change, what, they, used.
>> They, couldn't., Yeah., They, were, just, like hand tie behind their back when other labs were not because Matt Mill had an active lawsuit.
>> I, think, that's, possible., I, I I, think probably more of it just has to do with the challenges of just continuing to scale and having that be the goal. Like
this is actually a lot of the reason why I got into data and started data was that the scaling laws always were terrible. What the scaling laws paper
terrible. What the scaling laws paper showed was that there was a predictable relationship. Yeah, the Kaplan one that
relationship. Yeah, the Kaplan one that there's a predictable relationship between performance and computer data.
Right. That's really useful. But it was a bad predictable relationship. Power
loss scaling is terrible. It means that every time you 10x your data, you get a diminishing marginal return on performance. You know, this is why you
performance. You know, this is why you had these prognostications. Oh, you
know, GPTN is going to cost, you know, a trillion dollars to train. It's because
you take that scaling curve and you just naively extrapolate it And I think that's what we've seen to some extent with the failure of the the mega models right, with 4.5 and Lama 4 and others. I
think that there's a challenge of just continuing to do that naively and you have to figure out how to break it. Um
I think there are a number of theories of of ways how to break it and I don't think they're mutually exclusive. My bet
is that that data quality is a massive way to do this and in many ways actually the paper that was the foundational paper for daty it's called beyond neural scaling laws and was fortunate to to get a best paper at nurips and what that
paper showed was that if you use your data correctly you can actually bend the scaling laws themselves and an interesting kind of technical part of this is that you know I mentioned what we really care about is this how much new information do you learn from the
next data point so technically that's the marginal information gain per data point um perplexity is another another variant of it. There's a there's a duality um between them. It turns out that we were able to prove in perceptrons, at least, because, that's
always generally what you prove. So in
small scale um and and this work was led by Ben Sorcher who was a really fantastic uh grad student I worked with on this paper. And what he showed was that there's a direct duality between power loss scaling and the fact that you
also see that the marginal information gain per data point also decays as a power law. And that's why you get power
power law. And that's why you get power law scaling because you know every successive data point is teaching you less and less and less and it follows a power law. So then you get performance
power law. So then you get performance decaying as a power law as well. So if
instead you can keep that so it's flat then you bend the scaling law and now all of a sudden you learn dramatically faster because the amount of information you're learning is not decaying with data set size. Now that was all in theory what you could accomplish you
know and we proposed a couple metrics that got us one step there. But in many ways I would actually say that the whole point of daty is how do we realize the potential that was shown in that paper.
How do we actually make that a reality?
And I think fundamentally if we want to get scaling to to work well fundamentally we need to do a better job here.
>> Are, you, measuring, the, quality, of, these open data sets over time? Are the most recent open data sets better than the older ones at a good rate or like just marginal? They do get better, but I
marginal? They do get better, but I think they're not relative to the headroom and potential, I would say.
Right. Like like Nematron is actually pretty similar in quality to DCLM. It's
it came out about 6 months later. It has
more unique tokens. Um they made a really big deal about it having more unique tokens, but in on average the quality is is pretty straightforward.
So, you know, when we think about what we are able to accomplish at Dlogy, we usually think about along these three axes I mentioned. Train faster, train better, train smaller. So, typically
basically that's say like first question, train faster. given a certain baseline data set, how much faster can we achieve the same performance? So, you
know, in how many fewer tokens? Um, so
we're able to now get to the same performance as DCLM about 12x faster.
So, you know, in fewer than 10% of the tokens, we can match what you get from training to convergence.
>> And, when, you, when, you, say, performance, you mean like GPQA or you mean loss?
>> Yeah., So,, we, typically, take, we, take, the accuracy across 15 kind of standard benchmark tasks that are relevant for you know, a given model size. So, your
MMLUs, your ARCs, your races, you know etc. >> The, the, problem, with, those, is, like, are you training to the test, right? Like
are you you know, I'm sure you know this >> and, and, that's, something, that, we're super careful about um because it's really easy to overfitit to these benchmarks of course and then end up with models that are really brittle and I think this is something that we've
seen especially with synthetic data and synthetic data is a big part of what we do at Datlogy. We found that it can drive like pretty dramatic gains if you do it correctly. Um there are lots of ways to do synthetic data incorrectly.
We've seen a number of models, right that are trained on on a lot of synthetic data and end up doing really well on benchmarks but then kind of don't pass vibe checks uh and people don't really use. So we do a lot to try to prevent this. Um first and foremost
we keep a held out set of tests sets that we only look at very occasionally.
Um and we also don't evaluate on a whole bunch of other evals that we then have you know, models that end up getting eval to try to really ensure this. But
yeah, this is fundamentally how we measure. We look at a an average of
measure. We look at a an average of benchmarks just trying to kind of think what what's fair and reasonable um with respect to what we can do. So you know that's like the first thing we typically look at. Then we look at train better.
look at. Then we look at train better.
Of course under the same compute budget how much better can you do uh with a given data set. Um we're able to beat kind of the best open data sets by anywhere from four to five points depending on the specific data set and
um and eval points. Four to five points on average.
points. Four to five points on average.
Um and those are absolute absolute points. We generally find that in order
points. We generally find that in order to get that same performance from training longer on the baseline data sets, you'd have to train on those baseline data sets, you know, at least 5 to 10 times longer to try to match that performance because every successive
point of accuracy of course gets harder and harder um to achieve. Um and then finally train smaller. Basically say
okay given holding performance constant what's the smallest parameter count model that we can get uh to outperform.
we can already get models that have fewer than half the parameters and also train faster and also outperform the larger models trained on the uncurated or alternatively curated data sets by
for by a large margin. So, you know this is a a big roundabout way of of of getting to this answer of you know, have the open data sets I think kept up with with the this improvement. You know
with a fairly small team, we're now a team of about 30. you know, most of the results that I've discussed like were achieved with a team of under 20 um because we've grown quite a bit in the last couple months and with not that much compute uh by uh kind of common
standards. You know, more than
standards. You know, more than academics, but certainly nowhere close to the Frontier Labs, we've been able to achieve, I think, pretty dramatic results. I think the reason for this is
results. I think the reason for this is because there's so much headroom here.
You know, we've already been able to get 10x, gains., I, think, there's, at least
10x, gains., I, think, there's, at least another 100x behind this that are still to be done. There's so much stuff that we're just not even doing right now that I know make sense to do. Uh let alone all the things that we are doing that I
know we can be doing better that we're still very suboptimal with respect to how we're doing this. Like I know that the way we do our synthetic data right now could be much better that the way we do our filtering could be much better the way we do our model based filtering
our embedding based filtering all these different aspects could be much stronger. So I think there's just so
stronger. So I think there's just so much headroom here. I think the challenge is that um there's not a huge incentive to do this in the open data set community. I mean the the labs which
set community. I mean the the labs which have the biggest incentives obviously have strong incentives not to share anything with respect to that. Um so
you're left to kind of you know the Allen Institute things like DLM hugging face etc to make progress there but I do think that this is such this is a hard
enough problem that it really demands a a whole company that is really focused on this. I think you what you see in all
on this. I think you what you see in all the frontier labs is that they have data teams. Um, and if you talk to the folks that work on those data teams, what you'll kind of systematically hear is that typically they're they're underresourced relative to to the gains
that they're delivering. Um, that
they're always having to fight for attention. And this is just like a
attention. And this is just like a fundamental thing that I I saw at Meta I saw at DeepMind, and and I've heard at all these other places. It was a big part of why I decided to start daty instead of doing this within meta. You
know, I had the opportunity to to start a data team there that was to try to centralize this. But fundamentally, I
centralize this. But fundamentally, I think that this is such an important problem that it's a problem that needs to be the end itself, not just the means to, the, end., Um,, which, I, think is, what
you see in many of these big groups. You
need to have a large team of really talented people who are really passionate about looking at the data.
Um, and there aren't that many people who are that passionate about it to just focus on how do we build the best possible data sets for model training. I
think it's hard to do this as a data team. I I think there's a real benefit
team. I I think there's a real benefit of being a data company and that's a lot of why I started data.
>> How, do, you, think, the, almost, economics, or like the open source data sets world evolve? Because you basically have these
evolve? Because you basically have these like open source data sets that are like good but maybe they're not quite as good to make production data um systems and then you have companies like yourselves that are sitting on top of it. Do you
think at some point there's going to be some sort of rupture between like hey why are you just taking my open source data set and making it better in private for people without contributing back and
do you guys have plans to then open source other sets I I think there's like kind of this open question of are these things actually useful in the open or should you just do it in in private >> yeah, it's, a, great, question, and, one, that we've thought a lot about I mean so
first off one thing to note is right is that that in while we do work with folks who are who are just training on open models in general we really built our product and designed it to be able to work with companies that are training on
a combination of open- source and proprietary data. And that proprietary
proprietary data. And that proprietary data could just be data they've been collecting as a matter of business for the last decade. Or that could be data that they've sourced from a data annotator or, you know, another data provider. And some folks who work with
provider. And some folks who work with have, all, three,, right?, They're, going to use open data, they're going to use data that they've acquired, and then they're going to use data that that's part of their business um to begin with. So, you
know, and and that's like I I think a lot of where our focus goes. Although of
course we are excited about working with lots of folks who are training on more open data sets. So I published for you know a decade uh more than that even like you know this was was very near and dear to my heart and it's something that we thought a lot about at datlogy. I
think, one, of the, challenges, of, building a startup today, especially a startup for which science is a critical component um which is as I mentioned is one of the things that really attracted
me um to starting dytology is this tension right fundamentally we have to we have to build a business and in order to do that we have to have a moat and you can think about kind of three places I think where our moat could come from
you know one is from science knowhow one is from engineering infrastructure and the challenge of of just implementing uh this yourself um and then finally there's a brand moat that you can eventually reach. We're very far from a
eventually reach. We're very far from a brand moat at this point in in our journey. Eventually, I would love to
journey. Eventually, I would love to have a brand moat where whenever anyone thinks data and AI, they think data and oh, that's where I should go first. Um
I hope that we get to that point, but in the meantime, uh, you know, we have to rely on the other two modes on on the science knowhow and the engineering infrastructure. I think on the open data
infrastructure. I think on the open data side, what we've seen is that the engineering infrastructure definitely can be a moat, but unfortunately, I think that science know-how moat is actually pretty important. And a lot of the evidence that we've seen so far has
suggested that that is something that's meaningful. As an example, you know
meaningful. As an example, you know many of the customers we talk to, the one, of the, first, things, they'll, ask, is, hey, compare to the best open source data set. Um, right? So, if if if we
data set. Um, right? So, if if if we were giving away everything we needed to in order to build that best open source data set, some folks would just would just go there. So, I think that's been where our challenge has been. Now, what
we've tried to do, and I think we've done a good job of, and I'm generally happy with the balance we've struck, is try to in the blog posts that we put out, give a lot of intuition as to kind
of what we're doing and how it works without necessarily getting to that point of of of reproducibility. You
know, that's I think much more open than you see most of the the big labs be.
Yeah.
>> If, you, look, at, like, the, data, section, of like the Gemini tech report, it basically says like data quality was the single most important thing for making a great model.
>> One, paragraph., we, used, algorithms and heruristics like great you know so like like I think some people were even pointing out um you know like recently the there's been a lot more attention on
rephrasing as a method for using synthetic data >> was, it, the, Apple, paper >> um, the, Apple, paper, the, Ky, paper, has mentioned this a bunch of others and you know some some folks recently pointed out that like hey in our blog post from November we were talking a lot about
that that's something that we do a lot of producing the guy who first came up with rephrasing is was one of our first employees So, you know, we've improved on that pretty dramatically and and and taking it to new places, but that's something that, you know, we like I think that there would have been an
incentive to just like not even talk about that at all.
>> Sorry., Just, on, that,, do, you, feel, like this is like a great example of you were talking about it in the data and then the KI paper comes out with a model and then people are like, oh, the rephrasing is important, but you're like, hey, I
was telling you that before, but I just didn't have a model to show you that it was important. Do you think that's still
was important. Do you think that's still even in in open science like a limiter for people that like if you don't have a model people don't care? Same with
deepseek., A, lot, of the, things, in, the paper were like kind of known but then once you have them applied people care.
I think that's certainly something that happens and I think speaks to the same sort of cultural incentives um that we we talked about earlier where I think that you know people tend to think about this very much in you know ultimately it
being a means to an end. And I
understand why that is of course you know ultimately like you know when we sell better data like ultimately we're selling a better model um at the end of it you know more cost effective model but I think that the fact that people
don't care about it as much unless if they it's really you're smacked in the face with it I think is both a tragedy and an opportunity and you know I would love it if it weren't that case but but but given that it is you know that's I
think the opportunity we see at Datlogy to to really make an impact here. This
might be a little bit of a tangent, but you you mentioned synthetic data, you mentioned rephrasing, so I figured now is a good time to go into it. Um, you
know, I I I figured that most of the work of data is filtering, but I see synthetic data as something slightly different. It it is in a general domain
different. It it is in a general domain of improved data quality, but it's different than filtering.
>> Yeah.
>> Am, I, right, to, you, synthetic, data, with rephrasing or is there are there other parts to synthetic data in your mind?
>> Yes,, I, think, there, are, different, parts of synthetic data. There are two parts but let me first actually just comment on the on the filtering versus things.
So I I used to actually use the word data um filtering or data pruning um and actually that paper I mentioned that was at Nurups that one actually has data pruning uh in the title and that's how you beat scaling laws through data
pruning. When I started atlogy I I
pruning. When I started atlogy I I really changed the language to be data curation um over uh data pruning or data filtering and that's because curation is a lot more than just filtering.
filtering and saying, "Hey, this is a bad data point. We want to get rid of it." Is absolutely an important part of
it." Is absolutely an important part of what we do. But it's also about rebalancing data sets. Upsampling
certain data distributionally and downsampling others. That might not mean
downsampling others. That might not mean filtering. It might just be changing the
filtering. It might just be changing the the waiting with with which you take it.
The order in which you present data can be really impactful. Curricula, and we now have seen this with discrete curricula, you know, for multi-phase training and things like that. That's
not filtering. You know, the way you batch the data uh can be an important factor. Um synthetic data can be an
factor. Um synthetic data can be an important factor. the way you mix
important factor. the way you mix sources, all of these sorts of things beyond just filtering. So filtering is a very important part of what we do and it will always be something that we care a lot about. But it's it's much more than
lot about. But it's it's much more than that. Okay. So now to the question about
that. Okay. So now to the question about synthetic data. I think of at a high
synthetic data. I think of at a high level there are two approaches to synthetic data. And we have focused more
synthetic data. And we have focused more on one of them, the rephrasing one other although I think there is opportunity in the other one. So the first approach is create new data where the knowledge
that's in that data is largely coming from the model that's generating that synthetic data.
>> Oh,, that's, distillation., Then
>> it's, a, version, of, distillation., And, I think that this version of synthetic data could be construed as distillation in disguise. Um and I I think it is a
in disguise. Um and I I think it is a very clear version of this. And when you think about like the criticisms data around model collapse and stuff like that, I think they largely apply to this version of you have a net new data
creation that's coming out of these models. So that's like one I'll slip one
models. So that's like one I'll slip one in there. There's also model
in there. There's also model steganography where you can sort of hide preferences in a model and distill it down.
>> Absolutely., And, now, we've, seen, like, the recent like owl stuff around that.
>> If, people, search, enthropic, owls,, you'll see it.
>> Yeah., Exactly.
>> The, other, way, is, this, rephrasing rewriting um approach. Um so this is the information that's in the data is actually coming from the data that you're conditioning the rephrasing on in
the first place. And all the model's doing is it's reformatting the data or presenting it in a new a new way that maybe is easier for a model to learn.
>> Yeah., Cleaning,, right?
>> Um, it's, cleaning, it, in, some, way., It
could be cleaning it. It could be making it, you know, the information more accessible. It could be putting that
accessible. It could be putting that information in a format that is more representative of what the model's going to be faced with downstream. So I do think, that, like, one, of the, things, that definitely happens with synthetic data is we are bringing more post-training
like data um into pre-training.
>> Sounds, like, SF., Um, and, in, general, like one of my beliefs is that most of what we do in post- training is better done in pre and mid training and earlier on in training in general. Um
>> it's, just, the, scale, you, know, you, don't have that scale until now.
>> It's, just, that, yeah, exactly, I, think, I think I think if you assume this paradigm where you know pre-training is incredibly expensive and something that you can only do very very rarely and then post- training is cheap then it
makes sense. But as soon as you break
makes sense. But as soon as you break that assumption and I think DeepS showed and that already you can get a frontier model for a marginal cost of a couple million dollars. Um that's gone down
million dollars. Um that's gone down since then because we've gotten better at it and compute has come down in price um since then. Like I believe that getting to a frontier model should cost
a million dollars or less for most organizations, at at, least, at least, in, a specialized domain right and when you think about what enterprises need that's generally what they need. They don't
need a model that can do everything.
They need a model that can do a constrained set of tasks to very high accuracy for as low an inference cost as possible and I think that that will be uh you know under a million dollars very very soon and and that changes a lot of
these dynamics but going back to the synthetic data uh question of these two different types. So I think there's one
different types. So I think there's one towards this net new creation. I think
that's where you have a lot of risk.
That's where you get the model collapse concerns where you know I train a model I train a generative model on a given data distribution. It overfits the modes
data distribution. It overfits the modes and it underfits the tails. So then if I have it generate a bunch of data it's going to be more mode and and less tail and then I do that a bunch of times and eventually I get a spike. I get a delta function >> only, mode.
>> Um, only, mode., Exactly., Um, like, that makes sense. Um why that happens. Um I
makes sense. Um why that happens. Um I
will note that if you filter the data after each point that's now information injection and that can break all of this um and I think can prevent model collapse >> which, a, little, bit, is, what, RL, is >> which, is, a, little, bit, what, RL, is., I
think you can absolutely view it that way, and, I, think, actually, a, lot, of the work that has suggested that um you know RL is really just eliciting the the capabilities of pre-trained models like random rewards or a single example and then it's just changing the distrib it's
like aligning to the distribution the model has in the first place are I think very in line with that way of thinking about it. you're you're distilling from
about it. you're you're distilling from a perfect model which is the environment or the the verifier or whatever and then you're distilling that into the the thing. So yeah, like it's it's amazing.
thing. So yeah, like it's it's amazing.
It's beautiful.
>> Um, but, but, the, cool, thing, about rewriting is that because the model that's doing the rephrasing just needs to know how to rephrase. It doesn't need
to know anything about the content itself. It doesn't need to understand
itself. It doesn't need to understand it. It means you can use a pretty weak
it. It means you can use a pretty weak model to do the rephrasing and have it generalize and generate data that can teach a model that's much better than the model that's doing that rephrasing.
So I think with this distillation in disguise, I I'm generally quite skeptical that you can get a model that will be better than the teacher that's generating the synthetic data when you
do this sort of net new data creation.
um it's possible you could through some sort of heavy rejection sampling on the big model because you're effectively inserting new information when you say which of the synthetic outputs is good or bad right there's some new
supervision coming in there um but I'm generally skeptical of that whereas we've seen this we actually will have a blog post coming out in the next week or two about kind of our synthetic data generation which we call beyond web um wow and we'll have some cool scientific
experiments in there too to our point of trying to figure out this balance where we can share some of the science but also do so in a way that you know is is sustainable for our business. And one of the things you show there actually is
that by doing this you can actually go do get a model to do much much better than if you had trained on all of the data all raw tokens in the first place.
So that by doing this rephrasing effectively you actually can break this data wall and now get models that are better than either of the models that generated the data. With rephrasing I think this is super possible because
most of the information is coming from the data. It's not coming from the model
the data. It's not coming from the model itself.
>> A, couple, follow-ups, on, that., just, things I've always wondered. Uh are textbooks all you need?
>> No,, they, are, not, all, you, need., I, think textbooks are great and I think there's a lot of really uh great content in in high quality data points like that. Um
but obviously textbooks are also a very narrow data distribution and the sing if I can if there's only one thing that you should take away from this entire interview about what is good for for
data quality it's diversity. Um like in many ways right there there was this like I used to do all this work on out of distribution generalization and we had all of these like you know very careful studies where we would say okay
let's you know make this corner of the data distribution then we leave this held out where it's never seen this combination of things and let's see if it can generalize and then like you know LLM and the modern way of training
models came along and said hey what if nothing was out of distribution what if we just made it so that we train everything and everything's now in distribution >> and, by, the, way, you, that, that, is, in, line
with AGI, right? So, you might as well >> and, that's, basically, what, we've, done, and it's worked um it's worked shockingly well like like way beyond anyone I I think or most people would have expected. Um I certainly was shocked by
expected. Um I certainly was shocked by it. I made a strong bet that there is no
it. I made a strong bet that there is no way you could get compositionality just from scaling and >> well, you, can, turns, out, um, it, does, work when you get big enough. Uh what I was really referencing was uh this is the
Microsoft fee papers, right? 1 2 3 4. Uh
a lot of them do the rephrasing or rewrite rewriting in textbook format and I, feel, like, that's, a, little bit, of, cargo culting of like oh just cuz you like write like Wikipedia or write like textbooks the models learn better.
That's, not, pro, I, don't know, that's, not automatically proven to be the case.
>> I, think, that's, also, probably, part, of, the reason why you see a big difference between the benchmark scores of those models and their real world use.
>> They, went, to, too, narrow, a, distribution.
And I think this is the problem with synthetic data fundamentally is that you're always going to have some bias here. I think you can do a lot to make
here. I think you can do a lot to make it more diverse and we have put a lot of effort into finding ways to do that. Um
for example, we rephrase into many many many different styles and formats. Um
that's really important to to get stuff that's good. But I think this is the
that's good. But I think this is the risk, right? That you you go on way too
risk, right? That you you go on way too narrow a distribution and models all are are always going to be fairly peaky with their output distribution and then that actually results in reducing diversity.
That said, I will say that there is a takeaway of that textbooks all you need that I think is correct, which is repeating higher quality tokens is
almost always better than seeing net new lower quality tokens. So like epoing over higher quality data almost always better than getting the same amount of
new data of an unknown quality or of average quality. Average in this case
average quality. Average in this case being like what you just get from an internet dump or something like that or even a reasonably filtered internet dump. Yeah
dump. Yeah >> it's, always, better., The, modification, I made or the study I would want to commission out of that is like instead of having another epoch on on high quality data, if you found high quality data, good, go and paraphrase it and
then an entrance on that, maybe that'll get additional gains. I don't think I've seen any papers that have been to that effect.
>> The, Ky, paper, actually, had, an, experiment to that effect where they tried adding um multiple epochs and they looked at how many rephrasings they did of each of them and had some results there that were to to that effect.
>> Amazing., And, then, the, other, question, was more on curriculum. uh curriculum
learning had a bad rep for a while. How
come it's back? What's changed?
>> Yeah., So,, a, bunch, of, things, and, this, is really interesting because when I was going out and um you know, initially deciding whether to start tutology and raising and like talking to various, you know, initial recruits and stuff. It was
like mid 23. Um and at the time I was saying,, you, know,, curricula, are, going to be really important aspect and a lot of people were basically just like no curricula don't work. Like we tried this a bunch of times and curricula don't work. Curricula are one of these ideas
work. Curricula are one of these ideas that I think always like had to work in the sense that it just made too much sense. There are a number of these
sense. There are a number of these things where it's like it might be hard to figure out how to make it work well but like it always had to work. There's
actually a really cool paper from Stanford that had a nice way of conceptualizing this which is imagine a graph where each of the nodes are a different concept um or you know idea
that you want the model to understand and then the edges are basically the dependency between those concepts right so if if concept A helps you learn concept B there would be an edge from concept A to concept B right so now this
is the graph imagine this graph of you know all concepts in the world and all the different edges between them right huge graph if that graph is empty then it would mean that nothing is helpful for learning anything else. Um, right?
And then curricula would not make any sense. You should just randomly order
sense. You should just randomly order things. If that graph was complete so
things. If that graph was complete so that the edges there was an edge of equivalent weight between every pair of nodes, then similarly it would mean that everything is equally useful for
learning everything else and curricula don't work and you shouldn't use them.
Any other graph besides those two graphs, curricula makes sense. I think
it's pretty obvious that neither of those is a graph of the actual world that we live in. Clearly, the world does have dependencies. Some very very
have dependencies. Some very very obvious like the fact that you know it'd be hard for me to do division and multiplication if I don't understand addition and subtraction and you know some much more vague. But I have always
believed like that this has to work and and the challenge has largely been that if you're fully saturating your data then there's really no advantage of curric unless if you unless if you wouldn't be able to learn it otherwise
generally I think the idea behind curricula is that it makes you much more efficient but in the supervised learning world we were fully saturating these data sets so you know maybe a curricula would get you there faster but that wasn't the bottleneck or the limiting
factor so there wasn't a clear incentive to actually go and do these hard experiments to try to figure out how to make a good curriculum because like who cares cares if I can get you to imageet performance in 80 epochs instead of 160
epochs. Like that's nice, but like it's
epochs. Like that's nice, but like it's it's not a big deal in the first place.
But now we're in this totally different world where now all of a sudden all of our models are underfitting the data.
This is super important and getting a curriculum right could literally make the difference between, you know spending 10 times as much on a model training, you know, hundreds of millions of, dollars, potentially., And, now, all, of a sudden curricula make a ton of sense. So
I think that's why the problem didn't really make sense to really put a lot of effort into. um previously um and you
effort into. um previously um and you know now we've seen pretty clearly with discrete curricula that this makes a big impact and like largely what we talk about when we say mid-training is really just like a later phase of of your discrete curriculum um I think is
another way of thinking about it right you could even think of post-training as part of a curriculum in fact one of the things that I'm really excited about is you know we've mostly focused on pre- and mid-training at datlogy so far one
of the kind of most consistent asks from every one of our customers has been can you do more on post- training um can you also help us curate the post- training data so we're starting to invest pretty heavily, there., Um,, and, one, of the, things
heavily, there., Um,, and, one, of the, things I'm really excited about is actually viewing this whole thing from pre-training to mid-training to post-training holistically as a single process. Um, and then asking questions
process. Um, and then asking questions like how do we optimize our pre-training data to make post-training more effective or things like that. You know
these are I think really exciting questions and something that you don't see happen even at the big labs because they have entirely separate teams right? There's a pre-training team
right? There's a pre-training team there's a mid-training team, there's a post-training team. And like the
post-training team. And like the mid-training team is a customer of the pre-training team. And the post-training
pre-training team. And the post-training team is like a customer of the mid-training and pre-training team. But
it's quite hard to actually have signals propagate through all these. Um, so, so I think this is a really exciting area.
>> I'll, push, you, a, bit, on, this., Um,, you know, I think a popular view is post- training is elicitation of capabilities that you already train in pre in pre-training. So, what dependencies can
pre-training. So, what dependencies can you have that feed back into the into pre-training? So, so I'm inclined to to
pre-training? So, so I'm inclined to to agree with that view and and I think that that view would lead very strongly to the fact that you should be trying to optimize your pre-training data to make post-training processes more effective.
So, you should try to figure out how do I optimize my pre-training data so that the slope of the test time compute curve or so that the slope of the RL curve is as steep as you possibly can be. Um, or
alternatively, how do I optimize my pre-training data so that the slope of the jailbreaking curve is as shallow as possible? Right? Right? Like
possible? Right? Right? Like
fundamentally, I think alignment and post- training doesn't really make sense as a as a long-term solution. If you can easily align a model through post- training, you can easily misalign a model through post- training. If it's
easy to put it in, it's easy to take it out. If it's really hard to put it in
out. If it's really hard to put it in it's really hard to take it out. That's
just like a truism of models, right? So
if you do alignment during pre-training you'll actually end up with models that are, I think, largely impossible to misalign without putting a massive amount of data into them. Um, I think there are a lot of benefits to that. Um
and I think we've also seen evidence for this, like looking at the difference between Llama and Quen with respect to their ability to be post-trained, right?
It's much easier to RL Quen than it is to do llama. Likely that has to do with the fact that Quen put a lot of synthetic reasoning traces uh into their training data >> even, with, wrong, examples.
>> Yeah,, but, even, with, wrong, examples, that's where that's it's still there.
Um, which is wild, right? Um, but I think that pretty clearly shows that it's the base model that's doing it.
It's not the rewards you're giving. If
if you give random rewards and the model still learns, it's probably not the reward signal that's doing it.
>> That's, cool.
>> I'm, just, curious, on, the, customer, uh usage. How many people are doing post
usage. How many people are doing post training? Obviously, nobody today
training? Obviously, nobody today because you don't have it, but when people come to you, are people looking mostly to do post training on open models, on OpenAI models, or what do they ask for? Yeah. So, we usually work
with folks who are either um training their own models from scratch um or doing continued pre-training on an open model with a bunch of domain specific data that they have that's unique to you know, their use cases and their
business. We typically focus on folks
business. We typically focus on folks that are, you know, doing training with a significant cost. So, typically that means, you know, at least a couple tens of billions of tokens, oftentimes more.
So, kind of the standard smallcale post-training fine-tuning case um we we don't focus as much on. That said, I think this has been a question that a lot of people have asked us consistently, like, hey, who's actually training their own models? Like, why
don't I just rely on this, rely on the open models. And I think there are a
open models. And I think there are a number of reasons why we see people uh do this. So, first off, I think
do this. So, first off, I think sovereign AI has been a a pretty big place where we've seen a lot of demand.
Lots of countries, they want to have models that they own that are unique to to to their language, their culture, and you know, that requires them to have really good data curation, of course, in order to do this effectively. Just to
double click, countries owning models isn't actually a thing that I know about. Like the I'm from Singapore. We
about. Like the I'm from Singapore. We
have the SEO model, but it's not like owned by a country, and I can't name any other country that owns a model.
>> Yeah,, I, think, that's, actually, correct.
Like it's it's largely what you see right now is these public private partnerships where governments are making pretty large grants.
>> PII, UAE, is, like, the, closest.
>> Yeah,, I, think, you, have, those., I, think you also have these places right where the funding is is the country and it becomes a little unclear where it comes from but yeah I think usually what you see is that countries are doing big grants um to private companies um or
public private partnerships to go and build yeah that sort of sort of thing um so that's a big thing I think I think we've seen a lot of um you know larger enterprises um that have a lot of their own data that want to do this and when
you think about this ultimately what we see is that okay across those three value pops train faster train better train smaller like which matters and when um like train faster In principle, that's the easiest one to compute. You know, I say, "Okay, this
compute. You know, I say, "Okay, this model would have cost you $10 million to train. I get it to you for a million
train. I get it to you for a million dollar or for $800,000 or whatever right? Great. I saved you a ton of
right? Great. I saved you a ton of money." In practice, though, nobody
money." In practice, though, nobody wants to train a $10 million model for a million dollars. They want you already
million dollars. They want you already have the model.
>> They, already, have, that., They, want, to train a $100 million model for $10 million. Um, you know, they want they
million. Um, you know, they want they want to train better. So, train faster um usually doesn't matter so much from the perspective of, hey, this model is now a lot cheaper. it does matter a lot more from the perspective of you can
iterate much faster, right? Because when
you think of the the workflow of most ML engineers, you you start a training, you go and you you sit on your hands until the training finishes. You know, you find something else to do, but largely you're waiting and your iteration is bounded by how long that takes. Um, if
you can take something from taking 10 days for a model to finish training to being overnight, now your existing team is way more productive and can do far more iterations and stuff like that. So
that's where we usually see that matter the most. Most people care the most
the most. Most people care the most about train better, right? I can get a better model for the same compute and and we can absolutely deliver that through data. Data is effectively a
through data. Data is effectively a compute multiplier, right? Because all
models are underfitting their data sets.
If you can make your model more data efficient, you effectively make your compute more valuable. Um because if you think about compute as I inject a certain number of dollars and I get a certain performance back. If I use better data, then I will get more
performance back per dollar invested and now my compute is more valuable. So
that's where train better I think it tends to be the the most meaningful thing. Um but interestingly for the most
thing. Um but interestingly for the most companies that are most advanced on their AI transformation journey, train smaller is the one that I think actually means the most. Um because when you
think about the total cost of ownership of these models is going to be very very heavily weighted towards inference. It's
all inference. And you know you think about a company that's spending say 50 mil a year on inference which in the scheme of things is is not very much right. If you deploy a model that's
right. If you deploy a model that's twice as big as it needs to be that's going to cost you 25 mil in year one.
The cost to train a model that has fewer than half the parameters but is just as good or even better at your particular use cases is say two or three million.
That's a no-brainer if you can do it easily, right? If it's really hard, then
easily, right? If it's really hard, then you're never going to do that. But if
you can do it easily and you can get it right on the first try, that's a no-brainer. Um, and then as uh, you
no-brainer. Um, and then as uh, you know, and then 50 mil a year is like not going to be very much, right? We know
that all of these products have, you know, a tiny tiny fraction of what their eventual user bases will be, right?
We're still very much in the first inning here. You know, everyone that
inning here. You know, everyone that listens to this podcast is using AI non-stop, but the rest of the world is not yet. So, the inference costs are
not yet. So, the inference costs are going to skyrocket with these models.
And if you use a general purpose model that then you constrain to say, hey this model knows about everything, but now only do this one thing, that model is going to have a ton of parameters that do not need to be there that are
going to massively increase the cost of serving that model. So I think that you know when you think about the use case of an enterprise where they need a model that's an inch wide and a mile deep. It
can do a small handful of things but it can do that really really effectively to 59s of reliability and it can do it for as low a cost as possible. The economics
make it so that it it really makes a lot of sense um to do this yourself if you can do it easily. Um and the way we think about it is that there were kind of two big barriers. First you have to get training right and then you got to get data right. And on the training
side, um I think three years ago this was super hard, right? But Mosaic was the first one to really recognize that there is a huge opportunity in making this easy. And now this has largely been
this easy. And now this has largely been commoditized um by things like SageMaker and Together and and lots of different folks to help you on the training side.
But on the data side, the barrier is just as high as ever. Um and in many ways, that's our mission at Dtology is how do we bring that barrier down so that anyone who wants to train a model can do so with the best quality data on their first try. They don't have to go
and spend 40 years in the desert. they
don't, have, to, get, it, wrong, a, h 100, times first which is what will happen if you don't have this experience but instead on the first shot they get a really great model.
>> Yeah., Uh, just, a, follow-up, question, on train smaller. Uh yeah I I I fully agree
train smaller. Uh yeah I I I fully agree and I think that this is something a lot of people investing in. Um you are primarily doing work on the on the data side data pruning which maybe is a bad
word now data curation whatever. I think
a lot of people uh you know Jonathan Franco was on the podcast very early on but a lot of people were betting on pruning the the model itself like you have an working model at size and you just lop off anything above like a
certain epsilon is that confirmed to just be dead >> so, so, it's, funny, Jonathan, actually interned with me when I was at Meta um and we worked on this stuff together you know he had the lottery ticket hypothesis which is a really beautiful paper
>> which, he, now, completely, disown >> which, he, largely, disown, um, you, know, I, I had, I, I I, had, this, whole, idea, when Jonathan and I worked together that We wanted to create a lottery ticket initialization. It would just be an
initialization. It would just be an initialization you'd sample from um for initializing the weights that would then be one of these like perfect winning ticket initializations.
>> Um, but, we, actually, found, out, that, the problem was that the lottery ticket was actually data dependent and that was where the fundamental problem came that as soon as you change the data distribution a little bit like the winning tickets changed in in a really
big way. Um I don't think pruning is
big way. Um I don't think pruning is dead. parameter pruning still absolutely
dead. parameter pruning still absolutely has a has a place but I think certainly we found it challenging to really realize the potential of it. I think one of the big tr uh tricks with with
pruning parameter pruning just to be clear was that unstructured pruning when you would you know prune weights randomly so you view all the weights as as a smorggas board um and just prune
them randomly that worked really well and you could remove massive quantities of the weights with unstructured pruning. The problem is that
pruning. The problem is that unstructured pruning doesn't really give you a clear compute advantage because you need to have a sparse matrix now to to reflect this and there's a pretty huge overhead of sparse matrix
multiplies. Um GPUs are not very good at
multiplies. Um GPUs are not very good at sparse matrix multiplies like there's some support for them now.
>> There's, some, hardware, optimizations, for that. Yeah.
that. Yeah.
>> And, there's, some, hardware, and, people have talked about like building AS6 that would be really good at unstructured pruning, but I don't think I've seen one that works super well. I think if someone did make something that worked really well for for kind of models that
were pruned in an unstructured way that could be effective. Um structured
pruning in which case you just like remove a unit. You just remove a neuron that uh is really easy to make as a as a faster in a GPU but that just doesn't
work uh nearly as as well. So you know I think there's still potential here. I
don't think it's the panacea that that I and I think many others had hoped. That
said, I think one thing that's cool about using better data to train smaller models is that it's complimentary with any other approaches for optimizing inference. So, you know, I think pruning
inference. So, you know, I think pruning and quantization obviously still have a lot of uh role to play in in helping inference go faster and that would stack on top of anything that we're doing. Um
which I think is kind of cool.
>> Yeah., One, also, I, think, a, kind, of, a, grand challenge golden question that would be very valuable for you and just in general is is this idea of like what is the smallest possible model for given capability. Do you have any insights on
capability. Do you have any insights on that? I I did a podcast with um Jack
that? I I did a podcast with um Jack Morris who's out of uh Cornell and uh you know I think like there's there's like some information limit and I think he he had some answer like you know it's like eight bits per parameter or
something like that. I forget what the the conclusion was. Yeah, I'm not sure what if I I would put out a specific number, but but I would definitely say far far smaller than what our current models are trained to be, right? Like we
are nowhere close to to this. And and
and and like I am generally of the belief that most of the models that the vast majority of people will be using in say 3 years will be singledigit B or smaller. I think we've seen this very
smaller. I think we've seen this very clearly like you look at just like the llama series, you know, if you want to exclude llama 4, do so. Um but you know llama 1 through 3 you can see pretty clearly that you know the 7B variant
from one gener uh from the from you know N plus1 generation is like pretty close to, the, 7dB, gen, v variant, from, the, prior generation. You know if it's not quite
generation. You know if it's not quite there but there's still a very clear trend here. We're seeing this with the
trend here. We're seeing this with the Quen models, right? You look at some of these small Quen models and they're just incredibly performant relative to what state-of-the-art was, you know, a year ago. I I think it's pretty clear that
ago. I I think it's pretty clear that these models are way too big. I
personally would bet against kind of the the next frontier being trillion parameter models and rather that you know we're gonna really optimize the inference cost of I think also test time
compute as a paradigm really pushes you towards smaller models, right? Because
if your cost of of solving a problem is cost of inference times number of thinking steps and you have to do a lot of thinking steps. Well well now this is like a really like minimizing the cost
of inference is is really important and I think that you know anything we can do to make it so that you can just make that inference model that is doing the one step of thinking a lot faster
enables test time compute to be a lot more effective. Yeah, I think there's
more effective. Yeah, I think there's there's another version of this which is the sort of Andre Karpathi cognitive core concept of a model that doesn't know anything but can use tools a lot to to to, find, figure, out, again, another
information theoretical limit that would be very helpful to figure out is what is the minimal viable model for that stuff like uh zero on GPQA 100 on browse comp
I I, really, like, that, idea, and, I, think it's very possible to do that because like knowledge storing takes a lot of capacity takes a lot of primary you don't need it and you don't need it and you know we can just look like there are there like actually one of my first
papers um that I ever wrote was actually about showing that when you train models on randomized labels because this was something that was kind of a common test to do you show that was the one way you could prove that a model was memorizing
would be that you randomize all the labels and now there's no actual true association it would have to memorize it and like models could do this really well there was like an iClar best paper from 2017 that showed this that people were really surprised that that models could memorize all of imageet now this
seems crazy because of course models can memorize the whole internet. Um, but at the time that was like crazy. Wait, they
could just memorize a million labels?
Like that's wild. And what we found there actually was that if you went and you just deleted units with a model that memorized, it would be really damaging to the model that memorized. But a model that actually learned a generalizing solution, you could delete a lot of
units um and it would be pretty robust um to that. So it's actually a very clear demonstration of exactly this concept that the more you memorize, the more capacity you're using.
>> Dropout, regular, regularization., There's
a lot of dualities to drop out and and like I think there's an argument to be made that dropout like you know helps to prevent memorization and it helps to learn more generalizable solutions and that's part of why it worked well. But
yeah, I think it's very possible to do this and like I think we're wasting a ton of capacity in these models on knowledge that is just totally unnecessary uh for for them to have.
Before we wrap just because we started with the RC models and then we never talked about them. um that I think the most interesting thing to me was they started with 23 trillion tokens of data
and then you helped them get down to 6.6 trillion. Any learnings from that? And
trillion. Any learnings from that? And
this is a 4.5b model which is par with Gemma 4B and a little worse than Quen 3 but roughly the same. Any learnings
there? Experiences things that auto models should adopt.
>> So, yeah,, so, we, started, for, that, one, we started with a combination of DCLM neatron and fine web. We basically just concatenate them all together. there's
about 25 trillion tokens to combine for all those to produce 7 trillion out of that. You know, I mean, I think what
that. You know, I mean, I think what what was exciting to us about that was in general, you know, seeing the speed at which the model learned. So, you
know, it was beating Gemma pretty consistently before the 1 trillion mark which was which was pretty cool uh to see and I think really highlighted in many ways, you know, how higher quality data can get you much better performance
much more quickly. General insights I think uh or or takeaways from that. I
mean, I think it was exciting for us as kind of one of our our first real like RC is the first customer that we're talking about and and being public about, you know, since starting the company. So, obviously that was an
company. So, obviously that was an exciting moment, but I think really generally it it's a good showcase about the fact that combining all of these different techniques can give you a really big gain. You know, I think that's one of the things we've been saying, but it's nice to have a real
demonstration about that. You know, this is not something where it was synthetic data taking us here or it was filtering taking us here. It was really about thinking about how do we actually combine all of these techniques. And one
of the things we've consistently found actually is that when you take these different techniques and you try to make them work together, they don't generally um you can make them work together, but it's quite hard to do so. So I think what was quite exciting for us there was
showing that that's possible. And then
combined with that, I think people um first off tend to think that you can't stack curation. I think the fact that we
stack curation. I think the fact that we we started with some of the best curated open data sets and we're able to make them dramatically better is a pretty good insight to the fact that there's still a ton of headroom left here. Um
like we didn't need to go to common crawl to get those tokens. We are do of course doing work on that. Uh and we think there's a lot we can do to improve there but just starting from that and we actually now are making bigger data sets
from that. I think we can get up to 15
from that. I think we can get up to 15 trillion just starting from that corpus and still have pretty identical quality to that which is pretty neat. So I think showing that you can get there and then the other and that it really stacks like
one, of the, other, things, we, consistently find is that if we apply our curation on top of um say DCLM and then we apply it on top of fine web the gap between fine web and DCLM is maintained um in the gap
between kind of daty curated DCLM and daty curated fine web. Um they both get a lot better but daty dlm is still better than daty fine web. So you know there really is a lot that we can do
here and I think that would be the biggest thing that I would just say there's so much still left to do here.
We're just scratching the surface. Um
we're pretty excited about what these results showed. Um, we already have
results showed. Um, we already have better data sets than what uh RC trained on because that model was largely trained in in in May and pretty excited about all the next uh trainings that that we'll have that go even bigger.
>> I, have, a, couple, more, lightning, fun questions. What data does everyone want
questions. What data does everyone want based on your customer conversation?
What data does everyone want but is really hard to get?
>> I, mean,, I, think, expert, data, is, the, the pretty obvious thing that that domain expert >> domain, expertise., That, said,, I, would also note that like most people don't know what data they actually should be getting.
>> They, just, show, up, with, whatever, they have. And
have. And >> yeah,, I, think, something, we've, actually found shockingly frequently is we talk to folks who, you know, have been planning for a really expensive training run, you know, millions and millions of dollars training run. They've been
thinking about the architecture they're going to use. They've been thinking about all this stuff and then, you know they reach out to us and they're like "Hey, like we realize we need a good data set and we're planning to kick off training in two weeks. Like, can you
help us?" Um, and a lot of it's like
help us?" Um, and a lot of it's like hey, you probably should be thinking about your data set. Um, before all the other things. Um, if if anything, that's
other things. Um, if if anything, that's actually the most important thing. So, I
think what honestly the most surprising thing is maybe how often people don't even have a conception of what good data is. And often times, I think when people
is. And often times, I think when people what they think is good data often isn't. Um, which, you know, goes to the
isn't. Um, which, you know, goes to the DCOM point I think that we mentioned in the past. It's it's it's very
the past. It's it's it's very counterintuitive and really hard for humans to identify this is high quality this is low quality. This is a little bit of a recruiting question. Uh, what
data efficiency question? If somebody
had an answer, they should join Dtology immediately.
>> The, first, thing, I, would, just, say, is, like if you are one of these people that keeps on finding yourself just like staring at the data, you keep on going into the data set. If you can tell me what your, you know, favorite and and
least favorite C4 example is, like you you belong at the you could you should come join us um and and join a bunch of other nerds that that love doing that exact same thing. I think in many ways that's kind of the the single biggest
predictor of whether someone um is going to be really happy at the like how much do you just look at the data um in in your own work cuz like I think you'd be surprised by how many really talented
researchers don't do it very often that they really just view it as a given. I I
think it's been pretty surprising um across the board. Um that said, there are so many questions that I am on the science side that I'm I'm just super excited about. You know, I mentioned the
excited about. You know, I mentioned the interactions between pre and post training. Definitely one um that we're
training. Definitely one um that we're really excited about. One of the things that we really care a lot about is making it so that our product and curation automatically adapts to novel data distributions. Right? If you have
data distributions. Right? If you have this where it has to be fully automated and you know we didn't talk about this too much, but one of our challenges often is if we're working with an enterprise that has a lot of proprietary data, they obviously don't want to give that to us. Um so we bring our curation
to their data, but this means that it has to adapt automatically. You know, we have pretty limited access into going and looking at that data. So that's
actually a really hairy and interesting out of distribution generalization problem. Um but it's also really
problem. Um but it's also really important because there's no golden curation. A curation is only optimal
curation. A curation is only optimal with respect to a given set of downstream use cases or tasks, right? So
we need to be able to define based off of, you know, if the model needs to be able to do XYZ, how should we use that information to adjust the curation that we do to make sure that we're giving the
data that's most relevant for solving tasks X XYZ? And that needs to happen automatically. So we have a number of
automatically. So we have a number of ways that we can do that for a number of our techniques, but that's a very broad and general question that we want to apply to every part of our pipeline. So
that the way we do synthetic data differs based off of the downstream use cases. The way we're doing this, the way
cases. The way we're doing this, the way doing every different part, filtering etc. is going to change based off of that. So that's another question that
that. So that's another question that we're just uh really excited about. And
fundamentally, you know, anything about really trying to answer this question about, you know, how do you value data with respect to a target? you know, when I think of of Daytony and our core competency, I think every company needs
to have an unfair advantage um or some core competency that they do better than anyone else, you know, and and for us at Datlogy, you know, I want us to be and I think we already are the best in the world at valuing data with respect to a
downstream use case. In many ways, I think that's kind of the NPcomplete problem of AI. Um, if you can do that you can kind of do anything. And that's
the thing that we're really focused on.
And of course, curation is like the very obvious direct application of that core competency. But when we think about kind
competency. But when we think about kind of the, vision, for, the, company, in, the long term, it's about kind of taking what, are, the, all all, the, other, ways, we can operationalize that same core skill set and I think there are tons of really interesting ways things you can do
there. But that's the fundamental
there. But that's the fundamental question that we really want to answer.
And then there are tons of different you know, entry points to that question.
But if if that's a question that excites you, if you have, you know, been working on data somewhere else and you have felt this pain of of being, you know, a secondass citizen or having the data
team be kind of dismissed and you want to be in a place where literally the only po exists is because data is all we care about. I mean, the the name of the
care about. I mean, the the name of the company, Datlogy, the science of data that's why we're here. um then you should absolutely talk to us.
>> Awesome., Um, and, just, to, wrap, on, some gossip, let's talk about meta and super intelligent. So and just in the notes
intelligent. So and just in the notes you know, what when you talk about science mode and whatnot, you raised a lot of money from very prominent people.
So you have, you know, Yan Lun as one of your investor, Jeffrey Hinton, Jeff Dean. Um so when Ari says when Ari says
Dean. Um so when Ari says when Ari says that they have a science mode, believe him. Um, so maybe since you have Jan as
him. Um, so maybe since you have Jan as an investor, this is more of a touchy question, but what what do you make of the whole Meta super intelligence team and you know Jan was also on LinkedIn and it was like hey you know I'm
actually working on you know at fair we're focused on the next generation of AI not on this current generation so my role is the same but then maybe people might say you know then why didn't you
do the current generation 10 years ago.
What do you make of the whole of the whole change and whether or not you think this is an interesting direction for Meta, especially given the large platform and user base that they have?
Well, first with respect to Yan specifically, I mean Yan's an incredibly talented scientist of course, but I think that, you know, his preference has always been to do science rather than to run an organization. So, I think he ran
fair um like organizationally for a year or two right at the very beginning, but pretty quickly uh he handed that off to other people. And like when I was there
other people. And like when I was there it was Joel Pino and and Antoine boards and then Joel for most of it that that really were running fairs and and she was an incredible leader. I I really respect her deeply and and couldn't have
asked for a better kind of advocate for science uh within fair. Um
>> when, when, she, left, people, were, saying like this is the end of fair.
>> I, hope, that's, not, true., Um, but, but, I also had that concern. But I think Yan always really wanted to just actually do the science himself and and you know he's generally for much of most of the time I was at fair he kind of operated
with his own group of a couple kind of postocs and visiting scientists and then he'd have a couple students through NYU and he would kind of do his own research there. So I don't think he was ever um
there. So I don't think he was ever um you, know, or, at least, not, since, the beginning in a role where he was defining AI strategy for meta. I don't
think that's the role he wanted um at any point. you know, I think he really
any point. you know, I think he really wanted to be doing that research and I think so I don't think that his role probably is changing very significantly in the sense that he wasn't uh doing that previously and I don't think it was what he wanted um to do. I mean I think
one thing that's pretty cool about it obviously is it showcases the importance of data um that Meta is willing to spend uh quite this much on uh you know scale uh kind of not acquisition not
acquisition uh that we're seeing today.
>> Alex, Wang, is, not, going, to, underrate, data let's put it that way. Yes, he's
certainly not going to underrate uh the importance of data, you know, and I do think that this is an area where, you know, the stuff we've done is quite different than than I think what what we've seen from from the data annotators which have been more focused on
collecting the data versus actually optimizing and curing it, curating it. I
think there's quite a bit you can do on top of those um things. So, I think it definitely draws some attention to that.
I will also just say generally when Zuck makes a very big bet, it's not proven wise uh to bet against him. just
historically that's been the case and like most most of the the big bets I think have panned out. I think the one that's still really up in the air is a metaverse but I would actually argue that I think that's going to end up paying off in the long run. I think the
Rayban glasses pretty darn cool and a lot of the foundations of of what was in reality labs will go into those. Also
fair was part of reality labs actually for like a year and a half after one reorg like initially fair wasn't and then it got reorged into reality labs.
So I think when I left actually fair was officially part of reality labs. Wow. um
if I recall correctly and there was at least a one and a half two-year period where where that was the case. Um so
some of the AI investment actually that like laid the foundations came out of real out of that metaverse um investment in the first place. You know that said I think you know we talk about data as being a comput multiplier all the time.
Uh talent I think obviously is a comput multiplier and given the amounts that they're spending on compute I think you can make a good argument as to why spending a crazy amount on talent is also worth it. So I'm excited to see
what they do. Um I I hope that they uh put a lot of focus on data >> and, become, customers.
>> Yes.
>> Awesome., Well,, thank, you, so, much, for, uh chatting and coming by uh and insisting on in person because you're actually very charismatic in person. So, I'm glad you did this.
>> Oh,, thank, you, very, much., Thanks, for having me and a joy to to get to chat uh in real life.
>> Awesome., Cool.
[Music]
Loading video analysis...