LongCut logo

The Story of Mech Interp

By Neel Nanda

Summary

## Key takeaways - **Ambitious Reverse Engineering Doomed**: The strong dream of fully reverse engineering neural networks to human understanding seems pretty doomed, but we've made useful partial progress with pragmatic tools that achieve goals we care about. [28:49], [38:59] - **Linearity Surprisingly Works**: Most interpretability findings show stuff is really linear, which did not have to be true but consistently holds, enabling steering and probes that were not obvious a priori. [02:38:57], [09:05:27] - **Feature Viz Reveals Coherent Neurons**: Optimizing images to maximize neuron activation in InceptionV1 produced coherent psychedelic patterns like dog noses that matched dataset examples, validating neurons represent interpretable features. [16:25:36], [17:12:36] - **Whale-Baseball Adversarial Interpretable**: Feature visualizations revealed a neuron detecting both baseball stitching and great white shark teeth, allowing construction of a gray whale with a baseball that fools the model into classifying a giant baseball. [24:44:25], [27:06:24] - **Activation Patching Finds Circuits**: Activation patching on contrast pairs like Shaquille O'Neal vs Megan Rapinoe reveals key components for factual recall, showing MLPs on final subject tokens store facts extracted by attention heads. [46:58], [52:37] - **Superposition Compresses More Features**: Models represent more concepts than dimensions via non-orthogonal superposition, enabling performance gains through interference-corrected compression, as shown in toy models and real memorization examples. [01:18:21], [01:19:48]

Topics Covered

  • Full Video

Full Transcript

So maybe the right place to begin is just kind of chatting about what I view as some of the like big picture themes that are going to come up again and again throughout everything else I'm

saying. So

saying. So I think probably the biggest thing I've changed my mind about over time is

when I got into the field I was very excited about this vision of ambitious reverse engineering. This idea that we

reverse engineering. This idea that we could take these weird interpretable neural networks and reverse engineer them to something human understandable.

I now basically think that the strong versions of this dream seem pretty doomed, but that we have made way more

useful partial progress than I expected.

And I think that there's just like a bunch of stuff we've learned how to do that is just already useful, even though it doesn't really come close to this initial dream of fully understanding

things. And I'm now a lot more excited

things. And I'm now a lot more excited about work that is just leaning into the pragmatic vision of

how do we achieve goals that we care about using the kinds of tools that trying to understand these systems has given us. How far can we push this

given us. How far can we push this understanding while accepting the I'm pretty pessimistic on getting all the way to fully understanding things. I

don't really think this works. for

reasons I'll get into.

And I think maybe another kind of crucial theme is this idea of where does your

information like how do you tell if what you're doing is [ __ ] or not? I think

that a pretty critical question that every interpretability researcher should regularly ask themselves is like

um how do I know that what I'm doing is not [ __ ] How do I know that what I am doing is actually telling me something interesting about

real world phenomenon?

um and kind of what empirical observations would be different if I was um right or

wrong.

And um maybe a final big picture theme is the idea of the linear representation

hypothesis where in my opinion a consistent theme behind most

interpretability findings has been stuff is just really linear. This did not have to be true, but it seems pretty true.

This could have broken in many ways and largely does not seem to have broken.

And I think that just kind of observing a bunch of the evidence that has accumulated over time that most things are linear can be instructive.

Um, all right. Any questions on these three big picture themes? Uh I think if you don't feel clear on exactly what I mean by each of these three, now is the time to ask. All right. I

>> was it what is what do you mean by sources of truth and downstream tasks?

>> Um >> like as distinguished from pragmatic understand >> by sources of truth I mean

um how do you ground your research? How

do you tell the difference between a true hypothesis and a false hypothesis?

Um, I didn't actually explain downstream tasks. Uh, what I mean by downstream

tasks. Uh, what I mean by downstream tasks is the idea that you can one form

uh maybe I should make this into a sub point. Um

point. Um there are two key ways historically people have tried to argue that their

interpretability research worked. The

first way was um what I think of as like um how accurate is your interpretable

approximation kind of you try to reconstruct part of the model in an interpretable way like reconstructing it

with a sparse order encoder or you try to like make predictions about the model's behavior using the algorithm you think is happening and you're like, is

this legit? What's my error? Um,

this legit? What's my error? Um,

downstream tasks is a conceptually different way of grounding your work where you say, what are the things I can do with this interpretability technique

that I could not previously do? One of

my favorite examples of this is the um ordering games paper from um Anthropic

um and San Marx where they gave a model a hidden goal um by training it on fake documents and doing a bunch of other really galaxy grain stuff. And then they had

grain stuff. And then they had researchers compete to see if they could figure out that hidden goal. Um and the researchers were blinded. they didn't

know what they were looking and this is a completely different way of figuring out if saes are interesting.

This is saying can people bind things with sapes better than they can find them without.

Um and this is a very different thing from um the idea

of uh applied interability.

Um yeah probably deserves to be another.

So another question slash theme is how much what you're trying to do is

just pursuing understanding for understanding's sake as kind of as far as you can go versus

um trying to just do a useful task and if interpretability is the right way to do that task great. If interpretability

is not the right way to do that task, that's also fine, but you're going to do something else um which I think of as applied interpretability

and kind of the middle ground which I think of as using interpretability to understand the broad strokes and

qualitative properties of a model which you could either be doing because you are pursuing this for understanding sake basic science or because you think this

is useful. Um, but you're not trying to

is useful. Um, but you're not trying to like go super deep into the model.

Um, and yeah, I'd probably just make this a separate axis. Um,

separate axis. Um, reverse engineering versus like trying to figure out the like highle properties and the broadstrokes without stressing about

have you gotten every single last detail?

And um a confusing point uh that I think is important is that there is a big difference between I want to look at a

real world task in order to figure out if my interpretability technique is doing something real or not, but I don't actually give a [ __ ] about this task.

And I'm doing a project about solving this task. And I'll only do

this task. And I'll only do interpretability if it is the best way to solve this task. The first one downstream tasks as a source of truth

and grounding is just a way you can do an interpretability project. And the

second one um is more like applied interpretability or possibly even just machine learning where interpretability is one of many

tools you're considering using. And I

think that it is complet I think that it is extremely compatible to do down basic science with the downstream tasks.

Uh how much sense am I making to people?

Great. Um Ara

um my question was about the linear representation hypothesis. So, it's not

representation hypothesis. So, it's not super clear to me why this is surprising in the sense that like if it wasn't linear, what would that look like? Like,

if we're trying to steer things, it kind of wouldn't steer very clearly. Like,

what would that other world look like?

>> I mean, why does steering work? This is not a priority obvious.

uh like it was well I mean um I wasn't massively surprised when I saw the first steering results because I already had linear

representation shaped intuitions but it's like I think something yeah something that like probably is harder to appreciate is just like how much we used to not know about what

happened inside these models like back like 5 10 years ago five years back like 10 years ago. I think we had some like

super preliminary results. I think

people had looked at things like linear probing and like maybe had some results that maybe worked. But the prevailing wisdom as I understand it was, man, people spent decades trying to

understand this [ __ ] And that led to the first and possibly second AI winter.

This is dumb. Uh just yolo. Um

make a benchmark. Hill climb it. You're

never going to understand it. This is

fine. Uh, I'm sure this is a faithful description that would definitely pass people's ideological cheering tests. Um,

and um I think the Yeah. Um I think that it is super it would have been extremely not

obvious to someone of that worldview. Um

to think that like to think that like ah we can just take the difference between activations on two prompts and then add

it and it will just work. Like what?

Why? this is such a specific way for the inscrutable pile of matrices to work.

Um, like I struggle to conceive of interesting ways things could work and

be interpretable in a way that isn't linear. Uh, you could imagine say linear

linear. Uh, you could imagine say linear probes don't work, but nonlinear probes do work. and we'd never find a way for

do work. and we'd never find a way for linear probes to work. But it does seem like there's some real information inside the assistance.

That would be perfectly plausible a priority to me. Um

and like various other things roughly within that vein, but like um yeah, we could just have no [ __ ] clue

what's happening. I don't know how

what's happening. I don't know how satisfying an answer that is. Uh, but I think it should like not be taken for granted that neural networks are weirdly

understandable and weirdly nice. And

this was not obvious to me like before I got into the field that any of this would work.

Um, all right. Any other questions?

Great. So,

yeah. Uh, I guess I've kind of already been talking a bit about like, uh, let's call them, uh, the dark days. Um

where to be clear I think that there were interesting hints of what was happening that would not have remotely called

themselves meant like um people discovered things like um let's see

there was um people discovered things called gabbor filters that were like

um early layers in convolutional neural networks. Um yeah, let me just

networks. Um yeah, let me just switch window.

Um yeah so yeah so like people observed I can't even remember when that if you look at early layers of a convolutional neural

network they often seem fairly interpretable and um they're not aware um some of the

earliest kinds of real deep learning people got excited about some image classification networks uh like Danet and AlexNet

um that were convolutional neural networks. The way convolutional neural

networks. The way convolutional neural networks work is you have like a really simple linear map that's like 3x

3x3 or 5x5 matrix and you apply it to like each 3x3 grid in the image and you could see things that were like

identifying edges or corners um where the you'd have like plus one on the top and minus one on the bottom. And this is a reasonable way of detecting an edge.

Um but beyond this, it was kind of less clear and like somewhat messier.

Um, I think that um I'm like not super familiar with the

exact details of like early interpretability work. Um but there was

interpretability work. Um but there was this pretty famous paper note the like 25,000 citations um on visualizing understanding

convolutional neural networks um that has some pretty pictures in it somewhere. Um

somewhere. Um yeah I think I think this might have been one of the first papers

that talks about saliency maps.

Um, I honestly don't remember the exact technique used in this paper, but I think they did things like look at maximum activating data set examples for

neurons and were like this seems pretty legit. Uh, look these these these cells

legit. Uh, look these these these cells they have patterns they are not random and they did something to create visualizations

and yeah, so maybe skipping ahead a bit. Um,

I think the there were like some big hints things were going on, but many people in machine learning thought that interpretability was a [ __ ] dead end

pseudocience. Uh, in fairness, much of

pseudocience. Uh, in fairness, much of the interpretability research was in fact a [ __ ] deadend pseudocience.

Uh, so reasonable. Um but

they were also uh somewhat pessimistic on just the idea

that you could do any of it. Um, and in my opinion, uh, Chris Ola did a lot to

kind of, um, push forward this idea of we can just like dig deep and understand these things and the neurons seem understandable, the weights seem

understandable. Um again at the time

understandable. Um again at the time people were like very into image networks and convolutional neural networks and there was this famous paper

from 2017 feature visualization. The key

idea was take a neuron in this um network. I think they were studying

network. I think they were studying inception v1, this like 2014 image classification model from Google that Chris's team spent several years going

extremely deep on and um basically optimize it using gradient descent to optimize images to make the neuron light up a bunch doing a bunch of like tricks

to make it not look like a [ __ ] mess and saying look this seems coherent and this is in fact the thing that neurons

seem to be trying to look for. You can

get these weird psychedelic things that seem like dog noses maybe. Uh and I think they also lit up on pictures of dog noses or this which I assume is some

kind of texture or some kind of clothing. uh fun fact about um imagenet

clothing. uh fun fact about um imagenet the data set that many of these early models trade on. It has the ridiculous number of breeds of dog in it. So these

models are really good at figuring out breed of dog and a lot of dog related features. Um and apparently Chris's team

features. Um and apparently Chris's team learned a great deal of things about how to subtle details of dogs that help you identify or breed them. Um,

and I think that this was one of the first papers which was introducing a tool that

you could use to try to understand what a neuron meant. Um, they also observed some signs of polymanticity

or is it in this paper? Um, I think there was something in this paper about polymaticity.

Um ah the enemy of no not this one. All right there was some paper around this time that observed

poly spant for the first time. I am not going to claim any foridence in which paper that was. Um but kind of starting

to make a big deal out of sometimes neurons seem great. Sometimes neurons

seem like a mess. We're not really sure why.

um in later layers of this model, more of them seem like a mess.

And yeah, I think that to me a lot of what is exciting about this paper and the kind of line of research it

represents highlighted here is um I think that there evidence that you could break

these models down that um not only were there meaningful directions but that these directions were just literally the neurons the standard basis of the um

and a priority. It is not obvious that even

a priority. It is not obvious that even if you think there's something linearly represented, it's not obvious. It should

be aligned with the standard basis. I

think people kind of noticed arguments like ah well activation functions are aligned with the standard basis which

means that it's more reasonable for things to be like more reasonable for neurons to be meaningful in arbitrary directions like

maybe this isn't super surprising.

Um, part of and I think part of why this matters is well first off makes your life dramatically easier because you can just iterate over the neurons in a flow and do feature visualization rather than

trying to figure out the directions. But

also if you already have a prior that the neuron means something and then you look at the data set examples and they have

like a coherent pattern then and the feature visualization also aligns with that pattern then you're like well I'm pretty sure that explanation and these

things align with that explanation seem good I guess. Um while if what you're

doing is more like I want to um here's a random direction does the thing mean anything? Uh you should put a bunch of probability yet mean something

and thus the bar evidence required should be though I think it took a while before people could reasonly claim that there were that there was enough evidence to have a prior that neurons

would often mean things.

Um, all right. Um,

all right. Um, that was Ry. Hopefully I gave like some vibes. I'm going to stop here for

vibes. I'm going to stop here for questions. Uh, L.

questions. Uh, L.

>> Um, yeah. I guess this like seems like a big difference between like these models where the neurons mean something and he said I'm understanding like the sparse autoenccoder work where it's the neurons don't mean things but there is.

>> Okay, >> great question.

Um that come that comes later in the sock. Um

sock. Um anyone else? Cool. All right. So

anyone else? Cool. All right. So

um for uh for context, uh this is a website called Distill, which is like an online journal set up by Chris Oler and some other people who wanted academia to

have better standards and also not use [ __ ] PDFs and actually let you embed interactive things. Uh it was a

interactive things. Uh it was a beautiful and noble efforts and it didn't really work and they got really bums how Annie got was very sad. Um

but there is still a bunch of great papers including a lot of the early meant stuff. Um, one fun paper is the

meant stuff. Um, one fun paper is the building blocks of interpretability, which

kind of is arguing the case that a there's like a bunch of things you could try doing to interpret these models.

Here are like various tools and things you could use. And also, it's really important to have good interfaces and tooling.

and um rumor this was in the days before wipe coding when making interfaces was

hot. Um and I think that um

hot. Um and I think that um an interesting accident of history is like a bunch of the researchers involved at the time were like pretty good at web

development and liked making a bunch of interfaces. Uh let's see. Is this the

interfaces. Uh let's see. Is this the one about police menty? Oh, um, they and they mention it once, but they do

mention it. Um, though it's a

mention it. Um, though it's a sufficiently obvious observation that probably mentioned other papers. Um,

and another fun one is the activation atlas paper. This um really goes all in on the

paper. This um really goes all in on the interactive interfaces are important.

This is some kind of weird visualization of all of the neurons in this image classifying model in different layers

and or I don't remember exactly what this is visualizing. I think it's neurons and um or maybe just taking different points in activation space after some mentality

reduction.

um and looking at what the model classified them as and doing a feature visualization on them and being like

look it you've got like acorn groups acorny things over here. Um that kind of

uh transition to things that look more like plastic bag like or fabric like or this is turning into what looks like an

arm or maybe a leg. Um anyway, uh very fun. You might enjoy clicking around

fun. You might enjoy clicking around with it. Um, but one thing I want to

with it. Um, but one thing I want to highlight here is this particularly fun result where they uh made an

interpretable adversarial example where they had a picture of a grey whale and they um stuck a baseball on it and now the model

thinks it's a great workshop.

And uh anyone want to guess how they made this? Feel free to just call out

made this? Feel free to just call out and give a guess like how they thought of this. I mean, making it easy

of this. I mean, making it easy or like why it works.

>> Is there that perhaps like corresponds to both um baseballs and great white shark?

>> Um kind of uh warm but not quite. Anyone

else? Guess I'm wondering like this is the color of the baseball and like it's a great white shark and they put a white object in there. So maybe this is pushing in that direction.

Cool.

>> Yeah, something similar. I guess the shark is plus white is put in the direction of great white shark. Cool. So

I believe that what happened is they were look doing something like looking at how the model

I can't remember what the exact story was it was something like they were looking at uh things that helped the model that made the model think something was more great white sharky or

like walky and they observed this kind of teethlike thing that also seems to look pretty baseball-like coming out of feature

visualizations.

And um I think that um there's basically some

neuron that seems to both detect the stitching on a baseball, red stuff, and a great white shark's teeth, which you know, kind of also like red zigzags on

white. Um,

white. Um, I don't know what they sent me to.

Um, and so they are like, "Ah, well, what's a great white shark?" Ah, it seems like

it's like a kind of whaleike thing plus teeth. Teeth? That's basically a

teeth. Teeth? That's basically a baseball. What if we glue them together?

baseball. What if we glue them together?

And um in addition to being hilarious, um I'm highlighting this because I think it

was one of the um like great early examples of rounding with the downstream tasks. The downstream task here is make

tasks. The downstream task here is make an adversarial example that is clearly not a great white shark, but that the model thinks as a great white shark.

And is this an interesting task? Mhm. Uh but

I think it is like not trivial and the fact that they were able to like make this kind of prediction and construct the image and it worked was like pretty good validation that that cooling was

fooling them and end historical detail.

So um the next um oh yeah any questions on that? Great.

All right. So the next interesting um cluster of things is going beyond studying neurons and activations to studying weights and circuits and

algorithms um with this and again the kind of

public knowledge the public results up to this time as I interpret them were kind of like yeah it seems like there's like some understandable stuff in the model's activation

like neurons often kind of have this checkable meaning. It's not necessarily

checkable meaning. It's not necessarily like perfectly rigorous or perfectly clear, but it sure seems like something's happening here.

Um but and so what Chris's team did is uh they basically just have this string of like

tiny blog posts exploring uh different parts of this hypothesis, but the model's weights just represent interpretable algorithms. And if you understand what all the neurons mean,

you can just read off um how each set of neuron goes to the next one. And

um I don't know, they had this like position PC post um making a bunch of claims like um

features of the fundamental units. They

seem pretty linear probably. Um what's

the Um yes uh oh they even um talk about the idea that maybe it's a direction rather

look like combinational neurons rather than a specific neuron. um this idea of like features are connected by weights and that we can understand them and um which is kind of kind of want to

highlight here cuz I think it was like the interesting one and then this claim of universality which was like nice but in my opinion somewhat less central um

and like it would have been an extremely big deal to me if every neural network was interceptable but you had to do a

bunch of extra work route one um And I think the most interesting

examples of work here um one is high frequency detectors. These are a type of neuron they found that basically

detect kind of sharp bits next to blurry bits in images which you might expect from like there's like a figure in the foreground and a kind of blurry background that wasn't in focus or

something. And I believe some

something. And I believe some neuroscientists since found evidence of bees in the brain uh which is kind of funky. And

um circuits wise um I think that the code circuits work was excellent

and still a good contender for one of the most rigorous circuit analyses in a real model that I know of where they basically

identified um a family of neurons early in the model and said, "Hm, these sure seem to be representing curves, and the curves seem to be getting more complicated."

Like, we start with like lines of different angles, and they seem to get combined into more lines, and then maybe they're curving a little bit, and now they're just curves. Um, and

convolutional networks kind of draw information from local the local area into a neuron. So this kind of slowly

becoming more and more abstract. Checks

out. And they had like a bunch of evidence um that they had in fact understood the algorithm. The one that I found most persuasive was one of them

just wrote a Python program to regenerate the weights uh based on their human understanding without um kind of without referencing the weights and then

just substituted that in as an artificial artificial neural network and found that it seemed to recover and about half the

performance that was lost when you deleted the curve detecting neurons And to me the interesting historical

context here is I think that this was one of the like best examples that maybe we could just look at the model's weights and

understand it and there were just loads of structure here. this would be a pain and there were like some things they need to figure out like poly statisticity but like the dream of ambitiously reverse engineering the

thing felt pretty alive to me at the time um all right any questions on this uh also can I get people to put in the

chat uh would you like me to go faster slower or about the same speed since I have no idea how much of this was just obvious reviews to people or

incomprehensible gibberish Cool. All right. Um, no questions.

Cool. All right. Um, no questions.

Great. All right. So,

now, so yeah, this came out in like 2020. Um, at this point I think it was

2020. Um, at this point I think it was pretty obvious that like language models were the [ __ ] and also that I don't know

and GP3 came out that year and it started to seem much more real that we could that the path to AGI would just in

fact be stack lots of data and lots of computing power and hope for the best.

Um and so people became way more into language models and um Anthropic got founded at the time and Chris Ola who'd been like leading much

of this research co-ounded that and started a touch team there and also hired me um which was very nice of him and

I think the I think at the time I remember being like had kind of initially had just been very skeptical of the entire institution

of interpretability and had thought that like um yeah thought that like

it seems like he isn't the prevailing wisdom that we can't understand these things and it's all a mess and I think that the um I then saw the circuit the image circuits

work and I was like oh this seems pretty cool law but I know it seems like images are really nice. It's continuous. You

can do things like feature visualization and it seems like there's maybe less stuff than language. So my intuition was

um I don't know it's like pretty not obvious to me that we can extend this to language models. Um but also it sounded

language models. Um but also it sounded like a cool vision and really interesting to try to help with and I joined after most of the work on a

mathematical framework had been done.

Um, but I view this as one of the first papers trying to like really explore what it might look like to reverse engineer a transformer

and a bunch of fundamental ideas like the residual stream is the central object of the

model, not just like a weird thing. I

don't know. people used to draw neural networks where they'd like draw each layer as a box and the residual stream is like uh kind of art rounds and um

this paper was the first place at least I saw people writing the residual stream at the central thing and like layers this like cube things you add on that are incremental updates

and a bunch of um ideas around like how you and think about attention heads, which were the

seemingly weird annoying new thing about Transformers and in hindsight actually way nicer than MLPS in a bunch of ways.

And um I think that yeah um I kind of went in being like h is really going to work on language models and then things just seemed really nice. We found

induction the heads. and duckling heads turned out to be like such a big deal.

They caused a bump in the velocity.

Um and that we were kind of working with

these tiny models and when people had um when we'd tried to look at um

yeah when we tried to look at MLK layers occasionally the neurons were interpretable but um versus vibe seems to be this seems like

a fair a bit less interpretable than image models were. This is kind of weird. Not really sure what we need to

weird. Not really sure what we need to do here. This seems like a big problem,

do here. This seems like a big problem, but and like you know the thing to figure out post attention. Um

then a Yeah. So, uh, a few months after this, I

Yeah. So, uh, a few months after this, I left anthropic and worked on, um, Grocking. Um,

Grocking. Um, and I'm assume many of you are like kind of familiar with the vibe of the paper and not going to go into too much detail, but stop me if not. But,

Brocking was this like big mystery. It

was super fashionable to work on at the time. It involved a two-layer

time. It involved a two-layer algorithmic model and I was like this is this is a pathetic model. This is tiny.

Surely Minto can do anything. It can

figure out what the hell's going on here. And I was right. Um and um found

here. And I was right. Um and um found this uh really pretty modular addition algorithm and went in not really

expecting to see this. Though I did discover after the fact that some parallel researchers had kind of come up with this algorithm in priori um but not

written it up yet. So it wasn't exactly deeply unexpected but it's kind of weird. I didn't expect to be using a trick like this and this

kind of was a thing we could look at during training because we now handed these like nice metrics and this gave us a bunch of insights and this all just seemed very exciting to me. And it

seemed like man we kept having opportunities for ambitious reverse engineering to fail and it's just been going pretty well. I think there were like

some warning signs accumulating like polyanticity was a big one. I think

another big one was that we definitely had not completely reverse engineered any of these models. I think this was

like the closest, but there was still some weird unexplained mysteries. For

example, um the way it worked is it was doing trigon entities in some frequency and

it didn't really matter what frequency it picked. It seems to be using five,

it picked. It seems to be using five, but for some reason it was also using frequency like 30 on the inputs, but it didn't appear on the outputs.

Why? Who the [ __ ] knows? Never figured

that one out. Um, maybe this was a random decision thing.

Sometimes during training, it would like start using some frequencies and then switch to others. Um,

and I feel like I understood enough of it to be able to say some pretty interesting things. And that kind of

interesting things. And that kind of seemed like the important thing at the time, but in hindsight, this seems like uh or well, even at the time, it was like, yeah, I'm like completely

understood that maybe I could have I tried harder. Maybe not. Unclear. And I

tried harder. Maybe not. Unclear. And I

think with a mathematical framework um yeah there were like various we didn't understand and

the things we were doing um just really wouldn't scale. Like we were doing

wouldn't scale. Like we were doing things like saying well each layer each attention head is kind of like a linear transformation you're applying to the

model. We're applying like one per head

model. We're applying like one per head and we're also doing the identity from the residual stream.

But now if you apply the second layer now you have every pair of heads. So

you've gone from like 8 to 64. And if

you do three layers now you go to 512.

And if you look at a real model which even at the time were like massive um or at least people liked working with

um massive models. that'd be a separate weird thing where model sizes have not got that much bigger since GP3. Uh

they're just trained on way more data.

Um and this does kind of confuse but separate questions. Um at the time we

separate questions. Um at the time we were like ah well DPD3 was about 180 billion parameters and it was dense because obviously all interesting models

are dense and yeah surely it's going to just you know keep keep getting bigger.

That seems reasonable. we can

extrapolate out like oh it'll be like a trillion like a year and then even worse beyond this and then

um maybe GPD 4.5 was that kind of scale.

We don't really know but also hopefully I didn't really want to monetize GPD 4.5. So,

um, yeah, that was that was another anomaly but I view ambitious reverse engineering as looking

like reasonably good at this point. Like

I um, any questions on this before I go on? And in my opinion, ambitious

on? And in my opinion, ambitious engineering has just gone extremely badly since and we have made in my opinion very

little progress um on specifically the goal of pursue an agenda that I think can realistically

understand everything important about a model and give us a fair amount of confidence that we're missing nothing.

Um and So yeah, maybe walking through

what happened next. I think there were maybe um threeish.

Um yeah. Um so yeah call that the light days and then uh emerging strands of

research where they maybe say that um between a mathematical circuit

mathematical framework um and now there were maybe like two big popular waves of research um

causal intervention based circuit finding and dictionary learning um and superposition.

I'll maybe just analyze these separately since I think you can kind of think of them as parallel things.

So the um the thing that first got me excited about causal intervention based circuit

finding was the Rome paper from like David Bound, Kevin Mang um where they basically um yeah

they basically took um activation catching which had kind of already been studied a bit in like a prior paper David Bow was involved with

um but I basically haven't really seen elsewhere and showed that you could use it um on either a pair of prompts or

like a prompt and a prompt plus Gaussian noise to try to figure out which chunks of the model were important for some task. And they got this pretty exciting

task. And they got this pretty exciting and fairly refined vision of like ah well you take a prompt like Shaquille O'Neal

with a Neil ending here. Um or Megan Rapo ending here plays the sport of what do we patch to make it say the sport

that is true for one of them. And oh

look it seems to live on the final token of the subject. And then it kind of like fades off with like some sharpness. Um,

and then moves to the end.

And I think that this was pretty cool. I think this was like some of the first work I saw that felt

like it was helping understand recall.

Um, it was kind of like token specific and it seemed like you could read quite a lot of information from this. They

also did some like layer specific stuff.

Uh it's probably in the paper but they got things like ah MLPS here seem important and then attention layers here seem important.

And um in my opinion this is the interesting half of the paper. Uh there

was also another bit that was about fact editing that was thing they framed as like the main contribution but I think was like less exciting and largely just

a form of specialized fine-tuning to kind of insert a fact into the model. Um

>> oh yeah I was just going to ask I I don't think I understand this plot or what exactly this paper is doing. So be

great if you could add a little more detail.

>> Sure. Do you understand what activation mapping is? I I don't think so.

mapping is? I I don't think so.

>> Ah, great. All right. So, activation

patching um the uh people may have seen it under the myriad other names had in the literature like causal mediation analysis, uh

causal tracing, interchange interventions, resampling, ablation, random [ __ ] All right. So the

idea is basically we think that a sparse set of model components are

important for some some task here. The

task is answering like me rapid play sport of with soccer. Um

and so we want to do a call. We kind of want to do a for where we pick a component or set of components. We um

somehow change it and then we observe the effect on the output.

And the reasoning goes if you affect um many things then you like yeah if if it has a big effect then it was important

otherwise it was not the naive version of this you might do is just replacing the model activation with zeros.

Um can anyone see what goes wrong if you do this? If you're competent, you know

do this? If you're competent, you know the answer. Leave.

the answer. Leave.

>> What was the question again? Sorry.

>> Um, what goes wrong if you try to understand say uh which MLP layer matters for this task

by like say uh may this be more specific. You want to understand the MLP

specific. You want to understand the MLP layer that contains the fact that Megan Rapido plays soccer.

And you want to do this. And a candidate method is that you use you just replace the output of a certain MLP layer on a certain token

with zeros. And you look at how much

with zeros. And you look at how much that changes the final answer.

Does that make sense?

Great. What could go wrong? Creedy guess

is encouraged. Um, so all right. So, we

have in the chat OOD activation patterns, but like that's not really an explanation.

Uh, it's a factor that could lead to something, but like why would being out of distribution change the model's output if it wasn't important for the

task? Like if an MLP layer doesn't

task? Like if an MLP layer doesn't contain the facts, why does throwing it off distribution matter? Um, all right.

Uh other guesses in the chat um redundancy is a valid answer but not the um kind of key factor I'm highlighting

here. Leonorb is a [ __ ] pain but you

here. Leonorb is a [ __ ] pain but you can kind of ignore it most of the time including here. Um

including here. Um and so yeah and yeah just to clarify um I'm not

saying zero out a specific neuron. I'm

saying zero out all of the neurons in one or like zero out the output of here are some things that could go wrong. Great. All right, Alex, you got

wrong. Great. All right, Alex, you got one. Uh removing other concepts that

one. Uh removing other concepts that might matter. For example, if you remove

might matter. For example, if you remove the model's ability to speak English, you're going to have a bad time. And if

all you're observing is the probability of software, you might see that go down a bunch because the model has no idea what's going on. And this does not mean

that you have found the facts. Um,

another thing that can go wrong is um that you just the model was relying on this um yeah, the model was kind of

relying on the activations having a certain mean or a certain norm or something. And by making the layers

something. And by making the layers zeros, you've shifted this in a way that like breaks assumptions that later layers were depending on um this kind of

effect is why as you steer with two larger coefficients everything goes wrong.

And um I think that yeah um let's see what

was um I think the yeah um uh another factor is you could have just found something

that is like important for the task for reasonable reasons um but is not where the factors store. like maybe you found

the thing that's about invert I don't know uh figure out that you should say soccer rather than football in this context or something like uh actually in this context it probably just says both

about 50/50 but um maybe more soccer because the internet is American but um so

the solution partial solution to This is called activation patching. The key idea is to set up a pair of prompts that are

like as close together as possible but differ in some key detail. Um, this is also known as a contrast pair and I think they are like one of the most

important insights in interpretability that you can do this and it's often really helpful because most things will be similar and controlled between the

two um but not everything and in particular if you can if you take the like Shaquille

O'Neal and the mechano one Um then you would hope that um well you know it's got to speak English in both then it's got to know

that it's doing factory recall it's got to know what a sport is but it will give a different sport and that doesn't rule out things that maybe

it's the um soccer football thing but

um what happened was oh it looks like um NLPS here are very important. Uh that's

right this graph is not about NLPS. So

maybe I'll just go to the paper um since that's probably making things even more confusing than needed. So the

middle green graph is like you do a replacement here they're doing like 10

an interval of 10 layers at once. So

like this actually means layer 10 to 20.

Um but that's not a super important detail.

It just makes the graph looks graphs look prettier and is a bit less precise.

Um, also meta note, every time you see graphs like this in a paper, you should uh assume they have cherrypicked this to look good and the a randomly chosen

example does not look as good unless uh they explicitly say otherwise uh or they average across a bunch of inputs or give error puzzle.

Um but you know anytime you see a figure go viral on Twitter assume it is not representative and cherry picked.

Um otherwise you will have incorrect thoughts.

And yeah so to me the exciting thing about this is you can kind of start to piece together um

some notion of what's going on here. um

where like we can kind of see here that um MLPS seem important. Um can anyone

tell you why it's interesting that MLPS light up on the last token of the subject rather than say the final token?

Uh so maybe no go ahead.

Uh so maybe uh the attention transfers information to the last subject token and the MRPs do the job of

directing information about the subject.

>> But like how does it know that we want a sport a location?

Like it doesn't say is in downtown. This

is really important information answering the question. What's going on?

I'm not sure about that.

uh son.

>> So maybe the MLP is kind of a key value and then it's kind of a key for a lot of things which is associated with the space needle and one of them could be the but it could be also other thing and

then later on the attention picks the one which it's the later part. Yep.

Pretty much. So what it turns out is happening here. It um

happening here. It um we can yeah um what kind of seems to roughly be happening is it just looks up everything

it knows about an entity on the final token of the entity or a token near the end. It's a bit annoying in general. And

end. It's a bit annoying in general. And

then attention heads kind of help extract this. And uh honestly it is like kind of

this. And uh honestly it is like kind of cursed. Um

cursed. Um there was like yeah involved in like a few subsequent papers on this. Um

I had one Matt's paper called like summing up the facts. Um

where um they we were kind of we thought ah there will be an attention head which um

based on um the which will kind of look at the subjects and extracts like particular attributes.

Uh and what seemed to be happening that there were actually three different kinds of attention heads all on the final token. Um some of which looked at

final token. Um some of which looked at the subject and extracted kind of some specific things but also just some general vibes. Um some that were that

general vibes. Um some that were that looked at the relationship and kind of just picked up on some countries and

just boosted all countries. Um there

seems like you know there might be a reasonable way of doing things if you add together a vector with positive which maps to positive logits for all countries and one that maps to positive

logs for all things related to the coliseum.

It's kind of fine. Um maybe it's like um not amazing but like

um it should constructively interfere on Italy cuz that's the only thing in both.

Um there also seems to be heads that like attends to both of them and extracts attributes from both somehow and kind of

give roughly the right answer. I think

there was also some other [ __ ] Um, and yeah. Oh yeah, there were also some MLP

yeah. Oh yeah, there were also some MLP layers that seemed important and I don't think we took it out. Um,

and uh, I mostly give this as an example of how the dreams of ambitious reverse engineering hold up less well than one

might have hoped. Um another example of follow on is this uh paper called factf finding. Oh, really perception thing was

finding. Oh, really perception thing was um that uh I did and it was part of my deed mind work and

um yeah broadly let's see um so

yeah so basically what we were trying to figure out here was kind of is the intuitive story that you kind of get

from here True. And also can we reverse engineer what this database looks like in how the model stores things. Um

and um I know this um yeah like I think superp position as a concept was like some

fairly discussed at this point. like ah

surely there should be like some kind of neat mathematical structure and we spent like 6 months there and it was a

complete puss and we gave up. Um and

um specifically I think we managed to understand a fair amount of what was going on. So like we looked at the

going on. So like we looked at the simplified case of athletes whose names are two tokens and especially ones where we think you need both tokens to figure out what's

happening like Michael Jordan. Um, there

were like these token concatenation heads that moved Michael onto the Jordan token. And then there were some MLPS

token. And then there were some MLPS specifically a band in early layers that all seems to do some like lookup. And

then later MLPS that seemed causally relevant were actually just largely repeating this information that was already present.

uh which you can test for with things like can and you train a linear probe to predict the sport being played on residual stream at different points and it works okay by like layer four and it

works pretty well up to layer six and it doesn't get but like what's actually happening in here just like didn't really go anywhere. We had some hypotheses. They

anywhere. We had some hypotheses. They

seemed really messy. it was like pretty distributed across the layers and um I

largely concluded that there is no reason that a database implemented in superposition needs to have a particularly nice explanation.

It it's it's a not a very complicated function.

um it's just mapping certain combinations of tokens to certain combinations of linear output vectors and the MLPS want to find some way of

conforming themselves to enable this mapping but the exact details don't really need to be nice

and models clearly know many facts about many many more entities than they have neurons and there are also So kind of mathematical constructions you can come

up with how you can store way more than one fact per neuron um without much error. Um

error. Um but yeah all right thus ends the tangent about factual recall. Uh it was less

nice than I hoped it might. Any

questions on that tangent? No. All

right. So,

get an idea. Um, do people kind of get the idea of activation catching? Um,

great. So, yeah. And this seemed like a really cool technique. Um,

and the the indirect object identification paper. Um, nope. Uh,

paper. Um, nope. Uh,

there it is.

um they kind of went beyond just looking at things like which layers matter and we're like well we'll look at which tension heads matter on some task and not only that we'll look at which pair

pairs of attention heads communicate with each other and because of like linear structure of the residual stream.

You can make it so that one head thinks that an earlier head has changed, but everything else doesn't think that earlier head has changed.

And that seemed really cool. It seemed

like a really refined tool and it seemed like they learned quite a bit about what was happening. Um they found this

was happening. Um they found this algorithm and

uh it later turned out that they've like kind of understood the broad strokes that they like missed a bunch of details

and nuance.

Um and one particularly interesting thing

that um kind of shifted is this kind of approach is not weight focused at all.

This is using causal interventions on the activations and crucially it is about understanding

the model in some domain. In this case, a bunch of prompts with a very specific flavor that can a specific grammatical task

and it yeah um uh I think

this is in many ways made this in many ways makes life much easier because we're trying to understand what a model does

on some very narrow slice of the distribution of all possible language or text. And so this is simplifies life a

text. And so this is simplifies life a lot. And in particular, if some

lot. And in particular, if some attention had is polymantic, as a good amount of these ones are, um

it doesn't really matter as long as the other uses aren't featuring here specifically.

But if you're dreaming of ambitious reverse engineering, this in some ways seems like maybe a bit of a waste of time where like yeah, you can figure out what something does from some distribution, but if you don't know what

it does in general, what are you doing?

And um I think and I don't know, there are like a bunch of other fun things about this paper.

It's a great paper. Um, one of one of my favorites is uh oh yeah two we kind of observed some weird things for the first

I think time negative name heads which were nameheads were the things that found the right answer. Negative names

were things that suppressed the right answer like kind of systematically figured out the correct answer and made

it go down. Uh, anyone want to give any hypothesis for why on earth models do this or like what was going on?

The people who don't really know the answer. Cool. All right. No guesses. Um,

answer. Cool. All right. No guesses. Um,

so Oh, yeah. I guess I maybe there's some

Oh, yeah. I guess I maybe there's some situations where it puts way too much weight and then it's like like with the steering vector kind of situation of like, oh, we put too much weight on this

thing and that will break the model. So

I have to make sure my not too much weight. Also, somebody had a comment

weight. Also, somebody had a comment with a guess.

>> Yeah, pretty reasonable. Um,

let's see. Oh, sorry. My chat is uh not scrolls too long.

Um, yeah, with some kind of self-inflicted doubt. Yeah, there's

self-inflicted doubt. Yeah, there's pretty good guesses. Uh so um one of

um yeah there was this delightful uh that the paper I supervised from Arthcom and Colum McDougall uh called copy

suppression where what we found was that these heads in general were annoying.

Um, people had also observed things like negative induction heads that seem to do induction and then suppress the right answer. And it turns out that the model

answer. And it turns out that the model just has a general circuit that says, am I trying to say something that occurred earlier in the context?

Um, eg it copies love and the model kind of wants to say all fairer in love and love for some reason.

I don't know model I we never figured out why models like repeating themselves maybe um some kind of misplaced induction or copying and then the head

is like wait no you're copying yourself we should kill this and suppress it and so if not that this head in any way

understands the II task it's responding to the heads that have figured out the II task and saying you're trying to copy a thing earlier in the context why are you doing

And in general, this seems like um a pretty reasonable plan, but

uh it is in practice kind of messy. Um

and um yeah, I don't know. This be a thing, but it'll do a lot. And one of the themes of this paper was like, it seemed like this

head just does this all the time. Is

this the only thing this head does? Have

we found a monosmantic head? To which

the answer is maybe. I'm not entirely sure. Uh we tried to see how well we

sure. Uh we tried to see how well we could approximate this head um using this kind of algorithm of like am I going to say something that occurred

earlier in context? and they got to like 77% of the impact across the distribution of like a bunch of tech of like a bunch of arbitrary web text. To

my knowledge, this might still be the best anyone's ever done in a language model. Kind of embarrassing in my

model. Kind of embarrassing in my opinion. You know, I think people might

opinion. You know, I think people might have done better on neurons by now. I'm

not aware of anyone doing better on heads.

Um, and I would consider this another nail in the coffin for the ambitious reverse engineering idea. Um,

engineering idea. Um, where this was simultaneously way nicer than it had any rights to be, but also not nice enough for me to be comfortable

claiming we have just like fully reverse engineered the model. Life is glorious.

Um yeah.

So if I understood right, what the head does is basically try to get it to not copy because usually some other head has before end up copying that same love token. Is there a lot of examples like

token. Is there a lot of examples like do we have some intuition for how much of the model is trying to correct itself or is dedicated to trying to correct correct itself in this way

>> versus how much like complicated question one of the reasons it's complicated is like so say this head is trying to

correct itself like all of the time these heads are trying to do name moving some of the time and maybe they're trying to do other things some of the

rest of the time. Um I mean it wouldn't be too difficult to look at the direct logic attribution

of like say different heads or layers of the model and look at which ones seem like negative on the correct answer and

where if you uh if you edit the input to that layer to remove the bit that would cause the positive logit like you subtract the unmbed of that logic until

the logic's now zero uh or kind of average or I don't know.

Does that make those layers or heads less likely to do this? Um like I think you could do that in like a throwaway afternoon project especially if you work

with like a smaller simpler model like GPD2.

um more modern models have accumulated a bunch of annoying things that can make that kind of investigation harder. Um

which is one of the many reasons uh fewer people do that kind of like really indepth let's do some maths and add things up um and like break the residual

stream down into components style analysis. eg putting residual streams on

analysis. eg putting residual streams on the ends of layers, things like this.

Anyway um uh another thing also relevant to your question is uh backup heads. Another

really [ __ ] cursed thing that I did not expect where it turns out that if

you ablate um the earlier heads even like replacing them by what happens on by like replacing them with their value on a different prompt where you say made

this token Mary. So now John was grammatically correct to go here and If you do this, then it turns out that

these heads will change their behavior in response to compensate.

And this is not redundancy. This is like actively

redundancy. This is like actively selfcorrecting.

Like this is redundancy. You have three heads doing the same thing. So if you delete one, the other two are basically doing as well. Um, this is more like you have

as well. Um, this is more like you have a head that's either not doing the pass or doing it only a tiny bit and then you delete a head in an earlier layer and

now it starts doing it a bunch.

We don't really know why this happens.

It's like a pretty robust phenomena. It

seems to happen all over the place. So,

like I don't know if anyone's checked in the last 2 years whether it still happens in modern models, but I'd be a bit surprised if it didn't. Notably,

this happens even in models that were not trained with techniques like dropouts. Um, also we don't know whether

dropouts. Um, also we don't know whether GVD2 small is trained with dropout or not. It's very annoying. Um, even asking

not. It's very annoying. Um, even asking people at OpenAI has not been particularly fruitful.

The paper implies they didn't. Their

best guess is that they did uh whatever. But people have tried this

uh whatever. But people have tried this on other models. Uh the don't use dropout and um and I think that backup behavior is

important for the causal interventions to understand things because it means the actually no one will tell me why the existence of backup heads is a pretty

big deal. If you want to say even if you

big deal. If you want to say even if you don't care about them, you just want to identify what are the names >> cuz if you go and you patch out certain things, it could keep working because of

some backup.

>> Yeah, pretty much. Uh, as far as I'm aware, it's normally a dampening effect rather than bringing it to zero. Like

you'll make something you'll be like minus one to the logger and actually it's only minus.4 or

something. So if you're really fast about precision, you should be sad. If

you just want to like huristically figure out the important chunks of the model, this is like maybe okay. Um

anyway, so I was very popular. Um couple

of hundred citations. There was this whole college industry of like follow works doing this on different tasks.

Um, I was probably partly responsible for this and encouraging people to do it. It was also just like not all that

it. It was also just like not all that hard to play around with. And in my opinion, this became way too much of a fad and people still really like doing it in like maths applications and stuff

and I don't think it's very interesting.

I think that the like core thing that is limited about this is I think that I think I actually went like unusually in depth. Most of the time people do this

depth. Most of the time people do this and they just conclude these components of the model and what matters for this task and then just stop.

It's like why do I care? I want to understand what is happening and why not just see

um yeah not just like um know that these specific layers matter and like it can be a useful tool in a broader investigation that gives

you a more refined sense of what's happening but yeah and I think the other problem is that layers are just like a pretty coarse

um thing to study. Um

and this means that ED even more matters. It's pretty hard to say that much more than matters, I guess. Uh

guess. Uh so yeah, maybe going to the final thread. I

think that um kind of somewhat in parallels to this people were getting pretty freaked out

about what do we do about this because we didn't really have a better plan than hope that the neurons are interpretable or

um just treat it as like one big chunk.

But this is clearly not a reasonable approach if you want to ambitiously reverse engineer the model.

Um so what happened? Um

what happened? Um there was the softmax linear units paper where we tried to introduce an

alternative activation function that would kind of encourage the model to have monossemantic neurons. This was

kind of predicated on the idea that poly semanticity was some kind of weird bug and we could probably fix it with the right thing. And this initially looked

right thing. And this initially looked like it worked and then after further investigation it turned out that the model had just learned to do um like

superp position under the hood in more cursed ways.

Um and this um yeah I think the point models of superp position paper kind of kicked off

superp position being a big deal the meant community cared about. Um

the rough um yeah I think this is tragically a paper on toy models close

to the name but I think did a lot of useful work building up conceptual foundations around

the ideas of superp position. Um,

specifically the idea that if you have a vector space and you want to represent things as

directions, you can represent a lot more components as directions than you have dimensions by some kind of compression scheme. Like

you chop five things in two dimensional space and you just project onto each one to see if it's present. you'll get some interference, but you can just do some

error correction and get rid of it. And

I think a few key insights from this paper. Um though it took a pretty long

paper. Um though it took a pretty long time before we really checked many of these things on real models. And I'm

still don't feel like we've really checked all of them on real models, which is kind of dissatisfying.

But a few insights I think are probably true.

Um it is valuable for performance to represent more concepts than you have dimensions.

You can do this by having them not be orthogonal because you know if you got like a 100 dimensional space you can only fit 100 axes in there but you can fit a damn

site more than 100 vectors with dot product at most point one with each other um for maths reasons and

um three one brown has a good video on this um that all features are fact mining um And

I think that yeah, it seemed pretty inherent to performance. And this is one of the things that made me pretty pessimistic on intrinsic approaches to

interpretability that try to make better architectures at least to deal with specifically the problem of superp

position and like poly semantic neurons where it just seems pretty important for a model's performance that it represents

more things than it has mentions And maybe if you train a much wider thing, you could get something that's like closish to

monantic.

But um so yeah. Um,

so yeah. Um, I'll get to Jake's question in the chat militarily, but yeah, other insights from his paper that I think are like

true and useful, you can there are two kinds of superp position.

Um, the idea was kind of motivated by neurons and the idea that you can have more concepts computed than you have neurons.

But neurons at least as we define them in mechan come after an activation function. And this means that

function. And this means that we would kind of hope or expect that uh these neurons are

yeah that they're like computing things and it turns out they can compute more concepts than um neurons.

An example of this is facts where you there are just like a bunch of actually lossless compression schemes you can come up with for representing more facts

than you have neurons.

Um this in my opinion was first shown in uh one of my mass papers from where's gi anthropic called finding neurons in a haste. Uh let's see does does this work?

haste. Uh let's see does does this work?

Yes, we have achieved set engine optimization.

Um that just found a bunch of real examples of things like ah the model has

memorized that certain uh certain strings are often followed by certain other strings like space nan is often

followed by grams. And this is a neuron that activates on these six totally different substrings and boosts the

um kind of natural continuations.

Importantly, this is like one neuron doing the same thing.

And it's fine to do this because if you have a few other neurons which activate different combinations of engrams, you

can get a fair amount of insight. you

you can do things like say well if these three activate then probably the like research gate engram is happening and they can kind of ignore all the noise

that comes because each of those three also boosts a few other engrams because those aren't conductively interfered with um I'm conflating engrams and facts because in my opinion these are all just

examples of memorization like given a specific string do specific thing or recall specific speific information and

yeah like memorized texts, engrams, facts, these are all aspects of the same fundamental thing in my opinion or at least entity based facts where you see

Michael Jordan and you recall a bunch of stuff. Um yeah, so going to Jake's

stuff. Um yeah, so going to Jake's question from the chat about what hasn't been shown. I basically think that the

been shown. I basically think that the yeah this is like a pretty long paper. I

think it has like some interesting conceptual insights that I think are cool and worth remembering and I think it has some

yeah a bunch of other stuff that I am less excited about and I'm much less confident transfers like they what do

they find? Basically, I think the

they find? Basically, I think the specific claims in this paper I'm a little skeptical of. Like they found something about these transitions as they very hyperparameters in their toy

model and it goes from being I don't remember what. Yeah, there's like a

remember what. Yeah, there's like a dedicated dimension for some feature.

It's in superp position. It's not there at all. The X and Y are some

at all. The X and Y are some hyperparameters and it's like these really clean sharp transitions.

Uh here's some funky [ __ ] They found that if you have three dimensions, it can often fit four features as a tetrahedrin.

And if you have like a 20 or 50 dimensional space, sometimes it will cluster into like a bunch of different tetrahedra or antipital pair when you've

got two in one dimension, one is this axis, the other is the gation. Um,

sometimes you have this weird [ __ ] the everything bagel. That's right. This is

everything bagel. That's right. This is

the everything bagel that doesn't seem to decompose further. To my knowledge, these have not been shown to occur on um

real models. I basically treat this

real models. I basically treat this paper as a source of intuitions that I found very helpful. And then a bunch of pretty specific claims about that toy model that I think probably don't

generalize.

And this yeah this paper came out it was like pretty unclear what to do about it.

I think that um the I think you can view a lot of at least anthropics interpretability work as

trying to rescue this initial dream of image classifying models on language models. Like neurons, they're meaningful

models. Like neurons, they're meaningful weights. They exist. You could just read

weights. They exist. You could just read them up and see the things between them.

And man, it really seemed like we could get pretty damn close to reverse engineering those. And now we have

engineering those. And now we have language models. And

language models. And in my opinion, there are just always going to be significant chunks of language models that we can't really

claim to understand.

my fuzzy best guess about what is happening in language models based on just like I don't know being involved in

a bunch of research and vibing uh in the field for a while is I think that there

are there are some properties of models that are kind of salient and important and these are often pretty easy to pick up on via a bunch of methods like sparse

autoenccoders often like probes or steering vectors or even just like literally talking to the

model. And um

model. And um then there's a separate cluster of stuff

that's more like um what I think of as like a really long tail of [ __ ] heristics where like maybe the model has some like really

nice coherent algorithms like the II thing or addition circuitry and things like this thing that's like pretty

coherent. to solve tasks.

coherent. to solve tasks.

Uh, and there, uh, you have my t-shirt of the days of the week, uh, represented as beautiful points on a circle. Uh, you

can buy your own at interpret.shop.

Um and uh, not joking. Uh, it's surprisingly easy to make your own merch store. Um,

yeah, there's like some really nice things that seem really important and we can understand them. And this is like actually pretty damn useful if you want

to do something like model biology or like apply this to useful tasks or just understand the broad strokes of what's happening in models. But there is just a

lot of [ __ ] like the error term and false autoenccoders and the backup stuff and weird second order components like

copies suppression and stuff we haven't even got to uh in this like kind of shistics that are like

slightly useful and thus fire here cuz if you got weights why not use them and I just don't feel like the quest to understand as much of that as possible

has gone particularly well. Uh but given that I have been going for coming on 2 hours, I'll probably wrap up there and

continue probably. Yeah, I'll probably in some

probably. Yeah, I'll probably in some time tomorrow um and just move whatever I event other event I have tomorrow to talk about kind

of the dictionary learning saga and then what I think are the most exciting directions now and like what's been happening recently and what I'm actually optimistic about

and kind of what I see as the current burgeoning directions in the field.

Um, but yeah, I know. Happy to take any questions if people have them before wrapping up. Uh, guess if I'm

wrapping up. Uh, guess if I'm pronouncing that right.

>> Yeah, it's I usually say person with the G, so Gerson.

>> Um, yeah, I was curious um what your thoughts are. I guess like there's still

thoughts are. I guess like there's still this circuit thread is still pretty active and I think a lot of a lot of the recent work has been focused. Mhm.

>> Um I guess >> I will discuss diction tree learning tomorrow.

>> Okay. I guess my general question was like what are your thoughts on the general direction of this thread >> and like the work they're focusing on?

But maybe that's working.

>> Okay. Okay.

>> How do the things work?

>> Um but I probably shouldn't try to give an answer that we'll just turn into giving tomorrow's thing. Uh but it's an

tomorrow's thing. Uh but it's an excellent question. Um, but I basically

excellent question. Um, but I basically model anthropic as being almost all in on dictionary learning, at least kind of

Chris Ol's interpretability team with some other parts like Jack Lindsay who has invented a bunch of stuff like

crosslayer transcoders and crosscoders now has a AI psychiatry team that's doing stuff that I consider much more within the domain of like pragmatic interpretability and they did a bunch of

like really cool [ __ ] in the claude sonet 4.5 model collage that came out yesterday kind of investigating at the ebal awareness but yeah I think the question of what do

I think of their agenda basically what do I think of dictionary um though there are other things like Lee Shy at Goodfire doing his like stocastic

private decomposition that's like another flavor that I'm I don't know. I

kind of have in my category of well maybe it'll work. There are a lot of things in the maybe it'll work category.

It's good if people who have conviction about some direction go work on this, but I'm pretty happy to leave it to Lee until he has epic results that will convince me to drop everything and

pivot.

Uh like I did for like a year based on before deciding that was probably a bad idea and I should do something else. Um,

any other questions? Uh, Alex,

>> um, what do you think about the early universality claims and I guess how come they haven't been followed up as much? I

feel like I haven't seen them much in the newer literature.

>> Yeah, my take is basically universality seems like somewhat but not fully true.

uh I don't massively feel like it's that decision relevant for how I would

approach understanding models because of the not fully true lot. Uh I think people people often just kind of take it

as a given I'd say or like it's often kind of assumed that studying certain phenomena in this model is interesting

because uh probably generalizes. I think

nowadays people often study multiple models. I like to hope that transformer

models. I like to hope that transformer lens making that really easy helped make this a normal in the field which I think has been like very good for epistemics

and like research culture in the field but like what yeah so one answer to this is just actually most papers are also universality papers because they just

their findings on a few models but it it's more like highle universality um rather than like super in the weeds.

There have been some works on this um ones I'll highlight uh there was this one paper called like Rosetta neurons

um that uh kind of tried I think tried to look at neurons in different image classifying models that were very correlated and said these are probably

interpretable because they're representing a concept shared between them so probably it's legit and did some interpret interpretability of these and

this is like this is a cool paper. Um

yeah, like I think this is a bunch of neurons that all seem like pretty similar and correlated. I don't know exactly what these show, but I'm sure

it's halfway rigorous. Um, and there was a

halfway rigorous. Um, and there was a fun paper from uh, Wes, one of my alums

on universal neurons in language models.

um kind of doing the same thing on this family of like five GPD2 small and medium models trained with different

random seeds finding yeah just like a bunch of interesting stuff like neurons that really liked letters or possibly words that began with letters

or yeah something about does the previous token have a comma and I I think these would both be just like across a bunch of web data

So this is like pretty monosmatic in my opinion. Um yeah, this one, you know, is

opinion. Um yeah, this one, you know, is a year. Uh where I guess like top of

a year. Uh where I guess like top of it's not a year. Uh oh, no, sorry. This

is uh this isn't activations. This is

the direct logic attribution of the neuron. So like you look at all of the

neuron. So like you look at all of the tokens, it impacts logits and years are negative, non-years are not. So this is

a don't see a year neuron which is important in GP2 because the way tokenizers worked at least back then is it was purely based on which

substrings were frequent. Uh can anyone see why this might make you want to suppress years if you have no special handling for numbers for your just like

which strings are common? Anyone? No.

All right. So basically what goes wrong

is that um the uh string um say 2001 is way more common than an arbitrary

four-digit number. So that is a token.

four-digit number. So that is a token.

Most four-digit numbers are not.

And uh but it's also a number and thus the model will have kind of it it will need to be really careful

about what kind of number it is currently talking about where eg the it kind of just needs a bunch of like spectral handling for like what kind of

number it should be tokenizing right now and sometimes you get unlucky and you have a normal number that just happens to have a year in it. Um, and sometimes trying to like write data by hand can

fall into problems like this cuz you type like 1 2 3 or 1 2 3 4 5 6 and like that's probably a token. And so

uh that's very different from how the model would deal with an arbitrary six-digit number. Um, and so uh why

six-digit number. Um, and so uh why exactly it would want to suppress years is like not nearly obvious to me. I

could see why we'd want to suppress non-year if it's like this is definitely a year. Um, and possibly it wants to

a year. Um, and possibly it wants to sometimes correct for that kind of machinery or sometimes maybe it's saying a date and it could either say a year or

it could say a month and there it wants to like not say a year.

Um, though that one isn't really about number tokenization. That's just like a

number tokenization. That's just like a reasonable thing to do. Um anyway,

[clears throat] uh nowadays many models either just have each digit be its own token, um or have say every number

between 0 and 999 be a token and every other number not be a token and need to be broken down. Uh I think I can't remember which models do which. Uh this

is an extremely important thing to check if you are ever doing research like this uh with numbers. Generally a thing you should just very frequently be doing if

doing any research that is remotely kind of hands-on with the model is print is display things with like boxes

around each token or some other indication of like how it's tokenized like using the transformer ends like uh

to stir tokens or something. Um

but yeah.

All right. So going back to universality.

Uh basically I think I want to say it is a thing. Um you

you know like a thing reviewers will typically complain about is like why did you only look at one mod? That's dumb.

And I think this is a reasonable thing for reviewers to complain about. And um

this also applies for your sprints. Um

if you can show me the or like the more surprising the thing is, the more I care about seeing it on multiple models, I think. Um yeah, I know like one fun

think. Um yeah, I know like one fun example would be the like refusal is mediated by a single direction paper while I'm just doing a core of my maths

papers as of a year plus ago. um where

yeah our like figure one was just we look at a bunch of models and review sure seems to be represented by a single direction guys and I think this was

pretty important for making the case that yes this is just a thing about models that is pretty general maybe doesn't apply to every model like

proprietary frontier lab models probably use interestingly different fine tuning than like open source ones But like yeah

anyway um in universality seems mostly but not entirely true. Occasionally

papers looking for a bit many papers like casually investigate it and papers that don't often get criticized as being

less rigorous but it's not seen as like a major basic science research direction that would be like super important.

and how satisf I can answer that is good. Uh yeah, so this is yeah this is

good. Uh yeah, so this is yeah this is basically why I'm not particularly excited for sprint projects on this. I

just don't expect to learn that much.

Um Ellen >> um well I guess one follow question about universality and I remember that like there had been these like universal

adversarial examples in image models and I guess that's kind of a different thing from here of like this like meta universality about like some concept but I guess I'm curious whether that appears

like universal jailbreaks or stuff like this in LLM context.

>> Yeah. So that is a complicated question.

So I think okay so first off I think we should distinguish between interpretable adversarial examples and gibberish abal examples. Why I would consider the um

examples. Why I would consider the um shock baseball one to be an interpretable one. Um, let's see.

interpretable one. Um, let's see.

Another one of my favorite examples is the multimodal neurons paper when uh actually if I was going to Google it.

What's this is such a badly SEOed paper.

Um and yeah so uh this is a adversarial example for a multimodal model that's basically trained to take images and

text and say is this an appropriate caption uh you know the latest Apple

iPod clearly um and I mean this iPod is like non-trivial already

but Uh yeah, this is a model called [ __ ] which was um yeah the uh it was like one of the it

was like one of the key things used to make deli which was one of the first like big deal image generation models.

Um, oh yeah, sorry for everyone who just knows this history and I'm just patronizing you, but hopefully it's interesting context for people less steams,

but yeah. Uh, anyway, so there's some

but yeah. Uh, anyway, so there's some truffle ones and then there's, you know, the like um probably if I just Google adversarial examples,

you'll see uh yeah, the airliner pig.

Um all right. Um the

interesting Yeah. So there was this delightful paper

Yeah. So there was this delightful paper called like adversarial examples are features not bugs or not bugs they're features. Um which

[clears throat] has inspired many copycat titles. Um,

and one way to think about the findings of this paper is basically saying we think there are real interpretable

concepts that are common to these images that adversarial examples use that models use

that humans can't perceive.

And I admittedly ignore this paper of detail. That could be an incorrect

detail. That could be an incorrect summary of exactly what it says. But it

was kind of like postulate and like kind of maybe postulating the bold hypothesis that like all adversarial examples are actually reasonable interpretable things.

Um, and I think, yeah, another fun thing Distill did is they had a, um, kind of thread of

different researchers responding and doing follow-up work on this.

Um, but so, Amazon examples of language models. Um,

models. Um, one of the first papers

that I saw go probably viral on this was the like GCG paper which um, yeah, no

yeah, there's only a year. It's just the citation cup 2018. Um,

and yeah, they had the like particularly spicy claim that not only could they do

um Yeah. So, they kind of produced these

Yeah. So, they kind of produced these garbage suffixes like example

um that just kind of look like nonsense.

Uh, oh, maybe this is Yeah, that kind of thing.

So, you know, it's a not quite what I call an interpretable adversarial example, but it in fact generates a plan to

destroy humanity, though not necessarily a good one. Um, but this was before you could just in fact tell the model to make you help you make a bioweapon and

it would give you useful answers. Um, as

is apparently the world we currently live in. Um, but yeah, and so this paper

live in. Um, but yeah, and so this paper existed. It made a splash. Um, I still

existed. It made a splash. Um, I still don't really understand to wonder if this paper was legit or not. I gather a fair amount of people could replicate it

on like a single model, but couldn't really replicate the universal thing.

Um, but also it's like a really fiddly thing. And also, at least on like single

thing. And also, at least on like single models, uh many of the seemingly fancy things done in this paper don't really seem that important. Like GCG stands for

like greedy co um greedy coordinate gradient descent or something. And the idea is like, oh,

something. And the idea is like, oh, you're using gradients to estimate which um tokens are more important are like

better at being a like weird cursed suffix for [ __ ] with the model. And

um it turns out actually yeah um exercise for the audience. Suppose you

have um you're reviewing a paper that's come up with this and you're like h I think this method is way more complicated than it needs. What is a

simplification you could suggest they do that plausibly would also work? Where to

recap, the method is essentially um you use gradients to estimate which token. You start with like a

which token. You start with like a randomly generated suffix and then you you use gradients to estimate um which

token if it replaced the current token in some position um over all positions would be like most important and then

like swap that in and then it keep doing this. They just do like MCMC

this. They just do like MCMC >> for instance or some like you know like handsome target function to for your acceptance of changing a token to another one.

>> So sorry. So what would what exactly would your method be here?

>> Oh it would be like pick a token at random uh pick a like pick a position at random pick a token at random substitute it and like keep it if this according to

some probability activated by like Metropolis Hastings or something. Maybe

not even just do a random mock, but like you could steer the random mock by using Metropolis things or something.

Um, yeah, pretty much. Um, basically

I at least believe there was some private follow-up work. I can't remember if anyone did this like properly publicly where you just replace it with like, well, let's just like randomly

mess with tokens and see which ones work better. do this. Do this like a 100

better. do this. Do this like a 100 times from your starting prompt and then keep the one that's best and then just keep doing this. This is pretty competitive because I don't know,

gradients just aren't very good for tokens.

Um token space just doesn't seem particularly linear. It's weird and

particularly linear. It's weird and cursed and discreet. And yeah, I think I don't know. I don't think you

necessarily need um things like Marco chain multi hollow or propolis Hastings or whatever. You can just literally try

or whatever. You can just literally try a hundred random steps, keep the best one repeat.

Um and uh yeah, this is a uh useful lesson in if you think you have like a really complicated fancy method, maybe it is a lot more complicated than it

needs to be. Um,

I don't know. Um,

uh, yeah, some other fun examples of this, there was an unlearning paper that claimed that it was finding the like

bio, um, yeah, bioteterrorism novice versus bioteterrorism expert steering vector and like subtracting that and that's why it worked.

Um, can anyone see uh the natural ablation on this one?

>> So, it worked to do what?

>> Uh, they wanted to make the model bad at like helping with boweapons. So they

found they found a vector um that was supposed to represent like kind of knowledge of bioweapons and then they like subtracted

it with a big coefficient and they and then the like score on the bioweapons benchmark went down that much.

>> But I guess they made the model stupid in the way that you do by adding a large steering vector.

>> Uh or it's a possibility.

>> So that's one hypothesis. Um yeah. Okay.

Maybe everyone will take like 30 seconds to think on their own about how exactly you might test the hypothesis that that's what they're doing or like what simplifications could you

make to this? Maybe I want people to get a chance to think on their own for a bit. All right. Uh anyone bonus points

bit. All right. Uh anyone bonus points that you haven't spoken before? Uh want

to give a guess? All right. People can

have. Yep. So instead of steering, maybe I would just fine-tune it.

>> Mhm. Yeah. Well, that's that's more like a baseline. I'm trying I'm trying to ask

a baseline. I'm trying I'm trying to ask how would you tell if your method was more complicated than it needed to be?

Uh or like it and you know, you could just fine tune instead. It is an answer to this, but it's more like you didn't need to bother doing the complicated

I think I know the answer just like I'm and hitting a cache that's saying you add a random vector.

>> Uh yeah, pretty much.

>> You replace the vector you believe pass meaning with um a random vector. The

format is unchanged.

Um therefore you were just making the model dumb.

Uh, replace the vector you think matters with a random vector is a extremely good baseline that you should basically just do if you ever think you have an

interesting ve that's giving you interesting results. It's also just

interesting results. It's also just really easy because you can just take whatever the other line of code is and put

rand n like around it or something.

Um maybe like scale it to be the same norm because norm can matter quite a bit. Um but yeah, random random

bit. Um but yeah, random random baselines highly underrated research technique would recommend. They're also really

would recommend. They're also really easy.

um ditto things like h well what if we just like randomly selected a chat prompt to give to the model rather than our like careful handpicked one or what

if rather than looking at the model on this biology data we looked at it on random chat data would our boweapon detector also light up a bear that kind

of thing um all right so I think I have somewhat gone off rails I was answering a question about adversarial examples. Um,

obviously there are many interpretable adversarial examples. Like typically

adversarial examples. Like typically when people say jailbreak they mean interable or at least you can reason

about why it work. I'm not aware of the good interpretability work on these kinds of suffixes. We looked at it a bit

in the refusal paper but didn't really make that much headway. Like there were some components that seemed important but kind of run into the classic problem

of well you found the components that matter but what now? And

yeah um all right so yeah I don't know finding more to say here. I don't know how satisfying an answer that was.

>> I You mean like you found the directions and then what now? What? What do you mean? Like where would you want it to

mean? Like where would you want it to go?

>> Yeah. No, no, it's right. Uh you asked a question ages ago about examples and I'm asking does it feel like that question been answered or is there like more I could

>> Well, I guess yeah, I was I was curious about your last comment of like what do you mean? What now? Where would you want

you mean? What now? Where would you want it to go if you found the examples like you'd want like an interpretable way to stop them or way to understand like what what they're doing in the model or Yeah.

Yeah. So

the thing I guess the main thing I'm okay so like there's the thing that with my ambitious reverse engineering hat on I'd want to

do that be like why why is it the case that this string gets the model to do this? Why is it universal? Um, and you

this? Why is it universal? Um, and you might try things like kind of doing ablations on the string and figure out the simplest form. And you

might think that you can like run an SAPE on it. I don't actually know if anyone has tried literally that probably someone has.

It would be an interesting um thing for someone to play around with. Probably just like a throwaway

with. Probably just like a throwaway afternoon projects. Uh, I recommend not

afternoon projects. Uh, I recommend not trying to replicate GCG because it's a pain in the ass, but you can probably replicate the like random search version

pretty easily.

Um, or just find some examples online or just literally try giving this to a model and see what happens. Um, and just pick one of the models we have SE for

though. Also, you should probably get a

though. Also, you should probably get a model to do a lit review for has someone done this. Um

done this. Um but yeah, I think one answer you might aim for is how is the how is the model

disrupted?

Either are there attention heads that don't do the right thing anymore or whatever. Um another might do thing you

whatever. Um another might do thing you might do is like ah well what what are the tokens where the crucial thing happens as I do patching?

How does that matter which we did here?

Um and another would be um what I think

um I think that you might want to take and yeah you can like see if there's some representations that become interpretable. Uh, one thing you might

interpretable. Uh, one thing you might want to do is try replacing it with the SAPE reconstruction.

And if that works, then that's some evidence that it's injectable.

Um, if it doesn't, that's some evidence that it's going via some like weirdass mechanism.

Uh, one hypothesis put forward has been that adversarial examples sometimes come from superp position gone wrong. like

you have um what? Yeah, that like sometimes

um what? Yeah, that like sometimes um there'll be some features that uh kind of using similar directions because they don't tend to occur at the same

time and maybe um an average example has learned that by making one of those light up you can make the other one light up and that's an optore model.

It's an open question how much the structure of superp position is universal between models like how much you'd expect this kind of overlap to

make sense though you could probably try answering it by I'm sure someone put out a set of essays with similar random seeds um well like a different random seeds

but everything else the same that you could study for this or something and or not what you'd want is two models

with SAES. Probably what you actually want is

SAES. Probably what you actually want is like a crosscoder between two different models and then the crosscoder should have a direction in both models for each concept and then you can look at the

like cosine sim between those directions for different models and see whether that's consistentish or not. Yeah, I think that might get you

not. Yeah, I think that might get you some that might probably tell you something, but it's like less obvious that those kind of adversarial examples, if they

even exist, should work. They should be universal. And yeah, and

universal. And yeah, and overall um I am pretty pessimistic on there

being like a really nice explanation down to the kind of token level of why this stuff works. I think it's plausible

that there's more of like a model biology explanation of like ah well at layer 10 the sapel latent for this is great guys has len up lit up a bunch and

if you get rid of everything other than that that's sufficient for the jailbreak therefore this is the important thing we have no idea how that thing is activated

shrug um yeah the reason I'm not super excited about sprint projects on this is I just I don't think adversarial examples are like that important.

Like I don't really know what it would change for like me or people trying to make um safer models to have a better understanding of where they come from.

I think the jailbreaks are well first of all I'm very pessimistic about the ability to fix adversarial examples because

thus claim the adversarial robust researchers uh that they have tried doing this for like a decade and failed and I think that generally just like

being robust adversaries is really hard even if you can make their lives harder.

I think that um there are things that are more like probe for how someone got the model to do a bad thing no matter what the mean

means were and like I think that's just legitimately useful um and like a perfectly valid direction of applied interpretability

but I feel like those kinds of defenses are better off being agnostic to the form of the attack especially as it's harder to do this kind of attack on like front table

models. They tend to be closed source

models. They tend to be closed source and adversaries won't have access. So

maybe you can do this by just hitting the API a lot. I don't really know.

All right. How satisfying was that answer?

>> I definitely have more questions, but I also like don't want to keep taking.

>> All right. Uh anyone else?

I think someone had the hand up. All

right. I'll take one final question and if no one else has one, I can be from L.

seem like no one else has one.

>> Um I guess just on the last thing I mean I guess you can imagine like attacking probes right with like adversarial examples. So like if you had like

examples. So like if you had like techniques or detecting whether a user is doing something bad those could be adversarily attacked and like that would

I don't know break that defense.

>> Yeah. So I think that to me the interesting question is what not can you break probes but it's what affordances

do you need to break I if you give me full white vodka access yeah I know we basically know that you

can optimize inputs to make numbers go in arbitrary directions people have shown this I was not very surprised um

there are yeah Um, so maybe the sort of like somewhat more aggressive version or like more universal versions of this, but they depend they still depend on like having this kind of privileged

access. The thing that I'm really

access. The thing that I'm really interested in is with like an API or just promising access, the kinds of

things people would have on say clawed if it is deployed with probes, which I think it probably is. Um,

and yeah, if people then were trying to like break into that e even with like arbitrary amounts prompts like maybe you

do an evolutionary algorithm and then we try to figure out how hard it is to break that. I'd be interested in that

break that. I'd be interested in that question.

Um, it's plausibly harder to do this on APIs because you either get rate limited or they tell you or they notice when you do jailbreak and they're like, "Fuck

you. We're going to revoke your API

you. We're going to revoke your API key." Um, I'm not entirely sure. I'm

key." Um, I'm not entirely sure. I'm

sure someone's tried a kind of blackbox random attack on APIs and I'd be kind of curious if it works or not,

but yeah.

Um, >> yeah, that's a question I'm interested in. Um, I generally think that

in. Um, I generally think that regardless of whether anyone is currently using probes, it does not seem like people have been using probes in

production for long. Um, and I think that over time you would Yeah. Like if

you introduce a new defense, it's going to look really good because people have spent ages breaking the old defense safety training. And so this is a

safety training. And so this is a problem and I'm pretty curious whether over time we're going to figure out new classes of jailbreaks that suddenly

started working on models at some point.

Um that presumably correspond to when that model started using new defenses like probes. Um, and no, yeah, I do

like probes. Um, and no, yeah, I do think this is a weak spot in a lot of research into things like monitoring and defenses.

Like, you need to check whether your thing is robust to realistic adversaries.

Otherwise, you're going to get a significantly inflated view of how good your thing is. And maybe that's okay if you're not going to be up against

adversaries or you will just do incident response and you think you have good incident response and if this is an issue you can deal with it or just like

switch back to a older more expensive kind of defense that's hard to break but I don't know papers often don't check this kind of thing especially academic

papers um a lot of reorg to check.

And people often also don't check things like what's the false positive rate on random user traffic because you really

don't want your bioweapons detector going off on 5% of users. um or

yeah like kind of what's the runtime cost or what are the side effects on the model if it's kind of more of like a change to the model. Um and

bunch of questions.

All right, I'll wrap up there.

Thanks all for coming.

Loading...

Loading video analysis...