Microarchitecture: What Happens Beneath

By Jane Street

Summary

## Key takeaways - **x86 Instructions Split into Micro-ops**: Instructions are specifically on Intel broken into smaller operations, micro operations, so unlike most RISC processes where you're just adding two numbers, some of the Intel instructions are crazy and you can do some crazy address calculation which is one number plus another one times four plus some offset, then read from that address, add to it, and store back. [05:28], [05:57] - **Macrofusion Fuses Compare-Jump**: Intel spots compare and jump as one unit in the predecode step because it's so common every loop ends in a compare and jump if not equal, so they fuse two instructions into one internal operation. [15:43], [16:05] - **Micro-op Cache Bypasses Decode**: The micro op cache sits to the side of the complicated legacy decode pipeline and on jumps to cached locations streams micro operations directly, delivering 4-6 micro-ops per cycle without decoding. [24:50], [28:38] - **Register Renaming Enables OoO**: Renaming maps architectural registers like EAX to hundreds of physical registers, breaking false dependencies so independent operations like separate loop iterations can execute concurrently out of order. [40:00], [43:45] - **XOR EAX,EAX Skips Execution**: The CPU recognizes XOR reg,reg as setting to zero and just renames it to a physical register that's always zero without issuing to any execution unit or taking resources. [44:37], [45:40] - **Store Buffer Splits Address-Data**: Store buffer separates address generation from data computation so loads can proceed if addresses don't overlap even if store data is pending, enabling more overlap. [56:20], [57:01]

Topics Covered

x86 Breaks into Micro-Ops
Macrofusion Fuses Compare-Jump
Micro-Op Cache Bypasses Decode
Renaming Unlocks Loop Parallelism
Store Buffer Splits Address-Data

Full Transcript

Hi everyone. Uh, welcome to this tech talk. Really glad everyone could make

talk. Really glad everyone could make it. I'm here to introduce Matt Godbolt.

it. I'm here to introduce Matt Godbolt.

Uh, Matt Godbolt is a C++ developer who loves looking under the hood of compilers, operating systems, and silicon. By day, he writes software for

silicon. By day, he writes software for finance like many of us here. Uh, by

night he emulates vintage computers and maintains Compiler Explorer. Well,

Compiler Explorer is the official name, but everybody knows it as Godbolt. So

naturally, when someone set up the Jane Street internal instance, they named it Godbolt, making Matt the first external speaker who already has an internal domain named after him.

Please join me in welcoming Matt Godbolt.

[applause] Thank you, Jasper. Well, thank you for having me. This is fantastic. Um I uh I

having me. This is fantastic. Um I uh I um I'm looking forward to talking to you about the kind of things that I love digging into. Um so this is micro

digging into. Um so this is micro architecture which hap what happens beneath which is not my best title. So

yeah as Jasper says compiler explorer is why you probably heard of me before. I

love making emulators for old computers.

That's what I've been doing for the last year as I've been on a non-compete. My

non-compete ended today which is why I'm able to be here talking to you.

But next week I'll be working at HRT.

So, we'll be neighbors and I'll be up the street. Friendly competition.

the street. Friendly competition.

I also love trying to reverse engineer what the heck is going on inside your CPU, right? Intel famously doesn't give

CPU, right? Intel famously doesn't give a lot of information about what's going on inside the computer. And um that's for obvious reasons to do with, you know, intellectual property or whatever.

But there's a sort of small but dedicated community of folks who are trying to work out what actually is going on inside. This is my contribution to it where I was trying to work out how the branch target buffer worked inside

uh some of the older chips. Um we're

going to be talking about some old chips today unfortunately because I don't have access to some very very new stuff because I've been on non-compete for a year and it's very expensive to rent them. Um but anyway I I've done some

them. Um but anyway I I've done some research before and my sort of claim to fame on this is that I was cited in the meltdown inspector out inspector paper which was a kind of a kind of nice

moment of uh recognition for this kind of stuff. uh you've probably seen a CPU

of stuff. uh you've probably seen a CPU pipeline before in textbooks and I know I've been speaking to some folks who are now sitting very threateningly in the front row and they have a very deep understanding of everything we're going

to cover here. But for the rest of you all hopefully you'll forgive me doing a quick sort of uh introduction reminder of what a CPU pipeline looks like. So

when I grew up this is what a CPU pipeline looked like. You fetched an instruction, you decoded it, you executed it and you wrote it back in sequence. And then you know you could do

sequence. And then you know you could do this as a sort of production line as a pipeline one step at a time. So one

instruction was being fetched while the previous one was being decoded and the one before that's already being executed and the one after that before that is being written back. And therefore

although it takes four cycles to get through here, we're still doing one useful piece of work every clock cycle.

That was a big win. Modern CPUs though realize that if we can do we can do better than just this sort of single step at a time. Um we can break the system into sort of three broad categories. We've got a front end uh

categories. We've got a front end uh where we've got a branch prediction unit. I wanted to talk to you a lot

unit. I wanted to talk to you a lot about branch prediction. I actually did a whole bunch of reverse engineering work for it but I don't have time so we'll have to talk about it afterwards or in the pub or something like that.

But anyway, there's a fetcher decoder as before we have a rename step which is sort of the gateway into a new world.

And then after that this middle section here, this is the back end. things can

happen out of order, which sounds counterintuitive, but in programs very often we write code which has lots of little parts that aren't totally dependent on each other. You know, you read a value, you add 10 to it, you store it somewhere else. You read

another value, you add 20 to it, you store it somewhere else. It's like,

well, those two things could happen at the same time. And so, if you can pick up enough instructions in one go, you can actually do more than one piece of work per clock cycle. And that's what this sort of stack is here in the

execute. And that's what the out of

execute. And that's what the out of order system is doing. It's doing some clever dependency tracking and then everything happens in whenever uh execution units are available and the and results inputs are ready and then

finally we come to the retirement stage and that's where everything gets put back in the order that the program was written in so that you don't notice the monkey business that's been going on in the middle bit. Okay, but we're going to

be talking specifically about Skylake which is old 2019 2020ish era is when it was out but again my laptop is a Skylake. My desktop machine is a Comet

Skylake. My desktop machine is a Comet Lake or Skylake X actually. Yeah, the

laptop's Comet Lake. So, these are representative of the kinds of things that actually happen in a modern CPU, but they have changed in the kind of server quality CPUs that you'll be using in your day-to-day job. Um, where I

remember them, I will try and call out the differences, but again, I couldn't test the things that um were on more modern machines.

Uh, so out of order on Skylake is a little bit more complicated than that nice textbook diagram. We have a front end that looks like this. Again, the

branch prediction unit we're not going to be talking about. We have sort of two parallel lines. The fetch predecode

parallel lines. The fetch predecode instruction Q and then many decoders.

Beneath that, we've got a microp cache.

To the right of that, we've got a a queue of microperations and the amusingly named LSD. Then we hit the renamer. We're going to talk about all

renamer. We're going to talk about all of these things in a second. I just want to just give you a big picture of like where we're going with all of this. Um

there, what was I going to say about this? There was something I meant to

this? There was something I meant to say. Uh, it'll come to me. We'll cover

say. Uh, it'll come to me. We'll cover

it anyway. So, anyway, this is the front end. And at this point here, by the time

end. And at this point here, by the time uh Oh, that's right. Microsoft. I've

written micro up here. So, instructions

are specifically on Intel um are broken into smaller operations, micro operations. So unlike most risk

operations. So unlike most risk processes where you're just like adding two numbers together or you're reading or you're doing a branch, some of the Intel instructions are crazy and you can

like do some crazy address calculation which is know one number plus another one times four plus some offset and that's the address that you're going to read from. Then you read from that

read from. Then you read from that address then you're going to add to it and then you're going to store back to it and that's one instruction. So it

makes sense for the poor front end here which deals in terms of those crazy x86 instructions to break it down into a sensible risk architecture in the back end. But we never get to see that. We

end. But we never get to see that. We

just have to infer its presence from all the measurements we do on the outside.

So that's what a microp is. Um yes we by the time we get to this point here we have a stream of microp operations that are in program order. They get to the renamer and then they reach the back end which is a little bit smaller and simpler. And we won't be covering as

simpler. And we won't be covering as much of this. If I've got my timings right, we probably only got five minutes of this at the end, but we've got auler.

We've got multiple execution units, not the four that are here. There are

actually many more than that. And then

there's a right backstage. Then we have the retirement. And then post

the retirement. And then post retirement, we have this store buffer commit phase. And again, each of these

commit phase. And again, each of these steps is probably a simplification, but this is mostly built out of measurements that we can make. And by we, I'm don't mean me. Actually, although I have

mean me. Actually, although I have contributed to this in the past, that the community of people that um uh that reverse engineer all this kind of stuff.

Um I'm relying very much on their results that I'm presenting here. These

things are not necessarily found in the Intel manuals. You have to go and dig

Intel manuals. You have to go and dig them out yourself. So I have some references at the end. If you're

interested in this and how you can measure it yourself, knock yourself out.

It's a really really interesting rabbit hole to go down. um this selection of of TLA's this soup of letters over here are all the sort of shared data structures

that each of the bits of the system can see. We'll talk about those separately,

see. We'll talk about those separately, but I wanted to call them out to begin with. We've got a phys [cough] excuse

with. We've got a phys [cough] excuse me, we have physical register storage.

So while you tend to write programs or see programs in assembly that refer to you know EAX and ECX and EDX or whatever the 16 registers that we have access to 16 what a luxurious number I know we were talking about earlier that we

probably could do is more but like it wasn't that long ago we only had eight of them of which two of them were used for other purposes but that is small fry compared to the number of register slots that are actually available on the CPU

and so the behind the scenes it's doing all sorts of allocation and renaming and and mapping to take advantage of the extra space it has on the chip. So the

physical register file contains all of the actual values of the registers and there are probably hundreds of those.

The RAT is the register alias table and that's the structure that maps the current EAX to where on earth it is currently on the chip and other things.

uh there's a reorder buffer which stores essentially a ledger of all of the micro operations that have flown flowed flown flowed through the system and is essentially the sort of reference keeping of like this is the point at

which we know an instruction has been issued and later on we will retire it in the order that they came through into this reorder buffer then there's a reservation station which is also called theuler in some literature uh that's

where we actually store the things that have yet to be processed like the actual operations themselves so like I need to do an ad of this number and that other number uh or this number and a number that we'll get when we the read that it

depends upon has completed. They live in the reservation station. Branch order

buffer is uh a sort of checkpointing system for branch misprediction. Again,

I wanted to talk more about this but sadly we don't have time but it's fascinating if you sit down and think about what the thing must be doing when it's predicting you know 20 branches in the future and then it gets one of them wrong and he doesn't want to throw all

of the work away. And then there's a memory order buffer which is responsible for making sure that loads and stores to memory still make sense even though we've reordered everything. So we're

going to start with the front end. This

is going to be the majority of the talk and I forgot to start my timer. So I'm

going to have to rely on looking at the clock behind you to see where I am.

[gasps] I'm going to use this example and I sort of it's thematic that I have to do it in C++. Although we did attempt to write

C++. Although we did attempt to write this in Okamel earlier, it did not generate the same code. [laughter]

I will leave that to your imagination.

So it's a it's a bit of a it's a bit of a madeup thing. It's taking an array of integers. It's squaring them and summing

integers. It's squaring them and summing them up. So it's just doing like a

them up. So it's just doing like a rolling sum of all of the integer sum squares effect what it is. And it suits my purpose because it compiles to six instructions and exactly 16 bytes of

machine code. So, I'm sure you're

machine code. So, I'm sure you're reasonably familiar with this, but on the right hand side, we have this text representation, which is the assembly code, and on the left hand side is the bytes that the CPU actually reads out.

That's the machine code. It used to catch me out all the time. I used to use the two interchangeably, which was foolish. And you know, you don't really

foolish. And you know, you don't really need to know, but it's like we read the array pointer, we square it, we add it to our running total, we move the array pointer to the next element, we compare it to the end one, and then we jump back if it isn't at the end yet. Pretty

straightforward stuff. 16 bytes worth.

Um, what you'll notice though immediately is that each of these is a different length. We've got a two byte

different length. We've got a two byte instruction, a three byte one, there's even a four byte one here. In general,

x86 instructions can be anything from one bite to 15 bytes long. And it's not sensibly written because it's kind of the sort of thing that happens when you take a design from the 1970s and

incrementally add things to it while we're maintaining backwards compatibility, which means that sort of everything has harks back to these ancient ways. And there's like bites

ancient ways. And there's like bites that say, "Hey, the the next bite is now interpret differently unless it's Tuesday and the moon is rising and then which so it's really quite complicated

to disassemble x86. Anyone who uses ARM or or MIPS um or risk 5 or whatever, it's lovely. They're all the

it's lovely. They're all the instructions the same length. So it's

very simple, but that's kind of why we need to do this complicated decode.

Another thing I'm going to quickly show you is an instruction trace of a sort.

Um this comes through a somewhat heavily modified um tool called uicker which is the microode analyzer that Andreas Ael um who's now at Google has written and

open sourced and this sort of lets us see the the an individual instructions journey through the pipeline. So what

we're seeing is one instruction and then all the stages it goes through from left to right from time base. So this this instruction takes 16 cycles. On the

first cycle here it's being predecoded.

It's hitting the the instruction queue here. It's being issued, dispatched,

here. It's being issued, dispatched, executed, and then retired in this particular example. I just want to show

particular example. I just want to show an example. We'll have a few more of

an example. We'll have a few more of these up, and I'll explain them a bit more as we hit them, but I don't want it to be surprising when they when it popped up. All right, first stage,

popped up. All right, first stage, fetching. This is probably the easiest

fetching. This is probably the easiest stage, although I'm going to gloss over a whole bunch of really complicated stuff that it does. So, we have the predicted instruction pointer. So, the

fetch unit doesn't wait for the program to definitely go somewhere. It's being

told where to go by the branch predictor at all times. And so that's kind of a really interesting uh aspect that I had never even considered because you might pick up a bunch of 16 bytes. And until

you've decoded it, which is like four or five steps down, you don't even know if there's a branch in there. And by that time, you've already had two more 16 byt chunks of memory that you've pulled up and you're like, well, I've went the wrong way. Right? So, and that's not a

wrong way. Right? So, and that's not a that's would be an unconditional branch, right? If it's conditional, you have to

right? If it's conditional, you have to actually execute to work out whether or not you're going to take it or not. So,

the branch predictor's got a great job and it does by and large a good job. Um

so we we the fetch unit picks up 16 bytes at a time at a 16 byt um uh boundary which means that if you branch through the middle of a 16 byt boundary then you're immediately wasting a bit of

bandwidth there because you're missing out on some of the bytes that you could have picked up in one cycle. The other

thing to note here is that I'm not going to talk about caches at all because that's another two hours talk. Um, but

this obviously has to liaz with the TLB to go and work out where the heck these instructions actually come from and the L1, L2, L3 caches to actually fetch the bytes into 64 byt cache line to then

pull 16 bytes of it down. So there's a lot of work hiding here. But the result of this fetch stage is some kind of pipeline of chunks of 16 bytes presumably with some kind of address that's tagged with them so we knew where

they came from and it can be checked later on. All right, easy one done.

later on. All right, easy one done.

Because the x86 instruction set is so complicated and because what we want to do is is unlock this parallelism where we can run more than one instruction at a time. If the most that we could do in

a time. If the most that we could do in a single instruction, sorry, in a single clock cycle was decode one instruction, then we'd be missing a trick down the line. And so what we want to try and do

line. And so what we want to try and do is decode as many instructions as we possibly can out of this block of 16 bytes in one clock cycle. So it's broken down into stages. The first stage here

is called a predecode stage. And the

predecoders have a set of kind of flaky huristics about what these bites might mean. They're flaky because the only

mean. They're flaky because the only thing that we can think of again we being in this instance um uh agna fog is that the only way it can be decoding this because we have no idea where the

instructions lie in here. All we know is they don't overlap with each other and they depend of course on having decoded the previous instruction. And so how could you possibly do it in one cycle?

The only thing we think of is that it just speculatively tries to decode an instruction at every possible offset and then it has a sort of filter step where it says, "Well, these ones overlap with the one that's come before, so it can't

that can't be a valid decode." And

again, it's a bit of a huristic hack. It

never gets it wrong, per se, but it sometimes conservatively tags instructions for the later stage of the pipeline that are tagged as being more complicated than they actually are. So

we'll get to that in a sec. So these 16 bytes happen to represent exactly my routine that we were that I showed the predecoder tags it into essentially just breaks it down and says well there's there's this is the op code bite these

are the op code bytes whatever again a lot of this now when I'm pointing at this side here is my kind of guesstimate about what's going on it's not canonical there's no source for exactly what's flowing here but hopefully you'll forgive me and the interesting thing here is that what we've done is we've

been able to discover where the bytes of the first instruction which is that mob the second one which is the IMOLE the adx the add RDI here and then for this

Last one. This is a compare and a jump.

Last one. This is a compare and a jump.

And at this stage in the pipeline, it spots an opportunity because although we treat them as two separate instructions, we have to that's what the ISA is. It's

so common, you know, every branch, every loop ends in a compare and jump if not equal or something like that. That Intel

have decided to say, well, what if we had an instruction which was just a compare and jump as one unit? Well, we

could make a new instruction. and we

could issue it and write all the compiler writers, you know, actually emit this instruction or we could just spot it in the instruction stream and replace it here. And that's what they do. This is called macro instruction

do. This is called macro instruction fusion and it's tagged at least in this predecode step. So we still don't know

predecode step. So we still don't know really what these bites are. And again,

sometimes the bit masks or whatever it does are wrong. I'm going to run out of time if I keep doing this. Um, okay.

From the Yeah. So the the limitations of the predecessions in a cycle and it does this macro opusion. Now, at the moment, and by the

opusion. Now, at the moment, and by the moment I mean for Skylake, it may have changed, but I don't think that this one has. Um, a compare and test on anything

has. Um, a compare and test on anything that isn't memory, which is therefore more complicated, followed by any branch, that's treated as a single operation. An add, subtract, increment,

operation. An add, subtract, increment, decrement, or an and followed by a branch on these limited set of of instructions is treated again as a single unit, as a compare and um it's worth saying that you probably don't do

this, although you write compilers, so you might have the latitude to do this.

Um, don't put any length changing prefixes. These are usually used in like

prefixes. These are usually used in like the crazy 32-bit modes um to to use smaller pointers, but they effectively snile up the predec. It just goes, "Whoops, I don't know." Because you can make an arbitrarily complicated

instruction out of just putting random prefixes in it. Um, and it just gives its hands up and it takes, I think it's three cycle penalty. If if it sees one, it's just like, "Whoop, I'm out." And

yes, the other thing that it does do is it notices whether or not a an instruction is simple or complex. And so

it steers that instruction to the appropriate decoder. And steering is not

appropriate decoder. And steering is not really the right word, but um I'll see I tell you what I mean here. So these sort of partially decoded instructions which again are just blocks of bytes and

presumably some kind of sequence. So you

keep I keep putting it of and seek here.

I don't think it uses addresses internally to track because obviously if you've got a loop you'll see the same address over and over again. It must be some kind of global sequence number that's used for like hey this branch was mispredicted. Everything after this is

mispredicted. Everything after this is now garbage. So I'm just going to say

now garbage. So I'm just going to say and sequence number. So we've got partially just decoded instructions and uh they're passed on to the decoder. The

decoder there are four units and they passed them in order. Um the result of which is and again this is me making up my own syntax over here. The micro

operations.

In this particular example, every instruction is one micro operation. So

each one in on that side is one thing on this side here. So, one of them is like a 32-bit load through RDI and multiply and add and add whatever. And then this one is a macro fuse one which I've made

up an instruction of compare and jump um variously these kinds of things. Okay.

Um [sighs] there are two kinds of uh fusion. One is

the macro fusion we've already talked about. So macrofusion macro is big. This

about. So macrofusion macro is big. This

is how I remember them. That means that two big things the instructions have been fused into one internal operation.

So we've talked about that. Microfusion

on the other hand is well this stupid syntax or sorry this stupid ISA that Intel have come up with it's so so common to have an address operand on the right hand side while this is logically

two operations that is an add racks comm R14 is go calculate whatever R14 is go read that from the cache and then come back and then go to the ALU and add it to racks right that's two distinct

operations practically for most of the time we what if we just made our microp like format that a little bit wider and had like an optional memory tag and then we could like put for most instructions that

we've got space we can say well this is both a an add and a read and for the simpler cases that can happen and so a microfos microfused instruction will

later on be actually executed by two different parts of the chip as two micro operations but at this point we can pack it into one microp in all the cues and things that are coming up down um so we

got microfusion macrofusion there is also a microode sequencer. Uh the

microode sequencer is effectively a ROM with a sort of simple interpreter put on top of it. And so anything that's more complicated than four micro operations needs to come from a ROM rather than

just being sort of hardwired in the code. And that ROM can either just

code. And that ROM can either just literally puke out a bunch of micro ops.

Here's 15 microps for a complicated instruction. Or it can actually have

instruction. Or it can actually have logic in it itself. It can actually go like, oh, okay. Well, while ECX is not equal to zero, keep issuing these micro ops, right? And so for things that are

ops, right? And so for things that are complicated like locked operations, which I'm sure anyone who's done threading has seen that takes more more um operations than four or obviously things like rep stars, which is a

repeating operation. This is effectively

repeating operation. This is effectively just for the microode sequencer just keep emitting the read and write operation until you hit the end of the the thing that you're copying. And

obviously complicated things like read um MSR or cupid um they you know effectively runs a little program. You

know it's going to go off and say hey what kind of CPU are you? Oh I can return that. And so that's kind of cool.

return that. And so that's kind of cool.

I actually met somebody who worked at Intel. He said oh yeah my neighbor

Intel. He said oh yeah my neighbor writes microode. I'm like what a crazy

writes microode. I'm like what a crazy job that must be to sit there and just like write this stuff that that's like at the other end of this. And then the other thing that's sort of done by the microode sequencer are things like divides. I know they're also

divides. I know they're also complicated, you know, more than four operate micro operations, but it was kind of a opener to me like, oh, that's how divides work. It's got all this acceleration circuitry, but that's how they can short circuit. It's like, well,

okay, we've reached the point where there's no more bits coming out. We can

stop here. Anyway, I've got a thing about divides. You'll learn that in a

about divides. You'll learn that in a second. [cough]

second. [cough] [clears throat] So, we've got four decoders, but not all decoders are created equal. The first decoder is the

created equal. The first decoder is the the real decoder. It can do up to four microps. So any instruction that is four

microps. So any instruction that is four microps or less can go and be passed by the first decoder. Anything that's also that's tagged as complex even if it actually turns out to be a simple operation gets put to the the first

decoder and the other ones won't touch it. Any instruction that requires a

it. Any instruction that requires a switch to the microode sequencer has to go through the first decoder as well. So

things like that I divide the first step will be hey here's the setup code and then now we're going to start reading from the microode sequencer. Decoders

two through four can only do a single micro operation. So any simple things,

micro operation. So any simple things, but this is actually the fused microp.

So those ads with memory, that still counts as one in this world, even though later on it's going to be um uh broken into two. We good so far? I'm seeing a

into two. We good so far? I'm seeing a lot of nods from the front row, which I appreciate very much.

So you can decode four instructions or five microl ops. Which means that like if you really need to get lots of micro operations out, you can have a four micro operation instruction followed by

a one microp operation instruction and that can come out in one clock cycle and then the other two decoders just have to sit and twiddle their thumbs. 311 211 or 1111, right? And luckily most things fit

1111, right? And luckily most things fit into this world unless you're doing something silly like dividing. Again,

the worst case you can possibly do is just to have a sequence of instructions which are exactly two microl ops each because they only go to the first one and all the other ones are are idle and then the next clock cycle you generate two. So that's the worst case. Um but

two. So that's the worst case. Um but

although I've just banged on about this and it sounds interesting and complicated and Intel refer to this as the legacy decode pipeline. Not that we have any other choices. Thanks Intel. Um

we're mostly saved by the next thing we're going to talk about which is the microp cache and the loop stream detector.

Um, I had a sketch at what might be in a microwop. I've been sort of trying to in

microwop. I've been sort of trying to in my mind construct a my own emulator stroke simulator so that I kind of exercise the path of like what where things would need to go. But we don't need to talk about it because I'm going have time. Uh, just to show you some

have time. Uh, just to show you some pretty rainbow diagrams. These are some other more complicated instructions that come out as being two micro three and four. Although most of these are fused

four. Although most of these are fused in the fused domain still. I think

there's only two and this is only one. I

think this is the only one that's actually three microwatts. But you can kind of see a push racks is like well I have to write racks to the stack pointer and then I have to update the stack pointer. Makes sense. Exchange route rbx

pointer. Makes sense. Exchange route rbx rax is probably temp equals rbx ra you know the whole switch thing. It doesn't

do zors. I'm pretty sure maybe it does.

I mean who knows? And then an add pointer here is one of those really complicated memory operations where we have to read the memory. We have to then add to it and then we have to write it back. And writing back is always two ops

back. And writing back is always two ops because um there's the address generation is separate from the data part of the store for reasons we'll touch later. All right.

touch later. All right.

And then yeah because I have to put a divide on here. This is a 32-bit divide on Skylake. They've got a lot better. I

on Skylake. They've got a lot better. I

mean I'm sure Granite Rapids is a lot quicker than this but this you know when I first time I sketched this I I got it to print this out. I was like oh my golly. Yeah it's like a 100 cycles for

golly. Yeah it's like a 100 cycles for the um for the 64-bit version. But what

the interesting points are here is like so we can see that like in the first clock cycle the decoder zero was able to output these three micro ops they kind of flow through and then you it switches to the microode sequencer and there's a

two cycle delay here and now it kind of carries on doing all of its whatever magical stuff has gone here. A note

about these these are very empirically derived from like looking at port pressure. So we're not 100% sure what

pressure. So we're not 100% sure what the individual operations are doing and sometimes these you know when things finish I'm not totally sure it's right.

So yes, we're saved by the micro op cache. Uh the micro cache sort of sits

cache. Uh the micro cache sort of sits logically to the side of this whole complicated pipeline we've just been talking about. And we're in one of two

talking about. And we're in one of two modes. Either we're caching or we're

modes. Either we're caching or we're reading from the cache. So unlike like the uh the regular sort of um uh like L1, L2, L3 caches, it's not like every read goes and says are you there? Okay.

No, I better go and do it the hard way.

It's just you're in one mode or the other mode. And it's jumps that

other mode. And it's jumps that determine whether or not that process starts.

Uh so yes in caching mode as we're running through our program the microops that the translation engine has been doing so might is the micro instruction translation engine that's again Intel

loves coming up with complicated names for things it's the or the legacy pipeline they get pushed into this cache and then they're just sort of written into the cache or if a jump happens and

the part of the branch prediction or whatever part that notices that the destination of that jump happens to be in the micro cache then we start streaming instead from the micro op cache and this is the happy place you

want to be in this situation. So the

result of that is just a also micro operations. Uh the micro operation cache

operations. Uh the micro operation cache is the weirdest cache you've ever seen because it has a really difficult job.

Um if you think about a normal cache like an L1, L2, whatever, you've got a onetoone mapping. It's like this bite is

onetoone mapping. It's like this bite is in that cache line and then you just fetch the whole cach line around it and you whack it in the cache and you're done. But if we jump to an address

done. But if we jump to an address that's not on a like a cache line boundary then well we decode that instruction but we don't we can't reverse back and say well what was the instruction before it then you just have to kind of start at that point and so

there's a lot of really weird restrictions on it. This has changed a lot since Skylake but the Skylake has 32 sets by eight ways and each way cache line has up to six micro operations in

it. Some of the microps take two slots.

it. Some of the microps take two slots.

So like if you've got one of the MOV with a 64-bit value, you need to have two slots in order to sort of store the 64-bit value. It is inclusive with the

64-bit value. It is inclusive with the L1i, which is a clue into how it's implemented. And then the weird thing

implemented. And then the weird thing here is that there are no more than three ways can be used by any 32 byt b of code. And this to me smells like they

of code. And this to me smells like they had hit a real snag because morally this is very similar to the uh trace cache that the Pentium Pro used. And I don't

know, you all look too young, [laughter] but the problem with that was that you have as many entries in the cache as there are traces through your code. And

so very quickly you blow up as the the traces change. Um, so I think they try

traces change. Um, so I think they try and limit that by saying if any block of memory, small small area of the of the program needs more than three lines, then we're probably in this case where it's about to start monopolizing the

whole cache. We'd rather throw it away

whole cache. We'd rather throw it away and prevent it from being um cached at all. Um it's worth saying that yeah the

all. Um it's worth saying that yeah the any branch even a not taken branch ends a cache line and then it starts caching from the next thing. So it kind of tries to find the the sensible boundaries. But

um the one of the sort of like takehomes from this is if you're generating code if you have more than one entry point into the same 32 byt block then you might be not using the DSP as effectively as you like. And I know that

it's quite typical in compilers to like start with a hey let's do some setup code then we jump to the end of the loop that then does like the loop end and then it jumps to just below the setup code to like then run around and you're

like ah now I've got two entries but so you have to think about it two is probably fine um and um yeah I forgot to say this before but like you know we this is why we align loops on 16 byt boundaries typically is so that you get

the benefit of the fetch system picking up the loop. Uh okay. But yes, in newer machines it is more like um you can have

six ways in 64 bytes. So it's a bit of a weirder thing more closely tied to the L1 cache. All right. I think yes. And it

L1 cache. All right. I think yes. And it

can deliver anywhere between four and six micro per cycle which means that even if they're complicated or simple or whatever instructions, you can get a pretty consistent stream out of it. The

literature says six. I could only ever measure four. Um so I don't know if

measure four. Um so I don't know if whatever. All right. Yes. Uh yes, the

whatever. All right. Yes. Uh yes, the loop stream detector. The loop stream detector.

So in between this sort of source of microp operations that's coming from either the legacy system or the cache, there is a queue. Obviously, you know, we love Q's. Q's buffer. If the renamer or the execution unit is sort of backing

up a little bit, we've got a bit of space in here. But this this Q effectively is a big circular buffer of a couple of hundred entries that we're writing microodes in program order in.

And what the loot stream detector does underneath is it says, "Wait a second.

You've jumped to a location that's actually already in this buffer. You

know, it may have already popped off the end of it, but like, you know, we haven't overwritten it yet. Wait a sec.

Wouldn't it be better if we just kind of held that part and just kept streaming back that same sequence of events over and over and over again?" And so that's what the loop stream detector does. And

that means it can turn down the whole front end. There's no caching that has

front end. There's no caching that has to be done. There's no decoding. It's

just brilliant. But it's relatively small and it has limitations. So it

spots loops that fit in the queue entirely. It can deliver each cycle up

entirely. It can deliver each cycle up to four micro operations and but it it can't deliver more than one loop iteration. So if you have a really small

iteration. So if you have a really small loop or a loop that doesn't fit into a multiple of four, then you might get like uh you know if it was five, you get four in the first cycle and one in the next one and four and then one and four

and one except it can unroll. It

actually spots the longest thing it can get up to eight times in the Skylake.

Um, so that you it'll actually unroll that loop effectively in hardware and go well I now if I take my five iterations and I and I do them four times. My math

is not good. Uh, five fours are 20 20 divides four. Yes, I think we're good,

divides four. Yes, I think we're good, right? Yeah, something like that. Um, so

right? Yeah, something like that. Um, so

then it would just fit and then in every cycle it can stamp out four, which is pretty cool, which is great. But why was it in parenthesis and why did it have a star? Well, it's disabled on Skylake

star? Well, it's disabled on Skylake which was reported by the due to a massive thing. And this is where I'm

massive thing. And this is where I'm actually going to have to bring up my notes. Sorry, you're going to have to

notes. Sorry, you're going to have to see speaker view for a second because yes, here we go. It was described as a nightmare level bug in the Debian fix.

Short loops using the ah BH C, you know, the high part of the 16- bit register and the corresponding wide reser may result in unpredictable system behavior.

I mean, whoa, that is scary stuff. So,

props to them. I don't know what it was and there are people in this audience who probably can tell but something about the O camel runtime or the O camel compiler liked to do this like in a loop

it would use the high 8 bits and the low eight bits maybe tagging I don't know I'm looking over at YouTube you two because you're my favorite two to look at but [laughter] yeah it was found by some folks in the O'Amel community and uh reported and

then it was Intel determined it couldn't be fixed and they just issued a microode patch which turned the whole thing Wow. Okay.

Wow. Okay.

Yes. Yes. Oh, [laughter]

we did weird things so you didn't have to, right? Okay. A renaming. So, now we

to, right? Okay. A renaming. So, now we get to the cool part here. Now, again,

those of you who have done like college level stuff stuff courses in in in this kind of out of order execution, this is probably me and drink, but I'm going to go over it anyway because it's just really interesting. And the penny

really interesting. And the penny dropped for me writing these slides about how important this whole process is. So the renamer has probably the

is. So the renamer has probably the biggest job in the entirety of the front end because it has about three different jobs. The first thing that happens is

jobs. The first thing that happens is that now the micro ops have reached this point. This is like the first point

point. This is like the first point where we can say this is definitely going to happen. So we're going to put it into I definitely it's definitely speculatively going to happen. It might

be undone later on when we discover we shouldn't have done this way, but like we're going to write it into the the reorder buffer which just says this instruction happened or will happen. We

also then rename the to physical registers. So we take the exs and the

registers. So we take the exs and the RDIS and whatever and we use this vast array of onchip resources to actually use for registers and again I'll talk about that in a sec. And then it also takes the microp operations that were

represented and shifts them off to the right execution units to sit to wait.

sort of sit in those schedule until either the uh the execution unit is ready to accept a new uh operation.

Maybe you know there's multiplier that's backed up waiting and there are other cues other instructions ahead of it or um maybe this instruction depends on the result of an a previous instruction that hasn't completed yet and so it'll sit

there and wait until all of its dependencies come in. But this is the point at which we kind of like fan out into all the various data structures and then the out of order system picks it up on the other side. Amazingly, although

it's doing all this complicated work, it can do four microlops a cycle. I'm this

is staggering. Maybe this is the kind of thing that in hardware is a lot easier than it sounds, right? This is the kind of thing you discover.

So the first thing we do is the reorder buffer, right? This is just to have our

buffer, right? This is just to have our ledger for in order completion later on.

There is a process called unlamination that happens at this point. So at this point sometimes the micro that came through even though it wasn't micro op fused is determined to be this is

actually more complicated than I can do in a single unit and so it'll break it up at that point into multiple pieces.

Um some of the more complicated addressing modes sometimes fall into this category for operations that could otherwise go to like the adder. It's

like well the adder can't do this. I'm a

little vague on that I'll be honest with you. Um but unlamination is uh is a

you. Um but unlamination is uh is a thing that can generate extra microps at this point.

Um there are 224 entries in the reorder buffer. That is the sort of top end of

buffer. That is the sort of top end of how many things that can be possibly in flight at once um that are in the system being executed. That's more like 500 on

being executed. That's more like 500 on um I'm looking at U2 in the front again for this. I think it's about 500 in

for this. I think it's about 500 in Granite Rapids and and above. So that's

that's a place they have changed a lot and expanded.

It kind of looks something like this.

You know, we have some kind of sequence number for each of the instructions in sequence. we know the information it

sequence. we know the information it needs to track is where are we writing to and then what rename happened. Um

we'll get to why we need that information. It's sort of weird that we

information. It's sort of weird that we need both the source and the old version partly for undo, you know, if we were to talk about branch stuff, but partly for another thing that I'd never considered before. Um and then we've got some kind

before. Um and then we've got some kind of type of information. There's some to do with like where if it's a branch where where it sort of restore state might be stored. And then the the important thing over here is that while this is all written to in this re

robite, this part over here, the execution and exception is going to be written to once the instruction has actually finished. So it's going to go

actually finished. So it's going to go through the rest of the pipeline and come back here. And so in my sort of example I've got up here, this first instruction, this load has completed, which means it's gone through the the system. It's finished. Um it's done with

system. It's finished. Um it's done with and it could be retired at any point. We

could the retirement unit could finish it off. Um and you can see I can't

it off. Um and you can see I can't remember what I did. Oh yeah, and it's got a source down here of P22. Yeah,

there's sort of an implied dependency there. it's not really that important.

there. it's not really that important.

And again, we don't really know what this looks like. This is the minimum set of information I think it needs. The

issuing step um is where we actually take the live to be calculated microp and give it off to these execution units. There are the execution units are

units. There are the execution units are in two big blocks that have many ports on them. It's a weird terminology, but

on them. It's a weird terminology, but there's an an ALU unit that has all of the things to do with, you know, arithmetic and logic as you might imagine. And then there's a sort of

imagine. And then there's a sort of memory system which only does loads and address generation. and they have

address generation. and they have different numbers of entries. They are

presumably balanced by some kind of internal um pressure inside both the layout and the complexity of those things and also you know the the experiments that Intel do in terms of

how many of each type of instruction they typically have live at once. Um the

issuing determines also which port within each of these. So these are further subdivided but at this point this is when it's decided where it's going to ultimately go. So even if it's an ad and there are like three different

locations on the chip where that ad could happen, we decide now at rename time which of the adders it's going to go to which is a little bit myopic because you know by the time it actually gets ready to run maybe that's not the

best choice of adder now. Maybe there's

some free adders somewhere else but it's useful to know and obviously it's a simplifying assumption here. Um the

algorithm has been reverse engineered with high probability that they got it right and it's kind of as simple as you might imagine. to do with like the

might imagine. to do with like the balance of how many operations have gone recently and it picks tends to pick the higher port numbers if um if they're more free and if there's a tie and then

it sort of the of the four instructions um issued the one and three go to the top place best place to to schedule them and two and four go to the second place so it kind of tries to balance them as

well it's again it's documented you can go read the Python code that does the same thing and at this point also now um separate from unlamination which actually makes two entries in the reorder buffer for an instruction that we notice later on is a bit more

complicated. Anything that's actually

complicated. Anything that's actually microfused that is like the AD with the RDI gets separated into the load component that's going to go over here into the load unit and the AD part

that's going to depend on the load that's going to go into the ALU component. So um the difference between

component. So um the difference between unlaminated microps and micro fusion is that this only has one sort of slot in the output buffer in the reorder buffer whereas the unlaminated stuff actually

takes up more resources in there. Not

that that's a huge problem with 224 of them. All right. Good, good, good, good.

them. All right. Good, good, good, good.

Uh so to again I think really the reason for me putting this up here is to sort of show you what information is being stored in each part because for the longest time I had reservation station um which is theuler sorry and the thing

that the issue is writing to and reorder buffer kind of mixed up in my mind. So

this is more my symptom of me explaining to myself so I could talk to you about it but hopefully you'll forgive me. So

it looks something like this. We have a bunch of um of of operations micro operations. These are no longer in any

operations. These are no longer in any particular order and we have you know maybe three sources that can happen and we've got a destination that it's going to write to and it's an ad and um this

is a guess. Now I have been unable to find anything that tells me whether or not the values of known quantities are

stored in the um the the theul in this sort of micro operation or it's just a reference. Now the way that I've chosen

reference. Now the way that I've chosen to show this is that this one has two resolved sources. So these have already

resolved sources. So these have already completed the instructions that generated these two numbers. Maybe this

was actually in the op code and this one was like reading from a register. These

two have already completed. So I've

written the values in here. And then for these guys here, I've shown that like I'm waiting for the previous instruction. That's what this guy in

instruction. That's what this guy in fact who's going to write to P24. I'm

going to wait for him and then when he's completed, he'll tell me what the results are and I'll I'll put it in here. We don't know if that actually

here. We don't know if that actually happens as in this broadcast action causes the uh issue things to get written these slots to get written to.

Um there's some circumstantial evidence that maybe it is because there's a broadcast bus. But it also opens more

broadcast bus. But it also opens more questions about what happens if you are issuing instructions that have two known quantities and then we have to read an arbitrary number of like physical registers on one clock cycle. Like if

every instruction had three sources, each of which wasn't already in flight, then you could imagine that issuing four of them means that we have to read from the register port four times three

times. And that doesn't seem realistic

times. And that doesn't seem realistic either. So, and no one has been able to

either. So, and no one has been able to find a limit to how many you can re unlike Sandy Bridge which had a very very limited number of read ports. So,

something magic is happening. We don't

know what.

Anywh who um let's talk about renaming.

Um so, let's talk about first of all the nuts and bolts of renaming. So, what

renaming is going to do is going to unlock more out of order opportunities down in the rest of the pipeline. And

what we're going to do is we're going to break any dependency that we can because many times when we say EAX, we just mean the current value of EAX and we're going to do something to it and then we're going to throw it away and then we're going to load something new into EAX and

we're going to just keep doing this process over and over again. And those

things are in are separable from each other. And so one example of uh

other. And so one example of uh something where um the values of the registers are don't depend on an earlier value of the register is a loop because you know we're going to read into ex every time around the loop and we're

going to do some work with it and then we're going to store it into a value or whatever and then we're going to start the loop again at the bottom here. So if

we can break that dependency um it means that we can run maybe two iterations of the loop at the same time or even more. I'll show you what I mean.

We've got pictures coming for those of you who are furing your brow. So the way that this works is we have a table that tells us where each register currently is. So which of the physical registers

is. So which of the physical registers and again this is my madeup syntax P whatever um currently contain the most contemporary value of rax. So rax is p39 and whatever. And then there is a free

and whatever. And then there is a free list of registers. These are registers that are currently spare chip real estate going spare fabulous. For every

time we read from a register we're going to use its current value in the rat.

every time we write to a register and completely replace it. It might as well be a new register. So, we're going to pick a new one off the free list, use that instead, and then update the table to say this is where the the racks

register is now. So, this first instruction, we're reading the RDI register, and we're going to go and fetch memory from it. And then we're writing to EAX. So, it really doesn't matter what was in EX before. So, P39,

don't care about you. So, what we're going to do, we're going to re rename this to be P45. And hopefully, you saw that that kind of went that way. That's

me popping off the front of the free list. This is in hardware, remember, by

list. This is in hardware, remember, by the way. You know, this is like pretty

the way. You know, this is like pretty impressive. And then P22. And then this

impressive. And then P22. And then this mole here is EAX with itself, right? I'm

squaring EX and I want the result in EAX. And I know the op code doesn't have

EAX. And I know the op code doesn't have three operands or it can do for some multipliers, but you know, it's been rewritten into this nice risky form by this point. Um, so when we read the

this point. Um, so when we read the EAXs, we're going to use P45. And then

when we write, we're going to use a new one, which will be P46. And this is indeed what happens. And so on and so forth. We can go through the rest of it

forth. We can go through the rest of it here. And now we've got this whole

here. And now we've got this whole rewritten set of op of microperations that are phrased purely in terms of physical registers only. And we've sort

of encoded the dependency between instruction values by if you follow the P45 in here, it's used here and then it's done with. So, you know, that has that's helped us.

So, this is what that loop looks like or rather loop iteration and a half. Um, so

I've got one and a half because it fits on the slide. If we didn't have a rename, this is what it would look like.

So the first iteration of the loop is from here to about here. And you can see that the next iteration loop can't even start because well, I can't read through RDI until I finished adding to RDI from

the previous iteration, right? Because

it's like I have to wait for it, right?

And it couldn't start. We couldn't add four to it to move it forward until we'd finished reading the memory address at RDI because presumably I still need the RDI register or whatever. Again, this is I had to hack the code so badly to make

it do this and I then went through it and sort of manually fixed it up. But

you get the idea. If we were to sort of show you the zoomed out view, it looks like this. And if you see these kind of

like this. And if you see these kind of diagrams, you can squint and you just look at the Rs really and you kind of look down there. And this is I mean mostly throughput rather than latency, but it's 10 cycles an iteration. Doesn't

seem totally unreasonable. I mean

obviously you'd use SIMD if you actually trying to do this for real right you get more per iteration but 10 cycles an in iteration if we rename it on the other

hand it looks like this so on the first first cycle we've already cued five of our instructions and we've issued four of them they the first and the fourth

one can already complete on cycle six because this ad can be running concurrently with the load as they don't depend on each other anymore and so on and so forth and you can see Every time there's an E, there's usually a D underneath it because the the E means

it's finished executing and the results become ready and immediately can be used in the um uh the the instruction that depends upon it. And this is evidence for the broadcasting theory of this

where as it kind of completes just goes anyone who cares P45 is now 27 like oh cool I got it. And then it doesn't have to wait another cycle. Okay, I think we've got everything. So that looks

pretty cool. But if we zoom out it's

pretty cool. But if we zoom out it's like one and a half cycles in iteration.

It really makes a difference. It's super

cool. All right, I've labored that enough. Right, let's talk about my

enough. Right, let's talk about my favorite instruction. Zor, EAX, EAX.

favorite instruction. Zor, EAX, EAX.

You've all seen it, right? Yeah. Why the

hell does it do that? Anyone want to shout out?

>> No, [laughter] nobody wants to shout out. Sorry, did

out. Sorry, did >> it sets it to zero? But why not just do more via ax comma zero? Come on.

>> Smaller instruction. It is. It's a

smaller instruction. But wait, there's more. Zorax. So, let's go through it

more. Zorax. So, let's go through it first of all. So exclusing oring a number with itself always leads to zero, right? Um and because of the stupid

right? Um and because of the stupid syntax of um x86, this is eax zor equals eax, right? So it ends up with zero.

eax, right? So it ends up with zero.

Fine. The compiler loves to do it. But

the cool thing is the CPU knows that you love to do this as well. And so it goes, oh, um if you're setting a value to zero, I don't need to do anything at all. All I'm going to do is I'm going to

all. All I'm going to do is I'm going to arrange to have a physical register that's always the value zero. and I will just rename it. Okay, rax is now P 0 000 and any kind of thing that reads from it later on is just going to get the value

zero. Right, we're done here. Doesn't

zero. Right, we're done here. Doesn't

issue at all. So although it's written to the reorder buffer, it doesn't go to any execution unit. It doesn't take up any resources at all. It's magic. We can

also do some warning idioms. There are some parallel warning type things that can generate an all one. So presuming

there's some other magical register that happens here. Okay, we can also do move

happens here. Okay, we can also do move elimination because if you think about it, if I move RBX into RAX, all I've done now is I've just got two names for the same register. So, we just have to

update the RAT and say, well, you know, RAX is now P88, which is incidentally where RBX is at the moment. There's no

is this is no issue. This doesn't mean there's no problem because there's a problem. There's no issue. There's

problem. There's no issue. There's

nothing written. There's no actual work to do. There's no actual ALU unit that

to do. There's no actual ALU unit that needs to be done. But there is an issue because this actually does introduce a complication. It's surprising. Um there

complication. It's surprising. Um there

are limits. Um have I got a slide? Oh

no, it's here. Ah, gone ahead by one. So

there are limits, right? Because now

we're allowing a single physical register to be aliased by more than one architectural register, which means that suddenly we either have to do reference counting to determine when we can free that thing back up again, or we need

another trick to determine when everyone's finished with with that particular physical register.

What it seems to use is it has four slots for rename for for aliases. And

once uh one of those slots has been used by having two registers point at the same thing it until all of those registers have been overwritten that P88 is now locked. It can't be freed and also that slot can't be used by anything

more. So you can do four mods in a row

more. So you can do four mods in a row and they're all free and then after that it becomes really expensive to move between registers until you overwrite both the source and the destination that both alias. that has gotten better and I

both alias. that has gotten better and I think they now use a bit mask to track things in a much more clever way because on older lake and onwards they have another trick and I I know I said I was

a scarlet but this is just too cool not to talk about because you know adds and subtracts are so straightforward what if the renamer goes well hang on an add with one why don't I just rename it as

raxis p88 and just remember it's got a one added to it how cool so again small increments and ads don't take up any resources at all they just get written

as sort of a housekeeping issue inside the rat. We have to pay the price at

the rat. We have to pay the price at some point.

Two reasons. One, the range is only between minus,024, so it's got 11 bits of range. And then if it hits that

of range. And then if it hits that limit, it actually does issue the micro operation that says, "Oh, fine. Add the

thing." And then we know where we are, right? And then of course after that, it

right? And then of course after that, it could be renaming from the new new value. So it's pretty cool. Um there's

value. So it's pretty cool. Um there's

also interestingly if you think about what this really means is that when P88 becomes ready if it were depending on it anything that depends on P88 has been

given the but uh the the burden of having to now actually do this damn AD right and so many of the instructions um that's fine right you just it's like okay I'm reading from P88 and there's an

adder that kind of runs in parallel and then by the time result gets to me it's already had that offset that needed to be applied to it applied to

But the shift uh like logical shift for whatever reason it takes a cycle to do it every time if you and and this is how we discovered this is going on because this is not in the literature anywhere

right um if you do um if you like do mo racks comma 10 and a shift you'll measure the shift takes one cycle. If

you do mauv um racks comma zero ink racks and then you do the shift the shift now takes two cycles because it pays for doing the ad in a single cycle beforehand and we don't know why it does that but this the the the guess that we

have is that for things like variable shifts where you're shifting by the value of another register because the barrel shifter needs a lot of setup time and it's a lot of circuitry to do that kind of 64-bit shift arbitrarily. we

figure that it probably needs the results a lot earlier in the cycle and so it hasn't got time to do the add and then set up this thing and then rather than them just blanket the um the shifts

that have um variable shifts they just said all shifts or it's a mistake I mean mistakes have been made before as we've seen so this is super cool it's really interesting um there's a whole blog post

which I think I've got at the end that's fascinating digging into this okay we are 50 odd minutes in to the presentation I don't think I've lost too many people most people's eyes are still

awake o open or they're awake or both and we this is the summary this is where we are but this is the meat of the talk just getting to this stage so we got a in a predicted program order which gives

us a sequence of instructions those instru or instruction pointers which are then fetched they're decoded into microperations those microp operations are renamed and then the reorder buffer is written to in program order and then

these sort of microperations that are pending execution get written into theuler Great. Okay. The back end looks really

Great. Okay. The back end looks really simple in comparison, but it does hide, you know, kind of the thing we care about, you know, actually doing the work, right? Everything is just a

work, right? Everything is just a preamble. So, um, I've already said re

preamble. So, um, I've already said re reservation station. The scheduleuler

reservation station. The scheduleuler and the reservation station are two names that Intel use interchangeably or certainly Agnro does and so that's where I'm getting a lot of my information from there. The execution and then the right

there. The execution and then the right back. So, it sort of looks something

back. So, it sort of looks something like this. We've got now it's we've gone

like this. We've got now it's we've gone from this nice pipeline to a big soup of things that can be done at some point.

And so we've got this the two reservation two schedulers, whatever we call them, reservation stations that have their 39 entries that are just sat there waiting for either all of their operands to be ready. Uh sorry, not

either waiting for their operands to be ready and for an execution unit that they've been tagged to go to to become free. So that's what they're sitting

free. So that's what they're sitting there in this these cues. Um but again, the cues are not really in any order.

It's just whatever's ready. Um, we've

got the reorder buffer and we the right back for the reorder buffer is just to say this instruction is now done. We've

completed there doesn't have to be any other values now. Way back when in sort of like the um Oh crap, what was the one before Sandy Bridge? I'm forgetting now.

They have such stupid names. I really

wish they would come up with a good name exist. Anyway, earlier earlier on the um

exist. Anyway, earlier earlier on the um reorder buffer actually contained the uh physical registers. effectively just

physical registers. effectively just used the reorder buffer as the physical register store and so the results of everything were written back into the ROB but that's no longer the case they are still they live down here because we added you know vector operations and if

we had you know 512 bit wide slots up there it's a lot of waste for a lot of you know unless you're writing AVX 512 all the time which maybe you are I know you can't tell me um maybe you can um no

so we've got reservation station here um yes we've got various ports attached to these reservation stations and sort of broadly the big picture thing is every clock cycle the the scheduleuler says

which of these microp operations is ready to run and it issues them to one of these ports that has many actual logical units. You'd think that this is

logical units. You'd think that this is just like an address and this is you know multiplier or whatever but they each have different usages and then those are pipeline themselves. So they

have their own pipeline but they're fixed length and they come out and when the result comes out five cycles later, three cycles later whatever it comes back into this writeback brush and it sort of says yes P47 is now 127. Maybe

it's written to the physical register file. Well, it's definitely written to

file. Well, it's definitely written to the physical register file. Maybe it's

written to and snooped on by some of the other entries that are waiting for that result and they become ready to run and then the whole process just keeps going until there's well just keeps going.

There's no never stops, right?

Fabulous. Okay, so looking at the ports that we have, this again is Skylake.

We've got a whole bunch more on the more up to-ate Granite Rapidy type things, but by and large you can sort of see where all of the engineering effort went down on this poor thing that gets all this work here. So there are three units

that can do integer and vector vector operations on like simple integers. Um

there's there's two that can do the shift, one that does the permute, some string operations and whatever. The

number here is the latency incidentally.

Um so um if you if you're doing you're doing it wrong if you're using x86 for what it's worth. And so the the execution ports over here are doing all um arithmetic type stuff except for the other kind of like here's one of these

kids that doesn't look the same as the others which is the store unit which is weird that the store unit doesn't live in the load store unit but whatever.

Um over on this world we've got uh a bunch of load ports that can do works and they can also generate the address of the store and then one sort of really um you know kid that can only do simple

addresses and therefore you know um I don't know what happened here. Maybe

they were like, we they copy pasted it twice and then they didn't have enough room for it and they edited it a bit.

Well, we can't do the complicated addressing here. We'll just have to have

addressing here. We'll just have to have that kid do that. Um, cool. Right. I

forgot where I was going with this.

Never mind. I think you get the idea.

There's a lot of things that go to different ports. Um, you'll um Yes. And

different ports. Um, you'll um Yes. And

so, yeah, typically it picks the highest port of the available ports if there's a tie because the higher ports can do fewer things. And so if it can do a

fewer things. And so if it can do a vector shift in this one, it's kind of freeing up port zero and one to do more of the other weird thing that might be coming down the line.

All right, so the address generation stroke load unit generates addresses.

Surprise. And it does the loads. Hey,

who thought? Um, what does that really mean? Um, well, we're going to have to

mean? Um, well, we're going to have to introduce something called the mob, the me memory order buffer. This is probably the most complicated part and we only have like four minutes to do this. So

there's going to be necessarily some simplifications here. Some of the

simplifications here. Some of the simplifications is in my understanding.

Some of the simplification is that no one has come up with a way of measuring exactly how this works yet. And some of it is I don't know user error on my part. I don't know. Um so there are

part. I don't know. Um so there are three kind of parts to this. There's a

load buffer and a store buffer. And

there are some predictors which are absolutely bonkers. we talked about

absolutely bonkers. we talked about earlier and I don't think I have time to go in the depth that we went into the table but there's predictions all over the place. So the store buffer holds all

the place. So the store buffer holds all of the stores that have not yet gone out to the real memory. And now remember everything we're doing is speculative

speculative at this point until the instruction retires. We haven't got cast

instruction retires. We haven't got cast iron guarantee that something upstream of it that happened in program order before it hasn't gone oh I just divided by zero. Oh, I mispredicted a branch or

by zero. Oh, I mispredicted a branch or anything like that. So, until it gets to that retirement unit, we can't let the store go out to to real memory, right?

So, we have to hold it in a sort of holding pattern. But, of course, if I

holding pattern. But, of course, if I store to memory and then the next instruction reads from memory, I might reasonably expect to see the number I just stored in memory, right? That would

seem quite bad if it didn't. So, I have to do something about this. So, this

store buffer is has many many hats. One

of them wants to hold the stores until they go out to memory. The other one is to kind of handle the loads that have come after it that might need to read from the things that haven't yet committed. So it's broad broadly broken

committed. So it's broad broadly broken into these two parts where we have the address calculation and the data calculation separately from each other.

And this is why this data unit lives in the ALU because where did the where does the values that are going to be stored come from? Probably the ALU and and the

come from? Probably the ALU and and the address is generated by the the address gen. What we're doing here is that by

gen. What we're doing here is that by separating them, most of the time, maybe not all the time or some of the time, let's just go with some of the time, some of the time, you know what address you're going to be storing to earlier than the data that's going to be stored

to it. Like, you know, you square root

to it. Like, you know, you square root something 20 times in a row. So, you

this huge long dependency chain of square roots and then you store it to address 20, right? I know that it's going to store to address 20 pretty early on. I'm going to have to wait 500

early on. I'm going to have to wait 500 cycles before those square roots have all completed. Why do I care about the

all completed. Why do I care about the address? Well, if I've got a load coming

address? Well, if I've got a load coming on, and it's not to address 20, I can do that load. So, I I need the address more

that load. So, I I need the address more than I care about the data for the purposes of letting other instructions that are later in the flow get their results. So, we have the address, we

results. So, we have the address, we have the data, we have how wide that is, whether it's completed or not, and um whether or not it caused a fault.

Obviously again if we cause a fault but we it was on a path that is not got to because we're going to divide by zero before we get to the fault or you know spectctor a meltdown notwithstanding you know um then we don't want that to

actually take effect and so when it comes to retirement we'll we'll check whether it was a fault or not. uh so

stores we have to wait for those two address calculation and the data operations and then we sit in that buffer serving the incoming reads and eventually we actually commit and that

happens way much later after we retire this once it's retired. Uh the load buffer on the other hand stores the loads that are happening and some status about the prediction stuff that I alluded to but I'm not going to go into

but this is used for tracking some guesses. is it makes some intelligent

guesses. is it makes some intelligent guesses about whether or not a load may never alias in which case it can issue it really really early. The problem with that is if it gets it wrong it's incredibly expensive. Has to do a whole

incredibly expensive. Has to do a whole bunch of like uh state clearing to kind of undo the fact that it let a load happen that was actually dependent on a store before it but it didn't run.

There's not not like an undo mechanism for that. So it just throws everything

for that. So it just throws everything away. Um so we have you know some kind

away. Um so we have you know some kind of reference to where the ROB is, how wide it is and what address it's going to uh that it's been served from. And

again um this is also the the order in here and there's the store color here is a reference it's called this in the literature there's a there's a um a pattern that talks about store color they don't go into any

details what that really means if you look at the risk five implementation of this it uses a complicated bit mask but effectively this is some way of it saying which stores could potentially have affected this load as in where is

it in the sequence of stores so that I've got some way of tying the things together and knowing that like once all previous stores have been accounted for.

I can do this load. All right. So

loading is really really really complicated which is one of the other reasons why it's slow. Not just you know we talk about L1 cache. It's like hey it's three or four cycles to get to L1 cache. It's like it's got to do all this

cache. It's like it's got to do all this as well right we first of we do some prediction about the ambiguity again out of scope unfortunately. We have to wait for the address to be calculated.

Hopefully that's quick. And then we have to check the store buffer. Now, if we get a hit in the store buffer and there are no intervening stores between the hit that we made and anything that hasn't yet been resolved, then we know

that no store earlier than this load can possibly affect the value other than the one we found. And we can go hooray. We

don't even have to go to L1. This is

like L0 cache. Here you go. Here's your

number.

If that's not true, if we miss, as in we don't find anything in there, but there are also no stores that haven't been resolved earlier than us, we know like, hey, there are no stores that are in the

queue that have not yet been flushed out to memory. So, we can just issue the

to memory. So, we can just issue the load now. And then it will actually

load now. And then it will actually finally go to the L1 and start the whole process of reading from memory. And

obviously, the load finishes and eventually completes and comes back. And

then we kind of send it off to the program. It's a miracle that any of this

program. It's a miracle that any of this stuff works, right? [clears throat]

And if not, if if we're still in this sort of ambiguous state where there are unresolved stores, we just have to wait.

We have to keep going and sit in this queue until all of those stores have been resolved, which is why we try and predict if it's going to happen. And

sometimes that's worth it. Okay. Yeah,

there's some pattern number here, which I'm sure you're not allowed to click on because of all the legal things, but let's not go there. All right. Uh the

other thing that the memory order buffer is involved in is all of those horrible locked instructions that you want to do.

Fences. So, you know, like storefrances and load fences are like operations that will cause those things to drain out before any further instructions that can issue loads can happen and that kind of stuff. And it sort of vaguely alluded.

stuff. And it sort of vaguely alluded.

I'm going to put just the words total sort of here, but my my confidence in this has been irredeemably shaken by our conversations earlier. Where's Andrew?

conversations earlier. Where's Andrew?

He was here earlier, I'm sure. Yeah.

Yeah, there is. So, there's definitely some things that go on with this, but that happens sort of later on. Um, so

finally, we get to talk about execution.

We are now exactly at an hour since I started talking to you all and we're going to talk about execution which is luckily the most boring stage because it just it does what the old CPUs used to

do right you know it's pipelined you know there are machine traps interesting this is where floatingoint assists happen if whoever has ever hit the hideous performance cliff of dormals

there are some nods in here to the best of my understanding what happens here is that if you so the floatingoint unit can't handle numbers that are like not one point* 10 the minus whatever or 2 to

the minus whatever um so for very very very small numbers that are below float min whatever the thing is um the there has to be special casing for it so the hardware can't do it but it can't

determine that ahead of time so by the time it gets to this point the the multiplier the add or the divider goes oh crap uh uh uh right what do we do we've we've got like a number I can't

deal with it just has to essentially flush the pipeline and then it goes back a microode sequence sensor and says can you just generate me the code that can handle this please you're like what you

know dynamically like undoing as it's a deoptimization kind of thing like a jit deop um yeah it's crazy anyway and the execution ports we hope that they're balanced like in terms of the latencies

if I go back to the slide before it used to be the case you could sort of see oh oh all right I won't do it now but like the they they tried to like make it so that there aren't different latency number of instructions Because one of the other things that happens is if two

instructions on an execution port complete in the same cycle, like you've got a three cycle guy and a five cycle instruction that was issued two cycles earlier, they both want to complete in that last cycle and that that would

delay one of them. So they try to avoid that by like coming up with sensible everything's a multiple of two over here and then we do this type of or whatever.

There there's some tricks to it. Um

gosh, where were we? So that's what I mean by that. Uh and then yeah, we we've talked about this. The right backstage is the results are broadcast. the

written to the permanent register file physical register file, not the permanent register file. That's

different. And it's marked complete in the reorder buffer. All right.

Retirement. Retirement is the easy bit.

Um, all we're doing all the retirement system is I mean, yeah, we're all we're all trying to retire. I know, I know.

Earlier perhaps than than not, but finally everything is done. So that

reorder buffer is essentially that ledger again of all instructions that flow through the system. the retirement

system is trying to pick up as many um completed instructions as it possibly can and say, "Okay, these are now done.

They can be freed back off." And that's the other thing. It does do the freeing.

So, this is where the registers that were renamed are now freed back to the system as they've been all the way through the system. But, of course, the oper O op codes that we've just seen, it's not that the it's not that the

register that just got written to that we're freeing. It's whatever it used to

we're freeing. It's whatever it used to be when it was renamed right back at the beginning of the pipeline. So we know that when we complete the register that replaces EAX completely and it used to

be P88 and it's now P50 at the point it retires P88 is now free. So that's when it gets chained back onto the list and that's also why that is simple provided you have no aliasing and no one else has

also got P88 at that point. Um so this is where exceptions actually get handled. We talked about this is just a

handled. We talked about this is just a flag and then that's when the whole thing goes horribly wrong. Hopefully

you're not doing too many exceptions in your code and not not exceptions like C++ exceptions. I mean exceptions like

C++ exceptions. I mean exceptions like machine checks and um you know divide by zero and and um memory stuff. The stores

are also marked as senior in that memory order buffer because they they're going to continue to live in there until they actually get flushed out to the real memory. And the retire system can do

memory. And the retire system can do four microps a cycle which is very rarely a limitation but if you you can you can sort of prove it with very very pathological code. Um yes and then later

pathological code. Um yes and then later the last stage is it's not really a stage at this point it's just that the the um the store buffer is kind of retiring stores in order as well as they become senior that means they can go out

that means the instruction that caused them to become ready um uh has completed so they they're okay to be committed and then some magic happens something something total store order some kind of

liaison with the other the chips uh on the on the die and or the other um subsystems and I this is where I can tell as far as I can with the RFO, the read for ownership that actually is the

kind of kind when we talk about cash ping ponging and force sharing. This

seems to be when it happens. I was very vague on this honestly. I was trying to work out where it could possibly happen.

I should probably talk to you all and see if you can have opinions on that afterwards, but it seemed it. Okay.

Right. Well, that's it. It's pretty

straightforward right?

And to think that I was going to talk to you about branch predictors as well, but um you know I uh where's the next thing?

There it is. Yes, a conclusion. There's

not really a conclusion to this. The

whole talk is the conclusion. It's like

this thing is an amazing edifice that's been built, right? It's amazingly

complicated. Uh the kind of advice and I I feel like I'm, you know, teaching your grandmother to suck eggs here. This is

like you all know this, you know, do things simply and straightforwardly.

Don't do divides again. Integer divides

that is floating point divides fine. I

don't care about those. Uh smaller line loops are good. Um the let the renamer do its work. Again, I wrote this before I had the conversation with some of your experts and um the the call for more

registers was uh really interesting to me because I hadn't really thought about the implications, but my small loops seem to work. Okay, so I'll stick with those. Um but yeah, so the renamer is

those. Um but yeah, so the renamer is very clever at doing its work. You

shouldn't need to worry about having more registers with now a massive footnote speak to these people. Um yeah,

I'm sure you all have questions. Um I've

got a whole bunch of references here that you can go and look at. I will make sure that these slides are available to you all so you can go and look through them all and thank you very much indeed.

[applause] [applause] All right, I got the cube there. Anyone

brave enough? I have I have swag as well. You can come and get stickers

well. You can come and get stickers before you go. You can come and grab a sticker or I have one special or two two coasters which are not very exciting because they're made of cardboard, but I'm trying to bribe you to talk to me.

Somebody speak. Come on.

Oh, there is a question. Hello. All

right. I'm gonna try not to hit anyone, but I'm sorry. I don't think I signed anything to say I wasn't causing physical damage. Uh, are the physical

physical damage. Uh, are the physical registers like equally as wide? Good

question. Um, no. So, the physical register file is broken into different pieces. I thought I had the numbers up

pieces. I thought I had the numbers up on one of the slides. Maybe I did, but I didn't talk about it. Um, there are different number of registers for different types of sorry, different number of physical registers for different types. There are like several

different types. There are like several hundred flags registers because they're small and pretty much every instruction affects the flags, but almost every other instruction doesn't give a about what those flags were. So, you

want to try and rename them away and make sure there's no force dependencies.

For the AVX things uh instructors, there seems to be a batch of like full width um 512bit physical registers and the renamer seems to want to use those

only when it knows that it needs them.

Otherwise it it tries to use 256 bit.

Again it's trying to sh save some kind of real estate. Again I I'm waving my hands a bit here because I haven't had full experience personally of this but I from reading in the literature there is some kind of cleverness about the allocation of those. And obviously then

the 64-bit general purpose registers are there's just some couple hundred of them and I'm going to >> but they're at least 64 bits wide.

>> They are at least 64. Yes. And they're

they are made for each type of register.

So they're at least 64 bit wise. Yeah.

Um, another thing to note, I was very cavalier about saying racks and then then showing things that had EAX on them and all that kind of stuff. Now we all know that they are the quote same register. It used to be the case again

register. It used to be the case again sandy bridge era that if you wrote to say ax then um the register allocator gave you a new a rack like ex

effectively at that point and then it also wrote out an op code that was like and or it with the remaining part of the old version of that register right to kind of combine them. That seems to be not the case anymore. It just seems to

just emit them and magic happens and then and you if you write to the bottom 16 bits then uh you don't have to do any extra work. But yeah, so but they're all

extra work. But yeah, so but they're all 64 bits wide is my understanding. Yeah,

good question. Okay, one of you want to try and pass it over rather than Excellent. Uh do you have much of a

Excellent. Uh do you have much of a sense of how much variety there is in the design between like these Intel CPUs and AMD or even like other high

performance CPUs? Uh, not really

performance CPUs? Uh, not really actually. No, my my my my sort of

actually. No, my my my my sort of childhood experience was all ARM, but then that was very very old and and not very sophisticated. Um, and then I've

very sophisticated. Um, and then I've only really been concentrating on the Intel CPUs. That having been said, you

Intel CPUs. That having been said, you know, the things that I know are different in AMD are the branch predictor. That's something I've done my

predictor. That's something I've done my own research on to and that is very that's got a very different ethos. It's

sort of um but I don't know about this these kinds of aspects of it. I maybe

someone else in here will do but no sorry for the non-answer and there's a question there.

>> Um I I've heard like two perspectives on whether ISA matters like is ARM faster because of you know all the nonISA things or is it faster because of ISA.

So what is your take?

>> So I think again in those hian days when ARM was actually a risk processor and was you know I could disassemble it by by looking at the instruction because they were that straightforward. I think

it probably had an edge against the the contemporary x86 era um thoughts, but I think that ship has sailed, you know, with all of the new whisbang features that the ARM can do now. I think they

have just as complicated a set of decode steps and prediction and and pipelines and out of order and all that kind of stuff. So, I don't know that it makes

stuff. So, I don't know that it makes that much difference. is more of an implementation detail of the front end as to how the actual ads and subtracts get their way into the ALU than there is

um uh you know I guess the one one argument could be that the x86 ISA is sort of a poor man's compression so you get a little bit more I usage maybe but

you know we've all seen 15 site 15 byt instructions as well you're like I'm not sure that's better you know so yeah no again another non-answer for you there I'm afraid >> [laughter]

>> question at the front. I think if you want to carefully Here we are. Lovely.

>> Um like how did anyone figure any of this out? Like what kind of crazy

this out? Like what kind of crazy experiments are people running to figure all this out?

>> So yeah, huge shout out. So the OG for all this is Agna Fog who is a Danish a retired now as I found out last week retired Danish anthropologist which is exactly the kind of person you'd

imagine. [laughter]

imagine. [laughter] He seems to be more of a social anthropologist and a technology focused anthropologist or was and so this sort of why [laughter]

I'm doing something. So he he did a whole bunch of experiments to reverse engineer things. Now so the short answer

engineer things. Now so the short answer for for that is there are performance counters. Those performance counters

counters. Those performance counters have revealing names even if Intel don't really want you to know what's going on and you can use them and infer from their usage. So like the my own research

their usage. So like the my own research back on earlier was about um answering the qu question the the you know that the XKCD with someone is wrong on the internet and you're like I got to go. It

was one of those someone made some sweeping statement about like backwards branches were always predicted taken.

I'm like really always but what if you don't take it you know what if it's a conditional branch that's backwards. Um anyway, the long and the

backwards. Um anyway, the long and the short of it is yes, it is in and some unconditional ones are different from conditional ones, whatever. But like the the the the measurements were do to do

two things that were in the Intel documentation about early resters and they don't tell you what this is. It's

just like oh yeah, the number of times the pipeline was was resteered early and you're like okay and then late resteer and then some other thing. And if

basically my inference which was backed up by the evidence I got and the the conclusions I drew was like I said in the beginning you don't even know if there's a branch in a 16

byt block of memory right so the fetcher is just going predict sorry yeah the BPU and the fetcher are just going to go 16 bytes 16 bytes 16 bytes 16 by as soon as a branch is detected really early like in that decode stage it can send a

message to say oh well whatever you're doing is probably wrong because if you didn't tell me that I had to branch at this you should branch. And so that was an early rest. So it's like only three

early rest. So it's like only three cycles of latency and then suddenly you're already getting on the right. And

so by playing around with that plus by branching forwards and backwards and then timing it and then seeing how far apart instructions were. That's what

those color pictures were, you can kind of infer some of the patterns of what what's going on. And then for things like the much more complicated stuff like the micro operations and stuff, there are counters of how many microps

um have been executed by each stage of the pipeline. So you can run experiments

the pipeline. So you can run experiments where you're just kind of adding over and over and over again. You say like, "Oh yeah, that the counts are going up from ports 0, one, and seven. So those

are the three that can do this operation and so on and so forth." Um, and then you know, you build huge dependency chains and then you try and do some things behind the back of those dependency chains and see which other counters go up and oh these things are

that we hit the rename it wasn't able to rename anymore because like it was way things like that. So there's a whole bunch of experiments and um Travis DS is probably got the best set of um work on

this. So he's this is his blog, but he

this. So he's this is his blog, but he also uh writes something called UARTH bench. I'm getting that right, which is

bench. I'm getting that right, which is um kind of like a whole suite of microarchitectural benchmarks. Agnafog

microarchitectural benchmarks. Agnafog has his own stuff as well. I've got my own branch of that which is what I use for the BTB stuff which I is now I've made it recompile again. It was it was broken as hell. Um, so yeah, there's and

then Andreas Abel and his um I can't think of the the other person on his paper, but as a I think it was his master's thesis, they spent a whole bunch of time trying to work out the

algorithms behind the renamer and the the uh which ports get allocated at what time and how long they are. And they

found a whole bunch of things that like no one else had ever noticed before, which was just bonkers. And I love the fact that we can experiment with something which like I'm carrying around with me all the time and you're like, I wonder how it works. like you can go and

find out. But yes, so the short answer

find out. But yes, so the short answer is all those weird counters and then very careful timings. And the the sadness about this is and I'm sure we again we talked about this earlier is that most of these things only work if

you have actually got a physical machine in front of you which obviously I do in a case of a laptop but I do in my desktop at home. You know, I tried to get a um a a Granite Rapids machine and I you know, it's expensive enough to

hire them and then I get to the my Amazon account and I've got it running and I'm like sudo this and it's like, oh, there's only three hardware counters on this. I'm sure there's more than

on this. I'm sure there's more than three hardware count. Oh, it's virtual, isn't it? So, I couldn't do the

isn't it? So, I couldn't do the measurements I wanted to on those. Um,

and it's the same with magic trace, right? You need a real actual honest to

right? You need a real actual honest to goodness computer to run it on in order to actually get an answer. Any other

questions? Thank you, by the way.

Oh, you've got one next to you and then we've got one behind you. Uh,

>> do you have a favorite x86 instruction?

>> Do I have Oh, no. Not I'm very simple actually. I quite like I quite like um

actually. I quite like I quite like um there's one of the parallel comparison things that I always forget, but when I see it, it makes me chuckle and I can't think which one it is. It's like one like PCL me something. No, I Yeah, good

question. And it's the kind of thing I

question. And it's the kind of thing I should have a good answer for, but I do not know.

>> We'll accept least favorite, too.

>> Least favorite. Ah, least favorite. I

mean, no. I I wish I had a pathy and amusing

no. I I wish I had a pathy and amusing answer for you, but it put me on the spot. [laughter] What was the They were

spot. [laughter] What was the They were took one away for the um they took one instruction away that used to be used for binary coded decimal and it was they didn't bother porting it to the the the

64-bit. And I'm like, come on. You have

64-bit. And I'm like, come on. You have

so much legacy crap and you didn't bring this instruction over and it's got a funny name. I can't remember what it is

funny name. I can't remember what it is now, but anyway.

>> Yes, sir. Presumably, we can use this to make our code fast. And Intel wants their CPUs to look very fast. So why is this all so hidden and impossible to describe? And why don't they just tell

describe? And why don't they just tell us more?

>> I don't know. So you can I mean again it may be possible that within this building there are people have access to information that I do not have access to but I was certainly aware of at my time

at at Google that if you have a certain volume of uh of uh um uh chips that you buy from Intel then there's a sort of sliding scale of how many how friendly they'll be with you [snorts] going from

like we can send somebody to come and help you debug your problems through to we can fabricate your own specification of chip and here's the design they I don't know the pink books, I think, and the yellow books, they were like

different names for different levels of classified information. And certainly

classified information. And certainly when I had some very off thereord conversations with some some contemporaries around the spectrum meltdown uh times, they were like, "Well, yeah, the pink thing it talks about." And I'm like, "What are you talk

about." And I'm like, "What are you talk what?" And they're like, "Have you not

what?" And they're like, "Have you not got those?" I'm like, "No." And they're

got those?" I'm like, "No." And they're like, "Oh, we shouldn't have said anything." I'm like, "Oh my god, even

anything." I'm like, "Oh my god, even knowing that they exist."

>> [laughter] >> So I think some people do have access to this information but to answer your actual question I think they they don't

want to promise things about how it works because you know what is it uh Hyram's law anything that can be relied upon will ultimately end up being relied upon. And of course you know that's why

upon. And of course you know that's why we have instructions from the 1970s although not my my my friend the um binary coded decimal adjustment. Um so

they don't want to be painting themselves into a corner and presumably as well there's an intellectual property aspect to this you know even though it gets reverse engineered. So like um the branch predictor in particular I think

is one thing because I do know that AMD is very different from the the x86 one sorry the Intel one and um yeah the the folks who reverse engineered the the

Skylake branch predictor um they found a really interesting case where and again we I have I was going to do some more slides that I could flick to but unfortunately I didn't get around to

them but effectively the various smooshing and hashing and things that it does to kind of come up with a this is fingerprint for this particular branch and the sequence of operations that led to it such that I can have a a table

that says hey the last time we were in this state at this branch we took the branch you know that kind of feel that's what is going on in a branch predictor right um they um the they discovered these researchers that the bit five of

the of the branch address was essentially a direct map. So if it was clear there was one half of the entire branch prediction system that you were using and if it was set there was the other half of the entire produ you know

effectively the hash didn't mix bit five and they used it to generate a sandbox sort of meltdown spectre style where if

you're a JIT compiler you could contrive the code that you jit compiled to only allow branches to happen when bit 5 was set right in your in the code that you generated and then you make sure that your supervisor and all the checks about

like are Are you in bounds? Are you not in bounds? Whatever. Has bit you people

in bounds? Whatever. Has bit you people laughing. Bit five clear. And now you

laughing. Bit five clear. And now you effectively have two branch prediction domains and now you're not subject to spectre meltdown. How cool is that? And

spectre meltdown. How cool is that? And

I'm like, yeah, if they'd have told us that, we'd have used it. But then of course now they'd be like, oh we have to keep it this way. We probably

didn't even know ourselves that bit five leaking out like that. So I think it's a bit of, you know, what what is the Hanland's raisin? Never ascribe. No.

Hanland's raisin? Never ascribe. No.

Yeah. No. No. I don't think Han Han's raver applies here which of course was the never ascribe to malice that which is explained by incompetence but no because I think they're very very common to people I think some very careful decisions have to be made about this kind of stuff I mean we're obviously in

a very sensitive intellectual property environment ourselves and so you know try trying to tow the line between the things that we do think is useful and the things that most people don't need to know but could be a competitive

advantage would be my biggest guess there plus just documenting it would be a pain any more questions questions.

Okay, I'm being waved at from the back as in times like we're done on the time.

So, once again, thank you all for inviting me out here. I've had a wonderful time and uh [applause] yeah.

Loading...

Loading video analysis...