27: High Throughput Stock Exchange | Systems Design Interview Questions With Ex-Google SWE

By Jordan has no life

Summary

## Key takeaways - **Exchange Matches Bids and Asks**: The exchange keeps track of bids (how much someone is willing to pay) and asks (how much someone is willing to sell for). A trade executes when a bid price is greater than or equal to an ask price, potentially triggering multiple atomic trades. [03:10], [04:27] - **Matching Engine Runs In-Memory**: To achieve high throughput with millions of orders daily, the matching engine runs everything locally in memory to limit IO, network calls, and disk access, while minimizing locking. [05:19], [05:52] - **UDP Multicast Ensures Fair Delivery**: Clients connect through proxy servers, and market data is delivered fairly using UDP multicast so all proxies receive messages simultaneously, preventing high-frequency trading advantages from timing differences. [07:47], [08:20] - **Retransmitters Handle UDP Drops**: UDP lacks reliability, so retransmitter nodes cache every message with sequence numbers; clients request missed messages like sequence number 4 after receiving 5. [09:26], [09:57] - **State Machine Replication for Backup**: Backup matching engine listens to primary's sequenced output messages and replays them deterministically to maintain identical state; it generates missing atomic trades like messages 2 and 3 from message 1 if primary fails mid-delivery. [13:00], [14:47] - **Partition by Symbol Avoids Intra-Symbol Splits**: Scale by partitioning matching engines per stock symbol like Apple or Google to prevent bottlenecks, but avoid splitting one symbol across nodes since all orders must see the best prices and trade atomically. [15:40], [16:36]

Topics Covered

Bids Cross Asks to Trigger Atomic Trades
In-Memory Engine Powers Extreme Throughput
UDP Multicast Ensures Fair Market Data
State Machine Replication Syncs Backups
Partition Per Symbol Avoids Cross-Locks

Full Transcript

hey everyone welcome back to the channel today we are doing yet another video on systems design we're going to be building out an exchange so this is a disclaimer right now because I can see myself getting sued for this video down

the line God knows why but uh if my current company ever wants to come after me for making this one I'm going to make two things very clear one is that I'm absolutely just completely shamelessly

plagiarizing this video from another Jane Street exchain Building video uh which I found on YouTube um and I'm going to link it so you can see how badly I'm plagiarizing it and two is that I work on absolutely nothing

similar to exchanges in my day-to-day work in fact like 50% of it is adding buttons to Trader goys so with all of that being said please don't sue me I haven't done anything wrong I'm just an

idiot on the internet for fun let's go ahead and get into some systems design all right let's go ahead and get into things as we all know this wouldn't be a Jordan has no life video unless there's

at least one kov and one foot pictures bit uh in the beginning and so here it is what is an change well at uh you know really like broken down to its Essence it's basically one person saying they're

willing to buy or sell something and this case I'm selling my foot pictures for $6 and then other people saying what they would pay for it so in this case Megan Fox might be willing to pay $5 but that's not good enough for me I want six

Karina kov comes along because she's in love with me pays $7 and now all of a sudden we sell now depending on the exchange who knows what price we're actually going to sell at it'll be somewhere between $6 and $7 but the gist

is when she's willing to pay more than I'm asking for uh then we are good to go so let's go ahead and formalize this in some problem requirements number one

is that we want our exchange right so I just had an open outcry exchange where everyone's yelling which still does actually exist however we're going to be building ourselves an electronic exchange so we're going to be matching

buyers and sellers of stock via a computer the next thing that we want to be doing is providing an ordered list of all operations that everyone will eventually agree on right it's not enough for uh you know the exchange to

provide some ordered list of operations and then maybe if it goes down some other backup comes up and provides a different order of operations that would be bad uh it would mean that some trades that people thought happened didn't

actually happen and uh then they would get very very angry and we would have legal issues so let's go ahead and make sure that we agree on The Ordering of things and number three is that in addition to providing a total ordering

over everything we also want to provide an ordered list of all act activities per client right so if I'm uh you know my Trading Company I want to know exactly all of the trades uh and the

orders that I made and the cancels that I made in the relatively correct order that I made them with a sequence number for each great okay so let's go ahead and talk

about the first thing here which is going to be the limit order book so in the last uh video that I made on this topic I'm very heavily plagiarizing off myself I spent a lot of time uh trying

to dive into the algorithm that you would actually use to do this and the truth is that in of itself is a video and it's not really a video for a systems design series either I'm just going to explain the highlevel overview

or algorithm of how the limit order book works and then from there I think that should be enough to get into this thing further so all that exchange is really doing is keeping track of bids and asks where a bid is how much I'm willing to

pay and an ask is how much I'm willing to sell a thing for so if we're talking about GameStop stock I've got some bids roaring Kitty is willing to pay $100 for five units of GameStop stock my wife's

boyfriend is willing to pay $90 for 10 units of GameStop stock on the other hand on the ask side maybe Citadel is either shorting or already owns some

GameStop stock and they want $120 for their 100 units of it so $120 per unit basically Melvin Capital who as we know got brutalized in the short squeeze

wants $130 for their 10 units of it so hopefully this is starting to make sense now as you can see as things are we're not going to be making any trades but all of a sudden maybe roaring Kitty decides actually you know what I'm

willing to buy five units for $30 what will actually happen and let's say he doesn't just want five units but now he wants 110 units he'll trade with both

Citadel and now all of a sudden he actually wants even more than that he'll also trade with Melvin because he wanted 110 units and both of those prices are going to work for him so the gist is that we're going to see one or more

trades executed when there's a bid in an ad ask where the bid price is greater than or equal to an Ask price and it's important to note that more than one trade can be executed at once because when we have something like that we need

to ensure atomicity right it shouldn't be the case that I only hear about one of those trades and not the others so this is going to be important when it comes to fault tolerance okay so let's go ahead and

talk about our our matching engine performance or rather a high level overview of it right because the matching engine is the thing that is implementing this limit order book right here and so we want to be able to make

sure that this thing is as fast as possible there are many many millions if not billions of orders being placed every single day on one of these exchanges and if they're taking a while

to get through or there's low throughput you're just going to have a bunch of orders coming back behind them and eventually everything is going to stop to a crawl nothing is going to get executed so we want to be as fast as

humanly possible here so what can we do to do that well the first thing that we want to do is limit as much IO as we can which basically means any additional Network calls and any additional calls

to dis which basically means we're going to be running everything locally in memory now of course that doesn't mean that it's all going to be single-threaded we have seen like certain databases which will do this

ideally we want to be avoiding locking as much as humanly possible but there are going to be some situations here where we simply can't avoid locking and we'll talk about those a little bit later cool another thing to note is that

our matching engine is going to be deterministic right based on everything that I described in the above algorithm it doesn't really matter exactly when the message comes in just more so that

if a given message comes in right so let's say we have some message M and we have some limit order book State s when we apply M to S we're always going to get S Prime now of course this doesn't

necessarily mean that you know if M and M Prime come to S at the same time uh that you know regardless of the order they're applied that we're going to get the same S Prime that's not the case we are going to have to order these

operations for us to make sure that our state is consistent cool so let's talk about some networking as far as our matching engine is concerned basically the gist is Right we've got this one box which is our matching engine and then we've got a

bunch of people who really really care about what's going on in that matching engine we've got all the clients right so all of the the vanguards and the the robin hoods and any other people that are actually trading with the exchange

on behalf of customers sending these orders we've got a bunch of Banks and clearing companies who care for you know verifying that things look good and then we've just got other Market data vendors maybe people like Bloomberg who are getting this data modifying it a little

bit and then sending it out to other people for money cool and the big thing is that probably all of these people or all these clients are going to be connected to the matching engine through proxy servers but the point is those

proxy servers need to be getting the data delivered to them fairly right because if the data goes to one of the proxy servers well before the others people are going to start complaining about that quite a bit because it's not

like you know you're paying more money to be connected to a certain proxy server it's more so that you know we've got a bunch of them they should be delivering all of the market data at around the same time otherwise if someone gets Market data sooner than

someone else they have more information than them and can make more money trading and that is kind of the the gist behind high frequency trading but the point is we want to make this thing as Fair as possible or we get in trouble so

the way that we do this and this is mostly background here I wouldn't really expect anyone like be able to Rattle this off during a normal Tech interview is we would use something like UDP multicast the reason being that TCP

which is our go-to standard choice for most web applications is going to require us having a bunch of onetoone connections from our actual exchange itself and so if you have a bunch of

onetoone connections sure maybe you can use multiple threads to send messages out at similar times but they're not going to go out at exactly the same time whereas multicast is literally one message that comes out of your computer

and then it basically gets forwarded over that private Network that everyone's connected to and uh supposedly this is the most Fair admittedly I don't know too much about this network stuff I'm not going to claim to either I would recommend

looking this one up more on your own if you can uh I simply don't know so please don't kill me in the comments cool so what happens when we use UDP as opposed to TCP well this much I do know we lose

a few nice features that TCP has for starters we lose flow control we lose an ordered broadcast and we also just lose reliable broadcast so in UDP messages can get just get dropped or they can be

delivered out of order or anything along those lines it's really fair game that you're basically just sending packets willy-nilly over the network and anything can happen so how are we actually going to

deal with the fact that we're using UDP well like I mentioned messages can come out of order and they can be dropped uh the flow control thing basically the only way we're going to be able to deal with that is by having consumers of

these messages that are sufficiently fast the actual out of ordering and the drop messages well we're going to deal with that by using something called retransmits so this is very much

something that gets actually used in you know these actual exchange designs where we're using UDP the reason being that let's say you know I have a matching engine and it sends out messages with

sequence numbers four and five so four and five go out but for some reason our client over here doesn't receive number four it just gets number five and so it's going to say shoot you know what I saw everything up to sequence number

three and now I'm getting a sequence number five which implies that a sequence number four exists well fortunately even though our client dropped this message we've got a few different retransmitter nodes which are

sitting over here solely for the purpose of caching every single message and basically putting it in its own log so then as the client I can go to the retransmitter and say hey I dropped a message give me what I missed and so the

retransmitter is going to allow us to make sure that we have all of these messages and can deliver them back to our client without actually risking losing a bunch of data now is it possible that somehow all of these

messages don't get to the retransmitter uh I suppose so yes but that's why we have a lot of retransmitter otherwise uh you know the increasing uh there's an increasing probability of something like that happen

happening okay the next thing that we are going to do is talk about the fault tolerance of our matching engine because right now what we've discussed is everything important is pretty much happening in here and so if I were to go

ahead and take it down via a nuke or via a sweet 360 roundhouse kick uh we're in trouble so what can we do basically like we mentioned this thing's running in memory so even if it

were to come back up we still don't have our state restored we basically are going to need a full-on backup in fact we probably would even want multiple backups but the gist is we're going to have some sort of Zookeeper node right

something that you know we can achieve consensus with where all of the nodes and zookeeper you know constantly doing heartbeats with our primary matching engine and then if they all agree that the primary has gone down then you would

go ahead and uh you know stop the heart beating or and uh continue to basically elect the backup as the new leader cool so what are we going to do now uh we're

going to recreate this state so uh basically how are we going to set up our backup I'm going to present a couple of options we'll talk about what's more feasible than the others so option number one is that you know we've got

our primary we've got our backup and for the backup to get the same exact State as our primary node what we can do is have every single client who's sending orders send them to both the primary and

the backup however this is going to be problematic for a couple of reasons let's imagine that we wanted to go ahead and send uh you know an order to both the primary and the backup what could

happen is if if we have two clients sending orders to them their orders could get interleaved right so what might happen is that this guy's order first hits the backup and then hits the

primary and in that same time this guy's first order hits the primary and then hits the backup and so basically what you might have is that you know on the primary it looks something like client

one's order gets executed first then client two and over here on the backup client twos order gets executed first

then Cent one and now these guys basically disagree uh on you know what the actual state of the thing is and that is bad uh it's unacceptable and

like I mentioned if we disagree on you know who traded first then we're going to have discrepancies down the line and people are going to get very mad because then you know the primary goes down client one thinks he traded first and

then all of a sudden we switch over to the backup and now he seeing he traded after client to he's going to be very upset with that so cool what can we actually do instead instead what we can

do is State machine replication so we've talked about this before but this is always something that's very useful when you've got one node that's basically processing every single thing in uh memory and you know we need to continue

to process it quickly and so we can basically just pop out all of the state that it is processed and then process it on the second note so I'll describe that in more detail basically the gist is we have our backup we're listening to the

output of the primary as opposed to the inputs because the inputs can get there in any order but the output is in a sequenced order right we mentioned that the primary matching engine is

outputting uh events with sequence numbers meaning that on the backup we can apply them in the same order as the sequence numbers we don't even have to uh get those messages in order on the

backup right because they might come to the backup out of order but because they have sequence numbers because we have retransmitter nodes uh the backup is always able to stay in perfect sync with

the primary so what is one Edge case that could happen here well let's imagine we have a client the client puts in an order over here to the primary and that triggers multiple different trades like I mentioned right

so now all of a sudden we've got message one message two and message three let's say that you know the client or rather the primary M matching engine delivers

message one and then it goes down before message two and three are delivered now that's bad because these three events are supposed to be Atomic it doesn't make sense to have message one unless you also have message two and three the

fortunate thing is that all of these messages are determined istic right so when the backup sees message one it's going to realize because it runs the same exact code as the primary and sees that it has to apply these operations

that oh you know what I didn't even receive messages two and three but I know those are about to happen because they have to happen at the same time cool so the backup is going to generate

two and three based off of number one and the gist is that you know for something like this to work we do still need atomicity at the individual message

level so you know message one is itself probably comprised of multiple different internet packets and we need all of these to arrive together for this thing to count so what could we do well we

could have actual sequence numbers within that individual uh you know exchange published message we could have a check sum to make sure the data is valid you know we could have an end marker right here to make sure that

we've received all of it where it says you know the number of actual uh sequences in that message so things like that uh will ensure that we have atomicity at an IND idual message

level cool the next thing that we want to do is talk about partitioning right so up to this point we've had one single matching engine that everything goes through and the problem there is that

that is effectively a bottleneck so what can we do to actually scale it out well we could do some form of partitioning so the question is what do we actually want to partition on well we probably can't

partition within the same symbol right so for example for Apple stocks if all of a sudden we decided like hey you know some orders for Apple stock are going to go on one node and some orders for Apple stock are going to go on another node

that's not going to work because the whole point of the exchange is we want all orders for Apple stock to be able to view one another and potentially trade on one another if the price is right if we're not giving people the best prices

they're not going to come to our exchange and frankly it might be illegal as well so that's not going to work you might be able to actually have the multiple partitions communicating with one another but then like we mentioned

now we're introducing a bunch of different networking calls and that's going to slow things down down a lot and probably be in feasible so the real way that you're probably going to be doing partitioning here is on a per symol basis right so if we're trading Apple

Google and Tesla we can have one matching engine for each of these the problem is some people depending on their trading strategies might actually want to place orders that are Atomic

across these nodes right so maybe I want to buy apple and sell Google at the same time and so I place an order to do just that or maybe I want to cancel my Apple order at the same exact time as I cancel my Google order because if one reaches

before before the other and one counts and works and the other doesn't count and work or maybe I traded one of those stocks and didn't trade the other now I have a bunch of exposure risk to one of the stocks when I wanted to cancel both

so the gist is you know when we partition like this and we have these symbols across multiple partitions the only way to actually make Atomic transactions is to do something like a two-phase commit and that's going to be really slow and so we're probably not

even going to offer that in the first place because it's just going to require everyone else to wait on our distributed transaction and that's not okay the last piece of this thing is going to be returning ordered messages for

clients so the gist is here that basically the matching engine is going to decide on a total ordering that doesn't necessarily mean though that the matching engine is going to execute everything on a single thread it just

means that every single operation is just going to get some number assigned to it it is possible that there are a bunch of concurrent operations that don't conflict with one another and so it doesn't really matter which sequence number we give to which one of those the

point is they don't conflict confict that being said even if they don't directly conflict with one another right there are some messages that we do want ordered relative to one another and typically that's going to be within a

single client so for example if I'm a person who's interested in trading I want to know whether you know the trade that uh I placed happened before or after I sent my Cancel message even if

they weren't directly conflicting with one another so the gist over here is that you know for every single message we're going to send basically an ID with that message we're going to associate a

lock with that relevant client ID so in this case you know this would be like a user 10 lock and then when I try and cancel versus when the exchange is trying to trade uh you know something

that I have basically now all of a sudden we're going to use this lock to determine a monotonically increasing sequence number for my operations assign it in this case and then send it back to

me the client so hopefully the main gist here is just that you know it's not enough to not lock on any two transactions that don't Direct ly conflict with one another if they're within a same client and we have to

provide the client with ordered sequence numbers then we're going to have to do some additional locking there too okay so let's go ahead and get ourselves into a diagram to finish this thing off the gist is I've got a bunch

of clients over here and like I mentioned we don't want to uh directly connect those with the exchange for a bunch of reasons for starters they're probably not on the same private Network so that might complicate multicasting but even additionally we do like having

order gateways as well to perform some sort of rate limiting so you can't just Spam The Exchange and take it down the order Gateway can you know stop you via a rate limiter and then within the

actual you know gateway to exchange private network over here everyone's communicating with UDP so the gist is we've got our matching engine which is really the center of every single thing

that we care about it's publishing to all of these different nodes around it with UDP and then in addition we also have a backup which is basically listening to

all of the primary messages establish ing the same state as the primary based off of them and then zookeeper which is always waiting to hear if the primary went down and if so basically saying

okay the backup can now go ahead and claim itself as the primary then finally we've got a couple of these retransmitter nodes right here which are also listening to the primary matching engine basically all they're going to be

doing is creating a log of sequence number to the actual message that the primary put out and then that way if any of these order gateways or the backup exchange itself becomes a little behind

or drop some messages they can reach out to one of the retransmitter nodes to see if it's there anyways guys I hope you enjoyed this video again for legal reasons I plagiarize the out of

this from Jane Street it has absolutely nothing to do with my actual work I am a front-end button monkey and don't you forget it anyways I hope we all have a great night and I will see you in the

next one

Loading...

Loading video analysis...