What Nobody Tells You About Being a Quant
By The Quant Insider
Summary
Topics Covered
- Communication beats coding in quant interviews
- Active management is forecasting the market's errors
- Factor decomposition makes portfolio risk computable
- Market impact is finance's Heisenberg principle
- The multi-factor model is a breadth machine
Full Transcript
Hi everyone, I've received a lot of requests to do a full walk-through of my experience as a quantitative developer.
And so today, I'm going to share everything I learned in my four years as a quant, and also touch on things I've worked on to give a detailed look into some practical systems that a quant developer could be expected to
implement. The most exciting piece that
implement. The most exciting piece that I will cover is reviewing over the systems design of a trading system you can expect at most big quant firms, which can be great for implementing in
your personal projects, and in turn a great talking point in interviews.
I've sectioned this video off into chapters, so you can easily skip ahead to the parts you're most interested in.
First, I'll start with the experience I had before I was accepted as a quantitative developer at my firm. I
graduated from the engineering department from the University of Waterloo, and I had completed six internships, all of them in big tech and startups. Across those roles, I worked
startups. Across those roles, I worked as a software developer and as a machine learning engineer. Despite the extensive
learning engineer. Despite the extensive internship experience, I didn't have any finance or quant finance background at all. After I graduated, I applied to
all. After I graduated, I applied to hundreds of quant roles of varying types, and I eventually landed mine.
The interview process consisted of six rounds, and I'll cover each one. The
first two rounds were technical coding questions, but they weren't your typical leak code questions. They felt
custom-made for the role, and the two questions I got were of varying difficulty.
The first round asked me to build a portfolio management system, something that could track orders, execute orders, and maintain my positions across different stocks.
I found this one fairly easy because I had already built trading systems in the past, and was familiar with the features the interviewer was asking for, like stop losses, limit orders, and so on.
The key thing I took away from that round was the importance of communicating my thought process thoroughly and concisely.
If there's one piece of technical interview advice you take from this video, it's this.
Half of what they're looking for is whether you can work your way through the problem using good technical practices, and the other half is whether you can communicate your thought process clearly.
This is a hugely underrated skill in interviews, and it's one of the main things I, along with my colleagues, look for when we conduct interviews ourselves. If a
candidate can't communicate their thinking behind a simple coding question, we lose confidence that they'll be able to explain the far more complex projects they'd take on if they
were hired as a quant developer, researcher, or trader.
As a quant, part of your job is being able to explain technical concepts to a wide range of people across the firm in a very intuitive and clear way.
When I interview candidates for our firm, I tend to value excellent communication over mediocre coding technical skills, especially in this climate in the tech industry. The second
round was about developing an algorithm around different sets of combinations that would appear in a 52-card deck, and then modified the suits and ranks in the deck. I can't remember the exact
deck. I can't remember the exact question. I tried searching for it
question. I tried searching for it online afterward, and couldn't find anything that matched. But from what I recall, it was really testing my ability to apply statistics to programming.
This was by far the hardest question I encountered in the whole process.
And I'll be honest, I didn't get it fully correct.
But I was able to communicate my thought process as best I could, and got probably 70% of the way there.
That actually reinforces the point I made earlier. Your communication will
made earlier. Your communication will take you a lot farther in the interview process than you might realize. I truly
thought that round is what would have kicked me out right away.
To my surprise, he continued me to the next stage to meet the partners at the firms, who are now my managers. So, my
biggest piece of advice is this.
Whenever you practice technical questions, talk through the problem out loud to yourself first. Write it down in pseudo code, and then actually implement it. That way, you build the habit of
it. That way, you build the habit of explaining yourself concisely. The third
round was a system design and technical round with two partners at the firm. It
was a little intimidating, almost a good cop, bad cop situation. They went
through my resume in full detail and dove deep into the projects I'd worked on. Their goal was to see how well I
on. Their goal was to see how well I could communicate a technical concept, and also to confirm that I had genuinely done the things on my resume.
And trust me, it comes across very clearly if that isn't true.
I talked about one of the projects from my machine learning engineering role, and a side project I'd built in trading systems. This kind of round will almost certainly
happen in every interview process.
So, I highly recommend being extremely comfortable being questioned about your own work and able to talk about a project for a solid hour. No matter how hard they pushed on the work I'd done,
it was important that I came across as confident and stood my ground with the technical depth to explain my projects in detail.
They asked me to show the system design for one of my projects, so I shared my screen and drew out the systems I'd built, and they questioned each piece.
There were a lot of questions around data engineering, how I'd organized the database, and the specific ML algorithms I'd used, which I believe were some kind
of gradient boosted tree algorithm, if I remember correctly.
The fourth, fifth, and sixth interviews were quite similar. Each one was with a different partner at the firm. It seemed
like they were rotating me around all the partners of each sub team within the quant department to figure out where I'd fit in the company.
I won't dwell on these too much because they didn't feel as difficult. They were
mostly a mix of systems design and behavioral questions, talking about the projects I'd worked on, database questions, and understanding my experience in the quantitative finance space.
They traded global equity, so they did ask me some questions related to that, but they were fairly high level, and they clarified that I wouldn't be penalized for not knowing too much of it. Though they made it clear it would
it. Though they made it clear it would be expected of me to grasp it if I joined.
The first big concept I want to walk through is the multi-factor trading model, because honestly everything else in this video hangs off it. It's the
standard architecture across a lot of the top firms, and once you understand it, the rest of my job will make a lot more sense. It's not a new idea, it goes
more sense. It's not a new idea, it goes back a few decades, and there's one book I'd point any new hire to before anything else, Active Portfolio Management by Richard Grinold and Ronald
Kahn.
I'll link it in the description. What I
want to do here is lay out the whole machine end to end, the way I'd sketch it on a whiteboard for someone on their first week.
Then, for the rest of the video, we'll zoom into the individual pieces I actually worked on and expand them.
One thing to set expectations on first, the multi-factor system at my firm, like at basically every firm, is an enormous code base. Nobody who joins today
code base. Nobody who joins today understands the entire thing end to end, and I'd argue nobody understands all of it at all. So think of this section as the map I wish someone had drawn for me
on day one.
Let me start with the single idea the whole business rests on.
We get paid to outperform a benchmark. A
client could just buy the index for almost nothing, so our entire reason to exist is the value we add on top of that index. That value add has a name, alpha.
index. That value add has a name, alpha.
The cleanest way to think about it, if your benchmark is up 10% and you're up 12, that extra 2% that came from you, not from the market rising, is your alpha.
Everything we build is in service of producing more of it.
And here's the mindset shift that trips up most people coming from a pure engineering background, like I did. The
market price already has a consensus view baked into it. Thousands of smart people have already priced in what they collectively think a stock is worth. So,
our job isn't to figure out what a stock is worth in some absolute sense. It's to
find the specific spots where we disagree with the consensus, and to be right about that disagreement more often than the other people trying to do the exact same thing. Active management is
forecasting the market's errors. So, how
do you forecast in a disciplined, repeatable way? You start by realizing
repeatable way? You start by realizing that a stock's move on any given day isn't one thing.
It's a stack of things added together.
Picture an oil producer that's up 14% over some period.
A chunk of that is just the whole equity market going up. Another chunk is its country.
Another chunk is its industry.
Oil moved, so oil stocks moved with it.
And only what's left after you strip all of that out is truly specific to that company.
That leftover piece is the part you actually have a view on.
The way we separate those pieces in practice is regression. You regress the stock's return against the things that drive it, and whatever they explain is factor return, while the unexplained
residual is the stock's own idiosyncratic return. And a quick gut
idiosyncratic return. And a quick gut check that surprises people, even for an oil stock that tracks oil tightly, that company-specific residual is usually a big part of its total movement. Stocks
are a lot more individual than they look. That idea, decomposing returns
look. That idea, decomposing returns into common drivers plus a specific piece, is the seed of the whole model.
So, let me walk the pipeline it grows into. It all starts with data and
into. It all starts with data and signals. Everything downstream is only
signals. Everything downstream is only as good as the data feeding it. So, this
is where it begins. Both traditional
financial data and the alternative data I'll get into later.
Out of that data, you build signals.
Individual, measurable views on a stock.
Classic ones are value, momentum, size, quality.
The more exotic ones come out of alternative data sets.
One thing worth flagging early.
Before any of this, you also have to define your universe.
The set of names you're even willing to trade.
Your instinct is to trade everything for maximum breadth, but you quickly learn some names just aren't worth it.
Illiquid stuff you can't get in and out of without moving the price against yourself.
So, you draw a sensible boundary first.
A raw signal though, isn't something you can trade directly. And turning it into something tradable is the next block, the alpha.
An alpha is a clean, refined forecast of return.
The intuition for the refinement is three things multiplied together. How
volatile the stock is.
How much genuine predictive skill your signal actually has.
And how strongly the signal's firing for that specific name, right now.
Each refined signal becomes one factor.
And multi-factor just means you're running many of these at once. Each one
ideally capturing a different, independent edge.
We'll come back later to why independence matters so much, because it's the whole game.
Running right alongside the alphas, not after them, in parallel, is the risk model. And this is the block newcomers
model. And this is the block newcomers consistently underrate. The alpha tells
consistently underrate. The alpha tells you what you hope to make. The risk
model tells you what it could cost you in volatility. It uses that same
in volatility. It uses that same decomposition from before. A stock's
return is its exposure to a set of common factors, industries, plus risk indices like size, value, volatility, momentum. Plus its own specific piece.
momentum. Plus its own specific piece.
Now, why go to all this trouble instead of just measuring how every stock moves with every other stock directly?
This is the single best why I can give you. So, let me actually work it
you. So, let me actually work it through.
To understand a portfolio's risk, you need the covariance between every pair of stocks, how each one co-moves with each other one.
Take a universe of 1,400 names.
The number of pairwise relationships you'd have to estimate is on the order of 980,000.
That's hopeless. You can't estimate a million numbers reliably from finite history. You'd mostly be fitting noise.
history. You'd mostly be fitting noise.
So, here's the trick that makes the whole thing possible. Instead of
stock-to-stock, you say every stock is just a bundle of exposures to maybe 65 common factors. Now, you only need the
common factors. Now, you only need the covariances among the factors, which is around 2,000 numbers instead of nearly a million.
You've collapsed an impossible problem into a tractable one.
That is why the risk model is built on factors rather than raw stocks.
It's the only way the math is even computable.
While I'm on risk, two pieces of intuition I lean on constantly. First,
risk does not add up the way you'd expect. The risk of a portfolio is less
expect. The risk of a portfolio is less than the weighted average of its parts because the stocks don't all move together. That gap is exactly the
together. That gap is exactly the benefit of diversification. Spread
across enough uncorrelated names, and the specific risk averages away toward nothing. But, and this is the catch, it
nothing. But, and this is the catch, it only averages away the specific part.
Stocks all tend to rise and fall with the market together, and that shared market risk never diversifies away, no matter how many names you hold. That's
the line between specific risk, which you can dilute, and systematic risk, which you're always carrying.
It's the core insight behind CAPM. The
market won't pay you a premium for risk you could have diversified away for free.
Arbitrage pricing theory took that one step further and said, "There isn't just one systematic driver, but many."
but many." Which is precisely the multi-factor view we run on.
The second piece of intuition is about time.
Risk doesn't add across time, but variance does.
As long as today's return isn't correlated with yesterday's.
Which for most assets, it basically isn't.
So variance piles up linearly with time.
And since risk is the square root of variance, risk grows with the square root of time.
That's why when you turn a monthly volatility number into an annual one, you multiply by the square root of 12, not by 12. Tiny detail, but it's everywhere.
And getting it wrong quietly corrupts every risk number you produce.
Once you've got both halves, the alpha is saying what you expect to make, and the risk model saying what it'll cost, they feed into portfolio construction. This is an optimizer. And
construction. This is an optimizer. And
conceptually, it's doing a balancing act. Maximize expected return. Subtract
act. Maximize expected return. Subtract
a penalty for the risk you're taking.
Subtract the cost of trading, all while respecting your constraints. Position
limits, staying sector neutral, turnover caps, how much leverage you're allowed.
What comes out the other side is the target portfolio.
The actual number of each name you want to hold. The optimizer just hands you a
to hold. The optimizer just hands you a target though.
Actually getting there is the next block, implementation and trading.
The whole philosophy of this stage fits in one line.
Subtract as little value as possible.
Every trade leaks a bit of your alpha back out, and this is death by a thousand cuts.
So it's worth knowing what those costs are.
There's commission, the per share fee to the broker.
There's the bid-ask spread. Buy at the ask, sell at the bid. And that gap is the cost of a round trip. There's market
impact, the big sneaky one. Buying one
share is cheap, but buying a 100,000 shares pushes the price against you as you go.
I always describe market impact as the finance version of the Heisenberg principle. You can't observe the market
principle. You can't observe the market without disturbing it. And finally,
there's opportunity cost.
The trade you waited on for a better price that just ran away from you. The
way you measure all of this after the fact is implementation shortfall. You
run a hypothetical paper portfolio with zero trading costs and compare it to your real one.
And the gap is your total cost of implementation. The last block closes
implementation. The last block closes the loop, performance analysis. After
the fact, you take what actually happened and decompose it. How much came from the factor bets you intended to make? How much from constraints?
make? How much from constraints?
How much was just noise?
The real goal is to separate skill from luck and to find where the skill actually lives, so you double down on what's working. This block feeds
what's working. This block feeds straight back to research because factors decay. An edge that printed
factors decay. An edge that printed money five years ago gets crowded out as everyone else discovers it. You're never
done. You're always refurbishing old factors and hunting new ones. Two
numbers tie the entire machine together and I want you to internalize both. The
first is the information ratio. Your
active return divided by your active risk. It's the report card. How much
risk. It's the report card. How much
value are you adding per unit of risk you choose to take? The second genuinely changed how I think. The fundamental law of active management. Your information
ratio is roughly your skill multiplied by the square root of your breadth.
Where breadth is the number of independent bets you make.
So, there are exactly two ways to get better.
Be more skillful per bet or make more independent bets. Now, that word
independent bets. Now, that word independent is doing enormous work. If
you're long five stocks and short five, but all the longs are retail and all the shorts are energy, you don't have 10 bets. You have two. A bet on retail and
bets. You have two. A bet on retail and a bet against energy. Real breadth means genuinely distinct decisions across both the names you cover and how often you
independently revisit them. And once
that clicks, the entire point of a multi-factor model snaps into focus.
It's a breadth machine.
It's how you make thousands of small, independent, slightly better than even bets across the whole market every single day.
And the square root tells you that stacking up breadth is how a modest per bet skill compounds into a serious edge.
That's the skeleton. Everything else I show you from here hangs off one of these blocks.
So, let's zoom into the very first block of that diagram, data and signals, because that's where I spent a real chunk of my time and it's where the whole machine either stands or falls.
I said everything downstream is only as good as the data feeding it and I want to show you what good data actually takes, starting with a fundamental piece of data engineering in this space,
security matching.
Before I get to matching, it's worth naming the unglamorous work that lives in this block because it's easy to assume the data shows up clean and it never does. A big part of the job is
never does. A big part of the job is data scrubbing, cross-checking outliers against other sources, filling gaps, fixing formats, and then reconciling vendors who all describe the same
company differently. One vendor
company differently. One vendor identifies a company by CUSIP, another by CUSIP, another by ISIN. One revises
its history when figures get restated, another doesn't.
Sorting all of that out so the data is consistent and trustworthy is the price of admission before anything downstream can run.
And the sharpest version of that problem is security matching.
Most quant firms with the budget for it will buy alternative data, a newer type of data that's entered the quant finance space over the past several years.
Examples of alternative data include scraped social media data, supply chain information, news articles, credit card transactions, and broker reports, among many others.
The reason alternative data has exploded in popularity over the past few years is that it's become so hard to extract valuable information from traditional data.
There's been a lot more competition over the past couple of decades, and quant firms are constantly looking for unique pieces of information they can turn into trading factors.
So, whenever a firm buys a data set, the first step to integrating it into the system is a process called security matching, something every data engineer at a quant firm has to do.
It's a pretty tedious process, but it's a necessity for everything downstream.
Security matching is the process of mapping the entities in a data set to the firm's internal identifiers for its trading model.
For example, mapping a URL from a website like apple.com to the official identifiers that represent Apple.
These official identifiers are standardized across the finance industry.
A few to mention are ISIN, CUSIP, and CUSIP, and Bloomberg IDs are also commonly accepted.
The reason you need to map these entities is so you can algorithmically trade based on the data set you bought.
The key concept in security matching is point in time. Point in time is a hard requirement because it prevents you from corrupting the data with look-ahead bias. An easy way to understand it, the
bias. An easy way to understand it, the identifier for Apple is only valid for the specific time periods during which the company was publicly listed under it, and those official identifiers can
change for several reasons. One example
is mergers and acquisitions. So, you
always have to ask what was true as of that date, not what is true today. This
whole process tends to get pretty straightforward once you've matched a couple of data sets. A couple of things to keep in mind. The data sets you're mapping are often several terabytes in
size and they require quite a bit of sophistication in how you process them.
That ranges from distributed computing, where applicable, to processing the data with fast, efficient transformations using tools like Polars, Spark, or NumPy.
Another concept that's very relevant to security matching is data loaders.
When a vendor sells data to a quant firm and then updates their database, the firm needs to make sure it pulls in that updated data. And the way you do that is
updated data. And the way you do that is with a data loader, which tends to be unique to each vendor.
Some vendors update their data daily, some weekly, some monthly.
It really depends on the type of data they sell.
Typically, what you'd expect is for the data loader to pull in the updated information from the vendor's side, perform the necessary transformations and aggregations to preprocess it, and
then save it down into the firm's internal database.
From that point on, any trading factor built off that data runs again on the fresh data that just came in. So, data
loaders are a production-quality feature. If the data loader doesn't
feature. If the data loader doesn't work, the trading factor gets halted.
So, it's critical that both the security matching and the data loaders work flawlessly and that you build in some kind of auditing to catch any discrepancies in the data coming
upstream from the vendor. And that opens up a whole new concept, data auditing. I
couldn't depend on the vendor to provide correct data all the time, so I had to build auditing systems for the different vendors.
Most of the time, this was just statistical metrics reported on certain columns of the data set, or even calculating the coverage of the securities mentioned in the newly refreshed file.
Any difference that crossed a certain threshold would flag me, or whoever else was responsible on the data side, to look into the issue. This is a very common standard practice in the quant
space because if the data is wrong, your trading factor is trading on incorrect data. So, it's vital to pinpoint the
data. So, it's vital to pinpoint the issue as early and as far upstream as possible.
Now, let's move one block to the right on our diagram to where those signals become alphas, the actual trading factors. On the whiteboard, that's a
factors. On the whiteboard, that's a single tidy box, but in real life, it's where research meets production. And
that handoff is most of what I did day-to-day. So, I want to expand on that
day-to-day. So, I want to expand on that block and talk about my experience productionizing trading factors.
I worked alongside plenty of senior researchers who were responsible for the research side of the trading factors, and I worked very closely with them to take all of their research code and
methodology and convert it into production-ready code that could run in our live trading model.
I worked on about six different research projects over my 3 years as a quant dev so far. I obviously can't talk about the
so far. I obviously can't talk about the specifics, but I'll walk through the high-level process and what I went through. The projects I worked on
through. The projects I worked on covered very different topics. Some of
them leveraged alternative data, others used traditional financial data, and only a very small subset actually used neural network techniques, which I found pretty surprising. A lot of research in
pretty surprising. A lot of research in general can be done with regression models or gradient boosted trees. If
there's interest in understanding regression models or gradient boosted trees, let me know in the comments.
That'll be my signal to do a future video explaining them intuitively.
Most of the research code was written in R, and I'd take that and convert it into either Python, Spark, or KDB, which is a pretty old language that a lot of hedge funds use. It's extremely
optimized for speed, and it's an excellent database. That said, I used a
excellent database. That said, I used a lot of distributed computing when building these trading factors, along with a lot of vectorization using NumPy.
Since everything is time series based, we needed to enable distributed computation for practically every trading factor we built because the first thing you do is run the code over
historical data. And that can range
historical data. And that can range anywhere from 5 to 15 years.
With distributed computing, a full historical run might take around half a day. It obviously depends on the trading
day. It obviously depends on the trading factor and the data being used, but in most cases a full historical run takes about half a day to a full day. Once
everything is set, you're pretty much ready to start running the trading factor on the new data that comes in through the data loader. And because
it's all distributed, it runs a lot faster than it otherwise would. You also
have to keep in mind that these trading factors have a specific time window in which they can be updated. In
production, multi-factor models tend to get updated every day. And if you're trading global equity, for example, you only have a really specific window. Once
the New York market closes, you have a few hours to run the model on the latest data so you can start trading during Japan's market hours.
So, being able to distribute your code and really understanding this concept is very important. When I wrote the
very important. When I wrote the production code, I was given a Word document that contained all of the research methodology for that factor.
That gave me an easy way to understand the thought process behind each step.
And it also made collaboration much easier across the several members who might be involved in a project. It's
essentially a central note that everyone can refer to when trying to understand a trading factor.
It became very clear during my time at the firm that documentation is extremely important. Not only does it help
important. Not only does it help researchers and developers understand the code better, but in the future it's common for a trading factor to need a second research project on top of it,
either to fix issues or to amplify its performance. This happens because
performance. This happens because factors always decay in performance, and it's up to the portfolio managers to decide whether a particular factor has some low-hanging fruit worth another
research project to improve it.
These research projects tend to involve several people from the team, usually a researcher, a developer, and a tester, plus a few senior partners who provide oversight, give advice, and get constant
reporting. These small pods are a
reporting. These small pods are a crucial part of the firm because they require excellent collaboration from each member and a real sense of team spirit.
One thing I noticed is that the portfolio managers paid close attention to particular pods. If they noticed a pod had a great work ethic and really good synchronous workflows, they tend to
keep that pod together for future projects because it meant better project throughput.
Here's a fair question to ask looking at the board so far. Those historical runs over 15 years of data across thousands of names, where does that actually run and how does it finish before the next trading
window?
That's the infrastructure layer sitting underneath the data and alpha blocks we just expanded. So, let me draw it in. By
just expanded. So, let me draw it in. By
the time I joined the firm, everything was done on premises. They had huge servers in the office where everything ran. It soon became clear to management
ran. It soon became clear to management that they had to migrate to cloud services. The team was growing, the data
services. The team was growing, the data was getting larger, and the factors were getting more complicated and demanding more compute.
So, one of my biggest task at the firm was migrating a lot of our infrastructure to the cloud and that forced me to get very comfortable with AWS and Databricks.
AWS has a lot of services and it can be confusing to navigate. But, the main ones I leveraged were EC2, ECR, and S3.
On the Databricks side, I pretty much used every feature they had at the time.
They've definitely expanded a lot since.
When I migrated everything, I took our on-premises code, moved it into notebooks, and incorporated those into workflows, which essentially trigger the code to run on a specific schedule. But,
the most notable piece of this section is Apache Spark and Delta Lake, which were the two features that really transformed a lot of our processes. Let
me explain Apache Spark in some detail because it's the engine that made all of this possible at scale. Spark is a distributed in-memory data processing engine. The whole point of it is to take
engine. The whole point of it is to take a computation that would never fit on or finish on a single machine, and spread it across a cluster of machines that work on it in parallel.
The way it's structured, you have a driver, which is the brain. It holds
your program and builds the plan.
And you have executors, which are the workers spread across the cluster that actually crunch the data.
A cluster manager hands out the machines. Your big data set gets split
machines. Your big data set gets split into chunks called partitions, and each executor works on its own partitions at the same time.
That's where the speed comes from. If
you have 100 partitions and enough executors, you're doing 100 things at once.
A couple of things make Spark special.
First, it's lazy. When you write transformations, filter this, join that, group by the other, Spark doesn't actually run anything yet. It just
records what you asked for and builds a graph of the steps.
It only executes when you hit an action, like writing the result or counting rows.
That laziness is what lets Spark's query optimizer, Catalyst, look at the entire plan and rewrite it to be as efficient as possible, pushing your filters down to the data source so it reads less,
reordering joins, and so on, before a single byte is processed.
Second, it's in-memory. The old
MapReduce model wrote intermediate results to disk between every step.
Spark keeps data in memory across steps, which is why it's often an order of magnitude faster for the kind of multi-step pipelines we ran. And it's
fault tolerant. Because Spark remembers the lineage, the recipe of transformations that built any piece of data, it can just recompute a lost chunk if a machine dies instead of failing the
whole job. The one thing you have to
whole job. The one thing you have to respect with Spark is the shuffle. Some
operations, joins and group buys, need data that lives on different machines to be moved around so related rows end up together.
That movement across the network is the expensive part. And most of optimizing
expensive part. And most of optimizing Spark comes down to minimizing and controlling shuffles. So, how would this
controlling shuffles. So, how would this actually look if I applied Spark to a large data set? Picture one of the alternative data sets I mentioned, several terabytes sitting in S3 as partitioned files. The flow looks like
partitioned files. The flow looks like this. Spark reads those partitions from
this. Spark reads those partitions from S3 spread across the executors. Because
it's lazy, it pushes my date and column filters all the way down so it only pulls what I actually need. Then it runs the transformations in parallel on each partition cleaning aggregating
computing my features. When I need to attach the firm's internal security identifiers, the security matching step from earlier, that's a join.
And because the mapping table is small, Spark broadcasts it out to every executor so the join happens locally with no shuffle. Finally, I partition the output by date and write it back down into a Delta Lake table ready for
the trading factor to consume. A
historical run that would take days on one machine comes down to hours.
That Spark job had to read from somewhere and write to somewhere. So,
the natural next layer to draw underneath everything is storage. And
the choice of where data lives isn't an afterthought. It's wired into why the
afterthought. It's wired into why the whole system performs the way it does.
So, I'm also going to explain two types of databases that are very commonly used in the quant space and that almost always come up in interviews. I'd say
about 80% of the interviews I encountered for quant developer roles asked me about the specifics of databases, and discussing parquet data lakes and KDB has always been a talking
point. So, I'll go through how quants
point. So, I'll go through how quants typically use parquet data lakes and why. Let's start with parquet and Delta
why. Let's start with parquet and Delta Lake. Parquet is a columnar file format.
Lake. Parquet is a columnar file format.
A normal database row stores all of a record's fields together. Parquet flips
that and stores each column together instead. That sounds like a small
instead. That sounds like a small detail, but it's huge for the kind of work we do because our queries usually touch a few columns across millions of rows, not whole rows. Storing by column
means you only read the columns you ask for. You get fantastic compression
for. You get fantastic compression because similar values sit next to each other, and you can skip entire chunks of a file that couldn't possibly match your filter. A Delta Lake is what you get
filter. A Delta Lake is what you get when you put a transaction layer on top of a pile of parquet files. On its own, a folder of parquet files has no concept
of a consistent all or nothing change.
Delta adds a transaction log that gives you ACID guarantees: atomicity, consistency, isolation, and durability.
In practice, that means a write either fully happens or doesn't happen at all.
Readers never see a half-written table, and concurrent jobs don't corrupt each other. It also enforces a schema, so bad
other. It also enforces a schema, so bad data doesn't silently slip in.
This combination is almost perfect for quant data, and especially for time series. You partition the data by date,
series. You partition the data by date, so when a factor needs the last 10 years of a few fields, it scans exactly those date partitions and exactly those columns, nothing else. Daily updates are just appends of a new date partition,
which is fast and cheap, and batch computation loves this because the whole historical run is just one big parallel scan. But, the feature I leaned on the
scan. But, the feature I leaned on the most was versioning. Because Delta keeps a transaction log of every change, you get time travel. You can query the table exactly as it looked on any past date.
That ties directly back to the point-in-time requirement I talked about in security matching.
When I run a factor over history, I need the data as it was known then, not as it's been restated since.
Versioning gives me that essentially for free.
And it makes runs reproducible.
If a vendor restates a chunk of history, I can cleanly overwrite just those partitions instead of rebuilding the whole table.
And because everything sits on cheap object storage like S3 and reads in parallel, the read and write throughput scales with the size of your cluster rather than choking on a single machine. The
other database is Kdb and its query language Q. Kdb is a completely
language Q. Kdb is a completely different animal from Parquet and Delta.
It's an in-memory columnar time series database and it is absurdly fast. It was
built from the ground up for exactly the kind of data finance produces, enormous streams of timestamped ticks and quotes.
The language Q is terse and vectorized.
You write tiny expressions that operate on entire columns at once the way NumPy does and it runs extremely close to the metal. The thing Kdb does better than
metal. The thing Kdb does better than almost anything else is time-based joins, in particular the as-of join where for every trade you want the quote
that was in effect at that exact moment.
That operation is everywhere in finance and painfully slow in a normal database.
In Kdb, it's a first-class lightning-fast primitive. That's why a
lightning-fast primitive. That's why a lot of hedge funds still build their core on it despite it being an old niche technology.
At my firm, Kdb was the backbone of the live trading model and we ran it in combination with HTCondor, which is a job scheduler that farms work out across a grid of machines.
So, Kdb held and served the time series data at speed and HTCondor distributed the actual model computation across the grid on top of it.
The way I'd sum up the two, Parquet and Delta Lake, are your cheap, massive, versioned warehouse for research and batch. They scale out and they remember
batch. They scale out and they remember everything.
KDB is your high-performance engine for time series and live trading.
It's all about raw speed on timestamped data.
Most firms use both because they're solving two different problems. And that's pretty much everything I wanted to cover for now.
If you take one thing away, let it be the picture we just built. We started
with a single skeleton of the multi-factor machine. And every piece I
multi-factor machine. And every piece I worked on was really just one of those blocks cracked open and expanded. The
data block became security matching, the alpha block became the research to production pipeline, and underneath all of it sat the infrastructure and the databases that make it run on time.
That's genuinely how it feels on the inside. One big connected system where
inside. One big connected system where every part exists for a reason.
There's a lot more I could dive into, but this should be enough for now, depending on how much of a reaction this video gets.
If there's any positive feedback, that'll be the encouragement I need to keep making more videos that are helpful for you.
Let me know your thoughts in the comments. I'll genuinely decide whether
comments. I'll genuinely decide whether to keep making these videos based on your reactions. So, if this was helpful,
your reactions. So, if this was helpful, please let me know.
Tell me what you like the most and what you'd like to see next.
If you're interested in this kind of content, go ahead and like and subscribe so you can stay updated. I'll try to post a video every week going forward and we'll see how it goes. Thank you so
much and I'll do my best to answer any questions you have. Thanks. Bye.
Loading video analysis...