Stanford CS221 | Autumn 2025 | Lecture 1: Course Overview and AI Foundations
By Stanford Online
Summary
Topics Covered
- Intelligence Requires Four Ingredients
- AI Succeeds Under Resource Constraints
- Alignment Conflicts Developer and Society
- Symbolic Neural Statistical AI Converge
- Tensors Unify All AI Computations
Full Transcript
Hello everyone.
Welcome to CS221.
This is artificial intelligence principles and techniques.
I'm your instructor, Percy Liang.
So if you're new to Stanford, welcome.
If you've old to Stanford, then welcome back.
So this is, believe it or not, my 14th year at Stanford, and this is my 13th time teaching the class.
However, this year, we've decided to make a number of quite large changes to the course given how much AI has changed.
But at the same time, many of the key ingredients about the foundations are still the same.
And hopefully, this will become clear as I go through the lecture.
So what I want to start with today is asking the question, what is AI?
We're here to learn about AI.
So what is it?
And by now, compared to 14 years ago, I hardly have to motivate why we care about AI.
It's all around us with AI assistants that probably many of you use daily.
Many of you probably have ridden in autonomous vehicle.
We've seen success stories of game-playing agents that have beaten humans.
More recently, there's been language models that can excel even the most elite competition-- math and programming folks-- at their own game.
And we've seen AI also make quite a big impact in the sciences, notably biology.
So these are examples.
But what exactly is AI?
And so if we break it down, AI, in case you didn't know, stands for artificial intelligence.
There's an artificial part, which is straightforward.
It means we're building something that runs on a computer or a robot.
And the intelligence thing is a little bit more nuanced.
And this has been a subject of debate for millennia.
And often, people tend to think about it in terms of humans.
But we actually want to maybe seek a definition from general principles.
So because intelligence, I think, is just general, maybe fundamental property in the universe in some sense.
So here's how I would break it down.
So what are the ingredients you need for intelligence?
In other words, if you have an agent that is supposed to be intelligent, what should the agent be able to do?
And here agent could be a computer agent, or it could be a human for example.
So there's four things.
There's perceiving the world, reasoning with information that it has, acting in the world, and learning from experience.
So just as an example, suppose your autonomous vehicle-- perception means taking in all the sensor recordings, visual, LiDAR, whatever you have, and making sense of it.
Then reasoning about what's going on in the scene.
Is this person crossing the road?
What are these cars likely to do?
And coming up maybe with a plan.
And then there's acting, which means-- or should you speed up, slow down, stop, turn, and so on?
And finally, learning, which is after you see the experience, next time, you should be better.
It would be not intelligent if you saw the same-- you were making the same mistakes over and over again.
So this is fairly intuitive.
And each of these four ingredients, if you were to unpack it, unpacks into a whole subfields worth of research.
So in this class, many of the methods that we are going to talk about fall into one of these four categories.
So for example, in perception, this is about processing raw inputs from the world and turning it into some sort of representation or understanding.
So this includes things like visual scene understanding, making sense of images and videos, speech recognition, or more generally audio perception, making sense of sounds, natural language understanding, text on the web or in messages, making sense of that.
And then there's reasoning, which is using the knowledge and the percepts that you have to draw inferences about the world.
So we're going to see a number of techniques in this class, starting with things like uniform cost search.
So if you have-- operating in a deterministic world, how you can find the shortest path or value iteration, as we'll see when we talk about Markov decision processes.
This is modeling the fact that the world is unknown and has uncertainty.
And how do you make optimal decisions under uncertainty?
And this is really a kind of elegant mathematical frame around that.
And then in some cases, you're playing against an adversary, for example, in chess, for example.
And you want-- there's principles called minimax, which allows you to play optimally even against a worst-case opponent.
And finally, we look at cases where-- when we talk about Bayesian networks, we have models of the world, which capture all the uncertainty around them.
And we're trying to reason about things probabilistically.
So there's a lot of the classes about actually how you reason with information you have in a "rational way."
And then if you go to act, this is where you actually prove to the world that you're intelligent.
So there's outputting actions that actually affect the world.
So you might generate text.
A chatbot returns some response or returns an image.
You might be generating speech.
Or if you're controlling a robot, the robot actually moves and does things in the world.
So without act, then you don't know whether you have intelligence or not.
And finally, learning.
So this is, as the agent is proceeding through life, it should be learning from experience.
So the most simple example of learning is, as we'll just look at very soon, is gradient descent.
When we look at reinforcement learning, we're going to look at a technique called Q-learning.
And then when we look at Bayesian networks, we're going to explore algorithms like expectation maximization.
So learning is a general principle of updating agent beliefs or state based on experience, and it manifests depending on which situation are you in, in different algorithms. So perceive, reason, act, learn.
And the trick is that all of these things have to happen under resource constraints.
By resources, I mean two types.
There's computation.
Your algorithm has to finish in a timely way, especially, for example, in real-time applications.
You're driving a car.
You can't take an hour to think.
You have to act in a split second.
And this also includes amount of memory or communication costs you might have.
And finally, the other type of resource which is important is information.
So an agent that's acting in the world is always going to have limited information, limited experience, and limited inputs that capture what the situation is.
You can't see around corners, for example.
And these constraints prevent the agent from acting as optimally as you would like.
But the name of the game is to deal with a limited amount of information and still be able to do something sensible.
So these resource constraints are going to really make our life challenging.
And that's why we need clever and sophisticated algorithms to actually solve these problems. So we've talked about what an agent sort of does, but what is it actually trying to do?
And this is, in some sense, the deeper part of the conversation.
And it can be broken up into two questions.
The first is, what do we want the developer of the AI agent to achieve?
Sorry.
What does the developer want to achieve?
So one note-- and this is a general comment-- is that an agent either explicitly or implicitly encodes some sort of values or somewhat similarly, goals or objectives or utility functions.
Often, these objectives are stated maybe explicitly.
But even if you don't state them explicitly, they're sort of in some implicit thing that it's trying to do.
So that's something to keep in mind.
And some of you might have heard of the term "alignment."
Alignment is about how do you make sure what the agent's values correspond to what the developer actually wants?
So a simple example-- the canonical example is ChatGPT, for example, is the developer, OpenAI, wants to make ChatGPT be informative, avoid hallucinations, refuse harmful queries, and so on.
So that is a certain value that the developer is trying to instill in the agent.
And whether that alignment procedure is successful, well, that's another question.
And sometimes you'll see that you have examples of misalignment where despite the developer's best intentions, the AI does something that is unintended.
So that's from the developer's perspective.
But then there's a broader question of what do we want the agent to do to society?
And this gets beyond just a single developer deploying a product into the world.
This has to do with social concerns such as protecting people's privacy, thinking about the role of copyright and intellectual property with respect to generative AI, the impact on jobs, inequality, even geopolitical considerations.
So towards the end of the class, I'm going to go a lot more into this, but I think it is worth saying from day one that these are really top concerns.
And this is a really kind of deep social technical problem.
There's a question of who's we?
Is it me?
Probably not.
Somehow the population of the United States, is that all of the people in the world?
The problem is that there's fundamental trade-offs between different people's values and even independent of AI, as you can see, with-- in politics, there's just tension between people who want different things.
And that's an inevitable part of life.
And then there's unintended consequences where you might think you're putting out an AI that does reasonable things, but it actually has unintended consequences.
And there's many examples of that.
So let's summarize here.
So what is AI?
Well, there's some examples of AI that hopefully everyone's familiar with.
But breaking it down into the four ingredients of intelligence-- perception, reasoning, action, and learning.
All of this has to happen under resource constraints, limited compute, and limited data.
And then we have to think about developer goals.
How do we as a developer are going to build AI to do something.
And then more at the societal level, how do we as a society make sure that AI is actually benefiting society, as opposed to maybe one particular developer only?
So the question is, are we going to understand more about the limitations of AI technology today?
Absolutely.
We're going to go through-- as you'll see, we're going to start at the bottom and build up our understanding of what AI is.
And we can see very quickly when it works and when it doesn't work.
So the next thing I want to talk about is this class.
So this class is called artificial intelligence, principles, and techniques.
So the emphasis is going to be really about these timeless foundations.
For example, gradient descent or stochastic gradient descent has been around for at least since the '50s, but it's still widely used today.
Of course, the world changes, and there's many other examples of AI, all change, and where our standards for intelligence might change.
But most of the class is going to be on foundations.
So when I was going back and thinking about what needs to be changed-- as you'll see later, we made some changes.
But many of the things are actually the same.
And one general philosophy, which has been there since the beginning, since I created this class, is learning by doing.
So a lot of the homeworks are going to be very much like building things, coding things up, and seeing things work.
Because at least at this point, AI is a very empirical and engineering-oriented field.
So without maybe, unfortunately, too much theory.
But there is a sense in which you just need to do things to really understand.
So in terms of changes for this year-- and this might not be so relevant, but in case you taught talk to your friends and how you had some ideas of what the class is going to be.
The biggest change is that we are going what I call tensor-native, which means that the course will be mostly in NumPy and PyTorch.
So this is obviously the framework that people use.
And again, many of the concepts are the same, but it's just allowing us to be more modernized in that sense.
And of course, tensors are very prevalent in deep learning.
But actually, if you think about all the other techniques, we're going to look at value iteration and Bayesian network inference.
All of those computations can be expressed just as tensors.
So tensors are actually this much more universal tool than just deep learning or machine learning.
So it's a good step up, I think.
We had to cut constraint satisfaction problems, which is it's sad, but that's life.
And instead, we're going to do a deep dive into the societal impact of AI.
And I feel like this is a really important-- the right trade-off to make because AI is, compared to 10 years ago, having a much more outsized impact on our daily lives.
And it enters the public discourse.
And talking about that in AI class feels right.
Constraint satisfaction problems-- I know all of you were dying and are very disappointed that we cut that.
You can always read up about it.
So finally, just as a tidbit, you might be wondering, what is this program that I'm going through?
So this is what I call an executable lecture.
It's a program where by executing a program, we're delivering the content of the lecture.
And if you think it kind of looks like code, you're right.
It is actually a program underneath all of this.
It's just that it's been rendered for your viewing pleasure.
So it's code.
So we can actually write code and define, step through code.
This will actually be really useful-- not for the first lecture, but for later when we actually get to the more details, because then we can walk through examples in a very crisp way.
So because its code lectures have this hierarchical structure, so you can see at any point in time where in the flow which function you're in.
Now we're talking about this.
What is a program function?
Code is also very precise because you can't-- certainly, more than English, but even math.
I think the problem with math is that its people-- you can write a lot of symbols, but then you have to understand what the symbols mean.
Whereas if you're writing code and the code actually executes, it has a well-defined semantics.
And of course, at the end of the day, if you're building an actual AI system, you're going to have to write code anyway.
So why don't we just start doing that from day one?
All right, so let me actually show you that on the Lectures page.
This is the GitHub repo.
This is where all the executable lectures live.
And today, we just did welcome.
Next, I'm going to do a brief history of AI, and then we're going to dive deep down into tensors.
We see what AI is is today.
But I think it's useful to always-- for the same reasons you learn history, in general, to understand a bit of the context of how things develop, because history can also inform us about the future.
So one natural place to start with a history of AI is the Turing test.
So in 1950, this man, by the name of Alan Turing, published this paper.
And this paper, Alan Turing asked, can machines think?
So it's a very simple question.
But more to the point, the question is really, how could you even tell if a machine could think?
Because at that time, there was actually hardly-- computers were hardly a thing.
And so this is really kind of within the realm of philosophy, almost.
And Alan Turing came up with a really clever idea, and it's called the imitation game, or more popularly known as the Turing test.
And the idea is that we set up a game where a machine and a human are each trying to convince a human that it is the actual human, and the human can't tell which one is machine or which one is human.
So you're basically trying to convince a human that as a machine that you're a human.
And there's all-- you can complain about the Turing test.
And many people have written extensively about how the Turing test is imperfect.
But I think the significance of this is that it grounds this philosophical question about what is intelligence and what does it mean to think in objective measurement.
And this idea of objectivity is actually what has really been a very strong driving force in the development of AI.
If you think about benchmarks and measurement is a way that the community has used to make progress.
And it kind of starts with Turing in a sense.
Now Turing deliberately left open this solution.
He, of course-- he was brilliant, but of course, he didn't know how you could get there.
He actually suggested something like machine learning, or at that time, a lot of folks were thinking about logic-based methods as well.
But he didn't know which the solution was.
But that's the kind of nature of research.
You don't know what the solution is, but you can define a North Star in some sense.
So then after 1950, the history of AI, I'm going to tell in three different stories-- symbolic AI, neural AI, and statistical AI.
So 1956, John McCarthy, who founded the-- actually, who was at MIT at the time but later found the Stanford AI lab, organized a workshop at Dartmouth College and convened the leading thinkers of the day.
So these were the brightest minds.
And the goal was to come together in two months and make a significant advance in AI.
They didn't exactly solve, I think, too much, but they did coin the term artificial intelligence, which has stuck with us.
So in general, in the '50s, I think was a time of high optimism.
There were folks like Arthur Samuel who created a checkers playing program.
They actually used some rudimentary machine learning at the time.
The checkers program was playing at a strong amateur level, which is not bad for 1950.
You have to realize in the '50s, there were hardly computers.
So this is actually pretty good.
Newell and Simon, fairly large giants in the AI space, developed a program that could prove theorems in Russell and Whitehead's Principia Mathematica.
So automated theorem proving was a thing back in the '50s.
So of course, you see a lot of success stories these days about theorem proving.
This actually kind of goes back all the way to the '50s.
They actually had their program come up with a new proof that was more elegant than the human proof.
They actually wrote up a paper and submitted it, but the paper got rejected because it wasn't a new result.
I think the reviewers failed to notice that the third author of the paper was actually the Logic Theorist.
So, like I said, this was a period where people thought they could solve AI in a matter of years.
So sounds a bit familiar here.
But as we know, that didn't happen.
There's this folklore example from machine translation.
This is probably not a real example, but it's, a good example.
So supposedly, people were interested in translating between Russian and English.
This is the height of the Cold War.
So they took an English sentence, "The spirit is willing, but the flesh is weak."
translated into Russian, which is something-- I don't speak Russian-- and translate it back.
And it says the vodka is good, but the meat is rotten.
So obviously, the machine translation maybe didn't work so well.
And in 1966, there was a report that said machine translation isn't really going anywhere.
We should just cut off funding.
And this led to the first AI winter.
So this is a period where funding for AI basically dries up, and people don't try to use the word AI too often.
So what went wrong here?
Well, back then in the '50s, there was just not that much compute.
So a lot of these problems were formulated as search problems, and the search space grew exponentially and outpaced hardware.
And they just had no idea how to deal with it, even the theory of incompleteness was from the '70s.
So I think it was just not very well managed, I think.
And there was also limited information.
So one thing is that-- it's one thing if you're doing chess.
But if you're thinking about machine translation, there's just a huge number of words and objects and concepts in the world.
And if you don't have all that knowledge in your system, you just don't have the intelligence.
So a lot of intelligence just has to do with knowledge.
But the silver lining was that despite not having solved AI, there were really useful things that came out of it.
For example, Lisp was the world's most advanced programming language.
Probably still is in some sense.
And many of the ideas and Lisp have only more recently been adopted into more modern programming languages.
Garbage collection-- if you don't know what that is, that means it's succeeded because you don't have to think about allocating and deallocating memory.
Time sharing-- the idea of multiple people using the same computer at once.
This was a radical idea back then, and all these things people had to develop in order to tackle this more ambitious problem of solving AI.
So then, fast forward through the '70s and '80s, folks were interested in knowledge.
So remember, one of the problems with previous generation systems is that there was just not enough information in these systems. So then people said, OK, well, why don't we build these expert systems where we just go out to a domain expert, and then experts are going to sit down and write all the rules?
So you go to a chemistry expert or biology expert, and they just write down the rules specifying what they know about biology.
And there are a bunch of systems that were built back in to diagnose blood infections or convert customer orders.
And the good thing is that the knowledge did actually help, both the information computation gap, because you have more knowledge in the system.
And because you have more knowledge, you don't have to search wildly through a huge space.
And this was actually the first time that AI actually had a real application that impacted industry and was solving real problems in the world.
But the shortcoming was that these rules really didn't deal well with uncertainty of the real world.
And these rules, furthermore, became too complicated to create and maintain.
And it sort of just resulted in spaghetti code and fell apart.
And so there was a lot of hype and underdelivering on results, which is a good way to get your funding cut off.
And that led to the second AI winter at the end of the '80s.
Also, there was economic recessions and so on.
That didn't help either.
So let's now go back and chart out another trajectory through the history of AI.
So this is neural AI.
So back in the '40s, McCulloch and Pitts, logician and neuroscientist, they basically developed the theory of artificial neural networks.
That was a theory paper.
There was no way of learning the network.
It was merely exploring the properties of how a mathematical model of a neuron could work, and what kind of logical computations it could do.
The first actual way you could call a learning algorithm comes from Hebb in 1949, with the idea that cells that fire together wire together.
It was just a sort of ad hoc rule.
There was a perceptron algorithm, which was a bit more principled.
There's ADALINE for linear regression.
In 1969, there was a book that was published by Minsky and Papert called Perceptrons, and they showed that these linear models were limited.
They couldn't represent very sophisticated functions, and that book was cited to have killed off neural nets research, or at least made people less excited about that.
Minsky happened to be of-- he actually worked on neural nets research during his PhD, but then he transitioned more into the symbolic AI school of thought.
So in the '70s, things were kind of dead.
In the '80s, things came back.
ConvNets were developed in the '80s, and backpropagation was popularized by Rumelhardt, Hinton, and Williams. So backpropagation, which is a.k.a--
has been actually invented and reinvented many times.
But 1986 was, I think, where it got really mainstream in the machine learning community.
And by the end of the '80s, there was some modest successes in using neural networks to do actual real things, like recognizing handwritten digits for USPS.
But still, neural networks were really hard to train and were generally unpopular in the 2000.
And then things started changing towards the latter part of the 2000.
In 2006, there was this paper by Hinton talking about how you can train, actually, deep neural networks because people had really struggled to train deep neural networks back in the day.
And then things kind of snowballed.
2009, deep learning started having impact in speech, then in computer vision with AlexNet, and then in language with machine translation and seq-to-seq modeling.
There was new optimizers.
Attention mechanism was developed.
And then we have events like AlphaGo, which showed the power of deep learning, combined with reinforcement learning, that could solve problems that people thought were a decade away.
And finally, the transformer architecture came around in 2017, which actually lays a lot of the groundwork for what we'll see a bit later.
So then let's go back to the other thread.
I call it statistical AI, but really, it's more of mathematical ideas that have come from, just in general, maybe even outside AI.
So one thing to notice is that many of the core ideas that we'll talk about in this class actually are not really about AI at all.
For example, if you think about linear regression, which is the simplest, maybe arguably the simplest place to start in machine learning-- this is from Gauss in 1801.
This is definitely before computers, and you had to solve least squares by hand.
1936 linear classification.
Stochastic gradient descent from the '50s.
There's uniform cost search, also from the algorithms community in the '50s.
Markov decision processes from control theory.
So this is really people who weren't necessarily trying to solve AI in the same way that John McCarthy was trying to solve AI, but nonetheless developing really valuable mathematical tools that then the AI community leaned on.
There is a period through the '80s and up until the 2000s where-- statistical machine learning.
These are techniques which are rooted in mathematical principles, such as Bayesian networks and support vector machines-- variational inference, conditional random fields, topic models.
These things were very in-vogue in the machine learning community.
This is also a period where neural nets were a really small fraction of the machine learning community because it didn't have any nice mathematical properties, and also it was hard to train.
But I think knowing where we know now, I think it's clear what has happened.
I think that many of the ideas from statistical AI were still, for example, thinking carefully about optimization, were still useful.
But the model architectures really became much more about the deep learning style.
Now, there's a more recent trend in the last five years.
I guess this is now history in some sense, even though it was only five years ago, which is the era of foundation models.
This started in language with pre-trained language models, with things like ELMO and BERT and Google's T5.
And the general idea is that if you take a large body of raw text and you train a model that essentially learns to predict the next word, then that model has really useful representations of language that can be useful for solving a bunch of other tasks.
So this idea is actually also not new.
It comes at least 20 years ago.
But the idea of coupling that and executing that in the modern deep learning framework lead to new and surprising results.
And then there was a period of large language models, foundation models where with folks like OpenAI and Google and Meta and DeepSeek and many others scaling up larger and larger models.
And this is what gives rise to the modern foundation models or generative AI boom.
One note about reasoning is-- there has been a lot of emphasis on solving more ambitious problems, and it's sort of maybe trivial that answering hard questions require thinking.
And so the idea with reasoning models is that language models can produce thoughts, which are tokens that illustrate what it's trying to do before producing an actual response.
And this has been led to another series of improvements.
And that's how you get to things like models that can get gold medals at the IMO and the IOI.
Another kind of general trend that has happened is the industrialization of AI.
So AI used to be this really cute niche thing in research where a bunch of daydreamers would come and try various things out, and maybe some things would work.
And most of the time, it didn't.
And now, GPT-4 supposedly has 1.8 trillion parameters, cost hundreds million to train.
xAI builds this cluster of 200K, H100 GPUs.
And there's hundreds of billions of dollars of investment going into AI.
At the same time, the amount of openness about what is happening, AI has also decreased quite a bit.
So the GPT-4 technical report says that there's-- due to a competitive landscape and safety implications, there's no information about how these models were trained, which is in contrast to even five or six years ago where companies and were publishing openly.
AI has also emerged from research and now shaping businesses and policy discussion.
But one thing that's interesting is that a lot of the open research problems around intelligence still remain.
So it's not like we've solved AI.
I think we just made so much progress that we can actually see it having real impact in the world, but there's more research to be done.
So some thoughts here.
So I talked about these three paradigms of thinking about AI.
And throughout history, there has been a lot of fierce debates.
So Minsky and Papert were promoting symbolic AI and deliberately tried to kill off neural nets research in some sense.
And statistical ML in the 2000s really thought neural nets were dead.
People would say, oh, really?
Why are you trying to use neural networks?
At the same time, there are some deeper connections.
So for example, the first paper that you really set the groundwork for neural nets is really about how to implement logical operations.
And Go is a game that's defined purely in terms of symbols.
But deep learning is actually the key to playing this purely symbolic game, which is kind of weird if you think about it.
And then deep learning, I think during the first decade, was really about perception and this understanding of the world, but now it has turned much more into reasoning, which is one of the goals of symbolic AI from the very beginning.
So AI, I think, is just this kind of melting pot.
From the beginning, I think symbolic AI really had the vision of what it meant to be to create intelligent agents.
Neural AI provided really the model architectures and the way of how you create models.
And then your statistical AI provides the rigor.
A lot of how the modern language we use to describe optimization and generalization from training to test, this comes from statistical thinking.
And this class, we'll see elements of all three traditions.
So the third part of this lecture is going to be on tensors.
And this is-- we're going to roll up our sleeves and see some code and try to build some foundations.
So if you hear the word "tensor," you probably think of something that looks like this.
So tensors are what I think of as atoms of modern machine learning.
So they're used to represent data, model parameters, gradients, intermediate computations like activations of deep neural networks, and basically anything.
So everything that we are probably going to touch can be thought of as some sort of tensor.
And as I mentioned earlier, tensors also show up very broadly in science and engineering.
So even if you don't, you're never going to do any AI in the future.
This is still probably a useful lecture to learn some of these foundations.
So in this lecture, I'm going to introduce some of these core ideas through NumPy examples.
So let's start with gradient tensor.
So tensors are essentially a multi-dimensional array, which generalizes vectors and matrices.
The simplest tensor is a scalar, which is a rank 0 tensor.
So here, x is-- so np is just NumPy.
And the array is-- I mean, it's a 0-dimensional, rank-0 array, which is just the value 42.
And every tensor has a shape.
In which case, there's 0 dimensions.
So it's just an empty list.
So a vector is a rank-1 tensor.
So here's a vector, 1, 2, 3.
And its shape is 3 because it's length 3.
A matrix is a rank-2 tensor.
So you have this matrix, and its shape is-- it's a 2-by-3 matrix.
And then finally, a tensor can be any rank.
So here is a rank-3 tensor.
So this has a three dimensions.
The first dimension, there's two parts.
The second dimension, there's also two.
And the third dimension, there's three.
So given this tensor, you can extract slices.
So for example, if you index into 1, it takes this-- this is the zeroth element, and this is the first element of x.
And so y is this.
You can further subset slice into the zeroth component of that, which will give you the row 0 of this piece.
And then you can first slice into that.
And you get the second element of that, which gives you 9.
So generally, you don't create tensors by writing down all the entries because most tensors are huge.
So you can use these special commands to create tensors.
For example, zeros.
You pass in the shape, and it gives you a tensor of that shape.
One gives you the tensor-- all ones' tensor.
Randn gives you tensor filled with random numbers from a Gaussian distribution.
And there's the identity matrix.
You can take a vector and populate a matrix where the diagonal is that vector.
You can also read and write tensors from disk.
So if you're loading datasets, you get a tensor that's loaded from disk.
And then you can probably create a tensor, which is randomly initialized.
And you can do some training.
And then you write out the parameters.
So let's look at some, hopefully, motivating examples of tensors.
So these are some tensors that show up in machine learning.
So in machine learning, we'll talk about data points.
And typically, data points are represented as vectors of some dimension D. So here, I'm just using MPL ones.
It's arbitrary.
You could put zeros.
We're just trying to get a tensor of the right shape here.
So this is a two-dimensional data point.
We often will batch examples together for efficiency, which we'll explain a bit later.
So if you have three examples and you can represent a dataset of N examples, each D-dimensional as follows.
So we have-- the first row is the first data point.
Second row is the second data points.
Third row is third data point.
So in language modeling, this is where you actually get into higher order tensors because each example is a sequence.
Think about a sentence.
It's a sequence of length L. And so
if you have a dataset and examples, each length L and each position in the sequence is D-dimensional, then you have an N by L by D tensor.
So here's three examples.
Each example is actually a matrix, where the dimension of that matrix is the length by the dimensionality of the vector at each position.
So in vision, you actually get more dimensions because there's a height.
Images have height, width, and a number of channels.
So if you have N examples, which each example is an image, then you have a N by H by W by C tensor.
So that's to represent data in neural networks.
You have weight matrices, typically, that transform a Din-dimensional input into a Dout-dimensional output.
And generally, this would be a matrix, which is Din by Dout.
So in general, the parameters of a neural network model are just a collection of tensors.
So for example, if you go on Hugging Face and you look at the DeepSeek V3 model.
So this is a 670 billion-parameter model.
And you look at the parameters.
You'll see that this has-- for every layer, I guess, there's a bunch of matrices, or I guess these are matrices where each matrix has its way.
This one's 7,168 by 16,384.
And so, in general, the parameters of a model are just a bunch of tensors.
So if you have a tensor, you can extract parts of it.
So you can just get the first row or get the first column using this notation.
You can transpose it.
Note that these operations do not make a copy of the tensor.
They're just producing different views.
So that means if you mutate-- so I have x equals the transpose of y.
So if I mutate x, y also changes.
So they're different objects, but they're pointing to the same underlying storage.
So in general, be very careful about mutation, and generally, you don't have to mutate.
But if you do mutate, really be careful.
Otherwise, you might get some bad bugs.
So I'll go over this part fairly quickly, elementwise operations.
So you have any sort of tensor.
I'll just use a rank-1 tensor for now.
You can apply a bunch of standard elementwise operations, such as raising to some power, taking the square root.
You can add tensors of the same size.
You can multiply and divide by scalars.
There's other special operations like triangular upper-- triu and tril will take the upper or lower triangular part of a matrix.
So if you have all ones 3-by-3 matrix, if you look at triu, that'll give you zero out everything under the diagonal.
If you do tril, it'll zero out anything above the diagonal.
So we'll see later that this is useful for masking parts of the input.
For example, if you're defining transformers with a causal mask, then you have to use these matrix to mask out certain parts of the input.
But right now, I'm just giving you the primitives and the building blocks, and then we can have some fun putting these together later.
You can also take a tensor and just create another tensor that has the same shape but is filled with zeros or ones.
So let's talk about matrix multiplication.
So the bread and butter of deep learning.
And generally, the main operation you do with tensors is matrix multiplication.
So hopefully, you guys should be familiar with metrics multiplication.
But just let's make sure we-- on the same page here.
So I have a 4-by-6 matrix and a 6-by-3 matrix.
If I multiply these two matrices, what I'm doing is I form a 4-by-3 matrix where each element is an inner product between a six-dimensional vector.
So just to-- I think everyone knows this.
But for example, this 6 is the inner product between the first row and the first column here So remember that we want to pack everything into tensors for efficiency.
So we can actually batch multiple matrix operations together.
So suppose we have a tensor, which is 2 by 4 by 6.
So effectively, I have two 4-by-6 matrices.
And remember, often, the first dimension is a batch dimension.
So you have two data points or n data points, and you want to often do the same operation to every data point.
So NumPy makes this fairly easy.
You could just write the same thing.
And implicitly, what this is doing is that it is taking for every slice with respect to the first dimension.
We're doing the same thing that we did before.
We're doing the matrix multiplication.
So that gives us these two matrices.
Each of which was a matrix multiplied from each of these matrices.
So in other words, for every slice, for x0 x1, we multiply by w.
And in other words, w is broadcasted to each slice of x because w is a lower rank object, and x is a higher rank object.
So what NumPy does is it sort of copies w multiple times, one for every index into the first dimension.
So this is generally a useful pattern that you'll see in a lot of deep learning code.
Any questions so far about this?
So the question is, what do I mean by more efficient?
So it will be more efficient in terms of computation, and it's also shorter in terms of code.
So I'll talk about efficiency actually right now.
So the question is, does this broadcasting only work if the first dimension is a batch dimension?
So I'm calling the first dimension the batch dimension just because that's like what it represents usually in machine learning.
NumPy just sees this as a rank-3 tensor.
There's no meaning to whether it's a batch dimension or not.
It's just taking the first dimension and just doing it.
So the question is, if you have a rank-4 tensor like the one we saw for images, what can you multiply on the other side?
So in general, the rule is that you have a rank-n tensor.
You can multiply by a scalar, which in that case is easy to see because the scalar is rank 0.
So it just gets broadcasted to every single element.
You can multiply by a vector.
In which case, it's also broadcasted to each piece or a matrix, and so on.
So actually, hold that thought.
When we talk about einops.
I'll show you a different way of thinking about this, which is far less confusing.
So let's move on to answer the question about efficiency.
So in general, when you're writing code, there's multiple ways to compute the same result.
And different ways will have different computational costs.
In the case of tensors, in general, you want to express your computation as-- we're using few tensor operations.
So let's take a very simple example.
So I have two N by N matrices.
If you want to do a matrix multiplication, you could just implement it in Python.
So for every i, j, k, i, j equals i, k times k, j.
So this is also a good refresher of what matrix multiplication is, but that's what it is.
Or you can just do it in NumPy.
So there's this handy utility called timeit that allows you to see how much time these operations take.
And it turns out that if you look at the Python time NumPy time, it's a lot slower.
And the reason for this is that-- well, first of all, this is in Python.
Python is not the fastest language.
It's interpreted.
It's not really optimized for performance.
And whereas NumPy, it's not only written in-- if it's on CPU, it's in C, but it's also been really, really optimized.
And furthermore, if you were running this on GPUs, you would definitely want to use NumPy because that will even give you further acceleration, especially for large matrices.
And just to say a little bit more, once you are able to write things in tensors, then not only can you put it on GPUs, but there's-- if you think about distributed algorithms, there's automatic tools that can actually shard up your tensors across multiple GPUs, and it can go even faster.
Whereas if you're writing sequential code, that's just not a good way to do it.
That does mean, however, sometimes your code is going to be actually potentially less readable because you're doing basically gymnastics on tensors.
So there is a trade-off here.
But often, in machine learning code, if you're scaling up, you want things to be fairly fast.
So now I'm going to talk about einops.
So here's the motivation.
So let's suppose I have this 2-by-2-by-3 tensor and another 2-by-2-by-3 tensor.
Basically, for every index into the first rank, I want to basically multiply this matrix by the transpose of this matrix.
So the way I would do that is this.
Oh, actually, I should say another thing.
Whenever you define a tensor, or at least when I define a tensor, I tend to annotate what the dimensions mean.
So for example, the batch dimension.
This indexes positions in a sequence.
And this index is like the hidden unit in a neural net and so on, just so that you don't get confused.
Because if you have a 4 or 5-rank tensor, it can be really confusing to keep track of where the dimensions are.
So if you wanted to do that operation, you would have to do x multiply y.transpose minus 2 minus 1.
And that gives you your result, which is going to be a 2-by-2-by-2 tensor.
So if you're wondering what is going on here, what is this minus 2 minus 1, that's exactly the question you should be asking because it can get confusing.
Now, this example isn't that hard.
I can explain to you, but I won't because there's-- I'm going to show you a better way to do this.
So einops is this library, just like NumPy is a library, for manipulating tensors where the dimensions are named.
And it's inspired, but not identical to Einstein summation notation.
There's a nice tutorial, which you can go through.
But I'll just give you some of the basics.
So the first function is einsum, and it's basically is a generalized matrix multiplication with good bookkeeping, is what I refer to it as.
So again, here's a simple example.
Suppose I just want to do x matmul y.
So I should get a 3-by-3 matrix.
So I could just multiply the matrix.
So if it's really just a matrix, that's what I would do.
But here's what it would look like in einops notation.
So einsum is going to take two arguments-- the two tensors-- and it's going to take a string.
And that string specifies how I want the matrix multiplication to go.
So if there's two, then I have two sequences of essentially named axes-- seq1 hidden hidden seq1.
You'll see that these match the names of the dimensions up here as well.
And then on the right side, I basically specify what the names of the axes should be.
And generally, the labels over here are going to be a subset of the labels over here.
So if you remember in matrix multiplication, if you have i, k, and k, j, that goes into i, j.
So basically, the a row dimension of the first tensor and the column dimension of the second tensor, that is used to index the resulting tensor.
And any variable that's hidden, that's not mentioned, is going to be summed out.
So I'm going to take a sum over all possible values of k here.
I'm going to add all those up, and that's going to get accumulated into the appropriate entry.
Let's try a more complicated example.
So here's the motivating example.
So for this, I mean, this is a little bit overkill.
You don't really need this.
But here, I'll show you where this can come in handy.
So imagine I have these two tensors now, and now they share a batch dimension.
So remember, before I was doing this-- and you're wondering, OK, what's happening?
So here's how we would do it using einops.
You would, again, write down the labels of each tensor-- batch, seq1 hidden, batch seq2 hidden.
And then you would just write down the dimensions of the resulting tensor, which is batch seq1, seq2.
So I find this like 100 times easier to read because I know exactly how the different indices correspond to each other.
They're also named instead of having numbers like minus 2 and minus 1 and so on.
So you can be a little bit fancier if you have, basically, tensors.
Often, you might have a matrix, but there's an unknown number of batching tensor or batching dimensions.
And you can just write dot, dot, dot as well.
So instead of writing batch, I can write dot, dot, dot.
And that's fine too.
So that's einsum.
That's mainly the thing that you'll need.
It might take some getting used to.
And at first, it might seem a little bit kind of cumbersome.
But hopefully, you'll grow to it.
It's kind of writing Python with type hints.
Sometimes it's kind of annoying.
You have to write this thing, but it's much, much easier to read with type hints instead of just having f of x, y, and z.
So very quickly, two other operations I want to talk to you about-- einops reduce.
So you can reduce a tensor by performing some operation.
So suppose you have this tensor, 2-by-3-by-4 tensor.
So this traditional way of doing things is, basically, I'm going to create a tensor where I'm summing over the last dimension.
So for the first dimension and for the second dimension, which is the row, I'm going to sum over the last dimension, which is the column.
And that's why I get 2 by 3 of 4's.
So there's a function from einops called reduce, which allows you to take now a single tensor.
And it allows you to basically specify the operation, which could be sum or min or max.
And it allows you to-- basically, again, any dimension that doesn't get mentioned on the output gets sort of processed away.
And here the processing operation is sum.
So it's going to sum all the elements across the rows.
Whereas if this were max, I'm going to take the maximum across the rows and so on and so forth.
And the final thing I'll talk about is rearrange.
So this is a bit more advanced, and you'll see it-- if you think about building a transformer where there's some more gymnastics that you have to do with different attention heads.
So the general idea is that sometimes you have a dimension that actually represents two dimensions, and you want to only operate on one of them.
So here's an example.
So suppose I have a 3-by-8 matrix.
So it's seq by total hidden.
But secretly, total hidden is actually a flattened representation over heads times hidden1.
So these are just names of dimensions.
So really, each row of this is like a 2-by-4 matrix, is the way I should think about it.
And I wanted to-- basically, multiply each of those two secret 2-by-4 matrices by this 4-by-4 matrix.
And I want to put it back into that and flatten.
So basically, I can use rearrange to unflatten.
So here, what I'm doing is I can use this parentheses notation to say that this thing is actually one dimension.
And over here, I remove the parentheses to denote that those are actually two dimensions.
And the way I'm going to split it is heads gets 2, and hidden gets basically the number of dimensions divided by 2.
So concretely, what that looks like is I've broken-- remember, this was a 3-by-8 matrix.
I've broken this into a 3-by-2-by-4 tensor, where the heads hidden1 corresponds to this eight-dimensional vector.
And that's been broken into a 2 by 4.
So heads 2 corresponds to this dimension being 2, and hidden1 is 4 here.
So now I have this tensor.
I can perform this operation on each of these matrices.
So here, I will show you.
I have hidden1, and hidden1, hidden2 mapping to hidden2.
So again this dot, dot, dot is basically saying for each of these things in the batch, in the sequence dimension, I'm going to do this kind of matrix multiplication.
And every, actually, row, I'm going to do this matrix multiplication by hidden1 times hidden1 times hidden2.
And that results in a tensor of the same shape.
And then finally, I can put this back.
So I use-- sorry, this.
I say hidden.
This is a 3-by-2-by-4 tensor.
So heads is 2, and hidden2 is 4.
And I can group that into basically a single dimension here.
So this is of a way to-- if you need to have a flattened object to unflatten, do an operation, and put it back.
So just to summarize, we saw that we're using tensors in this class to basically represent everything.
It can be advantageous for computational reasons to express everything in terms of tensor operations.
This will be a little bit like a puzzle.
So you often get these puzzles like, how do you compute this?
And you have to think about how you can structure things into the matrix so you can do the right matrix multiplications.
And to help you with this, einops will make the computations more legible because there, you can annotate the dimensions.
And finally, this does take some practice.
So even though you've seen NumPy, maybe some tensors, I think getting more and more practice with this will just make you better at doing this kind of thing.
Loading video analysis...