Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 2 - Neural Classifiers
By Stanford Online
Summary
Topics Covered
- Bag-of-Words Learns Semantic Similarity
- Stochastic Gradient Descent Learns Better
- Negative Sampling Approximates Softmax Efficiently
- Co-occurrence Ratios Capture Meaning Components
- Single Vectors Superpose Word Senses
Full Transcript
okay so what are we going to do for today so the main content for today is to um go through sort of more stuff about word vectors
including touching on word sensors and then introducing the notion of neural network classifiers um so our biggest goal is that by the end of today's class you should feel like you could
confidently look at one of the word embeddings papers such as the google word to vect paper or the glove paper or sanjiv aurora's paper that we'll come to later and feel like yeah i can
understand this i know what they're doing and it makes sense so let's go back to where we were um so this was sort of introducing this model of word devec and
third line your idea was that we started with random word vectors and then we're going to sort of it we have a big corpus of text and we're going to iterate through
each word in the whole corpus and for each position we're going to try and predict what words surround this our center word and we're going to do that
with a probability distribution that's defined in terms of the dot product between the word vectors for the center word and the context words
um and so that will give a probability estimate of a word appearing in the context of into well actual words did occur in the context of into on this occasion so what we're going to want to
do is sort of make it more likely that turning problems banking and crises will turn up in the context of into and so that's learning updating the word
vectors so they can predict actual surrounding words better um and then the thing that's almost magical is that doing no more than this simple
algorithm this allows us to learn word vectors that capture well word similarity and meaningful directions in a word space
so more precisely right for this model the only parameters of this model are the word vectors so we have outside word vectors and center word vectors for each
word and then we're taking their dot product um to get a probability well we get taking a dot product to get a score of how likely a particular outside word
is to occur with the center word and then we're using the soft max transformation to convert those scores into probabilities as i discussed last time and i kind of come back to at the
end this time a couple of things to note um this model is what we call an nlp a bag of words models so bag of words models are models
that don't actually pay any attention to word order or position it doesn't matter if you're next to the center word or a bit further away on the left or right the probability estimate would be the
same and that seems like a very crude model of language that will offend any linguist and it is a very crude model of language and we'll move on to better
models of language as we go on but even that crude model of language is enough to learn quite a lot of the probability sorry quite a lot about the properties
of words and then the second note is well with this model we want it to give
reasonably high probabilities to the words that do occur in the context of the center word at least if they do so at all often but obviously lots of
different words can occur so we're not talking about probabilities like 0.3 and 0.5 we're more likely going to be talking about probabilities like 0.01
and numbers like that well how do we achieve that and well the way that the word defect model achieves this and this is the learning phase of
the model is to place words that are similar in meaning close to each other in this high dimensional vector space so again you
can't read this one but if we scroll into this one we see lots of words that are similar and meaning group close together in the space so here are
days of the week like tuesday thursday sunday and also christmas over what else do we have we have
samsung and nokia this is a diagram i made quite a few years ago so that's when nokia was still an important maker of cell phones we have various sort of fields like mathematics and
economics over here so we group words that are similar in meaning actually one more note i wanted to make on this i mean again this is a
two-dimensional picture which is all i can show you on a slide um and it's done with the principal components projection that you'll also
use in the assignment um something important to remember but hard to remember is that high dimensional spaces have very different properties to
the two dimensional spaces that we can look at and so in particular a word a vector can be close to many other
things in a high dimensional space but close to them on different dimensions okay so i've mentioned
doing learning so the next question is well how do we um learn good word vectors and this was the bit that i
didn't quite hook up at the end of last class so for a while in the last i said calculus and we have to work out um the gradient of the loss function with
respect to the parameters that will allow us to make progress um but i didn't sort of altogether put that together so what we're going to do
is um we start off with random word vectors we initialize them to small numbers near zero in each dimension we've defined our
loss function j which we looked at last time and then we're going to use a gradient descent algorithm which is an iterative iterative algorithm that
learns to maximize j of theta by changing theta and so the idea of this algorithm is that from the current values of theta you
calculate the gradient j of theta and then what you're going to do is make a small step in the direction of the negative gradient so the gradient is pointing upwards and we're taking a
small step in the direction of the negative of the gradient to gradually move down towards the minimum and so one of the parameters of neural
nets that you can fiddle in your software package is what is the step size so if you take a really really itsy bitsy step it might take you a long time
to minimize the function you do a lot of wasted computation on the other hand if your step size is much too big
well then you can actually diverge and start going to worse places or even if you are going downhill a little bit that what's going to happen is you're then going to end up bouncing back and forth
and it'll take you much longer to get to the minimum okay in this picture i have a beautiful quadratic and it's easy to minimize it something
that you might know about neural networks is that in general they're not convex so you could think that this is just all going to go alright um but the truth is and practice life works out to
be okay but i think i won't get into that more right now and come back to that um in the later class so this is our gradient descent so we
have the current values of the parameters theta we then walk a little bit in the negative direction of the gradient using our
learning rate or step size alpha and that gives us new parameter values where that means that you know these are vectors but for each individual
parameter we are updating it a little bit by working out the partial derivative of j with respect to that parameter so that's the simple gradient descent
algorithm nobody uses it and you shouldn't use it the problem is that our j is a function
of all windows in the corpus remember we're doing this sum over every center word in the entire corpus and we'll often have billions of words in the corpus
so actually working out j of theta or the gradient of j of theta would be extremely extremely expensive because we have to iterate over our entire corpus so you'd wait a very long time before
you made a single gradient update and so optimization be extremely slow and so basically a hundred percent of the time in neural network land we don't use
gradient descent we instead use what's called stochastic gradient descent and stochastic gradient descent is a very simple modification of this so rather
than working out an estimate of the gradient based on the entire corpus you simply take one center word or a
small batch like 32 center words and you work out an estimate of the gradient based on them now that estimate of the gradient will be
noisy and bad because you've only looked at a small fraction of the corpus rather than the whole corpus but nevertheless you can use that estimate of the
gradient to update your theta parameters in exactly the same way and so this is the algorithm that we can do and so then
if we have a billion word corpus um we can if we do it on each center word we can make a billion updates to the parameters we pass through the corpus
once rather than only making one more accurate update to the parameters at once you've been through the corpus so overall we can learn
several orders of magnitude more quickly and so this is the algorithm that you'll be using everywhere including um you know right right from
the beginning from our assignments um again just an extra comment of more complicated stuff we'll come back to all right
this is the gradient descent is a sort of performance hack it lets you learn much more quickly it turns out it's not only a performance hack neural nets have some
quite counter-intuitive um properties and actually the fact that stochastic gradient descent is kind of noisy and bounces around as it does its
thing it actually means that in complex networks it learns better um solutions than if you were to run plain gradient descent very slowly so
you can both compute much more quickly and do a better job okay one final note on running stochastic gradients with word vectors this is kind
of an aside but something to note is that if we're doing a stochastic gradient update based on one window then actually in that
window we'll have seen almost none of our parameters because if we have a window of something like five words to either side of the center word we've seen at most 11 distinct word
types so we will have gradient information for those 11 words but the other 100 000 odd words now vocabulary will have no
gradient update information so this will be a very very sparse gradient update so if you're only thinking math
you can just have your entire gradient and use the equation that i showed before but if you're
thinking systems optimization then you'd want to think well actually i only want to update the parameters for a few words and there
have to be and there are much more efficient ways that i could do that um and so um here's so this is another aside will be useful for the assignment so i will say
it up until now when i presented word vectors i presented them as column vectors and that makes the most sense if you
think about it as a piece of math whereas actually in all common deep learning packages including pytorch that we're using
word vectors are actually represented as row vectors and if you remember back to the representation of matrices and cs107 or something like that um that you'll
know that that's then obviously efficient um for representing words because then you can access an entire word vector as a contiguous range of memory
different if you're in fortran anyway so actually our word vectors will be row vectors when you look at those um inside pi torch
okay now i wanted to say a bit more about the word to vec algorithm family um and also um what you're going to do in homework 2. um so if you're still
meant to be working on homework 1 which remembers um to next tuesday that really actually with today's content we're starting into homework two and i'll kind
of go through the first part of homework two today and this other stuff you need to know for homework two so i mentioned briefly the idea that we have two
separate vectors for each word type the center vector and the outside vectors and we just average them both at the end they're similar but not identical for multiple reasons including the random
initialization and the stochastic gradient descent um you can implement a word defect algorithm with just one
vector per word and actually if you do it works slightly better but it makes the algorithm much more complicated and the reason for that is
sometimes you'll have the same word type as the center word and the context word and that means that when you're doing your calculus at that point you've then
got this sort of messy case that just for that word you're getting an x squared term oh sorry a dot product you're getting a dot product of x dot x term which makes it sort of much messier
to work out and so that's why we use this sort of simple optimization of having two vectors per word okay so
for the word to vect model as introduced in the miklov at our paper in 2013 it wasn't really just one
algorithm it was a family of algorithms so there are two basic model variants one was called the skip gram model which is the one that i've explained to you that
predicted for outside words position independent given the centre word in a bag of words style model the other one was called the continuous bag of
words model sibo and in this one you predict the center word from a bag of context words both of these give similar results the
skipgram one is more natural in various ways so it's sort of normally the one that people have um gravitated to in subsequent work um but then as to how you train this
model um what i've presented so far is the naive softmax equation which is a simple but relatively expensive
training method and so that isn't really what they suggest using in your paper in the paper they suggest using a method that's called negative sampling so an
acronym you'll see sometimes is sgns which means skip grams negative sampling so let me just um say a little bit um about what this is but actually
doing the script gram model with negative sampling is the part of homework too so you'll get to know this model well so the point is that if you use this naive softmax you know even
though people commonly do use this naive softmax in various neural net models that working out the denominator is pretty expensive and that's because you
have to iterate over every word in the vocabulary and work out these dot products so if you have a hundred thousand word um
vocabulary you have to do a hundred thousand dot products um to work out the denominator and that seems a little bit of a shame and so instead of that the
idea of negative sampling is where instead of using this soft max we're going to train
binary logistic regression models for both the troop the true pair of center word and the context word
versus noise pairs where we keep the true center word and we just randomly sample words from the vocabulary
so as presented in the paper the idea is like this so overall what we want to optimize is still an average of the
loss for each particular center word but for when we're working out the loss for each particular center word we're going to work out um sorry the loss for each particular center word and each
particular window we're going to take the dot product as before of the center word and the outside word and that's sort of the
main quantity but now instead of using that inside the softmax we're going to put it through the logistic function which is sometimes also often also called the sigmoid
function the name logistic is more precise so that's this function here so the logistic function is a handy function that will map any real number to a probability
between zero and one open interval so basically if the dot product is large the logistic of the dot product will be virtually one
okay so we want this to be large and then what we'd like is on average we'd like the dot product between the center word and words that we just chose
randomly i.e they most likely didn't
randomly i.e they most likely didn't actually occur in the context of the center word to be small and there's just one little trick
of how this is done which is this sigmoid function is symmetric and so if um we want
this probability to be small we can take the negative of the dot product so we're wanting it to be over here that the product the dot product of a random word
in the center word is a negative number and so then we're going to take the negation of that and then again once we put that through the sigmoid we'd like a big number
okay so the way they're presenting things they're actually maximizing this quantity but if i go back to making it a bit more similar to the way we had written things
we've worked with minimizing the negative log likelihood um so it it looks like this so we're taking the
negative log likelihood of this the sigmoid of the dot product um again negative log likelihood we're using the same negator dot product
through the sigmoid and then we're going to work out this quantity for a handful of brand
number we k negative samples um and how likely they are to sample word depends on their probability and where this loss function is going to be
minimized given this negation by making these dot products large and these dot products um smalling negative so
they're just then one other trick that they use actually there's more than one other trick that's used in the word defect paper to get it to perform well but i'll
only mention one of their other tricks here um when they sample the words they don't simply just sample the words based
on their um probability of occurrence in the corpus or uniformly what they do is they start with what we call the unigram distribution of words so that is
how often words actually occur in our big corpus so if you have a billion word corpus and a particular word occurred 90 times in it you're taking 90 divided by
a billion and so that's the unigram probability of the word but what they then do is that they take that to the three-quarters power and the effect of that three-quarters power which is then
re-normalized to make a probability distribution with z kind of like we saw last time with the soft max by taking the three-quarters power
that has the effect of dampening the difference between common and rare words um so that less frequent words are sampled somewhat more often but still
not nearly as much as they would be if you just use something like a uniform distribution over the vocabulary okay so
that's basically um everything to say about the basics of how we have this very simple
neural network algorithm word deveck and how we can train it and learn word vectors so for the next bit what i want to do is
step back a bit and say well here's an algorithm that i've shown you that works great um what else could we have done and what
can we say about that um and the first thing that you might think about is well here's this funny iterative algorithm to give you
word vectors um you know if we have a lot of words and a corpus it seems like a more obvious thing that we could do
is just look at the counts of how words occur with each other and build a matrix of counts uh co-occurrence matrix so here's the idea
of a co-occurrence matrix so i've got a teeny little corpus i like deep learning i like nlp i enjoy flying
and i can define a window size i made my window simply size one to make it easy to fill in my matrix symmetric just like our word to back
algorithm and so then the counts in these cells are simply how often things that co-occur in the window of size one so i like
occurs twice so we get twos in these cells because it's symmetric deep learning occurs one so we get
one here and lots of other things occur zero so we can build up a co-occurrence matrix like this and well these actually give us a representation
of words as co-occurrence vectors so i can take the word i with either a row or a column vector since it's symmetric and say okay my
representation of the word i is this row vector and that is a representation of the word i and i think you can maybe convince
yourself that to the extent that words have similar meaning and usage you'd sort of expect them to have somewhat similar vectors right so if i had the
word u as well on a larger corpus you might expect i and u to have similar vectors because i like you like i enjoy you and joy um you'd see the same kinds
of possibilities hey chris could you keep looking to answer some questions sure all right so we got some questions from negative uh sort of the negative stamping sampling slides um
in particular um what's like can you give some intuition for negative sampling what is the negative sampling doing and why do we uh only take one positive example those are two questions that could be answered in
tandem okay um that's a good question okay i'll try and give more intuition so is to work out something like what the softmax
did in a much more efficient way um so in the soft max well
you wanted to give high probability to the in predicting the context a context word that actually did appear with the center
word um and well the way you do that is by having the dot product between those two words be as big as possible and
part of how but you know you're going to be sort of it's more than that because in the denominator you're also working out the dot product with every other word in the vocabulary so as well as
wanting the dot product with the actual word that you see in the context to be big you maximize your likelihood by
making the dot products of other words that weren't in the context smaller because that's shrinking your denominator and therefore um you've got a bigger
number coming out and you're maximizing the loss so even for the softmax the general thing that you want to do to maximize it is have dot product with words action the
context big dot product with words not in the context be small to the extent possible and obviously you have to average this as best you can over all
kinds of different contexts because sometimes different words appear in different contexts obviously so um so
the negative sampling is a way of therefore trying to maximize the same objective now you know for you only
you only have one positive term because you're actually wanting to use the actual data um so you're not waiting wanting to invent data so for working out the entire j we do
do work this quantity out for every center word and every context word so you know we are iterating over the different words in the context window and then
we're moving through positions in the corpus so we're doing different vcs so you know gradually we do this but for one particular center word and one particular context word we only have one
real piece of data that's positive so that's all we use because we don't know what other words should be counted as positive words now
for the negative words you could just sample one negative word and that would probably work but if you want a sort of a slightly
better more stable sense of okay we'd like to in general have other words have low probability it seems like you might be able to get better more stable results
if you instead say let's have 10 or 15 sample negative words and indeed that's been found to be true but and for the negative words well it's
easy to sample any number of random words you want and at that point it's kind of a probabilistic argument the words that you're sampling might not be actually bad words to
appear in the context they might actually be other words that are in the context but 99.9 of the time they will be unlikely words to occur in the context
and so they're good ones to use and yes you only sample 10 or 15 of them but that's enough to make progress because the center word is going to turn up on
other occasions and when it does you'll sample different words over here so that you gradually sample different parts of the space and start to learn
we had this co-occurrence matrix and it gives a representation of words as co-occurrence vectors
and just one more note on that i mean there are actually two ways that people have commonly made these co-occurrence matrices one corresponds to what we've seen already that you use a window around a
word which is similar to word to vec and that allows you to capture some locality and some of the sort of syntactic and semantic proximity that's
more fine-grained the other way these co-matrix diseases have often made is that normally documents have some structure whether it's paragraphs or um just
actual web pages sort of sized documents so you can just make the your window size a paragraph or a whole web page and count co-occurrence in those and this is
the kind of method that's often been used in information retrieval in methods like latent semantic analysis okay so the question then
is are these kind of count word vectors good things to use well people have used them they're not terrible
but they have certain problems the kind of problems that they have uh well firstly they're huge though very sparse so this is back where i said
before if we had a vocabulary of half a million words when then we have a half a million dimensional vector for each word
which is much much bigger than the word vectors that we typically use um and it also means that because we have these very high dimensional
vectors um that we have a lot of sparsity and a lot of randomness so the results that you get tend to be noisier and less robust depending on what
particular stuff was in the corpus and so in general people have found that you can get much better results by working with low dimensional vectors so
then the idea is we can store the most of the important information about the distribution of words in the context of other words in a fixed small number of
dimensions giving a dense vector and in practice the dimensionality of the vectors that are used are normally somewhere between 25 and a thousand
and so at that point we need to use two we need to use some way to reduce the dimensionality of our count co occurrence vectors so
if you have a good memory from a linear algebra class you hopefully saw singular value decomposition and
it has various mathematical properties um that i'm not going to talk about here of single singular value projection giving you an optimal way under a
certain definition of optimality of producing a reduced dimensionality matrix that maximally or sorry pair of matrices that maximally
well lets you recover the original matrix but the idea of the singular value decomposition is you can take any matrix such as our
count matrix and you can decompose that into three matrices u a diagonal matrix sigma and a v
transpose matrix um and this works for any shape now in these matrices some parts of it
are never used because since this matrix is rectangular there's nothing over here and so this part of the the transpose matrix gets ignored but if
you're wanting to get smaller dimensional representations what you do is take advantage of the fact that the singular values inside the
diagonal sigma matrix are ordered from largest down to smallest so what we can do is just delete out more of the matrix
of the delete out some singular values which effectively means that in this product sum of u and sum of v is also not used and so
then as a result of that we're getting lower dimensional representations um for our words if we're wanting to have word vectors
which still do as good as possible a job within the given dimensionality of enabling you to recover the original
co-occurrence matrix so from a linear algebra background um this is the obvious thing to use so how does that work
um well if you just build a raw count co-occurrence matrix and run svd on that and try and use
those as word vectors it actually works poorly and it works poorly because if you get into the mathematical assumptions of svd you're expecting to
have these normally distributed errors and what you're getting with word counts looked not at all
like something's normal you didn't because you have exceedingly common words like arthur and and and you have a very large number of rare words so that
doesn't work very well but you actually get something that works a lot better if you scale the counts in the cells so to deal with this problem of extremely
frequent words there are some things we can do we could just take the log of the raw counts we could kind of cap the maximum count
we could throw away the function words and any of these kind of ideas let you build then have a co-occurrence matrix that you get more useful word vectors
from running something like svd and indeed these kind of models were explored um in the 1990s and in the
2000s and in particular um doug rhody explored a number of these ideas as is how to improve the co-occurrence matrix in a model that he built that was called
kohl's and you know actually in his kohl's model he observed the fact that you could get
the same kind of linear components that have semantic components that we saw yesterday when talking about
analogies so for example this is a figure from his paper and you can see that we seem to have a meaning component
going from a verb to the person who does the verb so drive to drive a swim to swimmer teach the teacher marry to priest and that these
vector components are not perfectly but are roughly parallel and roughly the same size and so we have a meaning component there
that we could add on to another word just like we did for previously for analogies we could say drivers to driver as mari is to what and we'd add on this
screen vector component which is roughly the same as this one and we'd say oh priest so that this space could actually get some word vectors
analogies right as well and so that seemed really interesting to us around the time word to vec came out of wanting to understand better what the iterative
updating algorithm of word deveck did and how it related to these more linear algebra based methods that have been explored in the couple of decades previously and
so for the next bit i want to tell you a little bit about the glove algorithm which was an algorithm for word vectors that was made by jeffrey pennington richard socher and me
in 2014 and so the starting point of this was to try to connect together the linear algebra based methods on
co-occurrence matrices like lsa and coles with the models like skip grand sibo and their other friends which were iterative neural updating algorithms so
on the one hand you know the linear algebra methods actually seemed like they had advantages for fast training and efficient usage of statistics but
although there had been work on capturing word similarities with them by and large the results weren't as good perhaps because of disproportionate importance
given to large accounts in the main conversely um the models um the the neural models it seems like if you're just doing these gradient
updates on windows you're somehow inefficiently using statistics versus a co-occurrence matrix but on the other hand it's actually easier to scale to a very
large corpus by trading time for space and but at that time it seemed like the newer methods just worked better for people
that they generated improved performance on many tasks not just on word similarity and that they could capture complex patterns such as the analogies
that went beyond word similarity and so what we wanted to do was understand a bit more as to what do you what properties do you need to have this
analogies work out as i showed last time and so what we realized was that if you'd like to do have these sort of vector subtractions
um and additions work for an analogy the property that you um want is for meaning components so a
meaning component is something like going from male to female queen to king or going from
its age and truck to driver um that those meaning components should be represented as ratios of co-occurrence probabilities
so here's an example that shows that okay so suppose the meaning component that we want to get out is the spectrum from solid to gas as in
physics well you'd think that you can get at the solid part of it perhaps by saying does the word co-occur
with ice and the word solid occurs with ice so that looks hopeful and gas doesn't occur with ice much so that looks hopeful but the problem is the
word water will also occur a lot with ice and if you just take some other random word like the word random it probably doesn't occur with ice much
in contrast if you look at words co-occurring with steam solid won't occur with steam much but gas will but water will again and random will be
small so to get out the meaning component we want of going from gas to solid what's actually really useful is to look at the ratio of these
co-occurrence probabilities because then we get a spectrum of large to small between solid and gas
whereas for water in a random word it basically cancels out and gives you one um i just wrote these numbers in but if you
count them up in a large corpus it is basically what you get so here are actual co-occurrence probabilities and that for water and my random word which was
fashion here these are approximately one um whereas for the ratio of probability of co-occurrence of solid
with ice or steam is about ten and four guess it's about a tenth so how can we capture these ratios of coeconos
probabilities as linear meaning components so that in our word vector space we can just add and subtract linear meaning components well
it seems like the way we can achieve that is if we build a log by linear model so that the dot product between two word vectors
attempts to approximate the log of the probability of co-occurrence so if you do that you then get this property that the
difference between two vectors its similarity to another word corresponds to the log of the probability ratio shown on the previous
slide so the glove model wanted to try and um unify the thinking between the
co-occurrence matrix models and the neural models by being in some way similar to a newer model but actually calculated on top of a
current matrix count so we had an explicit loss function and our explicit loss function is that we wanted the dot product to be
similar to the log of the co-occurrence we actually added in some bias terms here but i'll ignore those for the moment and we wanted to not have very
common words dominate and so we kept the effect of high word counts using this f function that's shown here and then we could
optimize this j function directly on the co-occurrence count matrix so that gave us fast training scalable to huge corpora
um and so this algorithm worked very well so if you ask if you run this algorithm ask what are the nearest words to frog you get frogs
toad and then you get some complicated words but it turns out they are all frogs um until you get down to lizards so latoya's that lovely tree frog there um
and so this actually seemed to work out pretty well how well did it work out um to discuss that a bit more i now want to say something about how do we evaluate word
vectors are we good for up to there for questions we've got some questions uh what do you mean by an inefficient use of statistics as a con for skip gram
well what i mean is that you know for word to vac you're just you know
looking at one center word at a time and generating a few negative samples and so it sort of seems like doing something always precise there
whereas if you're doing uh optimization algorithm on the whole matrix at once well you actually know everything about the matrix at once
you're not just looking at what words what other words occurred in this one context of the center word you've got the entire vector of co-occurrence
accounts for the center word and another word and so therefore you can much more efficiently and less noisily
work out how to minimize your loss okay i'll go on okay so i've sort of said look at these word
vectors they're great and i sort of showed you a few things at the end of the last class which argued hey these are great um you know they work out
these analogies um they show similarity and things like this um we want to make this a bit more precise and indeed for natural language processing as in other areas of machine
learning a big part of what people are doing is working out good ways to evaluate knowledge that things have so how can we really evaluate word
vectors so in general for nlp evaluation people talk about two ways of evaluation intrinsic and extrinsic so an intrinsic
evaluation means that you evaluate directly on the specific or intermediate subtasks that you've been working on so i want a
measure where i can directly score how good my word vectors are and normally intrinsic evaluations are fast to compute they helped you to understand
the component you've been working on but often simply trying to optimize that component may or may not have a very big good
effect on the overall system that you're trying to build um so people have also also been very interested in extrinsic evaluations so
an extrinsic evaluation is that you take some real task of interest to human beings whether that's web search or machine translation or something like
that and you say your goal is to actually improve performance on that task well that's a real proof that this
is doing something useful so it in some ways it's just clearly better but on the other hand it also has some disadvantages
it takes a lot longer to evaluate on an extrinsic task because it's a much bigger system and sometimes you know when you change things
it's unclear whether the fact that the numbers went down was because you now have worse word vectors or whether it's just somehow the other components of the
system interacted better with your old word vectors and if you change the other components as well things would get better again so
in some ways it can sometimes be muddier to see if you're making progress but i'll touch on both of these methods here um so for intrinsic evaluation of
word vectors one way um which we mentioned last time was this word vector analogy so we could simply
give our models a big collection of word vector analogy problems so we could say man is the woman as king is the what and ask the model to find the word that is
closest using that sort of word analogy computation and hope that what comes out there is queen and so that's something people have done
and have worked out an accuracy score of how often that you are right at this point i should just mention one little trick of these word vector
analogies that everyone uses but not everyone talks about a lot in the first instance i mean there's a little trick which you can
find in the sim code if you look at it that when it does manage to woman as king is to what
something that could often happen is that actually the word once you do your pluses and your minuses that the word that will actually be closest is
still king so the way people always do this is that they don't allow one of the three input words um in the selection process so
you're choosing the nearest word that isn't one of what words um okay so since um here is showing results from the
glove vectors um so the glove factors have this strong linear component property just like i showed before um for um
coal so this is for the male female dimension and so because of this you'd expect in a lot of cases that word analogies would work because i
can take the vector difference of man and woman and then if i add that vector difference onto brother i expect to get to sister and king queen and from any of these
examples but of course they may not always work right because if i start from emperor it's sort of on a more of a lean and so it might turn out that i get
countess or duchess coming out instead you can do this for various different relations so different semantic relations so these sort of word vectors actually learn quite a bit of just world
knowledge um so here's the company ceo or this is the company ceo around 2010 to 2014 when the data was taken from word vectors
and they as well as semantic things or pragmatic things like this they also learn syntactic things so here are vectors for positive comparative and
superlative forms of adjectives and you can see those also um move and roughly linear components um so um the word to vect people built a data set of
analogies so you could evaluate different models on the accuracy of their analogies and so here's how you can do this and this gives some numbers so there are semantic
and syntactic analogies i'll just look at the totals okay so what i said before is if you just use unscaled
co-occurrence counts and pass them through an svd things work terribly and you see that there you only get 7.3 but then as i also pointed out if you do some scaling you can actually
get svd to of a scaled count matrix to work reasonably well so this spdl is similar to the kohl's model and now
we're getting up to 60.1 which actually isn't a bad score right so you can actually do a decent job without a neural network um and then here are the
two variants of the um word to vect model and here are our results from the glove model and of course at the time
2014 we took this as absolute proof that our model was better and our more efficient use of statistics was really working in our favor um with seven years
of retrospect i think that's kind of not really true it turns out i think the main part of why we scored better is that we actually had better data and so
there's a bit of evidence about that on this next slide here so this looks at the semantic syntactic and overall
performance on word analogies of glove models that were trained on different subsets of data so in particular the two
on the left are trained on wikipedia and you can see that training on wikipedia makes you do really well on semantic analogies which maybe makes
sense because wikipedia just tells you a lot of semantic facts i mean that's kind of what encyclopedias do and so one of the big advantages we
actually had was that wikipedia that the glove model was partly trained on wikipedia as well as other texts whereas the word to vect model that was
released was trained exclusively on google news so newswire data and if you only train on a smallish amount of
newswire data you can see that for the semantics it's it's just not as good as even a one-quarter of the size amount of
wikipedia data though if you get a lot of data you can compensate for that so here on the the right hand did you then have common crawl web data and so once
there's a lot of web data so now 42 billion words um you're then starting to add good scores again from the semantic side
um the graph on the right then shows how well do you do as you increase the vector dimension and so what you can see there is you know 25 dimensional vectors
aren't very good they go up to sort of 50 and then 100 and so 100 dimensional vectors already work reasonably well so that's why i used hundred dimensional vectors
when i showed my example in class yet is the sweet spare too long load and working reasonably well but you still get significant gains for 200 and it's
somewhat to 300 so at least back around so 2013 to 15 everyone sort of gravitated to the fact that 300 dimensional vectors is the sweet spot um
so almost frequently if you look through the best known sets of word vectors that include the word divec vectors and the glove vectors that usually what you get
is 300 dimensional word vectors um that's not the only intrinsic evaluation you can do another intrinsic evaluation you can do
is see how these models model human judgments of word similarity um so psychologists for
several decades have actually taken human judgments a word similarity where literally you're asking people for pairs of words like professor and doctor
to give them a similarity score that's sort of being measured as some continuous quantity giving you a score between say 0 and ten um and so there
are human judgments which are then averaged over multiple human judgments as to how similar different words are so tiger and cat is pretty similar um
computer and internet is pretty similar plane and car is less similar stock and cd aren't very similar at all but stock and jaguar even less similar
so we could then say for the our models do they have the same similarity judgments and in particular we can measure a correlation coefficient
of whether they give the same ordering of similarity judgments and so then we can get data for that and so there are various different data sets of word
similarities and we can score different models as to how well they do on similarities and again you see here that plain svds
and works comparatively better here for similarities than it did for analogies you know it's not great but it's now not completely terrible because we no longer need that linear
property but again scaled svds work a lot better word deveck works a bit better than that and we got some of the same kind of
minor advantages from the glove model hey chris sorry to interrupt a lot of the students were asking if you could re-explain the objective function for the glove model and also what log
bilinear means okay uh sure okay here is
here is my here is my um objective function the right if i go so one slide before that right so the property that
we want is that we want the dot product um to represent the log probability of co-occurrence
so um and that's then gives me my tricky log bilinear so the buy is that there's sort of the wi and the wj so that there
are sort of two linear things and it's linear in each one of them so this is sort of like having and
rather than having a sort of an ax where you just have something that's linear in x and a is a constant it's bilinear because we have the w i w j and there's
linear in both of them and that's then related to the log of a probability and so that gives us the log by linear model and so
since we since we'd like these things to be equal what we're doing here if you ignore these two center terms is that we're
wanting to say the difference between these two is as small as possible so we're taking this difference and we're squaring it so it's always positive and
we want that squared term to be as small as possible and you know that's 90 percent of it and you can basically stop there but the
other bit that's in here is a lot of the time when you're building models um rather than simply
having sort of an ax model it seems useful to have a bias term which can move things up and down for
the word in general and so we add it into the model bias term so that there's a bias term for both words so if in general probabilities are high for a
certain word this bias term can model that and for the other word this bias term then model it okay so now i'll pop back and after
um oh actually i just saw someone said why multiplying by the f of sorry i did skip that last term
um okay the why modifying by this f of x i j so this last bit was to
scale things depending on the frequency of a word because you want to pay more attention to
words that are more common or word pairs that are more common because you know if you think about it um in word divect terms you're seeing
if things have a co-occurrence account of 50 versus 3 you want to do a better job at modeling
the co-occurrence of the things that occurred together 50 times and so you want to consider in the count of co-occurrence
but then the argument is that that actually leads you astray when you have extremely common words like function words and so effectively you paid more attention
to words that co-occurred together up until a certain point and then the curve just went flat so it didn't matter if it was an extremely extremely common word
so then um for extrinsic word vector evaluation so at this point you're now wanting to sort of say well
can we embed our word vectors in some end user task and do they help and do different word vectors work
better or worse than other word vectors so this is something that we'll see a lot of later in the class i mean in particular when you get on to doing assignment
three that assignment three you get to build dependency parsers and you can then use word vectors in the dependency parser and see how much they help we
don't actually make you test out different sets of word vectors but you could um here's just one example of this to give you a sense so the task of named entity
recognition is going through a piece of text and identifying mentions of a person name or an organization name like
a company or a location and so if you have good word vectors um do they help you do named entity recognition
better and the answer to that is yes so if one starts off with a model that simply has discrete features so it uses word identity as features you can build
a pretty good named entity model doing that but if you add into it word vectors you get a better representation of the meaning of words and so that you can
have the numbers go up quite a bit and then you can compare different models to see how much gain they give you in terms of this extrinsic task
so skipping ahead this was a question that i was asked after class which was word senses because so far we've had just
one word sorry for one particular string we've got some string house and we're going to say for each of those strings there's a
word vector and if you think about it a bit more that seems like it's very
weird because actually most words um especially common words and especially words that have existed for a long time actually have many meanings which are
very different so how could that be captured if you only have one word vector for the word because you can't actually capture the fact that you've got different meanings for the word
because your meaning for the word is just one point in space one vector and so as an example of that here's the word pipe now
it's actually but it is an old germanic word well what kind of means does the word pike have um so you can maybe just think for a minute
and think um what were meanings the word pike has and it actually turns out you know it has a lot of different meanings so
so perhaps the most basic meaning is um if you did fantasy games or something medieval weapons um a sharp pointed staff there's a pike um but there's a
kind of a fish that has a similar elongated shape that's a pike um it was used for railroad um lines maybe that usage isn't
used much anymore but it certainly still survives in referring to roads so this is like when you have turnpikes we have expressions where pike means the future
like coming down the pike it's a position in diving that divers do a pike those are all now nooses they're also verbal uses so you can pike
somebody with your pike you know different usages might have different currency in australia you can also use pike to mean that you pull out of doing
something like i reckon he's going to pike i don't think that usage is used in america but lots of meanings and actually for words that are commoner if you start thinking words like line or
field i mean they just have even more meanings than this so what are we actually doing with just one vector for a word and well
one way you could go is to say okay up until now what we've done is crazy pike has and other words have all of these different meanings so maybe what we should do
is have different word vectors for the different meanings of pike so we'd have one word vector for the medieval pointy
weapon another word vector for the kind of fish another word vector for the kind of road so that they then be word sense vectors
and you can do that i mean actually we were working on that in the early 2010s actually even before word to vect came out so
this picture is a little bit small to see but what we were doing was for words we work clustering instances of a word hoping that those
clusters so clustering the word tokens hoping those clusters that were similar represented sensors and then for the clusters of word tokens we were sort of
treating them like they were separate words and learning a word vector for each and you know basically that actually works so in green we have two
senses for the word bank and so there's one sense for the word bank that's over here where it's close to words like banking finance transaction and laundering and then we have another
sense for the word bank over here whereas close to words like plateau boundary gap territory which is the riverbank sense of the word bank um and
for the word jaguar that's in purple um well jq has a number of sensors and so we have those as well so this sense down here is um close to hunter so that's the
sort of big game animal sense of um jaguar up the top here is being shown close to luxury and convertibles this is the jaguar car sense
um then jaguar here is near string um keyboard and words like that so jaguar is the name of a kind of keyboard um and
then this final jaguar over here is close to software and microsoft and then if you're old enough you'll remember that there was an old version of mac os so it's called jaguar um so that's then
the computer sense so basically this does work and we can learn word vectors for different sensors of a word but actually this isn't the majority way
that things have been gone in practice and there are kind of a couple of reasons for that i mean one is just simplicity
if you do this it's kind of complex because you first of all have to learn word senses and then start learning word vectors in terms of the word senses
but the other reason is although this model of having word sensors um is traditional it's what you see in dictionaries it's commonly what's being
used in natural language processing i mean it tends to be imperfect in its own way because we're trying to take all the uses of the word pike and sort of cut
them up into key different sensors where the difference is kind of overlapping and it's often not clear which ones to count as distinct so for example here right a
railroad line and a type of road well sort of that's the same sense of pike it's just that they're different forms of transportation and so you know that this could be you know a type of transportation line and cover both of
them so it's always sort of very unclear how you cut word meaning into different sensors and indeed if you look at different dictionaries everyone does
it differently so um it actually turns out that in practice you can do
rather well by simply having one word vector per word type and what happens if you do that well what you find
is that what you learn as a word vector is what gets referred to in fancy talk as a
super superposition of the dif of the word vectors for the different senses of a word um where the word superposition means no more or less
than a weighted sum so our the vector that we learned for pike will be a weighted average of the vectors that you would have learned for the medieval
weapon sense plus the fish sense plus the road sense plus whatever other senses that you have where the weighting that's given to these different sense vectors
corresponds to the frequencies of use of the different sensors so we end up with the word um the vector for pike um being a
kind of an average vector and so if you're um if you're say okay you've just added up several
different vectors into an average you might think that that's kind of useless because you know you've lost the real meanings of the word and you've just got
some kind of funny average vector that's in between them but actually it turns out that if you use this average vector
in applications it tends to sort of self-disambiguate because if you say is the word pike similar to the word for
fish well part of this vector represents fish the fish sense of pike and so in those components it'll be kind of similar to
the fish vector and so yes you'll say the um substantia there's substantial similarity whereas if in another um
piece of text that says you know the men were aimed were armed with pikes and lancers or pikes and mesas or whatever other medieval weapons you remember well actually
some of that meaning is in the pike vector as well and so it'll say yeah there's good similarity with mace and staff and words like that as
well and in fact we can work out which sense of pike is intended by just sort of seeing which components are similar to other words
that are used in the same context and indeed there's actually a much more surprising result than that and this is a result that's um jews are sanjiv aurora
tanguma who is now on our stanford faculty and others in 2018 and that's the following result which i'm not actually going to explain but um
so if you think that the vector for pike is just a sum of the vectors for the different sensors well it should be
you'd think that it's just completely impossible to reconstruct the sense vectors from
the vector for um the word type because normally if i say i've got two numbers the sum of them is 17 you just have no information as to what my two numbers
are right you can't resolve it um and even worse if i tell you i've got three numbers and they sum to 17 but it turns out that when we have these
high dimensional vector spaces that things are so sparse in those high dimensional vector spaces that you can
use ideas from sparse coding to actually separate out the different sensors providing they're relatively common so they show in their paper that you can
start with the vector of say pike and actually separate out components of that vector that correspond to different sensors of the word pike and so here's
an example at the bottom of this slide which is for the word it's separated out that vector into five different sensors and so there's one
sense it's close to the words trousers blouse waistcoat so this is the sort of clothing sense of tie another sensors is close to wires cables wiring
electrical so that's the sort of the thai sense of attire used in the electrical staff and then we have sort of scoreline goals equalizer so this is
the sporting game sense of tie this one also seems to in a different way evokes sporting game sense of tie and then there's finally this one here
maybe my music is just really bad maybe it's because you get ties and music when you tie notes together i guess so you get these different senses out of it
Loading video analysis...