Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 2 - Neural Classifiers

By Stanford Online

Summary

Topics Covered

Bag-of-Words Learns Semantic Similarity
Stochastic Gradient Descent Learns Better
Negative Sampling Approximates Softmax Efficiently
Co-occurrence Ratios Capture Meaning Components
Single Vectors Superpose Word Senses

Full Transcript

okay so what are we going to do for today so the main content for today is to um go through sort of more stuff about word vectors

including touching on word sensors and then introducing the notion of neural network classifiers um so our biggest goal is that by the end of today's class you should feel like you could

confidently look at one of the word embeddings papers such as the google word to vect paper or the glove paper or sanjiv aurora's paper that we'll come to later and feel like yeah i can

understand this i know what they're doing and it makes sense so let's go back to where we were um so this was sort of introducing this model of word devec and

third line your idea was that we started with random word vectors and then we're going to sort of it we have a big corpus of text and we're going to iterate through

each word in the whole corpus and for each position we're going to try and predict what words surround this our center word and we're going to do that

with a probability distribution that's defined in terms of the dot product between the word vectors for the center word and the context words

um and so that will give a probability estimate of a word appearing in the context of into well actual words did occur in the context of into on this occasion so what we're going to want to

do is sort of make it more likely that turning problems banking and crises will turn up in the context of into and so that's learning updating the word

vectors so they can predict actual surrounding words better um and then the thing that's almost magical is that doing no more than this simple

algorithm this allows us to learn word vectors that capture well word similarity and meaningful directions in a word space

so more precisely right for this model the only parameters of this model are the word vectors so we have outside word vectors and center word vectors for each

word and then we're taking their dot product um to get a probability well we get taking a dot product to get a score of how likely a particular outside word

is to occur with the center word and then we're using the soft max transformation to convert those scores into probabilities as i discussed last time and i kind of come back to at the

end this time a couple of things to note um this model is what we call an nlp a bag of words models so bag of words models are models

that don't actually pay any attention to word order or position it doesn't matter if you're next to the center word or a bit further away on the left or right the probability estimate would be the

same and that seems like a very crude model of language that will offend any linguist and it is a very crude model of language and we'll move on to better

models of language as we go on but even that crude model of language is enough to learn quite a lot of the probability sorry quite a lot about the properties

of words and then the second note is well with this model we want it to give

reasonably high probabilities to the words that do occur in the context of the center word at least if they do so at all often but obviously lots of

different words can occur so we're not talking about probabilities like 0.3 and 0.5 we're more likely going to be talking about probabilities like 0.01

and numbers like that well how do we achieve that and well the way that the word defect model achieves this and this is the learning phase of

the model is to place words that are similar in meaning close to each other in this high dimensional vector space so again you

can't read this one but if we scroll into this one we see lots of words that are similar and meaning group close together in the space so here are

days of the week like tuesday thursday sunday and also christmas over what else do we have we have

samsung and nokia this is a diagram i made quite a few years ago so that's when nokia was still an important maker of cell phones we have various sort of fields like mathematics and

economics over here so we group words that are similar in meaning actually one more note i wanted to make on this i mean again this is a

two-dimensional picture which is all i can show you on a slide um and it's done with the principal components projection that you'll also

use in the assignment um something important to remember but hard to remember is that high dimensional spaces have very different properties to

the two dimensional spaces that we can look at and so in particular a word a vector can be close to many other

things in a high dimensional space but close to them on different dimensions okay so i've mentioned

doing learning so the next question is well how do we um learn good word vectors and this was the bit that i

didn't quite hook up at the end of last class so for a while in the last i said calculus and we have to work out um the gradient of the loss function with

respect to the parameters that will allow us to make progress um but i didn't sort of altogether put that together so what we're going to do

is um we start off with random word vectors we initialize them to small numbers near zero in each dimension we've defined our

loss function j which we looked at last time and then we're going to use a gradient descent algorithm which is an iterative iterative algorithm that

learns to maximize j of theta by changing theta and so the idea of this algorithm is that from the current values of theta you

calculate the gradient j of theta and then what you're going to do is make a small step in the direction of the negative gradient so the gradient is pointing upwards and we're taking a

small step in the direction of the negative of the gradient to gradually move down towards the minimum and so one of the parameters of neural

nets that you can fiddle in your software package is what is the step size so if you take a really really itsy bitsy step it might take you a long time

to minimize the function you do a lot of wasted computation on the other hand if your step size is much too big

well then you can actually diverge and start going to worse places or even if you are going downhill a little bit that what's going to happen is you're then going to end up bouncing back and forth

and it'll take you much longer to get to the minimum okay in this picture i have a beautiful quadratic and it's easy to minimize it something

that you might know about neural networks is that in general they're not convex so you could think that this is just all going to go alright um but the truth is and practice life works out to

be okay but i think i won't get into that more right now and come back to that um in the later class so this is our gradient descent so we

have the current values of the parameters theta we then walk a little bit in the negative direction of the gradient using our

learning rate or step size alpha and that gives us new parameter values where that means that you know these are vectors but for each individual

parameter we are updating it a little bit by working out the partial derivative of j with respect to that parameter so that's the simple gradient descent

algorithm nobody uses it and you shouldn't use it the problem is that our j is a function

of all windows in the corpus remember we're doing this sum over every center word in the entire corpus and we'll often have billions of words in the corpus

so actually working out j of theta or the gradient of j of theta would be extremely extremely expensive because we have to iterate over our entire corpus so you'd wait a very long time before

you made a single gradient update and so optimization be extremely slow and so basically a hundred percent of the time in neural network land we don't use

gradient descent we instead use what's called stochastic gradient descent and stochastic gradient descent is a very simple modification of this so rather

than working out an estimate of the gradient based on the entire corpus you simply take one center word or a

small batch like 32 center words and you work out an estimate of the gradient based on them now that estimate of the gradient will be

noisy and bad because you've only looked at a small fraction of the corpus rather than the whole corpus but nevertheless you can use that estimate of the

gradient to update your theta parameters in exactly the same way and so this is the algorithm that we can do and so then

if we have a billion word corpus um we can if we do it on each center word we can make a billion updates to the parameters we pass through the corpus

once rather than only making one more accurate update to the parameters at once you've been through the corpus so overall we can learn

several orders of magnitude more quickly and so this is the algorithm that you'll be using everywhere including um you know right right from

the beginning from our assignments um again just an extra comment of more complicated stuff we'll come back to all right

this is the gradient descent is a sort of performance hack it lets you learn much more quickly it turns out it's not only a performance hack neural nets have some

quite counter-intuitive um properties and actually the fact that stochastic gradient descent is kind of noisy and bounces around as it does its

thing it actually means that in complex networks it learns better um solutions than if you were to run plain gradient descent very slowly so

you can both compute much more quickly and do a better job okay one final note on running stochastic gradients with word vectors this is kind

of an aside but something to note is that if we're doing a stochastic gradient update based on one window then actually in that

window we'll have seen almost none of our parameters because if we have a window of something like five words to either side of the center word we've seen at most 11 distinct word

types so we will have gradient information for those 11 words but the other 100 000 odd words now vocabulary will have no

gradient update information so this will be a very very sparse gradient update so if you're only thinking math

you can just have your entire gradient and use the equation that i showed before but if you're

thinking systems optimization then you'd want to think well actually i only want to update the parameters for a few words and there

have to be and there are much more efficient ways that i could do that um and so um here's so this is another aside will be useful for the assignment so i will say

it up until now when i presented word vectors i presented them as column vectors and that makes the most sense if you

think about it as a piece of math whereas actually in all common deep learning packages including pytorch that we're using

word vectors are actually represented as row vectors and if you remember back to the representation of matrices and cs107 or something like that um that you'll

know that that's then obviously efficient um for representing words because then you can access an entire word vector as a contiguous range of memory

different if you're in fortran anyway so actually our word vectors will be row vectors when you look at those um inside pi torch

okay now i wanted to say a bit more about the word to vec algorithm family um and also um what you're going to do in homework 2. um so if you're still

meant to be working on homework 1 which remembers um to next tuesday that really actually with today's content we're starting into homework two and i'll kind

of go through the first part of homework two today and this other stuff you need to know for homework two so i mentioned briefly the idea that we have two

separate vectors for each word type the center vector and the outside vectors and we just average them both at the end they're similar but not identical for multiple reasons including the random

initialization and the stochastic gradient descent um you can implement a word defect algorithm with just one

vector per word and actually if you do it works slightly better but it makes the algorithm much more complicated and the reason for that is

sometimes you'll have the same word type as the center word and the context word and that means that when you're doing your calculus at that point you've then

got this sort of messy case that just for that word you're getting an x squared term oh sorry a dot product you're getting a dot product of x dot x term which makes it sort of much messier

to work out and so that's why we use this sort of simple optimization of having two vectors per word okay so

for the word to vect model as introduced in the miklov at our paper in 2013 it wasn't really just one

algorithm it was a family of algorithms so there are two basic model variants one was called the skip gram model which is the one that i've explained to you that

predicted for outside words position independent given the centre word in a bag of words style model the other one was called the continuous bag of

words model sibo and in this one you predict the center word from a bag of context words both of these give similar results the

skipgram one is more natural in various ways so it's sort of normally the one that people have um gravitated to in subsequent work um but then as to how you train this

model um what i've presented so far is the naive softmax equation which is a simple but relatively expensive

training method and so that isn't really what they suggest using in your paper in the paper they suggest using a method that's called negative sampling so an

acronym you'll see sometimes is sgns which means skip grams negative sampling so let me just um say a little bit um about what this is but actually

doing the script gram model with negative sampling is the part of homework too so you'll get to know this model well so the point is that if you use this naive softmax you know even

though people commonly do use this naive softmax in various neural net models that working out the denominator is pretty expensive and that's because you

have to iterate over every word in the vocabulary and work out these dot products so if you have a hundred thousand word um

vocabulary you have to do a hundred thousand dot products um to work out the denominator and that seems a little bit of a shame and so instead of that the

idea of negative sampling is where instead of using this soft max we're going to train

binary logistic regression models for both the troop the true pair of center word and the context word

versus noise pairs where we keep the true center word and we just randomly sample words from the vocabulary

so as presented in the paper the idea is like this so overall what we want to optimize is still an average of the

loss for each particular center word but for when we're working out the loss for each particular center word we're going to work out um sorry the loss for each particular center word and each

particular window we're going to take the dot product as before of the center word and the outside word and that's sort of the

main quantity but now instead of using that inside the softmax we're going to put it through the logistic function which is sometimes also often also called the sigmoid

function the name logistic is more precise so that's this function here so the logistic function is a handy function that will map any real number to a probability

between zero and one open interval so basically if the dot product is large the logistic of the dot product will be virtually one

okay so we want this to be large and then what we'd like is on average we'd like the dot product between the center word and words that we just chose

randomly i.e they most likely didn't

randomly i.e they most likely didn't actually occur in the context of the center word to be small and there's just one little trick

of how this is done which is this sigmoid function is symmetric and so if um we want

this probability to be small we can take the negative of the dot product so we're wanting it to be over here that the product the dot product of a random word

in the center word is a negative number and so then we're going to take the negation of that and then again once we put that through the sigmoid we'd like a big number

okay so the way they're presenting things they're actually maximizing this quantity but if i go back to making it a bit more similar to the way we had written things

we've worked with minimizing the negative log likelihood um so it it looks like this so we're taking the

negative log likelihood of this the sigmoid of the dot product um again negative log likelihood we're using the same negator dot product

through the sigmoid and then we're going to work out this quantity for a handful of brand

number we k negative samples um and how likely they are to sample word depends on their probability and where this loss function is going to be

minimized given this negation by making these dot products large and these dot products um smalling negative so

they're just then one other trick that they use actually there's more than one other trick that's used in the word defect paper to get it to perform well but i'll

only mention one of their other tricks here um when they sample the words they don't simply just sample the words based

on their um probability of occurrence in the corpus or uniformly what they do is they start with what we call the unigram distribution of words so that is

how often words actually occur in our big corpus so if you have a billion word corpus and a particular word occurred 90 times in it you're taking 90 divided by

a billion and so that's the unigram probability of the word but what they then do is that they take that to the three-quarters power and the effect of that three-quarters power which is then

re-normalized to make a probability distribution with z kind of like we saw last time with the soft max by taking the three-quarters power

that has the effect of dampening the difference between common and rare words um so that less frequent words are sampled somewhat more often but still

not nearly as much as they would be if you just use something like a uniform distribution over the vocabulary okay so

that's basically um everything to say about the basics of how we have this very simple

neural network algorithm word deveck and how we can train it and learn word vectors so for the next bit what i want to do is

step back a bit and say well here's an algorithm that i've shown you that works great um what else could we have done and what

can we say about that um and the first thing that you might think about is well here's this funny iterative algorithm to give you

word vectors um you know if we have a lot of words and a corpus it seems like a more obvious thing that we could do

is just look at the counts of how words occur with each other and build a matrix of counts uh co-occurrence matrix so here's the idea

of a co-occurrence matrix so i've got a teeny little corpus i like deep learning i like nlp i enjoy flying

and i can define a window size i made my window simply size one to make it easy to fill in my matrix symmetric just like our word to back

algorithm and so then the counts in these cells are simply how often things that co-occur in the window of size one so i like

occurs twice so we get twos in these cells because it's symmetric deep learning occurs one so we get

one here and lots of other things occur zero so we can build up a co-occurrence matrix like this and well these actually give us a representation

of words as co-occurrence vectors so i can take the word i with either a row or a column vector since it's symmetric and say okay my

representation of the word i is this row vector and that is a representation of the word i and i think you can maybe convince

yourself that to the extent that words have similar meaning and usage you'd sort of expect them to have somewhat similar vectors right so if i had the

word u as well on a larger corpus you might expect i and u to have similar vectors because i like you like i enjoy you and joy um you'd see the same kinds

of possibilities hey chris could you keep looking to answer some questions sure all right so we got some questions from negative uh sort of the negative stamping sampling slides um

in particular um what's like can you give some intuition for negative sampling what is the negative sampling doing and why do we uh only take one positive example those are two questions that could be answered in

tandem okay um that's a good question okay i'll try and give more intuition so is to work out something like what the softmax

did in a much more efficient way um so in the soft max well

you wanted to give high probability to the in predicting the context a context word that actually did appear with the center

word um and well the way you do that is by having the dot product between those two words be as big as possible and

part of how but you know you're going to be sort of it's more than that because in the denominator you're also working out the dot product with every other word in the vocabulary so as well as

wanting the dot product with the actual word that you see in the context to be big you maximize your likelihood by

making the dot products of other words that weren't in the context smaller because that's shrinking your denominator and therefore um you've got a bigger

number coming out and you're maximizing the loss so even for the softmax the general thing that you want to do to maximize it is have dot product with words action the

context big dot product with words not in the context be small to the extent possible and obviously you have to average this as best you can over all

kinds of different contexts because sometimes different words appear in different contexts obviously so um so

the negative sampling is a way of therefore trying to maximize the same objective now you know for you only

you only have one positive term because you're actually wanting to use the actual data um so you're not waiting wanting to invent data so for working out the entire j we do

do work this quantity out for every center word and every context word so you know we are iterating over the different words in the context window and then

we're moving through positions in the corpus so we're doing different vcs so you know gradually we do this but for one particular center word and one particular context word we only have one

real piece of data that's positive so that's all we use because we don't know what other words should be counted as positive words now

for the negative words you could just sample one negative word and that would probably work but if you want a sort of a slightly

better more stable sense of okay we'd like to in general have other words have low probability it seems like you might be able to get better more stable results

if you instead say let's have 10 or 15 sample negative words and indeed that's been found to be true but and for the negative words well it's

easy to sample any number of random words you want and at that point it's kind of a probabilistic argument the words that you're sampling might not be actually bad words to

appear in the context they might actually be other words that are in the context but 99.9 of the time they will be unlikely words to occur in the context

and so they're good ones to use and yes you only sample 10 or 15 of them but that's enough to make progress because the center word is going to turn up on

other occasions and when it does you'll sample different words over here so that you gradually sample different parts of the space and start to learn

we had this co-occurrence matrix and it gives a representation of words as co-occurrence vectors

and just one more note on that i mean there are actually two ways that people have commonly made these co-occurrence matrices one corresponds to what we've seen already that you use a window around a

word which is similar to word to vec and that allows you to capture some locality and some of the sort of syntactic and semantic proximity that's

more fine-grained the other way these co-matrix diseases have often made is that normally documents have some structure whether it's paragraphs or um just

actual web pages sort of sized documents so you can just make the your window size a paragraph or a whole web page and count co-occurrence in those and this is

the kind of method that's often been used in information retrieval in methods like latent semantic analysis okay so the question then

is are these kind of count word vectors good things to use well people have used them they're not terrible

but they have certain problems the kind of problems that they have uh well firstly they're huge though very sparse so this is back where i said

before if we had a vocabulary of half a million words when then we have a half a million dimensional vector for each word

which is much much bigger than the word vectors that we typically use um and it also means that because we have these very high dimensional

vectors um that we have a lot of sparsity and a lot of randomness so the results that you get tend to be noisier and less robust depending on what

particular stuff was in the corpus and so in general people have found that you can get much better results by working with low dimensional vectors so

then the idea is we can store the most of the important information about the distribution of words in the context of other words in a fixed small number of

dimensions giving a dense vector and in practice the dimensionality of the vectors that are used are normally somewhere between 25 and a thousand

and so at that point we need to use two we need to use some way to reduce the dimensionality of our count co occurrence vectors so

if you have a good memory from a linear algebra class you hopefully saw singular value decomposition and

it has various mathematical properties um that i'm not going to talk about here of single singular value projection giving you an optimal way under a

certain definition of optimality of producing a reduced dimensionality matrix that maximally or sorry pair of matrices that maximally

well lets you recover the original matrix but the idea of the singular value decomposition is you can take any matrix such as our

count matrix and you can decompose that into three matrices u a diagonal matrix sigma and a v

transpose matrix um and this works for any shape now in these matrices some parts of it

are never used because since this matrix is rectangular there's nothing over here and so this part of the the transpose matrix gets ignored but if

you're wanting to get smaller dimensional representations what you do is take advantage of the fact that the singular values inside the

diagonal sigma matrix are ordered from largest down to smallest so what we can do is just delete out more of the matrix

of the delete out some singular values which effectively means that in this product sum of u and sum of v is also not used and so

then as a result of that we're getting lower dimensional representations um for our words if we're wanting to have word vectors

which still do as good as possible a job within the given dimensionality of enabling you to recover the original

co-occurrence matrix so from a linear algebra background um this is the obvious thing to use so how does that work

um well if you just build a raw count co-occurrence matrix and run svd on that and try and use

those as word vectors it actually works poorly and it works poorly because if you get into the mathematical assumptions of svd you're expecting to

have these normally distributed errors and what you're getting with word counts looked not at all

like something's normal you didn't because you have exceedingly common words like arthur and and and you have a very large number of rare words so that

doesn't work very well but you actually get something that works a lot better if you scale the counts in the cells so to deal with this problem of extremely

frequent words there are some things we can do we could just take the log of the raw counts we could kind of cap the maximum count

we could throw away the function words and any of these kind of ideas let you build then have a co-occurrence matrix that you get more useful word vectors

from running something like svd and indeed these kind of models were explored um in the 1990s and in the

2000s and in particular um doug rhody explored a number of these ideas as is how to improve the co-occurrence matrix in a model that he built that was called

kohl's and you know actually in his kohl's model he observed the fact that you could get

the same kind of linear components that have semantic components that we saw yesterday when talking about

analogies so for example this is a figure from his paper and you can see that we seem to have a meaning component

going from a verb to the person who does the verb so drive to drive a swim to swimmer teach the teacher marry to priest and that these

vector components are not perfectly but are roughly parallel and roughly the same size and so we have a meaning component there

that we could add on to another word just like we did for previously for analogies we could say drivers to driver as mari is to what and we'd add on this

screen vector component which is roughly the same as this one and we'd say oh priest so that this space could actually get some word vectors

analogies right as well and so that seemed really interesting to us around the time word to vec came out of wanting to understand better what the iterative

updating algorithm of word deveck did and how it related to these more linear algebra based methods that have been explored in the couple of decades previously and

so for the next bit i want to tell you a little bit about the glove algorithm which was an algorithm for word vectors that was made by jeffrey pennington richard socher and me

in 2014 and so the starting point of this was to try to connect together the linear algebra based methods on

co-occurrence matrices like lsa and coles with the models like skip grand sibo and their other friends which were iterative neural updating algorithms so

on the one hand you know the linear algebra methods actually seemed like they had advantages for fast training and efficient usage of statistics but

although there had been work on capturing word similarities with them by and large the results weren't as good perhaps because of disproportionate importance

given to large accounts in the main conversely um the models um the the neural models it seems like if you're just doing these gradient

updates on windows you're somehow inefficiently using statistics versus a co-occurrence matrix but on the other hand it's actually easier to scale to a very

large corpus by trading time for space and but at that time it seemed like the newer methods just worked better for people

that they generated improved performance on many tasks not just on word similarity and that they could capture complex patterns such as the analogies

that went beyond word similarity and so what we wanted to do was understand a bit more as to what do you what properties do you need to have this

analogies work out as i showed last time and so what we realized was that if you'd like to do have these sort of vector subtractions

um and additions work for an analogy the property that you um want is for meaning components so a

meaning component is something like going from male to female queen to king or going from

its age and truck to driver um that those meaning components should be represented as ratios of co-occurrence probabilities

so here's an example that shows that okay so suppose the meaning component that we want to get out is the spectrum from solid to gas as in

physics well you'd think that you can get at the solid part of it perhaps by saying does the word co-occur

with ice and the word solid occurs with ice so that looks hopeful and gas doesn't occur with ice much so that looks hopeful but the problem is the

word water will also occur a lot with ice and if you just take some other random word like the word random it probably doesn't occur with ice much

in contrast if you look at words co-occurring with steam solid won't occur with steam much but gas will but water will again and random will be

small so to get out the meaning component we want of going from gas to solid what's actually really useful is to look at the ratio of these

co-occurrence probabilities because then we get a spectrum of large to small between solid and gas

whereas for water in a random word it basically cancels out and gives you one um i just wrote these numbers in but if you

count them up in a large corpus it is basically what you get so here are actual co-occurrence probabilities and that for water and my random word which was

fashion here these are approximately one um whereas for the ratio of probability of co-occurrence of solid

with ice or steam is about ten and four guess it's about a tenth so how can we capture these ratios of coeconos

probabilities as linear meaning components so that in our word vector space we can just add and subtract linear meaning components well

it seems like the way we can achieve that is if we build a log by linear model so that the dot product between two word vectors

attempts to approximate the log of the probability of co-occurrence so if you do that you then get this property that the

difference between two vectors its similarity to another word corresponds to the log of the probability ratio shown on the previous

slide so the glove model wanted to try and um unify the thinking between the

co-occurrence matrix models and the neural models by being in some way similar to a newer model but actually calculated on top of a

current matrix count so we had an explicit loss function and our explicit loss function is that we wanted the dot product to be

similar to the log of the co-occurrence we actually added in some bias terms here but i'll ignore those for the moment and we wanted to not have very

common words dominate and so we kept the effect of high word counts using this f function that's shown here and then we could

optimize this j function directly on the co-occurrence count matrix so that gave us fast training scalable to huge corpora

um and so this algorithm worked very well so if you ask if you run this algorithm ask what are the nearest words to frog you get frogs

toad and then you get some complicated words but it turns out they are all frogs um until you get down to lizards so latoya's that lovely tree frog there um

and so this actually seemed to work out pretty well how well did it work out um to discuss that a bit more i now want to say something about how do we evaluate word

vectors are we good for up to there for questions we've got some questions uh what do you mean by an inefficient use of statistics as a con for skip gram

well what i mean is that you know for word to vac you're just you know

looking at one center word at a time and generating a few negative samples and so it sort of seems like doing something always precise there

whereas if you're doing uh optimization algorithm on the whole matrix at once well you actually know everything about the matrix at once

you're not just looking at what words what other words occurred in this one context of the center word you've got the entire vector of co-occurrence

accounts for the center word and another word and so therefore you can much more efficiently and less noisily

work out how to minimize your loss okay i'll go on okay so i've sort of said look at these word

vectors they're great and i sort of showed you a few things at the end of the last class which argued hey these are great um you know they work out

these analogies um they show similarity and things like this um we want to make this a bit more precise and indeed for natural language processing as in other areas of machine

learning a big part of what people are doing is working out good ways to evaluate knowledge that things have so how can we really evaluate word

vectors so in general for nlp evaluation people talk about two ways of evaluation intrinsic and extrinsic so an intrinsic

evaluation means that you evaluate directly on the specific or intermediate subtasks that you've been working on so i want a

measure where i can directly score how good my word vectors are and normally intrinsic evaluations are fast to compute they helped you to understand

the component you've been working on but often simply trying to optimize that component may or may not have a very big good

effect on the overall system that you're trying to build um so people have also also been very interested in extrinsic evaluations so

an extrinsic evaluation is that you take some real task of interest to human beings whether that's web search or machine translation or something like

that and you say your goal is to actually improve performance on that task well that's a real proof that this

is doing something useful so it in some ways it's just clearly better but on the other hand it also has some disadvantages

it takes a lot longer to evaluate on an extrinsic task because it's a much bigger system and sometimes you know when you change things

it's unclear whether the fact that the numbers went down was because you now have worse word vectors or whether it's just somehow the other components of the

system interacted better with your old word vectors and if you change the other components as well things would get better again so

in some ways it can sometimes be muddier to see if you're making progress but i'll touch on both of these methods here um so for intrinsic evaluation of

word vectors one way um which we mentioned last time was this word vector analogy so we could simply

give our models a big collection of word vector analogy problems so we could say man is the woman as king is the what and ask the model to find the word that is

closest using that sort of word analogy computation and hope that what comes out there is queen and so that's something people have done

and have worked out an accuracy score of how often that you are right at this point i should just mention one little trick of these word vector

analogies that everyone uses but not everyone talks about a lot in the first instance i mean there's a little trick which you can

find in the sim code if you look at it that when it does manage to woman as king is to what

something that could often happen is that actually the word once you do your pluses and your minuses that the word that will actually be closest is

still king so the way people always do this is that they don't allow one of the three input words um in the selection process so

you're choosing the nearest word that isn't one of what words um okay so since um here is showing results from the

glove vectors um so the glove factors have this strong linear component property just like i showed before um for um

coal so this is for the male female dimension and so because of this you'd expect in a lot of cases that word analogies would work because i

can take the vector difference of man and woman and then if i add that vector difference onto brother i expect to get to sister and king queen and from any of these

examples but of course they may not always work right because if i start from emperor it's sort of on a more of a lean and so it might turn out that i get

countess or duchess coming out instead you can do this for various different relations so different semantic relations so these sort of word vectors actually learn quite a bit of just world

knowledge um so here's the company ceo or this is the company ceo around 2010 to 2014 when the data was taken from word vectors

and they as well as semantic things or pragmatic things like this they also learn syntactic things so here are vectors for positive comparative and

superlative forms of adjectives and you can see those also um move and roughly linear components um so um the word to vect people built a data set of

analogies so you could evaluate different models on the accuracy of their analogies and so here's how you can do this and this gives some numbers so there are semantic

and syntactic analogies i'll just look at the totals okay so what i said before is if you just use unscaled

co-occurrence counts and pass them through an svd things work terribly and you see that there you only get 7.3 but then as i also pointed out if you do some scaling you can actually

get svd to of a scaled count matrix to work reasonably well so this spdl is similar to the kohl's model and now

we're getting up to 60.1 which actually isn't a bad score right so you can actually do a decent job without a neural network um and then here are the

two variants of the um word to vect model and here are our results from the glove model and of course at the time

2014 we took this as absolute proof that our model was better and our more efficient use of statistics was really working in our favor um with seven years

of retrospect i think that's kind of not really true it turns out i think the main part of why we scored better is that we actually had better data and so

there's a bit of evidence about that on this next slide here so this looks at the semantic syntactic and overall

performance on word analogies of glove models that were trained on different subsets of data so in particular the two

on the left are trained on wikipedia and you can see that training on wikipedia makes you do really well on semantic analogies which maybe makes

sense because wikipedia just tells you a lot of semantic facts i mean that's kind of what encyclopedias do and so one of the big advantages we

actually had was that wikipedia that the glove model was partly trained on wikipedia as well as other texts whereas the word to vect model that was

released was trained exclusively on google news so newswire data and if you only train on a smallish amount of

newswire data you can see that for the semantics it's it's just not as good as even a one-quarter of the size amount of

wikipedia data though if you get a lot of data you can compensate for that so here on the the right hand did you then have common crawl web data and so once

there's a lot of web data so now 42 billion words um you're then starting to add good scores again from the semantic side

um the graph on the right then shows how well do you do as you increase the vector dimension and so what you can see there is you know 25 dimensional vectors

aren't very good they go up to sort of 50 and then 100 and so 100 dimensional vectors already work reasonably well so that's why i used hundred dimensional vectors

when i showed my example in class yet is the sweet spare too long load and working reasonably well but you still get significant gains for 200 and it's

somewhat to 300 so at least back around so 2013 to 15 everyone sort of gravitated to the fact that 300 dimensional vectors is the sweet spot um

so almost frequently if you look through the best known sets of word vectors that include the word divec vectors and the glove vectors that usually what you get

is 300 dimensional word vectors um that's not the only intrinsic evaluation you can do another intrinsic evaluation you can do

is see how these models model human judgments of word similarity um so psychologists for

several decades have actually taken human judgments a word similarity where literally you're asking people for pairs of words like professor and doctor

to give them a similarity score that's sort of being measured as some continuous quantity giving you a score between say 0 and ten um and so there

are human judgments which are then averaged over multiple human judgments as to how similar different words are so tiger and cat is pretty similar um

computer and internet is pretty similar plane and car is less similar stock and cd aren't very similar at all but stock and jaguar even less similar

so we could then say for the our models do they have the same similarity judgments and in particular we can measure a correlation coefficient

of whether they give the same ordering of similarity judgments and so then we can get data for that and so there are various different data sets of word

similarities and we can score different models as to how well they do on similarities and again you see here that plain svds

and works comparatively better here for similarities than it did for analogies you know it's not great but it's now not completely terrible because we no longer need that linear

property but again scaled svds work a lot better word deveck works a bit better than that and we got some of the same kind of

minor advantages from the glove model hey chris sorry to interrupt a lot of the students were asking if you could re-explain the objective function for the glove model and also what log

bilinear means okay uh sure okay here is

here is my here is my um objective function the right if i go so one slide before that right so the property that

we want is that we want the dot product um to represent the log probability of co-occurrence

so um and that's then gives me my tricky log bilinear so the buy is that there's sort of the wi and the wj so that there

are sort of two linear things and it's linear in each one of them so this is sort of like having and

rather than having a sort of an ax where you just have something that's linear in x and a is a constant it's bilinear because we have the w i w j and there's

linear in both of them and that's then related to the log of a probability and so that gives us the log by linear model and so

since we since we'd like these things to be equal what we're doing here if you ignore these two center terms is that we're

wanting to say the difference between these two is as small as possible so we're taking this difference and we're squaring it so it's always positive and

we want that squared term to be as small as possible and you know that's 90 percent of it and you can basically stop there but the

other bit that's in here is a lot of the time when you're building models um rather than simply

having sort of an ax model it seems useful to have a bias term which can move things up and down for

the word in general and so we add it into the model bias term so that there's a bias term for both words so if in general probabilities are high for a

certain word this bias term can model that and for the other word this bias term then model it okay so now i'll pop back and after

um oh actually i just saw someone said why multiplying by the f of sorry i did skip that last term

um okay the why modifying by this f of x i j so this last bit was to

scale things depending on the frequency of a word because you want to pay more attention to

words that are more common or word pairs that are more common because you know if you think about it um in word divect terms you're seeing

if things have a co-occurrence account of 50 versus 3 you want to do a better job at modeling

the co-occurrence of the things that occurred together 50 times and so you want to consider in the count of co-occurrence

but then the argument is that that actually leads you astray when you have extremely common words like function words and so effectively you paid more attention

to words that co-occurred together up until a certain point and then the curve just went flat so it didn't matter if it was an extremely extremely common word

so then um for extrinsic word vector evaluation so at this point you're now wanting to sort of say well

can we embed our word vectors in some end user task and do they help and do different word vectors work

better or worse than other word vectors so this is something that we'll see a lot of later in the class i mean in particular when you get on to doing assignment

three that assignment three you get to build dependency parsers and you can then use word vectors in the dependency parser and see how much they help we

don't actually make you test out different sets of word vectors but you could um here's just one example of this to give you a sense so the task of named entity

recognition is going through a piece of text and identifying mentions of a person name or an organization name like

a company or a location and so if you have good word vectors um do they help you do named entity recognition

better and the answer to that is yes so if one starts off with a model that simply has discrete features so it uses word identity as features you can build

a pretty good named entity model doing that but if you add into it word vectors you get a better representation of the meaning of words and so that you can

have the numbers go up quite a bit and then you can compare different models to see how much gain they give you in terms of this extrinsic task

so skipping ahead this was a question that i was asked after class which was word senses because so far we've had just

one word sorry for one particular string we've got some string house and we're going to say for each of those strings there's a

word vector and if you think about it a bit more that seems like it's very

weird because actually most words um especially common words and especially words that have existed for a long time actually have many meanings which are

very different so how could that be captured if you only have one word vector for the word because you can't actually capture the fact that you've got different meanings for the word

because your meaning for the word is just one point in space one vector and so as an example of that here's the word pipe now

it's actually but it is an old germanic word well what kind of means does the word pike have um so you can maybe just think for a minute

and think um what were meanings the word pike has and it actually turns out you know it has a lot of different meanings so

so perhaps the most basic meaning is um if you did fantasy games or something medieval weapons um a sharp pointed staff there's a pike um but there's a

kind of a fish that has a similar elongated shape that's a pike um it was used for railroad um lines maybe that usage isn't

used much anymore but it certainly still survives in referring to roads so this is like when you have turnpikes we have expressions where pike means the future

like coming down the pike it's a position in diving that divers do a pike those are all now nooses they're also verbal uses so you can pike

somebody with your pike you know different usages might have different currency in australia you can also use pike to mean that you pull out of doing

something like i reckon he's going to pike i don't think that usage is used in america but lots of meanings and actually for words that are commoner if you start thinking words like line or

field i mean they just have even more meanings than this so what are we actually doing with just one vector for a word and well

one way you could go is to say okay up until now what we've done is crazy pike has and other words have all of these different meanings so maybe what we should do

is have different word vectors for the different meanings of pike so we'd have one word vector for the medieval pointy

weapon another word vector for the kind of fish another word vector for the kind of road so that they then be word sense vectors

and you can do that i mean actually we were working on that in the early 2010s actually even before word to vect came out so

this picture is a little bit small to see but what we were doing was for words we work clustering instances of a word hoping that those

clusters so clustering the word tokens hoping those clusters that were similar represented sensors and then for the clusters of word tokens we were sort of

treating them like they were separate words and learning a word vector for each and you know basically that actually works so in green we have two

senses for the word bank and so there's one sense for the word bank that's over here where it's close to words like banking finance transaction and laundering and then we have another

sense for the word bank over here whereas close to words like plateau boundary gap territory which is the riverbank sense of the word bank um and

for the word jaguar that's in purple um well jq has a number of sensors and so we have those as well so this sense down here is um close to hunter so that's the

sort of big game animal sense of um jaguar up the top here is being shown close to luxury and convertibles this is the jaguar car sense

um then jaguar here is near string um keyboard and words like that so jaguar is the name of a kind of keyboard um and

then this final jaguar over here is close to software and microsoft and then if you're old enough you'll remember that there was an old version of mac os so it's called jaguar um so that's then

the computer sense so basically this does work and we can learn word vectors for different sensors of a word but actually this isn't the majority way

that things have been gone in practice and there are kind of a couple of reasons for that i mean one is just simplicity

if you do this it's kind of complex because you first of all have to learn word senses and then start learning word vectors in terms of the word senses

but the other reason is although this model of having word sensors um is traditional it's what you see in dictionaries it's commonly what's being

used in natural language processing i mean it tends to be imperfect in its own way because we're trying to take all the uses of the word pike and sort of cut

them up into key different sensors where the difference is kind of overlapping and it's often not clear which ones to count as distinct so for example here right a

railroad line and a type of road well sort of that's the same sense of pike it's just that they're different forms of transportation and so you know that this could be you know a type of transportation line and cover both of

them so it's always sort of very unclear how you cut word meaning into different sensors and indeed if you look at different dictionaries everyone does

it differently so um it actually turns out that in practice you can do

rather well by simply having one word vector per word type and what happens if you do that well what you find

is that what you learn as a word vector is what gets referred to in fancy talk as a

super superposition of the dif of the word vectors for the different senses of a word um where the word superposition means no more or less

than a weighted sum so our the vector that we learned for pike will be a weighted average of the vectors that you would have learned for the medieval

weapon sense plus the fish sense plus the road sense plus whatever other senses that you have where the weighting that's given to these different sense vectors

corresponds to the frequencies of use of the different sensors so we end up with the word um the vector for pike um being a

kind of an average vector and so if you're um if you're say okay you've just added up several

different vectors into an average you might think that that's kind of useless because you know you've lost the real meanings of the word and you've just got

some kind of funny average vector that's in between them but actually it turns out that if you use this average vector

in applications it tends to sort of self-disambiguate because if you say is the word pike similar to the word for

fish well part of this vector represents fish the fish sense of pike and so in those components it'll be kind of similar to

the fish vector and so yes you'll say the um substantia there's substantial similarity whereas if in another um

piece of text that says you know the men were aimed were armed with pikes and lancers or pikes and mesas or whatever other medieval weapons you remember well actually

some of that meaning is in the pike vector as well and so it'll say yeah there's good similarity with mace and staff and words like that as

well and in fact we can work out which sense of pike is intended by just sort of seeing which components are similar to other words

that are used in the same context and indeed there's actually a much more surprising result than that and this is a result that's um jews are sanjiv aurora

tanguma who is now on our stanford faculty and others in 2018 and that's the following result which i'm not actually going to explain but um

so if you think that the vector for pike is just a sum of the vectors for the different sensors well it should be

you'd think that it's just completely impossible to reconstruct the sense vectors from

the vector for um the word type because normally if i say i've got two numbers the sum of them is 17 you just have no information as to what my two numbers

are right you can't resolve it um and even worse if i tell you i've got three numbers and they sum to 17 but it turns out that when we have these

high dimensional vector spaces that things are so sparse in those high dimensional vector spaces that you can

use ideas from sparse coding to actually separate out the different sensors providing they're relatively common so they show in their paper that you can

start with the vector of say pike and actually separate out components of that vector that correspond to different sensors of the word pike and so here's

an example at the bottom of this slide which is for the word it's separated out that vector into five different sensors and so there's one

sense it's close to the words trousers blouse waistcoat so this is the sort of clothing sense of tie another sensors is close to wires cables wiring

electrical so that's the sort of the thai sense of attire used in the electrical staff and then we have sort of scoreline goals equalizer so this is

the sporting game sense of tie this one also seems to in a different way evokes sporting game sense of tie and then there's finally this one here

maybe my music is just really bad maybe it's because you get ties and music when you tie notes together i guess so you get these different senses out of it

Loading...

Loading video analysis...