Unlocking Personalization: A Deep Dive into Modern Recommendation Algorithms | Sarang Gupta
By Data Science Conference
Summary
## Key takeaways - **Long Tail Enables Niche Discovery**: Online platforms have unlimited inventory allowing niche products unlike limited shelf space in brick-and-mortar stores, and recommendation systems make these long tail items discoverable helping users find obscure content and generating significant revenue for businesses. [07:41], [08:35] - **Jam Experiment: Paradox of Choice**: In the tasting booth experiment, 60% of shoppers stopped at the booth with 24 jam varieties but only 3% purchased, while 40% stopped at the 6-variety booth but 30% purchased, showing too many choices paralyze decision making. [10:32], [11:12] - **Explicit vs Implicit Feedback**: Explicit data is direct user feedback like ratings and reviews which is high quality but hard to gather, while implicit data from behaviors like clicks and viewing time is abundant but noisier requiring more cleaning. [19:04], [20:09] - **Bayesian Average Fixes Popularity Bias**: Simply ranking by average rating favors items with few perfect scores like 5.0 from 10 reviews, but Bayesian average uses a prior from global mean and minimum ratings count to pull low-review items toward the dataset average. [25:04], [26:19] - **Collaborative Filtering Learns Embeddings**: Collaborative filtering decomposes the sparse user-item rating matrix via matrix factorization into user and item latent feature matrices, automatically learning embeddings without hand-engineered features to capture complex patterns. [33:45], [34:59] - **Two-Tower Nets Combine Approaches**: Two-tower neural networks have a user tower and item tower where neural nets process user and item features to generate embeddings, then dot product gives interaction probability combining content and collaborative strengths. [35:42], [37:34]
Topics Covered
- Long Tail Unlocks Niche Revenue
- Too Many Choices Paralyze Buyers
- Bayesian Average Fixes Sparse Ratings
- Two-Tower Nets Fuse All Signals
Full Transcript
great um so yeah I hope you all are able to see my screen but um yeah again Welcome to our Deep dive into modern recommendation system um the title of
this tutorial is unlocking personalization and we'll in this tutorial we'll explore how AI systems help match users with relevant content
and we'll go and do a deep dive into some of the modern recommendation algorithms um we'll cover both theoretical foundation and practical implementation of some of the
recommendation algorithms and I hope that by the end of the tutorial um you are able to um have a better idea about like you know how recommendation system
work and if you want to build those you can do that um for your company or as as part of your side project um so I think I'm sharing the wrong screen let me
share the right screen okay there we go okay so uh my name is sang and I am a lead data scientist at um a collaborative Work Management tool
called Asana I am based out of um Vancouver and today I'll be your tutor so this is the uh table of content
so the presentation and the tutorial is split into two parts um there is the motivation section uh where we'll talk about a couple of hypothesis and a couple of like studies that relate to
recommendation system and in particular why are recommend ation systems important and why are we even here um and then the second part is the tutorial
where I'll cover um a bunch of like you know different methodologies we'll start with the input data so looking at like you know what are the different types of input data that go into recommendation
system um we'll then talk about huris stics um how you can use business logic to power like really simple recommendations we'll then talk about three methodologies uh content based
collaborative filtering and hybrid recommendation and the tutorial is structured in such a way that we'll start with like you know basic recommendations and gradually we'll move
on to developing something more and more complicated and hybrid recommendations use steep learning and Ural net so I hope you're still with me at that point in the
tutorial great so let's talk about the motivation for this tutorial um so recommendation systems are ubiquitous they are everywhere um in
our daily lives so some of the key examples are listed here um so if you go on Amazon to shop for things or any other e-commerce website you would see a
page uh something similar to this here that shows your recommended items or if you purchase something it'll show you items similar to what you've already purchased um if you watch TV shows and
movies um this is a screenshot from Netflix and it recommends your top picks um based on your past history it also recommends things like because you for
something uh what are like some of the things that you might like uh LinkedIn I use it for jobs and career search so essentially um if if you're looking for
jobs it'll suggest you your top job picks based on your profile based on what you've interacted with in the past um Spotify um it has a Made For You
section so it curates a daily playlist based on your listening history and it like profiles profiles your taste in music and so
on um Airbnb shows you some recommended vacation rentals based on your past days and preferences and finally if you are on online dating even tender has a
recommendation system which matches you to relevant profiles based on things that you mentioned in your profile or your past swipe history um so again each platform uses
different approach to personalize content we'll go into some of the approaches in today's tutorial and presentation and in general um like recommendation systems are again a very
important part of our daily life they shape how we discover new content and make choices so I think it's very important to understand how they work especially if you are a data scientist
and machine learning engineer great uh so what is a recommendation system um the definition is simple um it's a type of information
filtering system that is designed to predict or suggest items or content that a user is likely to be interested in or is or or generally prefers or is likely
to interact with on a given platform um the whole recommendation system can be broken down into three important components so there's a
component for um the data the input data or the user preferences um again it can be explicit or implicit we'll talk a bit more about this in the tutorial
but explicit data is when a user gives direct user input like ratings reviews and likes implicit is when we use like you know users observed Behavior like
clicks viewing time watching in history and so on um okay so again brings us back to the question why do we need
recommendation system and um I think uh the audience of this tutorial because you all are here I guess you already realize the import of recommendation systems in our daily lives in terms of
how we make purchases and make decisions but I want to um I want to bring forth this idea through two interesting hypothesis and two interesting studies
that were like you know conducted and how they relate to recommendation system so uh the way number one um is something called the longtail um so
essentially the way that we shop has changed over the past decade past two decade and so on uh there used to be traditional stores the traditional brick
and motor stores where you used to go to purchase um DVDs or books and so on um and they could hold only limited inventory because the Shelf space was
limited they could only hold so much inventory and also um they could only hold mainstream products so basically only the products that are
really really popular that are profitable they would hold that but products that cater to like a specific interest or specific group of audience um those were tend to not be not like
you know included in included the Shelf because they were not super profitable they created to a small audience and um again there was limited shelf life uh
we've moved from those brick and motor concept to the concept of online distribution so with online distribution online platforms there's technically
Unlimited inventory there's not a limit of how many movies or books can a website hold um and this gives the uh
website the ability to essentially uh sorry this gives the website the ability to essentially like you know offer Niche products to uh to to like you know to
the customers and so on um and recommendation systems essentially helps users navigate this vast area of choices because um again the inventories become
unlimited there are Niche product that like you know a company can hold on so on uh great so
uh let's uh yeah so okay so let's talk about the longtail uh concept so this bring us brings us to the longtail concept so if you look at this chart which is which plots like you know items
over popularity uh there is the head so essentially these are items which are like really high impact popular they are fewer in numbers but they're mainstream
so these are the things that traditional brick and M stores were um were likely to uh were likely to like you know hold um and then there's these long tail
items which are low impact Niche many in number and obscure and given we now have transition from a brick and motor like you know way of Distributing things to
online way of Distributing things recommendation systems make these long table item discoverable and they help users to find things that they won't discover in through traditional browsing
so if we just show like no really popular items they would not discover these Niche items which again for a business can generate significant amount
of Revenue and this creates opportunities for both consumers and content creators so for consumers they're able to discover things that they like content creators they're able
to cater and make really these Niche products that that might be targeted to a particular segment of customers great so that was the first
study the second study uh that's um that's really interesting and relates to recommendation system is the tasting Booth
experiment and uh the setup of this experiment was basically so this was an experiment that was conducted by two Stanford researchers uh a younger and leper and they've set out to investigate
the effect of choice on customer Behavior uh so for the setup of this experiment this experiment took place in uh gur grocery store so an upscale grocery
store near Stanford University in the campus in California and the setup of this experiment was that they set up two booths uh Booth one and Booth two uh
Booth one contained um 24 varieties of jams and boot to contain like six varieties of Champs basically um and they wanted to test the impact of the
choice um quantity on customer Behavior so how many how does like you know the how does the impact how does like the number of choices that a customer has available to them impact like customer
behavior um so they try to control all the other factors so like you know they placed the booth strategically so that it has like similar foot traffic and other things like that the only thing that was different between the booth was
the number of samples that they had and the findings for the study was really interesting so so about 60% of
Shoppers in Booth one actually like 60% of Shoppers at stopped at Booth one whereas if you look at boot 2 which held like fewer number of samples only 40% of
uh Shoppers stopped at boo 2 so uh there was a lot of lot more foot traffic on Booth one but if you drill a bit deeper
and look at like the subsequent purchases uh in um in tasting boo one um there was only 3% of of of consumers Al
those 60% consumers that made a purchase but in boot two even though it's much smaller foood traffic 30% of those
consumers made a purchase um so the key Insight from this was too many choices can paralyze decision making among customers um and
this is um really um well depicted by this chart on the left which shows happiness of the customers plotted against choices so the happiness of or
customer satisfaction increases until certain point so unless until like you know certain number of choices are given to the customer they are satisfied and happy but if you give them are overwhelmed with overwhelm them with
like too many choices the customer satisfaction and happiness uh decreases and it is stressful for the customer and this is where recommendation system come
into play Given with online distribution there are a lot more choices recommendation system can improve customer satisfaction they can increase conversion rates by um simplifying
decision making and by recommending only a limited curated set of products to the user it helps eliminate this Paradox of choice so helping hopefully uh like you
know Drive these many choices where the customer satisfaction is is really low um so great so we've talked about
why um recommendation systems are important From perspective of the longtail uh longtail Theory and the tasting Booth experiment so before we
jump into some of the theory for our tutorial I want to mention a brief history of recommendation systems um so the history of
recommendation system traces back to late 1970s and it has significantly evolved over the decades into a key technology that is now used by a variety
of Industries um across different sectors um so the first known recommendation systems the early Beginnings were actually developed in
late 1970s by a scientist called elain Rich um at UT Austin U University of Texas
Austin and she developed a system that was called Grundy and what Grundy was was It was a computer-based librarian which was designed to recommend books to
um to like the students based on their user preferences so Gandhi was a really simple recommendation system what it did was it worked by asking users question
and classifying them into certain stereotypes based on their responses and the system then recommended books uh that match the preferences of the users or based on
like you know what they had filled in a survey it classified them into like certain categories and give the same recommendations to like you know all the users within the
category um in 1990s in early 1990s was when the collaborative filtering approach became really popular so this was uh an approach that was developed by
Xerox and it's again uh in today's world it's a widely used approach uh Xerox developed this approach um for one of the products that they were launching
and it was called like tapestry which was a document management system and tapestry introduced collaborative filtering and allowed users to manually rate documents share their opinions with
others on things that they received in their inbox and this method or tap3 basically laid the foundations for future automated recommendations
algorithm it uh brought collaborative filtering into picture which is still a very widely used um recommendation system
algorithm um in late 1990s um Amazon started using collaborative filtering for product recommendations and what their system
did was it analyzed user behavior and preferences to suggest items that other similar users uh purchased and it was hugely successful for Amazon and their
success um in increasing sales through personalized recommendation it spurred widespread adoption for recommendation
system particularly in e-commerce oops um in in 2000s uh Netflix started using um recommendation system again
Netflix is one of the pioneers of recommendation system you would often hear uh a lot of cool technology that Netflix ships uh in terms of like you
know the way that they recommend movies to the users but uh they launched a famous contest called the Netflix prize in 2006
um and the prize was that like you know whichever team or whichever research group is able to improve the recommendation system or recommendation algorithm by 10% they would receive a
price of $1 million uh which was great back at that time and the winning solution U pioneered hybrid recommendation so it used an ensemble of 107 different
recommendation algorithms and um it it demonstrated the blending of like you know multiple approaches and how it multiple approaches when applied together can
enhance recommendation accuracy um and now we are here 2010 and Beyond um recommendation system have
become a ubiquitous part of our daily lives we saw we see them on YouTube Spotify Facebook a lot of different online platforms that we interact with there have been a lot of advances in
recommendation system particularly with the Advent of deep learning and deep learning coming mainstream and they become an integral part of our lives so this is a brief history
um as to how recommendation systems evolved across the years and how theyve become an integrated part of how we consume content online and even
offline so let's jump into the tutorial now um enough of theory um and in this tutorial we'll cover four important aspects of
recommendation systems so we'll start with a jistic based approach which uses predefined rules and business logic to make straightforward recommendation then we'll jump into
content based recommendation so it recommends basically items to what users have previously liked by analyzing features of like you know other items as
and then recommending what items are similar to the items that user has interacted with uh we'll then go into collaborative filtering which is again a very very widely um used approach and a
very popular approach which suggests item based on preferences of users with similar taste and similar taste patterns and behavior and lastly we'll go into
hybrid recommendation system which combines multiple recommendations approaches together and leverages the strength of each of the method and this is a deep learning approach um this is
complicated so um again we'll go into this and hopefully you're still with me at that point of time as we talking through these different approaches I have structured the tutorial in such a
way that we start with something that is uh simple So htic based and gradually we'll look into more and more complicated approach so to get you warmed up as we look at like you know
more complicated and more sophisticated approach of approaches of like recommending items to users so uh let's talk about the input
data uh I briefly mentioned this in one of the previous slides but there are two main types of input data there is uh the
explicit data and there is the implicit data so explicit data is data where some sort of rating is given by the user so
not rating but in general like you know it's basically a direct user feedback so in case of let's say Netflix is developing a recommendation system it has thumbs up thumbs down users can rate
movies 1 to five uh they can give reviews similarly on Amazon a user can read the product from 1 to 5 they can give reviews and things like that so this is direct user
feedback it is really really high quality data because a user has given their feedback or their preference for the
item this data is hard to come by so it's difficult uh to gather this data this data might not always be available as users might not spend time to rate it
items or you know they might not even be applicable to the recommendation system that you're developing the other type of data that we have is called the implicit data and
the data that this data is basically the behavioral data that we gather from like you know users interaction with a given product with a given website and things like that user do not give any specific
ratings but essentially we look at like you know specific actions that a user performs on a given platform so let's say we are building a recommendation system for Spotify um it looks at like
you how many times a user has played the song um if you're building a recommendation system let's say for Netflix it might look at like you know
what movies you click at um click at which movies you navigate to if you're building a recommendation system for Amazon or other e-commerce website we
might look at what items a user clicks or adds to their card and so on implicit data is a a lot more abundant abundant than explicit data because technically you don't require the user to give a
direct feedback you are looking at their direct interaction with their website but it's also a lot more noisier than the explicit data and it might not
directly U directly like you know Express a user's preference and there is a lot more uh data cleaning and rling someone that needs needs to do in order
to use that in the recommendation Systems Great so we've talked about our input data Let's uh talk about uh the different methodologies that we'll cover
in our tutorial so I'll start with some basic basic theory on each of them and I think this would help you to understand um like you know the code more better as
we look into like the tutorial and as we go through the code so heuristic based recommendations are the simplest form of recommendations they use predefined
rules and business logic to make make straightforward recommendations based on behavior and patterns and again there can be different heuristics that you
might use for recommendation systems and those depend on like your business logic and on your product uh some of the common heuristic based recommendations
include recency based or popularity based so for example popularity based recommendations recommends the most popular items to to users
IMDb for example has something called IMDB's top 2 250 movies this is also a very KN or like you simplistic recommendation system which basically is
a popularity based recommendation system so it ranks movies based on how popular they are and so on heuristic based recommendation systems are really simple to set up um
so that's a big Pro they only require metrics like sales or views and it makes it like really really quick to deploy them they also do not suffer from cold
St problems so let's say if a user is joining your platform you do not need to gather any data about the user you can directly recommend in the most popular movies or most popular products based on
other users interaction so it does not suffer from a cold start problem uh but the cons are pretty obvious it lacks personalization um so essentially it
does not reflect individual users performance your giving generic recommendations to everyone so that's a pretty common cons this is again like a very simplistic way to recommend things but it is still used across a bunch of
different products specifically for solving cold start problem when you don't have data for your user um so I want to talk a bit more about the popularity based
recommendations um in terms of how the number of reviews are accounted for so let's say you have three items let's say you're developing an online e-commerce
recommendation system you have three items users have rated these three items uh there's item a which has a rating of five star but it has only received 10
reviews there's Item B which has a rating of 100 uh which has a rating of 4.8 but it has only received 100 reviews and then there's item C which has the lowest rating among these three but it
has retrieved uh it has received 1,000 reviews so a lot more compared to item A and B um so the question here is can we just simply rank these items based on the
rating so should we rank item a above Item B and C because it has a rating of 5.0 much higher than Item B and C but it only has 10
reviews um and the answer is no um generally we want to see the item have at least certain number of ratings
before we can have them on the top and there's a very popular approach that is used to do this and this approach is called ban average imdb2 250 list that
we actually saw in the previous slide actually do this so what it does is it is a basan approach so it develops a prior which is C and it's essentially
the average number of ratings that all your items receive from all your users um across across all your data set so it's essentially used to set
a prior and then you have minimum number of ratings that are required so again you use this to um set your prior and then finally you have your mean rating for the item that you're interested in
calculating the patient average for and the number of rating for that item and what this formula foration average does is it takes a weighted average for your
prior and your item and the rating for your item eventually gets pulled towards the average rating in your data set if there are very few ra ings for that so
if let's say for item 10 there are very few ratings what it would do is it would eventually get pull towards what the mean of all your ratings in the data set
is so if you we apply Bas and average we'll see that the item a is pushed to the bottom of the list in terms of
recommendations for popularity and Item B and C are boosted up there and this makes a lot more sense because again we do want to recommend an item that's just
fitted like you know one time or like 10 times because a user that generated that content might themselves rate the item so very small number of ratings are not
that reliable so as you're developing popularity based recommendation system you would want to apply this technique of B average um again used very widely
across um across the industry um for example the IMT v250 great so the next type of recommendation system that we'll talk about is the content based filtering
again it's a step above the popularity based recommendation system it takes into account user preferences and what it does is it uses
item features to recommend items that uh are similar to the item that a user has liked um and it looks at like you know particularly explicit feedback and it
tries to look at the features and characteristics of the items that a user has already liked so let's take an example for um a movie recommendation
system let's say we had a bunch of different movies uh Shrek Harry Potter dark n Rises momento and triplets and we
can place them on on two axises so one of the axis is um the genre so whether it's a children movie or it's an adult movie and then we have another access
whether it was a Blockbuster or a mainstream movie um and we can place these um place place the movies on this like new two dimensional space we know
about what these movies are we know their plot we know their ratings and so on so we can place these movies um what we do then is let's say
if a user has watched Shrek and given that movie a rating of five star um we would recommend Harry Potter to that because this movie is similar to Harry Potter is similar to Shrek it's a
children movie it was a blockbuster and so on um so this is a simple content based filtering system it places items on end dimensional space based on
certain features and if a user has a high interaction with one item it recommends items that are similar or nearest to that particular
item uh major limitation for this approach um feature representations need to be hard engineered so you need to know what features you want to encode
for a given item so you need to know these different features for example in this movie recommendation system you need to know whether the movie is a Blockbuster movie or it's a children or
an adult movie um and it can only make recommendation the system can only make recommendations based on existing interest of the users so it does not
include that component of ciput so um it's not able to expand a users's taste because what it does is it just looks at the movie that a users watched and it
just recommends simple uh movies that are closer to the user based on like you know certain characteristics and so on and so forth so again very simple
recommendation system just looks at um user preferences and recommends items that are similar to the user preference um and you would have seen this in a lot
of systems this is again like widely used for example if you see um on Amazon it has a section that says people who've watched people who've bought this have
also bought this so it's basically based on like you know content based filtering um where it like you know tends to analyze analyze the the uh features and
characteristics of a product that a user interacts with great um the next approach again the most popular approach um I guess I
would say in today's recommendation system is the collaborative filing approach and it uses basically similarity between uses and items
simultaneously to provide recommendations the key component of collaborative filtering is a matrix
which looks like this um it's also called the user and item Matrix and what it does is it essentially uh the input
data to this approach is this Matrix where we have users as our x-axis and the it s or movies if you're developing a movie recommendation system as the U
as the y-axis and the values in this Matrix are the interaction or the preference of a user with a given item so over here if you're developing a
movie recommendation system based on explicit feedback the value in this Matrix are the stars that a given user gives to a given movie so let's say user
one created these three movies um and this made Matrix is very sparse because generally you tend to have a lot of movies or a lot of items in your catalog
and a lot of users but users in general tend to interact with a very limited set of items so it's a very very sparse Matrix when I see spars it's basically a
lot of values in this Matrix are empty um so what collaborative filtering does is basically it looks at interactions of a given user and then it
tries to recommend item to a given user based on interest of another user that the model or the approach thinks is
similar to a given user so let's say let's look at user one um and user n so we see that user one and user n have given a very high rating to this movie
called Shrek um they've given a very low rating to this movie called momento so what the model does is it figures out that user n and user one are similar to
each other and because user one has watched or given a very high rating to Harry Potter it would give the same it would recommend Harry Potter to user n
because it it thinks that user n and user one are similar to each other based on how they've interacted with like you know different products so one of the big advantages of
collaborative filtering is that like embeddings can be learned automatically so as compared to content based approaches you do not need to rely on hand engineered features so you do not
need to know particular features about your users and about the movies so you do not need to know whether whether Shrek and Harry Potter a children movie or adult movies whether they're a block
Blockbuster or occas movies um the collaborative filing approach basically automatically figures those latent features or those features out based on
users interaction with the movie and basically it can capture uh these complex patterns and relationships by learning these features automatically based on how users interact with
different movies so how collaborative filtering works again um collaborative filtering uses something called Matrix
factorization which decomposes this user item Matrix um again like you know this is the user item Matrix that we looked at in the previous slide into two
matrices which are user cross latent feature and movie cross latent feature and there are multiple approaches to factorize a given Matrix there is the
SVD singular value decomposition there's non- negative matri Matrix factorization and so on uh we'll cover one of the approaches in our tutorial but there are
multiple approaches and it decomposes that Matrix into two Matrix that capture these hidden patterns or these latent features so basically again you are able
to figure figure out certain features or certain characteristics of users and your your items but you do not need to like you know hand engineer them or you
do not need to know these particular Dent features because what this approach does is it automatically filters these like these latent features out based on how users interact with uh with items in
your catalog and this Matrix factorization is the key component of collaborative filtering uh once you have user cross latent features or you have movies cross
latent features you can do a bunch of things you can calculate what movies are similar to a given movie um you can figure out what users are similar to a
given user and so on so it's able to capture these hidden patterns without actually um are specifying or hand coding these specific features um for a
given movie or a given user so really really cool approach uh really simple um and again we'll talk about this in a tutorial a bit
more cool so the last approach uh that we'll talk about is the hybrid recommenders um and in particular we'll talk about one particular type of hybrid
recommender which is called two Tower neural networks um and it's one of the most popular form of hybrid recommenders um and what it does is it combines the goodness of all the
different approaches that we've talked together uh so one of the drawbacks of collaborative filtering is that we specify user item engagement but we are
not able to specify item characteristics and user characteristics so let's say like we have really really uh good features or really good item characteristics that we want to input in
our model um we can do that for collaborative u in content based fil in sry content based approaches uh but we cannot do that in collaborative filtering but at the same time we want
to include um users interaction with the items and like you know other similar users and things like that we can do that in collaborative filtering but we cannot do that in content based
filtering hybrid recommenders to Tower neural Nets helps us do Best of Both Worlds uh so essentially in hybrid recommenders there are two towers as the
name suggest there is the query Tower um or also called the user Tower and then there's also the candidate Tower also called the item
Tower um Within These Towers um you can feed in your user features and item features there are neural Nets basically you can uh design neural net
architecture based on based on like you know however you want to design it um and as we train the model it eventually generates user embeddings and we then
can take a DOT product of these embeddings and generate a similarity score which essentially tells How likely a user is to interact with a given item or basically a user
preference um and our source of Truth again this is a Super Wise learning approach our source of Truth is uh given users interaction with a given product
um but at the same time we can also uh incorporate the user characteristics and and the characteristics of our items uh through these neural networks or through
these separate Towers so again very very powerful approach um this is also used very widely in in in
recommendation systems that require um that have a lot of like you know user data or like you know item data or item metadata user metadata that
could be really useful in in in like you know ining recommendations great okay so that's enough of theory let's jump into a tutorial
um so in this tutorial we will particularly talk about uh the movie lens data set uh we'll try out these different approaches on the movie lens
data set basically movie lens data set is the Titanic data set of recommendation systems if you've uh played around with Titanic data set it is the um it is the standard data set
that's used to teach machine learning classification algorithms and so on and similarly mov lens data set is this Titanic or very standard industry data set for recommendation system um movie
lens data set was developed for research purposes it's non-commercial developed at University of Minnesota and there are different versions of the movie lens
data set based on number of values that are or major number of like you know ratings or number of rows that are available so we'll particularly look at
uh the movie lens data set that has like 1 million rating um again 10 and million uh rating data sets would require more compute and then 100K I
felt was uh was like too small of a data set for us to experiment these different methodologies particularly neural Nets so we'll be looking at this data set uh
you can access this through this link that's mentioned here at the bottom um sorry uh yeah uh you can access this
data set through this link mentioned here at the bottom uh um and if I click on this link it takes me to here uh to
this page um which is the page for movie lens there are different data sets available based on how I mentioned um mentioned like you know the types of
data set on the previous slide in particular we'll be looking at this data set so if you want to download that um you can download that so this data set has about 1 million r
from 600 users and 400 movies it was released in 2003 so it's still outdated um it has movies that are like you know quite old but it's really
really good for Learning and educational purposes so if you could download this um just download this Z
file for this tutorial we would also um all the material is available at this GitHub link so if you click on this link
it will will direct you to um a GitHub it's a public repository which looks like this and um you essentially can follow through all the notebooks um that we'll
go through right now uh in this here there's also the data set so if you don't want to download it directly from from the movie lens website uh you can essentially download this from from this
GitHub page I recommend that you download these CSV files the files um contain certain um pre data that's like not processed
yet so I processed that data you can like you know find that in the CSV file great so that brings us to the start of
our tutorial um and before I jump into it I want to see if there are any on any
uh any questions so far great okay so can you share the link yes so let
me send the link on the chat this is the chat uh this is the link to the GitHub great any other questions besides
question from M oh cool so there's a question from belal that says is there any recommended system which does
synchronicity uh I'm not totally sure what you mean by synchronicity here um B would you would you clarify
that okay so um if I understanding the question correctly um synchronicity probably means that
it's able to use both content based features as well as like user engagement um there is uh we talked about like the two Tower neural networks
that um that does that okay so the simultaneous occurrence of events which appear significantly okay so belal says that synchronicity means the
simultaneous occurrence of events which appear significantly related but have no discernable causal connection yeah that's a good point B uh so hopefully uh
two Tower neural networks are able to do it so let's say in collaborative filtering a user ended up rating two
movies that appear to be related to each other but maybe it's a it's an anomaly because two towel neural networks are really really powerful they encode a
bunch of like you know other features uh hopefully they help deal with that problem as well um so hopefully that answers your
question um not sure but yeah feel free to ask a followup question if it if it does not okay how about serendipity yeah
great Point um so serend dep um so for folks uh who don't know know what Serendipity is Serendipity basically means um means that like you know
sometimes like in recommendation systems uh recommendation systems tend to focus on particularly on things like that user have previously interacted with so let's say if I watch a lot of movies that
relate to like you know Thriller um action and so on and so forth I would be recommended a lot of movies that are like you know very similar to like U very very like you know Thriller action
oriented and and so on um and if it is often times interest it is often times like you know useful to give users the ability to like you know explore new content so like you know
broaden their taste because sometimes user might like things that they've not previously interacted with or things that are like not similar to The Taste that they have indicated through their
previous interactions and so on so uh yeah we include Serendipity in um recommendation systems uh one of the ways to include Serendipity is basically
you can do like something hacky or something heuristic based um which is basically you can try to infuse some things that are like you know really really popular into
recommendations um of a given user based on their previous interaction so let's say like you know again if I interact with thriller or action-based movies and we want to introduce that
component of serendipity what we could do is we could give like you know top um like let's say top three movies that are like you know not not trailer or um action based so let's say like you know
I recommended a movie that is comedy but it's like really really highly rated and then we see we look at like a user's interaction with that particular uh new component and if a user uh gives that
like you know that gives that like comedy movie a very very high rating uh we this would be incorporated in our recommendation system because they've rated that really really high and
hopefully in the next iterations of like the recommendations that they presented those move movies related to Comedy are shown up shown um and if a user did not
like it that would eventually be eliminated in it uh okay so great there's another question from Bal I did my Master's thesis based on POI recommended systems
which use serender uh my result was good based on only Serendipity not with novel um also I used only
those users which have data existed but not for the new users it was a statistical algorithm not AI how do how
to do it for those users which are new H so I guess like your questions around how to use
Serendip for users which are new um okay so send deput for users which are new I guess like
um again I think it would be like really really um interesting to look at your master thesis but uh I think like from my understanding and from my experience working industry like I think for new
users what we generally try to do is um we give things that are popular but again like you know your popularity can you can like tweak your popularity based recommendations in like you know certain
fashion so you can like try to include recommendations um you can try to like segment your recommendations your popular recommendation so instead of going through all your popular recommendations that might be like you know Thriller based in terms of a movie
recommendation system you can give like you know popular recommendations from uh different uh different shanders and see what users interact with right
uh there's obviously a bias component here where like you would see that recommendations or like you know movies that are up there on the top they are interacted with more but as your user
gets accustomed to your platform you would they tend to become like you know your existing user so you can eventually use the same phenomena that you used with your existing users to to introduce
that component of s so if a users user tends to interact a lot more with comedy movies but um you think you want to introduce like you know Thriller to to a given user you could like you know try
to sneak that in into into your recommendations for the user so uh that's one of the DAT ways to do it um there might be some things that are like you know more scientific but in my industry experience that's like you know
one of the things that I've that we've um used previously and that has tended to work uh work really well in terms of like you know figuring out what are the things that a user might might be interested
in great okay so um I want to jump into the tutorial now
um yeah uh great so looks like the link was incorrect um thanks for catching that John
Baptist um great so yeah the links in the chat um feel free to look at this link again the way that this tutorial is structured is okay so I'll quickly walk
through what this what this uh repository looks like basically U there's this data uh 1 M which essentially has the data that from from
this movie L movie lens uh website um I suggest that you download the CSV files again because those are pre-processed um the D files are not pre-processed um so
yeah recommend downloading those and there are a bunch of I python notebooks here um all notebooks have two uh two
categories so there's a notebook called doore T and there are notebooks without the T in this tutorial we'll go through the notebooks that have p in the name
um these are basically I've created a version of The Notebook that has some fill-in the blanks um so I think that'll make things easier as we like you know walk through it but all the solutions
are available oops I realized that I'm not sharing my screen and I apologize for that okay yeah so this is what the repository looks like um so there are
two version um versions here uh the T version basically um is the tutorial version the one that we'll walk through here it has has some fill-in the blanks uh makes things easier you can like
again practice as you're learning through it um and the solutions are basically available um on here so um the one without d uh they're available here
um so if you want to look at the solutions you can directly look here great so let's jump into it and I want to start
with uh I want I'm going to start with the introduction and Eda uh notebook and I will share my screen great one second hold on let me
just set my environment [Music] up okay so okay so the way that this tutorial is um let me add another cell
here great okay so the way that this tutorial is structured is it's broken down into seven
steps um we will actually no I'm wrong it's not this one yeah let's use this one okay so it's broken down into seven steps I'm using
the doore T file so again recommend you to use the T file uh but basically it's broken down into seven steps um we'll
start with loading the data and doing some exploratory analysis um we'll um look at heuristic so we'll look at like popularity based
recommendations we'll look at content based filtering and collaborative filtering the approaches that we talked about earlier and then we'll go into deep learning um I've so in deep
learning section itself we'll cover like you know some large language models um so like you know how you can incorporate like large large language models in your recommendation systems
and so on so that will again like be part of part of like you know our deep learning tutorial itself so this is the flow and I have structured all the
notebooks basically through um through this so um all the notebooks are like essentially labeled through like all the different steps that we'd be going
through great so um we'll walk through the movie lens data set uh okay so let's go into it so first chapter um loading
the data so again um you would find all the data in um in in in in the data folder data-1 and here so I'm just
loading the data um in the CSV files the data is separated by tabs so it's technically not a CSV it's a tsv tab
separated value so you can just add uh the tab here and let's run this and let it load the
data cool our data is loaded um let's quickly print the first rows of our movies data set so I'm going to do movies.
head uh there are three data sets there is the movies data set which contains movie IDs titles and genres so what the
movie is what is the type of the movie animation children comedy and so on so forth there is the uh um there's a users data set we look
into it but let's quickly uh investigate the movies data set a bit more so I'm going to print the
info for the movie data set great so looks like we have a total of 380 3888 3883 movies um looks like there are no
nulls but basically there's a movie ID there's a title and then there's a there's just shaer in there um I want to investigate this data
set so what I'll do is I will um I will build basically a word cloud uh what word clouds are are essentially um it's
basically like what we can do is we can look at like what are all the words in the title for uh for movies and that will give us a picture of like you know
generally in in the 1990s which is this which this data set is from 1990s 1980s what were like the most common titles in the movie so uh there's this quote for
uh generating a word cloud but basically there's a library that I'm using it's called the word Cloud Library um it has stop words um in it so basically in
order to remove words like you know the is uh and and things like that we generally remove those words from uh from from like from the titles uh we
pre-process the title a bit here and then we can use our my plot Li library to quickly plot uh to cck to plot like you know the title word cloud that we
generate here so I'm going to print this and then whoops I also need to generate the title Corpus
here uh it takes a few minutes to run and there we go uh this is our word cloud so basically what it does is IT
sizes the word based on um based on like you know how how much it is being mentioned mentioned in the titles of the movie so you would see like you know
back in the 90 late 90s a lot of movies mentioned man love night Day dead and so on so this is just like pretty cool as you're like you know doing exploratory
analysis for your data set uh we can also plot the genres so in order to look at like you know what were the genres or the categories of the
movies uh we can quickly look at um what the counts were in the genre so what I've done is I'm using pandas to split
the genres so if you look at our movies uh genres they are actually segregated by this uh this like you know pipe
operator so I'm going to uh I'm going to like do a quick count of how many I've like split the genres by this pipe operator and I'm going to do a
quick count of like you know how many genres are there and then I use my standard C libraries to plot this out so
uh basically I plot the values on the xais and then I plot the index basically once you do genres. count it uh generates uh it generates like a p a Ser
a panda series that basically that has like you know index as the genre and the values as the uh number of times that was mentioned so cool so this is the
distribution of our movie genres uh um looks like a lot of movies back in those days were drama so we have about 1,600 movies that had a genre tag with
drama um a lot of movies comedy and action and Thriller and very few movies that were like you know fantasy Western animation and so on uh pretty
interesting um the second data set that we have is called users so let's look at the users data set uh pretty standard uh
if I do ahead um it has user IDs it has a bunch of information about our users so um like you know the gender age bunch of like
demographic characteristics of our users uh let's do a users. info here and um it
shows um we have 6040 6,40 users and it also uh it shows that there's basically no NS uh which is
pretty good uh great um let's quickly plot um maybe in the interest of time I'm going to skip the plottings here but you
can take a look at the solution uh which essentially shows the demographics plotted um for our um users so basically like we have a
lot more Mals in our data set what the age distribution is uh what are like the top 10 occup occupations um and again like the age distribution for different occupations so this is really
interesting uh it's not directly applicable to like you know things that we would be talking about next so I'm going to skip that in the interest of time but you can take a look at this
solution notebook and it would uh show uh it would show this uh basically great okay so the third data set that I want to quickly uh talk about
before we go into the method ologies is the rating data set um so I'm going to print ratings.
head and I see uh we're going to ignore these two columns for now but basically I see for a given user and a given movie what is the rating and the time stamp
for when they uh give the ratings uh these are some noisy columns u in the data set so let's just drop them quickly
so just using my standard python data set function to drop these
uh oh yeah sorry there been already been dropped so let's just do a rating Start Head cool so this is what a rating data
set looks like uh ratings movie ID so what is the ratings that rating that are given users given a movie and what is the time stamp for
that uh let's do a quick ratings doino and um yep um looks like there are no nulls
great um let's also plot the distribution of ratings uh so what I'm going to do is I'm just going to use my standard Mt plot lib library and then plot the
distribution of rating so as you can see here um looks like the ratings are skewed towards four and five uh 3 four
and five so users tend to give a lot higher ratings uh but a lot lower ratings so it looks like there's quite a bit of bias in our data set uh but that's fine let's uh play around with
this bias um but that's that's very common like you know if you're looking at any recommendation system or if you're looking at like you know any ratings you would see a lot of users tend to give ratings at the extremes
because generally whenever user like users are like super happy with the product or like you super dissatisfied with the product that is when they would like you know tend to like rate given product or something if it's like
average uh they tend to like you know ignore it and stuff like that so that's like a very common problem across like you know all the all the data sets uh that are in recommendation systems cool
so I'm also going to do a quick describe of the data set so um again we have about a million ratings that's the one M
data set that we're looking at um minimum is one um and whoops for the rating sorry so the minimum is one looks like the 50th
percentile is four which is uh very very interesting so again our data set is kind of like skewed towards three and four and five rating so the 25th percentile is three but again like I
think this is all rounded up um I think the 50th percentile is about like 3.5 is or something when I investigated it
last uh cool so I'm going to combine my uh data sets uh basically I'm going to merge my movies and ratings data set and I'm going to put them in data set
combined um I've already in the data 1 M folder there's already a data set combined but I am just going to combine them because it's just like good to look at everything together and that's like
the main data set that we' be looking at cool so this notebook has a bunch of Eda again uh sorry for rushing through it but uh feel free to take a look at the
solutions it has like you know all I've also written some notes and things like that um that that talk about like you know the General Ed that can be
done great okay so let's jump into uh our second tutorial where we'll talk about
um uh in this tutorial uh one second let me just open up my notes great okay so in this tutorial we'll quickly talk about popularity based recommendations again these are very simple
uh nothing so complicated so this is a pretty short notebook basically I'm going to import my standard Library so
import pandas import nayas NP oops not no to my kernel uh going to read my data using the read CSV
function uh this is the data set combined that I'm looking at basically what it does is it combined users movies and ratings into one data frame so for a given movie we also have its description
as well as like you know the rating that different user has given as well as the demographics for the user um again in popularity based recommendation systems basically what we
do is we look at what movies are the most popular ones so what I'm going to do is I'm going to group my data frame uh by the uh
title and then look at the rating what is the average number of I'm going to look at the median so what is the median rating as well as how many users have given a rating so I'm looking at both median and count and then I'm going to
sort my values by the median so again looking at this uh looks like to live a 1994 movie has
the highest median actually not the highest median but a bunch of movies have a medium of 51 but again as I mentioned in the slide um we should not directly use popularity
based recommendation systems because we need to take into account how many times a given movie has been created uh so I use the basian average um essentially so
this is what we talked about um basian average basically establishes a prior um it looks at uh what is the global rating
across all your movies uh you can set up a minimum threshold here uh so essentially um what is the minimum number of movie minimum number of ratings that you want to look at and so
on so this is like a common formula that is used for for for calculating like a patient average so um I've written down
a function here uh basically we calculate this so we multiply C by C into M where C is our Global mean rating um so the global mean I'm taking a
median here so I'm taking the median um and then I multiply that with the count so here and then to calculate what is
the threshold that I want to apply on my minimum number of ratings um I can basically take a quantile so basically
I'm looking at what is the quantile of the number of ratings that a user has provided on a given movie so let's say if you have
five movies and uh the Quant is like you know 10 ratings let's establish that as our minimum number of ratings that we require so I'm
just going to run this cell and then again sort the values by
basian average and then generate ahead here great so I see the ratings have completely changed from what I see here I see American Beauty a very very
popular classic movie high up there has a very high count has a very high Bion average rating Star Wars sa PRI trian Matrix Silence of the
lamps very very classic very popular movies from the '90s um all being populated here and if I
look at this movie called Slavs Brer Brothers of sleep uh that has a count of one if I click on this oh I see it has a
ban average of three so this has been downgraded from 5 to three because again it has a very low count rating great so
uh this is our htic based recommendation very simplistic but if you're using popularity based heuristic recommendation I encourage you to do a
ban average to account for the count of ratings uh awesome now let's jump into
our next approach and that is the content based filtering and before I jump into content based filtering let me take a look at the chat quickly to see
if there are any questions looks like no I think we are good to proceed um let me go back and share my screen
and again this might feel a bit rush and I apologize for that um I suggest that you look at these notebooks and the
solutions um these are all provided on the GitHub link um all of them have really really detailed descriptions and comments so hopefully it helps you understand the code as I'm walking
through it great okay so let's let's look into our next uh the content based recommendations so again
doing the standard stuff loading my libraries pandas numpy matplot lib reading my data
sets So reading the ratings data sets the users data sets the movies data set and the combined data
set cool all read in um and again content based recommenders we talked about the theory a bit but basically if a user watches a given movie we find
similar movies based on genres actors director storyline and so on and then recommend those similar movies to those so you need to hand code those
features great okay so there are multiple steps that one goes through for building a Content based recommender system uh this is a flowchart here
basically we'll extract um features of different items so in our case uh we don't have a lot of features because this is an educational
data set but we would mainly look at the genre of a movie that's like one of the features that we have for our movies so if you look at our movie state of set
um it has the only feature that we have is shre so we're going to find movies similar to a given movie based on the genres that are encoded and a given
movie can have multiple genres um so that is an important information that we would use um so let's start with U pre-processing
pre-processing our genres a bit so again genres are currently encoded in such a way that there is a pipe operator between all of uh between like you know
all the genres that a movie is part of so I'm going to split this up and then put this into an array so I'm going to split my movie genre column I'm going to
split it by this pipe operator and then uh I'm going to fill any so there might be like even though there are uh there might not be a null
in general this data set does not have a null but I'm just like having this just in case uh there is uh there is like just in case like you know there's there's something there's an anomal or something so I'm running this
preprocessing code and when I that I see that my uh that my genres are split into like array um and this array is
indicative of like you know what the shra a given movie title is encoded into um and for computing the features
of a given movie I'm going to use tfidf um again tfidf is a very popular industrywide approach basically
what it does is in our case what it'll do is it'll basically look at um it it considers like every movie as a separate document and then it looks at like what
are the different encodings for every movie in terms of like you know what are the genres so let's say if like a lot of movies in our data set have a genre of like you know children it like downway
children genre and then if a movie is specifically like you know um has a specific genre of like animation and very few movies have animation as their only genre it would like you know weigh
those movies together so this is a way to essentially encode features for U for like you know it's only using like natural language processing but this is
a methodology that we can use in like you know our data set as well so um psychic learn has a tfidf vectorizer so I'm going to
use a tfidf vectorizer here from Psychic learn and um I am going to use I'm going to
remove top wordss the English top wordss here and then I will just fit my data set
on a tfid factorer so what it has done is it's for every uh movie in my data set it has built 127 features it has built 127 features because I've also
used an NR um and nram basically looks at like you know words that might be co-occurring together so it has like 127 features which are
uh basically like genres of a given movie and once I have this Matrix I essentially can calculate cosine similarity um again a bunch of like different similarity metrics some of you
might be aware with like cosin's distance uh you C in distance man and distance and so on so I'm going to use cosine distance for here um again site
kit learn has cosine similarity by 10 so going to use cosine similarity uh going to calculate cosine similarity of my tfidf Matrix with
itself so if you look at our tfidf Matrix dot dens 2 D I think oops have that
attribute okay so this is what a matrix looks like so this has all the movies and then all the movies are encoded based on the it shows zero because again
there are 15 127 values in here so some might be populated in between but yeah so I calculate coine similarity um and then I have a function
here that basically uh given a movie title it calculates the most similar movies to that given movie so running this function and I am going to
calculate similar movies to this movie called Toy Story uh so if you run this function directly and I run this it gives me the
similar movies to the movie byy story and as you would see um I think this makes sense to me um Aladin again a kid's movie animated movie uh American
Tale I haven't seen it so I don't know what movies it is Rugrats bug lives Toy Story 2o very similar to the movie here so if we using this methodology using
tfidf and calculating similarity scores um if we input a movie we can eventually calculate uh what the similar movies are um similar if I look at the Matrix
it gives me very similar movie to the Matrix so Nemesis um you know Nemesis is a movie that I've not seen I've seen Terminator
but yeah movies along the similar themes great so once we have these things that we've generated um content based um recommend ations um what
basically you can do is like you know if a users watched a movie rated any movie five star you can just recommend all similar movies to that based on based on
this content based recommendation systems uh great so there is another way to calculate um recommendations for a
user using content based so one of the ways as I mentioned before was that you can technically just recommend the latest movie that a user has given five St to and then you can Surface all the
similar movies to that the other way to do it is something called U is using something called user profile creation and what user profile creation does is
that you know we have these vectors these tfidf vectors that we've created basically for a given user um all the movies that a given user has rated
highly let's say four or five you can average those vectors together for a given user so let's say I'm user one I I've given movie five and six as as a
really really high rating or rating of four and five I can average those vectors I can take a simple mean or simple average of those vectors so let's say I've given these two movies very
high rting I can just average them and then calculate similarity to those uh to that average Vector so calculate the similarity of the movie to that average
vector and this is good because instead of relying on just one movie you are generating a user profile uh using the average of all the vectors so like you know the user profile is like all the
movies that they've wred highly uh just averaging them and then you know this is basically depicting a given user's taste because for every uh movie that a users watched we are taking that and like you
know taking that as a component in in this user profile so the code here basically um walks through it I'm going to skip that again in the interest of time uh but
basically it creates a user Vector it averages um a user's profile based on the movies that they've watched and then it calculates those uh similari so given
for every movie in our data set it calculates the users calculates the movie similarity with the profile that we've created for a given user and then simple standard functions for
calculating uh calculating recommendations for the movies uh so again Solutions are here in
the nont version and you can look at like you know once you create a profile you can just enter any user ID and then it will output the list of uh movies that we should recommend to that given
us sir great okay so we have five minutes um I am going to uh talk about
collaborative filtering quickly because this is where I think uh this is like the most popular approach to be honest like in industry I've used this the most
um even over like neural networks this is really really simple really intuitive and that's why I want to spend like the next five minutes talking about like you
know collaborative filing approach um great so opening this notebook 04 to collaborative filtering uh let's um let's load a bunch
of libraries and I'm going to just copy paste the things from my from my solution in the interest of time let's load the data
sets and I am going to do a quick head here I have my combin data set okay cool so if you remember from uh the theory
that we talked about the key component of collaborative filtering is the user movies Matrix um the values in this Matrix are the ratings for the user and basically
we need to generate this Matrix from from this
Loading video analysis...