AI Foundations Course – Python, Machine Learning, Deep Learning, Data Science
By freeCodeCamp.org
Summary
## Key takeaways - **AI Talent Shortage**: Over 80% of companies worldwide cannot find data scientists and AI professionals, with demand projected to grow as the data science market exceeds $400 billion. [00:33], [00:55] - **High AI Salaries**: Data science and AI professionals earn $150,000 to $200,000 in the US, with top machine learning experts exceeding $500,000. [01:15], [01:25] - **Machine Learning Applications**: Machine learning powers healthcare diagnostics like cancer detection, finance fraud detection, Netflix recommenders, and autonomous vehicles via deep learning. [12:10], [17:01] - **Core ML Skills Roadmap**: Master linear algebra, calculus, statistics, Python libraries like scikit-learn and TensorFlow, plus top algorithms from linear regression to XGBoost. [17:33], [27:35] - **Portfolio Projects Essential**: Build recommender systems, regression for salary prediction, spam classification, and customer segmentation using K-means to showcase supervised and unsupervised skills. [38:57], [43:42] - **Bias-Variance Tradeoff**: Model error decomposes into bias, variance, and irreducible error; complex models have low bias but high variance, simple models the opposite, requiring balance to minimize test error. [01:05:23], [01:11:06]
Topics Covered
- AI Demand Outstrips Supply by 80%
- Recommender Systems Power Netflix Sales
- Master Math Before ML Algorithms
- Bias-Variance Tradeoff Drives Model Choice
- L1 Lasso Shrinks Features to Zero
Full Transcript
learn about machine learning and AI with this comprehensive 11-hour course this is not just a crash course this course covers everything from fundamental concepts to Advanced algorithms complete
with real world case studies in recommender systems and Predictive Analytics this course goes beyond Theory to provide Hands-On implementation experience career guidance and great
insights from industry professionals it also includes a career guide on how to build a data science career launch a startup and prepare for interviews T van
vahi from lunarch developed this course over 80% of the companies worldwide are unable to find data scientists and AI professionals to bring their ideas into the market and to become more
competitive in the next decade the demand for data science and ml professionals is only going to increase as the market in the data science nii is
projected to pass the 400 billion valuation about 5% of all employees worldwide are asking their employers to get training in the field of generative
Ai and machine learning and if this didn't convince you to get into this lucrative and highly demanded industry then let me tell you that the salaries
of data science and AI professionals are at the moment about the $150 up to $200,000 in us and in some cases the
salaries of most in demand Ai and machine learning Prof professionals can pass the $500,000 us welcome to this involved crash course in machine
learning and data science in this 11 hours you will get a comprehensive overview of the machine learning and data science from different perspective
both the theory practice implementation career insights and what you can expect from this career this will be a great course for anyone who wants to become a
machine learning engineer or AI engineer so here's what we are going to cover as part of this comprehensive Crush course in machine learning so we are going to start with the machine learning roadmap
for 20124 here we are going to provide you a structured overview of the machine learning landscape helping you understand what is like to become a machine learning engineer what you can
expect from this career what exactly you need to learn what kind of skill sets from what kind of Industries and also you are going to see what kind of career
directions you can take in the field of machine learning so how you can get into machine learning how you can Kickstart a career and what is that that you need to
learn after that we are going to get into the top machine learning algorithms so here you will learn the most important machine learning algorithms
from linear regression to Advanced algorithms like the boosting algorithms of course this won't be your comprehensive machine learning course because this is aimed to provide you the
basics and the fundamentals but this would be a great starting point for you to get a taste what it's like to learn the theory of machine learning you will learn the theory you will also learn the
definitions the pros and the cons of these algorithms along with the Practical python implementation so this will be great way to learn the basics
and to also learn how to implement this in Python of course as a prerequisite for this it is required that you know basics in Python like how to create list
how to work with pyit learn or how to create variables so for this this is important next up we have the handson case studies so after learning the
basics in machine learning in terms of the theory and implementation in Python with the real examples you are ready to get into a handson machine learning work
and this won't be just quick case studies that you can complete in 30 minutes but rather those will be involved three different case studies so we will start with the basics like
performing a behavior analysis and data analytics which is always a must when it comes to becoming machine learning or AI Engineers so you will learn how to
perform data analytics how to perform customer segmentation using python how to perform data wrangling how to do exploratory data analysis all in Python
and and then to make those important conclusions and tell your data story this is really important as AI professional to know data science and data analytics so this first case study
which is the superstore customer Behavior Analysis that will be conducted and presented to you by vah asan co-founder of ler Tech that will provide
you a good insights into the basics of machine learning and how to do a data analytics and data science in a real live case
study in the second case study we will then get more Hands-On with machine learning and uh we will be predicting the Californian house prices we will do
expiratory data analysis we will use Python to clean the data use statistics to perform outl detection data visualization we will also perform
causal analysis and we will be using linear regression to perform the predictions by leveraging practical data analytics but this time also data
science skills and combining this with python libraries like psychic learn and the third case study will be about building a movie recommender system so
here we will explore the NLP the natural language pre processing another very important topic in the field of AI and machine learning these days and here
we'll be using NLP we will use also machine learning data science tool to develop a recommended algorithm so this project will then enhance your skills in
the text Data analysis how to process this text Data how to use Python for doing that as well as practical machine learning applications like building a
recommender system keep in mind that you can also put this case studies on your resume to Showcase your experience after we are done with this tree end to endend
invol case studies we are going to provide you career insights now as a data science and AI professional you have two choices you can either decide to get into the corporate world so
become a data scientist or a professional or you can decide to build your own startup and to provide you information on both of these directions
in the first conversation you will join me and the data science manager from Aliens Cornelius where you can learn from him how to break into the field
into the field of data science and machine learning especially from traditional background here you can get lot of tips on succeeding in this field how to get promoted what to expect from
interviews what is like that selection process and much more about data science nii corporate career so once we are done with that conversation then we will
provide you the next choice which is about the building of a startup as a machine learning or AI professional so here you can then listen to the
conversation between co-founder of lunar Tech vah asan and a Serial entrepreneur and successful investor Adam coffee so
here Adam coffee will be then provide you a lot of insights on how to launch a startup how to raise funds what to expect from this uh type of career so
once we are done with all this with this career insight as well we will then get into the final part of this course which is about interview preparation we'll conclude with providing you the most
popular machine learning interview questions with the corresponding detailed answers this will be great for anyone who wants to Ace their interviews and who is now preparing for machine
learning or AI interviews this crash course for 11 hours is more than just a short introduction it's an involved comprehensive overview of everything
that you can expect from the world of machine learning and AI want to become uh more handson and get the entire comprehensive overview and learn
everything in one place to become a job ready machine learning and professional then make sure to check the lunch. our
data science boot camps and many other courses that will provide you that all in one approach to become a job rate professional if you like this video make
sure to like subscribe and comment so if you ready I'm really excited let's get started hi there in this video we are going to talk about how you can get into machine
learning in 2024 first we are going to start with all the skills that you need in order to get into machine learning step by step what are the topics that you need to
cover and what are the topics that you need to study in order to get into machine learning we are going to talk about what is machine learning then we are going to cover step by step what are
the exact topics and the skills that you need in order to become a machine learning researcher or just get into machine learning then we're going to cover the type of exact projects you can
complete so examples of portfolio projects in order to put it on your resume and to start to apply for machine learning related jobs and then we are
going to also talk about the type of Industries that you can get into once you have all the skills and you want to get into machine learning so the exact career path and what kind of business
titles are usually related to machine learning we are also going to talk about the average salary that you can expect for each of those different machine learning related positions at the end of
this video You're are going to know what exactly machine learning is where is it used what kind of skills are there that you need in order to get into machine learning in 2020 4 and what kind of
career path with what kind of compensation you can expect with the corresponding business titles when you want to start your career in machine learning I'm DF co-founder of lunar Tech
and I come from econometrical and statistical background I been in the tech field and specifically in data science and AI for The Last 5 Years working across different data science
and AI projects across the globe and now I'm going to tell you what exactly machine learned is and what are the skill sets that you need in order to get
into machine learning in 2024 so without further Ado let's get started so what is machine learning machine learning is a brand of artificial intelligence of AI
that helps to uh build models based on the data and then learn from this data in order to make different decisions so we will first start with the definition of machine learning what machine
learning is and what are the different sorts of applications of machine learning that you most likely have heard of but you didn't know that it was based on machine learning so machine learning
is a brand of artificial intelligence that is uh using data in order to uh learn from this data by using different sorts of algorithms and it's being used
across different Industries uh starting from Healthcare till entertainment in order to improve uh the customer uh experience custom identify customer
behavior um improve the sales for the businesses uh and it also helps um governments to make decisions so it really has a wide range of applications
so let's start with the healthcare for instance machine learning is being used in the healthcare to help with the uh diagnosis of diseases it can help to uh
diagnose cancer uh during the co it helped many hospitals to identify whether people are getting more uh severe effects or they are getting P
pneumonia um based on those pictures and that was all based on machine learning and specifically computer vision uh in the healthcare is also being used for drug Discovery it's being
used for personalized medicine for personalizing treatment plans to improve the operations of the hospitals to understand what is the amount of uh
people and uh patients that hospital can expect in each of those uh uh days week and also to estimate the amount of doctors that need to be available the
amount of uh people uh that the hospital can expect in the uh emergency room based on the day or the time of the day and this is basically not a machine
learning application then we have uh machine learning in finance machine learning is being largely used in finance for different applications starting from fraud detection in credit
cards or in other sorts of banking operations um it's also being used in trading uh with specifically in combination with quantitative Finance to
help traders to make decisions whe they need to go short or long into different stocks or bonds or different assets just in general to estimate the price that
those talks will haveen Assets in the real time in the most accurate way uh it's also being used in uh retail uh it helps you understand an estimated demand
for certain products in certain Warehouse houses it also helped you understand what is the most appropriate or closest uh uh warehouses that the items for that corresponding customer
should be shipped so it's optimizing the operations it's also being used to build different Dr Commander systems and search engines like the famous Amazon is doing so every time when you go to
Amazon and you are searching for project or product you will most likely see many article recommenders and that's based on machine learning because Amon zone is uh
Gathering the data and comparing your behavior So based on what you have bought based on what you are searching uh to other customers and those items to
other items in order to understand what are the items that you will most likely will be interested in and eventually will buy it and that's exactly based on machine learning and specifically
different sorts of recommended system algorithm and then we have uh marketing where machine learning is being heav used because this can help to understand
uh what are these different tactics and specific targeting uh groups that you belong and how retailers can Target you uh in order to reduce their marketing
cost and to result in high conversion rates so to ensure that you buy their product then we have machine learning in autonomous vehicles that's based on machine learning and specifically uh
deep learning applications uh and then we have also so um uh natural language PR processing which is highly related to the famous
Chad GPT I'm sure you are using it and that's based on the machine learning and specifically the large language models so the Transformers large language models where you will going and
providing your text and then question and the chat GPT will provide answer to you or in fact any other uh virtual assistant or chat boats those are all
based on machine learning and then we have also uh smart home devices so Alexa is based on machine learning also in agriculture uh machine
learning is being used heavily these days to estimate what the weather conditions will be uh to understand what will be the uh production of different
plants uh what will be the um outcome of this uh to understand and to make decisions uh also how they can optimize those uh crop uh yields to monitor for
uh soil health and for different sorts of applications that can just in general uh improve the uh revenue for the farmers then we have of course in the
entertainment so the Vivid example is Netflix that uses the uh data uh that you are providing uh related to the movies and also based on what kind of
movies you are watching Netflix is uh building this super smart recommender system to recommend you moving that you most likely will be interested
in and you will also like it so in all this machine learning is being used and it's actually super powerful topic and super powerful uh field to get into and
in the upcoming 10 years this is only going to grow so if you have made that decision or you are about to make that decision to get into machine learning continue watching this video because I'm
going to tell you exactly what kind of skills you need and what kind of uh practice Pro objects you can complete in order to get into machine learning in
2024 so you first need to start with mathematics you also need to know python you also need to know statistics you will need to know machine learning and
you will need to know some NLP to get into machine learning so let's now unpack each of those skill sets so independent the type of machine learning
you are going to do you need to know mathematics and specifically you need to know linear algebra so you need to know what is matrix multiplication what are
the vectors matrices dot product you need to know how you can uh multiply those different matrices Matrix with vectors what are these different rules
the dimensions also what does it mean to transform a matrix the inverse of the Matrix identity Matrix diagonal matrix
uh those are all Concepts as part of linear algebra that you need to know as part of your mathematical skill set in order to understand those different machine learning
algorithms then as part of your mathematics you also need to know calculus and specifically differential Theory so you need to know these different theorems such as chain rule
the rule of uh differentiating when you have sum of instances when you have constant multiply with an instance when you have um uh sum but also subtraction
division multiplication of two items and then you need to take the uh derivative of that what is this idea of derivative what is the idea of partial derivative what is the idea of haian so first order
derivative second order derivative and it would be also great to know a basic integration Theory so we have differentiation and the opposite of it
is integration Theory so this is kind of basic you don't need to know uh too much when it comes to calculus but those are basic things that you need to know uh in
order to succeed in machine learning uh then the next Concepts uh such as discrete mathematics so you need to know
uh what is this idea of uh graph Theory uh what are this uh combinations combinators uh what is uh this idea of complexity which is important when you
want to become a machine learning engineer because you need to understand what is this Big O notation so you need to understand what is this complexity of
uh and squ complexity of n complexity of n log n um and about that you need to know uh some basic uh mathematics when it comes which comes from usually high
school so you need to know multiplication division you need to understand uh multiplying uh uh amounts which are within the parenthesis you
need to understand um different symbols that represent mathematical um values you need to know this idea of using X is
y uh and then what is x s what is y^ 2 What is X the^ 3 so different exponents of the different variables then you need to know what is logarithm what is
logarithm at the base of two what is logarithm at the base of e and then at the base of 10 uh what is the idea of e so what is the idea of Pi uh what is
this idea of uh exponent logarithm and how does those uh transform when it comes to taking derivative of the logarithm taking the derivative of the
uh exponent those are all values and all uh topics that are actually quite basic they might sound complicated but they are actually not so if someone explains
you uh clearly then you will definitely understand it from the first goal and uh for this uh to understand all those different mathematical Concepts so
linear algebra calculus differential Theory and then discrete mathematics and those different symbols you need to uh go for instance uh and
look for courses or um YouTube tutorials that are about uh basic mathematics uh for machine learning and AI uh don't go
and look further you can check for instance Can Academy which is uh quite favorite when it comes to learning math uh both for uni students and also for just people who want to learn
mathematics and this will be your guide um or you can check our resources at lunch. cuz we are going also to uh
lunch. cuz we are going also to uh provide this resources for you uh in case you want to learn mathematics for your machine Learning Journey the next skill set that you need to gain in order
to break into machine learning is the statistics so you need to know this is a must statistics if you want to get into machine learning and in AI in general so
there are few topics that you must um study when it comes comes to statistics and uh those are descriptive statistics multivariate statistics inferential
statistics probability distribution and some bial thinking so let's start with descriptive statistics when it comes to descriptive statistics you need to know what is side
of mean uh median standard deviation variance and uh just in general how you can uh analyze the data with using this
descriptive measure me so distance measures but also variational measures then the next topic area that you need to know as part of your statistical
journey is the inferential statistics so you need to know those Infamous theories such as Central limit theorem the law of
uh large numbers uh and how you can um relate to this idea of population sample unbias sample and also u a hypothesis
testing confidence interval statistical sign ific an uh and uh how you can test different theories by using uh this idea of statistical significance uh what is
the power of the test what is type one error what is type two error so uh this is super important for understanding different SS of machine learning applications if you want to get into
machine learning then you have probability distributions and this idea of probabilities so to understand those different machine learning Concepts you need to know what are probabilities so
what is this idea of probability what is this idea of Sample versus population uh what is what does it mean to estimate probability what are those different rules of probability so
conditional probability uh and um those uh probability uh values and rules that
usually you can uh apply when you have uh a probability of um multipliers probability of sums um and then uh you
need to understand some uh popular and you need to know some popular probability distribution functions and those are pero distribution binomial
distribution uh normal distribution uniform distribution exponential distribution so those are all super important distributions that you need to
know in order to understand uh this idea of normality normalization uh also uh this idea of bar noly trials and uh relating uh
different probability distributions to different uh uh higher level statistical Concepts so rolling a dice the probability of it how it is related to bero distribution or to binomial
distribution and those are super important when it comes to hypothesis testing but also for uh many other machine learning applications so then we have the ab
thinking this is super important when it comes to more advanced machine learning but also some basic machine learning you need to know what is the Bas theorem which arguably is one of the most
popular statistical theorems out there comparable also to the central limit theorem you need to know what is conditional probability what is this biased theorem and how does it relate to
conditional probability uh what is this uh bation uh statistics Ide at very high level you don't need to know everything in uh super detailed but you need to
know um the these Concepts at least at higher level in order to understand machine learning so to learn statistics and the fundamental concepts of
Statistics you can check out the fundamentals to statistics course at lunch. here you can learn all this
lunch. here you can learn all this required Concepts and topics and you can practice it in order to get into machine learning and to gain the statistical
skills the next skill set that you must know is the fundamentals to machine learning so this covers not only the basics of machine learning but also the
most popular machine learning algorithms so you need to know this uh different um mathematical side of these algorithms step by step how they work what are the
benefits of them what are the demo and and which one to use for what type of applications so you need to know this uh categorization of supervised versus
unsupervised versus semi-supervised then you need to know what is the idea of classification regression or uh clustering then you
need to know uh also time series analysis uh you also need to know uh these different popular algorithms including linear
regression also logistic agression LDA so linear discriminant analysis you need to know KNN you uh need to know uh decision treats both classification and
regression case you need to know uh random Forest bagging but also boosting so popular boosting algorithms like uh light GBM GBM uh so gradient boosting
models and you need to know uh HG boost uh you uh also need to know um some supervised learning algorithm such as K
means uh usually Ed for class string you need to know DB scan which becomes more and more popular in uh class string algorithms you also need to know
hierarchal class string um and um for all this type of uh models you need to understand the idea behind them what are the advantages and disadvantages whether
they can be applied for unsupervised versus supervised versus semi-supervised you need to know whether they are for regression classification or for uh clustering beside of this popular
algorithms and models you also need to know the basics of uh training a machine learning model so you need to know uh this process behind training validating
and testing your machine learning algorithms so you need to know uh what does it mean to uh perform hyperparameter tuning what are those different optimization algorithms that
can be used to optimize your parameters such as uh GD SGD SGD with momentum Adam and Adam V you also need to know the
testing process this idea of splitting the data into train validation and then test you need to know resampling techniques why are they used including
the um bootstrapping and uh cross viation and there's different sorts of cross viation techniques such as leave one out cross viation kful cross viation
validation set approach uh you also need to know um this uh idea of uh Matrix and how you can use different Matrix to evaluate your machine learning models
such as uh classification type of metrics like F1 score FB D Precision recall um cross entropy um and also you need to
know some Matrix that can be used to evaluate regression type of problems like the uh mean squared error so M root
me squared error RMC uh MAA so the uh absolute uh version of those different sorts of Errors um and um or the
residual sum of squares for all these cases you not only need to know higher level what the those algorithms or those uh uh topics or concepts are doing but
you actually need to know the uh mathematics behind it their benefits the uh disadvantages because during the interviews you can definitely expect questions that will test uh not only
your higher level understanding but also this uh background knowledge if you want to learn machine learning and you want to gain those skills then uh feel free to check out my uh fundamentals to
machine learning course which is part of the ultimate data science boot camp at lunch. or you can also check out and
lunch. or you can also check out and download for free the fundamentals to machine learning handbook that I published with free cord Camp then the next skill set that you definitely need
to gain is a knowledge in python python is actually one of the most popular programming languages out there and it's being used across software Engineers uh
AI Engineers machine learning Engineers data scientists so this this is the universal language I would say when comes to programming so if you're considering getting into machine
learning in 2024 then python will be your friend so knowing the theory is one thing then uh implementing it uh in in the actual job is another and that's
exactly where python comes in handy so you need to know python in order to perform uh descriptive statistics in order to trade machine learning model or more advanced machine learning models or
deep learning models you can use for training while ation and uh for testing of your models and uh also for building different sorts of applications so
python is super powerful therefore it's also gaining such a high uh popularity across the globe because it has so many
uh liaries it has uh tenser flow pie torch both that uh are must if you want to not only get into machine learning but also the advanced levels of machine
learning so if you are are considering the AI engineering jobs or machine learning engineering jobs and uh you want to train for instance deep learning
models uh or you want to build large language models or generative AI models then you definitely need to learn uh pie torch and tensor flow which are
Frameworks that I used in order to uh Implement different deep learning uh which are Advanced machine learning models here are a few libraries that you
need to know in order to uh get into machine learning so you definitely need to know pandas numai you need to know psyit learn scipi you also need to know
uh nltk for the text Data you also need to know tensor flow and Pythor for a bit more advanced machine learning and um beside this there are also data
visualization libraries that I would definitely suggest you to practice with which are the ma plot lip and specifically the pie plot and also the curn
when it comes to python beside knowing how to use libraries you also need to know some basic data structures so you need to know what are these variables how you can create variables what are
the matrices arrays how the indexing works and also uh what are the lists what are the sets so unique lists uh What uh are the ways that you can what
are the different operations you can perform uh how does the Sorting for instance work I would definitely suggest you know um some basic data structures
and algorithms such as binary sort so an optimal way to sort your arrays you also need to know uh the data processing in Python so you need to understand how to
identify missing data how to uh identify uh duplica in your data how to clean this how to perform feature engineering so how to combine uh multiple variables
or to perform operations to create new variables um you also need to know uh how you can aggregate your data how you can filter your data how you can sort
your data and of course you also need to know how you can form AB testing in your Python and how you can train machine learning models how you can test it and
how you can evaluate them and also visualize the performance of it if you want to Learn Python then the easiest thing you can do is just to Google for
uh python for data science or python for machine learning tutorials or blogs or you can even try out the python for data science course at Learner tech. in order
to learn all these Basics and usage of these libraries and some practical examples when it comes to python for machine learning the next skill set that you need to gain in order to get into
machine learning is the basic introduction to NLP natural language processing so you need to know how to work with text Data given that these
days the text data is the Cornerstone of all these different different Advanced algorithms such as uh gpts Transformers the attention mechanisms so those uh applications that you see as part of
building chat boat or this uh personalized uh applications based on Tex data they are all based on NLP so therefore you need to know this basics
of NLP to just get started with machine learning so you need to know uh this idea of text Data what are those strings uh how you can clean text dat data so
how you can clean uh those um dirty data that you get and what are the steps involved such as lower casing uh removing punctuation
tokenization uh also what is this idea of stemming lemmatization stop wordss how you can use the nltk in Python in order to perform this cleaning you also
need to know uh this idea of embeddings and uh you can also learn this idea of uh
the uh tfidf which is a basic uh NLP algorithm uh you also uh can learn this idea of word embeddings uh the subword embeddings uh and the character
embeddings if you want to learn the basics of NLP you can check out those Concepts and learn them as part of the blogs there are many tutorials on
YouTube you can also try the introduction to uh NLP course at lunch.
in order to learn this uh different Basics that form the NLP if you want to go beyond this uh intro till medium level machine learning and you also want
to learn bit more advanced machine learning and this is something that you need to know after you have gained all these previous skills that I mentioned then you can gain uh this uh knowledge
and the skill set by learning deep learning and also uh you can consider uh getting into generative AI topics so you can for for instance learn what are the
RNN what are the Anns what are the CNN you can learn what is this uh outand coder concept what are the variational out hand coders what are the uh
generative adversarial networks so gens uh you can understand what is this idea of reconstruction error uh you can understand this um this different sorts of neural networks what is this idea of
back propagation the optimization of these algorithms by using the different optimization algorithms such as GD D HGD
um HD momentum Adam Adam W RMS prop uh you uh can also go One Step Beyond and you can uh get into generative AI topics
such as um uh the uh variation Auto encoders like I just mentioned but also the large language models so if you want to move towards the NLP side of
generative Ai and you want to know how the ched GPT has been invented how the gpts work or the birth mode Ro uh then you will definitely need to uh get into
this topic of language model so what are the end grams what is the attention mechanism what is the difference between the self attention and attention what is uh one head self attention mechanism
what is multi-head self attention mechanism you also need to know at high level this uh encoder decoder architecture of Transformers so you need
to know the architecture of Transformers and how they solve different problem s of reur narrow networks or RNN and lstms uh you can also look into uh this
uh uh encoder based or decoder based algorithm such as uh gpts or Birch model and those all will help you to not only
get into machine learning but also stand out from all the other candidates by having this advaned knowledge let's now talk about different sorts of projects that you can complete in order to train
your machine learning skill set that you just learned uh so there are few projects that I suggest you to complete and you can put this on your resume to start to apply for machine learning
roles the first application in the project that I would suggest you to do is building a basic recommender system whether it's a job recommender system or
a movie recommender system in this way you can show case how you can use for instance text Data from those job advertisement how you can use numeric
data such as the ratings of the movies in order to build a top end recommended system this will showcase your understanding of the distance measures
such as cosign similarity this Cann algorithm idea and this will help you to uh uh tackle this specific uh area of
data science and machine learning the next project I would suggest you to do will be to build a regression based model so in this way you will showcase that you understand this idea of
regression how to work with a Predictive Analytics and predictive model that has a dependent variable response variable that is in the numeric format so here
for instance you can uh estimate the salaries of the jobs based on the uh characteristics of the uh job based on this data which you can get for instance
from uh open source uh web pages such as kegle and you can then uh use different sorts of regression algorithms to perform your predictions of the salaries
evaluate the model and then compare the uh performance of these different machine learning regression based algorithms for instance you can use the uh linear regression you can use the
decision trees regression version you can use the um uh random Forest you can use uh GBM XG boost in order to Showcase
and then in one uh graph to compare this uh performance ments of these different algorithms by using single regression uh
ml modal metrics so for instance the rmsc this project will showcase that you understand how you can train a regression model how you can test it and
validate it and it will showcase your understanding of optimization of this regression algorithm you understand this concept of hyperparameters unun the next project that I would suggest you to do
in order to Showcase your classification knowledge so when it comes to uh predicting a class for an observation given uh the feature space would be uh
to uh build a classification model that would classify emails being a Spam or not a Spam so you can use a publicly available data that will be uh
describing a specific email and then you will have multiple emails and the idea is to uh build a machine learning model that would classify the email to the
class zero and class one where class zero for instance can be your uh not being a Spam and one being a Spam so with this binary classification you will
showcase that you know how to train a machine learning model for classification purposes and you can here use for instance logistic regression you can use also the decision tra for
classification case you can also use random Forest the uh EDG for classification GBM for classification and uh with all these models you can
then obtain the performance metrics such as uh F1 score or you can plot the r curve uh or the uh area under the curve metrix and you can also compare those
different classification models so in this way you will also tackle another area of expertise when it comes to the machine learning then the final project that I
would suggest you to do would be uh from the unsupervised learning to Showcase another area of exper and here you can for instance use data
to your customers into good better and best customers based on their transaction history the amount of uh money that they are spending in the
store so uh in this case you can for instance use uh K means uh DB scan hierarchal clustering and then you can evaluate your uh clustering algoritms
and then select the one that performs the best so you will then in this case cover yet another area of machine learning which would be super important
to Showcase that you can not only handle recommender systems or supervised learning but also unsupervised learning and the reason why I suggest you to uh cover all these different areas and
complete this four different projects is because in this way you will be covering different expertise and areas of machine learning so you will be also putting
projects on your uh resume that are covering different s s of algorithms different sorts of uh Matrix and approaches and it will showcase that you
actually know a lot from machine learning now if you want to go beyond the basic or medium level and you want to be considered for medium or Advanced
machine learning uh levels and positions you also need to know bit more advanced which means that you need to complete with more advanced projects for instance
if you want to apply for generative AI related or large language models related positions I would suggest you to complete a project where you are building a very basic uh large language
model and specifically the pre-training process which is the most difficult one so in this case uh for instance you can build a baby GPT and I'll put a here
link that you can follow where I'm building a baby GPT a basic pre-trained GPT algorithm where uh I am using a text
Data uh publicly available data in order to uh uh process data in the same way like GPT is doing and the encoded part
of the Transformers in this way you will showcase to your um hiring managers that you understand this architecture behind
Transformers architecture behind the um uh large language models and d gpts and you understand how you can use pych in
Python in order to do this Advanced NLP and generative AI task and finally let's now talk about the common career path and the business titles that you can
expect from a career in machine learning so assuming that you have gained all the skills uh that are must for breaking into machine learning there are different sorts of business titles that
you can apply in order to get into machine learning so when it comes to machine learning uh you can uh get into machine learning uh and there are different fields that are covered as
part of this so uh first we have the general machine learning researcher machine learning researcher is basically doing a research so training testing evaluating different machine learning
algorithms they are usually people who come from academic background but it doesn't mean that you cannot get into machine learning research without getting a degree in statistics
mathematics or in um uh machine learning specifically not at all so uh if you have this um desire and this passion for
reading doing research uh and you don't mind reading uh research papers then machine learning researcher job would be a good fit for you so machine learning
combined with research then sets you uh for the machine learning researcher role then we have the machine learning engineer so machine learning engineer is
the engineering version of the machine learning uh expertise which means that we are combining machine learning skills with the engineering skills such as
productionizing pipelines or in robust pipeline scalability of the model considering all these different aspects of the model not only from the performance side when it comes to the
quality of the algorithm but also the uh scalability of it and when putting it in front of many users so when it comes to combining engineering with machine
learning then you get machine learning engineering so if you are someone who is a software engineer and you want to get into machine learning then machine learning engineering would be the best
fit for you so for machine learning engineering you not only need to have all these different skills that I already mentioned but you also need to have this good grasp of uh uh
scalability of algorithms the uh uh data structures and algorithms type of um skill set uh the uh complexity of the moral
uh also system design so this one uh converges more towards and similar to the software engineering position combined with machine learning rather
than your pure machine learning or AI role then we have the AI research versus AI engineering position so uh the uh AI research position is similar to The
Machine learning uh research position and the AI engineer position is similar to The Machine learning engineer position with only single difference when it comes to machine learning we are specifically talking about this
traditional machine learning so linear regression logistic regression and also uh random Forest exy boost begging and when it comes to AI research and AI
engineer position here we are tackling more the advanced machine learning so here we are talking about deep learning models such as RNN lstms grus CNN or
Computer visional Applications and we are also talking about uh generative AI models large language models so uh we are talking about um the Transformers
the implementation of Transformers the gpts T5 all these different algorithms that are from uh more advanced uh AI topics rather than traditional machine
learning uh for those you will then be applying for AI research and AI engineering positions and finally you have these different sorts of
observations niches from AI for instance NLP research and NLP engineer or even data science positions for which you will need to know machine learning and
knowing machine learning will set you apart for the source of positions so also the business titles such as data science or technical data science
positions NLP researcher NLP engineer for this all uh you will need to know machine learning and knowing machine learning will help you to break into
those positions and those career paths in this lecture we will go through the basic concepts in machine learning that is needed to understand and follow conversations and solve main problems
using machine learning strong understanding of machine learning Basics is an important step for anyone looking to learn more about or work with machine learning we'll be looking at the Tre
Concepts in this tutorial we will Define and look into the difference between supervised and unsup rvis machine learning models then we will look into the difference between the regression
and classification type of machine learning models after this we will look into the process of training machine learning models from scratch and how to evaluate them by introducing performance
metrics what you can use depending on the type of machine learning model or problem you are dealing with so whether it's a supervised or unsupervised whether it's regression versus
classification type of problem machine learning methods are categorized into two types depending on the existence of the label data in the training data set which is especially
important in the training process so we are talking about the So-Cal dependent variable that we saw in the section of fundamental Su statistics supervised and unsupervised machine learning models are
two main type of machine learning algorithms one key difference between the two is the level of supervision during the training phase supervised machine learning algorithms are Guided
by the labeled examples while as supervised algorithms are not as learning model is more reliable but it also requires a larger amount of labeled data which can be timec consuming and
quite expensive to obtain examples of supervised machine learning models includes regression and classification type of
models on the other hand unsupervised machine learning algorithms are trained on unlabeled data the model must find p patterns and relationships in the data without the guidance of correct outputs
so we no longer have a dependent variable so unsupervised ml models require training data that consists only of independent variables or the features
and there is no dependent variable or a label data that can supervise the algorithm when learning from the data examples of UNS supervised models
are clust string models and outlier detection techniques supervised machine learning methods are categorized into two types depending on the type of dependent
variable they are predicting so we have regression type and we have classification type some key differences between regression and classification include output type so the regression
algorithms predict continuous values while the classification algorithms predict categorized values some key difference between regression and classification include the output type
the evaluation metrics and their appc ation so with regards to the output type regression algorithms predict continuous values while classification algorithms
predict categorical values with regard to the evaluation metric different evaluation metrics are being used for regression and classification tasks for example mean
square theor is commonly used to evaluate regression models while accuracy is commonly used to evaluate classification models when it comes to Applications regression and classification models are used in
entirely different types of applications regression models are often used for prediction tests while classifications are used for decision-making tasks regression algorithms are used to
predict the continuous value such as price or probability for example a regression model might be used to predict the price of a house based on its size location or other
features examples of regression type of machine Lear learning models are linear regression fix effect regression exus regression Etc classification algorithms on the
other hand are used to predict the categorical volum these algorithms take an input and classify it to one of the several predetermined categories for
example a classification model might be used to classify emails as a Spam or as not a Spam or to identify the type of aner in an image
examples of classification type of machine learning models are logistic regression exus classification random Forest classification let us now look into
different type of performance metrics we can use in order to evaluate different type of machine learning models for aggression models common evaluation metrics includes residual sum
of squared which is the RSS mean squared eror which is the msse the root mean squared error or rmsc and the mean absolute error which is the m AE this metrix measure the difference between
the predicted values and the True Values with a lower value indicating a better feed for the model so let's go through these metrics one by one the first one is the RSS or the residual sum of
squares this is a metrix commonly used in the setting of linear regression when we are evaluating the performance of the model in estimating the different coefficients and here the beta is a
coefficient and the Yi is our dependent variable value and the Y head is the predicted value as you can see the RSS or the residual sum of square or the
beta is equal to sum of all the squar of y i minus y hat across all I is equal to 1 till n where I is the index of the each r or the individual or the
observation included in the data the second Matrix is the MSC or the mean squared error which is the average of the squared differences between the predicted values and the True Values so
as you can see m is equal to 1 / to n and then sum across all i y IUS y head squ as you can see the RSS and the msse are quite similar in terms of their uh
formulas the only difference is that we are adding a 1 / to n and then this makes it the average across all the square differences between the predicted
value and the actual True Value a lower value of Ms indicates a better fit the rmsc which is the root mean squar error is the square roof of the msse so as you
can see it has the same formula as MSC only with the difference that we are adding a square roof on the top of that formula a lower value of rmsc indicates
a better fit and finally the Mae or the mean absolute error is the average absolute difference between the predicted values so the Y hat and the
True Values or Yi a lower value of this indicates a better fit the choice of a regression metric depends on the specific problem you are trying to solve and the nature of your
data for instance the msse is commonly used when you want to penalize large errors more than the small ones msse is sensitive to outliers which means that
it may not be the best choice when your data contains many outliers or extreme values our msse on the other hand which is the square root of the msse makes it
easier to interpret so it's easier interpretable because it's in the same units as Target variable it is commonly used when you want to compare the performance of different models or when
you want to report the error in a way that it's easier to understand and to explain the Mia is commonly used when
you want to penalize all errors equally regardless of their magnitude and M is less sensitive to outliers compared to msse for classification models common
evaluation metrics include accuracy precision recall and F1 score this metrics measure the ability of the machine learning model to correctly classify instances into the correct
categories let's briefly look into this Matrix individually so the accuracy is a proportion of correct predictions made by the model it's calculated by taking the correct predictions so the correct
number of predictions and divide to all number of predictions which means correct predictions plus incorrect predictions next we will look into the the Precision so Precision is the
proportion of true positive predictions among all positive predictions made by the model and it's equal to True positive divided to True positive plus false positive so all number of
positives true positives are cases where the model correctly predict a positive outcome while false positives are the cases where the model incorrectly predicts a positive outcome next Matrix
is recall recall is a proportion of true positive predictions among all actual POS positive instances it's calculated as the number of true positive predictions divided by the total number
of actual positive instances which means dividing the true positive to True positive plus false negative so for example let's say we are looking into medical test a true
positive would be a case where it has correctly identifies a patient as having a disease while a false positive would be a case where the test incorrectly identifies a healthy patient as having
the disease and the final score is the F1 score the F1 score is the harmonic mean or the usual mean of the Precision and recall with a higher value indicating a better
balance between precision and recall and it's calculated as the two times recall times Precision divided to recall class Precision for unsupervised models such
as class string models whose performance is typically evaluated using metrics that that measure the similarity of the data points within a cluster and the dis similarity of the data points between
different clusters we have Tre hyp of metrics that we can use homogeneity is a measure of the degree to which all of the data points within a single cluster belong to the same class A Higher value
indicates a more homogeneous cluster so as you can see homogeneity of AG where AG is the simply the short way of describing homogeneity is equal to 1
minus conditional entropy cluster assignments divided to the entropy or predicted class if you're wondering what this entropy is then stay tuned as we are going to discuss this entropy
whenever we will discuss the cloud string as well as the cision trees the next metrix is The Silo score silid score is a measure of the similarity of the data point to its own cluster
compared to the other clusters a higher silid score indicates that the data point is well matched to its own cluster this is usually used for DB scan or KS
so here the silhou squore can be represented by this formula so the S so or the silhou sore is equal to B minus AO divided to the maximum of AO and B
where s o is The Silo coefficient of the data point characterized by o AO is the average distance between o and all the other data points in the cluster to
which o belongs and the b o is the minimum average distance from o to all the Clusters to which o does not belong the final metrix we will look into is
the completeness completeness is another measure of the degree to which all of the data points that belongs to a particular class are assigned to the same cluster a higher volue indicates a
more complete cluster let's conclude this lecture by going through the step-by-step process of evaluating a machine learning model
at a very simplified version since there are many additional considerations and techniques that may be needed depending on a specific task and the characteristics of the data knowing how
to properly train machine learning model is really important since this defines the accuracy of the results and conclusions you will make the training process starts with
the preparing of the data this includes splitting the data into training and test sets or if you are using more advanced resampling techniques that we will talk about later than splitting
your data into multiple sets the training set of your data is used to feed the model if you have also validation set then this validation set
is used to optimize your hyperparameters and to pick the best model while the test set is to use to evaluate the model performance when we will approach more
lectures in this section we will talk in detail about these different techniques as well as what the training means what the test means what validation means as
well as what the hyper parameter tuning means secondly we need to choose an algorithm or set of algorithms and train the model on the training data and save
the fitted model there are many different algorithms to choose from and the appropriate algorithm will depend on the specific task and the characteristics of the data as a third
step we need to adjust the model parameters to minimize the error on the training set by performing hyperparameter tuning for this we need to use validation data and then we can
select the best model that results in the least possible validation error rate in this step we want to look for the optimal set of parameters that are included as part of our model to end up
with a model that has the least possible error so it performs in the best possible way in the final two steps we need to evaluate the model we are always interested in a test error rate and not
the training or the validation error rates because we have not used a test set but we have used a training and validation sets so this test s rate will give you an idea of how well the model
will generalize to the new unseen data we need to use the optimal set of parameters from hyperparameter tuning stage and the training data to train the model again with these hyper parameters
and with the best model so we can use the best fitted model to get the predictions on the test data and this will help us to calculate our test error
rate once we have calculated the test rate and we have also obtained our best model model we are ready to save the predictions so once we are satisfied with the model performance and we have
tuned the parameters we can use it to make predictions on a new unseen data on the test data and compute the performance metrics for the model using the predictions and the real values of
the target variable from the test data and this complete this lecture so in this lecture we have spoken about the basics of machine learning we have discussed the difference between the the
unsupervised and supervised learning models as well as regession classification we have discussed in details the different type of performance metrics we can use to evaluate different type of machine
learning models as well as we have looked into the simplified version of the step-by-step process to train the machine learning model in this lecture lecture number two we will discuss a
very important Concepts which you need to know before considering and applying any statistical or machine learning model here I'm talking about the bias of the model and the variance of the model
and the tradeoff between the two which we call bias various trade of whenever you are using a statistical econometrical or a machine learning model no matter how simple the model is
you should always evaluate your model and check its error rate in all these cases it comes down to the tradeoff you make between the variance of the model and the bias of your model because there
is always a catch when it comes to the model choice and the performance let us firstly Define what by bias and the variance of the machine learning model are the inability of the
model to capture the true relationship in the data is called bias hence the machine learning models that are able to detect the true relationship in the data have low
bias usually complex models or more flexible models tend to have a lower bias than simpler models so mathematically the bias of the model can be expressed as the expectation of the
difference between the estimate and the True Value let us also Define the variance of the model the variance of the model is the inconsistency level or the variability of the model performance
when applying the model to different data sets when the same model that is trained using training data performs entirely differently than on the test data this means that there is a large
variation or variance in the model complex models or more flexible models tend to have a higher variance than simpler models in order to evaluate the performance of
the model we need to look at the amount of error that the model is making for Simplicity let's assume we have the following simple regression model which aims to use a single independent variable X to model the numeric y
dependent variable that is we fit our model on our training observations where we have a pair of independent and dependent
variables X1 y1 X2 Y2 up to xn YN and we obtain an estimate for our training observations F head we can then compute
this let's say fhe X1 fhe X2 up to fhe xn which are the estimates for our dependent variable y1 Y2 up to YN and if these are approximately equal to this
actual values so one head is approximately equal to y1 Y2 head is approximately equal to Y2 head Etc then the training error rate would be small
however if we are really interested in whether our model is predicting the dependent variable appropriately we want to instead of looking at the training error rate we
want to look at our test eror rate so so the error rate of the model is the expected Square difference between the real test values and their predictions where the predictions are made using the
machine learning model we can rewrite this error rate as a sum of two quantities where as you can see the left part is the amount of FX minus F hat x^
squared and the second entity is the variance of the error term so the accuracy of Y head as a prediction for y depends on the two quantities which we can call the
reducible error and the irreducible error so this is the reducible error equal to FX minus F x^ s and then we have our irreducible error or the
variance of Epsilon so the accuracy of Y head as a prediction for y depends on the two quantities which we can call the reducible error and the irreducible error in general The Fad will not be a
perfect estimate made for f and this inaccuracy will introduce some errors this error is reducible since we can potentially improve the accuracy of app head by using the most appropriate
machine learning model and the best version of it to estimate the app however even if it was possible to find a model that would estimate F perfectly
so that the estimated response took the form of Y head is equal to FX our prediction would still have some error in it this happens because Y is also a
function of the error rate Epsilon which by definition cannot be predicted by using our feature X so there will always be some error that is not predictable so
variability associated with the error epsylon also affects the accuracy of the predictions and this is known as the irreducible error because no matter how
well we will estimate F we cannot reduce the error introduced by the epsylon this error contains all the features that are not included in our model so all the unknown factors that have an influence
on our dependent variable but are not included as part of our data but we can reduce the reducible error rate which is based on two values the variance of the estimates and the
bias of the model if we were to simplify the mathematical expression describing the error rate that we got then it's equal to the variance of our model plus squ bias of our model plus the
irreducible error so even if we cannot reduce the IR reducible error we can reduce the reducible error rate which is based on the two values the variance and the squared bias so though the
mathematical derivation is out of the scope of this course just keep in mind that the reducible error of the model can be described as the sum of the variance of the model and the squared bias of the
model so mathematically the error in the supervised machine learning model is equal to the squared bias in the model the variance of the model and the irreducible error therefore in order to
minimize the expected test error rate so on the Unseen data we need to select the machine learning manod that simultaneously achieves low variance and low
bias and that's exactly what we call bias variance tradeoff the problem is is that there is a negative correlation between the variance and the bias of the model another thing that is highly related to the bias and the variance of
the model is the flexibility of the machine learning model so flexibility of the machine learning model has a direct impact on its variance and on its
bias let's look at this relationships one by one so complex models or more flexible models tend to have a lower bias but at the same time complex models
or more flexible models tend to have higher variance than simpler models so as the flexibility of the model increases the model finds the true patterns in the data easier which
reduces the bias of the model at the same time the variance of such models increases so as the flexibility of the model decreases model finds it more difficult to find the true patters in
the data which then increases the bias of the model but also decreases the variance of the model keep this topic in mind and we will continue this topic in the next lecture when we will be
discussing the topic of overfitting and how to solve the overfitting problem by using regularization in this lecture lecture number three we will talk about very important concept called
overfitting and how we can solve overfitting by using different techniques including regularization this topic is related to the previous lecture and to the topics
of error of the model train error rate test error rate bias and a variance of the machine learning model overfitting is important to know and also how to solve it with
regularization because this topic can lead to inaccurate predictions and a lack of generalization of the model to new data knowing how to detect and prevent
overfitting is crucial in building effective machine learning models questions about this topic are almost guaranteed to appear during every single data science
interview in the previous lecture we discussed the relationship between model flexibility and the variance as well as the bias of the model we saw that as the
flexibility of the model increases model finds the true pattern in the data easier which reduces the bias of the model but at the same time the variance of such models
increases so as the flexibility of the model decreases model finds it more difficult to find a two patterns in the data which then increases the bias of the model and decreases the variance of
the model let's first formally Define what the overfitting problem is as well as what the underfitting is so overfitting occurs when the model performs well in
the training while the model performs worse on the test data so you end up having a low training error rate but a high test error rate and in the ideal world we want our test error rate to be
low or at least that the training error rate is equal to the test error rate overfitting is a common problem in machine learning where a model learns the detail and noise in training data to
the point where it negatively impacts the performance of the model on this new data so the model follows the data too closely closer than it should this means
that the noise or random fluctuations of training data is picked up and learned as concepts by the model which it should actually ignore the problem is that the noise or
random component of the training data will be very different from the noise in the new data the model will therefore be less effective in making predictions on new data overfitting is caused by having
too many features too complex of a model or too little of the data when the model is overfitting then also the model has high variance and low
bias usually the higher is the modal flexibility the higher is the risk of overfitting because then we have higher risk of having a model following the
data too closely and following the noise so under fitting is the other way around underfitting occurs when our test error rate is much lower than our
training error rate given that overfitting is much bigger of a problem and we want ideally to fix the case when our test sale rate is large we will only focus on the
overfitting and this also the topic that you can expect during your data science interviews as well as something that you need to be aware of whenever you are training a machine learning model all
right so now that we know what overfitting is we should now talk about how we can fix this problem there are several ways of fixing or preventing overfitting first you can reduce the
complexity of the model we saw that higher the complexity of the model higher is the chance of the following the data including the noise too closely resulting in overfitting therefore
reducing the flexibility of the model will reduce the overfitting as well this can be done by using a simpler model with fewer parameters or by applying a
regularization techniques such as L1 or L2 regularization that we will talk in a bit kind solution is to collect more data the more data you have the less
likely your model will overfit third and another solution is using resampling techniques one of which p is cross validation this is a technique that allows you to train and test your model
on different subsets of your data which can help you to identify if your model is overfitting we will discuss cross validation as well as other resampling
techniques later in this section another solution is to apply early stopping early stopping is a technique where you monitor the performance of the model on a validation set during the training
process and stop the training when the performance starts to decrease and another solution is to use emble methods by combining multiple models such as decision trees overfitting can be
reduced we will be covering many popular emble techniques in this course as well finally you can use what we call dropout dropout is regularization
technique for reducing overfitting in neuron networks by dropping out or setting to zero some of the neurons during the training process because from time to time Dropout related questions
do appear during the data science interviews for people with no experience so if someone asks you about Dropout then at least you will remember that it's a technique used to solve
overfitting in the setting of deep learning it's worth noting that there is no one solution that works for all types of overfitting and often a group of
these techniques that we just talk about should be used to address the problem we saw that when the model is overfitting then the model has high
variance and low bias by definition regularization or what we also call shrinkage is a method that shrinks some of the estimated coefficients toward
zero to penalize unimportant variables for increasing the variance of the model this is a technique used to solve the overfitting problem by introducing the
lethal bias in the model while significantly decreasing its variance there are three types of regularization techniques that are widely known in the industry the first
one is to reach regression or L2 regularization the second one is the ler regression or the L1 regularization and finally the third one is the Dropout
which is a regularization technique used in deep learning we will cover the first two types in this lecture let's now talk about R
regression or L2 regularization so retrogression or L2 regularization is a shrinkage technique that aims to solve overfitting by shrinking some of the modal coefficients towards zero
retrogression introduces L bias into the model while significantly reducing the model variance R regression is a variation of linear regression but instead of trying
to minimize the sum of squared residuales that linear regression does it aims to minimize the sum of squared residuales added on the top of the squared coefficients what we call L2
regularization term let's look at a multiple linear regression example with P independent variables or predictors that are used to model the dependent variable
y if you have followed the statistical section of this course you might also recall that the most popular estimation technique to estimate the parameter of the linear regression assuming its assumptions are satisfied is the
ordinary Le squares or the OLS which finds the optimal coefficients by minimizing the sum of squar procedures or the RSS so rual regression is pretty similar to
the OLS except that the coefficients are estimated by minimizing a slightly different cost or loss function this is the loss function of
the r progression where beta J is the coefficient of the model for variable J beta0 is the intercept and xig J is the input value for the variable J and
observation I Yi is a target variable or the dependent variable for observation Y and N is the number of exles and Lambda is what we call regularization parameter of the
retrogression so this is the loss function ofs that you can see here and added a penalization term so it's combined the what we call RSS so if you
check out the very initial lecture in this section where we spoke about different metrics that can be used to evaluate regression type of models you can see RSS and the definition of RSS
well if you compare this expression then you can easily find that this is the formula for the RSS added with an intercept and this right term is what we called a penalty amount which basically
represents the Lambda times the sum of the squar of the coefficients included in our model here Lambda which is always positive so is always larger than equal zero is the tuning parameter or the
penalty parameter this expression of the sum of squared coefficients is called L2 Norm which is why we call this L2 penalty based regression or L2
regularization in this way regression assigns a penalty by shrinking their coefficients towards zero reduces the overall model variance
but this coefficient will never become exactly zero so the model parameters are never set to exactly zero which means that all P predictors of the model are still
intact this one is a key property of retrogression to keep in mind that it shrinks the parameters towards zero but never exactly sets them equal to zero
L2 Norm is a mathematical term coming from linear algebra and it's standing for Alan Norm we spoke about the penalty parameter Lambda what we also call the
tuning parameter Lambda which serves to control the relative impact of the penalty on the regression coefficient estimates when the Lambda is equal to zero the penalty term has no effect and
the r regression will introduce the ordinary Le squares estimates but as a Lambda increases the impact of the shrinkage penalty grows and the r regression coefficient estimates
approach to zero what is important to keep in mind which you can also see from this graph is that in r agression large Lambda will assign a penalty to some variables by shrinking their
coefficients towards zero but they will never become exactly zero which becomes a problem when you are dealing with a model that has a large number of features and your model has a low
interpretability retrogression advantage over ordinarily squares is coming from the earlier introduced Vias Varian tra of phenomenon so as the Lambda the penalty parameter
increases the flexibility of the retrogression F decreases leading to decreased variance but increase bias the main advantages of
retrogression are solving overfitting retrogression can shrink the regression coefficient of less important predictors towards zero it can improve the prediction accuracy as well by reducing
the variance and increasing the bias of the model rich regession is less sensitive to outliers in the data compared to linear regression Rich regression is computationally less
expensive compared to loss or regression the main disadvantage of reg agression is the low model interpretability as the P so the number of features in your model is
large let's now look into another regularization technique called ler regression or L1 regularization by definition L regression or L1 regularization is a shrinkage technique
that aims to solve overfitting by shrinking some of the modal coefficients toward zero and setting some to exactly zero l or aggression like retrogression
introduces later bias into the model while significantly reducing moral variance there is however a small difference between the two regression techniques that makes a huge difference
in their results we saw that one of the biggest disadvantages of retrogression is that it will always include all the predictors or all the p predictors in the final
model whereas in case of lasso it overcomes this disadvantage so large Lambda or penalty parameter will assign a penalty to some variables by shrinking
their coefficience toward zero in case of Rich aggression they will never become exactly zero which becomes a problem when your model has a large number of features and it has a low
interpretability and losser aggression overcomes this disadvantage of retrogression let's have a look at the loss function of L
regularization so this is the loss function of OLS which is the left part of the formula called RSS combined with a penalty amount which is the right hand side of the expression the Lambda times
some of the absolute values of the coefficients beta J as you can see this is the RSS that we just saw which is exactly the same as the loss function of the OS and then we
are adding the second term which basically is the Lambda the penalization parameter multiplied by the sum of the absolute value of the coefficient beta J where J goes from one till p and p is
number of predictors included in our model here once again the Lambda which is always positive larger than equal zero is a tuning parameter or the
penalty parameter this expression of the sum of squared coefficients is called L1 Norm which is why we call this L1 penalty B regression or L1
regularization in this way L regression assigns a penalty to some of the variables by shrinking their coefficients towards zero and setting
some of these parameters to exactly zero so this means that some of the coefficients will end up being exactly equal to zero which is a key difference between the ler aggression versus the
reg aggression the L1 Norm is a mathematical term coming from the linear algebra and it's standing for manh hat Norm or distance you might see here a key
difference when comparing the uh visual representation of the loss of regression compared to the visual representation of the reg regression so if you look at this point you can see that there will
be cases where our coefficients will be set to exactly zero this is where we have this intersection whereas in case of R agression you can recall that there
was not a single intersection so the numbers where the circle was close to the intersection points but there was not a single point when there was an intersection and the coefficients were
put to zero and that's the key difference between two regression type of models between these two regularization techniques the main advantages of loss
of regression are solving overfitting so loss of regression can shrink the regression coefficient of less important predictors toward zero and some to exactly zero as the model filters some
variables out Al indirectly performs also what we call feature selection such that the resulted model is highly interpretable and with less features and
much more interpretable compared to the regual regression lasso can also improve the prediction accuracy of the model by reducing the variance and increasing the
bias of the model but not as much as the retrogression earlier when speaking about correlation we also briefly discussed the concept of cation we discussed that correlation is not a
causation and we also briefly spoke the method used to determine whether there is a causation or not that model is the INF famous linear aggression and even if this model is recognized as a simple
approach it's one of the few methods that allows identifying features that have an impact or statistically significant impact on a variable that we are interested in and we want to explain
and it also helps you identify how and how much there is a change in the Target variable when changing the the independent variable values to understand the concept of
linear regression you should also know and understand the concepts of dependent variable independent variable linearity and statistical significant effect dependent variables are often referred
to as response variables or explained variables by definition dependent variable is a variable that is being measured or tested it's called inde dependent variable because it's thought
to depend on the independent variables so you can have one or multiple independent variables but you can have only one dependent variable that you are interested in that is your target
variable let's now look into the independent variable definition so independent variables are often referred as regressors or explanatory variables and by definition independent variable
is the variable that is being manipulated or controlled in the experiment and is believed to have an effect on the dependent variable put it differently the value of the dependent
variable is s to depend on the value of the independent variable for example in an experiment to test the effect of having a degree on the wage the degree variable would be your independent
variable and the wage would be your dependent variable finally let's look into the very important concept of statistical significance we call the effect statistically significant if it's
unlikely to have occurred by random chance in other words a statistically significant effect is one that is likely to be real and not due to a random
chance let's now Define the linear regression model formally and then we will dive deep into the theoretical and practical details by definition V regression is a
statistical or machine learning method that can help to model the impact of a unit change in a variable the independent variable on the values of another Target variable or the dependent
variable when the relationship between the two variables is assumed to be linear when the linear regression model is based on a single independent variable then we call this model simple
linear regression when the model is based on multiple independent variables we call it multiple linear regression let's look at the mathematical expression describing
linear regression you can recall that when the linear regression model is based on a single independent variable we just call it a simple linear regression this expression that you see here is the most common mathematical
expression describing simple linear regression so you can see that we are saying that the Yi is equal to Beta 0 plus beta 1 x i plus
UI in this expression the Yi is the dependent variable and the I that you see here is the index corresponding to the E row so whenever you are getting a
data and you want to analyze this data you will have multiple rows and if your multiple rows describe the observation that you have in your data so it can be
people it can be observation describing your data then the E characterizes the specific Row the each row that you have in your data and the Yi is then the
dependent variables value corresponding to that each R then the same holds for the XI so the XI is then the independent variable or the explanatory variable or
the regressor that you have in your model which is the variable that we are testing so we want to manipulate it to see whether this variable has a statistically significant impact on the
dependent variable y so we want to see whether unit change in the X will result in a specific change in the Y and what kind of change is
that so beta Z that you see here is not a variable and it's called intercept or constant something that is unknown so we don't have that in our data and is one of the parameters of linear regression
it's an unknown number which the linear regression model should estimate so we want to use the linear regression model to find out this uh unknown value as well as the second unknown value which
is the beta one as well as we can estimate the error terms which are represented by the UI so beta one next to the XI so next to
the independent variable is also not a variable so like beta zero is an unknown parameter in linear regression model an unknown number which the linear regression model should estimate beta
one is of referred as a slope coefficient of variable X which is the number that quantifies how much dependent variable y will change if the
independent variable X will change by one unit so that's exactly what we are most interested in the beta one because this is the coefficient and this is the unknown number that will help us to
understand and answer the question whether our independent variable X has a statistically significant impact on our dependent variable y finally the U that
you see here or the UI in the expression is the error term or the amount of mistake that the model makes when explaining the target variable we add this value since we know that we can
never exactly and accurately estimated Target variable so we will always make some amount of estimation error and we can never estimate the exact value of y hence we need to account for this
mistake that we are going to make and we know in advance that we are going to have this mistake by adding an error term to our model let's also have a brief look at how
multiple linear regression is usually expressed in mathematical terms so you might recall that difference between the simple linear regression and multiple linear regression is that the first one
has a single independent variable in it whereas the letter so the multiple linear regression like the name suggest has multiple independent variables in it so more than
one knowing this type of Expressions is critical since they not only appear a lot in the interviews but also in general you will see them in a data science blogs in presentations in books
and also in papers so being able to quickly identify and say ah I remember seeing this at once then it will help you to easier understand and follow the
process and the story line so uh what you see here you can read as Yi is equal to Beta 0 plus beta
1 * X1 I plus beta 2 * xui plus beta 3 * X3 I plus UI so this is the most common mathematical expression describing multiple linear regression in this case
with three independent variables so if you were to have more independent variables you should add them with their corresponding indices and coefficients so in this case the
method will aim to estimate the model parameters which are beta 0o beta 1 beta 2 and beta 3 so like before Yi is our a dependent variable which is always a
single one so we only have one dependent variable then we have beta 0 which is our intercept or the constant then we have our first slope coefficient which is beta 1 corresponding to our first
independent variable X1 then we have X1 I which stands for the independent variable the first independent variable with an index one and the I stands for the index corresponding to the row so
whenever we have multiple linear regression we always need to specify two indices and not only one like we had in our single line regression the index cor
that characterizes which independent variable we are referring to so whether it's independent variable one two or three and then we need to specify which row we are referring to which is the
index I so you might notice that in this case all the indexes are the same because we are uh looking into one specific row and we are representing this row by using the independent
variables the error term and dependent variable so then we are adding our third term which is beta 2 * x2i so the beta two is our third unknown parameter in
the model and the second slope coefficient corresponding to our second independent variable and then we have our third independent variable with the corresponding slope coefficient beta Tre
as well as we also add like always and error term to account for the error that we know that we are going to make so now when we know what the linear regression is and how to express it in
the mathematical terms you might be asking the next logical question well we know that when we know what the linear regression is and how to express it in the mathematical terms you might be
asking the next logical question how do we find those unknown parameters in the model in order to find out how the independent variables impacted the dependent variable finding this unknown
parameters is called estimating in data science and in general so we are interested in finding out the possible values or the values that the best
approximate the unknown values in our model and we call this process estimation and one technique used to estimate linear regression parameters is
called o or ordinary Le squares so the main idea behind this approach the OLS is to find the best fitting straight line so the regression line through a set of paired X and y's
so our independent variables and dependent variables values by minimizing the sum of squared errors so to minimize the sum of squares
of the differences between the observed dependent variable and its values which are the predicted values that we are predicted by our model that's exactly
what we want to do by using this linear function of the independent variables the residuals so this is too much information let's go it step by step so
in linear regression we just so when we were expressing our simple linear regression we had this error term and we can never know what is the actual error term but what we can do is to estimate
the value of the error term which we call residual so we want to minimize the sum of squ residual because we don't know the errors so we want to find a
line that will best fit our data in such a way that the error that we are making or the sum of squared errors is as small as possible and since we don't know the
errors we can estimate the Errors By each time looking at the predicted value that is predicted by our model and and the True Value and then we can subtract them from each other and we can see how
good our model is estimating the values that we have so how good is our model estimating the unknown parameters so to minimize the sum of squar of the differences between the
observed dependent variable and its values predicted by the linear function of the independent variables so the minimizing the sum of squared
residual so uh we Define the estimate of a parameters and variables by adding a hedge on the top of the variables or parameters so in this case you can see
that y i head is equal to Beta Z head plus beta 1 head XI so you can see that we no longer have a error term in this and we say that y i head is the
estimated value of Yi and beta 0o head is the estimated value of beta 0 beta 1 head is the estimated value of our beta one and the XI is still our data so the
values that we have in our data and therefore we don't have a hat since that does not need to be estimated so what we want to do is to estimate our dependent
variable and we want to compare our estimated value that we got using our OLS with the actual with the real value such that we can calculate our errors or
the estimate of the error which is represented by the UI head so the UI head is equal to Yi minus Yi head where UI head is simply the estimate of the
error term or the residual so this predicted error is always referred as residual so make sure that you do not confuse the error with the residual so error can never be
observed error you can never calculate and you will never know but what you can do is to predict the error and you can when you predict the error then you get a residual and what oil is trying to do
is to minimize the amount of error that it's making therefore it looks at the some squared residuales across all the observation and it tries to find the
line that will minimize its value therefore we are saying that the OLS tries to find the best fitting straight line such that it minimizes the sum of squared
residuals we have discussed this model when we were talking about this model mainly from the perspective of causal analysis in order to identify features that have a statistically significant
impact on the response variable but linear regression can also be used as a prediction model for modeling linear relationship so let's refresh our memory with the definition of linear regression
model by definition linear regression is a statistical or a machine learning method that can help to modrow the impact of a unit change in the variable the independent variable on the values
of another Target variable the dependent variable when the relationship between two variables is linear in the lecture number six from the statistical section we also discussed
how mathematically we can express what we call Simple linear regression and a multiple linear regression so this how the uh simple linear regression can be represented so uh in case of simple
linear regression you might recall that we are dealing with just a single independent variable and we always have just one dependent variable both in the single linear regression and in the
multiple linear regression so here you can see that Yi is equal to beta 0 + beta 1 * XI + UI where Y is the dependent variable and I is basically
the index of each observation or the row and then the beta 0 is The Intercept which is also known as constant and then the beta 1 is a slope coefficient or a parameter corresponding to the
independent variable X which is unnown and a constant which we want to estimate along to the beta 0 and then the XI is the independent variable corresponding to the observation I and then finally
the UI is the error term corresponding to the observation I do keep in mind that this error term we are adding because we do know that we always are going to make a mistake and we can never perfectly estimate the dependent
variable therefore to account for this mistake we are adding this UI so let's also recall the estimation technique that we use to estimate the parameters of the linear regression
model so the beta0 and beta one and to predict the response variable so we call this estimation technique ORS or or the ordinary Le squares NS is an estimation technique for estimating the unknown
parameters in a linear regression model to predict the response or the dependent variable so we need to estimate the beta Z so we need to get the beta zero head and we need to estimate the beta 1 or
the beta 1 head in order to obtain the Y I head so Yi head is equal to beta0 head plus beta 1 head time XI where the um
difference between the Yi had and the Yi so the true value of the dependent variable and the predicted value they are different will then produce our estimate of the
error or what we also call residual the main idea behind this approach is to find the best fitting straight line so the regression line through set of paired X and Y values by minimizing the
sum of squared residuales so we want to minimize our errors as much as possible therefore we are taking their squared version and we are trying to sum them up and we want to minimize this entire
error so to minimize the sum of squar residual so the difference between the observed dependent variable and its values predicted by the linear function of the independent variables we need to
use the OLS one of the most common questions related to linear regression that comes time and time again in the uh data science related interviews is a topic of
the assumptions of the linear regression model so you need to know each of these five fundamental assumptions of the linear regression and the OLS and also you need to know how to test whether
each of these assumptions are satisfied so the first assumption is the linearity Assumption which states that the relationship between the independent variables and the dependent variable is
linear we also say that the model is linear in parameters you can also check whether the linearity assumption is Satisfied by plotting the residual to the fitted values if the pattern is not
linear then the estimat will be biased in this case we say that the linearity assumption is violated and we need to use more flexible models such as tree based models that we will discuss in a
bit that are able to model these nonlinear relationships the second assumption in the linear regression is the Assumption about randomness of the sample which means that the data is randomly sampled
and which basically means that the errors or the residuales of the different observations in the data are independent of each other you can also check whether the uh second assumption
so this assumption about random sample is Satisfied by plotting the residuals you can then check whether the mean of this residuales is around zero and if not then the OLS estimate will be biased
and the second assumption is violated this means that you are systematically over or under predicting the dependent variable the third assumption is the
exogeneity Assumption which is a really important assumption often as during the data science interviews exogeneity means that each independent variable is uncorrelated with the error terms
exogeneity ref refers to the assumption that the independent variables are not affected by the error term in the model in other words the independent variables are assumed to be determined
independently of the errors in the model exogeneity is a key Assumption of the new regression model as it allows us to interpret the estimated coefficients as representing the true causal effect of
the independent variables on the dependent variable if the independent variables are not exogeneous then the estimated coefficients may be biased and the interpretation of the results may be
inv valid in this case we call this problem an endogeneity problem and we say that the independent variable is not exogeneous but it's endogeneous it's important to carefully consider the
exogeneity Assumption when building a linear regression model as violation this assumption can lead to invalid or misleading results if this assumption is satisfied for an independent variable in
the linear model we call this independent variable exogeneous so otherwise we call it endogeneous and we say that we have a problem endogenity endogenity refers to the
situation in which the independent variables in the linear regression model are correlated with the error terms in the model in other words the errors are not independent of the independent
variables endogeneity is a violation of one of the key assumptions of the linear regression model which is that the independent variables are exogeneous or not affected by the errors in the model
endogenity can arise in a number of ways for example it can be caused by omitted variable bias in which an important predictor of the dependent variable is not included in the model it can also be
caused by the reverse causality in which the dependent variable affects the independent variable so those two are a very popular examples of the case when we can get an endogeneity problem and
those are things that you should know whenever you are interfering for data science roles especially when it's related to machine learning because those questions are uh being asked to you in order to test whether you
understand the concept of exogeneity versus endogenity and also in which cases you can get endogenity and also how you can solve it so uh in case of omitted variable bias let's say you are
estimating a person's salary and you are using as independent variable their education their number of years of experience and uh some other factors but you are not including for instance in
your model a feature that would describe the uh intelligence of a person or uh for instance IQ of the person well given that those are a very important
indicator for a person in know to perform in their field and this can definitely have um indirect impact on their salary not including these variables will result in omitted
variable bias because this will then be uh Incorporated in your um error term and uh this can also relate to the other independent variables because then your
uh IQ is also related to the um to the education that you have higher is your IQ usually higher is your education so in this way you will have
an error term that includes an important variable so this is the omitted variable which is then uh correlated with your uh one of your or multiple of your independent variables include in your
model so the other example the other cause of the endogenity problem is the reverse quti and um what reverse quti means is basically that not only the independent variable has an impact on
the dependent variable but also the dependent variable has an impact on the independent variable so there is a reverse relationship which is something that we want to avoid we want to have
our features that include in our model that have only an impact on dependent variable so they are explaining the dependent variable but not the other way around because if you have the um the other way so you have the dependent
variable impact in your independent variable then you will have the error term being related to this independent variable because there are some components that also Define your
dependent variable so knowing the uh few examples such as those that can cause uh endogenity so they can violate the exogeneity assumption is really
important then uh you can also check for the exogeneity Assumption by conducting a formal statistical test this is called house one test so this an econometrical
test that helps to understand whether you have an exogeneity uh violation or not but this is out of the scope of this course I will however include uh many
resources related the exogeneity endogenity the omited variable bias as well as the reverse cality and also how the house one test can be conducted so
for that check out the inter preparation guide where you can also find the corresponding fre your resources the fourth assumption in linear regression is the Assumption about
homos homos refers to the assumption that the variance of the errors is constant Ross all predicted values this assumption is also known as the
homogeneity of the variant homos is an important Assumption of linear regression model as it allows us to use certain statistical techniques and make inferences about parameters of the model
if the errors are not homoscedastic then the result of these techniques may be invalid or misleading if this assumption is violated then we say that we have
hecticity heos skas refers to the situation in which the variance of the error terms in a linear regression model is not constant across all the predicted
values so we have a variating variant in other words the Assumption of homos skas testing in that case is violated and we say we have a problem of
heteros heteros can be a real problem in the near regression NES because it can lead to invalid or misleading results for example the standard their estimates and the confidence intervals for the
parameters may be incorrect which means that also the statistical test may have incorrect type one error rates so you might recall when we were discussing the linear regression as part of the
fundamental ex section of this course is that we uh looked into the output that comes from a python and we saw that we are getting uh estimates as part of the output as well as standard errors then
the T Test so the student T test and then the corresponding P values and the 95% confidence intervals so whenever there is a heos skst problem
the um coefficient might still be accurate but then the corresponding standard error the student T Test which is based on the standard error and then
the P value as well as the uh confidence intervals may not be accurate so you might get the uh good and reasonable coefficients but then you don't know how
to correctly evaluate them you might end up discovering that um you might end up stting that certain uh independent variables are statistically significant because their coefficients are
statistically significant since their P values are small but in the reality those P values are misleading because they are based on the wrong statistical uh test and they are based on the wrong
standard errors you can check for this assumption by plotting the residual and see whether there is a funnel like graph if there is a Fel like graph then you have a a constant variance but if there
is not then you won't see this Fel like this shape that uh indicates that your variances are constant and if not then we say we have a problem of heos skusy
if you have a heteros system you can no longer use the OS and the linear regression and instead you need to look for other more advanced econometrical regression techniques that do not make
such a strong assumption regarding the variance of your um residuales so you can for instance use the GLS the fgs the
GMM and this type of solutions will um help to solve the hosc C problem and they will not make a strong assumptions regarding the variance in your
model the fifth and the final assumption in linear regression is the Assumption about no perfect multicolinearity this assumption states that there are no exact linear relationships between the
independent variables multicolinearity refers to the case when two or more independent variables in your linear regression model are high correlated with each other this can be a problem
because it can lead to unstable and unreliable estimate of the parameters in the model perfect multicolinearity happens when the independent variables are perfectly correlated with each other
meaning that one variable can be perfectly predicted from the other ones and this can cause the estimated coefficient in your linear regression model to be infinite or undefined and
can lead your errors to be uh entirely misleading when making a prediction using this model if perfect multicolinearity is detected it may be necessary to remove one if not more
problematic variables such that you will avoid having correlated variables in your model and even if the perfect multicolinearity is not present multicolinearity at a high level can
still be a problem if the correlations between the independent variables are high in this case the estimate of the parameters may be imprecise and the model may be uh entirely misleading and
will result less reliable uh predictions so uh to test for the multicolinearity Assumption you have different solutions you have different options the first way uh you can do that
is by using the uh Dill test De test is a formal statistical and econometrical test that will help you to identify which variables cause a problem and whether you have a perfect
multicolinearity in your linear regression model you can PL heat map which will be based on the uh correlation metrix corresponding to your features then you will have your uh
correlations per pair of independent variables plotted as a part of your heat map and then you can identify all the um pair of features that are highly
correlated with each other and those are problematic features one of which should be removed from your model and in this way by uh showing the hit map you can also showcase your stakeholders why you
have removed certain variables from your model whereas explaining a dier test is much more complex because it involves more advanced econometrics and linear uh regression um
explanation so if you're wondering how you can perform this D FL test and you want to prepare the uh questions related to perfect multicolinearity as well as how you can solve the perfect
multicolinearity problem in your linear regression model then head towards the interiew preparation guide included in this part of the course in order to answer such questions and also to see
the 30 most popular interview questions you can expect from this section in the interview preparation guide now let's look into an example coming from the linear regression in order to see how
all those pieces of the puzzle come together so let's say we have collected data on a class side and a test course for a sample of students and we want to model the linear relationship between
the class size and the test course using a linear regression model so as we have one independent variable we are dealing with a simple linear regression and the model equation would be as follows so
you can see that the test course is equal to Beta 0 plus beta 1 multiply by class size plus Epsilon so here the class size is the single independent
variable that we got in our model the test score is the dependent variable the beta0 is The Intercept or the constant the beta one is a coefficient of Interest as it's the coefficient
corresponding to our independent variable and this will will help us to understand what is the uh impact of a unit change in the class size on the
test score and then finally we are including in our model our error term to account for the mistakes that we are definitely going to make when estimating
the uh dependent variable the test course the goal is to estimate the coefficient beta0 and beta 1 from the data and use the estimated model to
predict the test course based on a class size so once we have the estimates we can then interpret them as follows the Y intercept the beta zero represents the
expected test course when the class size is zero it represents the base score that the student would have obtained if the class size would have been zero then the coefficient for the class size the
beta one represents the change in the test course associated with the one unit change in the class size the positive coefficient would imply that one unit
change in the class size would increase the test course whereas the negative coefficient would uh imply that the one unit change in the class size will decrease the test course uh
correspondingly we can then use this model with OLS estimate in order to predict the test course for any given class size so let's go ahead and Implement
that in Python if you're wondering how this can be done then head towards the resources section as well as the part of the Python for data science where you can learn more about how to work with
pendas data frames how to import the data as well as how to fit a linear regression model so the problem is as follows we have collected data on the class size and we have this independent
variable so as you can see here we have the students uncore data and then we have the class size and this our feature and then we want to estimate the why
which is the test score so uh here is the code a sample code that will Fed a linear regression model we are keeping here everything very simple we are not splitting our data into training test
and then fitting the model on the training data and making the predictions with the test score but we just want to see how we can interpret the uh
coefficients so keeping everything very simple so you can see here that we are getting an intercept equal to 63.7 and the coefficient corresponding
to our single independent variable class size is equal to minus 0.14 what this means is that so each increase of the uh class size by one
unit will result in the decrease of the test scor with 0.4 so there is a negative relationship between the two now the next question is whether there is a statistical significance whether
the uh coefficient is actually significant and whether the class size has actually statistically significant impact on the dependent variable but all
those are things that we have discussed as part of the fundamental statistic section of this course as well as we are going to look into a linear regression example when we are going to discuss the
hypothesis testing so I would highly suggest you to uh stop in here to revisit the fundamentals to statistic section of this course to refresh your
memory in terms of linear regression and then um check also the hypothesis testing uh section of the course in order to look into a specific example of
linear regression when we are discussing the standard errors how you can evaluate your OS estimation results how you can use the student T Test the P value and the confidence intervals and how you can
estimate them in this way you will learn for now only the theory related to the coefficients and then you can um add on the top of this Theory once you have learned all the other sections and the
other topics in this course let's finally discuss the advantages and the disadvantages of the linear regression model so some of the advantages of the linear regression model are the following the linear regression is
relatively simple and easy to understand and to implement linear regression models are well suited for understanding the relationship between a single independent variable and a dependent
variable also linear regression can help to handle multiple independent variables and can estimate the unique relationship between each independent variable and a corresponding dependent variable V
regression model can also be extended to handle more complex models such as pooms interaction terms allowing for more flexibility in the modeling the data also linear aggression model can be
easily regularized to prevent overfitting which is a common problem in modeling as we saw uh in the beginning of this section so you can use for instance retrogression which is an
extension of Vine regression you can use ler regression which is also an extension of linear regression model and then finally linear regression models are widely supported by software
packages and libraries making it easy to implement and to analyze and some of the disadvantages of the linear aggression are the following so the linear aggression models make a lot
of strong assumptions regarding for instance the linearity between independent variables and independent variables while the true relationship can actually be also nonlinear so the
model will not then be able to capture the complexity of the data so the nonlinearity and the predictions will be inaccurate therefore it's really important to have a data that has a
linear relationship for linear regression to work linear regression also assumes that the error terms are normally distributed and also homoscedastic error terms are independent across observations
violations of the strong assumption will lead to bias and inefficient estimates linear regression is also sensitive to outliers which can have a disproportionate effect on the estimate
of the regression coefficient linear regression does not easily handle categorical independent variables which often require additional data preparation or the use of indicator
variables or using encodings finally linear regression also assumes that the independent variables are exogeneous and not affected by the error terms if this assumption is
violated then the result of the model may be misleading imagine you have a friend Alex who collects stamps every month Alex buys a certain number of stamps and
you notice that the amount Alex spends seems to depend on the number of stamps bought now you want to create a little tool that can predict how much Alex will
spend next month based on the number of stamps bought this is where linear regression comes into play in technical terms we're trying to predict the
dependent variable amount spent based on the independent variable number of stamps bought below is some simple
python code using pyit learn to perform linear regression on a created data set the linear regression analysis was carried out through a structured process
using Python's numpy and matplot lib libraries as well as psyit learns linear regression class initially the libraries were imported to facilitate numerical
comp computations and data visualization this foundational step ensures that all necessary functions and methods are available for executing the analysis
subsequently the data was organized with stamps and Bot serving as the independent variable and amount is spent as the dependent variable the stamps
array was reshaped into a two-dimensional array using reshape this modification was necessary because the pyit learned Library
requires input features in a specific format once the data was appropriately formatted a linear regression model was instantiated from the linear regression
class and then trained with the fit method by passing the reshaped stamps bought and amount spent arrays to this method the model learned the relationship between the number of
stamps bought and the corresponding amount spent the trained model was then used to predict the the expenditure for a hypothetical future scenario where 10
stamps are bought this was accomplished using the predict method which was called with an input array representing 10 stamps the model used its learned
parameters to estimate the outcome based on this input for visualization the original data points and the regression line were
plotted using matte plot lib the scatter function was used to plot the data points in blue illust rating the actual amount spent for different quantities of
stamps bought the regression line plotted in red using the plot function demonstrated the predicted relationship as learned by the model finally the
prediction for 10 stamps was displayed using a print statement this demonstrated how the model's predictions can be interpreted and used providing a
specific numerical estimate for the amount likely to be spent on stamps in the given scenario this complete process from data preparation through training
to prediction showcases how linear regression can be applied to derive insights from Real World data effectively let's go through some of the
concepts and variables we are using in this chapter sample data the term stamps is bought refers to the number of stamps Alex bought each month and amount spent
represents the corresponding money spent creating and training model we use linear regression from pyit learn to create and train our model using fit
predictions the train model is then used to predict the amount Alex will spend for a given number of stamps in the code we predict the amount for 10 stamps
plotting we plot the original data points in blue and the predicted line in red to visually understand our model's prediction capability displaying
prediction finally we print out the predicted spending for a specific number of stamps 10 in this case this graph illustrates the outcome of a simple linear regression analysis which seeks
to capture the relationship between two variables the number of stamps purchased and the total expenditure incurred the red line depicted in the graph is known as the regression line representing the
best fit through the plotted green data points the slope of this regression line is particularly telling it quantifies the increase in total cost associated
with each additional stamp purchased such insights are invaluable for budgeting purposes or for forecasting future expenses based on past purchasing Behavior each of the green data points
on the graph corresponds to an actual purchase event where both the quantity of stamps bought and the precise amount spent are known the close clustering of
these points around the regression line strongly supports the presence of a linear relationship between the number of stamps purchased and the total
expenditure this alignment suggests that the model has effectively captured the underlying Trend in the data offering a
reliable basis for predictions employing simple linear regression in scenarios like this provides a clear and quantifiable understanding of how one
variable affects another in the realm of data analysis is this method serves as a powerful tool enabling analysts to draw
significant conclusions and make informed decisions based on the observed relationships between variables the Simplicity yet robustness of linear
regression make it an indispensable technique in the toolkit of anyone seeking to extrapolate future behavior from historical data now let's go
through some other examples where linear regression is used one real estate pricing linear regression is widely used in real estate to predict house prices
based on various features such as square footage number of bedrooms number of bathrooms age of the house and location
for instance a regression model can help determine how much an additional bathroom adds to The house's value for example a real estate company uses
linear regression to understand how the proximity to City centers affects the market price of properties they find that for each mile closer to the city
center house prices increase by an average of $10,000 two credit scoring financial institutions often employ linear regression to predict the credit
worthiness of individuals based on historical financial data including income levels existing debts and past repayment histories for example a bank may use
linear regression to determine how an applicant's credit score changes with variations in their debt to income ratio this helps in deciding whether to approve a loan
application three supply chain costs linear regression can analyze and predict costs associated with different components of the supply chain such as
Transportation labor and materials based on factors like distance fuel prices and labor rates for example a manufacturing company uses linear regression to
predict Logistics costs as a function of fuel price fluctuations and shipping distance to better manage their budget
and set product prices four Health Care in healthcare linear regression could be used to predict patient outcomes based on treatment methods dosage levels and
patient demographics for example a medical research team applies linear regression to study the relationship between dosage of a new
drug and patient recovery rate the findings indicate that increasing the drug dose by one unit enhances recovery rates by
5% five academic performance educational institutions might use linear regression to predict student Performance Based on study habits attendance rates and
previous grades grades for example a university conducts a study using linear regression to understand how the number of hours spent
studying per week impacts students GPA the analysis reveals that every additional hour of study per week correlates with an increase of 0.05 in
GPA six energy consumption energy companies can use linear regression to forecast consumption levels based on factors like tempor
time of the year and economic activity for example an energy utility uses linear regression to model how electricity usage increases with Rising
temperatures during summer months the model helps the company prepare for PE will demonstrate how to use logistic regression to predict Jenny's book preferences we have a data set where
each entry records the number of pages in the books Jenny read and whether she liked them step one import libraries we start by importing numpy to manage our
data mat plot lib for visualization and Scar's logistic regression and accuracy score for building and evaluating our
model step two prepare the data We Begin by setting up our data set Pages holds the number of pages in each book as an independent variable and likes records
Jenny's reaction as a dependent binary VAR variable one for like and zero for dislike it's essential to reshape pages
into a 2d array since s kit learns model expects features in this format this preparation ensures our model can interpret the data
correctly step three create and train the model we defined a logistic regression model and train it with our data set using the fit method this
method optimizes the model parameters to best explain the relationship between the number of pages and Jenny's preferences training involves finding the statistical parameters that minimize
prediction error effectively learning from the data step four make predictions after training we use the predict method
to estimate Jenny's reaction to a book with 260 Pages this part of the code shows how the trained model applies what it has learned to new dat data step five
plotting we visually present the data and model predictions we plot the data points in green displaying actual likes and dislikes the model's predicted
probabilities are shown in red giving a visual representation of the likelihood of liking books of various page lengths specific markers highlight the query
point at 260 Pages helping to contextualize the prediction visually step six displaying prediction finally we display the prediction result
directly from our model using a print statement this tells us if Jenny is predicted to like or dislike the book based on the number of pages illustrating the practical application
of our logistic regression model conclusion this process illustrates how logistic regression can be used to make predictions based on historical data
providing insights that can be applied in various fields Beyond just reading preferences now let's go over the results produced by our code the slope of the red line not linear change the
slope of the red line is positive indicating an upward curve as we move along the x axis representing the number of pages this is crucial because it
means the relationship between page count and Jenny's liking isn't a simple straight line increase in other words each additional page doesn't increase
her enjoyment by the same amount the sigmoids effect the s-shaped curve typical of logistic regression means that the change in probability is more
pronounced for certain ranges of page length there might be zones where adding more pages drastically increases the likelihood of liking the book in
contrast other zones might show very little increase in probability despite many more pages this is different from linear regression where the slope remains constant and the change is
directly [Music] proportional reasoning behind the slope this curve's shape could suggest some
underlying patterns in Jenny's preferences short is not sweet very short books might not be her cup of tea as the stories Could Feel
underdeveloped The Sweet Spot there might be a middle range of page counts where the probability jumps significantly indicating her favorite
type of book length too long oh too much perhaps extremely long books thought to have a diminishing return on her enjoyment Green Dots model accuracy
proximity matters the fact that the green dots representing actual books Jenny has read are clustered tightly around the red line is a very good sign it signifies that the model's
predictions closely align with Jenny's real World preferences data validation this clustering demonstrates the model has successfully picked up on the
underlying pattern between page number and Jenny's like dislike reactions consequently we can have more confidence in its predictions for new
books exceptions if some green dots were far away from the line that would be a cause for concern it would mean the model is consistently mispredicting in those regions and might need
[Music] refinement the threshold line 0.5 decision time the line at 0.5 is where we translate The Continuous
probability values into the Practical like or dislike recommendations for Jenny not set in stone while 0.5 is a common threshold it's not mandatory
depending on how much we prioritize avoiding false positives like predictions that turn out to be dislikes or false negatives dislike predictions when Jenny would have actually enjoyed
the book we might move this threshold higher or [Music] lower customizing for Jenny if Jenny indicates she generally likes to try
books even if she's not super sure we might lower the threshold this gives her more recommendations even if the certainty of her liking them is slightly
lower this logistic regression model reveals that Jenny's book preferences are influenced by page count in a nonlinear way and it has successfully
learned the underlying pattern to provide more informed recommendations let's explore in what other sections logistic regression is used we'll explore how logistic
regression can be used to predict customer churn for a subscription-based company churn or the act of customers canceling their subscriptions poses a
significant challenge for such companies logistic regression with its binary out come prediction capabilities is an ideal choice for this scenario the problem at
hand is straightforward the company wants to proactively identify customers who are at high risk of cancelling their subscriptions also known as churning
logistic regression is particularly suitable for this task because it deals with binary outcomes where a customer either churns one or continues their
subscription how logistic regression is used data Gathering the company starts by collecting historical data on various aspects of customer behavior and
demographics this includes usage patterns such as frequency and feature engagement support interactions such as tickets opened and types of complaints
plan types distinguishing between basic and premium subscriptions demographic information like age location Etc model training the collect data is divided
into training and testing sets a logistic regression model is then trained using the training data during training the model learns how different factors such as usage patterns and
demographics relate to the probability of churn scoring new customers once the model is trained it can be used to score new customers each customer's data is
fed into the trained model which generates a probability score between do zero and one indicating their likelihood of churning proactive action customers with high
churn probabilities are identified and receive targeted attention interventions may include offering special deals personalized Outreach or addressing common pain points known to lead to
[Music] churn imagine Sarah who loves cooking and trying various fruits she's noticed that the fruits she enjoys tend to fall
into certain size and sweetness ranges could she predict whether she'll like a new fruit based on these characteristics linear discriminant
analysis LDA is the perfect tool to help her out LDA is a powerful technique for classifying Things based on several features think about how facial
recognition software can identify individuals in Sarah's case we can use LDA to find patterns in the size and sweetness of fruits she's liked or
disliked in the past LDA will search for a way to create a sort of like versus dislike boundary based on these features
imagine each fruit as a point on a graph where the x-axis is size and the y- AIS is sweetness LDA tries to draw the best
possible line to separate the like and dislike fruits of course some overlap might happen maybe there are some small small fruits she surprisingly loves LDA
aims to find the line that does the best possible job of separating the groups overall LDA is great when you have multiple features to consider at once
Sarah could try looking at size or sweetness alone but LDA lets her combine this information for potentially better predictions so let's go through the code
now first we need to ensure that we have all the necessary tools at our disposal we'll be using python for our coding tasks and will import the powerful
libraries numpy for numerical operations and matplot lib P plot for data visualization additionally we'll utilize
pyit learns linear discriminant analysis module for implementing LDA now let's create a sample data set we'll curate a
data set consisting of eight fruits each characterized by two features size and sweetness these features will serve as our inputs while the corresponding
labels will indicate whether each fruit is liked one or disliked zero this data set will form the foundation for training our predictive model with our
data set ready it's time to build our predictive model using linear discriminant analysis we'll create an instance of an LDA model object and
proceed to train it using the features size and sweetness and corresponding labels from our sample data [Music]
set we'll select a new fruit with a size of 2.5 and a sweetness of six as our test case using these feature values the model will predict whether Sarah would
like this fruit or not this prediction will provide valuable insights into the model's decision-making process and its ability to generalize to unseen
instances to visualize the results of our analysis we'll plot the sample data set on a scatter plot in this plot fruits that are liked by Sarah will be
represented by Blue markers while disliked fruits will be marked in yellow additionally we'll highlight the new fruit being predicted with a distinct
Red X marker providing a clear visual representation of its classification after making the prediction will display the outcome we'll print a statement
indicating whether Sarah is likely to enjoy the new fruit based on the model's classification decision this Insight will offer valuable information on how
the model interprets the given features and arrives at its prediction here are several considerations we should make which shows a line graph depicting fruit
enjoyment based on size and sweetness the x axis represents size the Y axis represents sweetness and the data points
are colored in Orange like and blue dislike here are some observations you can make class separation there appears to be some
separation between the orange like and blue dislike data points this suggests that size and sweetness may be useful
factors in predicting whether Sarah will enjoy a particular fruit over between classes there's also some overlap between the two classes
particularly in the region of larger and sweeter fruits this indicates that size and sweetness alone may not perfectly predict Sarah's preferences other
factors not considered here might also influence her enjoyment here are additional considerations we should make sample size the number of data points is
not visible in the image a larger data set might provide a clearer picture of the class separation and improve the model's generalizability nonlinear relationships
the graph assumes a linear relationship between size sweetness and enjoyment if the true relationship is more complex a model like LDA might not capture it
perfectly overall the data suggests a potential link between fruit size sweetness and Sarah's preferences LDA is a promising technique to explore for
building a classif ification model but it's important to consider potential limitations and the need for more data
if available where decision trees take root decision trees with their intuitive branching structure find use across
various Industries and problem domains let's dive into some key areas where they prove particularly effective business and finance customer
segmentation analyze customer data to identify group groups with similar behaviors or purchasing patterns for targeted marketing
strategies fraud detection identify patterns and transactions that may indicate fraudulent activity credit risk assessment evaluate the credit
worthiness of loan applicants based on their financial history and other factors operations management optimize decision making in areas like inventory
management Logistics and resource allocation Health Care medical diagnosis support assist in diagnosing diseases by guiding clinicians through a series of
questions and tests based on patient symptoms and medical history treatment planning help determine the most suitable treatment options based on
patient characteristics and disease severity disease risk prediction identify individuals at high risk of developing certain health conditions
based on factors like lifestyle family history and medical data science and engineering fault diagnosis isolate the cause of malfunctions or failures in
complex systems by analyzing sensor data and system logs classification in biology categorize species based on their characteristics or DNA sequences
remote sensing analyze satellite imagery to classify land cover types or identify areas affected by natural disasters
customer service troubleshooting guides create interactive decision trees to guide customers through troubleshooting steps for products or Services chatbots
power automated chatbots that can categorize customer inquiries and provide appropriate responses reducing weight times and improving support
efficiency other applications game playing design AI OPP components in games that can make strategic decision logistic regression is a
popular approach for performing classification when there are two classes but when the classes are well separated or the number of classes
exceeds two the parameter estimates for the logistic regression model are surprisingly unstable unlike logistic regression l
LDA does not suffer from this instability problem when the number of classes is more than two if n is small and the distribution of the predictors x
is approximately normal in each of the classes LDA is again more stable than the logistic regression model Noah is a botanist who has collected data about
various plant species and their characteristics such as Leaf size and flower color Noah is curious if he could predict a plant species based on these
features here we'll utilize random forest and Ensemble learning method to help him classify plants technically we
aim to classify plant species based on certain predictive variables using a random forest [Music]
model the bias variance tradeoff the key challenge in machine learning lies in finding the right balance between bu bias and variance
generally reducing bias increases variance and vice versa complex models tend towards low bias and high variance while simpler models tend towards the
opposite the ideal model finds a sweet spot between underfitting High bias and overfitting high variance a balance that
depends heavily on the nature of your specific problem and the trade-off you're willing to make between flexibility and
stability naive Bay versus logistic regression naive Baye is known for high bias it assumes features are independent
which often isn't true however its Simplicity makes it less prone to overfitting low variance and computationally fast to train on the
other hand logistic regression is more flexible low bias and can model complex decision boundaries this comes the risk of overfitting high variance especially
with many features or little regularization when to choose which if you prioritize speed and simplicity naive Bay might be a good starting point when your data relationships are
unlikely to be simple and independent logistic regressions flexibility becomes valuable however if you choose logistic regression you need to actively manage
overfitting potentially using techniques like regularization the plotting purposes and sck it learns logistic regression and linear
discriminant analysis for classification tasks step two generating synthetic data next we Define a function called
generate data to create synthetic data for our classification experiment the function generates data points from
three initial classes each centered at 0 0 3 0 and 6 0 for each class random data points are generated around the
respective Center using a goian distribution step three data generation and model fitting we generate a data set with 40 samples per class using the
generate data function then we fit logistic regression and LDA models to this data set step four analyzing the
results after fitting the models we print the coefficients for both logistic regression and LDA for the initial three classes these coefficients provide
insights into the decision boundaries learned by each model step five adding an extra class we then introduce a new class to our data set by generating
additional data points centered at 90 we append this new class to our data set and update the corresponding labels accordingly step six refitting the
models following the addition of the new class we refit both logistic regression and LDA models to the updated data set
with four classes step seven analyzing the results for four classes finally we print the coefficients for both models after fitting them to the data set with
four classes this allows us to observe how the decision boundaries change with the inclusion of the new class the provided code serves to elucidate date
the intricate concepts of bias variance and the bias variance tradeoff by just opposing naive Bay and logistic regression classifiers on a synthetic
data set here's a cohesive explanation of the codes functionality to begin the script generates a synthetic data set comprising two classes arranged in
circular patterns the make circles function from s kit learn is employed for the this purpose creating data that challenges the assumptions of naive Bay due to its
nonlinear separability following data Generation The Script proceeds to train two distinct classifiers firstly aian naive
Bas model is trained on the data set this Choice aligns with the high bias low variance characteristics of naive
Bay given its Simplicity and Assumption of feature Independence second secondly a logistic regression model is trained with regularization CA
1.0 regularization is introduced to combat overfitting a concern for logistic regression due to its flexibility potentially leading to
higher variants the script visualizes their decision boundaries through plots showcasing the decision boundaries of both models it becomes evident that
naive Bas delineates a simpler linear boundary while logistic regression captures the data set's nonlinearity furthermore the script
calculates the accuracy of each model on a heldout test set facilitating a comparative analysis of their performance a deeper understanding of
the bias variance trade-off emerges from this comparison naive Bas exhibits higher bias due to its simplifying assumptions resulting in a less complex
decision boundary on the contrary logistic regressions flexibility enables it to learn the nonlinear pattern with lower bias potential albeit at the risk
of overfitting the importance of contextual considerations becomes apparent while logistic regression often
boasts lower bias naive baz's Simplicity and computational efficiency May render it preferable in certain contexts [Music]
Tom is a movie Enthusiast who watches films across different genres and Records his feedback whether he liked them or not he has noticed that whether
he likes a film might depend on two aspects the movie's length and its genre can we predict whether Tom will like a movie based on these two characteristics
using naive Bay technically we want to predict a binary outcome like or dislike based on the independent variables movie length and genre let's delve into the
code's functionality the initial step involves importing essential libraries necessary for the code's operations numpy
facilitates numerical operations while matplot lib AIDS in visualization Additionally the Gan naive Bas implementation from Psychic learn is
imported to utilize its functionalities following the Imports the script defines sample data representing movie features and corresponding likes each movie is
characterized by its length in minutes and a genre code notably genres are numerically encoded with zero signifying action one representing romance and so
forth this structured representation prepares the data for subsequent analysis moving forward a gan naive Bas model is instantiated this model serves
as the predictive engine leveraging its inherent assumptions about feature Independence to classify movie likes subsequently the model is trained using
the provided movie features and their Associated likes once the model is trained it is ready to make predictions a new movie is defined with its length
and genre code leveraging the trained naive baz model predictions are made regarding whether Tom would like this movie based on its features this step
demonstrates the practical application of the trained model in making real world predictions the script proceeds to visualize the data set through a scatter
plot each existing movie is plotted based on its length and genre code with liked movies depicted in one color and disliked movies in another Additionally
the new movie is plotted with a distinct marker providing visual context to Aid interpretation the provided code serves to elucidate the intricate concepts of
bias variance and the bias variance tradeoff by just opposing naive bays and logistic regression classifiers on a synthetic data set here's a cohesive
explanation of the code's functionality to begin the script generates a synthetic data set comprising two classes arranged in circular
patterns the make circles function from s kit learn is employed for this purpose creating data that challenges the assumptions of naive Bays due to its nonlinear
separability following data Generation The Script proceeds to train two distinct classifiers firstly aian naive
Bas model is trained on the data set this Choice aligns with the high bias low variance characteristics of naive Bay given its simplicity here are a few
observations and conclusions we can make clear separation we can see that there is a clear separation between the like and dislike clusters this indicates that the
naive Bas model has successfully found a distinct boundary based on the movie length and genre features the strong separation suggests these features are powerful predictors of
preference prediction confidence since the new movie Red X squarely within a cluster we can have a high degree of confidence in the model's prediction
further exploration even with good separation movies lying close to the boundary deserve closer examination they may offer insights into less predictable
cases and help refine the model further overlapping clusters if the like and dislike groups overlap significantly it
suggests that movie length and genre alone May not be enough to accurately predict preferences in all situations model limitations in cases of
overlap naive Bas might be too simplistic exploring more sophisticated models like decision trees or support Vector machines could improve
accuracy need more data a larger and more diverse data set or the inclusion of additional features EG star ratings director could help uncover for clearer
patterns in situations where the current features aren't sufficient genre significance if distinct clusters form based on genre codes it means the naive
Bay model recognizes genre as a strong indicator of preference personalized recommendations this genre based Insight can be used to tailor
recommendations if a user consistently enjoys a particular genre movies within that genre should be prioritized even even if their length deviates from the
usual pattern caveats it's important to remember that genre preferences are subjective and might have naturally mixed reactions within a given genre
where decision trees take root decision trees with their intuitive branching structure find use across various Industries and problem domains
let's dive into some key areas where they prove particularly effective business and Finance customer segmentation analyze customer data to
identify groups with similar behaviors or purchasing patterns for targeted marketing strategies fraud detection identify
patterns and transactions that may indicate fraudulent activity credit risk assessment evaluate the credit worthiness of loan applicants based on
their financial history and other factors operations management optimize decision making in areas like Inventory management Logistics and
resource allocation Health Care medical diagnosis support assist in diagnosing diseases by guiding clinicians through a series of questions and tests based on
patient symptoms and medical history treatment planning help determine the most suitable treatment options based on patient characteristics and disease
severity disease risk prediction identify individuals at high risk of developing certain health conditions based on
factors like lifestyle family history and medical data science and engineering fault diagnosis isolate the cause of malfunctions or failures in complex
systems by analyzing sensor data and system logs classification in biology categorize species based on their characteristics or DNA sequences remote
sensing analyze satellite imagery to classify land cover types or identify areas affected by natural disasters customer service troubleshooting guides
create interactive decision trees to guide customers through troubleshooting steps for products or Services chatbots power automated chatbots that can
categorize customer inquiries and provide appropriate responses reducing weight times and improving support efficiency
other applications game playing design AI opponents in games that can make strategic decisions based on the state of the gain e-commerce personalized
product recommendations based on user browsing behavior and past purchases Human Resources identify key factors
influencing employee retention and make informed decisions why decision trees Thrive here decision trees excel in
these scenarios due to several factors interpretability the decision making process is transparent allowing humans to understand the reasoning behind the
model's predictions handles diverse data accommodates both numerical and categorical features nonlinear relation Alex is intrigued by the
relationship between the number of hours studied and the score scores obtained by students Alex collected data from his peers about their study hours and
respective test scores he wonders can we predict a student score based on the number of hours they study let's leverage decision tree regression to
uncover this technically we're predicting a continuous outcome test score based on an independent variable
study hours complex patterns this is is like trying to fit the curved data set with a straight line in contrast a low bias model is more flexible allowing it
to potentially match intricate Trends in the data Noah is a botanist who has collected data about various plant species and their characteristics such
as Leaf size and flower color Noah is curious if he could predict a plant species based on these features here we'll utilize r random forest and
Ensemble learning method to help him classify plants technically we aim to classify plant species based on certain predictive variables using a random
forest [Music] model the bias variance tradeoff the key challenge in machine
learning lies in finding the right balance between bias and variance generally reducing bias increases variance and vice versa complex models
tend towards low bias and high variance while simpler models tend towards the opposite the ideal model finds a sweet
spot between underfitting High bias and overfitting high variance a balance that depends heavily on the nature of your specific problem and the trade-off
you're willing to make between flexibility and stability naive Bay versus logistic regression naive Bay is known for high
bias it assumes features are independent which often isn't true however its Simplicity makes it less prone to overfitting low variance and
computationally fast to train on the other hand logistic regression is more flexible low bias and can model complex decision boundaries this comes at the
risk of overfitting high variance especially with many features or little Reg regularization when to choose which if you prioritize speed and simplicity
naive Bay might be a good starting point when your data relationships are unlikely to be simple and independent logistic regressions flexibility becomes valuable however if you choose logistic
regression you need to actively manage overfitting potentially using techniques like regularization the provided code serves to elucidate the intricate concepts of
bias VAR and the bias variance tradeoff by just opposing naive bays and logistic regression classifiers on a synthetic
data set here's a cohesive explanation of the code's functionality to begin the script generates a synthetic data set comprising two classes arranged in
circular patterns the make circles function from s kid learn is employed for this purpose creating data that challenges the assumptions of naive Bays due to its
nonlinear separability following data Generation The Script proceeds to train two distinct classifiers firstly aian naive
Bay model is trained on the data set this Choice aligns with the high bias low variance characteristics of naive Bay given its Simplicity and Assumption
of feature Independence secondly a logistic regression model is trained with regularization regularization is
introduced to combat overfitting a concern for logistic regression due to its flexibility potentially leading to higher variant once the models are
trained the script visualizes their decision boundaries through plots showcasing the decision boundaries of both models it becomes evident that
naive base delineates a simpler linear boundary while logistic regression captures the data set's nonlinearity furthermore the script
calculates the accuracy of each model on a heldout test set facilitating a comparative analysis of their performance a deeper understanding of
the bias variance trade-off emerges from this comparison naive Bas exhibits higher bias due to its simplifying assumptions resulting in a less complex
decision boundary on the contrary logistic regressions flexibility enables it to learn the nonlinear pattern with lower bias potential allbe it at the
risk of overfitting the importance of contextual considerations becomes apparent while logistic regression often
boasts lower bias naive bases Simplicity and computational efficiency May render it preferable in certain contexts model selection hinges on
factors such as data set characteristics computational constraints and the relative significance of interpretability versus Raw predictive performance it's crucial to acknowledge
the limitations and caveats of the presented Example The observed results May Vary on different data sets or with alternative hyperparameter
configurations additionally while decision boundary visualization AIDS comprehension accuracy metrics are equally essential for comprehensive model
evaluation finally the script underscores the significance of continuous learning in machine learning it advocates for a methodical approach
involving experimentation with diverse models rigorous evaluation and judicious selection based on problem specific requirements and performance
metric Tom is a movie Enthusiast who watches films across different genres and Records his feedback whether he liked them or not he has noticed that
whether he likes a film might depend on two aspects the movie's length and its genre technically we want to predict a binary outcome like or dislike based on
the independent variables movie length and genre Alex is intrigued by the relationship between the number of hours studied and the scores obtained by
students Alex collected data from his peers about their study hours and respective test scores he wonders can we predict a student's score based on the
number of hours they study let's leverage decision tree regression to uncover this technically we're predicting a continuous outcome test
score based on an independent variable study hours Let's dissect the code to understand its functionality importing libraries We Begin by importing
necessary libraries numpy assists in numerical operations Matt plot lib facilitates visualization and decision tree regressor from Psychic learn is utilized
for decision tree regression sample data definition following the library Imports the code defines sample data this data
includes the number of hours studied and the corresponding test scores achieved achieved each entry pairs a study hour with its corresponding test score forming the data set for training the
regression model creating and training the model a decision tree regression model is instantiated with a maximum depth set to
three this parameter controls the maximum number of levels within the decision tree subsequently the model is trained using the provided study hours
and their corresponding test score prediction after training the model is capable of making predictions an example
Study Hour 5.5 hours is chosen and the model predicts the test score corresponding to this input based on its
training plotting the decision tree the code then generates a visualization of the decision tree regression model this visualization elucidates the the
decision-making process of the model utilizing the provided features study hours plotting study hours versus test scores another plot is created to
illustrate the relationship between study hours and test scores this scatter plot displays the actual data points while the regression line portrays the
predictions made by the decision tree regression model additionally the predicted test score for the new study hour is highlighted
displaying prediction finally the code prints out the predicted test score for the specified study hour now here are some key features sample data study at
hours contains hours studied and test scores contains the corresponding test scores creating and training model we create a decision tree regressor with a
specified maximum depth to prevent overfitting and train it with fit using our data plot the decision tree plot tree
helps visualize the decisionmaking process of the model representing splits based on study hours prediction and
plotting we predict the test score for a new study hour value 5.5 in this example visualize the original data points the decision trees predicted scores and the
new prediction now here are a few conclusions we can make from the decision tree regressor visualization and the study hours versus test scores
plot observations from the plot linear progression with step function the Orange Line representing the predicted scores demonstrates a clear step
function a typical characteristic of decision tree regressors this indicates that the model provides constant predictions within certain ranges of
study hours changing abruptly at specific thresholds where new rules splits [Music] apply prediction accuracy the red dots
actual scores mostly align with the orange step function predicted scores suggesting that the decision tree model does a good job of capturing the general Trend in the data the close alignment
also indicates that the model handles the nonlinear relationship between study hours and test scores well within the constraints of its maximum depth
specific prediction the plot marks a prediction for 5.5 hours of study green X this specific prediction falls on the
step increase suggesting that additional study hours Beyond a certain threshold significantly improve the predicted test score according to the model's training
data conclusions model fit the decision tree appears well fitted to the range of data presented without signs of overfitting or
underfitting the choice of maximum depth seems appropriate balancing model complexity and generalization utility of decision tree
for educational data decision trees are useful for educational data like study hours and test scores because they can easily model thresholds EG minimum hours
needed to achieve a certain score that are intuitive for educational planning and interventions implications for students and Educators this model can help in
setting realist istic study goals based on expected outcomes for instance Educators can advise students about the probable benefit of studying an
additional hour based on the model's predictions potential for refinement while the current model provides valuable insights further refinement
with additional features like type of study material individual student Baseline performance Etc could enhance prediction accuracy testing the model on
more diverse data sets or incorporating Ensemble methods like random forests could provide a more robust analysis and
mitigate any variance not captured by a single decision tree visualization and interpretation the stepwise visualization AIDS in understanding how
additional study hours could lead to increments in test scores which is valuable for explaining Model Behavior to non-technical
stakeholders where decision tree take rote decision trees with their intuitive branching structure find use across various Industries and problem domains
let's dive into some key areas where they prove particularly effective business and finance customer segmentation analyze customer data to
identify groups with similar behaviors or purchasing patterns for targeted marketing strategies fraud detection identify
patterns and transactions that may indicate fraudulent activity credit risk assessment evaluate the credit worthiness of loan applicants based on
their financial history and other factors operations management optimize decision making in areas like Inventory
management Logistics and resource allocation Health Care medical diagnosis support assist in diagnosing diseases by guiding clinicians through a series of
questions and tests based on patient symptoms and medical history treatment planning help determine the most suitable treatment options based on
patient characteristics and disease severity disease risk prediction identify individuals at high risk of developing certain health conditions
based on factors like lifestyle family history and medical data science and engineering fault diagnosis isolate the cause of malfunctions or failures in
complex systems by analyzing sensor data and system logs classification in biology categorize species based on their characteristics or DNA sequences
remote sensing analyze satellite imagery to classify land cover types or identify areas affected by natural disasters
customer service troubleshooting guides create interactive decision trees to to guide customers through troubleshooting steps for products or Services chatbots
power automated chatbots that can categorize customer inquiries and provide appropriate responses reducing weight times and improving support
efficiency other applications game playing design AI opponents in games that can make strategic decisions based on the state of the game e-commerce
personalized product recommendation ations based on user browsing behavior and past purchases Human Resources identify key factors influencing
employee retention and make informed decisions why decision trees Thrive here decision trees excel in these scenarios
due to several factors interpretability the decision making process is transparent allowing humans to understand the reasoning behind the
model's predictions handles diverse data accommodates both numerical and categorical features nonlinear relationships can capture complex
nonlinear patterns within data versatility applicable for both classification predicting a class label and regression meet Lucy a fitness coach who
is curious about predicting her client's weight loss based on their daily calorie intake and workout duration Lucy has data from past clients but recognizes
that individual predictions might be prone to errors let's utilize bagging to create a more stable prediction model technically will predict a continuous
outcome weight loss based on two independent variables daily calorie intake and workout duration using bagging to reduce variance in predictions let's now go through the
code importing libraries We Begin by importing the necessary libraries numpy is imported as NP to facilitate
numerical operations matplot li. pyplot
is imported as PLT for visualization purposes and Gan NB is imported from pyit learn to implement the Gan naive
Bas classifier sample data definition moving on the code defines sample data representing movie features and their corresponding likes the movies
features array contains pairs of movie length and minutes and genre codes where each genre is encoded numerically EG
zero for action one for Romance the movies likes array denotes whether each movie is liked one or disliked zero creating and training the model A
gaussian naive Bas model is instantiated using gaussian and NB creating an instance named model this model is then trained using the fit method with movies
features as the input features and movies likes as the target labels prediction following training the model is ready to make predictions an
example new movie is defined with a length of 100 minutes and a genre code of one this new movie is represented as
a numpy array named new movie movie the model's predict method is then employed to predict whether Tom will like or dislike this new movie
plotting to visualize the movie data the code generates a scatter plot using PLT dos scatter existing movies are plotted
using circles marker o with their length on the x-axis and genre code on the Y AIS liked movies are depicted in one
color while disliked movies are depicted in another Additionally the new movie is plotted with a different marker a red X to highlight its position displaying
prediction finally the code prints a message indicating whether Tom will like the new movie if the predicted like value is one the message states that Tom will like the movie otherwise it
suggests that he will not here are several key features clients data contains daily calorie intake and workout duration and weight loss contains the corresponding weight loss
train test split we split the data into training and test sets to validate the models predictive performance creating and training model we instantiate
bagging regressor with decision tree regressor as the base estimator and train it using dot fit with our training data prediction and evaluation we
predict weight loss for the test data evaluating prediction quality with mean squar error msse visualizing one of the base estimators optionally visualize one
tree from The Ensemble to understand individual decision-making processes keeping in mind an individual tree may not perform well but collectively they produce stable [Music]
predictions a breakdown of the key points true weight loss this range 2 to 4.5 lb represents the actual weight loss experienced by the clients in the test
set predicted weight loss this range 3.1 to 3.96 lb represents the model's
predictions for weight loss in the test set mean squared error msse [Music] 0.75 this metric measures the average
squar difference between the predicted weight loss and the true weight loss a lower MSC generally indicates better model performance in simpler terms the
model predict predicts weight loss somewhat accurately but there are deviations between the predictions and the actual weight loss experienced by the clients on average the model's
predictions were off by 0.75 lb squared in our previous explorations of machine learning we've come to recognize the inherent strengths
and limitations of individual models some models are masters of Simplicity offering quick interpretable results While others thrive in handling complex
high-dimensional data sets yet all models are susceptible to the twin challenges of bias and variance this is where bagging enters as a powerful Ally
leveraging the wisdom of crowds to forge better predictive Frameworks applications of bagging regression problems imagine you're attempting to
predict housing prices within a bustling City factors like square footage location number of bedrooms and countless others collectively influence
the price a single linear regression model might struggle to capture the intricate relationships between these features bagging comes to the Rescue by
training multiple regressors like decision trees on diverse samples of the data and averaging their predictions we reduce variance and improve
accuracy classification quests perhaps your task with classifying customer reviews as positive or negative a lone naive Bay classifier might make
oversimplified assumptions about word independence resulting in subpar performance bagging empowers us to
assemble an ensemble of classifiers each member of The Ensemble casts a vote and the majority vote often yields a superior classification decision
mitigating the shortcomings of any sing model image recognition the vast world of image recognition presents unique
challenges with high dimensional data convolutional neural networks cnns while remarkably powerful can fall prey to
overfitting with bagging at our disposal we can create a council of independently trained cnns where each Network focuses on distinct subsets of the image data
aggregating their predictions instills robustness and can significantly improve classification results harnessing the power of diversity the Cornerstone of
bagging lies in cultivating diversity within its Ensemble by constructing models on varying bootstrapped samples of the original data set each model
develops slightly different biases this clever strategy combats bias through its Collective approach moreover the random
nature of samp sampling reduces variance particularly when working with unstable algorithms like decision
trees real world examples Healthcare in medical diagnosis where Precision is Paramount bagging is widely used ensembles of models trained on patient
data often lead to enhanced accuracy in identifying diseases contributing to Better healthc Care decision making Finance Financial instit utions employ
bagging for critical tasks such as fraud detection and risk assessment aggregated models built with bagging techniques are frequently more efficient at detecting
anomalies and spotting fraudulent patterns aiding in the protection of valuable assets environmental science bagged
models are leveraged in tasks ranging from land cover classification to climate modeling the ability to create more stable and reliable predictions
from diverse data sets and models proves invaluable when tackling complex environmental challenges remember while bagging empowers us with
stronger predictive prowess it's not a Magic Bullet random forests provide an improvement over bag trees by way of a
small tweak that decorrelates the trees as in bagging we build a number of decision trees on bootstrap training samples but when building these decision
trees each time a split in a tree is considered a random sample of M predictors is chosen as split candidates from the full set of P predictors the
split is allowed to use only one of those m predictors a fresh and random sample of M predictors is taken at each
split and typically we choose MP that is the number of predictors considered at each split is approximately equal to the square root of the total number of
predictors this is also the reason why random Forest is called random the main difference between bagging and random forests is the choice of predictor
subset size M decorrelates the trees using a small value of M in building a random Forest will typically be helpful when we have a large number of
correlated predictors so if you have a problem of multicolinearity RF is a good method to fix that problem so unlike in
bagging in the case of random forest in each tree split not all P predictors are considered but only randomly selected M predictors from it this results in not
similar trees being decorrelated and due to the fact that averaging decorrelated trees results in smaller variants random Forest is more
accurate than bagging Noah is a botanist who has collected data about various plant species and their characteristics such as Leaf size and flower color Noah
is curious if he could predict a plant species based on these features here we'll utilize random forest and emble
learning method to help him classify plants technically we aim to classify plant species based on certain predictive variables using a random
forest model let's walk through the provided code step by step importing libraries the code starts by importing
necessary libraries such as numpy for numerical operations matplot lib for visualization random forest classifier
from pyit learn for random Forest classification train test split for splitting the data and classification de
report for evaluating classification performance data preparation the code defines two numpy arrays plants ey and features containing the features of
plants Leaf size and flower color and plants are species containing the corresponding species labels each row in Plants features represents a plant and
the corresponding entry in Plants species denotes its species zero or one train test split the data set is split into training and testing sets using the
train test split function this ensures that the model is trained on a portion of the data and evaluated on an unseen portion the split ratio is set to 75%
training data and 25% testing data model initialization and training a random forest classifier model model is
initialized with 10 estimators trees and a random state of 40 2 for reproducibility the initialized model is then trained using the training data X
train and Y train random forests build multiple decision trees and combine their predictions to improve generalization performance prediction and evaluation
the trained model is used to make predictions y spread on the test data X to test these predictions are then evaluated using the classification for
report function which generates a detailed report including Precision recall F1 score and support for each class displaying prediction and evaluation the classification report
containing evaluation metrics is printed to the console providing insights into the model's performance visualization two visualizations are generated to gain
insights into the data and model scatter plot of species this plot visualizes the distribution of plants features Leaf size and flower C color for each species
different marker shapes and colors represent different species making it easier to distinguish between them feature importance this horizontal bar
plot visualizes the importance of each feature Leaf size and flower color in predicting plant species features with higher importance values contribute more
to the model's decision making process interpreting the scatter plot the scatter plot shows the distribution of plant data points based
on their Leaf size and flower color coded the different marker colors green and red represent the two plant species
0 and one partial separation there appears to be some separation between the green and red data points suggesting that leaf size and flower color might be
partially effective in distinguishing between the two plant species overlap we can also observe some overlap between the green and red
clusters particularly towards the center of the plot this overlap indicates that some plants might have similar Leaf size and flower color features regardless of
species this overlap may lead to some classification Errors By the random forest model potential challenges the overlap
between the two species in the scatter plot suggest that the model might struggle to accurately classify plants that fall in these overlapping areas overall the scatter plot provides a
visual representation of the data that can be helpful in interpreting the performance of the random forest model key points from the bar chart the bar
chart depicts the feature importances for leaf size and flower color the height of the bar represents the features importance in this case the bar
for flower color is considerably higher than the one for leaf sze interpretation this visualization confirms what we might have inferred
from the scatter plot flower color carries more weight in the model's decision-making process for plant species classification this is likely because
the flower color data seems to have a clearer separation between the two species green and red dots in the scatter plot [Music]
overall conclusions the random forest model partially leverages both Leaf size and flower color for classification but flower color appears to be the more
dominant feature the model's performance might be limited by the overlap between the two species in the data particularly for plants with similar Leaf size and flower color
[Music] values random Forest a versatile machine learning algorithm find applications across various domains including finance
and banking Healthcare e-commerce marketing and more finance and banking in fraud detection random forests excel
in spotting irregular patterns in transactions leveraging features such as transaction amount location frequency and Merchant type to flag potential
fraudulent activities for credit risk assessment these models evaluate borrowers creditworthiness by analyzing factors like income debt to income ratio
credit history and employment status predicting the likelihood of default accurately additionally in stock market prediction random forests leverage
historical stock prices company fundamentals new sentiment and market trends to forecast future prices though this remains
challenging Healthcare in medical diagnosis random Forest classify patients based on various Medical Data like test results symptoms and patient
history aiding health care providers in making informed decisions for drug Discovery and development researchers utilize these
models to identify potential drug candidates by analyzing molecular structures gene expression data and existing drug information furthermore
more in personalized medicine random forests help tailor treatments to individual patients by considering factors like genetics medical history
and lifestyle enabling predictions of patient response to specific therapies or medication dosages e-commerce and marketing customer segmentation benefits
from random forests as they group customers based on purchase history browsing Behavior demographics Etc facilitating targeting marketing and
personalization efforts for product recommendations these models analyze customer purchase patterns product ratings and search history to offer
relevant product suggestions thereby enhancing user experience and boosting sales moreover inur prediction random
forests identify customers at risk of leaving by examining usage patterns service interactions and demographic
data allowing for proactive retention strategies other notable areas random forests find applications in environmental science aiding tasks like
land cover classifications using satellite imagery monitoring deforestation and assessing climate change
impact in image analysis they assist in image classification tasks such as facial recognition object detection in self-driving cars and analyzing Medical
Imaging scans furthermore in network intrusion detection random forests help identify suspicious Network traffic patterns by
analyzing features like source and destination IP addresses protocols used and packet sizes contributing to cyber security efforts it's important to note
that while random forests are generally robust to overfitting due to their Ensemble nature careful feature selection and hyperparameter tuning are
crucial for Optimal Performance additionally random forests work well with both categorical with missing values or [Music]
outliers like bagging which averages correlated decision trees and random Forest which averages uncorrelated decision trees boosting aims to improve
the predictions resulting from a decision tree boosting is a supervised machine learning model that can be used for both regression and classification
problems unlike begging or random Forest where the trees are built independently from each other using one of the B bootstrapped samples copy of the initial
training date in boosting the trees are built sequentially and dependent on each other each tree is grown using information from previously grown trees
boosting does not involve of bootstrap sampling instead each tree fits on a modified version of the original data set it's a method of converting weak
Learners into strong Learners in boosting each new tree is a fit on a modified version of the original data set so unlike fitting a single large
decision tree to the data which amounts to fitting the data hard and potentially overfitting the boosting approach instead learns slowly give in the
current model we fit a decision tree to the residuals from the model that is we fit a tree using the current residuals rather than the outcome y as the
response we then add this new decision tree into the fitted function in order to update the residuals each of these trees can be rather small with just a
few terminal nodes determined by the parameter D in the algorithm now let's have a look at had three most popular boosting models in machine learning the
first Ensemble algorithm we will look into today is aab Boost like in all boosting techniques in the case of aab boost the trees are built using the
information from the previous Tree and more specifically part of the tree which didn't perform well this is called the weak learner decision stump this
decision stump is built using only a single predictor and not all predictors to perform the prediction so adaboost combines weak Learners to make
classifications and each stump is made by using the previous stump's errors here is the step-by-step plan for building an adaab boost model Step One
initial weight assignment assign equal weight to all observations in the sample where this weight represents the importance of the observations being
correctly classified one t n all samples are equally important at this stage step two optimal predictor selection the first stump is built by
obtaining the RSS in case of regression or guinea index entropy in case of classification for each predictor picking the stump that does the best job
in terms of prediction accuracy the stump with the smallest RSS or guinea entropy is selected as the next tree
step three Computing stumps weight based on stumps we increase the weight of the observations which have been incorrectly predicted and decrease the remaining
observations which had higher accuracy or have been correctly classified so that the next stump will have higher importance of correctly predicted the value of this
observation step five building the next stump based on updated weights using weighted Genie index to choose the next
next stump step six combining B stumps then all the stumps are combined while taking into account their importance weighted sum imagine a scenario where we
aim to predict house prices based on certain features like the number of rooms and age of the house for this example let's generate synthetic data
where numb rooms the number of rooms in the house house age the age of the house in years price the price of the house in
$1,000 importing libraries the code starts by importing necessary libraries numpy for numerical operations pandas
for data manipulation ma plot Li for visualization and specific modules from psyit learn for machine learning tasks like model selection Ensemble learning
and evaluation metrics data generation synthetic data is generated to mimic a real world scenario random numbers are generated to represent the number of
rooms in a house numb rooms the age of the house house age and noise the price of the house price is then calculated
based on a linear relationship with the number of rooms age of the house and added noise data visualization the generated
data is visualized using SC Scatter Plots two Scatter Plots are created one showing the relationship between the number of rooms and the house price and
the other showing the relationship between the age of the house and the house price this visualization helps in understanding the distribution and
relationships between features and the target variable price data splitting the data is split into training and testing
sets using the train test split function from psyit learn this step is essential for training the model on one subset of data training set and evaluating its
performance on another subset testing set model initialization and training an adaboost regressor model is initialized with specific parameters like the number
of estimators decision trees set to 100 and a random seed for reproducibility the model is then trained using the training data X train
and Y train model evaluation once trained the model makes predictions on the test data X test the mean squared
error MSE and root mean squared error rmse metrics are calculated to evaluate the model's performance compared to the
actual house prices wide test result visualization the actual house prices whyi test and the predicted prices are visualized using a scatter plot the plot
also includes a diagonal line representing perfect predictions this visualization AIDS in assessing how closely the model's predictions align with the actual prices the Scatter Plots
provided shows two key relationships number of rooms versus price this is represented by the green data points there appears to be a positive correlation meaning as the number of
rooms increases the price of the house also tends to increase this is likely because houses with more rooms are generally larger and more expensive
house age versus price this is represented by the red data points the relationship here is less clear there might be a slight negative correlation
where newer houses lower House age tend to be more expensive however the data points are scattered making it difficult to draw a definitive
conclusion additional points to consider the data points show some variation around the general Trends this indicates that there might
be other factors influencing house price besides the number of rooms and House age EG location amenities it's important to note that this is simulated data and
real estate prices can be influenced by many complex factors overall the scatter plot suggests that the number of rooms has a positive correlation with house
price while the relationship between House age and price is less clearcut the Scatter Plots you saw provide valuable insights into the data but adaab boost
plays a crucial role in uncovering the underlying relationship between features number of rooms House age and price
here's how capturing complex relationships the data exhibits some scatter around the general Trends suggesting the price isn't perfectly
explained by just number of r rooms and House age adaab boost excels in handling such scenarios it's an ensemble method that
combines multiple weak decision trees into a stronger final model these decision trees can effectively capture nonlinear patterns in the data providing
a more nuanced understanding of the price feature relationship than a single linear model focus on informative features while both Scatter Plots offer
Clues adaab boost goes beyond simply visualizing correlations it analyzes the data to determine which features number
of rooms House age are most informative for predicting price by focusing on these features during the decision tree creation process adaboost prioritizes
the factors that have the strongest influence on price iterative refinement adaboost Works in a stage manner it trains a series of weak decision trees
each focusing on correcting the errors of the previous one by visualizing the data we can get a general sense of the trends but adabo iteratively refines its
understanding through these multiple stages ultimately leading to a more accurate prediction model in essence the Scatter Plots provide a starting point
for understanding the data but idab boost acts as a powerful tool to leverage that initial understanding and build a more a robust model that captures the complexities of the price
feature relationship Ada boost and gradient boosting are very similar to each other but compared to Ada boost which starts the process by selecting a stump and continuing to build it by
using the weak Learners from the previous stump gradient boosting starts with a single leaf instead of a tree of a stump the outcome corresponding to this chosen Leaf is then an initial
guess for the outcome variable like in the case of adaab boost gradient boosting uses the previous stump's errors to build a tree but unlike in adaab boost the trees that gradient
boost builds are larger than a stump that's a parameter where we set a max number of leaves to make sure the tree is not overfitting gradient boosting uses the learning rate to scale the
gradient contributions gradient boosting is based on the idea that taking lots of small steps in the right direction gradients will result in lower Vari
ience for testing data the major difference between the adab Boost and gradient boosting algorithms is how the two identify the shortcomings of weak
learners for example decision trees while the adaboost model identifies the shortcomings by using High weight data points gradient boosting performs The
Same by using gradients in the loss function why ask what p a e needs a special mention as it is the error term
the law L function is a measure indicating how good a model's coefficients are at fitting the underlying data a logical understanding of loss function would depend on what we
are trying to optimize early stopping the special process of tuning the number of iterations for an algorithm such as GBM and random Forest is called early
stopping a phenomenon we touched upon when discussing the decision trees early stopping performs model optimization by monitoring the model's performance on a
separate test data set and stopping the training procedure once the performance on the test data stops improving Beyond a certain number of iterations it avoids
overfitting by attempting to automatically select the inflection point where performance on the test data set starts to decrease while performance
on the training data set continues to improve as the model starts to over fit in the context of GBM early stopping can
be based either on an outof bag sample set o or cross validation CV like mentioned earlier the ideal time to stop training the model is when the
validation error has decreased and started to stabilize before it starts increasing due to overfitting to build GBM follow this step-by-step process
step one train the model on the existing data to predict the outcome variable step two compute the error rate using the predictions and the real values
pseudo residual step three use the existing features and the pseudo residual as the outcome variable to predict the residuals again step four
use the predicted residuals to update the predictions from step one while scaling this contribution to the tree with a learning rate
hyperparameter step five repeat steps 1 to 4 the process of updating the pseudo residuals and the tree while scaling with the learning rate to move slowly in
the right direction until there is no longer an improvement or we come to our stopping rule the idea is that each time we add a new Scale tree to the model the residual
should get smaller let's break down the provided code step by step model initialization and training the code initializes a gradient boosting
regressor model model GBM with specific parameters such as the number of estimators trees set to 100 learning rate set to 0.1 maximum depth of each
tree set to one and a random seed for reproducibility this model is then trained using the training data XT train and Y train gradient boosting Builds an
ensemble of weak Learners decision trees in this case sequentially with each tree learning from the errors made by the previous ones predictions after training
the model is used to make predictions on the test data X test the predict method is applied to model GBM and the predicted house prices are stored in the
variable predictions model evaluation to assess the model's performance the mean squared error MSC and root mean squared
error rmsc metrics are calculated these metrics quantify the average Square difference between the actual house
prices YY test and the predicted prices predictions lower values indicate better model performance the calculated MSE and
rmsse are then printed to the console using formatted strings result visualization finally the code generates a scatter plot to visualize the
relationship between the actual house prices y test and the predicted prices prediction the scatter plot displays the actual
prices on the x-axis and the predicted prices on the Y AIS additionally a diagonal dashed line is drawn to represent perfect predictions where actual prices equal predicted prices
this visualization helps in assessing how closely the model's predictions align with the actual prices providing insights into the model's accuracy and
potential areas for improvement now if we compare this result with the result that we got before with adaab Boost we can say the following scatter plot
characteristics both plots follow a similar structure with predicted prices on the Y AIS and actual prices on the x- axis points represent individual
predictions with ideal predictions lying on a dashed diagonal line indicating where predicted prices equal actual
prices performance indication Ada boost the points are spread around the diagonal but show a trend of underestimating higher values as seen from the concentration of points below
the line as actual prices increase GBM the points are more tightly clustered around the diagonal line throughout the range of values
suggesting that GBM predicts both low and high prices with better accuracy than Ada boost algorithm Effectiveness the GBM model generally appears to
perform better given the closer clustering of points around the identity line this indicates a more accurate prediction across the range of house prices the auto boost plot shows greater
deviation from the line especially at higher price points suggesting less consistency in prediction accuracy across the price
Spectrum data distribution both models handle the full range of data from about 150 to 500 in units consistent across
both models but the GBM seems to manage the upper range more effectively overall from these plots we can infer that GBM provides a more
accurate and consistent prediction for house prices compared to idab boost particularly at higher price points where idab boost tends to underestimate
values one of the most popular boosting or Ensemble algorithms is Extreme gradient boosting XG boost the difference between the GBM and XG
boost is that in the case of XG boost the second order derivatives are calculated second order gradients this provides more information about the direction of gradients and how
to get to the minimum of the loss function remember that this is needed to identify the weak learner and improve the model by improving the weak Learners
the idea behind XG boost is that the second order derivative tends to be more precise in terms of finding the accurate Direction like the Ida boost XG boost
applies advanced regularization in the form of L1 or L2 Norms to address overfitting unlike Ada boost XG boost is parallelizable due to its special
caching mechanism making it convenient to handle large and complex data sets also to speed up the training XG boost uses an approximate greedy algorithm to
consider only a limited amount of thresholds for splitting the nodes of the trees to build an XG boost model
follow this stepbystep process step one fit a single decision Tree in this step the loss function is calculated for
example ndcg to evaluate the model step two add the second tree this is done such that when this second tree is added to the model it lowers the loss function
based on first and second order derivatives compared to the previous tree where we also used learning rate at EA step three finding the direction of
the next move using the first degree and second degree derivatives we can find the direction in which the loss function decreases the largest this is basically
the gradient of the loss function with regard to the output of the previous model step four splitting the nodes to
split the observations XG boost uses an approximate greedy algorithm about three approximate weighted quantiles usually quantiles that have a similar sum of
weights for finding the split value of the nodes it doesn't consider all the candidate thresholds but instead it uses the quantiles of that predictor only
optimal learning rate can be determined by using cross validation and grid search Imagine you have a data set containing information about various houses and their prices the data set
includes features like the number of bedrooms bathrooms the total area the year built and so on and you want to predict the price of a house based on these features let's dissect the
provided code step by step model initialization and training the code starts by importing the XG boost Library import XG boost as xgb XG boost is a
powerful implementation of gradient boosting machines next an XG boost regressor model model xgb is initialized
with specific parameters the objective is set to regi squ Dera indicating that the model aims to minimize the mean squared error loss function Additionally
the number of estimators trees is set to 100 and a seed value of 42 is specified for reproducibility the initialized model is then trained using the training
data xxe train and YC train XG boost Builds an ensemble of decision trees sequentially optimizing a specified loss function predictions after training the
trained model model JX xgb is used to make predictions on the test data xas test the predict method is applied to the model and the predicted house prices are stored in the variable
predictions model evaluation the code proceeds to evaluate the model's performance using two common metrics mean squared error and M and
root means squared error rmse these metrics quantify the average squared difference between the actual house prices y Quest test and the
predicted prices predictions lower values indicate better model performance the calculated msse and rmsse are then
printed to the console using formatted strings result visualization lastly the code generates a scatter plot to visually compare the actual house prices
wide test with the predicted prices predictions the scatter plot displays the actual prices on the x-axis and the predicted prices on the Y AIS
additionally a diagonal dash line is drawn to represent perfect predictions where actual prices equal predicted prices this visualization AIDS in
assessing the model's accuracy by examining how closely the predicted prices align line with the actual prices providing insights into the model's
performance now let's compare adaboost GBM XG boost here's how they compare adaab boost the adaab Boost plot shows
predictions that tend to underestimate the actual prices especially as the values increase this is evident from the larger number of points lying below the
diagonal line in the higher price ranges the overall fit to the diagonal is less tight suggesting higher prediction errors or bias particularly for higher
priced houses GBM the GBM model produces predictions that are generally closer to the diagonal across all price ranges this indicates better accuracy and
consistency in predicting both lower and higher priced houses compared to adaboost the points in the GBM plot are more tightly clustered around the diagonal indicating lower variance in
prediction errors XG boost the XG boost plot also shows a tight clustering of points
around the diagonal similar to GBM this suggests a high level of accuracy unlike GBM the XG boost plot seems to slightly overestimate the lowest priced houses
while matching or slightly underestimating the highest priced houses but still maintains a close adherence to the diagonal
line summary of comparison accuracy and consistency both GBM and XG boost exhibit high accuracy with their predictions closely clustered around the
diagonal line showing their effectiveness in both lower and higher price predictions Ada boost however shows more variance and a tendency to
underestimate particularly at higher price points performance at different price ranges GBM and XG boost handle Extremes in prices is better than Ada
boost which struggles with underestimation as prices increase General predictive performance XG boost and GBM are quite comparable with slight
differences in how they handle the very low and very high ends of the price Spectrum idab boost appears to be less reliable especially for higher priced
properties from these observations GBM and XG boost seem more suitable for scenarios where precise and consistent predi itions across a wide range of
house prices are critical adaab boost might be more prone to prediction errors particularly in higher price brackets hi I'm vah and in this project
we will learn how to understand your customers better track sales patterns and show those results if you like working with data or own the store this video will show you how to use
information to make better choices and get better results you will speed your customers into smaller groups based on how they shop this helps you send the
right messages to the right people and give them offers they will like loyal customers are the best you will use data to find your biggest supporters and those who are ready to spend more then
you can reward your your best customers with programs that fit their shopping habits this makes them happy and stops them from going to other stores we will
use data to get gu what people will buy and when they will buy it you will find sales patterns in among different items and figure out what cool new product
people will want this lets you always have the right stuff at the right time you won't have too many items everything will sell and customers will be surprised by how well you know what they
need we'll look at how sales change throughout the year this helps you plan for busy times can slow downs early and know exactly when to have big
sales we will use location data and what people say about you to find places where sales are going well and where you could grow you will even show it all on the map this helps you spend your
advertising money wisely find great spots for new stores and even choose the perfect things to sell in each place so let's get
started all right let's now go with the DS I will be using so we are using the superstore sells the set and it has
9,800 rows and the columns order ID order date ship date shipping mode standard class second class or other classes and the customer ID withth them
as some name the segment meaning um who bought the product whether a consumer a corporate or a home
office and the clients mainly come from United States and it's also specified from which city of the United States they come from so we shall import this
to our Google cab and start working on it okay so let's now import the necessary Pyon libraries we import pis
as PD we also import numai SMP import mfold lip IA mimport Seaborn as SMS so let's also import the there and we'll be
using copy [Music] P perfect so this is how it looked like on kle and this is also how it look like
when we have imported so let's now look up the frames larger info so everything seems to be consistent but the postal
code it seems that 11 postal codes are missing okay so what we can do is to fill in those no values
[Music] for [Music] [Music] [Music] okay so as you can see we have replaced
the no uh postal codes customers that didn't have any postal code and we have put the zero inside it all right so let's now move on to checking for
duplicates [Music] ifdf duplicated do su trigger down in zero Das print so let's now see if there
are actually duplicates and if there are duplicates we will print D duplicates exist and if there are not Blue Print no duplicate SE
[Music] found all right so Mo as you can see there exists no duplicate so let's move
on to customer segmentation let's first create a variable new types of customers and let's extract out of our data frame
called segment as you can see from our data frame we have a merried segment within our data this segment includes a list of the types of customers in our data frame we
have both consumer and corporate customers so let's get started with customer segmentation the main problem is that many large businesses struggle to understand the contribution and
importance of their various customer segments they often lack precise information about their main buyers relying on intuition rather than data
this leads to misallocation of resources resulting in Revenue loss and decreased customer satisfaction for example if your store primarily caters to Consumers
it's crucial to tailor your marketing and customer satisfaction efforts to resonate with their needs and preferences by focusing your resources on understanding and Catering to your consumer base you can avoid
misallocating resources to large corporates this ensures you're providing a satisf customer experience for your primary demographic ultimately leading
to increased customer loyalty and revenue growth and we can also you can create a part chart pie chart or bargar from it to clearly illustrate the
revenue contribution of each customer segment and this will allow us to tailor more of our marketing resources our customer satisfaction resources
towards once you've completed C customer segmentation The Next Step depends on your strategic goals here are a few ways to proceed focus on your most valuable
segment if your existing customer segmentation reveals a particularly profitable segment such as consumers tailor your marketing product offerings
and customer service to deepen your engagement with that group Target new segments if you want to attract more corporates or home offices you'll need to understand their unque unique needs
and pain points start by researching these segments what are their challenges what solutions would appeal to them develop tailored messaging and consider offering specialized products or
services to attract these new customer types all right so let's get started [Music] [Music]
so this will extract the types of customers from data frame perfect so it's consumer corporate and home office those are all the um variables are in
our data frame all right so let's count the unique values in our segment and we will do is byy number of customers
[Music] [Music] [Music] [Music] [Music] so what this meaning does is it counts
unique values in our segment and resets the index to turn them into a column and then we can correct the
renaming of columns so we want to give our segment the name as like total customer or type of customer I will go
with the type of customer so we will say number of customers is equal to number of
customers that rename and want to call the col which is the main segment we
want to rename it to to type of customer now if you want to print that print number of customers there are 5,00 101
consumers and for corporate there are like 2,953 corporate buyers and7
1,746 home offices and if you want to create a pie chart out of this we can PL it by saying CL pie the number of
customers and want to base a pie chart on the account you want to label the number of customers total customers
[Music] Perfect all right so from as you can see we had the renew um type of customer the total customer so you can see that from
this uh pie chart our main consumer segment is 52% 30% of our orders come from corporates and 18% from Whole
offices you can see who we have to exactly focus on which are consumers while consumers hold the majority focusing solely on them overlooks significant potential within the
corporate and home office segments let's explore how to balance resource allocation for all three segments to maximize growth to gain even deeper
insights we should integrate our customer data with sales figures this analysis will help us identify which segments generate the most Revenue per
customer average order value and overall profitability customer lifetime value additionally we can segment customers by
purchase frequency and basket size to understand their buying Behavior within each segment here are some additional questions to consider for a more
comprehensive analysis customer acquisition cost CAC how much does it cost to acquire a customer in each segment
customer satisfaction how satisfied are customers in each segment churn rate what is the rate at which customers leave in each segment by analyzing these
factors alongside revenue and customer lifetime value we can create a customer segmentation model that prioritizes
segments based on their overall value and growth potential we can also PL a bar graph for the total sales for each
customer type and group the data by the segment column and calculate the total sales for each segment and you want to do this by so right now you don't see
the exact sales numbers the bar chart you can see the exact sales numbers for each customer type so let's PL it
[Music] [Music] [Music] [Music]
[Music] [Music] so there are around 1.2 million from our consumers and we
have around 600 or 700,000 on corporates now we can also P out uh bar from
this which means PL bar sales per segment customer type type of customer sales hair segment col sale these bar
chart effectively illustrates the distribution of sales across our customer segments consumers account for the largest portion of sales 1.2 million
followed by corporates 1.0 million and Home Offices 0.8 million while the chart is clear a deeper analysis can help us
optimize our marketing efforts customer lifetime value CLTV calculate the CLTV V of each segment to identify which segments
generate the most Revenue over time this will help prioritize customer segments for marketing efforts for example if you
find that the home office segment has a higher CLTV than the Consumer segment you may want to invest more resources in marketing campaigns targeting home
office customers market research conduct market research to understand the spefic specific needs and preferences of each customer segment this will inform the
development of targeted marketing campaigns for instance you might discover that consumers in your data are price sensitive while corporate
customers are more interested in bulk discounts and reliable service you can use this knowledge to tailor your marketing messages to each segment
average order value analyze average order value by segment to identify opportunities to increase Revenue per
customer let's say your analysis reveals that corporate customers have a higher average order value than consumers you could develop marketing campaigns that
encourage consumers to purchase bundles or higher pric products to increase their average order value customer acquisition cost CAC how much does it
cost to acquire a customer in each segment knowing CAC can help determine the return on investment Roi for
marketing efforts here's an example let's say it cost $100 to acquire a new corporate customer but only $20 to acquire a new consumer customer if the
CLTV customer lifetime value of a corporate customer is significantly higher than the CLTV of a cons consumer customer then spending $100 to acquire a
corporate customer may still be profitable however if the CLTV of the corporate customer is only slightly higher than the CLTV of the consumer
customer you may want to focus your marketing efforts on acquiring more consumers because the cost of acquisition is much lower customer
satisfaction how satisfied are customers in each segment understanding satisfaction levels can help identify
areas for improvement and reduce churn here's an example you can conduct surveys or collect customer feedback to understand satisfaction levels if you
find that corporate customers are less satisfied than consumer customers you may want to investigate the reasons for their dissatisfaction and make changes to
improve their EXP experience this could involve improving your customer service offering more competitive pricing for corporate customers or developing
products or services that better meet the needs of corporate customers we can also create a pie chart
for our sales which you can do by FL pie pie sales hair segment toal sales and we
name labels specific to to sales per segment and type of customer type of customer 51% of our
sales come from our consumers 30% from our corporates and 19% from home offices all right so let's now move on to the customer loyalty as a business
you want to make sure that your most loyal customers stay happy this will make sure that those customers keep on coming back keep on bringing new people and also placing new
orders so you will decrease the cost on acquisition of new customers because there will be already existing customers
and you'll also be able to make sure that your Revenue either stay at the same level or increases by keeping your
most loyal customers happy and you want to do that as a business now we can do this by either the following ways we can
rank the most loyal customers by the amount of orders they have placed or the total uh they have spent you have analyzed your data pinpointing your 30
most lawyer customers this represents a significant opportunity to strengthen these relationships and maximize their
lifetime value here's a powerful appro design a targeted email specifically for those high value segments for actively offer personalized support
with inquiries such as how can we assist you today this demonstrates your commitment to their success proactively addressing potential issues and fostering a deep sense of
loyalty loyalty programs consider a tiered loyalty program that offers exclusive rewards tail to your most valuable customers this include earlier
access to new products personalized discounts or even point-based reward systems personalized experiences leverage your Data Insights to go beyond
email consider personalized website recommendations targeted promotions based on past purchase history or even handwritten thank you notes for high
value customers customer feedback loops make sure your top customers feel heard Implement surveys or invite them to participate in exclusive focus groups
this demonstrates you value their input and are actively using feedback to improve the customer experience Community Building depending on your
business model fostering a community among your most loyal customers can create a sense of belonging this could involve access to online forums
exclusive events or opportunities to network with like-minded individuals now this strategy extends Beyond customer satisfaction
prioritizing the experience of your top customers the directly correlates with increased retention positive referrals
and ultimately improves Revenue now um let's dive deeper and see who are our most loyer customers all right so let's
now get started with that let's create the variable with the name all so let's first display the first three R of our
day frame so as you can see there is a row called sales or the col call called sales and each customer has a specific
ID with a specific name so if you can count out the number of times this shows
up then you also have the number of total orders which then you can write which you can use later however you want
to so let's start with doing that
[Music] [Music] [Music] [Music]
[Music] now let's rename the columns we want the column order ID which is where is the
order ID or right here when the order ID to be named our total [Music]
orders now we want to rename the columns that are equal to order ID in this column this must be renamed to Total
orders face is equal to True okay so now let's identify the repeat customers customers with all the frequency greater than one so repeat customers are equal
to customers order frequency customer order frequency total orders and like I said want to make sure it's equal it's greater than one it's equal greater than
one perfect now we can we want to organize this in a way that is the sting we can do that by saying repeat custom
sort it are repeat customers that sort values [Music] [Music] [Music]
[Music] perfect now let's print this out print repeat
customers sorted. head 12 want to
customers sorted. head 12 want to display our top 12 customers reset index so the customer with name William
Brown who a consumer has placed to Total 35 orders so this is the list of your top how many customers and your as
business or superstore you can identify exactly the number of the total orders a person or a business has to
place in order to be considered a loow customer and then according to that you can tailor your services to it now the data clearly reveals that a small group
of customers place aers with considerably higher frequency 3 plus we have William Brown with 35 orders and other home office C customers with 34
and many consumers and one corporate with 32 so it shows clearly that we have loyal group of customers there's also significant potential for our home
office segment several of our most loyal customers belong to the home office segment now this implies that the home office segment has a strong potential for customer loyalty and deserves
targeted marketing efforts and it also shows that we just don't have like one in a group of loyal customers we have home offices consumers and corporates
while there are many consumers it doesn't mean that we have to focus on one segment it means that we still have to devise a plan that caters to our
multiple segments so some recommendations now we can prioritize loal customers segment customers by AIO frequency and uh we can develop
exclusive offers Rewards or Early Access programs tailor 12 almost um all your customers so for example we can provide them exclusive
discounts to your reward programs and earlier access and we can also Target uh more home offices because we see that home offices um keep on coming back and
we are able to satisfy few of the home offices that mean that means we can we have catered to their needs and provided a good enough service for them to keep
on coming back back that means our product is great for home offices that means we can Target more home offices using content marketing social media ads
or other type of marketing strategies and we can also analyze our the behavior with the way we provided service to our um to these
customers and because it worked out pretty well and if we provide this kind of service to our newcoming customers then we increase the chance that they Al become a loyal customer so those are
like several conclusions we can make now we can also identify loyal customers by sales so this is uh identified them by total number of
orders they have placed but we can also use amount of sales so the total amount to identify them because a person can
come and place 35 orders but if they Place 35 $1 order then obviously that's just 35 now this doesn't say anything about the
sales amount so I um ideally you want to organize it by the sales amount to be able to um identify the actual top
spending and loal customers or that said you when there is a significant customer so let's say
someone has spent like say 25,000 that can be done also in one order so there doesn't mean that it's a repeated customer it's it's just a top spender
now um let's start with identifying our top spending customers so let's first create a variable customer sales Glu to
data frame that's grou by customer ID you want their customer ID want to also see the name and also what type of customer they are this segment and want
to do it by sales and and we want to sum that sum those all and we don't want to resent the index and now let's Addy our
top Spenders by having them ranked descendingly meaning our top Spenders will be rank all the way up customer
customer sales at the sord values by e [Music] [Music]
[Music] so Sean Miller has spent the most who was from home office using total amount of 25,000 USD William Brown has placed
the most number of orders which are 35 but William Brown is nowhere to be found here same as Sean Miller he has spent he's the he's a customer who has spent
the most in our Superstore but he's also nowhere to be found here meaning that the repeated customers doesn't really
Define their spending habits so if you it depending on the way you you run a Superstore now obviously I would want to
I would want our customer to come back but I would dedicate my resources to the customers who spend the
most because those are the customers who BR bring the most business to my to me meaning those are the customers I have to keep uh happy so the number the total number of orders is great but it doesn't
really speak that much about their spending habits and about their value to your store all right let's now go over to the next chapter which is shipping
now as a Superstore you also want to know what shipping methods customers prefer and which are the most cost effective and
reliable and overall knowing this impacts your customer satisfaction and also meaning it also has great impact on
your Revenue so that so for example Amazon has mve shipping methods but it has the most popular shipping method which keeps the most amount of customers
happy and it also makes uh Amazon the most amount of money so as a superstar you want to know which one of your shipping methods is the most
reliable oh we create the variable the type all because risks so our shipping model let's create a variable we use the
data of the data frame ship mode we want to count those values and of course we want to reset the index [Music] [Music]
[Music] [Music] [Music] [Music] so our standard class is the most popular by it's almost like four times
more popular my first class is go east and the same day our first class is one East so let's create the pie chart of this PL pie shipping model all right so
this is our class s class or this are like the shipping methods the most popular one is standard class which is
60% of the orders use style class shipping and rest is like 40% so as a super store or as any store you invest
in your shipping so you end up buying some kind of deals with a uh delivery um
companies like thehl ups and others and sometimes you end up recommending the wrong option to your customer so let's say second class is fast but it ends up
costing the con customer way too much the customer ends up not buying your product and this decreases the as you can see this for your store but for if
you know that standard classes is the most popular option then you can can have like a button saying this is our most popular option which is time of class and most of the time people choose
the most popular option so this will help you s this will help the Super Store save the cost of investment to
these others or dedicate the amount of resources that each class brings and it also allows the Super Store to
recommend its most popular option which just a class so the problem that many super stores have because many stores
have um stores in many locations in many um States but they don't know how much how well each is performing on a
dashboard for example you could have that but they have no idea how well each of the stores in each state are performing and leaving them with clueless where where there is an
underperformance or for example where they can where there is a high potential area in which they can open a new store
so let's move on to this chapter which is geographical analysis so many stores have hard time in identifying high potential areas or also identifying
stores that are underperforming so things like Walmart Target they have like many branches and
they they will want to know how well each branch is doing and the perfect way to do this is by counting up the number
of sales for each City the number of sales for each state and then this will allow you to see which of the states or which of the cities is performing the best and which of them is performing the
least and dedicate your resources accordingly so let's say if one city the story is simply just losing money for
your years or for more then you will want to adjust your strategy according to that so maybe you will want to close this store or adjust it in a way so it
starts bringing in more profit or Revenue well so let's get started with that [Music] [Music]
[Music] [Music] [Music] [Music] [Music]
all right so as you can see the most popular state is California and the least popular is New Jersey so maybe you
can go over this and let's say in few of the States where there's still high potential
for a profitable store you can identify that's in Washington and calculate all right so maybe from from this you can
see that what Washington is performing fourth or New Jersey is performing like the least of our top 20 from this You can conclude
that you might have to work on New Jersey more to increase the order count this also allows you to increase your revenue or you can see that the
California is your most popular option so you might want to keep California happy and you can also do it per City so
City TF City you count reset next print City that had the most top 50 to 15 so the most popular city is New York with
the order count of 891 and uh and then Los Angeles and Jackson is the least popular out of our
top 15 and you can also increase this to the 25 so not only can you can focus on the states but for each state you can also FOC was on the the city that's
underperforming or overperforming so this allows you to also medicate your resources to the to the city that you want maybe to increase
your revenue or increase the your potential or maybe there is a like City for example Long Beach where there's high potential but you're not using any of your
resources now we can also uh organize it sales per state let's say state sales so previously we did it by order count and we can also do
it for state sales want to sum it up and then we set the index you want to rank it ACC call
for perfect so as you can see our still our most popular state is California and then New York and then it changes it and then it doesn't change yet Texas so this
is according to the sales amount the popularity of the state according to the sales amount and let's also sorted for
her City [Music] [Music] [Music] [Music]
[Music] [Music]
so most publicity is New York la San Francisco this is exactly the same as our previous analysis on the city
nothing really has changed all right so as a store you want to be able to track down your most popular category of
products or your bestselling products or sales performance across categories and subcategories and find the sweet spots where strong categories also have top
selling subcategories and also spot weaker subcat with otherwise strong categories that might need Improvement or product popularity fluctuations see
see if popularity seasonal trading up and down and helps and this helps to forecast the future demand or um you can group it by location for each location
there might be a different popular product you you want to put it in a certain place to maximize your stor Revenue so let's get started with
finding our top performing products or their categories so let's first extract our products the categories of our
products from our dat frame the unique PR product so right now in our dat frame we have only three sorts of products as
you can see the category and each one has a subcategory cases chairs but we have mainly three uh categories which are Furniture office supplies and
technology so let's go now over to the types of subcategory her product
subcategory the uh a print product sub category's chair is the and bunch of it now let's group the
data by a product category and how many subcategory it has so we want to say for example office supplies may have like 20
subcategories refer may have uh five subcategories so let's see how many subcategories each one has [Music]
[Music] [Music] [Music] [Music] [Music]
[Music] [Music] [Music]
so so there are nine office nine first office supplies four for furniture four for technology so for supplies is much
more sophisticated category now we can also see our top performing subcategory so let's say subcategory C and then you
want to count the sales go buy category [Music] [Music] [Music] [Music] [Music]
[Music] [Music] [Music]
so our most popular subcategory is T specifically Falls it has the most am of Sals Furniture chairs office supply
storage so from this you can see our most popular subcategories and what subcategories you want to recommend them or on a front page
or um in the store now let's see which one of our main categories performs or has the most amount of sales product
category goodbye so as expected uh Tech is the most popular one in the furniture in office supplies so maybe you will have inside the inside your store you
will have a much larger department or not much larger maybe a little bit larger or maybe it's in the first row right in front of the customers to be
able to uh present your most popular option immediately to your customers now this will allow you to of course increase
your revenue and sales if you want to create a pie chart for this you can say product P top product category I can
organized by Sals labels top product has to AR well it seems that pack is perform a little bit better than most or like
these two but it's not that much different that much it's not really that different all right so let's now see which one of our subcategories is the
most popular one now remember that we saw which one of our subcategories had the most amount of sales now let's
create a barograph out of it we can just buy sending false and sales sending is
true let's create the bar bar sub category count sales
is category top product sub category the sales on this SP it and some wring find
now this was perfectly that um our most popular option is phone in church and so since this are generate the most amount of sales that means that
customers are more willing to pay money for this so you can end up spending more of your marketing resources on phones and chairs because it will it's there is
already shown that because of the amount of resources you have provided for phones and CH it already Works meaning if you increase the amount of resources
you spend for phones and chairs then your sales will also increase the accordingly you can also uh conclude that art en envelopes and labels aren't
that popular so maybe right now you can give a discount and get rid of those and buy less of those for the future so you can buy end up buying more of the
popular options for example phones chairs or you can also investigate why they are not popular maybe those are like the most the worst envelopes you
could have bought or maybe it's not the right it's not the right art you have bought maybe those kind of art people don't like but if you were to choose complete other form of Arts maybe
they will customers will end up buying them so this shows exactly how now this data stores can use to optimize their
sales or optimize how their resources are allocated so you would end up making more um more money or more sales our
businesses love making sales they love seeing Revenue increase and profits increase it's all lovely but there you should be able to
track down your Sal so that you can see in what kind of situation you are and you can adjust to that situation and what's what is a better way than having
a pie chart or bar graph or just a normal graph to see how much growth or how much decline you're experience for example if you are a business and
there's a declining Revenue then year over year over month of month over month then you can see that there is a problem and then you can also allocate your
resources toward fixing that problem whether that is investing more in customer preferences or investing more
money into marketing or into resources that make your customers more satisfied or adapting new technologies those are all um things that you can do whenever
you see a declining Revenue But first you must be able to see it coming and also businesses um have a problem with st growth so they may grow one month and
then the next month there's like no growth or maybe there's a decline so you want to see that as a business to be able to stabilize the growth so use uh
continuously grow as a business or missed seasonal opportunities if a business isn't aware of how sales changes throughout the the year they
could miss out on maximizing profits during big Seasons maybe or some Seasons there's a certain product that's uh what
that's high in demand but you are you don't have enough stuff enough stock to cover it so you end up not being able to meet the demand and losing out on um
revenue and profits and there so those were regarding our yearly sales fans and there's also a problem with quality monthly sales so for example
cash flow issues now many businesses experience cash flow issues so they may maybe one day they look at their bank and see that they are out of money and
they cannot invest more in their business or there's also inventory imbalance or ineffective marketing so for example whenever you have a cash
flow issue drastic dips in sales during specific quarters or months can lead to cash crunches making it hard to pay suppliers employees or ongoing expenses
or when whenever you have in inventory imbalance some uh periods you're Overstock and and those items you have to give away when some periods you are
under stock you're are not able to meet the demand or maybe your marketing is ineffective so if you spend significant amount of time in marketing and you
don't reach your desired out outcome that means there's a major issue with your marketing campaign and then you can see that from the sales you're making so for example if you're spending this
amount of money with marketing or increased amount marketing and there's no significant increase of sales that means you're doing something wrong with your marketing campaign or likeing
response to emerging Trends so mully sales data can highlight new trends or drops in demand more quickly than just he the overviews so you can react much
faster to emerging trends for example a certain product was released uh 2024 and all of a sudden is high in
demand in uh many countries they you want to be able to adjust to that demand and get get the supplies for the product but you are not able to do that if you
track yellow sales or don't do any tracking at all so those are all the all the problems that exist if you are not able to track down your sales be it
monthly partly or yearly and so we are intending to solve that problem by having my graphing it and concluding from the results we get from our graphs
all right so let's get started now let's convert the order dat column to data framing format and the order dates is
equal to pd.
to daytime thef or the DAT Day first is equal to the true let's go the data by years and calculate the total sales amount for each year we can do it by y
sales screening the variable first and then do buy thef or the DAT the year sales some release sales is equal toas
sales let reset the index now I want to give the appropriate call the appropriate name cuz right now in the data frame is is not the order date
should be named year and the sales should be named total sales now let's print this out so this the amount of
sales for each year in total and we can also this prod bar out of this prod bar
new sales year new sales total sales [Music] all right so from this uh biograph there can be few uh conclusions made for
example there's a steady growth from 2016 to 2018 that might explain for example new product launches they are effective or economic factors or
marketing efforts those are all the explanations that a person can make but you can make these conclusions only when you have a larger data available to you
and in this data frame we don't have the marketing cost or any other cost involved regarding this so that's our
conclusions are PR limited but what we can see is that um this bar graph combined with um any other bar for example marketing
cost you can make a pretty good amount of conclusions from that as a business now how about um tots we can also PL this using just a normal graph which
means I will just copy it oh no I won't this plet you Sals here to the Sals so this shows a little bit
different now I prefer this sort of graph for instead of a bar graph for tracking the yearly sales because it shows much more clear the amount of
increase with the mod of decrease now we can also uh focus on the quar sales like I said to be able to uh react to
emerging Trends or to emerging or to um react fast to or to be able to react fast to any kind of change now let's
again C the order date to date format order dates [Music] [Music] [Music] [Music]
[Music] [Music] [Music]
[Music] [Music] [Music] [Music] [Music] [Music] [Music]
[Music] [Music]
[Music] [Music] [Music] [Music] [Music] [Music]
[Music] [Music] from our qu Sals we can see that there's
a steady increase of qu qu Sals and all of a sudden um July it blows up to new
heights so from this a graph we can see exactly that uh Q3 and Q4 did very well and q1 and Q2 didn't so something might
have changed for example seasonal Trend or you have increased your marketing or you have introduced a new product or you have targeted a specific customer
segment and as a business is really important to know that so you can also expect that if you follow a similar line
of actions then you might have higher demand for your products on Q Q3 and Q4 so you might want to Overstock it or
maybe you can analyze this uh further and replicate the successful strategies for future quarters and you this will cause for this will make sure that a
business can steadily grow and increase its Revenue which is good and also for um q1 you can see that there's um it starts out with pretty slow I mean
businesses may want to start out the year more quickly so they can investigate it also for example is this a seasonal uh for the industry maybe on
certain Seasons a certain product is not high in demand or maybe the a competitor did some kind of marketing or um some
kind of or use some kind of a uh strategy to drive more customers to them or maybe we have changed um marketing effort in our Q3 and our marketing
efforts were not productive in for q1 and Q2 so maybe that's it all right so maybe you want to investigate this uh
much more deeply so not quarterly but monthly so let's do it now um let's start off the same way call the order dates column to the data time format
thef or the day periods to daytime [Music] [Music] [Music]
world [Music] [Music] [Music] [Music]
[Music] [Music]
[Music] [Music] [Music] [Music] all right so from this sh you can see
that it's growing month over month beside first month over uh 2019 and also the third month the the third month of
2018 so this is generally an upward Trend which which why suggest like a healthy sales are going on and it looks like
August and December might be your seasonal so you might want to Overstock the products there and okay you can also
see that there are seasonal dips in third month of 2018 and also 2019 in November of 2018 so you might want to
consider it uh like seasonal promotions to stimulate offseason sales or diverse diversify your product service offerings to reduce the Reliance on seasonal
demand maybe you want to start deploying new marketing strategies here or try to Target new customer segments by introducing new products so that you might offset the seasonal Trends and
overall it seems pretty consistent so it looks like that there's a healthy sales fund so you might want to invest more in your proven strategies
which might be for example certain marketing tactic promotion or product offering for a certain month and like I said for the dips the store might um try
to deploy new marketing strategies or introduce new products to Target new uh customer segments which might which might offset uh the dip and that said um
and that said it uh more it's important to consider the time frame a year is certainly a year certainly has great amount of data but it does not really
accurately the um reveal seasonal patterns because for one year that might be the case but for another year it might just be
completely uh the opposite so you might want to consider a larger um sales line graph for example maybe for five years and then you can see if that's the case
and if that's the casee then this might suggest a seasonal Trend if not then you will act accordingly all right so we have covered um the sales Trends we can
move on to the next chapter which is all right let's move on to the next chapter which is mapping so we want to create a
map out of uh Sals per state so for each state we want to color it um according to the amount of sales so if you are if there's a high amount of sales then it
should be colored yellow and if there is no amount of sales then it should be colored uh blue so the question is why would someone want to do this now
companies looking to expand into new Geographic areas face the challenge of identifying the most promising States and regions for their products or
Services now for example how do you know if um your product will sell in a certain state so one of the a one of the
tactics that people often use is by seeing if there is a similar store like them operating in that state or in that City and so if there is a similar store
um working there and it's not a saturated Market meaning there are substantial amount of people who might buy your product then it's a good
idea to go there for example a company that manufactures Athletic Apparel is considered expanding its retail footprint by
analyzing total sales data by US state they can see that states with a high concentration of fitness centers and active population like for example
California Texas or Florida might be good candidates for new stores and if there are currently no
uh um sports stores there then it's even better or for example if um you're a business and you want to strategically
allocate your Market budget and sales team and so you have uh stores all over the States you you want to optimize it for each state maybe one state is
forming well and another state is not performing so you might then from the map you might see which state is not performing well and allocate your
resources accordingly to be able to um maximize your return on investment by optimizing certain strategies but if you
don't know which state is not uh performing well then if you don't know which ST which state is performing well then you have no information on where
you have to optimize like for example a a national Pizza chain wants to optimize his marketing spend now sales data reveals that their piz piz areas in the
midwest consistently outperform those on the west coast at this Daya suggest that they might need to allocate more marketing budget to increase brand awareness and sales in the western
states and you might also want to do this because of competitor analysis so staying ahead of the competition means understanding where your competitors are
having the most success analyzing their sales patterns across states can reveal their Geographic strengths and weaknesses now for example a coffee roasting company notic as a competive
coffee brand is experiencing High sales in the Pacific Northwest states a this indicate the competitor has established a strong partnership with local grow grocery stores or a lot of successful
marketing campaigns in that region the company can use this information to Target similar grocery stores or develop competitive marketing strategies for the
Pacific Northwest so without further Ado let's get started with it now I will walk you to the code and instead of writing it instead of writing it we will
I will just walk you through it so let's first import the pl graph function or elaborate so we initialize the pl in or
J notebook and we also want to create a map for all 50 states which we do this here and add the ab abbreviation column
to the data frame because right now there is not we as we initialize the variable and calculate the amount of sales for each state and grou it by
State and this is exactly what we need and then we add the abbreviation to some of sales we do that here and finally we pl it and this is how the map looks like
so the Blue Area are the ones with low amount of sales and the yellow area which California has high amount of sales so from this m you can see which
areas um main cells come from and according to it you can optim you can optimize according so let's say you have a pizza chain or
pizza chain of stores and you want to see which one of your States is performing the best and which one is performing the poorest so you want to spend your most energy on optimizing
what doesn't work so from this you can see clear for is great so you can leave it alone but for example Texas is not performing that well so you might
allocate more marketing budget or more uh resources there to to have to start making um or to start getting more sales
because in Texas there are still many amount of people will consume P serious but they are not buying so why is that and you can also see so for example if
this is a completely another store this is like relates to for example a retail store for Sport Goods and all this
States they've got um a store from You can conclude that uh California is profing real bille so it it's probably not a good idea to go there since it
might have since there might be a like Market situation there but you can go for example to for example Florida you
want go to Florida and start selling um similar uh sports goods there because like you see the market is still uh new
or the market is still not such trade all right so that was that so we can also create a barograph out of it now from this you can see that most of
the total sales P by state California is doing the best and nor theota is doing the worst and of course now I still
remember we previously categorized or showed how large each of our categories are and we did the same for sub our
subcat but we never did it in the same uh plot so here we display our main category of products which are Furniture
office supplies and technology and for each uh category we have subcategories and based on their size it's uh and it's
organized based on their size so chairs seems that chairs is the largest or sells the most in our furniture category that it goes to tables and we have then
small model of book cases and fleshings and in our office supplies you can see that the best uh storage um product is performing the best and
envelopes and labels are performing the worst and for our Tech category phones are performing the best and the machines accessories and copies and from this you
can see that phones is overall the best subcategory it's even I think it's even larger than um chairs yeah little larger
so this is a much better way to display um if you're trying to make an argument uh to display the plot and of course you can also do it this way all
right and hope you guys enjoyed this project I definitely did and I will see you guys in the next video in this part we are going to talk
about a case study in the field of Predictive Analytics and causal analysis so we are going to use this simple yet powerful regression technique called
linear regression in order to perform causal analysis and Predictive Analytics so by causal analysis I mean that we are going to look into this correlations
clation and we're trying to figure out what are the features that have an impact on the housing price on the house value so what are these features that
are describing the house that Define and cause the variation in the uh house prices the goal of this case study is to
uh practice linear regression model and to get this first feeling of how uh you can use a machine learning model a simple machine learning model in order
to perform uh model training model evaluation and also use it for causal analysis where you are trying to identify features that have a
statistically significant impact on your response variable so on your dependent variable so here is the step-by-step process that we are going to follow in
order to find out what are the features that Define the Californian house values so first we are going to understand what are the set of independent variables
that we have we're also going to understand what is the response variable that we have so for our multiple linear regression model we are going to understand what are this uh techniques
that we uh need and what are the libraries in Python that we need to load in order to be able to conduct this case study so first we are going to load all these libraries and we are going to
understand why we need them then we are going to conduct data loading and data preprocessing this is a very important step and I deliberately didn't want you
to skip this and didn't want you to give you the clean data cuz uh usually in normal real Hands-On data science job you won't get a clean data you will get
a dirty data which will contain missing values which will contain outliers and those are things that you need to handle before you proceed to the actual and F
part which is the uh modeling and the uh analysis so therefore we are going to do missing data analysis we are going to remove the missing data from our
Californian house price data we are going to conduct outlier detection so we are going to identify outliers we are going to learn different techniques that you you can use visualization uh
techniques uh in Python that you can use in order to identify outliers and then remove them from your data then we are going to perform data visualization so we are going to explore the data and we
are going to do different plots to learn more about the data to learn more about this outliers and different statistical techniques uh combined with python so
then we are going to do correlation analysis to identify some problematic Fe features which is something that I would suggest you to do independent the nature of your case study to understand what
kind of variables you have what is the relationship between them and whether you are dealing with some potentially problematic variables so then we will be U moving
towards the fun part which is performing the uh multiple theur regression in order to perform the caal nzis which means identifying the features in the
Californian house blocks that Define the value of the Californian houses so uh finally we will do very quickly another uh implementation of the
same multiple uh multiple linear regression in order to uh give you not only one but two different ways of conducting linear regression because linear regression can be used not only
for causal analysis but also as a standalone a common machine learning regression type of model therefore I will also tell you how you can use
psychic learn as a second way of training and then predicting the Californian house values so without further Ado let's get started once you become a data scientist
or machine learning researcher or machine learning engineer there will be some cases some handson uh data science projects where the business will come to you and will tell you well here we have
this data and we want to understand what are these features that have the biggest influence on this other factor in this specific case in our case study um let's
assume we have a client that uh is interested in identifying what are the features that uh Define the house price
so maybe it's someone who wants to um uh invest in uh houses so it's someone who is interested in buying houses and maybe even renovating them and then reselling
them and making profit in that way or maybe in the longterm uh investment Market when uh people are buying real estate in a way of uh investing in it
and then longing for uh holding it for a long time and then uh selling it later or for some other purposes the end goal in this specific case uh for a person is
to identify what are this features of the house that makes this house um to be priced at a certain level so what are
the features of the house that are causing the price and the value of the house so we are going to make use of this very popular data set that is
available on Kel and it's originally coming from psyit learn and is called California housing prices I'll also make sure to put the link uh of this uh
specific um data set uh both in my GitHub account uh under this repository that will be dedicated for this specific case study as well as um I will also
point out the additional links that you can use to learn more about this data set so uh this data set is derived from 1990 um US Census so United uh States
census using one row Paris census block so a Blog group or blog is the smallest uh geographical unit for which
the US s Bureau publishes sample data so a Blog group typically has a population of 600 till 3,00 people who are living there so a
household is a group of people residing within a single home uh since the average number of rooms and bedrooms in this data set are provided per household
this cons may be um May take surprisingly large values for blog groups with few households and many empty houses such as Vacation
Resorts so um let's now look into uh the variables that are available in this specific data set so uh what we have here is the med
in which is the median income in block group so uh this um touches the uh financial side and uh Financial level of
the uh block uh block of households then we have House age so this is the median house age in the block group uh then we have average
rooms which is the average number of rooms uh per household then we have average bedroom which is the average number of bedrooms per
household then we have population which is the uh blog group population so that's basically like we just saw that's the number of people who live in that
block then we have a uh o OU uh which is basically the average number of household members uh then we have latitude and longitude which are the latitude and
longitude of this uh block group that we are looking into so as you can see here we are dealing with aggregated data so we don't have the uh
the data per household but rather the data is calculated and average aggregated based on a block so this is very common in data science uh when we
uh want to reduce the dimension of the data and when we want to have some sensible numbers and create this crosssection data and uh crosssection data means that we have multiple
observations for which we have data on a single time period in this case we are using as an aggregation unit the block
and uh we have already learned as part of the uh Theory lectures this idea of median so we have seen that there are different descriptive measures that we
can use in order to aggregate our data one of them is the mean but the other one is the median and often times especially if we are dealing with skew distribution so if we have a
distribution that is not symmetric but it's rather right skewed or left skewed then we need to use this idea of median because median is then better
representation of this um uh scale of the data um compared to the mean and um in this case we will soon see when
representing and visualizing this data that we are indeed dealing with a skewed data so um this basically very simple a
very basic data set with not too many features so great um way to uh get your hands uh uh on with actual machine learning use case uh we will be keeping
it simple but yet we will be learning the basics and the fundamentals uh in a very good way such that uh learning more um difficult and more advp learning
models will be much more easier for you so let's now get into the actual coding part so uh here I will be using the Google clab so I will be sharing the
link to this notebook uh combined with the data in my python for data science repository and you can make use of it in order to uh follow this uh tutorial uh
with me so uh we always start with importing uh libraries we can run a linear regression uh manually without using libraries by using matrix
multiplication uh but I would suggest you not to do that you can do it for fun or to understand this metrix multiplication the linear algebra behind
the linear regression but uh if you want to um get handson and uh understand how you can use the neur regression like you
expect to do it on your day-to-day job then you expect to use um instead Library such as psychic learn or you can also use the stats models. API
libraries in order to understand uh uh this topic and also to get handson I decided to uh showcase this example not only in one library in py but also the
stats models and uh the reason for this is because many people use linear regression uh just for Predictive Analytics and for that using psyit learn
this is the go-to option but um if you want to use linear gression for causal analysis so to identify and interpret this uh feature the independent variables that have a
statistically significant impact on your response variable and then you will need to uh use another Library a very handy one for linear regression which is
called uh stats models. uh API and from there you need to import the SM uh functionality and this will help you to do exactly that so later on we will see
how uh nicely this Library will provide you the outcome exactly like you will learn on your uh traditional econometrics or introduction to linear
regression uh class so I'm going to give you all this background information like no one before and we're going to interpret and learn everything such that
um you start your machine Learning Journey in a very proper and uh in a very um uh high quality way so uh in
this case uh first thing we are going to import is the pendis library so we are importing pendis Library as PD and then non pile Library as NP we are going to
need pendas uh just to uh create a pendas data frame to read the data and then to perform data wrangling to identify the missing data outliers so
common data wrangling and data proposing steps and then we are going to use numpy and numpy is a common way to uh use whenever you are visualizing data or
whenever you are dealing with metrices or with arrays so so pandis and nonp are being used interchangeably so then we are going to
use MPL lip and specifically the PIP platform it uh and this library is very important um when you want to visualize
a data uh then we have cburn um which uh is another handy data visualization library in Python so whenever you want to visualize data in Python then methot
leip and Cy uh cburn they are two uh very handy data visualization techniques that you must know if you like this um cooler uh undertone of colors that
Seaburn will be your go-to option because then the visualizations that you are creating are much more appealing compared to the med plotly but the underlying way of working so plotting
scatterer plot or lines or um heat map they are the same so then we have the statsmodels.api
uh which is the library from which we will be importing the uh SM uh that is the simple uh linear regression model uh that we will be using uh for our caal
analysis uh here I'm also importing the uh from pyit learn um linear model and specifically the lar regression model
and um this one uh is basically similar to this one you can uh use both of them but um it is a common um way of working with machine learning model so whenever
you are dealing with Predictive Analytics so we you are using the data not for identifying features that have a statistically significant impact on the response variable so features that have
an influence and are causing the dependent variable but rather you are just interested to use the data to train
the model on this data and then um test it on an unseen data then uh you can use pyit learn so pyit learn will uh will will be something that you will be using
not only for linear regression but also for other machine learning models think of uh Canon um logistic regression um
random Forest decision trees um boosting techniques such as light GBM GBM um also clustering techniques like K means DB scan anything that you can
think of uh that fits in in this category of traditional machine learning model you'll be able to find insight therefore I didn't want it to limit this
tutorial only to the S models which we could do uh if we wanted to use um if we wanted to have this case study for specifically for linear regression which
we are doing but instead I wanted to Showcase also this usage of pyit learn because pyit learn is something that you can use Beyond linear regression so for all this added h of machine learning
models and given that this course is designed to introduce you to the world of machine learning learning I thought that we will combine this uh also with pyic learning something that you are
going to see time and time again when you are uh using python combined with machine learning so then I'm also uh importing
the uh training test split uh from the psyit learn model selection such that we can uh split our data into train and
test now uh before we move into uh the uh actual training and testing we need to first load our data so therefore uh
what I did was to uh here uh in this sample data so in a folder in Google collab I uh put it this housing. CSV
data that's the data that you can download uh when you go to this specific uh page so uh when you go here um then
uh you can also uh download here that data so download 400 9 kab of this uh housing data and that's exactly what I'm
uh downloading and then uploading here in Google clap so this housing. CSV in
this folder so I'm copying the path and I'm putting it here and I'm creating a variable that holds this um name so the
path of the data so the file underscore path is the variable string variable that holds the path of the data and then what I need to do is that I need to uh
take this file unor path and I need to put it in the pd. read CSV uh which is a function that we can use in order to uh
load data so PD stands for pandas uh the short way of uh naming pandas uh PD do and then rore CSV is the function that
we are taking from Panda's library and then within the parentheses we are putting the file uncore path if you want to learn more about this Basics or variable different data structures some
basic python for data science then um to ensure that we are keeping this specific tutorial structured I will not be talking about that but feel free to check the python for data science course
and I will put the link um in the comments below such that you can uh learn that if you don't know yet and then you can come back to this tutorial
to learn how you can use python in combination with linear progression so uh the first thing that I tend to do before moving on to the actual execution
stage is to um look into the data to perform data exploration so what I tend to do is to look at the data field so the name of the variables that are
available in the data and that you can do by doing data. columns so you will then look into the columns in your data this will be the name of your uh uh data
fields so let's go ahead and do command enter so we see that we have longitude latitude housing undor median age we have total rooms we have total bedrooms
population so basically the the um amount of people who are living in the in those households and in those houses then we have households then we have median income we have median house
uncore value and we have ocean proximity now you might notice that the name of these variables are bit different different than in the actual um
documentation of the California house so you see here the naming is different but the underlying uh explanation is the same so here they are just trying to
make it uh nicer and uh represented in a better uh naming but uh it is a common um thing to see in Python when we are
dealing with uh data that uh we have this underscores in the name app ation so we have housing uncore median AG
which in this case you can see that it says house um H so bit different but their meaning is the same this is still
the median house AG in the block group so uh one thing uh that you can also uh notice here is that the um in the
official uh documentation we don't have this um one extra variable that we have here which is the ocean proximity and
this basically uh describes the uh closeness of the house from the ocean which of course uh for some people can
definitely mean a increase or decrease in the house price so I basically um we have all these variables and next thing
that I tend to do is to look into the actual data and one thing that we can do is just to look at the um top 10 rows of the data instead of printing the entire
uh data frame so when we go and uh execute this specific part of the code and the command you can see that here we have the top 10 rows uh of our data so
we have the longitude the latitude we have the housing median age you can see we are seeing some 41 year 21 year 52 year basically the number of years that
a house the median age of the house is 41 21 5 two and this is per block then we have the number of total
bedrooms so we see that uh we have um in this blog uh the total number of rooms that this houses have is 7,99 so we are already seeing a data
that consists of these large numbers which is something to take into account when uh you are dealing with machine learning models and especially with linear regression then we have total
bedrooms um and we have then pop population households median income median house value and the ocean proximity one thing that you can see
right of the bed is that uh we have longitude and latitude uh which have some uh unique uh characteristics um and longitude is with
minuses latitude is with pluses uh but that's fine for the linear regression because what it is basically looking is uh whether a variation in certain independent variables in this case
longitude and latitude but that will cause a change in the dependent variable so just to refresh our memory what this linear regression will do in this case um so we are dealing with multiple
linear regression because we have more than one independent variables so we have as independent variables those different features that describe the house except of the house price because
median house value is the dependent variable so that's basically what we are trying to figure out we want to see what
are the features of the house that cause so Define the house price we want to identify what are um the features that
cause a change in our dependent variable and specifically uh what is the uh change in our median house price uh
value if we apply a one unit change in our IND dependent feature so if we have a multiple linear aggression we have learned during the theory lectures that what linear regression tries to use
during causal anasis is that it tries to keep all the independent variables constant and then investigate for a specific independent variable what is this one unit uh change uh increase uh
in the specific independent variable will result in what kind of change in our dependent variable so if we for instance change by one unit our
uh housing median age um then what will be the corresponding change in our median household value keeping everything else
constant so that's basically the idea behind multi multiple line regression and using that for this specific use case and in here um what we also want to
do is to find out what are the uh data types and whether we can learn bit more about our data before proceeding to the next and for that I tend to use this uh info
uh function in pandas so given that the data is a pandas data frame I will just do data. info and then parenthesis and
do data. info and then parenthesis and then this will uh show us what is the data type and what is the number of new values per
variable so um as we have already noticed from this header which we can also see here being confirmed that ocean proximity is a very variable that is not
a numeric value so here you can see nearby um also a value for that variable which unlike all the other values is represented by a string so this is
something that we need to take into account because later when we uh will be doing the data prop processing and we will actually uh actually run this model we will need to do something with this
specific variable we need to process it so um for the rest we are are dealing with numeric variables so you can see here that longitude latitude all the
other variables including our dependent variable is a numeric variable so float 64 the only variable that needs to be
taken care of is this ocean underscore proximity uh which um we can actually later on also see that is um categorical string variable and what this basically
means is that it has these different categories so um for instance uh let us actually do that in here very quickly so
let's see what are all the unique values for this variable so if we take the name
of this variable so we copy it from this overview in here and we do unique then this should give us the unique values for this categorical
variable both so here we go so we have actually five different unique values for this categorical string variable so this
means that this ocean proximity can take uh five different values and it can be either near Bay it can be less than 1 hour from the ocean it can be in land it
can be near Ocean and it can be uh in the Iceland what this means is that we are dealing with a feature that describes the distance uh of the block from the ocean
and here the underlying idea is that maybe this specific feature has a statistically significant impact on the
house value meaning that it might be possible that for some people um in certain areas or in certain countries
living in the uh nearby the ocean uh will be increasing the value of the house so if there is a huge demand for houses which are near the ocean so
people prefer to uh live near the ocean then most likely there will be a positive relationship if there is a uh negative
relationship then it means that uh people uh if uh in that area in California for instance people do not prefer to live near the ocean then uh we
will see this negative relationship so we can see that um if we increase uh the uh if if people uh if the house is in
the uh um area that is uh not close to Ocean so further from the ocean then the house value will be higher so this is something that we want to figure out with this line regression we want to
understand what are the features that uh Define the value of the house and we can say that um if the house has those characteristics then most likely the
house price will be higher or the house price will be lower and uh linear aggression helps us to not only understand what are those features but also to understand how much higher or
how much lower will be the value of the house if we have the certain characteristics and if we increase the certain characteristics by one unit so
next we are going to look into uh the missing data in our data so in order to have a proper machine learning model we need to do some uh data processing so
for that what we need to do is we need to check for the uh Missing values in our data and we need to understand what is this amount of n values per data
field and this will help us to understand whether uh we can uh remove some of those missing values or we need to do
imputation so depending on the amount of missing data that we got in our data we can then understand which of those Solutions we need to take so here we can
see that uh we don't have any new values when it comes to longitude latitude housing median age and all the other variables except a one variable one
independent variable and that's the total bedrooms so we can see that um out of all the observations that we got the total uh uncore bedrooms variable has
207 cases when we do not have the corresponding uh information so when it comes to representing this numbers in percentages
which is some something that you should do as your next step we can see that um out of um the entire data set uh for
total underscore bedrooms variable um only 1. n n uh 3% is missing now this is
only 1. n n uh 3% is missing now this is really important because by simply looking at the number of times the uh
number of missing uh observations per data field this won't be helpful for you because you will not be able to understand relatively how much of the
data is missing now if you have for a certain variable 50% missing or 80% missing then it means that for majority of your house blocks you don't have that
information and including that will not be beneficial for your moral nor will be it accurate to include it and it will result in biased uh moral because if you
have for the majority of observations uh no information and for certain observations you do that inform you have that information then you will automatically skew your results and you
will have biased results therefore if you have uh for the majority of your um data set that specific uh variable
missing then I would suggest you choose just to drop that independent variable in this case we have just one uh% uh of the uh house blocks missing that
information which means that this gives me confidence that uh I would rather keep this independent variable and just to drop those observations that do not
have a total uh underscore bedrooms uh information now another solution could also be is uh to instead of dropping that entire independent variable is just
to uh use some sort of imputation technique so uh what this means is that uh we will uh try to find a way to
systematically find a replacement for that missing value so we can use mean imputation median imputation or more model based more advanced statistical or
econometrical approach is to perform imputation so for now this is out of the scope of this problem but I would say look at the uh percentage of uh
observations that for which this uh independent variable has missing uh values if this is uh low like less than 10% and you have a large data set then
uh you should uh be comfortable uh dropping those observations but if you have a small data set so you got only 100 observations and for them like 20%
or 40% is missing then consider from imputation so try to find the values that can be um used in order to replace those missing
values now uh once we have this information and we have identified the missing values the next thing is to uh clean the data so here what I'm doing is
that I'm using the data that we got and I'm using the function drop na which means drop the um uh observations where
the uh value is missing so I'm dropping all the observations for which the total underscore bedrooms has a null value so I'm getting rid of my missing
observations so after doing that I'm checking whether I got rid of my missing observations and you can see here that when I'm printing data. is n. sum so I'm
summing up the number of uh Missing observations n values per uh variable then uh now I no longer have any missing observations so I successfully deleted
all the missing observations now the next state is to describe the data uh through some descriptive statistics and through data
visualization so before moving on towards the caal analysis or predictive analys is in any sort of machine learning traditional machine learning approach try to First Look Into the data
try to understand the data and see uh whether you are seeing some patterns uh what is the mean uh of different um numeric data fields uh do you have
certain uh categorical values that cause an un unbalanced data those are things that you can discover uh early on uh
before moving on to uh the more model training and testing and blindly believing to the numbers so data visualization techniques and data
exploration are great way to understand uh this uh data that you got before using that uh in order to train and test
machine learning model so here I'm using the uh traditional describe function of pendis so data. describe parenthesis and then this will give me the descriptive
statistics of my data so here what we can see is that in total we got uh 20,600
observations uh and then uh we also have a mean of uh all the variables so you can see that perir variable I have the same count which basically means that
for all variables I have the same number of rows and then uh here I have the mean which means that um here we have the
mean of the uh variable so per variable we have their mean and then we have their standard deviation so the square root of the variance we have the minimum
we have the Maxum but we also have the 25th percentile the 15 percentile and the 75th percentile so the uh percentile
uh and quantiles those are uh statistical terms that we oftenly use and the 25th percentile is the first quantile the 15 percentile is the second
quantile or the uh median and the 75th percentile is the uh third quantile so uh what this basically means is that uh
this percentiles help us to understand what is this threshold when it comes to looking at the um
observations uh that fall under the 25% uh and then above the 25% so when we look at this standard deviation standard deviation helps us to interpret the
variation in the data at the unit so scale of that variable so in this case the variable is median house value and we have that the mean is equal to
206,000 approximately so more or less that uh range 206 K and then the standard deviation is
115k what this means is that uh in the data set we will find blocks that will
the median house value that will be uh 206k 206k plus 115k which is around 321k so there will be blocks where the
median house value is around 321k and there will also be blocks where the um median house value will be around
uh 91k so 206,000 minus 115k so this the idea behind standard deviation this variation your data so
next we can interpret the idea of uh this uh minimum and maximum of your data in your data fields the minimum will help you to understand what is this minimum value that you have per data
field numeric data field and what is the maximum value so what is the range of values that you are looking into in case of the median house value this means
what um are the uh what is this minimum median house value per uh block and uh in case of Maximum what is this um
highest value perir block when it comes to the Medan house value so this can uh help you to understand um when we look at this aggregated data so the median
house value what are the blocks that have the uh chipest uh houses when it comes to their valuation and what are the most expensive uh blocks of
houses so we can see that uh the chipest um block uh where in that block the median house value is uh 15K so
14,999 and the house block with the um highest valuation when it comes to the median house value so uh the median um valuation of the houses is equal to
$500,000 And1 which means that when we look at our blocks of houses um that uh the median house value in this most
expensive blocks will be a maximum 500k so uh next thing that I tend to do is to visualize the data I tend to start with the dependent variable so this is
the variable of interest the target variable or the response variable which is in our case the median house value so this will serve us as our dependent
variable and what I want you to do is to upload this histogram uh in order to understand what is the distribution of Medan house values so I want to see that
when when looking at the data what are the um most frequently appearing median house values and uh what are this uh
type of blocks that have um unique less frequently um appearing uh meded house values by plotting this type of plots
you can see some alt lers some um frequently appearing values but also some values that uh go uh and uh are lying out side of the range and this
will help you to identify and learn more about your data and to identify outliers in your data so in here I'm using the uh caburn uh Library so given that earlier
I already imported this libraries there is no need to import here what I'm doing is that I'm setting the the grid so which basically means that I'm saying the background should be white and I
also want this grid so this means those this grid behind then I'm initializing the size of the figure so PLT this comes from met plotly pip plot and then I'm
setting the figure the figure size should be 10 by6 so um this is the 10 and this is the six then we have the
main PL so I'm um using the uh his plot function from curn and then I'm taking from the uh clean data so from which we have removed the missing data I'm
picking the uh variable of interest which is the median house value and then I'm saying upload this um histogram using the fors green
color and then uh I'm saying uh the title of this figure is distribution of median house values then um I'm also mentioning what is the X label which
basically means what is the name of this variable that I'm putting on the xaxis which is a median house value and what is the Y label so what is the name of
the variable that I need to put on the Y AIS and then I'm saying pl. show which
means show me the figure so that's basically how in Python the visualization works we uh first need to write down the the actual uh figure size
uh and then we need to uh Set uh the function uh and the right variable so provide data to the visualization then we need to put the title we need to put the ex label while label and then we
need to say show me the visualization and uh if you want to learn more about this visualization techniques uh make sure to check the python for data science course cuz that
one will help you to understand slowly uh and in detail how you can uh visualize your data so in here what we are visualizing is the frequency of this
median house values in the entire data set what this means is that we are looking at the um number of times each of those median house values appear in
the data set so uh we want to understand are there uh certain uh median house values that appear very often and are there certain
house values that do not appear that often so those can be maybe considered outliers uh because we want in our data
only to keep those uh most relevant and representative data points we want to derive conclusions that hold for the majority of our uh uh observations and
not for outliers we will be then using that uh representative data in order to run our linear regression and then make conclusions when looking at this graph
what we can see is that uh we have a certain cluster of um median house values that appear quite often and those are the cases when this frequency is
high so you can see that uh we have for instance houses in here in all this block that appear um very often so for
instance the median house value U of about 160 170k this appears very frequently so you can see that the frequency is above 5,000 those are the
most frequently appearing median house values and um there are cases when the um so you can see in here and you can
see in here houses that uh whose Medan house value is not appearing very often so you can see that their frequency is
low so um roughly speaking those houses they are unusual houses they can be considered as outliers and the same holds also for these houses because you
can see that for those the frequency is very low which means that in our population of houses so California house prices you'll most likely see houses uh
blocks of houses whose medium value is between let's say um 17K up to uh let's say uh
300 or 350k but anything below and above this is considered as unusual so you don't often see a houses that are um so house
blocks that have a median house value less than uh 70 or 60k and then uh also
uh houses that are above um 370 or 400k so do consider that uh we are dealing with
1990 um a year data and not the current uh prices because nowadays uh California houses are much more expensive but this is the data coming from 1990 so uh do
take that into account when interpreting this type of data visualizations so uh what we can then do is to use this idea of inter quantile range to remove this outl what is
basically means is that we are looking at the lowest 25th uh percentile so uh we are looking at this first quantile so
0 25 which is a 25th percentile and we are looking at this upper 25th um percent which means the third quantile
or the 75th percentile and then we want to basically remove those uh by using this idea of 25th percentile and 75th
percentile so the first quantile and the search quantile we can then identify what are the U observations so the
blocks that have a median house value that is below the uh 25th percenti and above the 75% he so basically we want to
uh get the middle part of our data so we want to get this data for which the median house uh value is above the 25th
percentile so um above all the uh median house values that is above the lowest 25% uh percent and then we also want to
remove this very large median house values so we want to uh keep in our data the so-called normal uh and representative blocks blocks where the
medium house uh value is above the lowest 25% and smaller than the largest 25% what we are using is this statistical uh term ter called inter
quanti range you don't need to know the name but I think it would be just wor to understand it because this is a very popular way of uh making a datadriven uh
removal of the outliers so I'm selecting the um 25th percentile by using the quantile function from pandas uh so I'm
saying find for me the uh value that divides my entire uh block of observation so block observ ations to
observations for which the median house value is below the um the um 25th percentile and above the 25th percentile
so what are the largest 75% and what are the smallest 25% when it comes to the median house value and we will then be removing this
25% so that I will do by using this q1 and then uh we will be using the uh Q3 in order to remove the very large median house M value so the uh upper 25
percentile and then uh in order to um calculate the inter quanti range we need to uh pick the Q3 and subtract from it
the q1 so just to understand this idea of q1 and Q3 so the Quantas better let's actually print this uh
q1 and this uh Q3 so let's actually remove this part for now and they run
it so as you can see here what we are finding is that the uh q1 so the 25% or first quantile is equal to
$119,500 so it basically is a number in here what it means is that um we have uh
25% um of the um observations the smallest observations have a median house value that is below the
$1 $119,500 and the remaining 75 uh% of our observations have a median house value
that is above the $190,500 and then the uh Q3 which is the third quantile or the 75th percentile it
describes this threshold the volume where we make a distinction between the um uh lowest median house values the
first 75th uh% of the lowest uh median house values versus the uh most expensive so the highest median house
values so what is this upper uh 25% uh when it comes to the median house value so we see that that distinction is
$264,700 so it is somewhere in here which basically means that when it comes to this uh to this blocks of uh houses the
most expensive ones with the highest valuation so the 25% top rated median house values they are above 2 64,500
that's something that we want to remove so we want to remove the observations that have uh smallest median house value and the largest median house values and usually it's a common practice when it
comes to the inter quantel uh range approach to multiply the inter quanti range by 1.5 in order to um obtain the lower bound and the upper bound so to
understand what are the um thresholds that we need to use in order to remove the uh blocks uh so observations from
our data where the Medan house value is very small or very large so for that we will be multiplying the IQR so inter
quanti range by 1.5 and when we uh subtract this value from q1 then we will be getting our lower bound when uh we will be adding this value to Q3 then we
will be infusing and getting this threshold when it comes to the uh upper bound and we will be seeing that um after we uh clean this uh outliers from
our data we we end up uh getting um smaller data so this means that uh previously we had uh 20K so
2,433 observations and now we have 9,369 observations so we have roughly removed um like about 1,000 or bit over
1,000 observations from our data so uh next let's look into some other variables for instance the median uh income
and um one other technique that we can use in order to identify outliers in the data is by using the box plots so I wanted to showcases different approaches
that we can use in know to visualize the data and to identify outliers such that you will be familiar with uh different techniques so let's go ahead and plot
the uh box plot and box is a statistical um way to represent your data uh the central box uh represents the inter
Quant range so um that is the IQR uh and with the uh with the bottom and the top edges they indicate the 25th percentile so the first quantile and the 75th
percentile so the third quantile respectively the length of this box that you see here uh this dark part is basically the 50% of your data for the
median income and uh this uh median uh line in inside this box um this is the uh the
one with uh contrasting color that represents the median of the data set so the median is the middle value when data is sorted in an ascending order then we
have this whiskers in our box fote and this line of whiskers extends from the top and the bottom of the box and indicate this range for the rest of the
data set excluding the outliers they are typically this 1.5 IQR above and 1.5 times um IQR uh below the
q1 something that we also saw uh just previously when we were removing the outliers from the median house volum so in order to um identify the outliers you
can quickly see that we have all these points that um lie above the 1.5 times IQR above the um third quantile so the
75 percentile and um that's something that you can also see here and this means that those are uh blocks of houses that have
unusually high median income that's something that we want to remove from our data and therefore we can use the uh exactly the same approach that we used previously for the median house value so
we will then identify the uh 25 percentile or the first quantile so q1 and then Q3 so the third quantile or the 7 percentile then we will compute the
IQR um and then we will be obtaining the lower bound and the upper bound using this 1.5 um ass scale and then we will be using that this lower bound and upper
bound to then um use this filters in order to remove from the data all the observations where the medium income is above the lower bound and all the
observations that have a median income below the upper bound so we are using lower bound an upper bound to perform double filtering we are using two
filters in the same row as you can see and we are using this parentheses and this end functionality to tell to python well first look that this condition is satisfied so the observations have a
median income that is above this lower bound and at the same time it should hold that the observation so the block should have a median income that is
below the upper bound and if this uh block this observation in data satisfies to two of this criteria then we are dealing with a good point a normal point
and we can keep this and we are seeing that this is our new data so let's actually go ahead and execute this code in this case we can see too high as all
our outliers lie in this part of the Box fot and then we will end up with a clean data I'm taking this clean data and then I'm putting it under data just for
Simplicity and uh this data now uh is much more clean and uh it's better representation of the population something that ideally we want because
we want to find out what are the features that uh describe and Define the house value not based on this unique and rare houses which are too expensive or
which are in the blocks that have uh very high income uh people but rather we want to see the uh the uh true representation so the most frequently
appearing data what are the features that Define the house value of the prices uh for common uh houses and for common areas for people with average or
with normal income that's what we want to uh find so uh the next thing that I tend to do uh when it comes to especially regression
NES and C nazes is to plot the correlation heat map so this means that that uh we are getting the um uh
correlation Matrix pairwise correlation score uh for each of this pair of variables in our data when it comes to the linear regression one of the uh
assumptions of the linear regression that we learned during the theory part is that we should not have a perfect multicolinearity what this means is that there should not be a high correlation
between pair of independent variables so knowing one should not help us to automatically Define the value of the other independent variable and if the
correlation between the two independent variables is very high it means that we might potentially be dealing with multicolinearity that's something that
we do want to avoid so heat map is a great way to identify whether we have this type of problematic independent variables and whether we need to drop any of them or maybe multiple of them to
ensure that we are dealing with proper linear regression model and the assumptions of linear regression model is satisfied now when we look at this correlation heat map um and uh here we
use the caburn in order to plot this as you can see here the colors can be from very light so wide from uh till very
dark green where uh the light means um there is a negative strong negative correlation and very dark uh green means
that there is a very strong positive correlation so uh we know that correlation a value Pearson correlation can take values between minus one and
one minus one means uh very strong negative correlation one means very strong positive correlation and um usually when uh we are dealing with correlation of the
variable with itself so a correlation between longitude and longitude then uh this correlation is equal to one so as you can see on the diagonal we have
therefore all the ones because those are the pairwise correlation of the variables with themselves and then um in here uh all the values under the diagonal are actually equal to the uh
mirror of them in the upper diagonal because the variable between so the correlation between um the same two variables in dependent of how we put it
so which one we put first and which one the second is going to be the same so so basically correlation between longitude and ltitude and correlation ltitude and
longitude is the same so um now we have uh refreshed our memory on this let's now look into the actual number and this heat map so as we can see here we have
the section where we um have uh variables independent variables um that have a low uh positive correlation with
the uh remaining independent variable so you can see here that we have this light green uh values which indicate a low positive relationship between pair of
variables one thing that is very interesting here is the middle part of this heat map where we have this dark numbers so the numbers uh below the diagonals are something we can interpret
and remember that below diagonal and above diagonal is basically the mirror we here already see a problem because we are dealing with variable which are going to be independent variables in our
model that have a high correlation now why is this a problem because one of the assumptions of linear regression like we saw during the theory section is that we
should not have a multiple uh colinearity so multicolinearity problem when we have perfect multicolinearity it means that we are dealing with
independent variables that have a high correlation knowing a value of one variable with help us to know automatically what is the value of the
other one and when we have a correlation of 0.93 which is very high or 0.98 this means that those two variables
those two independent variables they have a super high positive relationship this is a problem because this might
cause our model to result in uh very large standard errors and also not accurate and and not generalizable model
that's something we want to avoid and uh we want to ensure that the assumptions of our model are satisfied now um we are dealing with independent variable which is total
underscore bedrooms and households which means that number of total bedrooms uh per block and the uh households is highly correlated
positively correlated and this a problem so ideally what we want to do is to drop one of those two independent variables and uh the reason why we can do that is
because uh those two variables given that they are highly correlated they already uh explain similar type of information so they contain similar type
of variation which means that including the two just it doesn't make sense on one hand it's U violating the moral assumptions potentially and on the other
hand it's not even adding too much value because the other one already shows similar variation so um the total underscore bedrooms basically contains
similar type of information as the households so we can as well um so we can better just drop one of those uh two independent
variables now uh the question is which one and that's something that we can uh Define by also looking at other correlations in here because we uh have
a total bedrooms uh having a high correlation with households but we can also see that the total underscore rooms
has a very high correlation with our households so this means that there is yet another independent variable that has a high correlation with our
households variable and then this total underscore rooms has also High uh correlation with the total underscore bedroom
so this means that um we can decide which one is has um more frequently uh High correlation with the rest of
independent variables and in this case it seems like that the largest two numbers in here are the um this one and this one so we see that the total
bedroom has 093 as correlation with the total underscore rooms and uh at the same time we also see that hotel
bedrooms has also um very high correlation with the household so 0.98 which means that total underscore bedrooms has the highest correlation
with the remaining independent variables so we might as well drop this independent variable but before you do that I would suggest you do one more
quick visual check and it is to look into the total uncore bedroom correlation with the dependent variable to understand how strong of a relationship does this have on the
response variable that we are looking into so we see that the uh total underscore bedroom uh has this one 0.05 correlation with the response
variable so the median house value when it comes to the total rooms that one has much higher so I'm already seeing from here that uh we can feel comfortable uh
excluding and dropping the total uncore bedroom from our data in order to ensure that we are not dealing with perfect multicolinearity so that's exactly what I'm doing here so
I'm dropping the um a total bedrooms so after doing that we no longer have this uh total bedrooms as the column so before moving on to the
actual cause of n is there is one more step that I wanted to uh show you uh which is super important when it comes to the POS analysis and some
introductory econometrical stuff so uh when you have a string categorical variable there are a few ways that you can deal with them one easy way that you
will see um on the web is to perform one H encoding which basically means transforming all this uh string values so um we have a near Bay less than 1
hour ocean uh Inland near Ocean Iceland to transform all these values to some numbers such that we have for the ocean
proximity variable values such as 1 2 3 4 5 one way of doing that can be uh something like this but better way when it comes to using this type of variables
in linear regression is to transform this uh string uh category type of variable to what we are calling dami variables so dami variable means that
this variable takes two possible values and usually uh it is a binary Boolean variable which means that it can take two possible values zero and one where
one means that the condition is satisfied and zero means condition is not satisfied so let me give you an example in this specific case we have that the ocean proximity has five
different values and ocean proximity is just a single variable then uh what we will do is we will use the uh get uncore d function in
Python from pandas in order to uh go from this one variable to uh five different variable per each of this
category which means that now we will have new variables that uh will uh basically be uh whether uh it is a
nearby or not whether it's less than 1 hour uh uh from the ocean a variable whether it's Inland whether it's near Ocean or whether it's in Island
this will be a separate binary variables a dami variable that will take value 0 and one which means that we are going from one string categorical variable to
five different dami variables and in this case um each of those dami variables that you can see here we are creating five dami variables each of
each for uh each of those five categories and then we are combining them and uh from the original data we will then be dropping the ocean
proximity data so on one hand we are getting rid of the string variable which is a problematic variable for linear regression when combined with the cyler
library because cyler cannot handle this type of um data when it comes to linear regression and B we are making our job easier when it comes to interpreting the
results so uh interpreting linear regression for closer analysis uh is much more easy when we have dami variables than when we have a one string
categorical variable so just to give you an example if we are creating from this uh string variable uh five different dami variables and those are those five different dami variables that you can
see in here so this means that if we are looking at this one category so let's say uh ocean _ proximity under Inland it
means that for all the RO R where we have the value equal to zero it means this criteria is not satisfied which means that uh ocean proximity uh
underscore Inland is equal to zero which means that the house block we are dealing with is not from Inland so that criteria is not satisfied
and otherwise if this value is equal to one so for all these rows when be ocean proximity Inland is equal to one It means that the criteria is satisfied and
we are dealing with house blocks that are indeed in the Inland one thing to keep in mind uh when it comes to uh transforming a string categorical
variable to um set of dummies is that you always need to drop at least one of the categories and the reason for this is because we learned during the theory
that uh we should have no perfect multicolinearity this means that um we cannot have five different variables that are perfect
correlated and if we include all these values and these variables it means that um when uh we know that the uh uh block
of houses is not near the bay is not less than 1 hour ocean is not Inland is not near the ocean automatically we know that it should be the remaining category
which is Inland so we know that for all those blocks the um uh ocean proximity underscore uh uh iseland uh Iceland will
be equal to one and that's something that we want to avoid because that is the definition of perfect multicolinearity So to avoid one of the oiless assumptions to be violated we
need to drop one of those categories so uh we can see here uh that's exactly uh what I'm doing I'm
saying so let's go ahead and actually drop one of those variables so let's see first what is the set of all variables we got so we got less than one hour uh
ocean Inland Iceland new bay and then uh new ocean let's actually drop one of them so let's drop the
Iceland and uh that we can do very simply by let me see I is not allowing me to aod in here
so we are doing data is equal to uh and then data do drop and then the name of
the variable within the uh quotation marks and then uh X is equal to one so in this way I'm basically dropping one
of the uh diing variables that uh I created in order to avoid the perfect multicolinearity assumption to be violated and once I go ahead and print
the columns now we should see uh this uh column uh this appearing here we go so we successfully deleted that variable let's go ahead and
actually get the head so now you can see that we no longer have a string in our data but instead we got four additional binary variable out of a string
categorical variable with five categories all right now we are ready to do the actual work uh when it comes to the training a machine learning model uh
or statistical model we learned during the uh theory that we always always need to split that data into train uh and test set that is the minimum in some cases we also need to do train
validation and test such that we can train the model on the training data and then optimize the model on validation data and find out what is the optimal
set of hyper parameters and then uh use this information to uh apply this fitted and optimize model on an unseen test
data we are going to skip the validation set for Simplicity especially given that we are dealing with a very simple machine learning model as linear regression and we're going to split our
data into train and test and here uh what I'm going to do is first I'm creating this list of the name of variables that we are going to use in
order to um uh train our machine learning model so uh we have a set of independent variables and a set of dependent variable so in our multiple linear
regression here is the set of uh independent variables that we will have so we have longitude latitude housing median Edge total rooms population
households median income median house value and then four different categorical dumi uh four different uh dami variables that we built from the categorical
variable then um I am specifying that the uh Target variable is so the target so the response variable or the
dependent variable is the um median house value this is the value that we want to uh uh Target because we want to see what are the features and what are
the independent variables out of the set of all features that have a statistically significant impact on the uh dependent variable which is the median house value because we want to
find out what are these features um describing the houses in the block that cause a change cause a variation in the
um Target variable such as the Medan house volume so here we have X is equal to and then uh from the data we are taking all the features that have the
following names and then we have the uh Target which is a midin house uh house value and that's uh the column that we are going to subtract and select from
the data so we are doing data filtering so here we are the selecting and what I'm using here is the train
test plit function from the pych learn so you might recall that in the beginning we spoke and imported this uh model selection um library and from the
pyler model selection we imported the train _ testore split function now this is the function that you are going to need quite a lot in machine learning cuz
this a very easy way to uh split your data so um in here uh the arguments of this function is first the uh Matrix or the
data frame that contains the independent variables in our case X so here you fill in X and then the second uh argument is
the dependent variable so uh the Y and then we have test size which means um what is the uh proportion of um observations that you want to put in the
Tex test and what is the proportion of observation that you um don't want to put basically in the training if you are putting 0.2 it means that you want your
test size to be uh 20% of your entire 100% of data and the remaining 80% will be your training data so if you provide your point2 to this argument then the
function automatically understands that you want this 80 20 division so 80% training and then 20% test size and if finally you can also uh add the random
State because the splitting is going to be random so the data is going to be randomly selected from the entire data and to ensure that your results are reproducible and uh the next time you
are running this um notebook you will get the same results and also to ensure that me and you get the same results we will be using a random State and random
state of 111 is just um random number that I liked and decided to use here so uh when we go and um use this and run
this command you can see that we have a training set size 15K and then test size uh 38k so when you look at these numbers you will then get a verification that
you are dealing with 20% versus 80% thresholds so then we go and we do the training one thing to keep in mind is
that here we are using the SM Library uh an SM function that we imported from them uh starts mods. API so this is one
uh function that we can use in order to conduct our uh coonis and to train a linear regression model so uh for that
what we need to do so uh when we are using this Library uh this Library doesn't automatically add the uh first
uh column of ones uh in your uh set of independent variables which means that it only go goes and looks at what are the features that you have provided and those are all the independent variables
but we learned from the theory that uh when it comes to linear regression we always are adding this intercept so the beta zero if you go back to the theory lectures you can see this beta0 to be
added to both to the simple linear regression and to the multiple linear regression this ensures that we look at this intercept and we see what is this
average uh in this case median house volum if all the other features are um equal to zero so um therefore given that
the this specific stats models. API is
not adding this uh constant um column to the beginning for intercept it means that we need to add this manually therefore we are saying sm. add oncore
constant to the ex train which means that U now our uh X uh table or X data frame uh adds a column of ones uh to the
features so let me actually show you uh before doing the uh training because I think this also something that you
should be aware of so if we do here a pause so I'm going to do xcore Shain
uncore constant and then I'm also going to print um the same um feature data frame before adding this
constant such that you see what I mean so as you can see here this is just the same set of all columns that form the independent variables the features so
then when we add the constant now after doing that you can see that now we have this initial column of ones this is done
such that we can have uh uh beta Z at the end which is the intercept and we can then perform a valid multiple linear regression otherwise you don't have an intercept and this is just not what
you're looking for now the pyit learn Library does this automatically therefore when you are using uh this tou models. API you should add this constant
models. API you should add this constant and then I use the pit learn without adding the constant and if you're wondering why to use this specific model
as uh we already discussed about this just to refresh your memory we are using the Tas models. API because this this one has this nice property of visualizing the summary of your results
so your P values your T Test your standard errors something that you definitely are looking for when you are performing a proper causal analysis and you want to identify the features that
have a statistically significant impact on your dependent variable if you are using a machine learning model including linear regression only for Predictive Analytics so in that case you can use
the psychic learn without worrying about using stats models API so this is about adding a constant uh now we are ready to actually uh fit
our model or train our model therefore what we need to do is to use sm. OLS so
OS is the ordinar squares estimation technique that we also discussed as part of the theory and we need to provide first the dependent variable so Yore
train and then the um feature set which is xcore Trainor constant so then what we need to do is to do do
feed parenthesis which means that take the OS model and use the Yore train as my dependent variable and xcore Trainor
constant as my independent variable set and then feed the OLS algorithm and linear regression on this specific data if you're wondering y y train or X train
and what is the differ between train and test Ure to go and revisit the training um Theory lectures because there I go in detail into this concept of training and
testing and how we can divide the data into train and test and uh this Y and X as we have already discussed during this tutorial is simply this distinction
between independent variables defined by X and the dependent variable defined by y so y train y test is the dependent variable data for the training data and
test data and then each train each uh test test is simply the training data features so ex train and then test data features X test we need to use ex train
and Y train to fit our data to learn from the data and then once it comes down to evaluating the model we need to
uh use the fitted model from which we have learned using both the dependent variable and the independent variable set so y train Mi train and then uh once
we have this model uh that is fitted we can apply this to an unseen data xcore test we can obtain the predictions and we can compare this to the true y so
ycore test and to see how different the Y uh underscore test is from the Y predictions for this unseen
data and to evaluate how moral uh is performing this prediction so how model is uh managing to identify the median uh
house values and predict median house uh values based on the um fitted model and on an unseen data so exore test so this
is just a background info and some refreshment and now um in this case we are just uh fitting the data on the training uh dependent variable and then training uh independent variable added a
constant and then we are ready to print the summary now let's now interpret those results first thing that we can see is that uh all the co efficients and all the independent variables are
statistically significant and how can I say this well um if we look in here we can see the column of P values this is the first thing that you need to look at
when you are getting these results of a caal analysis in linear oppression so here we are seeing that the P value is very small and just to refresh our memory P value says what is this
probability that you have obtained too high of a test statistics uh given that this is just by random chance so you are seeing statistically significant results
which is just by random chance and not because your uh n Hyo this is false and you need to reject it so that's one thing in here you can see you can see
that we are getting much more so first thing that you can do is to verify that you have used the correct dependent variable so you can see here that the dependent variable is a median house
value the model that is used to estimate those coefficients in your model is the LS the method is the Le squares so Le squares is simply uh the uh technique
that is the underlying approach of minimizing the sum of uh uh squared residuals so the least squares the DAT that we are running this analysis is the
26 of January of 2024 uh so we have the number of observations which is the number of training observations so the 80% of our
original data we have r s which is the um Matrix that show cases what is the um goodness of fit of your model so r s is
a matrix that is commonly used in linear regression specifically to identify how good your model is able to feed your data with this linear regression line
and the r squ uh the maximum of R squ is one and the minimum is zero 0.58 in this case approximately 59 it
means that uh all your data that you got and all your independent variables so those are all the independent variables that you have included they are able to explain
59% so 0.59 out of the entire set of variation so 59% of variation in your response variable which is the median house value
you are able to explain with the set of independent variables that you have provided to the model now what does this mean on one hand it means that you have a reasonable enough information so
anything above 0.5 is quite good which means that more than half of the uh entire variation in your median house value you are able to explain but on the
other hand it means also that there is approximately 40% of variation so information about your house values that you don't have in your data this means
that you might consider going and looking for extra additional information so additional independent variables to add on the top of the the existing independent variables in order to
increase this amount and to increase the amount of information and variation that you are able to explain with your model
so the r squ this is like the best way to uh explain what is the quality of your regression model another thing that we have is the adjusted R squ adjusted R
squ and R squ in this specific case as you can see they are the same so 0 um 50 T9 this usually means that uh you're fine when it comes to amount of features
that you are using once you uh overwhelm your model with too many features you will notice that the adjusted R squ will be different than your R squ so adjusted R squ helps you to understand whether
your model is performing well only because you are adding so many of you of those variables or because really they contain some useful information cuz sometimes the r squ it will
automatically increase just because you are adding too many independent variables but in some cases those independent variables they are not useful so they are just adding to the complexity of the model and possibly
overfitting your model but not providing any edit information then we have the F statistics here which corresponds to the
F test and uh F test um it comes from statistics uh you don't need to know it but I would say uh check out the fundamentals to statistics course if you
do want to know it CU it means that uh you are testing whether all these independent variables all together whether they are helping to explain your
uh dependent variable so the median house value and uh if the F statistics is very large or the P value of your F statistics is very small so
0.0 it means that all your independent variables jointly are statistically significant which means that all of them
together help to explain your uh uh median house value and have a statistically significant impact on your median house value which means that you have a good set of independent
variables so then we have the log likelihood not super relevant in this case you have the AIC Bic which stand for Aus information criteria and bation
information criteria those are also not necessary to know for now but once you advance in your career in machine learning it might be useful to know at higher level for now think of with like
um value that helps you understand this uh information that you gain when you are adding this uh set of independent variables to your model but this is just optional uh ignore it if you don't know
it for now okay let's now go into the fun part so in this Mata uh part of the summary uh table we got first the set of uh independent variables so we have our
constant which is The Intercept we have the longitude latitude housing median age total rooms population household both median income and the four dami variables that we have created then we
have the coefficients corresponding to those independent variables those are basically the beta 0o beta 1 head beta 2 head Etc which are the um parameters of
the linear regression model that our o method has estimated based on the data that we have provided now before interpreting this independent variables the first thing
you need to do as I mentioned in the beginning is to look at this P value column this showcases the set of all independent variables that are
statistically significant and usually this table that you will get from a satat road API is at 5% significant level so the alpha the threshold of
statistical significant is equal to 5% and any P value that is smaller than 0.05 it means you are dealing with um statistically significant independent
variable now the next thing that you can see here in the left is the T statistics this P value is based on a t test so this T Test is simply stating as we have
learned during the theory and you can also check the fundamental statistics course from lunar tech for more detailed understanding of this test but for now
this T has um States a hypothesis whether um each of this independent variables individually has a statistically significant impact on the
dependent variable and whenever this uh T Test has a P value that is smaller than the 0.05 it means you are dealing with statistically significant uh
independent variable in this case we're super lucky all our independent variables are statistically significant then the question is whether we have a positive statistical significance or
negative that's something that you can see by the signs of these numbers so you can see that longitude has a negative coefficient latitude negative
coefficient housing median AG positive coefficient Etc negative coefficient means that this independent variable causes a negative change in the
dependent variable so more specifically when we look for instance the um let's say which one should we look uh let's
say the uh total underscore rooms when we look at the total underscore rooms and it's minus 2.67 it means that when we look at the total number of rooms and we increase
the number of rooms uh by uh one additional unit so one more room added to the total
underscore rooms then the uh house value uh decreases by minus 2.67 now you might be wondering but how is this possible well first of all the
value the coefficient is quite small so on one hand it's it's not super relevant as we can see the uh relationship between them is not super strong because
the U margin of this um coefficient is quite small but on the other hand you can explain that at some point when you are adding more rooms it just doesn't add any value and in fact in some cases
just decreases the value of the house this might be the case at least this is the case based on this data we can see that if there is a negative coefficient
then one unit increase in that specific independ ver variables all else constant will result in um in this case for
instance in case of the total rooms uh 2.67 decrease in the median house value everything else coner we are also
referring to this ass set that is parus in econometrics which means that everything else concern so one more time let's refresh our memory on this so
ensure that we are clear on this if we add one more rle to the total number of rooms then the median house value will
decrease by $ 2.67 and this when the longitude latitude house median age population households median income and all the
other criterias are the same so if we have uh for instance this negative value this means that we are getting a decrease in the mediate house while if
we have an increase by one unit in our um total number of roles now let's look at the opposite uh case when the coefficient is actually positive and
large which is the housing median age this means is if we have two houses they have uh exactly the same characteristics so they have the same longitude latitude
they have the same total number of rooms population housing households median income they are uh the same in terms of the distance from the ocean then um if
one of this house has one more additional year added on the uh median age so housing median age
so it's one year older then the house value of this specific house is higher by
$846 so this house which has one more additional median age has $846 higher median hous value compared
to the one that has all these characteristics except it has just the um uh house median age that uh is one
year less so one more additional uh year in the median age will result in 846 uh increase in the median house
value everything else concerned so this is regarding this idea of negative and then positive and then the margin of coefficient now now let's
look at one dami variable and um explain the idea behind it and how we can interpret it and uh it's it's a good way to understand how the D variables can be
interpreted in the context of linear aggression so one of the independent variables is the ocean proximity Inland and the coefficient is equal to-
2108 e plus 0.5 this simply means- 2110 uh K uh approximately and um what this means is that if we have two houses they
have exactly the same characteristics so their longitude latitude is the same house median age is the same they have the same total number of rooms population households median income all
these characteristics for this two blocks of houses is the same with a single difference that one block is located in the um Inland when it comes
to Ocean proximity and the other block of houses is not located in the Inland so in this case the reference so the um
category that we have removed from here was the iand you might recall uh so if the block of houses is in the Inland
that their value is on average uh smaller and less by 210k when it comes to the median house value compared to the block of houses
that has exactly the same characteristics but it's not in the Inland so for instance it's in the uh Iceland so uh when it comes to this
dummy uh variables where there is also an underlying referenced variable which you have deleted as part of your string categorical variable then you need to
reference your dami variable to that specific category this might sound complex it is actually not I would say uh it's just a matter of practicing and
trying to understand what is this approach of Dy variable it means that you either have that criteria or not in this specific case it means that if you have two blocks of houses with exactly
the same characteristics and one block of houses is in the Inland and the other one is not in the Inland for instance is in the Iceland then the block of houses
in the England will have on average 210,000 less median house value compared to the block of houses that is in the for instance in the Iceland uh when it
comes to the ocean proximity which kind of uh makes sense because in California people might prefer living uh in the iseland location and the houses might have more demand
when it comes to the iseland location compared to the um Inland locations so the longitude uh has a statistically significant impact on the uh median
house value ltitude how median age has an impact and causes uh uh statistically sign ific difference in the median house value if there is a change in the median
age the total number of rooms have a impact on the median house volume and the population has an impact households
median income as well as the uh proximity from the ocean and this is because all their P values is uh zero which means that they are smaller than
0.05 and this means that they all have a statistically significant imp impact on the mediate house value in the California housing market now when it
comes to the uh interpretation of all of them uh we have interpreted just few uh for the sake of Simplicity and ensuring that this uh this entire case stud
doesn't take too long but what I would suggest you to do is to uh interpret all of the uh coefficients here because we
have interpreted just the housing median age and the um the total number of rooms but you can also interpret the population uh as well as the median
income and uh we have also interpreted one of those Dy variables but feel free also to interpret all the other ones so by doing this you can also uh Even build
an entire case study paper in which you can explain in one or two pages the results that you have obtained and this will showcase that you have an understanding of how you can interpret
the linear gressional results another thing that I would suggest you to do is to uh add a comment on the standard error so let's now look into
the standard errors we can see a huge standard error that we are um making and this is the direct result of the fourth assumption that was violated now this
case study is super important and useful in the way that it showcases what happens if some of your um assumptions are satisfied and if some of those uh
assumptions are violated so in this specific case the Assumption related to the uh uh the uh errors having a constant variance is violated so we have
a heteros assist issue and that's something that we are seeing back in our results and this is a very good example of the case that even without checking
the assumptions you can already see that the standard error is very large and uh you can see here that given that stand
is large this already gives a hint that most likely our heteros skusy uh is present and our homos T assumption is
violated you keep in mind this um idea of um large standard errors that we just saw because we are going to see that this becomes a problem also for the um
performance of the model and we will see that we are obtaining a large error due to this and uh one more comment when it comes to the total rooms and the housing
median age in some cases the linear regression results might not seem logical but sometimes they actually is an underlying explanation that can be provided or maybe your model is just
overfitting or biased that's also possible and uh that's something that uh you can do by checking
your o assumptions and uh before uh going to that stage I wanted to briefly showcase to you this um idea of predictions so we have now
fitted our model on the uh uh training data and we are ready to perform the predictions so we can then use our
fitted model and we can then uh use the test data so ex test in order to perform the predictions so to uh use that data
to get new house mediate house values for the um blocks of houses for which we are not providing the uh corresponding Medan house price so on the anene data
we are uh re um applying our model that we have already fitted and we want to see what are this predicted median house values and then we can compare this
predictions to the true median house values that we have but we are not yet exposing them and we want to see how good our model is doing a job of
estimating and finding this unknown median house values for the test data so for all the blocks of houses for which we have provided the characteristics in the X test but we are not providing the
Y test so uh as usual like in case of training we are adding a constant with this library and then we are saying model that V model uncor fitted so the fitted model and then that predict and
providing the test data and those are the test predictions now uh once we do this we can then get the test predictions and uh
if we print those you can see that we are getting a list of house values those are the house values for the um um blocks of houses
which were included as part of the testing data so the 20% of our entire data set uh like I mentioned just before in order to ensure that your model is
performing well you need to check the OLS assumptions so uh during the um Theory section we learned that there are a couple of assumptions that your model should satisfy and your data should
satisfy uh for ol to provide uh B unbiased and um efficient uh estimates which means that they are accurate their
standard error is low something that um we are also seeing as part of the summary results and uh your estimates are accurate so the standard error is a
measure that showcases how efficient your estimates are which means um do you have a high variation uh can the coefficients that you are showing in
this table vary a lot which means that you don't have accurate um coefficient and your coefficient can be all the way from one place to the other so the range is very L large which means that your
standard error will be very large and this is a bad sign or you are dealing with an accurate estimation and uh it's more precise estimation and in that case
the standard ER will be low uh and unbias estimate means that your estimates are a true representation of the pattern between each pair of independent variable and the response
variable if you want to learn more about this idea of bias unbias and then efficiency and sure to check the U fundamental to statistics course at
lunarch because it explains very clearly these Concepts in detail so here I'm assuming that you know or maybe you don't even need it but I would suggest
you to know at high level at least then uh let's quickly do the checking of oiless assumption so the first assumption is the linearity Assumption which means that your model is linear in
parameters one way of checking that is by using your already fitted model and your uh predicted model so the Y uh uh
test which are your true house median house values for your test data and then test predictions which are your uh predicted median house value use for
nonen data so you are using the uh True Values and the predicted values in order to um plot them and then to also plot the best fitted line in an ideal
situation when you would make no error and your model would give you the exact True Values um and then see how well your um uh how linear is this relationship do we actually have a
linear relationship now if the observed versus predicted values where the observed means the uh real uh test wise and the
predicted means the test predictions if this pattern is kind of linear and matching this perfect linear line then you have um assumption one that is
satisfied your linearity assumption is satisfied and you can say that your uh data uh and your model is indeed linear in
parameters then uh we have the second assumption which t that your uh sample should be random and this basically translates that the uh expectation of
your error terms should be equal to zero and uh one way of checking this is by simply taking the residues from your fitted model so model uncore fitted and
then that's residual so you take the residuales you obtain the average which is a good estimate of your expectation of errors and then this is the mean of
residual so the average uh residuales uh where the residuales is the estimate of your true error terms and then uh here what I do is just I
just round up uh to the two decimals behind uh the uh the point this means that uh we are getting uh this average
amount of uh errors or the estimate of the errors which we are referring as residuals and if this number is equal to zero which is the case so the mean of the residuals in now model is zero it
means that indeed the um uh expectation of the uh error terms at least the estimate of it expectation of the residual is indeed equal to zero another
way of checking the um uh second assumption which is that the um model uh has a is based on a random sample and the sample we are using is random which
means that the expectation of the error terms is equal to zero is by plotting the residual versus values so uh we are taking the residuales from the fitted
model and we are comparing to the fitted values that comes from the model uh and we are looking at this um graph this
scatter plot which you can see in here and we're looking where this um pattern is uh
symmetric uh around the uh threshold of zero so you can see this line kind of comes right in the middle of this pattern which means that on average we
have residuales that are across zero so the mean of the residual is equal to zero and that's exactly what we were calculating also here therefore we can say that we are indeed dealing with a random
sample this quote is also super useful when it comes to the fourth assumption that we will come a bit later so for now let's check the third assumption which
is the Assumption of exogene it so exogeneity means that uh each of our independent variables should be uncorrelated from the err terms so there
is no omitted variable bias there is no um reverse causality which means that the uh independent variable has an impact on the dependent variable but not the other way around so dependent
variable should not have an impact and should not cause the independent variable so for that there are few ways that we can deal with uh with this uh
one way is just straightforward to compute the uh correlation coefficient between between each of these independent variables and the residuales that you have obtained from your fitted
model that just simple uh technique that you can use in a very uh quick way to understand what is this uh correlation between each pair of independent variable and the residual which are the
best estimates of your error terms and in this way you can understand that there is a correlation between your independent variables and your airts another way you can do that and
this is more advanced and bit more um towards the econometrical side is by using this test which is called the Durban uh view Houseman test so this uh
Durban view housan test is um a more professional more advanced way of uh using an econometrical test to find out whether you have um exogeneity so
exogeneity assumption is satisfied or you have endogenity which means that one or multiple of your independent variables is potentially correlated with
your error terms uh I won't go into detail of this test uh I'll put some explanation here and also feel free to uh check any uh introductory to
econometrics course to understand more on this Durban Vu housan test for exogeneity assumption the fourth assumption that we will talk about is the
homos HOSA assumption states that the error terms should have a variance that is constant which means that when we are looking at this variation that uh the
model is making uh across uh different observations that uh when we look at them the variation is kind of constant so uh we have all these uh cases when
the uh in observations for which the residuals are bit small in some cases bit large we have this miror when it comes to this figure with what we are calling heteroskedasticity which means
that homoskedasticity assumption is violated our error terms do not have a variation that is constant across all the observations and we have a high variation and different variations for
different observations so we have the heteros skill T we should consider a bit more um flexible approaches like uh GLS
fgs GMM all bit more advanced econometrical algorithms so uh the final part of this case study will be to show you how you can do uh
this all but for machine learning traditional machine learning site by using the psychic learn so uh in here um
I'm using the um standard scaler function in order to uh scale my data because we saw uh in the summary of the
table um that we got from the stats uh Mor API that our data is at a very high scale because the uh median house values are those large numbers the uh age uh
the median age of the house is in this very large numbers that's something that you want to avoid when you are using the linear regression as a Predictive Analytics model when you are using it for interpreting purposes then you
should keep the skilles because it's easier to interpret those values and to understand uh what is this difference in the median price uh of the h house when
you compare different characteristics of the blocks of houses but when it comes to using it for Predictive Analytics purposes which means that you really care about the accuracy of your
predictions then you need to uh scale your data and ensure that your data is standardized one way of doing that is by using this standard scaler function uh in the psyit learn.
preprocessing uh and uh the way I do it is that I initialize the scaler by using the standard scaler and then parenthesis which I just imported from this pyit
learn library and then uh I am uh taking the scaler I'm doing fitore transform exrain which basically means take the
independent variables and ensure that we scale and standardize the data and standardization simply means that uh we are
standardizing the data that we have to ensure that um some large values do not wrongly influence the predictive power of the model so the the model is not
confused by the large numbers and finds a wrong variation but instead it focuses on the a true variation in the data based on how much the change in one independent variable causes a change in
the dependent variable here given that we are dealing with a supervised learning algorithm uh the exrain uh scaled will be then
containing our standard ardized uh features so independent variables and then each test SC will contain our standardized test features so the Unseen
data that the model will not see during the training but only during prediction and then what we will be doing is that we will also use the um y train and Y
train uh is the dependent variable in our supervised model and why train corresponds to the training data so we will then first initialize the linear
regression here so linear regression model from pyit learn and then uh we will initialize the model this is just the empty linear regression model and
then we will take this initialized uh model and then we will fit them on the uh training data so exore trained uncore scale so this is the trained features
and then the um uh dependent variable from training data so why train uh do note that I'm not scaling the dependent variable this is a
practice cuz you don't want to uh standardize your dependent variable rather than you want to ensure that your features are standardized because what
you care is about the variation in your features and to ensure that the model doesn't mess up when it's learning from those features less when it comes to
looking into the impact of those features on your dependent variable so then uh I am fitting the uh model on this train data so uh features
and independent variable and then I'm using this fitted uh model the LR which already has learned from this features and dependent variable during supervised
training and then I'm using the ex test scale so the test standardized uh data in order to uh perform the prediction so
to predict the median house values for the test data unseen data and you can notice that here in no places I'm using Y test y test I'm keeping to myself
which is the dependent variable True Values such that I can then compare to this predicted values and see how well my motor was able to actually get the
predictions now uh let's actually also do one more step I'm importing from the psyit learn the Matrix such as mean squar error uh and I'm using the mean
squared error to find out how well my motel was able to to predict those house prices so this means that uh we have on average we are making an error of
59,000 of dollars when it comes to the median house prices which uh dependent on what we consider as large or small
this is something that we can look into so um like I mentioned in the beginning the uh idea behind linear regression using in this specific uh course is not
to uh use it in terms of pure traditional machine learning but rather than to perform um causal analysis and to see how we can interpret it when it
comes to the quality of the predictive power of the model then uh if you want to improve this model this can be considered as the next step you can understand whether your model is
overfitting and then the next step could be to apply for instance the um lasso regularization so lasso regression which addresses the overfitting you can also
consider going back and removing more outliers from the data Maybe the outliers that we have removed was not enough so you can also apply that factor then another thing that you can do is to
consider bit more advanc machine learning algorithms because it can be that um although the um regression
assumption is satisfied but still um using bit more flexible models like random Forest decision trees or boosting techniques will be bit more appropriate
and this will give you higher predictive power consider also uh uh working more with this uh scaled uh version or
normalization of your data interested in machine learning or data science then this course is for yoube will build a movie or Commander system using future
selection con factorization coine similarity and at the end of the video we will create a r app using scrim blets building projects is one of the most
effective ways to thoroughly learn a concept and develop essential skills this guide will walk you through building a movie recommendation system
that's tailored to user preferences we will leverage a vath 10,000 movie de set as our foundation while their approach is intentionally simple it establishes
the core building blocks common to the most sophisticated recommendations in the industry think Netflix Spotify or others will harness the power and
versatility of python to manipulate and analyze our data the pandas Library will streamline data preparation and psych learn will provide robust machine
learning algorithms by conve Riser and cosign similarity user experience is key so we will Design the intuitive web application for effortless movie
selection and the recommendation display at the end of the video develop a Daya dur mindset understand the essential steps in building a recommendation system we will Master Core machine
learning techniques Ty into Data manipulation future engineering and machine learning for user recommendations and create a user Center solution deliver a seamless experience
for personalized movie suggestions so let's get start it all right so let's now go over to DS that we be using with the move Commander system we we are
using the tmdb movies dat set from K this dat set is crucial for developing A system that recommends films tail to your preferences and introduces you to
new titles we selected this data set for its comprehensive movie data it includes ID which is essential for movie identification title genre origin of
language and many other features but the key features that we will focus on are IDE and title the genre and overview and to add on top of this we will combine
the tags of overview and J together for so we have selected this DS set because of how big it is it is around 10K top
rated tmtb movies and it has as you see many features so let's move on to next chapter which is future engineering okay so now let's go with
the features that we'll be using Futures in a recommend this system are essentially the the data points you use to make decisions about what to recommend these Futures helping
identifying similarities between movies which is crucial for generating personalized recommendations for your system to be effective it's vital to select fatures that offer meaningful
insight into the content of the movies and the preferences of the users so you have to be careful with what Futures you
choose and and for our movie recommended system we will focus on several key features ID this serves as unique identify for each movie crucial for
indexing and retrieving movie information accurately title the most basic yet essential future and they been uses to identify movies by their names
genre this categorizes movies into different parts facilitating recommendations based on content similarity and user preferences the
genre plays a pivotal role in personalization and of course the overview overing a brief summary of the movie's plot to overview access a rich
source for Content based filtering through NLP so we will be using that overview combined with Jer he create a very comprehensive descriptor for each
movie therefore you're able to recommend movies more accurately now combining the overiew with the genre into a single Pags future gives us a
fuller picture of each movie this combination helps the system to a better analyze and find movies that are similar in theme story or style so for example
let's consider a movie like Inception it's OB might be like a thief who steals corporate Secrets through dream sharing technology who is Task with planting an
idea into the mind of a CEO how the genre could be listed as action science fiction Adventure how if you combine this with a text which is action science
fiction and Adventure you get more Fuller pictures which is actions science fiction and Adventure if corporate Secrets dream sh technology hunting an
idea or a CEO which might recomend your movie like The Matrix but the the main point is that when you combine the overview with
the text you get a much more sophisticated future that you can convert to a much more sophisticated numerical data point which you can use
to better recommend movies and before you use the before using the overview and genre data for the Lord of the Rings
for example we would remove the and all now we pre-process the selected Futures because there are many movies that
include stop fors in the title or in their genre or overview and those are words that don't contribute anything to
the title and therefore we PR process them and remove them and clean out data so those are for example words suggest the and in his and other words as well
so we have done our future selection and now let's move on to content based versus collaborative based recommend assistance otherwise let's explore now
the key recommender systems that are currently being used by nflix Amazon and other big Tech so there are two main
recommended systems which are content based and ctive filtering recommended systems now let's start with a Content based recommended system now a Content
based or Commander system only uses the Futures and the overview of the movies that you have likeed to recommend the Sim movies so let's say
you talk with a friend and you say I have like this like Iron Man because of his Futures such as the Director the genre
is and or the overview it will recommend similar movies based on those Futures so it won't use any other data for example
what other people have liked or what her rating is of other movies or what rating you have given to other movies it's just based solely on the futures of the
movies that you have like previously so for example if you've enjoyed Inception a Content based a Content based system I suggest
Interstellar because both movies share a similar director and the complex narrative structure genre and overview now let's go on to collab the
filter and recommender system so in Netflix you often see for example if you have watched and enjoyed stranger things Netflix might often recommend The
Witcher to you because other users who like stranger things also enjoyed the Witcher now what's important to note here is that it doesn't use the features
or the overview or the other informative data points it only uses what other users have liked what other users that has similar preferences like you have
also like so that's the difference the this recombination is made based on the viewing habits and the preferences of a larger group of viewers with similar
taste to yours now this method doesn't rely on it features but on the wisdom of the crowd it uses patterns of ratings or interactions from many users to predict
what an individual might like for example if users who like The Avengers also enjoyed the Guardians of the Galaxy you might receive a recommendation for Guardians of the Galaxy feel like The
Avengers while both systems are effective on their own if you could combine them that would that would enhance the accuracy of the recommander
system so for example you start off with content based or Commander system but once you start getting more data you can also use collaborative filter
collaborative filtering recommended system to provide more accurate recommendations in this session we will focus on the crucial element of
transforming text into numerical vectors so our models aren't able to be trained on text but they are be they're able to
be trained on new vectors which means that we have to convert our text into a vector so to do that we use count
vectorizer this method simplifies text analysis By ignoring the order of words inste focusing on the frequency by transing text into a numerical data
we'll be able to classify documents a vital function that allows our systems to process and organize large amounts of text Data efficiency we aim to provide a
straightforward practical understanding of this essential technique so that we can move on to our next chapter which is cosign
similarity to provide a more targeted example to provides a more targeted example let's say we are considered
three movie overviews which is an an action pack adventure adventure and
Adventure that or adventure movies inpire me and we both love heart racing
Adventures so if we count the wordss that we are using here here which is fun
action hack adventure movies pinspire me we B law art racing so be the we use the
following words pin our sentences now for our and have par Adventure we use we
use P once patum PEX Adventure we use it as well but for the rest you use of them
so Vector is 1 1 1 0 0 0 0
0 Z and this is our Vector now for the SE example which is p adventure movies inspire me if we use no pen we use no
action pack we use denture once we use movies once you use Inspire once and you re use me once and the rest we don't use
it SWS which means we get the vector 0 0 1 1 1 1 0
0 zero here so this is our second vector and you can do the same for Reb La of hard racing venes but I think you get the idea so this step is key as it
transform statue information into a numerical format a vector each movie or overview is sence converted to a vector in a high in high dimensional space
where each unique W is a dimension and the W's frequency and overview is the value in that Dimension now this structure allows machine learning models to intert the data aing tests such as
genre classification movies as an example so let's see we have Iron Man
one two three and the Avengers we try to create a vector out of this we use Iron
Man
[Music]
on so we have the use of words Iron Man one 2 3 so for example if we have Iron Man 1 you can expect the following
Vector 1 one 1 0 0 who get this 1 1 0 0 two then you can expect what the
structure to be your 1 Z afford ironment threee we can expect we1 one01 so this process of convert terization
translating movies titles into numic of vectors is a cornner stone in the realm of text analysis it allows us to convert
unstructured text adest movie titles into a structured for format as became later use in our machine learning algorithms and we can extend this process to a larger and more complex
body of Tex for instance we could apply con vectorization to movie descriptions reviews or even entire scripts respected
of the text size or complexity the conf T method remains effective it allows us to handle a broad spectrum of Text data to understand the concept of Co
similarity to the learns of movies let's take a simple example imagine we are comparing two movies based on the genous Sci-Fi threr and just sci-fi now close
similarity is the following it's an equation of the dot product of two vectors and the multiplication of the
magnitudes of the two vectors so of a Sci-Fi trer and the convert converted to
a vector which is SAR and then crer is a one one and just s it's just one
zero so when we would multiply them you would do 1 + 1 + 1 * 0 and the magnitud
of one one which is the magnitud of a is equal to square root or Square of one a sare of one and square root of it which
is two half for sci-fi which is Vector B is the magnitude of Z Square so 1 * so this is equal to 1
over2 now if you take the close high of this you will get 0.71 so this is whole similar the two movies are the
0.71 now let's have a more broader example so we have Iron Man one and we
compare it with the Avengers and offenheim now Iron Man 1 one 123 it's an action sci-fi Mount Avengers is an
action scii and adventure and Oken is drama and historical now we will calculate the Sim CL similarity between
this movies to the genre and we will recommend movies based on their genre so if we try to make factors out of them
out of this genres of each movie they have action adventure or you can also do the action sci-fi
Adventure drama and historical then Iron Man would be 1 one
0000 Avengers will be the 1 1 1 0 0 and oimr will be 0 0 01 1 so let's calculate
the similarities between iron man and the D Avengers the do as I mentioned cosign no cosine similarity is equal to
Del of the two factors / by a * B so when we take the example of Iron
Man and the Avengers we get that the following vectors 1 1 0 0 0 and theist
which is 1 1 1 0 0 so now when you take a DOT part of this you get 1 * 1 1 * 1 1
* 0 0 z0 0 0
[Music]
now if we calculate the coin similarity between Iron Man and the open Hammer
that will get zero because
[Music]
so as you can see we we calculate the coine similarity using this formula all right let's get started so I have written clode already so I will just
walk into it and so with step one which is import pandas so since we will be using panas we must first import it this is the day experation and pre-processing
part but let's install install the pandas okay so let's start with the future selection part which means first do not list all the columns and there
from to identify relevant features so these are all the columns that we have inside our data set we going to combine over and genre into
a column which will have the name tags and that's it perfect now move us around into
get now there's a new column which is tags now we have an additional tags and this is the only po that we will be using to
run an our model on which means we don't need overview in John anymore so we can get rid of it we do that by saying
me the is equal to movies. drop and we are going to drop
movies. drop and we are going to drop the columns overview and genre perfect as you can see have only ID title and
tags and this is great now aside the text cleaning part we will import n okay and the necessary [Music] modules for our text
[Music] preprocessing now let's run this cuz I first probably will download the necessary modules the NK
resources for pre-processing we got to start with printing the text so first we want to make sure that it's not a string do
there by saying if not is instance of [Music]
text now we want to also say all movies titles or the text column must all be
lowercase we also want to remove punctuation uh with C digits you also want to token as it move
upwards and join the words back together and now we want to Le again or clean the
perfect all right now up to the combater part so [Music] [Music]
build it import dist and also install it if not already installed but it's s Solly installed we apply the cream text
function to a text column and we create a new column which size Tex Queen so we clean the data here
okay so let's initially as a comer per Futures is 10,000 and stop wordss [Music]
English and we also what we are say here is that our maximum number of Futures are 10,000 and we must remove all the stop
words that contain inside in the English English vocabulary okay now we're back to S hour
um right so now we fit the Tex data into an array and factorize [Music]
it perfect we have onlya the ID and
effect all right so now we have can import coine similarity and start or initializing coine similarity
so through this line we calculate the coign similarity between the movies and based on the vector representations that we
did we already vectorized it through our Conor it now we can calculate the similarity between them and that's you do that by initializing it and seeing
similarities perfect it is run now let's see the similarity great now we click on new beta.
info and we have there is still the same with 10,000 IDs 10,000 titles and still some missing overviews or
tags Perfect all right everything looks fine now we want to identify and pin the
so here we will want to see how our model has
performed so we are going to calculate we are going to identify and print the titles of the top five similar movies to a given movie based on CL similarity and
here we will do it for um from movie three movie
four let's say which one was Movie 4 [Music] [Music]
with for is Godfather part two with this function we see if our model works so what we want to do
here is to calculate the distance between we want to calculate the most similar or the movies that have the
highest coine similarities go with Movie 4 and we will we want to reverse uh return the reverse order of the list meaning that the most similar movie will
come on top and we want to do this by going from 0 to five which means we want to only recommend five
movies and of course we want to then print the data of the list of five movies meaning the title of the list that contains the five most similar
movies to Movie 4 perfect so movie four which is the Goda or something that is related to
Godfather Comm as Godfather [Music] SP all right so let's now create a
function that will only recommend us movie based on the title of the movie so not the ID but just the
[Music] title and let's see if that
[Music] works okay so now let's save the modified data frame
and the similarity Matrix for they use we are going to use this in our web app so let's import pickle
and want to also download the mov list SAR this C
and on to load the saved similarity score and perfect all right I will see you in the last part which is building the
actual we so as you can see the front end will be provided to you and the only thing
you have to do is just to go the content f.p s St import
f.p s St import pickle import requests so so we have to first write the
function to fetch movie poster using movie ID now this function will be provided to
you I can just do that by connecting the API now you have some Mo load movie data
and similarity Matrix you can do it by movies [Music] is equal to pickle load we have done the
similar in call up but now we're loading it and we have to also do the same for similarity and for
list so let's now create the header for web [Music] app should be will show be see
header command system so now we have to also bring the necessary components for our streamlet we will be using we will be creating an image Carousel and for that we must
first import the [Music] components so we will be importing necessary components to create an image
carousel
[Music] so we'll be
using the component from our front end and public and has a base we want to be able
to fetch movie posters withing movie ID or there just to be a base or movies that are
recommended so [Music] that that it's not empty and this is works just
fine and of course let's display name is trash component and now create a drop down menu
for call movies [Music] page and now let's create our commmand function so what we want to do first
[Music] is create so the way we are going to do it is by
first finding the index of a selected movie from our data frame based on its title and then we are going to calculate
its similarity scores against the movies and we are going to rank it from the movie that's the most similar meaning the highest coine similarity score to
the lowest and we are going to provide we are going to recommend five movies of course you can recommend many or 50 movies that's
completely up to [Music] you so let's now first find the index
or um create a variable [Music] index and now let's calculate it by the
similarity score similarity score and by [Music]
distance so as you can see we first calculate the similarity
score we will return the perverse order meaning the highest ranking to the [Music]
lowest and now we want to initi PR the list which we will do by command mov to
[Music] List's poster you put empty list
[Music]
[Music]
we I [Music] prist Comm in poster
[Music]
perfect and it works perfectly all right so let me walk you through the code to show you exactly how
you can do it as well so we first import the streamlet pickle and of course request for our to fetch posters from
for our movie IDs the images because we don't have it stored and then we load our data we load our movies list and of
course the similarity score and we grab it t titles and now we create the header of
our web app and to create a carousel we must first bring or import components for our
streamlet here we initialize our CER [Music] component we feture some movie posters
images for there to be basic list of available um images or rather movies that people can
use to navigate without it's just a basic image Carousel and then we display the image charer component creating of course the
drop down menus and here's the main function of our movie recommander system which is
basically we first initialize the index of our movies or movie movie then we calculate its distance or rather the similarity score between all the movies
and return [Music] um one that ranks the highest so which
means the movies that are the most similar will be returned and all right so let's now work on the
main function this is our main function which allows us to recommend movies based on index so here we initialize the
index of the movie then we calculate here the most similar movies in respect to our selected movie we initialize the
list of movie and poster and here we just fill in the list and once everything is filled we just
return it this is the button that we will click on to our Command movies we will have or five
columns and each column will have a text and it works just fine as you can see when we click on the recommand Iron
Man thank you for watching this video I hope you enjoyed creating the movie recommendation system if you like to watch more of this content then make
sure to subscribe and other than that I will see you guys in the next video when you are to enter into business you need to Le show that you
can work for the test that you are going to be hiring for my stter point is like showing you uh data science portfolio showing that you could actually do the
work right so a strong data science portfolio basically you learn the data science communication translation skills business acument all plus but extra but
for you as a hiring manager during the hiring process you pay attention at the projects that they have completed unless they have already an experience hi
everybody Welcome Cornelius really excited to have you here an experience they scientist a top voice in the field of data science and with wealth of
knowledge here to share with you uh Cornelius uh you are a data science manager at aliens so can you walk us uh through your journey uh in the data
science field and how you uh went through the corporate leather oh that's a nice story I think so I think I will go back to the
beginning for us so before I go to like all the corporate stuff so I starting actually like every student every aspirant scientist I want
to become a scientist but I'm starting not majoring in any data science field I actually majoring in a biology and evolutionary biology my bachelor my
master atic is all about biology I'm a research at heart obiously but this um just like a moment like where I become another SCI that I
want to become a data scientist so this is the moment that on the first coming where I was like in my master study I listening at w weinar and I try
to see what kind of a job that I could have in the future because like like this like as a biologist especially by
country Indonesia it's kind of a little bit hard to actually finding that securing that I I was say like a really money money having some some money for
the job like that's like Financial security job because like it's a little bit harder in Indonesia and even it work I obiously if you do as a biologist here get my
patient but then I try to find something that is still related to my biology is still to related that need my patient like I any research but what could actually make some money and that's I
actually finding something like called data science so this is like what I try to sear doing my let look at this webinar data scientist and then I tried to learning about it that okay they
using esta they using things and then they try to use the data to actually solving a business problem and then is actually using in business and day I realize okay this my
could be my next current move so at that time my master time even like I I'm not graduate yet I try to learn as much as possible regarding data science I try to
joining this kind of online course I try to enjoy it like a community in the social media and I try to read it as much as possible and then yeah when I
coming back from my master to back to India yeah I focusing on self rning again and then I try to join like uh really offline classes through the data
scientist yeah and then they try to make a connection as much as possible from all of the my previous connection that I already have like could I become a data scientist in your company basically how
in you feel yeah and then I by a little bit hard work so I could say like from 2018 to 2019 so like one and year and
half work moving from this field from field and then I become another scientist here super cool that's quite an impressive journey in such short amount
of time you managed to you know to go through all these different levels cuz you know data science can be tough especially in the beginning when you want to get your your first job uh as a
junior data scientist with uh almost no experience right and you were went from uh biology not a so-called traditional
uh data science study all the way to the fields of data science so uh let's talk about that can you talk about some of the challenges that you uh overcame when
you were just starting out with no experience at all yeah there's like a lot of challenges even like when my first start as a data scientist I don't know
anything I I know it's a little bit about programming I know a little bit about statistic I know a little bit here and there but when you already being in the corporate basically because like my
V job right now as a j like in the corporate there's like a lot of things to be learning especially in the business part yeah you can develop like
someing model I mean like you can try to running this code but this is is is it solving this business problem is it like really valuable to our business is it
like really could be convincing enough to be used by the business people and the business team or even like from the customer part to actually they want to using this kind my model like my study
by learning so this is actually something that I learned during the my uh first year I'm like if if like I still relearning like what is about the
business itself because like as a jior scientist we are working in the BL we are working usually in the to execute to
performing uh what the t is going to be right but in corporate it's a little bit different I think like the start up that mightbe you could try to like you could
like be everything as but in corporate it's having a this little bit different because it's everything is already being structured everything organized already in there and there already um business
moving there and the business is already know what you want the business already know what could make money in there and then the business already know what it what should be data science and new
stuff data science is could be there to automate stuff could be try to INF figurate all the new thing in the business but the but as the data science
itself if you're like a junior we need to understand why this Bri process so for example like in my prev years experience as Junior like one of the first modeling that I create is like
this called a propensity to buy model so this is like model this like a modeling that try to predict which customer that will be approached to buy
a new chice basically or a new product it seems easy it's like you just take a data of the customer and then you could try to which one is the one who buy which one who not buy and then you
try to create a modeling that's there but the moving part is actually there's a lot of the that there is's a lot of puzzle pieces in there so yeah you have the model but what kind of product that
you actually sell who's the target of the customer who's going to execute this list who's where the where's the communication is going to be so and then
how long it's going to be the is campaign how long is it going to be is it going to be like a continuous or is it like just one time there's like a so much moving back in there and you need
to like working a lot with the business so that's one of the part that where I was still engineer that I relearning but actually one of the thing that make my career actually boosted up because I
really so much about the business and I also leing about the communication within this business part because like like I said before like you want to
convince the in the customer I say like customer is is user the part that they you want to use this model you what I I
can do my job but can you believe my job so sign that it take takes like uh really good translation between our technical to the business uh term
basically so I can present my model as like for example like a random Forest this is like how this is work and this like this is the Precision this is like recall classification model no they
don't right yeah but what I understand is like for example like I create a proper model this kind of modeling when we are simulated basically it could have
to your Ki your Ki is made for example like like uh income like increase the revenue in the monthly it could increase
the revenue 20% compared to the normal job that you do and simulation so there there example but this like a some kind
of uh need to be rethinking compared from our technical to how the business actually speaking so I always I always call it translation and I think I
couldn't say it even better than what you just said it's really important also from my experience and from what I have seen in my colleagues being a data scientist is not just you know crunching numbers many people think it's just
statistics or maybe some mathematics or really data but it's like you just said all these different skill set that have to come together like business acument
what you just mentioned and uh communication translation from business you know to technical terms because uh usually uh the product managers will
never come and tell you please make an classification model for me to classify this thing no they just come and say I need to do this and then you need to realize that oh you need a
classification model oh you need to select those approaches and I think you perfectly mentioned that uh this is this combination of different things that from one view it might seem very easy
but when you dive deeper you clean the data where to collect the data where to store it how to process it how to make the impact how to measure it so so I I
totally understand and how you could go and grow and you know go through this corporate letter very quickly because early on in your career you had an opportunity to learn and gain all these
different skills and I think also for our audience for our aspiring data scientists this is definitely note to make if they want to grow quickly in their career they need to be prepared to
work on their communication skills on their business skills like you did uh on their uh translation skills how to
translate from uh business kpis okrs to an actual data science problems on that note since you mentioned very interesting project was there an any
particular project in your journey early on when you were data scientist that as a junior meteor that kind of made your
career to you know to be um to set yourself apart from other ones and be promoted yeah precisely that's like this this
it's actually accumulative I don't say that it's just like one question but like maybe I go a little bit to the beginning so one of the things that I also try to do during my junior time I
feel like a medical is to take an initiative so I not just move I'm not just moving like okay this is your project and then it's like you need to do this and that but even like from my J
level I I'm not try to taking like a job like that but it's more like I try to communicate my boss like I know this is like a really interesting project I want
to take it or like I know this is going to be have a good business impact so can we to try to talk and Fa spe the business user this kind of project we could try
to create together so this is like uh try to take initiative on my own on my career so I could know where I could try to move during that time but yeah as
this is accumulated and there's like one project that I know like uh during that time that this is this is the project that I done uh is called the the lp
project so it's basically try to create a prediction which of the customer email was a not a sponsor uh have complain or
not so it's like you know like customer sometimes like really complaining but right yeah but it's not just a complaint what we want to try to predict is like it's like is this kind of Co blade could
be uh damaging to our reputation or not so this like the kind of the C blade kind to predict so in this time uh this kind
of project that I undertake is something that uh I could say try to make because like this is not a kind of a project that was done before in our commy at
final and then I try to S the the this user and my boss that I want to take this kind of project I know this like could be like really useful for our business and then yeah I will try to
take an responsibility like from managing this kind of project and it's going it's going well and it it's like yeah even right now uh the business
users still what's it call still want to try to communicate with me for any kind of project but in the this is because I already starting this kind of project and I try to inate in that and they
still try to approach me if it's like they have like a some a problem with another initiative that one we've done so it start simple it starts just from
ourself but this kind of AC things will be seen will be seen if like our boss uh colleague by your any uh any business partner
and so just take an initiative I think like some uh could be like really make on Breck car not just brick it really makes you understood so kind of be
proactive if you um like you you are already a junior data scientist you already have some basic skills you have worked with some um senior data scientist under
supervision couple of months now it's time to take initiative to uh look around uh you know Network and see what kind of projects are on the table and
then go and try to make something from even if something seems boring or uninteresting you kind of need to dig deeper like you did right so you basically identified them and said to
your boss well can I work on this and if this is impactful because at the end of the day you will always go to Next Level if you are doing a project if if it has
a lot of impact right yes yes that's true and then yeah it's also uh ad fin is still from my junior level I already I feel like before I become data
scientist right I already having I know what I want to be I know what I want to do and then when I already become what I do I'm not stopping I try to making what
is called a master plan basically like this is my going to be my career but I need to take uh into my hand basically this is also what my boss always always
said like before like this is like already inspired to me like your carer is in your hand your life is in your hand so every kind of movie that you make you need to be responsible for that
but it's also need to be something that make you improve ask your a person or ask your career or add anything so that's why I always try to take initiative in
there and and it worked out right because now you are data science manager you are top voice in data science so it worked out well which is amazing and on
that note given your uh non traditional data science background because you you have background in biology and there are many people out there from our listeners
uh also uh who want to make a career change or who come from a non-traditional data science background meaning they don't have a traditional statistics mathematics or data science
master's degree um I wanted to uh I wanted to ask you actually you can tell to our audience about the impact that your unique background had on your
career it's really interesting yeah because as a biologist before I not using much about really regarding my education like use the specific for
example like Gan or like a protein of course I'm not using that in my everyday life but this more about what how I try to methodically thinking during my time
of working as a biologist as a researcher really how I structure my work during that time how do I using my statistic and finist I still using that
from myological thinking and and what's it called uh and my education part I think like really taking part like how I could try
actually uh break down a lot of like uh the AC because academic you need to like d like for example like the structure of your work need to be like
from uh the theory and then how the methodology and then how this going to be work the result and the coration so all that that my experience
previously uh it actually helped me working as a data scientist but of course like I know every person having like a d background I def Le for my because I'm still from the science
part this kind of risk methodology could be really useful during that as dat Landes but I know like there could be like a people coming from the literature
for example like from philosophy like f uh yeah language and then something that really not related to a programming or like really data at all but I think like
it's like uh could have like a different unique perspective as well in there and the fin L uh when you're approaching your work it could be like how do you
approach your work do during your study could be applicable the way you're working at the time but at the finist if you have already have the basic I think
we could Le even if you are not from the data science major so you basically used on your advantage in your advantage all your background even if it was a
non-traditional data science background yes so what would be your advice to anyone who um who has this non-traditional background in terms of
specific steps yes having a rock Mac for me is like the best one so I know it's like a little bit clich I think I to say but
having a rock Mac to follow is really helpful so there's already a lot in the online like right now like how how to become a data scientist step to step First Learning statistic second learn it
third learning the machine learning fourth having like the machine learning project data science project five like try to apply try to communicate try to
uh uh uh learning the presentation this kind of Step might boring but following those step is actually important because like it really a structure way because
like to Pi a data side I'm going to say like it just start small in there but those stepe like if statistic
programming and machine learning it's are going to be uh connected each other to each other the skill and it going to help you to become a data scientist that
you want so just uh follow me the road map that I already have it going to be helpful rather than you try to learning yourself like jumping here and there try
to I learning then I try to just learn do those typ of things but then you for statistic then forget about the math or you just learning about the programming then forget about the how to desend it
to the business user just uh follow the step one by one and then you I think it could be already good so in organized way basically to
follow the road map yeah yeah okay and have a clear plan I suppose yes having had clear plan is actually really helpful because like as uh this is like coming from my experience as well
because like I try to create a plan for for myself to become a data scientist and then I Tred to follow that plan and it actually work but it worked out not
well but this kind of plan that actually I think really helped Ence for me because like I know if we are not focusing because this plan can uh this
kind of plan actually make you focus because if you are not focusing on that thing you could just going anywhere at that lose loose loose loose lose your
way to becoming like the data that you want right right no I absolutely agree with you having a clear plan and uh learn in an organized way and not just
to go from one course to the other one to try to learn everything that that would be indeed the best way because otherwise they will everyone will be spending ton of time on learning one skill and then there is a new skill to
learn by the time they finish another one amazing and uh when it comes to Leading because you are a data science manager and you have gone through these
different steps already and you are leading teams what in your opinion uh is that helped you uh to balance the you so
the technical side and leading people how do you manage projects and people and uh at the same time uh you driving your
career yeah that's really hard I could say but this is like something that I also learned during my corporate time because leading and then try to become
an individual contributor I could already say like it's two different things on one side like uh in the technical part we already understand but
become a leader become a to the people we need to understand how to delegate how to delegate what kind of the task what kind of project and then trust the
other that they could actually done it and then by trusting each other I myself and I trust you as a colleague I would say like you are also my friend if like
you like I but also you are friend that in we are working together in here we are not just like boss and to okay you do this you do that and you are okay bye
no but we are trying to communicate together what are the problem in here what how we could try to work together in this here but I try to not to back from myage like I try to jumping in
there doing a stuff life you are to but trusting the people and the delegate to the people is actually very important but it's coming from the learning of course of course like as individual
contributor and I still really love encoding I still love programming I do it right now I still do it right now I really want to job get like doing this
border and stuff uh at any time but trusting the others is like it's a learning learning to managing this project it could be you could do this
and then learning that I could actually uh doing this stuff uh much better when I actually already being in this spot right now but so yeah it's it's still a
learning process for me I could say like it's still alarming it's uh never ending journey I could say but delegating to the people that I trust having a Trust
basically to the other is like the important to balancing things on that note because um maybe some of our listeners who are aspiring uh tech
people and who haven't worked in the field they don't know this concept of individual contributor what you just mentioned because for anyone who knows who doesn't know this term uh we have
usually this two different career paths right in data science when you just join you can uh you usually join as an individual contributor so you work on the technical stuff and once you go
through your career there's usually two ways to go one is the uh individual career and as an um uh individual contributor and the other one is the managerial path which I assume is the
one that you are following because you are managing people yes so um that's about that uh you mentioned working as an individual contributor and then now
you are manager which means that you are managing people and supervising them and on the note of the trust because that's something that you mentioned a lot when explaining your leadership skills the
trusting in others but how do you build trust as a data scientist when you have just joined your data science uh
job yes he my proof I think because that's what I do to build the trust with my boss with my colleague I Pro that in my job I could actually do it and then I
try to present how my result is going to be and then I try to ProActive at my work trust is not built just like by you saying it but trust is built by you are
having a proof of your work trust is coming from the things that you already did is from the action itself I know like communicating is maybe not that
easy as well but just by but your work is your uh it's like a promise in there during that time so I know like as a junior data scientist like yeah I I can
do I can do it I can do it I said I can do it like that and I know like uh people that I is like yeah I can do it but how they work so it is that's
actually that that that that work that actually making that uh the proof of the trust so as a Jor I think the best way is like yeah just do the work if like a
smaller like as you said previously youly like it might be boring stuff it might be an assum it might be not leading anywhere but those kind of work show that you could actually do the work
rather than just so do work done basically yeah they work down right right that makes sense so then you can build the trust uh make
sure that you help your supervisor because that's at the end of the day the job of the data scci Junior data scientists to help uh the senior data scientists and other leaders in the team
to make their life easier whether it's unfortunately data uh collection data analysis or not and um on the note of uh
leading the teams and becoming manager in your opinion um how can one decide whether they want to take the path towards individual contributors so become a principle data scientist and
then stay like an individual contributor or they should consider the path uh towards becoming a manager data science
manager what qualities uh do you need to have to go towards One path or the other I think it's coming from yourself
because every single person have like a different evaluation right but what they kind want to be like for example for my side I love both side I already said
like I love both side like as individual contributor as they leading I love both of them so I want to try to actually experience having like this experience as a leader as well so because I already
have good experience as a different contut that's why I try to moveing to that site as well to try to the leadership part but of course like it's coming back to yourself again what do
you want to do do in your future because like I cannot tell like for as every single person that you need to be like become a manager is the best path because it's like have the best money or
like become is much better because it's like really coming to yourself or like your comfort zone where you you want to be in the future so it need to become to yourself I mean like every single puff
car way have their pro and con it's always like that it's just like the decision that you already make you need to responsible for that that makes sense so what it is to be a
data science manager um walk us through your kind of dayto day uh your responsibilities for the listeners to get an idea what it means to be a data
science manager yes uh youi in Mar will say like still Ling the thing of course like
having to having like a meeting like AIT the the themes and then having like managing the project what what's our project right now what the what's how
how is it going and then how is the progressing so it's like managing the progress of the work in there and then try to balance it with the business user what their expectation is and then
translate it into our technical so it could be actually making those kind of progress so it's like balancing with everybody uh from every side as well so
I will say like those kind of work that I usually like the day so managing uh the people like like how what you going to be like in this week or what going to
be in this day or and then they try to when the business us are coming how do we going to make this project uh
successful right so that's dayto day basically that makes sense so managing people lot of communication meetings there defitely also part of the
managerial position that's for sure so if you don't like meetings and you don't like communicating with product people and business people you should not choose for death paths right that's true
that's true that's yeah I I would say like that's the con because like I could say like if you see my calendar of course like the this meeting it could be
like 8 nine 10 12 it could be like lot and then it's like suddenly there like uhom me what what do I need to
J this my for yeah yeah for sure no I heard that a lot so that's something for our listeners to take into account when choosing which career path they need to
go through and on the note of the projects because uh you have been leading projects and you have been also going through the corporate letter successfully as your da science manager
now at such a young age can you uh remember maybe there is like a one project an impactful one that you end up uh going through it was a tough project
but you uh with your team have completed it and what were the main takeaways from it um this like so many project but I
think what the really impactful right now if I like still we still ongoing work is that we try to create this kind like a CL fraud detection project so
basically it's like every comp need like a fraud detection model this is like a really really impactful to the business because like you know like truck usually
not happening that often but when it happened it could be damaging the finan damage into the reputation damage into the fbody on their business and then yeah this is like
really ongoing project that need a lot of consideration like coming from this kind of business like this customer like through the data set is coming from so
it's like really involving a lot of OB stakeholder and a lot of technical person uh to be involved so this kind of project that we are doing
uh I will say like taking a lot of people come and go because like if like this is like really hard project but at the very Leist uh I would say it's like
making an impact because like the business use use the business us are using it and then they try I try to trust our result but there's always a
lot of room for improvement because like uh uh I said before a FR detection as are really could be really damaging if we could not do it uh
properly so we need to present those result from this CL fraud to be something that could be understood by the business so we try to create the model as good as possible we need to
have like the expandability as good as possible we need to integrate all the business process as possible so do moving part really uh really right now
are still inte uh the project that I really remember until now that I tried to even uh try to what's it called uh in
another project those kind of work that I done there I try to implement it in in the other because it's just like this kind of complexity really need to be uh
taking consideration okay amazing sounds as a tough problem fraud detection um it if
if you have a high rate uh that can seriously impact your operations also uh this about money uh that we are talking
about after trol uh and on the node of money so uh you have gone uh through this different steps you start as a junior data scientist usually or as an
intern in data science if you don't have any background at all and you haven't complete any projects and then you need to go through the steps usually you become a major data scientist and then
you you become um a senior data scientist and you become then a data science manager like you are today so
what it takes to get a promotion is it only the technical side building that trust like you just said making an impact or you also need to actively
promote yourself because I know it really is different from companies to companies but uh there are some common qualities that differentiate data
scientists who stay in the same role for many years even if they have a very uh good skill set verus the data scientists who very quickly like yourself go
through these different steps and get promoted a lot yes I think like withc corporate yeah I said before it's really depend on
the business as well of course and because promotion only coming uh first if there's like a business set you want like a business position to be filled that's the first stage of course like
you want to like having like a really higher position it might not go as fast as mine if you have if your company doesn't even
need it but for a promotion basically for a promotion that increase your money uh increase your financial security increasing that I I believe that's the most important part actually the
financial I think like for most the people like I know like some people doesn't want doesn't get to promoted but they just want to have like a higher
salary but they right bigger title but in like my bigger biggest quality that yeah of course like I said before like taking initiative I discuss it with my
manager what my career I want to be as well I want to be like I said in before like can I get a promotion
I yeah us like at the time like and then they try uh and then they the what's it called uh negotiate basically yeah yeah we can
do it like in this uh of time with with your kind of pro proof of work we already been everything that already I done I already show it because like but I had thought just like six month no I
just I not saying that I in six month I said I can be promoted no but I try to build it during that time like but it's actually a year that I to to the med to
see here I I ask even I ask so it's really like an initiative from because you don't even know like yeah is it like
a possible or not people are not asking I know like some company culture is like a little
bit like right like like so like it's like really shy or like like some people I just to shy but it's is depend on culture the company because like in my
company C at my voice was like really open it doesn't matter if you want to you ask to be promoted or not it's like sh we are open discussion but for the
company that a little bit Str like a little bit a t in that part yeah you need to take a little bit proactive but showing you are the best
in your work showing that if like compare to your colleague yeah I am the another one there I I I would say this is a b competition I will stay if you
are to be like in the Cent it's still a competition very often if you put it as a competition but it's like a really healthy competition if you
want to get up in dark okay amazing so basically don't be shy to ask because if you don't ask you might not get it read the room meaning
you need to know who your bus is whether he is open or he is not open to u a conversation like that so understand the company culture whether it's fine to
promote yourself and uh kind of build that negotiation skill set right to negotiate for whether it's higher salary or Better Business title because sometimes that's also
important uh for for your position in the company but do that on time meaning before that show your work and only with right cards go and promote yourself
right yes yes you need to have a strategy Bally for that like a to that but yeah I know like like is uh take a time take a skill like uh to do that I
know like not everyone could actually brave enough to do that or like uh having like thinking to that everyone is different but of course like you could
try to copy a little bit people strategy I know like it online like if like what I said right now if you want to copy my strategy like even after a year after you work hard you already taking
initiative you ask your boss then when that strategy work you could try to copy my if you uh company actually having a
little bit my mind could be like a reop okay so that that's a very good advice I think to not show away to ask
and know your strategy plan in advance yes on the note on that planning and uh on a personal branding so planning your personal brand now Cornelius you have a
huge following you are top voice on LinkedIn and you have about 30,000 followers on LinkedIn right almost
30,000 still almost yes if you listening now go and follow Cornelius maybe we can 30,000 yes so how important it is to
have a personal brand as a data scientist yes it's actually very important so I will talk of my work because like of course it's you school but offset up my work it give me a lot
of chances it give me a lot of opport opportunity like right now with you I'm talking with you because I already have personal branding right you know yeah you understand me it's they
opening a lot of door so I get a lot of new friends lot of new n workking lot of a current opportunity that uh offset of my job so try to a lot of freelances lot
of writings I love writing so I try to focusing myself as in the writing as well because like some of the people that C content creator who actually do
like a video to actually doing a Tik Tok for example photo but I love article writing I have like a newsleter I try to make myself person branding and there as
a data scientist not writing and does does uh make my carer actually have like a I don't say it's a side hustle anymore because it's already my hustle
it's become my personal hustle that b uh improve my income you just secur so person branding could have make you
having like more car choices it could make you a lot of because like like this uh I read a lot about uh people who like
a business entrepreneur they usually not only have like one source of income they have like a multiple they reallying here and there and I see like this coming actually from the personal branding as
well they could actually coming to have like this kind of a source of income the multiple because they promote thems like myself like I try to promot as data
scientist who know some of the stuff in the data science world and then yeah the opportunity keep coming that's because I
promoting here and there and then try to having which uh card this choice that could actually make in the future it at
it's the door opening is uh more so uh in the case that maybe someday I could lay off or something you never know in the company you never know what
could happen if like it might like something like so secure right now maybe in the future something like this like a Cris financial crisis being L off like or even like a
pandemic before like people that who since like Financial Security suddenly getting cut off like I I see my personal brand right now as a security
measurement as well that I could try to make it uh a runway in the case of something that happening as well so there
really yeah for sure because like you just mentioned uh in many parts of the world also currently there are many data scientists that are being laid off just because of the economic
structure and uh like you the message what you are also giving to our audience is that your personal brand will not just be uh something
uh for uh you know for for that scenario but also for uh for for for just in general it opens many doors for you new
opportunities new networks new contacts and also in case something happens so uh you're being terminated laid up for some reason then uh that will be a great way
for you uh to have a source of income so not to put all your eggs in one basket that's the message you are giving right basically yeah very precisely and on the note of your
newsletter because you have a popular newsletter and it is called non-brand Data can you tell us more about that yes at first yeah I just call it non-brand
data because like is it's really in the early year or I try to create a newl and branding so I try to just not want to
boxing myself in this of branding that I could to be a data engineering or like a scientist like in the nalp of stuff I could try so this kind of us that could
talk about anything but and related to data but at the moment right I try to talk more about a things like first about more about a career so like a
business in the career in as data scientist like those tips and that's like uh my opinion but and the second what is talking more about uh the
technical stuff so I try to also still putting my technical uh knowledge uh in my newsletter so I'm try talking about the python ml op and then how to
integrate it so basically I try to combine like I still understanding the business but understand understanding programming stuff or technical stuff in
there it's still uh it data science just little that try to make you as a car uh data scientist to improve your car this
GI amazing so if you are listening now make sure to check the non-brand data newsletter from Cornelius hit the Subscribe button and you will get a lot
of help with uh your career how to start a data science career and also the technical stuff like Cornelius just mentioned yes also uh let's talk about
the coaching that you do because you have a topmate um uh link uh and also a way to provide your knowledge so to
coach other aspiring data scientists or people who are already in the field can more about what kind of coaching you do yes so it's more about how you could
move as a data scientist the car where do you want to move I like I said before manag or to contributor but basically how do you
want to move as data scientist in your career but of course it is also open and open to coaching as like from non uh data scien yet that you want to move
breaking into Data s science field opponent for the coaching orat as well amazing so if you're looking for a coach then uh make sure to get in touch
with Cornelius uh he might be able to help you with your career but also the technical part yes he's on the top mate so we will put the link uh in the
description so on the note of the person branding let's say uh you have a person who comes and ask for your coaching services or just in general your advice
uh you have a huge personal brand at the moment uh huge following on LinkedIn what would be your advice more in actionable way how to build a personal brand from
scratch yes I will come from my personal experience first I first I'm networking first actually before I posted anything
much about that but I already before I coming up with the plan of what I want to be I'm a data scientist I what I know
then I try to uh building that those going to be my brand those going to about dat sideen this in here then I tried to having like a networking like a
so the first time that I posting it uh my social media of course like it's not going to be like having like a lot of Follow That reading it and how like lot of people that are noticing it but
that's why I try to actually also approaching people that actually already have think just big follower so I have like right now my friend like his name is kin but he's already not that much
active right now in the in but he's the first one that I am helping me going to the network in the data science field I
in as a data scientist and then because of his follower I get another follower as well and then I start of getting more
and more momentum in there so it's just like annoing my brand then try to networking with those people who are actually already uh having those number
basically I would say it's a number game if you say it but and three consistent I'm consistently still posting it from
my uh it's already been five years I think I still consistent until now try to posting something stuff even if like I know like one of the things that uh
make people kind of share way after they posting their first post is there doesn't have like a lot of people that might be liking them or like a comment
to their content but no I mean like just keep posting it I mean like every single post that you have might be pable for someone do you might thinking that this
of course might not ah this is like just too easy why people already know but no whybe someone is actually reading your
content and understand okay this is how do you do it with okay people uh do that I I think about that before okay like
not much left but there's a people actually say thank you look my post and then it's like okay that actually make
my M higher but the first that it might not showing that much but just keep consistent
posting because it's like it take take a mental I think like for take a mental to to keep posting if like your situation is not that much but of course it's take
a strategy still take a strategy to improve your social media and personal branding so what I think going to be like people having out with your personal brand
having the network and then keep consistent I think those tra is already being good enough if you want to build in personal brand amazing so if you if
you are someone who is listening and who who does yet have any personal brand uh following the steps will help you to uh
gain that personal brand uh online but also offline um so on that note Cornelius um do you think that uh having
uh like coaching um services having a newsletter and having a LinkedIn following uh is that enough for your personal brand and of course also
networking or are there any other uh media or channels that you usually use in order to uh build your personal brand as a technical person I'm talking about
for example GitHub you know um place to um store and uh create and showcase your code also to Showcase how you can tell a
story about data yes yes that's true like GitHub is like uh really those is the community for a DAT not even dat just like a
programmer to have like those check portolio to be showed in there but also for myself because I a writerly uh
I but really after writing I using medium to hosting all my uh article I'm still quite active there if uh but right
now is a little bit more in the my newsletter but depending on your paion I think like a strategy you need to yeah uh need to a little bit branching so for
example like my go do media and GitHub but maybe for example for some people that would like video maybe you could try to have like present in the YouTube or like in the Tik Tok because I know
like some friend in there uh so of my friend is actually really active in the Tik Tok compared to the uh Linkin but uh
it's really need to understand as well what the audience what your audience want to be uh sorry what you want the audience you want the audience that you
want so for example like I likel in because it's like a really professional places like a social pum so there's not a lot of professional uh Communication in there
so that's why I really active there try to BU my building personal brand there but maybe your personal branding want to Target more more casual people then maybe X may be better places so I know
like a lot of people in The X like a big data people in there as well there's a lot of follow I try to post here and there as well in X but it's mostly actually in Basa Indonesia so maybe a
little bit harder for Indonesia speaker but yeah uh think but if you are just starting try to focus on one
platform first having like a one platform social media then try to build from there then try to branch in there but like GitHub like medium I think
that's just the medium way just the places to show in portolio so I think it's like more about where you can show your working but if you want to for uh I
think it's like uh complimenting complimenting your personal branding that you're already building in the social media like GitHub edum or like some other uh it's like for the
portfolio but starting uh your starting places then try to focus their course on there I think that's one important part I think that's an amazing advice to keep it simple because sometimes it can be
overwhelming if you have so many social media or different channels that you need to keep up with it will you will just run out of time or you will burn out so start simple basically that's
what you are saying and then uh they will all start to compliment each other so either ex or Tik Tok or GitHub medium
or um different out of places LinkedIn uh Facebook so it really depends on your uh Target profile like you just
mentioned yes amazing let's now dive deep become a bit more technical CU data science is such a buzz word there are people who understand data science as
data analytics there are people who understand data science as this machine learning engineer and um with the uh revolution of AI now there are many
people who understand data scientist as someone who is dealing with lops large language models deep learning machine learning so many parts that um many data
scientists across the world are learning and are implementing now in your opinion as a data science manager with the wealth of knowledge and experience in
the field what is for you the data scientist and what it takes to be a data scientist yes this is still a question
that I actually still asking because data scientist previously I said is someone who working with the data to bring a into the business but right now
is moving much more than that as well well so it's like data scientist is someone who bring the value to the business and making the decision for the
battle any business but I always said a business because like a data scientist uh always need a business to working with because like even if you're
a Sol PR a Sol that this your product will be to be a promoted in some way then they have like some business value so that's why I always said that right
now data science is not just something that Mis value but also someone who bring a better decision in the business
but yeah to become a Delta scientist I would say it takes a lot of uh mental for so like take a
consistency taking a what to learn all the way every day every year every time always for system learning right and I
was like Pro business I already said it before so I don't need to repeat it again but yeah just keep want to keep learning and keep up with the new latest
technology this technology latest technology I believe in myself I believe more that it's going to chance to work
it's to chance to work like I right now it's going to chance to work just so I scientist try to keep following that technology because if you really not
following that technology we will be yeah as I said being replaced basically as a data scientist we have like an
advantage because we are usually working for creating that AI complimenting that aiting right so Che on that note I think
you already answered my question because I wanted to ask you do you think that AI is a password many people think that this new era of generative AI chat GPT
you know cloud lmms it's a hype it will just go away but you just answered that in your opinion it's not going to uh going away
anytime soon and it's going to make a huge impact so can you uh walk us through some of the recent developments that you are aware of and you believe
are going to make a huge impact and in which industry specifically in your opinion yes uh if you want to talk about technology right now yeah we already have like a lot of that
generation the open I already had like a lot of the model that really good like the like the cloud is also really have a really great you see
lately and then I I feel like like Sora right now already try to cre out just image the of the video but right now what I really see is like the business
are so like non technical people because right now it's like the is like just like recording or for a yeah simple stuff but right now like a business
people start to see how useful is that like a lot of not just lot technical people but like a non technical people who are try to build a product based on
that so yeah it just from my S so uh this P two weeks three weeks I already did like a lot of uh stakeholder basically this India like a big stick on
actually they want to try to building this kind of EG product and they try to using this uh AI to simplify their business process
basically is like if feel like in internet in Indonesia they already see the potential that it could be used but of course like right now it's our J as
well like as data scientist to prove that this kind of AI to be actually useful in the VIS it's not just like the business want to use it and try to uh
make it effort from their side but we as a data this to make an effort as well that yes this AI could be useful for your business and that it could be like us proove
that this two things that combine together business and is like become big it could be like become a game changer so but yeah that's why I really quite really confident that like I could
actually change the world because it's not just uh I personally using it and then try to get a benefit from that but also everyone that I I I like from my
side is actually already starting using it and they try to implement it in their business and then it's actually useful well that's amazing to hear because I am trying when when I'm
hearing talks about AI will replace data scientists I feel like um that's something that uh can is highly arguable
so um first I want to take uh your uh opinion on this do you think AI will replace data scientist is and um how can
you future prove yourself as a data scientist yes I think it's not going to replace holy I think it's more like about some of the job that we not yes
some of the task that we are done for example like maybe about about detection or like uh Co generation it could be like delegated to do AI but of course
like restructuring all theor words to is going to be using it and that where is going to be managed it's still taking a data scientist to do that but that's why
it's the data set just become evolving right because like those task could be replac AI we as a data scientist not to for us well we are not just need to
understand how to coding it well but we need to understand how to manage this scod better we need to actually document it better we need to understand where it's going to be in the
business which going to be used which of us data scientist is going to be really using this AI so all this latest technology we need to understand as well
so but is what I want to say like data need to be become a full step and that is like cannot be affordable on
basically like like like my as well like I try to learn a lot of ml Ops ma operationalization because I don't think it's going to be
by I as well the structure could be be that yeah I couldn't say better because um being
able to become this full stock professional that you know both the data side the machine learning side the Deep learning but also the recent
developments in AI kind of at least high level to know what LM is what LM Ops is those Cloud Technologies how they can be used right yes and have also the
business acument and the communication because if the uh product um there's always communication right business versus data scientist there's always a
need for the translation and um as long as you are able to do so and continuously develop yourself with the technology there is no way that you will be replaced because by the way the other
day I was reading that um currently uh there is an 85% gap between uh the demand for data science and AI professionals versus the supply so
unless you are doing a manual job I always tend to say that unless you do something repetitively that can be replaced by AI you are good to go yes so
you should definitely have a motivation to get into data science if you like it on the note of getting into data science because you are now in the position of
hiring data scientist so what are the skill set that you are paying attention at because with this recent B LMS generative AI many aspiring data
scientists instead of starting with fundamentals they start um with a difficult stuff training narrow networks understanding RNN attention mechanisms
Transformers diffusion models and then uh try to show with this project that they are an experienced professional okay what do you pay attention at when you are hiring and you are getting this
different resumes okay so the first thing that I don't pay attention much first thing that I don't actually pay attention much
is actually where the university is or where the GP is or their AG is or there how highground is like if you out
but but I the fin L those kind of a discriminative i the do so I try to make every call in the in the K but of course like when you are in a business there's
like a business bu needs so the first thing of course like filling up the business needs uh when it's like hiring and as a data scientist that what I I do that I see for a junior data side list I
want to see if you could actually do the work so basically have you already at uh Le do some data science project already
created some data project data science project and then for this data SCI project how's your uh thought process going how's your what what motivated you
to do this kind of project what motivated you to do uh this kind of code what motivated you to present this kinde of result so do data science project is actually what I really see of course
like uh having like an internship experience having like uh job experience previously really uh us up I see that as well but of course like uh I know like
this a little bit harder for Jor f uh sorry uh as data F to become up because like the position is like really hard to
feel so that's why I try to I at Le taking a look at their data science portolio project so that's the those the the most important part that I try to
text in for like like a communication like a business equipment like for a transl to the business I think the kind of a skill that I think will be you
learn when you are already being in a business inside the business but when you are to enter in business you need to at Le show that you can work for the P
that you are going to be hiring for and understanding but but of course like understanding a little bit about bus is uh really help it still really helpful it's a plus yeah there is a plus it's a
plus going but my standard point is like showing your data science portfolio showing that you could actually the work that's what that my right so a strong
data science portfolio basically you learn the data science communication translation skills business acument all plus but extra but for you as a hiring
manager during the hiring process you pay attention at the projects that they have completed unless they have already an
experience yes and so on that note uh what would you suggest like couple of examples of such projects that um you
would suggest an aspiring data scientist who have zero experience just fresh out of college maybe even nonre data science Ed education to put on their resume to
impress you this a what I mean like a lot of people like are impress me but they so
smart but it's like I you think like uh I there like a lot of that but uh that you could try to find it in the cas
or like UCI those kind of little bit complex data science project but what would really impress me if you actually could formulate a business problem from
those data s data set and then formulate why youve develop this type of modeling and then this model that you develop is actually solving or even not
using a modeling are you actually using the data set no you don't need a model but I have you can formulate a business problem from this data set and then and you try to useing the data science
technique maybe just a clustering technique maybe just a customer or Dimension reduction technique but Co actually showing that this is how I could solve the problem from my business
problem that I already formulate with this data set and then I showing you that this is like how I do it so that's the one that I really impress me because like what I want to see in your data
science project portolio is basically the top process and usually the top process will coming in when you already have like this business problem that you
want to uh solve but of course like uh the the data set that already available in the public is may be a little bit limited but of course like within this
limited data set if you could be creatively thinking about a problem and then try to solve it it will be really impress me right so from end to end basically
and also have kind of this extra skill set than just solving the problem in technical way yes solving a problem in technical way you can say that amazing
and one last question because we spoken about many important topics that I believe uh many aspiring data scientists would be interested in uh what do you see as the future of data science in the
upcoming Five Years well we going to chance a lot I think like with this all AI development
andu data technical stuff that where I know like in five years data size will be invaluable to the business like I said before AI is like it's like a
password previously but is a password that used by the business and if the business could actually use using it bets in the company in this five year if
it is five year they will try to hire as much as data SST just to making that sure that this is going uh smoothly and then I F sure absolutely sure like bu in
this five years will'll be using a lot of automation from our data from data science uh technique amazing now Cornelius thank
you so much for joining us today I think uh your insights and all your tips were uh invaluable for our listeners who are interested in Tech and in data
science uh for our listeners make sure to follow Cornelius on LinkedIn but also to check his newsletter non-brand data and if you are looking for a coach then
uh definitely go for Cornelius he will be able to help you thank you so much Cornelius was really pleasure to have you here thank you so much thank you
that so they're gonna buy a b a business and that'll be the base and then on top of that base we're going to add other you know other businesses to try to help it grow you know exponentially faster
and and I used 100% Bank debt to buy those 23 companies and you know the n result is you know a tremendous amount of shareholder value created Tesla could have come crashing down if if the
lenders started saying you know enough is enough joining us today is Adam coffee who brings over 21 years of experience in building businesses having
served as a CEO for three major companies supported by nine different private Equity sponsors Adam has managed
transactions worth over $2.5 billion and advised top Fortune 500 companies during his time Adam organized 58 business
deals and significantly increased company values achieving fivefold returns for investors he grew one company's value from 10 million to over
a billion dollars earning him recognition as one of the most influential leaders by the Orange County Business Journal Adam is also a
best-selling author a popular speaker and a mentor to aspiring leaders his extensive background includes role in healthare manufacturing and Beyond
his diverse skill set also includes being a licensed contractor a pilot an army veteran and a former executive at G for 10 years today we will dive into the
proven strategies that aspiring Tech entrepreneurs and fresh graduates need to drive in today's competitive landscape we will uncover invaluable insights on how to navigate the tech
work cut through the noise build investor trust and secure funding as well as Forge lasting Partnerships finally we will learn how
to plan and execute a lucrative exit that maximizes your hard-earned success the podcast will be hosted by vah asan experienced software engineer and a tech
entrepreneur the co-founder of lunar Tech that is on the mission to democratize data science in AI so without further ado let's get started
welcome Adam we are excited to having you join us today now Adam you arep a big deal in private equity and in business done deals over like 2.5
billion um you have also advised top 1400 companies and written bestselling books on business Adam could you please share your journey and how you got to
where you are today with you with our audience happy too happy too and and hey by the way good to see you good to uh good to be here with all your listeners out there um you know I think for all of
us life is a journey and it's a building of a set of experiences that make us who we are as a young person I served in the US military um service in the military taught me something about discipline
teamwork leadership uh engineering uh engineering made me a meticulous planner um I'm a pilot Pilots don't take off unless we know where we're going so that taught me how as an entrepreneur to plan
an exit from the beginning you know and always have an exit and a destination in mind uh I spent 10 years working for Jack Welsh in the uh Camelot era of uh
ge I call it um GE was the world's number one largest company Fortune number one on the Fortune 500 list company was growing so fast it was doubling in size every three years and
that that really informed my thinking about growth and GE taught me how to run a business then I spent 21 years as a CEO building three different National companies for nine different private
Equity firms um bought 58 companies a buy and build guy a turnaround guy and uh you know I've got two and a half billion dollars in uh in CEO exits under
my belt um that kind of led me to writing books you know about how to do this I I wanted to educate I'm turning 60 here shortly and I I I wanted to start thinking about you know Legacy and
how how you know I wanted to teach the next generation of entrepreneurs and business owners how to excel at this game at this thing you know that that's
been so influenced by private equity and you know that kind of kind of led me to hanging up my CEO cleats a couple years ago I started a Consulting business I've got clients all over the globe uh I
helped them with uh with scaling uh with doing m&a teaching them the tricks that that the big institutional investors use to create shareholder wealth and then um
you know I help people exit I work with private Equity firms I work with individuals and Founders I'm having a ball I work more hours now than I ever did when I was a CEO so so much for uh
for for slowing down I I think I've actually speed it up awesome and um can you share a few stories on how you acquired a new
businesses and grew them and sold them like the top businesses you worked on well so so usually in my world again I've been doing this with large institutional shareholders and so they
always start with what's called a platform company so they're going to buy a b a business and that'll be the base and then on top of that base we're going to add other you know other businesses to try to help it grow you know
exponentially faster so if I take my last company as an example um the the company that I was hired to run so private Equity Firm buys a company it's a platform company it has 200 million
plus in Revenue uh they buy it with a combination of debt and Equity from their fund uh and then they they bring me in the company has not done well you
know and time to bring in the guy to turn it around to fix it to get it scaling again and then start doing a a buy and build and so I then bought 23
companies over a five-year period uh I bought I bought eight total in the first hold period 15 in the second and uh you know and started bolting on these other
businesses to go from being Regional to National National to International depending on which company I was building at at the time and in addition
to to Growing through m&a also then would focus on on improving the business that I started with so usually investing in technology trying to do my best to
increase the revenues the profitability um you know and and you know of the base business and then also a lot of effort around organic growth to get the business that that was underwhelming and
not not doing really well to grow like it had never grown before organically and and so so I've learned how to build I'll say a very balanced growth oriented
company but no question that m& is the largest component of shareholder value creation and on that example those 23 companies that that we bought you know
on average I paid five times for each one of the companies they were small they were plentiful or smaller plentiful and and I used 100% Bank debt to buy
those 23 companies and I used the cash flow of the 23 businesses to you know service the debt while I'm collecting them buying them and then when we go to
market we sell it and we sold for the first time a multiple of around 14 times and so things I was buying at five times I'm now selling it 14 times and you know
the net result is you know a tremendous amount of shareholder value created then you add in the organic growth you add in the margin Improvement and that's kind
of my recipe for the perfect exit you know in that case in my first exit the the the three-year period uh it was a 4ex multiple of invested Capital so
share holders were happy investors were happy management team was thrilled we made a ton of money you know and uh when it when when things go
well and um so as you know like in the tech industry the creating uh value wealth is the the opportunity is immense
that it's also like very competitive like we have many fresh graduates coming straight out of University and trying to make a new business a new startup but
they have no idea how to do this so what strategy or what mindset would you recommend them for our ambitious Tech entrepreneurs who want to get their food
in the field yeah so Tech is an entirely different world you know you have to get to different concepts um you know software as a service you know or uh a
Tech enabled platform definitely brings a higher valuation call it at Exit um but often times from a tech startup perspective I'd say that a lot of people out there are trying to create the new
next best thing and oftentimes I tell people it's like instead of trying to create something new that doesn't yet exist um potentially solve uh an old
problem or or put a new spin on something that's already out there and you know I think too often it's like we we try too hard as entrepreneurs to create something new differentiated that
the world's never seen before and sometimes boring old problems still need help and still need solving you know and they can be updated and solved in a new modern fashion and and so I I think
sometimes entrepreneurs overthink complexity um and so when when I when I am talking to people about what constitutes a great company you know I
tell them to think about human basic needs think about needs versus wants in a bad economy in a down economy if my business is focused on needs I'm not going to get hurt as bad my revenue
streams will be be still fairly consistent um but if my if my product or my service is is a want then if I'm laid off or I'm unemployed or I'm feeling a
pinch from high interest rates you know I can slow down I can avoid or I can completely ignore that spend for a you know for an extended period of time
until the economy comes back and so we have to be concerned with just the cyclicality of of of the broader economies you know in the world we go through up Cycles we go through down
Cycles you know and and the world can throw us curveballs like covid and so we we have to be very thoughtful around if we're going to start something I want it
to be needs-based so you know it's like if if my if my roof was leaking and it's raining outside and I'm in my house you know and water is pouring on my head I
have to fix that whether I'm broke or not you know but if I wanted to put new fancy you know accessories on my big monster truck out there um you know if I'm unemployed and I don't have any
money then I just look at the magazine and dream about what I would do but I don't have the money and so I don't do it you know it's a it's a discretionary spend so needs versus wants then we want
subscription based versus I'll call it Project based we want some type of a a product that customers are going to pay us a monthly fee for it's shocking to me if I go to my credit Card statements and
I look for all the monthly fees I'm paying you know for Adobe Acrobat and for Google cloud and for Apple this or that and it's like I spend you know a a
fortune every month in just these recurring you know contracted type charges and you know that's also the key to entrepreneurial success once I find a
customer I want to create a recurring Revenue stream you know even in games people might have a free game but there's in in in app purchases to help augment you know and so if I'm thinking
from a tech perspective needs versus wants contracted Revenue stream versus one-time use or Project based and uh you know and and then I'm thinking you know
in a perfect world low capital expenditure um not a lot of money to to further develop or refine a product once it's created and uh you know and it's
it's creating a profile like that that leads to high profitability high fre cash flow and with high-f free cash flow comes the ability to service a lot of
debt which means buyers who want to use debt as a primary funding source they can uh they can service a lot of debt because there's a lot of cash flow so if you can build a business with high-f
free cash flow you know that's focused on needs not wants you know and uh and has a recurrent contracted Revenue stream you're going to do much much better yeah now that's that's a great
advice like often times entrepreneurs start working on some kind of a new project they think they are like working on solving a problem in a certain way but actually they're like not solving
like any form of new problem and they end up hitting like a wall where where they are like but is isn't already someone else doing that when they talk to an
investor yeah sometimes boring you know Industries and we we solve a problem there but they're a a staple you know or a Mainstay of an economy um we we we can
get a lot better ATT traction you know when we solve a a a common problem for common people rather than create a new problem that someone doesn't know they have yet and then have to convince them
they need our product to solve that problem yeah 100% And with um startups like many startups
like need human capital they need like some kind of um Capital to be able to employ uh new people to be able to uh
invested in marketing or in other resources now build building trust is very important with investors because they w't know that they are investing in
someone that's trustful and they're able to not only get their money back but also get a multiple returns so how would
you advice on new people entering the field on building um thrust with investors well this is the age-old problem and the age-old question right
so chicken or the egg I have no Revenue but I need people you know Venture Capital investor says I don't want to give you a bunch of capital that you're going to waste you know on the come you
know I need you to be able to prove you know proof of concept and prove that you can can actually create these revenue streams so it's a very delicate balance and it makes startups a very difficult
place to to to to be you know and and oftentimes I I I ask myself do I want to build or do I want to buy and I'll I'll look at the existing Market in place you
know and I'll say look if I start from scratch I have a very high probability of failure um I I have a a lot of hurdles that I'm going to have to to cross and I might ask myself is there an
existing company that has the existing technology or has the existing product that I can buy that's pre-existing and as a result I've got a company that has
Revenue customers a history of profitability and then it's a different game you know so in the startup world we use things like Founders equity and and we we we try to attract people by by
telling them you know how rich they're going to be one day down the road in the future and that's a hard sell you know and I I got to tell you I get contacted constantly with people who want to offer
me Founders Equity to to help them and you know what I work for cash flow I don't work for Founders Equity uh and when I'm sitting on the boards of companies they give me stock anyway so I I get I get stock in an existing company
with real new real customers you know and I get cash flow and so I personally won't work in a tech type startup world where there's Founders Equity involved
so I I think we have to be realistic and and we we have a we have to profile so anytime I need people you know my goal and objective is to hire the best people I can find for the company that I want
to be in five years not the company I am today you know and part of my tenants include I have to pay a fair market wage but if I can't because I'm I'm cash
constrained then the only tool I've got is incentive Equity to try to attract people and then my profile might change you know I I may not be looking for an established executive who's used to
making seven figures a year because I have no money to pay them and so I'm looking for a different profile it's a younger person it's an upand cominging person it's a person with great skills but they live in their mother's basement
you know or they live in an apartment and their cost structure is low they don't yet have kids are not yet married and as a result of that you know I can
attract them with the equity potential and the lack of of cash flow because their their needs are lower you know it's like I I'm a seven fig you know
eight figure guy every year and so it's like I I I don't work for free I don't work for Equity that may or may not pay off in 10 years I work for cash and Equity you know and so you know we we
have to think about the talent that we need and where are we going to find it and how are we going to attract it and retain it and so we have to build a
profile for the type of person that we think would be uniquely qualified to go on this entrepreneurial Journey with us um especially when we're cash constrained in the beginning and we just
don't have the right level of capital so I need Brilliance on a budget and I'm going to look for a profile of a person who's got low cash flow needs to wear my
small poultry s salary will in in you know at least cover their basic needs because they have cheap basic needs but brilliant skills and they're trying to become you know call it the next tech
you know Tech billionaire or multi-millionaire they'll believe in the journey and they'll take and call it use Sweat Equity to get
there um now in your experience how do um successful companies balance Innovation with sust sustainable growth like for example we have like lot of
businesses that are innovating but they like they keep on innovating and in an unsustainable way a small Tech entrepreneur who's trying to create
something that has you know a a legs I'll call it something that has longlasting ability to build a sustainable Revenue in future um at some
point we have to shift entrepreneurial gears and say it's good enough it's good enough for now and our Focus needs to be scaling and then the Innovation or
investment you know we we we we don't necessarily want to stop but we do need to throttle back so if I've gotten to a proof of concept you know I'm out in the marketplace you know there is a a point
where we have to be thinking about well if I continue to spend money I don't have you know innovating innovating innovating while this is important I'm never going to build a sustainable business if I don't also keep my eye on
the ball and the fact that my investors need to see a return and I need to create you know revenue and so as I get out of the gates I get out of the market when I start to start to see Revenue
coming in it's like we really have to drive Revenue hard and show sustainable high levels of Revenue growth and the high margins that that that were we were
hoping for and we have to demonstrate this and and so our you know we have a lot of initial effort to to call it on the technology side to innovate and create the product once we get out and
launch that needs to to scale back and our efforts need to be replaced by focusing on marketing and sales and building the revenue stream we have to remember in order to build the best
business in the world it still has to be fed with cash and investors eventually will run out of patience and pull out the rug from under us if we can't prove that we've got revenue and and so I I
think back to like Elon Musk in the early days of Tesla you know or Jeff Bezos at at Amazon you know on any given day Elon Musk could have you know Tesla could have come crashing down if if the
lenders started saying you know enough is enough I'm not loaning you anymore it's time for you you know to either make money or or shut down and you know
he he was able to to to navigate that as was Jeff when he was building Amazon but the typical small entrepreneur isn't going to get that kind of treatment you know they are not going to be able to
sustain Innovation and investment in in a hope in a prayer if they cannot prove that money is there so don't forget that while we may be interested in technologically changing the world
there's the commercial aspect of we got to make money and before I worry about making money big I need to prove to people I can make money small and once I've got my product kind of to a stage
to where it's re ready to revenue I need to turn down innovation turn up marketing and really focus on driving Revenue creation and customer adoption
so that I can then start generating cash which will let me then go back to innovating you know at a future time so we we have to be balanced a lot of times entrepreneurs forget the commercial
aspect and the commercial aspect is we got to make money we're so busy innovating we forget that we have to make money and before long it's like investors get tired of us because there's a thousand other things for them to invest in they pull the rug out from
under us and we crash and burn and so the best technology on the planet does not guarantee you Commercial Success you know we have to drive commercial success
as soon as we're able in order to prove the sustainability of our business now 100% And I I feel like ego
has some kind of role in that where like an entrepreneur is like very convinced because of certain reasons but also their ego on why this Innovation will
only cause growth although in a reality it's only uh hinders their growth like what do how would you why that that's why dreamers dream and doers do you know
so there there's I call it the accidental arrogance of success it's like PE people get so into their own you know self-promotion that that you know I I'm
God's gift and what I've done is going to change the world and you know I I see those pitches every day from people who are out there Adam my idea plus your
wallet equals you know the best thing the planet has ever seen and I'm like first of all if you're talking to me about money you don't understand my value because my value is what's up here not what's in my wallet money is a
commodity there's trillions of it out there looking for Investments right now you know and so all you have to do is know where it is go go go go get it treat it well give it you an outsized
return and you'll get funded you know and so money is a commodity money's not the issue people who are focused on money is my problem don't understand how money works and so in addition to being
you know call it a tech genius they need to have a business Acumen you know Andor partner with someone who understands business and and they can be the the the strange person who's locked
in the the dark room for 20 hours a day innovating and creating something great but they still need some business guy out there to be the front end and so when we get arrogant you know and and
and keep in mind most of these people have not created anything yet you know and so if they have an arrogance of success and they have it before before they're actually revenu and and building
something special then boy that's that's a an entrepreneur who's going to have a hard time finding capital and finding money there's a fine line between arrogance you know and confidence and we
need to be confident we shouldn't be arrogant and if we're Arrogant with no money and we're Arrogant with an idea but no Revenue you know then investors
just simply walk away you know I I that's not a not an adventure that I'm going to back so we we have to be careful about letting the arrogance of
our genius Cloud our thinking and and ultimately investors see right through that and uh if there's you know what it's it's okay to be arrogant you know
if you're the richest man on the planet and you have you know you you have arrived when you have an idea and no revenue and you're arrogant and you're arrogant with investors that's not a
good recipe for success and we are almost uh hitting the time and could um tell us about your services for example we have new startups but they they have
no idea how to do business they don't have the business acument so maybe they can talk to you or yeah so I do Consulting work um you know people can read my books they're cheap you know I
donate my royalties to charity um and you know all three of my books have been number one bestsellers so so thank you to everybody out there who reads my books I've been on hundreds of podcasts
just like this I do these freely so that there you know if you go to listennotes docomo from from there I teach seminars globally you know those are relatively low cost and I'll do boot camps where we
spend two to four days together and I I I really get in-depth about about all things around growth and raising capital and selling businesses and uh and then I I do work you know one-on-one with uh
with dozens of entrepreneurs I I have a per group we call the chairman group I do that with my business partner JT Fox um you can reach out to me on LinkedIn
you can go to my website uh Adam e coffee.com um you can also uh you know in in I'd say LinkedIn is where you'll find me the most I'm most active it's
really the only social media platform I'm on Twitter I post some things once in a while I'm not on Instagram or Facebook there's a fake Adam coffee out there believe it or not I guess you know you've arrived when there's people who
are intimidating you and so on Facebook and Instagram you'll find fake Adam coffees trying to take money from you um you know I I'm trying to help people not
buk them for uh for for money um so I'm I'm a consultant you know and I I do Consulting work with uh all kinds of different people private Equity firms you know family offices Etc so thanks
for having me on I appreciate you appreciate your listeners out there good luck take care of people and uh and revenue will happen thank you Adam
the next question is what is gradient descent so gradient descent is an optimization algorithm that we are using in both machine learning and in deep learning in order to minimize the loss
function of our model which means that we are iteratively improving the model parameters in order to minimize the cost function and to end up with a set of
model parameters that will that will optimize our model and the model will will be producing highly accurate predictions so in order to understand the gradient descent we need to
understand what is the loss function what is the cost function which is another way of referring to the loss function uh we need to also understand the the flow of neural network how the
training process works uh which we have seen as part of the previous questions and then we need to understand the idea of iteratively improving the model and
why we are doing that so let's start from the very beginning so we have just learned that during the training process of neuro Network we first do the forward
paths which means that we are iteratively Computing our Activation so we are taking our input data we then are passing it with the corresponding weight
parameters and the bias vectors through the hidden layers and we are activating those neurons by using activation functions and then we are going through
those multiple hidden layers up until we end up with uh Computing the output for that specific forward path so the uh
predictions why hat so once we perform this in the for our very initial iteration of training the neural network we need to have a set of model
parameters that we can start uh training process in the first place so we therefore need to initialize those parameters in our model and we have specifically two type of model
parameters in the neural network we have the weights and we have the bias factors as we have seen in the previous questions so then the question is well how much error are we making if we are
using this specific set of weights and bias vectors cuz those are the parameters that we can change in order to improve the accuracy of our model so
then the question is well if we use this very initial version of the model parameters so the weight and the B vectors and we compute the output so the
Y hat uh then we need to understand how much error is the model making based on this set of model parameters that's the loss function so loss function or the cost function which means the average
error that we are making when we are using this ways and the bias factors in order to perform the predictions and uh as you know already from the machine learning we have
regression type of tasks and classification type of tasks based on the problem that you are solving you can also decide what kind of loss functions you will be using in order to measure
how well your model is doing and the idea behind narrow Network training process is that you want to iteratively improve this moral parameters so the
weights and the bias factors such that you will end up with the set of best and most optimal waste and the bias factors that will result in the smallest amount
of error that the model is making which means that you came up with an algorithm and and with neural network that is producing highly accurate predictions
which is our goal our entire goal by using neural networks so loss functions if you are dealing with classification type of problems can be the cross entropy which is usually the go to
Choice when it comes to the classification type of tasks but you can also use the F1 score so F beta score you can use the Precision Recall now beside this in case you have a
regression type of task you can also use the mean the msse you can use the rmsc you can use the MAA and those are all the ways that you can measure the
performance of your model every time when you are changing your model parameters so we have also seen as part of the training of neural network that there is one fundamental algorithm that
we need to use which we called and referred as a back propagation that we use in order to understand how much is there a change in our loss function when
we apply a small change in our parameters so this is what we were referring as gradients and this came from mathematics and as part of the back prop what we were doing is that we were
Computing the first order partial derivative of the loss function with respect to each of our model parameters in order to understand how much we can
change each of those parameters in order to decrease our loss function so then the question is how exactly gradient descent is performing the optimization
so the gradient descent is using the entire training data when going through one pass and one iteration as part of the training process so for each update
of the parameters so every time it wants to update the weight factors and the bias factors it is using the entire training data which means that in one go
in one forward pass we are using all the training observations in order to compute our predictions and then compute our loss function and then perform back
propagation compute our first order derivative of the loss function with respect to each of those model parameters and they use that in order to update those parameters so the way that
the GD is performing the optimization and updating the model parameters is taking the output of the back prop which is the first order partial derivative of
the loss function with respect to the moral parameters and then multiplying it by by the Learning rate or the step size and then subtracting this amount from
the original and current model parameters in order to get the updated version of the model parameters so as you can see here this comes from the
previously showcase simple example from neural network and here when we compute the predictions we take the gradients from the back propop and then we are
using this DV which is the first order gradient of the loss function with respect to the weight parameter and then multiply this with the St size the EA
and then we are subtracting this from V which is the current weight parameter in order to get the new updated weight parameter and the same we also do for
our second parameter which is the bias Factor so one thing you can see here is that we are using this step size the learning rate which can be also
considered a separate a topic we can go into details behind this but for now uh think of the learning rate as a step
size which decides how much of this size of the step should be when we are performing the updates because we know exactly how much the change there will
be in the loss function when we make a certain change in our parameters so we know the gradient size and then it's up to us to understand how much of this
entire change we need to apply so do we want to make a big jump or we want to make a smaller jumps when it comes to iteratively improving the model
parameters if we take this learning rate very large it means that we will apply a bigger change which means the algorithm will make a bigger step when it comes to
moving towards the global optim and later on we will also see that it might become problematic when we are making too big of a jumps especially
if those are not accurate uh jumps so we need to therefore ensure that we optimize this learning parameter which is a hyperparameter and we can tune this
in order to find the best learning rate that will be minimizing the loss function and will be optimizing our neural network and when it comes to the gradient descent the quality of this
algorithm is very high it is known as a good Optimizer because it's using the entire training data when performing the gradient so performing the back propop and then taking this in order to update
the model parameters and the gradients that we got based on the entire training data is the represents the true gradients so we are not estimating it we
are not making an error but uh instead we are using the entire training data when calculating those gradients which means that we have a good Optimizer that
will be able to make accurate steps towards finding the global Optimum therefore GD is also known as a good Optimizer and it's able to find with
higher likelihood the global Optimum of the loss function so the problem of the gradient descent is that when it is using the entire training data for every
time updating the model parameters it is just sometimes computationally not visible or super expensive because training a lot of observations taking
the entire training dat data to perform Just One update in your model parameters and every time stor storing that large data into the memory performing those
iterations on this large data it means that when you have a very large data using this algorithm might take hours to optimize in some cases even days or
years when it comes to using very large data or using very complex data therefore GD is known to be a good op Optimizer but in some cases it's just
not feasible to use it because it's just not efficient the next question is what is loss function and what are various loss functions used in deep learning so
loss function is used in order to quantify the amount of overall error that the model is making whether it's a deep learning model but also in general the traditional machine learning models
in all these cases we need a way to measure the amount of error that the model is making and in order to do so we are making use of this idea of loss
functions so loss function is a way to measure the amount of loss that the model is making which means the amount of overall errors that the model is
making when performing the prediction so we can have a loss when we are dealing with classification model we can have a loss when we are dealing with regression
model at the end of the day we know that independent of the type of problem we are solving we we are always going to have this errors that we will have as
part of the predictions so we can never get predictions which are exactly equal to the True Values that we want to get therefore we need to know what are this
these errors that the model is making and what is the overall error that the model is making such that we can then know how we can edit and adjust our model in order to improve it such that
the model will make less uh loss therefore the idea behind the optimization techniques such as gradient descend SGD RMS prop is to minimize the
loss function but to be able to do that we first need to have a proper loss function that is measuring the overall error that the model is making when it comes to the different examples of loss
functions depending on the type of problem we are dealing with we can use different sorts of loss functions if we are dealing with regression problem we can use the mean squared error the root
mean squ error the MAA which is another measure that is commonly used as part of evaluating the regression type of problems so as a loss function we can
use this different Matrix in order to compute what is the overall error uh in the predictions for that specific model
type and here we are using as an input the actual values of y's which are usually numeric values given that we have regression type of problem and the
estimated values which come from our machine learning or deep learning model so once we have perir iteration the predicted values then uh
we can use this predicted values these numeric values of uh y hat and compare it to the actual y that we have as part of our validation set or training set or
testing set in order to compute the amount of loss the MoDOT is making so there is always is this two set of input values the Y head and the Y and then
using mean squared error which is the mean of the square of all the sum of the errors that we are making as part of the model training and then take the average
of it therefore it's also called the mean of the sum of squared of the errors so uh when it comes to the classification type of problems then the
class for the classification type of problems we can use the cross entropy which which is also known as a log loss in order to evaluate the performance of
The Deep learning model this is handy when dealing with binary classification when it comes to other type of L functions that we can use for
classification type of problems we can use the Precision we can use the recall we can also use the F1 score or the F beta score which is more General version
of F1 score when we know specifically what is more important for us the recall versus Precision whereas in case of F1 score we don't know or we don't care and
it's more that we want to have a good balance between the Precision and the recall then the F1 score basically provides a 50% importance to each of those two the next question is what is
cross entropy and why it is preferred as a cost function for multiclass classification type of problems so cross entropy which is also known as log loss
it measures the per performance of a classification model that has an output in the terms of probabilities which are values between Zer and one so whenever
you are dealing with classification type of problem let's say you want to classify whether an image is of a cat or
a dog or and the house can be classified as a old house versus a new house in all those cases when you have these labels
and you want the morel to Pro provide a probability uh to each of those classes per observation such that you will have us an output of your model that the
house a has 50% probability of being classified as new 50% probability of being classified as old or the cat has
70% probability of uh uh being a cat image or this image has 30% probability of of being a dog image in all those cases when you are dealing with this
type of problems you can apply the cross entropy as a loss function and the cross entropy is measured as this negative of
the sum of the Y log p + 1 - Y and then log 1 - P where Y is the actual label so in binary classification this can be for
instance one and zero and then p is the predicted probability so in this case the P will be then the value between 0 and one and then Y is the corresponding
label so let's say label zero when you are dealing with cat image and label one when you are dealing with a dog image and the mathematical explanation behind
this formula is out of the scope of this question so I will not be going into that details but if you are interested in that make sure to check out the logistic regression model this is part
of my machine learning fundamentals handbook which you can check out and this one includes also logistic regression which explain step by step how we end up with this log likelihood
function and how then we go from the products to summations after applying the logarithmic function so we get the log ODS and then we multiply it with the
minus because this is the uh negative of the uh likelihood function given that we want to ideally minimize the loss
function and this is the opposite of the likelihood function um and in this case what the showcases is that we will end
up getting a volue that tells how well the model is performing in terms of classification so the cross entropy then will tell us whether the model is doing
good job in terms of classifying the observation to certain class the next question is what kind of loss function we can apply when we are dealing with multiclass
classification so in this case when dealing with multiclass class class ification we can use the multiclass crow entropy which is often referred as a
softmax function so softmax loss function is a great way to measure the performance of a model that wants to classify observations to one of the
multiple classes which means that we are no longer dealing with binary classification but we are dealing with multiclass classification so one example of such
case is when we want to classify an image image to be from Summer theme to be from Spring theme or from winter theme given that we have three different
possible classes we are no longer dealing with binary classification but we are dealing with multiclass classification which means that uh we also need to have a proper way to
measure the performance of the model that will do this classification and softmax is doing exactly this so instead of getting the pair observation two
different volumes which will say what is the probability of that observation belonging to class one or class two instead we will have a larger Vector pair observation depending on the number
of classes you will be having in this specific example we will end up having three different value so one vector with three different entries perir observation saying what is the
probability that this picture is from Winter scene what is the probability of this observation coming from Summer theme and the third one one what is the obser what is the probability that the
observation comes from a spring theme in this way we will then have all the classes with the corresponding probabilities so as in case of the Cross
entropy also in case of the soft Max when we are when we have a small value for the softmax it means that the model is performing a good job in terms of
classifying observations to different classes and we have well separated classes and one thing to keep keep in mind when we are comparing cross entropy
and multi class cross entropy or the softmax is that we are usually using this whenever we have more than two
classes and you might recall from the uh Transformer uh model introduction from the paper tension is all you need that as part of this architecture of
Transformers a soft Max layer is also applied um as part of the multiclass classification so when we are uh Computing our activation course and also
at the end when we want to transform our output to a values that make sense and to measure the performance of the Transformer the next question is what is
SGD and why it is used in training naral networks so SGD is like GD an optimization algorithm that is used in deep learning in order to optimize the
performance of a deep learning model and to find the set of model parameters that will minimize the loss function by iteratively improving the parameters of the model including the weight
parameters and the bias parameters so the SGD the way it performs the updates of model parameters
is by using a randomly selected single or just few training observations so unlike the GD which was using the entire
training data to update the model parameters in one iteration in case case of SGD the SGD is using just single uh
randomly selected training observation to perform the update so what this basically means is that instead of using the entire training data for each update
SGD is making those updates in the model parameters per training observation and there is also an importance of this random component so the stochastic
element in this algorithm hence also the name stochastic grade in descent because SGD is randomly sampling from training observations a single or just couple of
training data points and then using that it performs the forward pass so it computes the Z scores and then computes the activation scores after applying
activation function then reaches the end of the forward pass and the network computes the output so the Y hat and then computes the losss and then we
perform the back prop only on those few data points um and then we are getting the gradients which are then no longer the exact gradient so in SGD given that
we are using only a randomly Selected Few data points or a single data point instead of having the actual gradients we are estimating those true gradients
because the true gradients are based on the entire training data and in SGD for this optimization we are using all only
few data points what this means is that we are getting an imperfect estimate of those gradients as part of the back propagation which means that the
gradients will contain this noise and the result of this is that we are making the optimization process much more efficient because we are making those uh
uh parameter updates very quickly based on pass by using only a few data points U and training Neal Network on just a
few data points is much faster uh and easier than using an entire training data for a single update but this comes at the cost of the quality of the SGD
because when we are using only a few data points to train the model and then compute gradients which are the estimate of the true gradients then this
gradients will be very noisy they will be imperfect and most likely far off from the actual gradients which also means that that uh we will make a less
accurate updates to our model parameters and this means that every time when the optimization algorithm is trying to find that Global Optimum and make those
movements per iteration to move one step closer towards that Optimum most of the time it will end up making wrong decisions and will pick the wrong
direction given that the gradient is the source of that choice of what direction it needs to take and every time it will make those uh
oscilations those movements which will be very erratic and it will end up most of the time discovering the local opum
instead of the global opum because every time when it's using just very small part of the training data it's estimating the gradients which are noisy
which means that the direction it will take will most likely be also a wrong one and when you make those wrong directions and wrong moves every time
you will start to ciliate and this is exactly what HGD is doing it's making those wrong decision Choice when it comes to direction of the optimization
and it will end up discovering a local Optimum instead of the global one and therefore the HGD is also known to be a bed Optimizer it is efficient it is
great in terms of convergence time in terms of the memory usage cuz storing model and that is based on a very small data and storing that small data into
the memory is not computationally heavy and memory heavy but this comes at the cost of this quality of the optimizer and in the upcoming interview questions
we will learn how we can adjust this SGD algorithm in order to improve the quality of this optimization technique the next question is why does fastic
gradient Des sensor the SGD or C ilate towards local minimum so there are a few reasons why this oscilation happens but
first let's discuss what oscilation is so oscilation is the movement that we have when we're trying to find the global Optimum so uh whenever we are
trying to optimize the algorithm by using an optimization method like GD SGD RMS probe Adam we are trying to minimize
the loss function and ideally we want to change iteratively our model parameters so much that we will end up with the set
of parameters resulting in the minimum so Global minimum of the loss function not just local minimum but the global one and the difference between the two is that the local minimum might appear
as of it's the minimum of the loss function but it holds only for a certain area when we are looking at this optimization process whereas the global
Optimum is really the mean as the real minimum of the loss function and that's exactly uh that we are trying to chase
when we have too much oscilations which means too much movements when we are trying to find the direction towards that Global Optimum then this might
become problematic because we then are making too many movements every time and if we are making those movements that are opposite or they are towards the
wrong direction then this will end up resulting in discovering local opum instead of global opum something that we are trying to avoid and the oscilations
happen much more often in the SGD compared to GD because in case of GD we are using the entire training data uh in order to compute the gradient so the
partial derivative of the loss function with respect to the parameters of the model whereas in case of SGD we learned that we are using just randomly some
single or few training data points in order to update the gradients and to use these gradients to update the model parameters this then for SGD results in
having too many of these oscilations because the the random subsets that we are using they are much smaller than training data they do not contain all the information in training data and
this means that the gradients that we are calculating in each St when we are using entirely different and very small data can def significantly one time we
uh can have One Direction the other time an entirely uh different direction uh for our movement in our optimization
process and uh this huge difference this variability in the direction uh because of the huge difference in the gradients
can result in a too often of this oscilation so too many of uh bouncing around towards the area to find the right direction towards the
global Optimum in this case the minimum of the loss function so that's the first reason the random subsets the second reason why in HGD we have too many of
those oscilations those movements uh is the step size so step size the learning rate can Define how much we need to
update the the weights and or the bias parameters and the magnitude of this updat is determined by this learning rate which then also plays a role how
many of this how different this movement will be and how large uh the the jumps will be when we are looking at the
oscilations so the third reason why the SGD will suffer from too many of oscilations which is a bad thing because it will result in finding a local opum
instead of the global opum too many times is the imperfect estimate so when we are Computing the gradient of the loss function with respect to the weight
parameters or the bias factors then if this is done on a small sample of the training data then the gradients will be noisy whereas if we were to use the entire training data that contains all
the information about the relationships between the features and just in general in the data then the gradients will be much less noisy they will be much more
accurate therefore because we are using this the gradients of based on small data as estimate of the actual gradients which is based on the entire training
data this introduces a noise so imperfection when it comes to estimating this true gradient and this imperfection can result in updates that do not always
Point directly towards the global opum and this will then cause this oscilations in the HGD so at higher level I would say that there are three reasons why HGT will
have too many of this osil iations the first one is the random subsets the second one is the step size and the third one is definitely the imperfect estimate of the gradient the next
question is how is GD different from HGD so what is the difference between the gradient descent and the stochastic gradient descent so by now given that we
have gone too much into details of HGD I will just give you a higher level summary of the differences of the T So for this question I would answer by
making use of four different factors that cause a difference between the GD and HGD so the first factor is the data usage the second one is the update frequency the third one is the
computational efficiency and the fourth one is the convergence pattern so let's go into one of this into each of these factors one by one so gradient descent
uses the entire training data when uh training the model and Computing the gradients and using this gradients as part of back propagation process to
update the model parameters however SGD unlike GD is not using the entire training data when performing the training process and updating the model
parameters in one go instead what HGD does is that it uses just a randomly sample single or just two training data points when performing the training and
when using the gradients based on the Su points in order to update the model parameters so that's the data usage and the amount of data that SGD is using
versus the GD so the second difference is the update frequency so given that GD updates the model parameters based on the entire training data every time it
makes much less of this updates compared to the HGD because SGD then very frequently every time for this single data point or just two training data
points it updates the model parameters unlike the GD that has to use the entire training data for just one single set of
update so this causes then HGD to make those updates much more frequently uh when using just a very small data so that's about the difference in terms of
update frequency then another difference is the computational efficiency so GD is less computationally efficient than HGD because GD has to use this entire
training data make the computation so back propagation and then update the model parameters based on this entire training data which can be computationally heavy especially if
you're dealing with very large data and very complex data and unlike GD SGD is much more efficient and very fast because it's using a very small amount
of data to perform the updates which means that it is it requires less amount of memory to sort of data it uses small data and it will then take much less
amount of time to find a global Optimum or at least it thinks that it finds the global Optimum so the convergence is much faster in case of SGD compared to
GD which makes it much more efficient than D GD then the final factor that I would mention as part of this question is the convergence pattern so GD is
known to be smoother and of higher quality as an optimization algorithm than SGD SGD is known to be a bed Optimizer and the reason for this is
because that the eff efficiency of HGD comes at a cost of the quality of it of finding the global optim so HGD makes all the all this oscilations given that
it's using a very small part of the training data when estimating the true gradients and unlike SGD GD is using the entire training data so it doesn't need
to estimate gradients it's able to determine the exact gradients and this causes a lot of oscilations in in case of SGD and in case of GD we don't need
need to make all these oscilation so the amount of movements that the algorithm is making is much smaller and that's why it takes much less amount of time for HG
to find the global opum but unfortunately most of the time it confuses the global opum with the local opum so SGD ends up making this many movements and it end up discovering the
local opum and confuses it with the global optim which is of course not desirable because we would like to have the actual Global op ter or the set of parameters that will actually minimize
and find the minimum value of the loss function and SGD is the opposite because it's using the true gradients and it is most of the time able to identify the
true Global optim so the next question is how can we use optimization methods like GD uh but in a more improved way so how we can improve the GD and what is
the role of the momentum term so whenever you hear momentum and then GD uh try to automatically focus on the HGD
with momentum because SGD with momentum is basically the improved version of HGD and as far as you know the difference between HGD and GD it will be much easier for you to explain what is the
HGD with momentum so uh we just discussed that the HD suffers from oscilation so too many of those movements and a lot of
time because we are using a small amount of training data to estimate the true gradients this will result in having entirely different gradients and too
much of this different sorts of updates in the weights and of course that's something that we want to avoid because we saw and we explained that too many of
those movements will end up causing the optimization algorithm to mistakenly confuse uh the global opum and local opum so it will pick the local opum
think that is a global opum but it's not the case so to solve this problem and to improve the uh SGD algorithm while taking into account that SGD in many
aspects is much more much better than the GD we came up with this SGD with momentum algoritm where HGD with momentum will take basically the benefits of the HGD and then it will
also try to address the biggest disadvantage of SGD which is this too many of these oscilations and the way SGD with momentum does is that it uses this
momentum and it introduces this idea of momentum so momentum is basically a way to find and put the optimization algorithm towards better Direction and
reduce the amount of oscilations so the amount of of this random movements and the way that it does is that it tries to add a fraction of this previous updates
that we made on the motor parameters which then we assume will be a good indication of the more accurate Direction in this specific time step so
imagine that we are at time step T and we need to make the update then uh the what momentum does is that uh it looks at all the previous updates and uses the
more recent updates more heavily and says that the more recent updates most likely uh will be better representation of the direction that we need to take
versus the very old updates and this updates in the optimization process these very recent ones when we take them into account then we can have a better
uh a better way of and more accurate way of updating the moral parameters so let's look into mathematical representation just for a quick refreshment so what the SGD with
momentum tries to do is to accelerate this conversion process and instead of having too many of the movements towards the different dire Direction and having
two different of gradients and updates it tries to stabilize this process and have more constant updates and in here you can see that as part of the momentum
we are obtaining this momentum term which is equal to VT +1 for the update at the time step of t+1 what it does is
that it takes this this gamma multiplies it by VT plus the learning rate ITA and then the gradient where you can see that
this inflated triangle and then underneath the Theta and then J Theta T simply means the gradient of the loss function with respect to the parameter
TAA and what what is basically doing is that it says we are Computing this momentum term for the time step of t+1 which is based on the previous updates
uh through this term gamma multi VT plus the the common term that we saw before for the HGD and for GD which is basically uh using this ITA learning
rate multiplied by the first order partial derivative of the loss function with respect to the parameter Theta so we then are using this momentum term to
Simply subtract it from our current parameter Theta T in order to obtain the new version so the updated version which is Theta t+ one where Theta is simply
the model parameter so in this way what we are doing is that we are performing more the updates in more consistent way so we are introducing consistency into
the direction by waiting the recent adjustments more heavily and it builds up the momentum hence the name moment momentum so the momentum builds up this
speed towards the direction of the global opum in more consistent gradients enhancing the uh movement towards This Global op so the global minimum of the
loss function function and this then on its turn will improve of course the quality of the optimization algorithm and we will end up discovering the global Optimum rather than local Optimum
so to summarize what this SGD with momentum does is that it basically takes the SGD algorithm so it again uses a small training data when performing the
model parameter updates but unlike the SGD what SGD with momentum does is that it tries to replicate the gd's quality when it comes to the finding the actual
Global optim and the way it does that is by introducing this momentum term which helps also to introduce consistency in the updates and to reduce the oscilations the algorithm is making by
having much more smoother path towards discovering the actual Global Optimum of the loss function the next question is compare badge gradient descent to mini
badge gradient descent and to stochastic gradient descent so here we have three different versions of the gradient descent algorithm the uh traditional
badge gradient descent often referred as GD simply uh the second algorithm is the mini badge gradient descent and the third algorithm is the SGD or the
stochastic gradient descent so the three algorithm are algorithms are very close to each other they do differ in terms of their efficiency and the amount of data
that they are using when performing each of this modal training and the model parameters update so let's go through them one by one so the batch gradient
descent this is the original GD uh this method involves the traditional approach of using the entire training data uh for each iteration when Computing the gradients so doing the back prop and
then taking this gradients as an input for the optimization algorithm to perform a single update for these model parameters then on to the next iteration
when again using the entire training data to compute the gradients and to update the model parameters so here for the badge gradient descent we are not estimating the true gradients but we are
actually Computing the gradients because we have the entire training data now B gradient descent thanks to this quality of using the entire training data uh has a very high quality so it's very stable
it's able to identify the actual Global Optimum however this comes at a cost of efficiency because the bch gradient descent uses the entire training data it needs to every time put this entire
training data into the memory and it is very slow when it comes to performing the optimization especially when dealing with large and complex data sets now
next we have the Other Extreme of the bch gradient descend which is SGD so SGD unlike the GD and we saw this previously when discussing the previous interview
questions that SGD is using uh stochastically so randomly sampled single or just few few training observations in order to perform the training so Computing the gradients
Performing the backrop and then using optimization to update the model parameters in each iteration which means that we actually we do not compute the
actual gradients but we actually are estimating the true gradients because we are using just a small part of the training data so uh this of course comes at a cost of the quality of the
algorithm although it's efficient to use only small sample uh from the training data when doing the back prop the training we need you don't need to store uh the entire training data into the
memory but just a very small portion of it and we perform the model updates quickly but then uh we find the so-called Optimum much quicker compared
to the GD but this comes at a cost of the quality of the algorithm because then it starts to make too many of these oscilations due to this noisy GR
which then ends up confusing the global opun with the local opun and then finally we have our third optimization algorithm uh which is the mini badge gradient descent and this mini badge
gradient descent is basically the Silver Lining between the batch gradient descent and the original HGD so stochastic gradient descent and the way mini badge works is that it tries to
strike this balance between the traditional GD and the HGD it tries to take the the advantages of the SGD when it comes to uh the efficiency and combin
it with the advantages of GD when it comes to stability and consistency of the updates and finding the actual Global Optimum and the way that it does that is by randomly sampling the
training observations into two batches where the batch is much bigger compared to SGD and it then uses this smaller portions of training data in each
iteration to to do the back propop and then to update the model parameters so think of this like the kfold cross validation when we are sampling our
training data into this K different folds in this case batches and then we are using this in order to train the model and then in case of naral networks
to use the mini mini badge gradient descent to update the model parameters such as weights and bias vectors so the tree have have a lot of similarities but they also have differences and in this
interior question your intervie is trying to test whether do you understand the benefits of one and two and what is the purpose of having mini batch gradient descent the next question is
what is RMS prop and how does it work so we just saw that RMS prop is one of the examples that that can be defined as an
Adaptive optimization process and RMS probe stands for the root mean squared propagation and it is like GD HGD and HGD with momentum an optimization
algorithm that tries to minimize the loss function of your deep learning model to find the set of model parameters that that will minimize the
loss function so uh what RMS probe does is that it tries to address some of the shortcomings of the traditional gradient descent algorithm and it is especially
useful when we are dealing with Vanishing gradient problem or exploring gradient problem so we saw before that a very big problem during the training of
deep neural network is this concept of Vanishing gradient or exploring gradient so when the gradients start to converge towards zero uh so they become very
small they almost vanish uh or when the gradients are so big that they are exploding so they are becoming very large and they result in a large amount
of oscilations now uh to avoid this what RMS prop is to doing is that it is using an Adaptive learning rate it's adjusting the learning rate and it is using for
this for this process this idea of running running average of the second order gradients so this is related to
this concept of Haitian and it is also using this DK parameter uh which takes into account and regulates uh what is the magnitude the average of the
magnitudes of the recent gradients that we need to use when we are updating the model parameters so basically what is the amount of information that we need
to take into account from the recent adjustments so in this case this means that parameters with large gradients
will have their effective learning rate to be reduced so whenever we have uh large gradients for parameter we will be then reducing the gradients this means
that we will then control the exploding gradient effect uh and of course the other way around holds through in case of RMS probe for the parameters that will have a small gradients we will be
then controlling this and we will be increasing their learning rate to ensure that the gradient will not vanish and in
this way we will be done controlling and smoothing the process so the RMS prob uses this DEC rate which you can see
here too this beta which is then a number usually around 0.9 and it controls how quickly this running
average forgets the oldest gradients so as you can see here uh we have this running average VT which is equal to
Beta * VT minus 1 + 1 - beta * GT ^ 2 so this is basically our uh second order gradient so then what we are doing is
that we are taking this running average and then we are using this to adapt and to adjust our learning rate so you can see here in our second expression the
Theta t + 1 so the updated version of the parameter is equal to the Theta T so the current parameter minus so we're subtracting the learning rate divided to
square root of this VT square root of this running average and we are adding some Epsilon which is usually a small number number just to ensure that we are
not dividing this ITA to0 in case our running average is equal to zero so we are ensuring that this number still exists and we do not divide a number to zero and then we are simply multiplying
this to our gradient so as you can see depending on our parameter we will then have a different learning rate and we will then
be updating this learning so by adopting this learning rate in case of RMS prop we are then stabilizing this optimization process we are then preventing all these random movements
these oscilations and at the same time we are ensuring smoother convergence we are also ensuring that our Network especially for deep neural networks it
doesn't suffer from this Vanishing gradient problem and from the exploding gradient problem which can be a serious problem when we are trying to optimize
our deep neural network the next question is what are l 2 or L1 regularizations and how do they prevent overfitting in their own network so both
L1 and L2 regularizations are shrinkage or regularization techniques that are both used both in traditional machine learning and in deep learning in order
to prevent the morel overfitting so trying to make the model more generalizable like the Dropout so you might know from the traditional machine learning models what
is L1 and L2 regular ation do L2 regularization is also referred as Rich regression and L1 regularization is also referred as L regression so what L1
regularization does is that it adds as a normalization or as a regularization or as a penalization factor that is based
on this penalization parameter Lambda multiplied by this term which is based on the absolute value of the weights so
this is different from the L2 regularization which is the reach regularization and this regularization adds to our loss function the
regularization term which is based on the Lambda so the penalization parameter multiplied by the square of the weights so you can see how the two are different
one is based on what we are calling the L1 norm and the other one is based what we are calling L2 Norm hence the names L1 and L2 to
regularization so both of them are used with the same motivation to prevent overfitting what L1 does different from
L2 is that L1 can set the weight of certain neurons exactly equal to zero so in some way also performing feature
selection whereas L2 regularization it shrinks the weights towards zero but it never sets them exactly equal to zero so in this aspect fact L2 doesn't perform
feature selection and it only performs regularization whereas L1 can be used not only for uh shrinking the weight and regularizing the network but also for performing feature selection when you
have too many features so you might be wondering but how does this help to prevent overfitting well when you uh shrink the
weights towards zero and you are trying to regularize this small or large weights then this methods such as L1 or
L2 regularization they will ensure that the model doesn't overfit to the training data so you will then regular regularize the weight and this will then
on its turn regularize the network because the weights will Define how much of this Behavior eretic Behavior will be
prevented if you have two large weights and you reduce them and you regularize them it will ensure that you don't don't have exploring gradients it will also uh
ensure that the network doesn't heavily rely on certain neurons and this will then ensure that your mot is not overfitting and not memorizing training
data which might also include noise and outlier points the next question is what is the curse of dimensionality in machine learning or in AI so the curse
of dimensionality is a nonn phenomena in machine learning especially when we are dealing with this distance-based neighborhood based models like KNN or
KES and we need to compute this distance using distance measures such as aine distance cosign distance Manhattan distance and whenever we have uh High
dimensional data so we have many features in our data then the model starts to really suffer from the Cur of
dimensionality so the complexity Rises when the model needs to compute these distances between the the set of pairs but given that we have so many features
it becomes problematic and sometimes even invisible to obtain those distances and obtaining and calculating these
distances in some cases doesn't even make sense because they no longer reflect the actual pairwise relationship or the distance between those two pair
of observation when we have so many features and that's what we are calling the curse of dimensionality so we have a curse on our ml or AI model when we have
high dimension and when we want to compute this distance between pair of observations and this can introduce data sparity this can introduce computational
challenges can introduce a risk of overfitting for our problem the model becomes less generalizable and it will also become problem in terms of picking a distance
measure that can handle this high dimension of our data
Loading video analysis...