AI Foundations Course – Python, Machine Learning, Deep Learning, Data Science

By freeCodeCamp.org

Summary

## Key takeaways - **AI Talent Shortage**: Over 80% of companies worldwide cannot find data scientists and AI professionals, with demand projected to grow as the data science market exceeds $400 billion. [00:33], [00:55] - **High AI Salaries**: Data science and AI professionals earn $150,000 to $200,000 in the US, with top machine learning experts exceeding $500,000. [01:15], [01:25] - **Machine Learning Applications**: Machine learning powers healthcare diagnostics like cancer detection, finance fraud detection, Netflix recommenders, and autonomous vehicles via deep learning. [12:10], [17:01] - **Core ML Skills Roadmap**: Master linear algebra, calculus, statistics, Python libraries like scikit-learn and TensorFlow, plus top algorithms from linear regression to XGBoost. [17:33], [27:35] - **Portfolio Projects Essential**: Build recommender systems, regression for salary prediction, spam classification, and customer segmentation using K-means to showcase supervised and unsupervised skills. [38:57], [43:42] - **Bias-Variance Tradeoff**: Model error decomposes into bias, variance, and irreducible error; complex models have low bias but high variance, simple models the opposite, requiring balance to minimize test error. [01:05:23], [01:11:06]

Topics Covered

AI Demand Outstrips Supply by 80%
Recommender Systems Power Netflix Sales
Master Math Before ML Algorithms
Bias-Variance Tradeoff Drives Model Choice
L1 Lasso Shrinks Features to Zero

Full Transcript

learn about machine learning and AI with this comprehensive 11-hour course this is not just a crash course this course covers everything from fundamental concepts to Advanced algorithms complete

with real world case studies in recommender systems and Predictive Analytics this course goes beyond Theory to provide Hands-On implementation experience career guidance and great

insights from industry professionals it also includes a career guide on how to build a data science career launch a startup and prepare for interviews T van

vahi from lunarch developed this course over 80% of the companies worldwide are unable to find data scientists and AI professionals to bring their ideas into the market and to become more

competitive in the next decade the demand for data science and ml professionals is only going to increase as the market in the data science nii is

projected to pass the 400 billion valuation about 5% of all employees worldwide are asking their employers to get training in the field of generative

Ai and machine learning and if this didn't convince you to get into this lucrative and highly demanded industry then let me tell you that the salaries

of data science and AI professionals are at the moment about the $150 up to $200,000 in us and in some cases the

salaries of most in demand Ai and machine learning Prof professionals can pass the $500,000 us welcome to this involved crash course in machine

learning and data science in this 11 hours you will get a comprehensive overview of the machine learning and data science from different perspective

both the theory practice implementation career insights and what you can expect from this career this will be a great course for anyone who wants to become a

machine learning engineer or AI engineer so here's what we are going to cover as part of this comprehensive Crush course in machine learning so we are going to start with the machine learning roadmap

for 20124 here we are going to provide you a structured overview of the machine learning landscape helping you understand what is like to become a machine learning engineer what you can

expect from this career what exactly you need to learn what kind of skill sets from what kind of Industries and also you are going to see what kind of career

directions you can take in the field of machine learning so how you can get into machine learning how you can Kickstart a career and what is that that you need to

learn after that we are going to get into the top machine learning algorithms so here you will learn the most important machine learning algorithms

from linear regression to Advanced algorithms like the boosting algorithms of course this won't be your comprehensive machine learning course because this is aimed to provide you the

basics and the fundamentals but this would be a great starting point for you to get a taste what it's like to learn the theory of machine learning you will learn the theory you will also learn the

definitions the pros and the cons of these algorithms along with the Practical python implementation so this will be great way to learn the basics

and to also learn how to implement this in Python of course as a prerequisite for this it is required that you know basics in Python like how to create list

how to work with pyit learn or how to create variables so for this this is important next up we have the handson case studies so after learning the

basics in machine learning in terms of the theory and implementation in Python with the real examples you are ready to get into a handson machine learning work

and this won't be just quick case studies that you can complete in 30 minutes but rather those will be involved three different case studies so we will start with the basics like

performing a behavior analysis and data analytics which is always a must when it comes to becoming machine learning or AI Engineers so you will learn how to

perform data analytics how to perform customer segmentation using python how to perform data wrangling how to do exploratory data analysis all in Python

and and then to make those important conclusions and tell your data story this is really important as AI professional to know data science and data analytics so this first case study

which is the superstore customer Behavior Analysis that will be conducted and presented to you by vah asan co-founder of ler Tech that will provide

you a good insights into the basics of machine learning and how to do a data analytics and data science in a real live case

study in the second case study we will then get more Hands-On with machine learning and uh we will be predicting the Californian house prices we will do

expiratory data analysis we will use Python to clean the data use statistics to perform outl detection data visualization we will also perform

causal analysis and we will be using linear regression to perform the predictions by leveraging practical data analytics but this time also data

science skills and combining this with python libraries like psychic learn and the third case study will be about building a movie recommender system so

here we will explore the NLP the natural language pre processing another very important topic in the field of AI and machine learning these days and here

we'll be using NLP we will use also machine learning data science tool to develop a recommended algorithm so this project will then enhance your skills in

the text Data analysis how to process this text Data how to use Python for doing that as well as practical machine learning applications like building a

recommender system keep in mind that you can also put this case studies on your resume to Showcase your experience after we are done with this tree end to endend

invol case studies we are going to provide you career insights now as a data science and AI professional you have two choices you can either decide to get into the corporate world so

become a data scientist or a professional or you can decide to build your own startup and to provide you information on both of these directions

in the first conversation you will join me and the data science manager from Aliens Cornelius where you can learn from him how to break into the field

into the field of data science and machine learning especially from traditional background here you can get lot of tips on succeeding in this field how to get promoted what to expect from

interviews what is like that selection process and much more about data science nii corporate career so once we are done with that conversation then we will

provide you the next choice which is about the building of a startup as a machine learning or AI professional so here you can then listen to the

conversation between co-founder of lunar Tech vah asan and a Serial entrepreneur and successful investor Adam coffee so

here Adam coffee will be then provide you a lot of insights on how to launch a startup how to raise funds what to expect from this uh type of career so

once we are done with all this with this career insight as well we will then get into the final part of this course which is about interview preparation we'll conclude with providing you the most

popular machine learning interview questions with the corresponding detailed answers this will be great for anyone who wants to Ace their interviews and who is now preparing for machine

learning or AI interviews this crash course for 11 hours is more than just a short introduction it's an involved comprehensive overview of everything

that you can expect from the world of machine learning and AI want to become uh more handson and get the entire comprehensive overview and learn

everything in one place to become a job ready machine learning and professional then make sure to check the lunch. our

data science boot camps and many other courses that will provide you that all in one approach to become a job rate professional if you like this video make

sure to like subscribe and comment so if you ready I'm really excited let's get started hi there in this video we are going to talk about how you can get into machine

learning in 2024 first we are going to start with all the skills that you need in order to get into machine learning step by step what are the topics that you need to

cover and what are the topics that you need to study in order to get into machine learning we are going to talk about what is machine learning then we are going to cover step by step what are

the exact topics and the skills that you need in order to become a machine learning researcher or just get into machine learning then we're going to cover the type of exact projects you can

complete so examples of portfolio projects in order to put it on your resume and to start to apply for machine learning related jobs and then we are

going to also talk about the type of Industries that you can get into once you have all the skills and you want to get into machine learning so the exact career path and what kind of business

titles are usually related to machine learning we are also going to talk about the average salary that you can expect for each of those different machine learning related positions at the end of

this video You're are going to know what exactly machine learning is where is it used what kind of skills are there that you need in order to get into machine learning in 2020 4 and what kind of

career path with what kind of compensation you can expect with the corresponding business titles when you want to start your career in machine learning I'm DF co-founder of lunar Tech

and I come from econometrical and statistical background I been in the tech field and specifically in data science and AI for The Last 5 Years working across different data science

and AI projects across the globe and now I'm going to tell you what exactly machine learned is and what are the skill sets that you need in order to get

into machine learning in 2024 so without further Ado let's get started so what is machine learning machine learning is a brand of artificial intelligence of AI

that helps to uh build models based on the data and then learn from this data in order to make different decisions so we will first start with the definition of machine learning what machine

learning is and what are the different sorts of applications of machine learning that you most likely have heard of but you didn't know that it was based on machine learning so machine learning

is a brand of artificial intelligence that is uh using data in order to uh learn from this data by using different sorts of algorithms and it's being used

across different Industries uh starting from Healthcare till entertainment in order to improve uh the customer uh experience custom identify customer

behavior um improve the sales for the businesses uh and it also helps um governments to make decisions so it really has a wide range of applications

so let's start with the healthcare for instance machine learning is being used in the healthcare to help with the uh diagnosis of diseases it can help to uh

diagnose cancer uh during the co it helped many hospitals to identify whether people are getting more uh severe effects or they are getting P

pneumonia um based on those pictures and that was all based on machine learning and specifically computer vision uh in the healthcare is also being used for drug Discovery it's being

used for personalized medicine for personalizing treatment plans to improve the operations of the hospitals to understand what is the amount of uh

people and uh patients that hospital can expect in each of those uh uh days week and also to estimate the amount of doctors that need to be available the

amount of uh people uh that the hospital can expect in the uh emergency room based on the day or the time of the day and this is basically not a machine

learning application then we have uh machine learning in finance machine learning is being largely used in finance for different applications starting from fraud detection in credit

cards or in other sorts of banking operations um it's also being used in trading uh with specifically in combination with quantitative Finance to

help traders to make decisions whe they need to go short or long into different stocks or bonds or different assets just in general to estimate the price that

those talks will haveen Assets in the real time in the most accurate way uh it's also being used in uh retail uh it helps you understand an estimated demand

for certain products in certain Warehouse houses it also helped you understand what is the most appropriate or closest uh uh warehouses that the items for that corresponding customer

should be shipped so it's optimizing the operations it's also being used to build different Dr Commander systems and search engines like the famous Amazon is doing so every time when you go to

Amazon and you are searching for project or product you will most likely see many article recommenders and that's based on machine learning because Amon zone is uh

Gathering the data and comparing your behavior So based on what you have bought based on what you are searching uh to other customers and those items to

other items in order to understand what are the items that you will most likely will be interested in and eventually will buy it and that's exactly based on machine learning and specifically

different sorts of recommended system algorithm and then we have uh marketing where machine learning is being heav used because this can help to understand

uh what are these different tactics and specific targeting uh groups that you belong and how retailers can Target you uh in order to reduce their marketing

cost and to result in high conversion rates so to ensure that you buy their product then we have machine learning in autonomous vehicles that's based on machine learning and specifically uh

deep learning applications uh and then we have also so um uh natural language PR processing which is highly related to the famous

Chad GPT I'm sure you are using it and that's based on the machine learning and specifically the large language models so the Transformers large language models where you will going and

providing your text and then question and the chat GPT will provide answer to you or in fact any other uh virtual assistant or chat boats those are all

based on machine learning and then we have also uh smart home devices so Alexa is based on machine learning also in agriculture uh machine

learning is being used heavily these days to estimate what the weather conditions will be uh to understand what will be the uh production of different

plants uh what will be the um outcome of this uh to understand and to make decisions uh also how they can optimize those uh crop uh yields to monitor for

uh soil health and for different sorts of applications that can just in general uh improve the uh revenue for the farmers then we have of course in the

entertainment so the Vivid example is Netflix that uses the uh data uh that you are providing uh related to the movies and also based on what kind of

movies you are watching Netflix is uh building this super smart recommender system to recommend you moving that you most likely will be interested

in and you will also like it so in all this machine learning is being used and it's actually super powerful topic and super powerful uh field to get into and

in the upcoming 10 years this is only going to grow so if you have made that decision or you are about to make that decision to get into machine learning continue watching this video because I'm

going to tell you exactly what kind of skills you need and what kind of uh practice Pro objects you can complete in order to get into machine learning in

2024 so you first need to start with mathematics you also need to know python you also need to know statistics you will need to know machine learning and

you will need to know some NLP to get into machine learning so let's now unpack each of those skill sets so independent the type of machine learning

you are going to do you need to know mathematics and specifically you need to know linear algebra so you need to know what is matrix multiplication what are

the vectors matrices dot product you need to know how you can uh multiply those different matrices Matrix with vectors what are these different rules

the dimensions also what does it mean to transform a matrix the inverse of the Matrix identity Matrix diagonal matrix

uh those are all Concepts as part of linear algebra that you need to know as part of your mathematical skill set in order to understand those different machine learning

algorithms then as part of your mathematics you also need to know calculus and specifically differential Theory so you need to know these different theorems such as chain rule

the rule of uh differentiating when you have sum of instances when you have constant multiply with an instance when you have um uh sum but also subtraction

division multiplication of two items and then you need to take the uh derivative of that what is this idea of derivative what is the idea of partial derivative what is the idea of haian so first order

derivative second order derivative and it would be also great to know a basic integration Theory so we have differentiation and the opposite of it

is integration Theory so this is kind of basic you don't need to know uh too much when it comes to calculus but those are basic things that you need to know uh in

order to succeed in machine learning uh then the next Concepts uh such as discrete mathematics so you need to know

uh what is this idea of uh graph Theory uh what are this uh combinations combinators uh what is uh this idea of complexity which is important when you

want to become a machine learning engineer because you need to understand what is this Big O notation so you need to understand what is this complexity of

uh and squ complexity of n complexity of n log n um and about that you need to know uh some basic uh mathematics when it comes which comes from usually high

school so you need to know multiplication division you need to understand uh multiplying uh uh amounts which are within the parenthesis you

need to understand um different symbols that represent mathematical um values you need to know this idea of using X is

y uh and then what is x s what is y^ 2 What is X the^ 3 so different exponents of the different variables then you need to know what is logarithm what is

logarithm at the base of two what is logarithm at the base of e and then at the base of 10 uh what is the idea of e so what is the idea of Pi uh what is

this idea of uh exponent logarithm and how does those uh transform when it comes to taking derivative of the logarithm taking the derivative of the

uh exponent those are all values and all uh topics that are actually quite basic they might sound complicated but they are actually not so if someone explains

you uh clearly then you will definitely understand it from the first goal and uh for this uh to understand all those different mathematical Concepts so

linear algebra calculus differential Theory and then discrete mathematics and those different symbols you need to uh go for instance uh and

look for courses or um YouTube tutorials that are about uh basic mathematics uh for machine learning and AI uh don't go

and look further you can check for instance Can Academy which is uh quite favorite when it comes to learning math uh both for uni students and also for just people who want to learn

mathematics and this will be your guide um or you can check our resources at lunch. cuz we are going also to uh

lunch. cuz we are going also to uh provide this resources for you uh in case you want to learn mathematics for your machine Learning Journey the next skill set that you need to gain in order

to break into machine learning is the statistics so you need to know this is a must statistics if you want to get into machine learning and in AI in general so

there are few topics that you must um study when it comes comes to statistics and uh those are descriptive statistics multivariate statistics inferential

statistics probability distribution and some bial thinking so let's start with descriptive statistics when it comes to descriptive statistics you need to know what is side

of mean uh median standard deviation variance and uh just in general how you can uh analyze the data with using this

descriptive measure me so distance measures but also variational measures then the next topic area that you need to know as part of your statistical

journey is the inferential statistics so you need to know those Infamous theories such as Central limit theorem the law of

uh large numbers uh and how you can um relate to this idea of population sample unbias sample and also u a hypothesis

testing confidence interval statistical sign ific an uh and uh how you can test different theories by using uh this idea of statistical significance uh what is

the power of the test what is type one error what is type two error so uh this is super important for understanding different SS of machine learning applications if you want to get into

machine learning then you have probability distributions and this idea of probabilities so to understand those different machine learning Concepts you need to know what are probabilities so

what is this idea of probability what is this idea of Sample versus population uh what is what does it mean to estimate probability what are those different rules of probability so

conditional probability uh and um those uh probability uh values and rules that

usually you can uh apply when you have uh a probability of um multipliers probability of sums um and then uh you

need to understand some uh popular and you need to know some popular probability distribution functions and those are pero distribution binomial

distribution uh normal distribution uniform distribution exponential distribution so those are all super important distributions that you need to

know in order to understand uh this idea of normality normalization uh also uh this idea of bar noly trials and uh relating uh

different probability distributions to different uh uh higher level statistical Concepts so rolling a dice the probability of it how it is related to bero distribution or to binomial

distribution and those are super important when it comes to hypothesis testing but also for uh many other machine learning applications so then we have the ab

thinking this is super important when it comes to more advanced machine learning but also some basic machine learning you need to know what is the Bas theorem which arguably is one of the most

popular statistical theorems out there comparable also to the central limit theorem you need to know what is conditional probability what is this biased theorem and how does it relate to

conditional probability uh what is this uh bation uh statistics Ide at very high level you don't need to know everything in uh super detailed but you need to

know um the these Concepts at least at higher level in order to understand machine learning so to learn statistics and the fundamental concepts of

Statistics you can check out the fundamentals to statistics course at lunch. here you can learn all this

lunch. here you can learn all this required Concepts and topics and you can practice it in order to get into machine learning and to gain the statistical

skills the next skill set that you must know is the fundamentals to machine learning so this covers not only the basics of machine learning but also the

most popular machine learning algorithms so you need to know this uh different um mathematical side of these algorithms step by step how they work what are the

benefits of them what are the demo and and which one to use for what type of applications so you need to know this uh categorization of supervised versus

unsupervised versus semi-supervised then you need to know what is the idea of classification regression or uh clustering then you

need to know uh also time series analysis uh you also need to know uh these different popular algorithms including linear

regression also logistic agression LDA so linear discriminant analysis you need to know KNN you uh need to know uh decision treats both classification and

regression case you need to know uh random Forest bagging but also boosting so popular boosting algorithms like uh light GBM GBM uh so gradient boosting

models and you need to know uh HG boost uh you uh also need to know um some supervised learning algorithm such as K

means uh usually Ed for class string you need to know DB scan which becomes more and more popular in uh class string algorithms you also need to know

hierarchal class string um and um for all this type of uh models you need to understand the idea behind them what are the advantages and disadvantages whether

they can be applied for unsupervised versus supervised versus semi-supervised you need to know whether they are for regression classification or for uh clustering beside of this popular

algorithms and models you also need to know the basics of uh training a machine learning model so you need to know uh this process behind training validating

and testing your machine learning algorithms so you need to know uh what does it mean to uh perform hyperparameter tuning what are those different optimization algorithms that

can be used to optimize your parameters such as uh GD SGD SGD with momentum Adam and Adam V you also need to know the

testing process this idea of splitting the data into train validation and then test you need to know resampling techniques why are they used including

the um bootstrapping and uh cross viation and there's different sorts of cross viation techniques such as leave one out cross viation kful cross viation

validation set approach uh you also need to know um this uh idea of uh Matrix and how you can use different Matrix to evaluate your machine learning models

such as uh classification type of metrics like F1 score FB D Precision recall um cross entropy um and also you need to

know some Matrix that can be used to evaluate regression type of problems like the uh mean squared error so M root

me squared error RMC uh MAA so the uh absolute uh version of those different sorts of Errors um and um or the

residual sum of squares for all these cases you not only need to know higher level what the those algorithms or those uh uh topics or concepts are doing but

you actually need to know the uh mathematics behind it their benefits the uh disadvantages because during the interviews you can definitely expect questions that will test uh not only

your higher level understanding but also this uh background knowledge if you want to learn machine learning and you want to gain those skills then uh feel free to check out my uh fundamentals to

machine learning course which is part of the ultimate data science boot camp at lunch. or you can also check out and

lunch. or you can also check out and download for free the fundamentals to machine learning handbook that I published with free cord Camp then the next skill set that you definitely need

to gain is a knowledge in python python is actually one of the most popular programming languages out there and it's being used across software Engineers uh

AI Engineers machine learning Engineers data scientists so this this is the universal language I would say when comes to programming so if you're considering getting into machine

learning in 2024 then python will be your friend so knowing the theory is one thing then uh implementing it uh in in the actual job is another and that's

exactly where python comes in handy so you need to know python in order to perform uh descriptive statistics in order to trade machine learning model or more advanced machine learning models or

deep learning models you can use for training while ation and uh for testing of your models and uh also for building different sorts of applications so

python is super powerful therefore it's also gaining such a high uh popularity across the globe because it has so many

uh liaries it has uh tenser flow pie torch both that uh are must if you want to not only get into machine learning but also the advanced levels of machine

learning so if you are are considering the AI engineering jobs or machine learning engineering jobs and uh you want to train for instance deep learning

models uh or you want to build large language models or generative AI models then you definitely need to learn uh pie torch and tensor flow which are

Frameworks that I used in order to uh Implement different deep learning uh which are Advanced machine learning models here are a few libraries that you

need to know in order to uh get into machine learning so you definitely need to know pandas numai you need to know psyit learn scipi you also need to know

uh nltk for the text Data you also need to know tensor flow and Pythor for a bit more advanced machine learning and um beside this there are also data

visualization libraries that I would definitely suggest you to practice with which are the ma plot lip and specifically the pie plot and also the curn

when it comes to python beside knowing how to use libraries you also need to know some basic data structures so you need to know what are these variables how you can create variables what are

the matrices arrays how the indexing works and also uh what are the lists what are the sets so unique lists uh What uh are the ways that you can what

are the different operations you can perform uh how does the Sorting for instance work I would definitely suggest you know um some basic data structures

and algorithms such as binary sort so an optimal way to sort your arrays you also need to know uh the data processing in Python so you need to understand how to

identify missing data how to uh identify uh duplica in your data how to clean this how to perform feature engineering so how to combine uh multiple variables

or to perform operations to create new variables um you also need to know uh how you can aggregate your data how you can filter your data how you can sort

your data and of course you also need to know how you can form AB testing in your Python and how you can train machine learning models how you can test it and

how you can evaluate them and also visualize the performance of it if you want to Learn Python then the easiest thing you can do is just to Google for

uh python for data science or python for machine learning tutorials or blogs or you can even try out the python for data science course at Learner tech. in order

to learn all these Basics and usage of these libraries and some practical examples when it comes to python for machine learning the next skill set that you need to gain in order to get into

machine learning is the basic introduction to NLP natural language processing so you need to know how to work with text Data given that these

days the text data is the Cornerstone of all these different different Advanced algorithms such as uh gpts Transformers the attention mechanisms so those uh applications that you see as part of

building chat boat or this uh personalized uh applications based on Tex data they are all based on NLP so therefore you need to know this basics

of NLP to just get started with machine learning so you need to know uh this idea of text Data what are those strings uh how you can clean text dat data so

how you can clean uh those um dirty data that you get and what are the steps involved such as lower casing uh removing punctuation

tokenization uh also what is this idea of stemming lemmatization stop wordss how you can use the nltk in Python in order to perform this cleaning you also

need to know uh this idea of embeddings and uh you can also learn this idea of uh

the uh tfidf which is a basic uh NLP algorithm uh you also uh can learn this idea of word embeddings uh the subword embeddings uh and the character

embeddings if you want to learn the basics of NLP you can check out those Concepts and learn them as part of the blogs there are many tutorials on

YouTube you can also try the introduction to uh NLP course at lunch.

in order to learn this uh different Basics that form the NLP if you want to go beyond this uh intro till medium level machine learning and you also want

to learn bit more advanced machine learning and this is something that you need to know after you have gained all these previous skills that I mentioned then you can gain uh this uh knowledge

and the skill set by learning deep learning and also uh you can consider uh getting into generative AI topics so you can for for instance learn what are the

RNN what are the Anns what are the CNN you can learn what is this uh outand coder concept what are the variational out hand coders what are the uh

generative adversarial networks so gens uh you can understand what is this idea of reconstruction error uh you can understand this um this different sorts of neural networks what is this idea of

back propagation the optimization of these algorithms by using the different optimization algorithms such as GD D HGD

um HD momentum Adam Adam W RMS prop uh you uh can also go One Step Beyond and you can uh get into generative AI topics

such as um uh the uh variation Auto encoders like I just mentioned but also the large language models so if you want to move towards the NLP side of

generative Ai and you want to know how the ched GPT has been invented how the gpts work or the birth mode Ro uh then you will definitely need to uh get into

this topic of language model so what are the end grams what is the attention mechanism what is the difference between the self attention and attention what is uh one head self attention mechanism

what is multi-head self attention mechanism you also need to know at high level this uh encoder decoder architecture of Transformers so you need

to know the architecture of Transformers and how they solve different problem s of reur narrow networks or RNN and lstms uh you can also look into uh this

uh uh encoder based or decoder based algorithm such as uh gpts or Birch model and those all will help you to not only

get into machine learning but also stand out from all the other candidates by having this advaned knowledge let's now talk about different sorts of projects that you can complete in order to train

your machine learning skill set that you just learned uh so there are few projects that I suggest you to complete and you can put this on your resume to start to apply for machine learning

roles the first application in the project that I would suggest you to do is building a basic recommender system whether it's a job recommender system or

a movie recommender system in this way you can show case how you can use for instance text Data from those job advertisement how you can use numeric

data such as the ratings of the movies in order to build a top end recommended system this will showcase your understanding of the distance measures

such as cosign similarity this Cann algorithm idea and this will help you to uh uh tackle this specific uh area of

data science and machine learning the next project I would suggest you to do will be to build a regression based model so in this way you will showcase that you understand this idea of

regression how to work with a Predictive Analytics and predictive model that has a dependent variable response variable that is in the numeric format so here

for instance you can uh estimate the salaries of the jobs based on the uh characteristics of the uh job based on this data which you can get for instance

from uh open source uh web pages such as kegle and you can then uh use different sorts of regression algorithms to perform your predictions of the salaries

evaluate the model and then compare the uh performance of these different machine learning regression based algorithms for instance you can use the uh linear regression you can use the

decision trees regression version you can use the um uh random Forest you can use uh GBM XG boost in order to Showcase

and then in one uh graph to compare this uh performance ments of these different algorithms by using single regression uh

ml modal metrics so for instance the rmsc this project will showcase that you understand how you can train a regression model how you can test it and

validate it and it will showcase your understanding of optimization of this regression algorithm you understand this concept of hyperparameters unun the next project that I would suggest you to do

in order to Showcase your classification knowledge so when it comes to uh predicting a class for an observation given uh the feature space would be uh

to uh build a classification model that would classify emails being a Spam or not a Spam so you can use a publicly available data that will be uh

describing a specific email and then you will have multiple emails and the idea is to uh build a machine learning model that would classify the email to the

class zero and class one where class zero for instance can be your uh not being a Spam and one being a Spam so with this binary classification you will

showcase that you know how to train a machine learning model for classification purposes and you can here use for instance logistic regression you can use also the decision tra for

classification case you can also use random Forest the uh EDG for classification GBM for classification and uh with all these models you can

then obtain the performance metrics such as uh F1 score or you can plot the r curve uh or the uh area under the curve metrix and you can also compare those

different classification models so in this way you will also tackle another area of expertise when it comes to the machine learning then the final project that I

would suggest you to do would be uh from the unsupervised learning to Showcase another area of exper and here you can for instance use data

to your customers into good better and best customers based on their transaction history the amount of uh money that they are spending in the

store so uh in this case you can for instance use uh K means uh DB scan hierarchal clustering and then you can evaluate your uh clustering algoritms

and then select the one that performs the best so you will then in this case cover yet another area of machine learning which would be super important

to Showcase that you can not only handle recommender systems or supervised learning but also unsupervised learning and the reason why I suggest you to uh cover all these different areas and

complete this four different projects is because in this way you will be covering different expertise and areas of machine learning so you will be also putting

projects on your uh resume that are covering different s s of algorithms different sorts of uh Matrix and approaches and it will showcase that you

actually know a lot from machine learning now if you want to go beyond the basic or medium level and you want to be considered for medium or Advanced

machine learning uh levels and positions you also need to know bit more advanced which means that you need to complete with more advanced projects for instance

if you want to apply for generative AI related or large language models related positions I would suggest you to complete a project where you are building a very basic uh large language

model and specifically the pre-training process which is the most difficult one so in this case uh for instance you can build a baby GPT and I'll put a here

link that you can follow where I'm building a baby GPT a basic pre-trained GPT algorithm where uh I am using a text

Data uh publicly available data in order to uh uh process data in the same way like GPT is doing and the encoded part

of the Transformers in this way you will showcase to your um hiring managers that you understand this architecture behind

Transformers architecture behind the um uh large language models and d gpts and you understand how you can use pych in

Python in order to do this Advanced NLP and generative AI task and finally let's now talk about the common career path and the business titles that you can

expect from a career in machine learning so assuming that you have gained all the skills uh that are must for breaking into machine learning there are different sorts of business titles that

you can apply in order to get into machine learning so when it comes to machine learning uh you can uh get into machine learning uh and there are different fields that are covered as

part of this so uh first we have the general machine learning researcher machine learning researcher is basically doing a research so training testing evaluating different machine learning

algorithms they are usually people who come from academic background but it doesn't mean that you cannot get into machine learning research without getting a degree in statistics

mathematics or in um uh machine learning specifically not at all so uh if you have this um desire and this passion for

reading doing research uh and you don't mind reading uh research papers then machine learning researcher job would be a good fit for you so machine learning

combined with research then sets you uh for the machine learning researcher role then we have the machine learning engineer so machine learning engineer is

the engineering version of the machine learning uh expertise which means that we are combining machine learning skills with the engineering skills such as

productionizing pipelines or in robust pipeline scalability of the model considering all these different aspects of the model not only from the performance side when it comes to the

quality of the algorithm but also the uh scalability of it and when putting it in front of many users so when it comes to combining engineering with machine

learning then you get machine learning engineering so if you are someone who is a software engineer and you want to get into machine learning then machine learning engineering would be the best

fit for you so for machine learning engineering you not only need to have all these different skills that I already mentioned but you also need to have this good grasp of uh uh

scalability of algorithms the uh uh data structures and algorithms type of um skill set uh the uh complexity of the moral

uh also system design so this one uh converges more towards and similar to the software engineering position combined with machine learning rather

than your pure machine learning or AI role then we have the AI research versus AI engineering position so uh the uh AI research position is similar to The

Machine learning uh research position and the AI engineer position is similar to The Machine learning engineer position with only single difference when it comes to machine learning we are specifically talking about this

traditional machine learning so linear regression logistic regression and also uh random Forest exy boost begging and when it comes to AI research and AI

engineer position here we are tackling more the advanced machine learning so here we are talking about deep learning models such as RNN lstms grus CNN or

Computer visional Applications and we are also talking about uh generative AI models large language models so uh we are talking about um the Transformers

the implementation of Transformers the gpts T5 all these different algorithms that are from uh more advanced uh AI topics rather than traditional machine

learning uh for those you will then be applying for AI research and AI engineering positions and finally you have these different sorts of

observations niches from AI for instance NLP research and NLP engineer or even data science positions for which you will need to know machine learning and

knowing machine learning will set you apart for the source of positions so also the business titles such as data science or technical data science

positions NLP researcher NLP engineer for this all uh you will need to know machine learning and knowing machine learning will help you to break into

those positions and those career paths in this lecture we will go through the basic concepts in machine learning that is needed to understand and follow conversations and solve main problems

using machine learning strong understanding of machine learning Basics is an important step for anyone looking to learn more about or work with machine learning we'll be looking at the Tre

Concepts in this tutorial we will Define and look into the difference between supervised and unsup rvis machine learning models then we will look into the difference between the regression

and classification type of machine learning models after this we will look into the process of training machine learning models from scratch and how to evaluate them by introducing performance

metrics what you can use depending on the type of machine learning model or problem you are dealing with so whether it's a supervised or unsupervised whether it's regression versus

classification type of problem machine learning methods are categorized into two types depending on the existence of the label data in the training data set which is especially

important in the training process so we are talking about the So-Cal dependent variable that we saw in the section of fundamental Su statistics supervised and unsupervised machine learning models are

two main type of machine learning algorithms one key difference between the two is the level of supervision during the training phase supervised machine learning algorithms are Guided

by the labeled examples while as supervised algorithms are not as learning model is more reliable but it also requires a larger amount of labeled data which can be timec consuming and

quite expensive to obtain examples of supervised machine learning models includes regression and classification type of

models on the other hand unsupervised machine learning algorithms are trained on unlabeled data the model must find p patterns and relationships in the data without the guidance of correct outputs

so we no longer have a dependent variable so unsupervised ml models require training data that consists only of independent variables or the features

and there is no dependent variable or a label data that can supervise the algorithm when learning from the data examples of UNS supervised models

are clust string models and outlier detection techniques supervised machine learning methods are categorized into two types depending on the type of dependent

variable they are predicting so we have regression type and we have classification type some key differences between regression and classification include output type so the regression

algorithms predict continuous values while the classification algorithms predict categorized values some key difference between regression and classification include the output type

the evaluation metrics and their appc ation so with regards to the output type regression algorithms predict continuous values while classification algorithms

predict categorical values with regard to the evaluation metric different evaluation metrics are being used for regression and classification tasks for example mean

square theor is commonly used to evaluate regression models while accuracy is commonly used to evaluate classification models when it comes to Applications regression and classification models are used in

entirely different types of applications regression models are often used for prediction tests while classifications are used for decision-making tasks regression algorithms are used to

predict the continuous value such as price or probability for example a regression model might be used to predict the price of a house based on its size location or other

features examples of regression type of machine Lear learning models are linear regression fix effect regression exus regression Etc classification algorithms on the

other hand are used to predict the categorical volum these algorithms take an input and classify it to one of the several predetermined categories for

example a classification model might be used to classify emails as a Spam or as not a Spam or to identify the type of aner in an image

examples of classification type of machine learning models are logistic regression exus classification random Forest classification let us now look into

different type of performance metrics we can use in order to evaluate different type of machine learning models for aggression models common evaluation metrics includes residual sum

of squared which is the RSS mean squared eror which is the msse the root mean squared error or rmsc and the mean absolute error which is the m AE this metrix measure the difference between

the predicted values and the True Values with a lower value indicating a better feed for the model so let's go through these metrics one by one the first one is the RSS or the residual sum of

squares this is a metrix commonly used in the setting of linear regression when we are evaluating the performance of the model in estimating the different coefficients and here the beta is a

coefficient and the Yi is our dependent variable value and the Y head is the predicted value as you can see the RSS or the residual sum of square or the

beta is equal to sum of all the squar of y i minus y hat across all I is equal to 1 till n where I is the index of the each r or the individual or the

observation included in the data the second Matrix is the MSC or the mean squared error which is the average of the squared differences between the predicted values and the True Values so

as you can see m is equal to 1 / to n and then sum across all i y IUS y head squ as you can see the RSS and the msse are quite similar in terms of their uh

formulas the only difference is that we are adding a 1 / to n and then this makes it the average across all the square differences between the predicted

value and the actual True Value a lower value of Ms indicates a better fit the rmsc which is the root mean squar error is the square roof of the msse so as you

can see it has the same formula as MSC only with the difference that we are adding a square roof on the top of that formula a lower value of rmsc indicates

a better fit and finally the Mae or the mean absolute error is the average absolute difference between the predicted values so the Y hat and the

True Values or Yi a lower value of this indicates a better fit the choice of a regression metric depends on the specific problem you are trying to solve and the nature of your

data for instance the msse is commonly used when you want to penalize large errors more than the small ones msse is sensitive to outliers which means that

it may not be the best choice when your data contains many outliers or extreme values our msse on the other hand which is the square root of the msse makes it

easier to interpret so it's easier interpretable because it's in the same units as Target variable it is commonly used when you want to compare the performance of different models or when

you want to report the error in a way that it's easier to understand and to explain the Mia is commonly used when

you want to penalize all errors equally regardless of their magnitude and M is less sensitive to outliers compared to msse for classification models common

evaluation metrics include accuracy precision recall and F1 score this metrics measure the ability of the machine learning model to correctly classify instances into the correct

categories let's briefly look into this Matrix individually so the accuracy is a proportion of correct predictions made by the model it's calculated by taking the correct predictions so the correct

number of predictions and divide to all number of predictions which means correct predictions plus incorrect predictions next we will look into the the Precision so Precision is the

proportion of true positive predictions among all positive predictions made by the model and it's equal to True positive divided to True positive plus false positive so all number of

positives true positives are cases where the model correctly predict a positive outcome while false positives are the cases where the model incorrectly predicts a positive outcome next Matrix

is recall recall is a proportion of true positive predictions among all actual POS positive instances it's calculated as the number of true positive predictions divided by the total number

of actual positive instances which means dividing the true positive to True positive plus false negative so for example let's say we are looking into medical test a true

positive would be a case where it has correctly identifies a patient as having a disease while a false positive would be a case where the test incorrectly identifies a healthy patient as having

the disease and the final score is the F1 score the F1 score is the harmonic mean or the usual mean of the Precision and recall with a higher value indicating a better

balance between precision and recall and it's calculated as the two times recall times Precision divided to recall class Precision for unsupervised models such

as class string models whose performance is typically evaluated using metrics that that measure the similarity of the data points within a cluster and the dis similarity of the data points between

different clusters we have Tre hyp of metrics that we can use homogeneity is a measure of the degree to which all of the data points within a single cluster belong to the same class A Higher value

indicates a more homogeneous cluster so as you can see homogeneity of AG where AG is the simply the short way of describing homogeneity is equal to 1

minus conditional entropy cluster assignments divided to the entropy or predicted class if you're wondering what this entropy is then stay tuned as we are going to discuss this entropy

whenever we will discuss the cloud string as well as the cision trees the next metrix is The Silo score silid score is a measure of the similarity of the data point to its own cluster

compared to the other clusters a higher silid score indicates that the data point is well matched to its own cluster this is usually used for DB scan or KS

so here the silhou squore can be represented by this formula so the S so or the silhou sore is equal to B minus AO divided to the maximum of AO and B

where s o is The Silo coefficient of the data point characterized by o AO is the average distance between o and all the other data points in the cluster to

which o belongs and the b o is the minimum average distance from o to all the Clusters to which o does not belong the final metrix we will look into is

the completeness completeness is another measure of the degree to which all of the data points that belongs to a particular class are assigned to the same cluster a higher volue indicates a

more complete cluster let's conclude this lecture by going through the step-by-step process of evaluating a machine learning model

at a very simplified version since there are many additional considerations and techniques that may be needed depending on a specific task and the characteristics of the data knowing how

to properly train machine learning model is really important since this defines the accuracy of the results and conclusions you will make the training process starts with

the preparing of the data this includes splitting the data into training and test sets or if you are using more advanced resampling techniques that we will talk about later than splitting

your data into multiple sets the training set of your data is used to feed the model if you have also validation set then this validation set

is used to optimize your hyperparameters and to pick the best model while the test set is to use to evaluate the model performance when we will approach more

lectures in this section we will talk in detail about these different techniques as well as what the training means what the test means what validation means as

well as what the hyper parameter tuning means secondly we need to choose an algorithm or set of algorithms and train the model on the training data and save

the fitted model there are many different algorithms to choose from and the appropriate algorithm will depend on the specific task and the characteristics of the data as a third

step we need to adjust the model parameters to minimize the error on the training set by performing hyperparameter tuning for this we need to use validation data and then we can

select the best model that results in the least possible validation error rate in this step we want to look for the optimal set of parameters that are included as part of our model to end up

with a model that has the least possible error so it performs in the best possible way in the final two steps we need to evaluate the model we are always interested in a test error rate and not

the training or the validation error rates because we have not used a test set but we have used a training and validation sets so this test s rate will give you an idea of how well the model

will generalize to the new unseen data we need to use the optimal set of parameters from hyperparameter tuning stage and the training data to train the model again with these hyper parameters

and with the best model so we can use the best fitted model to get the predictions on the test data and this will help us to calculate our test error

rate once we have calculated the test rate and we have also obtained our best model model we are ready to save the predictions so once we are satisfied with the model performance and we have

tuned the parameters we can use it to make predictions on a new unseen data on the test data and compute the performance metrics for the model using the predictions and the real values of

the target variable from the test data and this complete this lecture so in this lecture we have spoken about the basics of machine learning we have discussed the difference between the the

unsupervised and supervised learning models as well as regession classification we have discussed in details the different type of performance metrics we can use to evaluate different type of machine

learning models as well as we have looked into the simplified version of the step-by-step process to train the machine learning model in this lecture lecture number two we will discuss a

very important Concepts which you need to know before considering and applying any statistical or machine learning model here I'm talking about the bias of the model and the variance of the model

and the tradeoff between the two which we call bias various trade of whenever you are using a statistical econometrical or a machine learning model no matter how simple the model is

you should always evaluate your model and check its error rate in all these cases it comes down to the tradeoff you make between the variance of the model and the bias of your model because there

is always a catch when it comes to the model choice and the performance let us firstly Define what by bias and the variance of the machine learning model are the inability of the

model to capture the true relationship in the data is called bias hence the machine learning models that are able to detect the true relationship in the data have low

bias usually complex models or more flexible models tend to have a lower bias than simpler models so mathematically the bias of the model can be expressed as the expectation of the

difference between the estimate and the True Value let us also Define the variance of the model the variance of the model is the inconsistency level or the variability of the model performance

when applying the model to different data sets when the same model that is trained using training data performs entirely differently than on the test data this means that there is a large

variation or variance in the model complex models or more flexible models tend to have a higher variance than simpler models in order to evaluate the performance of

the model we need to look at the amount of error that the model is making for Simplicity let's assume we have the following simple regression model which aims to use a single independent variable X to model the numeric y

dependent variable that is we fit our model on our training observations where we have a pair of independent and dependent

variables X1 y1 X2 Y2 up to xn YN and we obtain an estimate for our training observations F head we can then compute

this let's say fhe X1 fhe X2 up to fhe xn which are the estimates for our dependent variable y1 Y2 up to YN and if these are approximately equal to this

actual values so one head is approximately equal to y1 Y2 head is approximately equal to Y2 head Etc then the training error rate would be small

however if we are really interested in whether our model is predicting the dependent variable appropriately we want to instead of looking at the training error rate we

want to look at our test eror rate so so the error rate of the model is the expected Square difference between the real test values and their predictions where the predictions are made using the

machine learning model we can rewrite this error rate as a sum of two quantities where as you can see the left part is the amount of FX minus F hat x^

squared and the second entity is the variance of the error term so the accuracy of Y head as a prediction for y depends on the two quantities which we can call the

reducible error and the irreducible error so this is the reducible error equal to FX minus F x^ s and then we have our irreducible error or the

variance of Epsilon so the accuracy of Y head as a prediction for y depends on the two quantities which we can call the reducible error and the irreducible error in general The Fad will not be a

perfect estimate made for f and this inaccuracy will introduce some errors this error is reducible since we can potentially improve the accuracy of app head by using the most appropriate

machine learning model and the best version of it to estimate the app however even if it was possible to find a model that would estimate F perfectly

so that the estimated response took the form of Y head is equal to FX our prediction would still have some error in it this happens because Y is also a

function of the error rate Epsilon which by definition cannot be predicted by using our feature X so there will always be some error that is not predictable so

variability associated with the error epsylon also affects the accuracy of the predictions and this is known as the irreducible error because no matter how

well we will estimate F we cannot reduce the error introduced by the epsylon this error contains all the features that are not included in our model so all the unknown factors that have an influence

on our dependent variable but are not included as part of our data but we can reduce the reducible error rate which is based on two values the variance of the estimates and the

bias of the model if we were to simplify the mathematical expression describing the error rate that we got then it's equal to the variance of our model plus squ bias of our model plus the

irreducible error so even if we cannot reduce the IR reducible error we can reduce the reducible error rate which is based on the two values the variance and the squared bias so though the

mathematical derivation is out of the scope of this course just keep in mind that the reducible error of the model can be described as the sum of the variance of the model and the squared bias of the

model so mathematically the error in the supervised machine learning model is equal to the squared bias in the model the variance of the model and the irreducible error therefore in order to

minimize the expected test error rate so on the Unseen data we need to select the machine learning manod that simultaneously achieves low variance and low

bias and that's exactly what we call bias variance tradeoff the problem is is that there is a negative correlation between the variance and the bias of the model another thing that is highly related to the bias and the variance of

the model is the flexibility of the machine learning model so flexibility of the machine learning model has a direct impact on its variance and on its

bias let's look at this relationships one by one so complex models or more flexible models tend to have a lower bias but at the same time complex models

or more flexible models tend to have higher variance than simpler models so as the flexibility of the model increases the model finds the true patterns in the data easier which

reduces the bias of the model at the same time the variance of such models increases so as the flexibility of the model decreases model finds it more difficult to find the true patters in

the data which then increases the bias of the model but also decreases the variance of the model keep this topic in mind and we will continue this topic in the next lecture when we will be

discussing the topic of overfitting and how to solve the overfitting problem by using regularization in this lecture lecture number three we will talk about very important concept called

overfitting and how we can solve overfitting by using different techniques including regularization this topic is related to the previous lecture and to the topics

of error of the model train error rate test error rate bias and a variance of the machine learning model overfitting is important to know and also how to solve it with

regularization because this topic can lead to inaccurate predictions and a lack of generalization of the model to new data knowing how to detect and prevent

overfitting is crucial in building effective machine learning models questions about this topic are almost guaranteed to appear during every single data science

interview in the previous lecture we discussed the relationship between model flexibility and the variance as well as the bias of the model we saw that as the

flexibility of the model increases model finds the true pattern in the data easier which reduces the bias of the model but at the same time the variance of such models

increases so as the flexibility of the model decreases model finds it more difficult to find a two patterns in the data which then increases the bias of the model and decreases the variance of

the model let's first formally Define what the overfitting problem is as well as what the underfitting is so overfitting occurs when the model performs well in

the training while the model performs worse on the test data so you end up having a low training error rate but a high test error rate and in the ideal world we want our test error rate to be

low or at least that the training error rate is equal to the test error rate overfitting is a common problem in machine learning where a model learns the detail and noise in training data to

the point where it negatively impacts the performance of the model on this new data so the model follows the data too closely closer than it should this means

that the noise or random fluctuations of training data is picked up and learned as concepts by the model which it should actually ignore the problem is that the noise or

random component of the training data will be very different from the noise in the new data the model will therefore be less effective in making predictions on new data overfitting is caused by having

too many features too complex of a model or too little of the data when the model is overfitting then also the model has high variance and low

bias usually the higher is the modal flexibility the higher is the risk of overfitting because then we have higher risk of having a model following the

data too closely and following the noise so under fitting is the other way around underfitting occurs when our test error rate is much lower than our

training error rate given that overfitting is much bigger of a problem and we want ideally to fix the case when our test sale rate is large we will only focus on the

overfitting and this also the topic that you can expect during your data science interviews as well as something that you need to be aware of whenever you are training a machine learning model all

right so now that we know what overfitting is we should now talk about how we can fix this problem there are several ways of fixing or preventing overfitting first you can reduce the

complexity of the model we saw that higher the complexity of the model higher is the chance of the following the data including the noise too closely resulting in overfitting therefore

reducing the flexibility of the model will reduce the overfitting as well this can be done by using a simpler model with fewer parameters or by applying a

regularization techniques such as L1 or L2 regularization that we will talk in a bit kind solution is to collect more data the more data you have the less

likely your model will overfit third and another solution is using resampling techniques one of which p is cross validation this is a technique that allows you to train and test your model

on different subsets of your data which can help you to identify if your model is overfitting we will discuss cross validation as well as other resampling

techniques later in this section another solution is to apply early stopping early stopping is a technique where you monitor the performance of the model on a validation set during the training

process and stop the training when the performance starts to decrease and another solution is to use emble methods by combining multiple models such as decision trees overfitting can be

reduced we will be covering many popular emble techniques in this course as well finally you can use what we call dropout dropout is regularization

technique for reducing overfitting in neuron networks by dropping out or setting to zero some of the neurons during the training process because from time to time Dropout related questions

do appear during the data science interviews for people with no experience so if someone asks you about Dropout then at least you will remember that it's a technique used to solve

overfitting in the setting of deep learning it's worth noting that there is no one solution that works for all types of overfitting and often a group of

these techniques that we just talk about should be used to address the problem we saw that when the model is overfitting then the model has high

variance and low bias by definition regularization or what we also call shrinkage is a method that shrinks some of the estimated coefficients toward

zero to penalize unimportant variables for increasing the variance of the model this is a technique used to solve the overfitting problem by introducing the

lethal bias in the model while significantly decreasing its variance there are three types of regularization techniques that are widely known in the industry the first

one is to reach regression or L2 regularization the second one is the ler regression or the L1 regularization and finally the third one is the Dropout

which is a regularization technique used in deep learning we will cover the first two types in this lecture let's now talk about R

regression or L2 regularization so retrogression or L2 regularization is a shrinkage technique that aims to solve overfitting by shrinking some of the modal coefficients towards zero

retrogression introduces L bias into the model while significantly reducing the model variance R regression is a variation of linear regression but instead of trying

to minimize the sum of squared residuales that linear regression does it aims to minimize the sum of squared residuales added on the top of the squared coefficients what we call L2

regularization term let's look at a multiple linear regression example with P independent variables or predictors that are used to model the dependent variable

y if you have followed the statistical section of this course you might also recall that the most popular estimation technique to estimate the parameter of the linear regression assuming its assumptions are satisfied is the

ordinary Le squares or the OLS which finds the optimal coefficients by minimizing the sum of squar procedures or the RSS so rual regression is pretty similar to

the OLS except that the coefficients are estimated by minimizing a slightly different cost or loss function this is the loss function of

the r progression where beta J is the coefficient of the model for variable J beta0 is the intercept and xig J is the input value for the variable J and

observation I Yi is a target variable or the dependent variable for observation Y and N is the number of exles and Lambda is what we call regularization parameter of the

retrogression so this is the loss function ofs that you can see here and added a penalization term so it's combined the what we call RSS so if you

check out the very initial lecture in this section where we spoke about different metrics that can be used to evaluate regression type of models you can see RSS and the definition of RSS

well if you compare this expression then you can easily find that this is the formula for the RSS added with an intercept and this right term is what we called a penalty amount which basically

represents the Lambda times the sum of the squar of the coefficients included in our model here Lambda which is always positive so is always larger than equal zero is the tuning parameter or the

penalty parameter this expression of the sum of squared coefficients is called L2 Norm which is why we call this L2 penalty based regression or L2

regularization in this way regression assigns a penalty by shrinking their coefficients towards zero reduces the overall model variance

but this coefficient will never become exactly zero so the model parameters are never set to exactly zero which means that all P predictors of the model are still

intact this one is a key property of retrogression to keep in mind that it shrinks the parameters towards zero but never exactly sets them equal to zero

L2 Norm is a mathematical term coming from linear algebra and it's standing for Alan Norm we spoke about the penalty parameter Lambda what we also call the

tuning parameter Lambda which serves to control the relative impact of the penalty on the regression coefficient estimates when the Lambda is equal to zero the penalty term has no effect and

the r regression will introduce the ordinary Le squares estimates but as a Lambda increases the impact of the shrinkage penalty grows and the r regression coefficient estimates

approach to zero what is important to keep in mind which you can also see from this graph is that in r agression large Lambda will assign a penalty to some variables by shrinking their

coefficients towards zero but they will never become exactly zero which becomes a problem when you are dealing with a model that has a large number of features and your model has a low

interpretability retrogression advantage over ordinarily squares is coming from the earlier introduced Vias Varian tra of phenomenon so as the Lambda the penalty parameter

increases the flexibility of the retrogression F decreases leading to decreased variance but increase bias the main advantages of

retrogression are solving overfitting retrogression can shrink the regression coefficient of less important predictors towards zero it can improve the prediction accuracy as well by reducing

the variance and increasing the bias of the model rich regession is less sensitive to outliers in the data compared to linear regression Rich regression is computationally less

expensive compared to loss or regression the main disadvantage of reg agression is the low model interpretability as the P so the number of features in your model is

large let's now look into another regularization technique called ler regression or L1 regularization by definition L regression or L1 regularization is a shrinkage technique

that aims to solve overfitting by shrinking some of the modal coefficients toward zero and setting some to exactly zero l or aggression like retrogression

introduces later bias into the model while significantly reducing moral variance there is however a small difference between the two regression techniques that makes a huge difference

in their results we saw that one of the biggest disadvantages of retrogression is that it will always include all the predictors or all the p predictors in the final

model whereas in case of lasso it overcomes this disadvantage so large Lambda or penalty parameter will assign a penalty to some variables by shrinking

their coefficience toward zero in case of Rich aggression they will never become exactly zero which becomes a problem when your model has a large number of features and it has a low

interpretability and losser aggression overcomes this disadvantage of retrogression let's have a look at the loss function of L

regularization so this is the loss function of OLS which is the left part of the formula called RSS combined with a penalty amount which is the right hand side of the expression the Lambda times

some of the absolute values of the coefficients beta J as you can see this is the RSS that we just saw which is exactly the same as the loss function of the OS and then we

are adding the second term which basically is the Lambda the penalization parameter multiplied by the sum of the absolute value of the coefficient beta J where J goes from one till p and p is

number of predictors included in our model here once again the Lambda which is always positive larger than equal zero is a tuning parameter or the

penalty parameter this expression of the sum of squared coefficients is called L1 Norm which is why we call this L1 penalty B regression or L1

regularization in this way L regression assigns a penalty to some of the variables by shrinking their coefficients towards zero and setting

some of these parameters to exactly zero so this means that some of the coefficients will end up being exactly equal to zero which is a key difference between the ler aggression versus the

reg aggression the L1 Norm is a mathematical term coming from the linear algebra and it's standing for manh hat Norm or distance you might see here a key

difference when comparing the uh visual representation of the loss of regression compared to the visual representation of the reg regression so if you look at this point you can see that there will

be cases where our coefficients will be set to exactly zero this is where we have this intersection whereas in case of R agression you can recall that there

was not a single intersection so the numbers where the circle was close to the intersection points but there was not a single point when there was an intersection and the coefficients were

put to zero and that's the key difference between two regression type of models between these two regularization techniques the main advantages of loss

of regression are solving overfitting so loss of regression can shrink the regression coefficient of less important predictors toward zero and some to exactly zero as the model filters some

variables out Al indirectly performs also what we call feature selection such that the resulted model is highly interpretable and with less features and

much more interpretable compared to the regual regression lasso can also improve the prediction accuracy of the model by reducing the variance and increasing the

bias of the model but not as much as the retrogression earlier when speaking about correlation we also briefly discussed the concept of cation we discussed that correlation is not a

causation and we also briefly spoke the method used to determine whether there is a causation or not that model is the INF famous linear aggression and even if this model is recognized as a simple

approach it's one of the few methods that allows identifying features that have an impact or statistically significant impact on a variable that we are interested in and we want to explain

and it also helps you identify how and how much there is a change in the Target variable when changing the the independent variable values to understand the concept of

linear regression you should also know and understand the concepts of dependent variable independent variable linearity and statistical significant effect dependent variables are often referred

to as response variables or explained variables by definition dependent variable is a variable that is being measured or tested it's called inde dependent variable because it's thought

to depend on the independent variables so you can have one or multiple independent variables but you can have only one dependent variable that you are interested in that is your target

variable let's now look into the independent variable definition so independent variables are often referred as regressors or explanatory variables and by definition independent variable

is the variable that is being manipulated or controlled in the experiment and is believed to have an effect on the dependent variable put it differently the value of the dependent

variable is s to depend on the value of the independent variable for example in an experiment to test the effect of having a degree on the wage the degree variable would be your independent

variable and the wage would be your dependent variable finally let's look into the very important concept of statistical significance we call the effect statistically significant if it's

unlikely to have occurred by random chance in other words a statistically significant effect is one that is likely to be real and not due to a random

chance let's now Define the linear regression model formally and then we will dive deep into the theoretical and practical details by definition V regression is a

statistical or machine learning method that can help to model the impact of a unit change in a variable the independent variable on the values of another Target variable or the dependent

variable when the relationship between the two variables is assumed to be linear when the linear regression model is based on a single independent variable then we call this model simple

linear regression when the model is based on multiple independent variables we call it multiple linear regression let's look at the mathematical expression describing

linear regression you can recall that when the linear regression model is based on a single independent variable we just call it a simple linear regression this expression that you see here is the most common mathematical

expression describing simple linear regression so you can see that we are saying that the Yi is equal to Beta 0 plus beta 1 x i plus

UI in this expression the Yi is the dependent variable and the I that you see here is the index corresponding to the E row so whenever you are getting a

data and you want to analyze this data you will have multiple rows and if your multiple rows describe the observation that you have in your data so it can be

people it can be observation describing your data then the E characterizes the specific Row the each row that you have in your data and the Yi is then the

dependent variables value corresponding to that each R then the same holds for the XI so the XI is then the independent variable or the explanatory variable or

the regressor that you have in your model which is the variable that we are testing so we want to manipulate it to see whether this variable has a statistically significant impact on the

dependent variable y so we want to see whether unit change in the X will result in a specific change in the Y and what kind of change is

that so beta Z that you see here is not a variable and it's called intercept or constant something that is unknown so we don't have that in our data and is one of the parameters of linear regression

it's an unknown number which the linear regression model should estimate so we want to use the linear regression model to find out this uh unknown value as well as the second unknown value which

is the beta one as well as we can estimate the error terms which are represented by the UI so beta one next to the XI so next to

the independent variable is also not a variable so like beta zero is an unknown parameter in linear regression model an unknown number which the linear regression model should estimate beta

one is of referred as a slope coefficient of variable X which is the number that quantifies how much dependent variable y will change if the

independent variable X will change by one unit so that's exactly what we are most interested in the beta one because this is the coefficient and this is the unknown number that will help us to

understand and answer the question whether our independent variable X has a statistically significant impact on our dependent variable y finally the U that

you see here or the UI in the expression is the error term or the amount of mistake that the model makes when explaining the target variable we add this value since we know that we can

never exactly and accurately estimated Target variable so we will always make some amount of estimation error and we can never estimate the exact value of y hence we need to account for this

mistake that we are going to make and we know in advance that we are going to have this mistake by adding an error term to our model let's also have a brief look at how

multiple linear regression is usually expressed in mathematical terms so you might recall that difference between the simple linear regression and multiple linear regression is that the first one

has a single independent variable in it whereas the letter so the multiple linear regression like the name suggest has multiple independent variables in it so more than

one knowing this type of Expressions is critical since they not only appear a lot in the interviews but also in general you will see them in a data science blogs in presentations in books

and also in papers so being able to quickly identify and say ah I remember seeing this at once then it will help you to easier understand and follow the

process and the story line so uh what you see here you can read as Yi is equal to Beta 0 plus beta

1 * X1 I plus beta 2 * xui plus beta 3 * X3 I plus UI so this is the most common mathematical expression describing multiple linear regression in this case

with three independent variables so if you were to have more independent variables you should add them with their corresponding indices and coefficients so in this case the

method will aim to estimate the model parameters which are beta 0o beta 1 beta 2 and beta 3 so like before Yi is our a dependent variable which is always a

single one so we only have one dependent variable then we have beta 0 which is our intercept or the constant then we have our first slope coefficient which is beta 1 corresponding to our first

independent variable X1 then we have X1 I which stands for the independent variable the first independent variable with an index one and the I stands for the index corresponding to the row so

whenever we have multiple linear regression we always need to specify two indices and not only one like we had in our single line regression the index cor

that characterizes which independent variable we are referring to so whether it's independent variable one two or three and then we need to specify which row we are referring to which is the

index I so you might notice that in this case all the indexes are the same because we are uh looking into one specific row and we are representing this row by using the independent

variables the error term and dependent variable so then we are adding our third term which is beta 2 * x2i so the beta two is our third unknown parameter in

the model and the second slope coefficient corresponding to our second independent variable and then we have our third independent variable with the corresponding slope coefficient beta Tre

as well as we also add like always and error term to account for the error that we know that we are going to make so now when we know what the linear regression is and how to express it in

the mathematical terms you might be asking the next logical question well we know that when we know what the linear regression is and how to express it in the mathematical terms you might be

asking the next logical question how do we find those unknown parameters in the model in order to find out how the independent variables impacted the dependent variable finding this unknown

parameters is called estimating in data science and in general so we are interested in finding out the possible values or the values that the best

approximate the unknown values in our model and we call this process estimation and one technique used to estimate linear regression parameters is

called o or ordinary Le squares so the main idea behind this approach the OLS is to find the best fitting straight line so the regression line through a set of paired X and y's

so our independent variables and dependent variables values by minimizing the sum of squared errors so to minimize the sum of squares

of the differences between the observed dependent variable and its values which are the predicted values that we are predicted by our model that's exactly

what we want to do by using this linear function of the independent variables the residuals so this is too much information let's go it step by step so

in linear regression we just so when we were expressing our simple linear regression we had this error term and we can never know what is the actual error term but what we can do is to estimate

the value of the error term which we call residual so we want to minimize the sum of squ residual because we don't know the errors so we want to find a

line that will best fit our data in such a way that the error that we are making or the sum of squared errors is as small as possible and since we don't know the

errors we can estimate the Errors By each time looking at the predicted value that is predicted by our model and and the True Value and then we can subtract them from each other and we can see how

good our model is estimating the values that we have so how good is our model estimating the unknown parameters so to minimize the sum of squar of the differences between the

observed dependent variable and its values predicted by the linear function of the independent variables so the minimizing the sum of squared

residual so uh we Define the estimate of a parameters and variables by adding a hedge on the top of the variables or parameters so in this case you can see

that y i head is equal to Beta Z head plus beta 1 head XI so you can see that we no longer have a error term in this and we say that y i head is the

estimated value of Yi and beta 0o head is the estimated value of beta 0 beta 1 head is the estimated value of our beta one and the XI is still our data so the

values that we have in our data and therefore we don't have a hat since that does not need to be estimated so what we want to do is to estimate our dependent

variable and we want to compare our estimated value that we got using our OLS with the actual with the real value such that we can calculate our errors or

the estimate of the error which is represented by the UI head so the UI head is equal to Yi minus Yi head where UI head is simply the estimate of the

error term or the residual so this predicted error is always referred as residual so make sure that you do not confuse the error with the residual so error can never be

observed error you can never calculate and you will never know but what you can do is to predict the error and you can when you predict the error then you get a residual and what oil is trying to do

is to minimize the amount of error that it's making therefore it looks at the some squared residuales across all the observation and it tries to find the

line that will minimize its value therefore we are saying that the OLS tries to find the best fitting straight line such that it minimizes the sum of squared

residuals we have discussed this model when we were talking about this model mainly from the perspective of causal analysis in order to identify features that have a statistically significant

impact on the response variable but linear regression can also be used as a prediction model for modeling linear relationship so let's refresh our memory with the definition of linear regression

model by definition linear regression is a statistical or a machine learning method that can help to modrow the impact of a unit change in the variable the independent variable on the values

of another Target variable the dependent variable when the relationship between two variables is linear in the lecture number six from the statistical section we also discussed

how mathematically we can express what we call Simple linear regression and a multiple linear regression so this how the uh simple linear regression can be represented so uh in case of simple

linear regression you might recall that we are dealing with just a single independent variable and we always have just one dependent variable both in the single linear regression and in the

multiple linear regression so here you can see that Yi is equal to beta 0 + beta 1 * XI + UI where Y is the dependent variable and I is basically

the index of each observation or the row and then the beta 0 is The Intercept which is also known as constant and then the beta 1 is a slope coefficient or a parameter corresponding to the

independent variable X which is unnown and a constant which we want to estimate along to the beta 0 and then the XI is the independent variable corresponding to the observation I and then finally

the UI is the error term corresponding to the observation I do keep in mind that this error term we are adding because we do know that we always are going to make a mistake and we can never perfectly estimate the dependent

variable therefore to account for this mistake we are adding this UI so let's also recall the estimation technique that we use to estimate the parameters of the linear regression

model so the beta0 and beta one and to predict the response variable so we call this estimation technique ORS or or the ordinary Le squares NS is an estimation technique for estimating the unknown

parameters in a linear regression model to predict the response or the dependent variable so we need to estimate the beta Z so we need to get the beta zero head and we need to estimate the beta 1 or

the beta 1 head in order to obtain the Y I head so Yi head is equal to beta0 head plus beta 1 head time XI where the um

difference between the Yi had and the Yi so the true value of the dependent variable and the predicted value they are different will then produce our estimate of the

error or what we also call residual the main idea behind this approach is to find the best fitting straight line so the regression line through set of paired X and Y values by minimizing the

sum of squared residuales so we want to minimize our errors as much as possible therefore we are taking their squared version and we are trying to sum them up and we want to minimize this entire

error so to minimize the sum of squar residual so the difference between the observed dependent variable and its values predicted by the linear function of the independent variables we need to

use the OLS one of the most common questions related to linear regression that comes time and time again in the uh data science related interviews is a topic of

the assumptions of the linear regression model so you need to know each of these five fundamental assumptions of the linear regression and the OLS and also you need to know how to test whether

each of these assumptions are satisfied so the first assumption is the linearity Assumption which states that the relationship between the independent variables and the dependent variable is

linear we also say that the model is linear in parameters you can also check whether the linearity assumption is Satisfied by plotting the residual to the fitted values if the pattern is not

linear then the estimat will be biased in this case we say that the linearity assumption is violated and we need to use more flexible models such as tree based models that we will discuss in a

bit that are able to model these nonlinear relationships the second assumption in the linear regression is the Assumption about randomness of the sample which means that the data is randomly sampled

and which basically means that the errors or the residuales of the different observations in the data are independent of each other you can also check whether the uh second assumption

so this assumption about random sample is Satisfied by plotting the residuals you can then check whether the mean of this residuales is around zero and if not then the OLS estimate will be biased

and the second assumption is violated this means that you are systematically over or under predicting the dependent variable the third assumption is the

exogeneity Assumption which is a really important assumption often as during the data science interviews exogeneity means that each independent variable is uncorrelated with the error terms

exogeneity ref refers to the assumption that the independent variables are not affected by the error term in the model in other words the independent variables are assumed to be determined

independently of the errors in the model exogeneity is a key Assumption of the new regression model as it allows us to interpret the estimated coefficients as representing the true causal effect of

the independent variables on the dependent variable if the independent variables are not exogeneous then the estimated coefficients may be biased and the interpretation of the results may be

inv valid in this case we call this problem an endogeneity problem and we say that the independent variable is not exogeneous but it's endogeneous it's important to carefully consider the

exogeneity Assumption when building a linear regression model as violation this assumption can lead to invalid or misleading results if this assumption is satisfied for an independent variable in

the linear model we call this independent variable exogeneous so otherwise we call it endogeneous and we say that we have a problem endogenity endogenity refers to the

situation in which the independent variables in the linear regression model are correlated with the error terms in the model in other words the errors are not independent of the independent

variables endogeneity is a violation of one of the key assumptions of the linear regression model which is that the independent variables are exogeneous or not affected by the errors in the model

endogenity can arise in a number of ways for example it can be caused by omitted variable bias in which an important predictor of the dependent variable is not included in the model it can also be

caused by the reverse causality in which the dependent variable affects the independent variable so those two are a very popular examples of the case when we can get an endogeneity problem and

those are things that you should know whenever you are interfering for data science roles especially when it's related to machine learning because those questions are uh being asked to you in order to test whether you

understand the concept of exogeneity versus endogenity and also in which cases you can get endogenity and also how you can solve it so uh in case of omitted variable bias let's say you are

estimating a person's salary and you are using as independent variable their education their number of years of experience and uh some other factors but you are not including for instance in

your model a feature that would describe the uh intelligence of a person or uh for instance IQ of the person well given that those are a very important

indicator for a person in know to perform in their field and this can definitely have um indirect impact on their salary not including these variables will result in omitted

variable bias because this will then be uh Incorporated in your um error term and uh this can also relate to the other independent variables because then your

uh IQ is also related to the um to the education that you have higher is your IQ usually higher is your education so in this way you will have

an error term that includes an important variable so this is the omitted variable which is then uh correlated with your uh one of your or multiple of your independent variables include in your

model so the other example the other cause of the endogenity problem is the reverse quti and um what reverse quti means is basically that not only the independent variable has an impact on

the dependent variable but also the dependent variable has an impact on the independent variable so there is a reverse relationship which is something that we want to avoid we want to have

our features that include in our model that have only an impact on dependent variable so they are explaining the dependent variable but not the other way around because if you have the um the other way so you have the dependent

variable impact in your independent variable then you will have the error term being related to this independent variable because there are some components that also Define your

dependent variable so knowing the uh few examples such as those that can cause uh endogenity so they can violate the exogeneity assumption is really

important then uh you can also check for the exogeneity Assumption by conducting a formal statistical test this is called house one test so this an econometrical

test that helps to understand whether you have an exogeneity uh violation or not but this is out of the scope of this course I will however include uh many

resources related the exogeneity endogenity the omited variable bias as well as the reverse cality and also how the house one test can be conducted so

for that check out the inter preparation guide where you can also find the corresponding fre your resources the fourth assumption in linear regression is the Assumption about

homos homos refers to the assumption that the variance of the errors is constant Ross all predicted values this assumption is also known as the

homogeneity of the variant homos is an important Assumption of linear regression model as it allows us to use certain statistical techniques and make inferences about parameters of the model

if the errors are not homoscedastic then the result of these techniques may be invalid or misleading if this assumption is violated then we say that we have

hecticity heos skas refers to the situation in which the variance of the error terms in a linear regression model is not constant across all the predicted

values so we have a variating variant in other words the Assumption of homos skas testing in that case is violated and we say we have a problem of

heteros heteros can be a real problem in the near regression NES because it can lead to invalid or misleading results for example the standard their estimates and the confidence intervals for the

parameters may be incorrect which means that also the statistical test may have incorrect type one error rates so you might recall when we were discussing the linear regression as part of the

fundamental ex section of this course is that we uh looked into the output that comes from a python and we saw that we are getting uh estimates as part of the output as well as standard errors then

the T Test so the student T test and then the corresponding P values and the 95% confidence intervals so whenever there is a heos skst problem

the um coefficient might still be accurate but then the corresponding standard error the student T Test which is based on the standard error and then

the P value as well as the uh confidence intervals may not be accurate so you might get the uh good and reasonable coefficients but then you don't know how

to correctly evaluate them you might end up discovering that um you might end up stting that certain uh independent variables are statistically significant because their coefficients are

statistically significant since their P values are small but in the reality those P values are misleading because they are based on the wrong statistical uh test and they are based on the wrong

standard errors you can check for this assumption by plotting the residual and see whether there is a funnel like graph if there is a Fel like graph then you have a a constant variance but if there

is not then you won't see this Fel like this shape that uh indicates that your variances are constant and if not then we say we have a problem of heos skusy

if you have a heteros system you can no longer use the OS and the linear regression and instead you need to look for other more advanced econometrical regression techniques that do not make

such a strong assumption regarding the variance of your um residuales so you can for instance use the GLS the fgs the

GMM and this type of solutions will um help to solve the hosc C problem and they will not make a strong assumptions regarding the variance in your

model the fifth and the final assumption in linear regression is the Assumption about no perfect multicolinearity this assumption states that there are no exact linear relationships between the

independent variables multicolinearity refers to the case when two or more independent variables in your linear regression model are high correlated with each other this can be a problem

because it can lead to unstable and unreliable estimate of the parameters in the model perfect multicolinearity happens when the independent variables are perfectly correlated with each other

meaning that one variable can be perfectly predicted from the other ones and this can cause the estimated coefficient in your linear regression model to be infinite or undefined and

can lead your errors to be uh entirely misleading when making a prediction using this model if perfect multicolinearity is detected it may be necessary to remove one if not more

problematic variables such that you will avoid having correlated variables in your model and even if the perfect multicolinearity is not present multicolinearity at a high level can

still be a problem if the correlations between the independent variables are high in this case the estimate of the parameters may be imprecise and the model may be uh entirely misleading and

will result less reliable uh predictions so uh to test for the multicolinearity Assumption you have different solutions you have different options the first way uh you can do that

is by using the uh Dill test De test is a formal statistical and econometrical test that will help you to identify which variables cause a problem and whether you have a perfect

multicolinearity in your linear regression model you can PL heat map which will be based on the uh correlation metrix corresponding to your features then you will have your uh

correlations per pair of independent variables plotted as a part of your heat map and then you can identify all the um pair of features that are highly

correlated with each other and those are problematic features one of which should be removed from your model and in this way by uh showing the hit map you can also showcase your stakeholders why you

have removed certain variables from your model whereas explaining a dier test is much more complex because it involves more advanced econometrics and linear uh regression um

explanation so if you're wondering how you can perform this D FL test and you want to prepare the uh questions related to perfect multicolinearity as well as how you can solve the perfect

multicolinearity problem in your linear regression model then head towards the interiew preparation guide included in this part of the course in order to answer such questions and also to see

the 30 most popular interview questions you can expect from this section in the interview preparation guide now let's look into an example coming from the linear regression in order to see how

all those pieces of the puzzle come together so let's say we have collected data on a class side and a test course for a sample of students and we want to model the linear relationship between

the class size and the test course using a linear regression model so as we have one independent variable we are dealing with a simple linear regression and the model equation would be as follows so

you can see that the test course is equal to Beta 0 plus beta 1 multiply by class size plus Epsilon so here the class size is the single independent

variable that we got in our model the test score is the dependent variable the beta0 is The Intercept or the constant the beta one is a coefficient of Interest as it's the coefficient

corresponding to our independent variable and this will will help us to understand what is the uh impact of a unit change in the class size on the

test score and then finally we are including in our model our error term to account for the mistakes that we are definitely going to make when estimating

the uh dependent variable the test course the goal is to estimate the coefficient beta0 and beta 1 from the data and use the estimated model to

predict the test course based on a class size so once we have the estimates we can then interpret them as follows the Y intercept the beta zero represents the

expected test course when the class size is zero it represents the base score that the student would have obtained if the class size would have been zero then the coefficient for the class size the

beta one represents the change in the test course associated with the one unit change in the class size the positive coefficient would imply that one unit

change in the class size would increase the test course whereas the negative coefficient would uh imply that the one unit change in the class size will decrease the test course uh

correspondingly we can then use this model with OLS estimate in order to predict the test course for any given class size so let's go ahead and Implement

that in Python if you're wondering how this can be done then head towards the resources section as well as the part of the Python for data science where you can learn more about how to work with

pendas data frames how to import the data as well as how to fit a linear regression model so the problem is as follows we have collected data on the class size and we have this independent

variable so as you can see here we have the students uncore data and then we have the class size and this our feature and then we want to estimate the why

which is the test score so uh here is the code a sample code that will Fed a linear regression model we are keeping here everything very simple we are not splitting our data into training test

and then fitting the model on the training data and making the predictions with the test score but we just want to see how we can interpret the uh

coefficients so keeping everything very simple so you can see here that we are getting an intercept equal to 63.7 and the coefficient corresponding

to our single independent variable class size is equal to minus 0.14 what this means is that so each increase of the uh class size by one

unit will result in the decrease of the test scor with 0.4 so there is a negative relationship between the two now the next question is whether there is a statistical significance whether

the uh coefficient is actually significant and whether the class size has actually statistically significant impact on the dependent variable but all

those are things that we have discussed as part of the fundamental statistic section of this course as well as we are going to look into a linear regression example when we are going to discuss the

hypothesis testing so I would highly suggest you to uh stop in here to revisit the fundamentals to statistic section of this course to refresh your

memory in terms of linear regression and then um check also the hypothesis testing uh section of the course in order to look into a specific example of

linear regression when we are discussing the standard errors how you can evaluate your OS estimation results how you can use the student T Test the P value and the confidence intervals and how you can

estimate them in this way you will learn for now only the theory related to the coefficients and then you can um add on the top of this Theory once you have learned all the other sections and the

other topics in this course let's finally discuss the advantages and the disadvantages of the linear regression model so some of the advantages of the linear regression model are the following the linear regression is

relatively simple and easy to understand and to implement linear regression models are well suited for understanding the relationship between a single independent variable and a dependent

variable also linear regression can help to handle multiple independent variables and can estimate the unique relationship between each independent variable and a corresponding dependent variable V

regression model can also be extended to handle more complex models such as pooms interaction terms allowing for more flexibility in the modeling the data also linear aggression model can be

easily regularized to prevent overfitting which is a common problem in modeling as we saw uh in the beginning of this section so you can use for instance retrogression which is an

extension of Vine regression you can use ler regression which is also an extension of linear regression model and then finally linear regression models are widely supported by software

packages and libraries making it easy to implement and to analyze and some of the disadvantages of the linear aggression are the following so the linear aggression models make a lot

of strong assumptions regarding for instance the linearity between independent variables and independent variables while the true relationship can actually be also nonlinear so the

model will not then be able to capture the complexity of the data so the nonlinearity and the predictions will be inaccurate therefore it's really important to have a data that has a

linear relationship for linear regression to work linear regression also assumes that the error terms are normally distributed and also homoscedastic error terms are independent across observations

violations of the strong assumption will lead to bias and inefficient estimates linear regression is also sensitive to outliers which can have a disproportionate effect on the estimate

of the regression coefficient linear regression does not easily handle categorical independent variables which often require additional data preparation or the use of indicator

variables or using encodings finally linear regression also assumes that the independent variables are exogeneous and not affected by the error terms if this assumption is

violated then the result of the model may be misleading imagine you have a friend Alex who collects stamps every month Alex buys a certain number of stamps and

you notice that the amount Alex spends seems to depend on the number of stamps bought now you want to create a little tool that can predict how much Alex will

spend next month based on the number of stamps bought this is where linear regression comes into play in technical terms we're trying to predict the

dependent variable amount spent based on the independent variable number of stamps bought below is some simple

python code using pyit learn to perform linear regression on a created data set the linear regression analysis was carried out through a structured process

using Python's numpy and matplot lib libraries as well as psyit learns linear regression class initially the libraries were imported to facilitate numerical

comp computations and data visualization this foundational step ensures that all necessary functions and methods are available for executing the analysis

subsequently the data was organized with stamps and Bot serving as the independent variable and amount is spent as the dependent variable the stamps

array was reshaped into a two-dimensional array using reshape this modification was necessary because the pyit learned Library

requires input features in a specific format once the data was appropriately formatted a linear regression model was instantiated from the linear regression

class and then trained with the fit method by passing the reshaped stamps bought and amount spent arrays to this method the model learned the relationship between the number of

stamps bought and the corresponding amount spent the trained model was then used to predict the the expenditure for a hypothetical future scenario where 10

stamps are bought this was accomplished using the predict method which was called with an input array representing 10 stamps the model used its learned

parameters to estimate the outcome based on this input for visualization the original data points and the regression line were

plotted using matte plot lib the scatter function was used to plot the data points in blue illust rating the actual amount spent for different quantities of

stamps bought the regression line plotted in red using the plot function demonstrated the predicted relationship as learned by the model finally the

prediction for 10 stamps was displayed using a print statement this demonstrated how the model's predictions can be interpreted and used providing a

specific numerical estimate for the amount likely to be spent on stamps in the given scenario this complete process from data preparation through training

to prediction showcases how linear regression can be applied to derive insights from Real World data effectively let's go through some of the

concepts and variables we are using in this chapter sample data the term stamps is bought refers to the number of stamps Alex bought each month and amount spent

represents the corresponding money spent creating and training model we use linear regression from pyit learn to create and train our model using fit

predictions the train model is then used to predict the amount Alex will spend for a given number of stamps in the code we predict the amount for 10 stamps

plotting we plot the original data points in blue and the predicted line in red to visually understand our model's prediction capability displaying

prediction finally we print out the predicted spending for a specific number of stamps 10 in this case this graph illustrates the outcome of a simple linear regression analysis which seeks

to capture the relationship between two variables the number of stamps purchased and the total expenditure incurred the red line depicted in the graph is known as the regression line representing the

best fit through the plotted green data points the slope of this regression line is particularly telling it quantifies the increase in total cost associated

with each additional stamp purchased such insights are invaluable for budgeting purposes or for forecasting future expenses based on past purchasing Behavior each of the green data points

on the graph corresponds to an actual purchase event where both the quantity of stamps bought and the precise amount spent are known the close clustering of

these points around the regression line strongly supports the presence of a linear relationship between the number of stamps purchased and the total

expenditure this alignment suggests that the model has effectively captured the underlying Trend in the data offering a

reliable basis for predictions employing simple linear regression in scenarios like this provides a clear and quantifiable understanding of how one

variable affects another in the realm of data analysis is this method serves as a powerful tool enabling analysts to draw

significant conclusions and make informed decisions based on the observed relationships between variables the Simplicity yet robustness of linear

regression make it an indispensable technique in the toolkit of anyone seeking to extrapolate future behavior from historical data now let's go

through some other examples where linear regression is used one real estate pricing linear regression is widely used in real estate to predict house prices

based on various features such as square footage number of bedrooms number of bathrooms age of the house and location

for instance a regression model can help determine how much an additional bathroom adds to The house's value for example a real estate company uses

linear regression to understand how the proximity to City centers affects the market price of properties they find that for each mile closer to the city

center house prices increase by an average of $10,000 two credit scoring financial institutions often employ linear regression to predict the credit

worthiness of individuals based on historical financial data including income levels existing debts and past repayment histories for example a bank may use

linear regression to determine how an applicant's credit score changes with variations in their debt to income ratio this helps in deciding whether to approve a loan

application three supply chain costs linear regression can analyze and predict costs associated with different components of the supply chain such as

Transportation labor and materials based on factors like distance fuel prices and labor rates for example a manufacturing company uses linear regression to

predict Logistics costs as a function of fuel price fluctuations and shipping distance to better manage their budget

and set product prices four Health Care in healthcare linear regression could be used to predict patient outcomes based on treatment methods dosage levels and

patient demographics for example a medical research team applies linear regression to study the relationship between dosage of a new

drug and patient recovery rate the findings indicate that increasing the drug dose by one unit enhances recovery rates by

5% five academic performance educational institutions might use linear regression to predict student Performance Based on study habits attendance rates and

previous grades grades for example a university conducts a study using linear regression to understand how the number of hours spent

studying per week impacts students GPA the analysis reveals that every additional hour of study per week correlates with an increase of 0.05 in

GPA six energy consumption energy companies can use linear regression to forecast consumption levels based on factors like tempor

time of the year and economic activity for example an energy utility uses linear regression to model how electricity usage increases with Rising

temperatures during summer months the model helps the company prepare for PE will demonstrate how to use logistic regression to predict Jenny's book preferences we have a data set where

each entry records the number of pages in the books Jenny read and whether she liked them step one import libraries we start by importing numpy to manage our

data mat plot lib for visualization and Scar's logistic regression and accuracy score for building and evaluating our

model step two prepare the data We Begin by setting up our data set Pages holds the number of pages in each book as an independent variable and likes records

Jenny's reaction as a dependent binary VAR variable one for like and zero for dislike it's essential to reshape pages

into a 2d array since s kit learns model expects features in this format this preparation ensures our model can interpret the data

correctly step three create and train the model we defined a logistic regression model and train it with our data set using the fit method this

method optimizes the model parameters to best explain the relationship between the number of pages and Jenny's preferences training involves finding the statistical parameters that minimize

prediction error effectively learning from the data step four make predictions after training we use the predict method

to estimate Jenny's reaction to a book with 260 Pages this part of the code shows how the trained model applies what it has learned to new dat data step five

plotting we visually present the data and model predictions we plot the data points in green displaying actual likes and dislikes the model's predicted

probabilities are shown in red giving a visual representation of the likelihood of liking books of various page lengths specific markers highlight the query

point at 260 Pages helping to contextualize the prediction visually step six displaying prediction finally we display the prediction result

directly from our model using a print statement this tells us if Jenny is predicted to like or dislike the book based on the number of pages illustrating the practical application

of our logistic regression model conclusion this process illustrates how logistic regression can be used to make predictions based on historical data

providing insights that can be applied in various fields Beyond just reading preferences now let's go over the results produced by our code the slope of the red line not linear change the

slope of the red line is positive indicating an upward curve as we move along the x axis representing the number of pages this is crucial because it

means the relationship between page count and Jenny's liking isn't a simple straight line increase in other words each additional page doesn't increase

her enjoyment by the same amount the sigmoids effect the s-shaped curve typical of logistic regression means that the change in probability is more

pronounced for certain ranges of page length there might be zones where adding more pages drastically increases the likelihood of liking the book in

contrast other zones might show very little increase in probability despite many more pages this is different from linear regression where the slope remains constant and the change is

directly [Music] proportional reasoning behind the slope this curve's shape could suggest some

underlying patterns in Jenny's preferences short is not sweet very short books might not be her cup of tea as the stories Could Feel

underdeveloped The Sweet Spot there might be a middle range of page counts where the probability jumps significantly indicating her favorite

type of book length too long oh too much perhaps extremely long books thought to have a diminishing return on her enjoyment Green Dots model accuracy

proximity matters the fact that the green dots representing actual books Jenny has read are clustered tightly around the red line is a very good sign it signifies that the model's

predictions closely align with Jenny's real World preferences data validation this clustering demonstrates the model has successfully picked up on the

underlying pattern between page number and Jenny's like dislike reactions consequently we can have more confidence in its predictions for new

books exceptions if some green dots were far away from the line that would be a cause for concern it would mean the model is consistently mispredicting in those regions and might need

[Music] refinement the threshold line 0.5 decision time the line at 0.5 is where we translate The Continuous

probability values into the Practical like or dislike recommendations for Jenny not set in stone while 0.5 is a common threshold it's not mandatory

depending on how much we prioritize avoiding false positives like predictions that turn out to be dislikes or false negatives dislike predictions when Jenny would have actually enjoyed

the book we might move this threshold higher or [Music] lower customizing for Jenny if Jenny indicates she generally likes to try

books even if she's not super sure we might lower the threshold this gives her more recommendations even if the certainty of her liking them is slightly

lower this logistic regression model reveals that Jenny's book preferences are influenced by page count in a nonlinear way and it has successfully

learned the underlying pattern to provide more informed recommendations let's explore in what other sections logistic regression is used we'll explore how logistic

regression can be used to predict customer churn for a subscription-based company churn or the act of customers canceling their subscriptions poses a

significant challenge for such companies logistic regression with its binary out come prediction capabilities is an ideal choice for this scenario the problem at

hand is straightforward the company wants to proactively identify customers who are at high risk of cancelling their subscriptions also known as churning

logistic regression is particularly suitable for this task because it deals with binary outcomes where a customer either churns one or continues their

subscription how logistic regression is used data Gathering the company starts by collecting historical data on various aspects of customer behavior and

demographics this includes usage patterns such as frequency and feature engagement support interactions such as tickets opened and types of complaints

plan types distinguishing between basic and premium subscriptions demographic information like age location Etc model training the collect data is divided

into training and testing sets a logistic regression model is then trained using the training data during training the model learns how different factors such as usage patterns and

demographics relate to the probability of churn scoring new customers once the model is trained it can be used to score new customers each customer's data is

fed into the trained model which generates a probability score between do zero and one indicating their likelihood of churning proactive action customers with high

churn probabilities are identified and receive targeted attention interventions may include offering special deals personalized Outreach or addressing common pain points known to lead to

[Music] churn imagine Sarah who loves cooking and trying various fruits she's noticed that the fruits she enjoys tend to fall

into certain size and sweetness ranges could she predict whether she'll like a new fruit based on these characteristics linear discriminant

analysis LDA is the perfect tool to help her out LDA is a powerful technique for classifying Things based on several features think about how facial

recognition software can identify individuals in Sarah's case we can use LDA to find patterns in the size and sweetness of fruits she's liked or

disliked in the past LDA will search for a way to create a sort of like versus dislike boundary based on these features

imagine each fruit as a point on a graph where the x-axis is size and the y- AIS is sweetness LDA tries to draw the best

possible line to separate the like and dislike fruits of course some overlap might happen maybe there are some small small fruits she surprisingly loves LDA

aims to find the line that does the best possible job of separating the groups overall LDA is great when you have multiple features to consider at once

Sarah could try looking at size or sweetness alone but LDA lets her combine this information for potentially better predictions so let's go through the code

now first we need to ensure that we have all the necessary tools at our disposal we'll be using python for our coding tasks and will import the powerful

libraries numpy for numerical operations and matplot lib P plot for data visualization additionally we'll utilize

pyit learns linear discriminant analysis module for implementing LDA now let's create a sample data set we'll curate a

data set consisting of eight fruits each characterized by two features size and sweetness these features will serve as our inputs while the corresponding

labels will indicate whether each fruit is liked one or disliked zero this data set will form the foundation for training our predictive model with our

data set ready it's time to build our predictive model using linear discriminant analysis we'll create an instance of an LDA model object and

proceed to train it using the features size and sweetness and corresponding labels from our sample data [Music]

set we'll select a new fruit with a size of 2.5 and a sweetness of six as our test case using these feature values the model will predict whether Sarah would

like this fruit or not this prediction will provide valuable insights into the model's decision-making process and its ability to generalize to unseen

instances to visualize the results of our analysis we'll plot the sample data set on a scatter plot in this plot fruits that are liked by Sarah will be

represented by Blue markers while disliked fruits will be marked in yellow additionally we'll highlight the new fruit being predicted with a distinct

Red X marker providing a clear visual representation of its classification after making the prediction will display the outcome we'll print a statement

indicating whether Sarah is likely to enjoy the new fruit based on the model's classification decision this Insight will offer valuable information on how

the model interprets the given features and arrives at its prediction here are several considerations we should make which shows a line graph depicting fruit

enjoyment based on size and sweetness the x axis represents size the Y axis represents sweetness and the data points

are colored in Orange like and blue dislike here are some observations you can make class separation there appears to be some

separation between the orange like and blue dislike data points this suggests that size and sweetness may be useful

factors in predicting whether Sarah will enjoy a particular fruit over between classes there's also some overlap between the two classes

particularly in the region of larger and sweeter fruits this indicates that size and sweetness alone may not perfectly predict Sarah's preferences other

factors not considered here might also influence her enjoyment here are additional considerations we should make sample size the number of data points is

not visible in the image a larger data set might provide a clearer picture of the class separation and improve the model's generalizability nonlinear relationships

the graph assumes a linear relationship between size sweetness and enjoyment if the true relationship is more complex a model like LDA might not capture it

perfectly overall the data suggests a potential link between fruit size sweetness and Sarah's preferences LDA is a promising technique to explore for

building a classif ification model but it's important to consider potential limitations and the need for more data

if available where decision trees take root decision trees with their intuitive branching structure find use across

various Industries and problem domains let's dive into some key areas where they prove particularly effective business and finance customer

segmentation analyze customer data to identify group groups with similar behaviors or purchasing patterns for targeted marketing

strategies fraud detection identify patterns and transactions that may indicate fraudulent activity credit risk assessment evaluate the credit

worthiness of loan applicants based on their financial history and other factors operations management optimize decision making in areas like inventory

management Logistics and resource allocation Health Care medical diagnosis support assist in diagnosing diseases by guiding clinicians through a series of

questions and tests based on patient symptoms and medical history treatment planning help determine the most suitable treatment options based on

patient characteristics and disease severity disease risk prediction identify individuals at high risk of developing certain health conditions

based on factors like lifestyle family history and medical data science and engineering fault diagnosis isolate the cause of malfunctions or failures in

complex systems by analyzing sensor data and system logs classification in biology categorize species based on their characteristics or DNA sequences

remote sensing analyze satellite imagery to classify land cover types or identify areas affected by natural disasters

customer service troubleshooting guides create interactive decision trees to guide customers through troubleshooting steps for products or Services chatbots

power automated chatbots that can categorize customer inquiries and provide appropriate responses reducing weight times and improving support

efficiency other applications game playing design AI OPP components in games that can make strategic decision logistic regression is a

popular approach for performing classification when there are two classes but when the classes are well separated or the number of classes

exceeds two the parameter estimates for the logistic regression model are surprisingly unstable unlike logistic regression l

LDA does not suffer from this instability problem when the number of classes is more than two if n is small and the distribution of the predictors x

is approximately normal in each of the classes LDA is again more stable than the logistic regression model Noah is a botanist who has collected data about

various plant species and their characteristics such as Leaf size and flower color Noah is curious if he could predict a plant species based on these

features here we'll utilize random forest and Ensemble learning method to help him classify plants technically we

aim to classify plant species based on certain predictive variables using a random forest [Music]

model the bias variance tradeoff the key challenge in machine learning lies in finding the right balance between bu bias and variance

generally reducing bias increases variance and vice versa complex models tend towards low bias and high variance while simpler models tend towards the

opposite the ideal model finds a sweet spot between underfitting High bias and overfitting high variance a balance that

depends heavily on the nature of your specific problem and the trade-off you're willing to make between flexibility and

stability naive Bay versus logistic regression naive Baye is known for high bias it assumes features are independent

which often isn't true however its Simplicity makes it less prone to overfitting low variance and computationally fast to train on the

other hand logistic regression is more flexible low bias and can model complex decision boundaries this comes the risk of overfitting high variance especially

with many features or little regularization when to choose which if you prioritize speed and simplicity naive Bay might be a good starting point when your data relationships are

unlikely to be simple and independent logistic regressions flexibility becomes valuable however if you choose logistic regression you need to actively manage

overfitting potentially using techniques like regularization the plotting purposes and sck it learns logistic regression and linear

discriminant analysis for classification tasks step two generating synthetic data next we Define a function called

generate data to create synthetic data for our classification experiment the function generates data points from

three initial classes each centered at 0 0 3 0 and 6 0 for each class random data points are generated around the

respective Center using a goian distribution step three data generation and model fitting we generate a data set with 40 samples per class using the

generate data function then we fit logistic regression and LDA models to this data set step four analyzing the

results after fitting the models we print the coefficients for both logistic regression and LDA for the initial three classes these coefficients provide

insights into the decision boundaries learned by each model step five adding an extra class we then introduce a new class to our data set by generating

additional data points centered at 90 we append this new class to our data set and update the corresponding labels accordingly step six refitting the

models following the addition of the new class we refit both logistic regression and LDA models to the updated data set

with four classes step seven analyzing the results for four classes finally we print the coefficients for both models after fitting them to the data set with

four classes this allows us to observe how the decision boundaries change with the inclusion of the new class the provided code serves to elucidate date

the intricate concepts of bias variance and the bias variance tradeoff by just opposing naive Bay and logistic regression classifiers on a synthetic

data set here's a cohesive explanation of the codes functionality to begin the script generates a synthetic data set comprising two classes arranged in

circular patterns the make circles function from s kit learn is employed for the this purpose creating data that challenges the assumptions of naive Bay due to its

nonlinear separability following data Generation The Script proceeds to train two distinct classifiers firstly aian naive

Bas model is trained on the data set this Choice aligns with the high bias low variance characteristics of naive

Bay given its Simplicity and Assumption of feature Independence second secondly a logistic regression model is trained with regularization CA

1.0 regularization is introduced to combat overfitting a concern for logistic regression due to its flexibility potentially leading to

higher variants the script visualizes their decision boundaries through plots showcasing the decision boundaries of both models it becomes evident that

naive Bas delineates a simpler linear boundary while logistic regression captures the data set's nonlinearity furthermore the script

calculates the accuracy of each model on a heldout test set facilitating a comparative analysis of their performance a deeper understanding of

the bias variance trade-off emerges from this comparison naive Bas exhibits higher bias due to its simplifying assumptions resulting in a less complex

decision boundary on the contrary logistic regressions flexibility enables it to learn the nonlinear pattern with lower bias potential albeit at the risk

of overfitting the importance of contextual considerations becomes apparent while logistic regression often

boasts lower bias naive baz's Simplicity and computational efficiency May render it preferable in certain contexts [Music]

Tom is a movie Enthusiast who watches films across different genres and Records his feedback whether he liked them or not he has noticed that whether

he likes a film might depend on two aspects the movie's length and its genre can we predict whether Tom will like a movie based on these two characteristics

using naive Bay technically we want to predict a binary outcome like or dislike based on the independent variables movie length and genre let's delve into the

code's functionality the initial step involves importing essential libraries necessary for the code's operations numpy

facilitates numerical operations while matplot lib AIDS in visualization Additionally the Gan naive Bas implementation from Psychic learn is

imported to utilize its functionalities following the Imports the script defines sample data representing movie features and corresponding likes each movie is

characterized by its length in minutes and a genre code notably genres are numerically encoded with zero signifying action one representing romance and so

forth this structured representation prepares the data for subsequent analysis moving forward a gan naive Bas model is instantiated this model serves

as the predictive engine leveraging its inherent assumptions about feature Independence to classify movie likes subsequently the model is trained using

the provided movie features and their Associated likes once the model is trained it is ready to make predictions a new movie is defined with its length

and genre code leveraging the trained naive baz model predictions are made regarding whether Tom would like this movie based on its features this step

demonstrates the practical application of the trained model in making real world predictions the script proceeds to visualize the data set through a scatter

plot each existing movie is plotted based on its length and genre code with liked movies depicted in one color and disliked movies in another Additionally

the new movie is plotted with a distinct marker providing visual context to Aid interpretation the provided code serves to elucidate the intricate concepts of

bias variance and the bias variance tradeoff by just opposing naive bays and logistic regression classifiers on a synthetic data set here's a cohesive

explanation of the code's functionality to begin the script generates a synthetic data set comprising two classes arranged in circular

patterns the make circles function from s kit learn is employed for this purpose creating data that challenges the assumptions of naive Bays due to its nonlinear

separability following data Generation The Script proceeds to train two distinct classifiers firstly aian naive

Bas model is trained on the data set this Choice aligns with the high bias low variance characteristics of naive Bay given its simplicity here are a few

observations and conclusions we can make clear separation we can see that there is a clear separation between the like and dislike clusters this indicates that the

naive Bas model has successfully found a distinct boundary based on the movie length and genre features the strong separation suggests these features are powerful predictors of

preference prediction confidence since the new movie Red X squarely within a cluster we can have a high degree of confidence in the model's prediction

further exploration even with good separation movies lying close to the boundary deserve closer examination they may offer insights into less predictable

cases and help refine the model further overlapping clusters if the like and dislike groups overlap significantly it

suggests that movie length and genre alone May not be enough to accurately predict preferences in all situations model limitations in cases of

overlap naive Bas might be too simplistic exploring more sophisticated models like decision trees or support Vector machines could improve

accuracy need more data a larger and more diverse data set or the inclusion of additional features EG star ratings director could help uncover for clearer

patterns in situations where the current features aren't sufficient genre significance if distinct clusters form based on genre codes it means the naive

Bay model recognizes genre as a strong indicator of preference personalized recommendations this genre based Insight can be used to tailor

recommendations if a user consistently enjoys a particular genre movies within that genre should be prioritized even even if their length deviates from the

usual pattern caveats it's important to remember that genre preferences are subjective and might have naturally mixed reactions within a given genre

where decision trees take root decision trees with their intuitive branching structure find use across various Industries and problem domains

let's dive into some key areas where they prove particularly effective business and Finance customer segmentation analyze customer data to

identify groups with similar behaviors or purchasing patterns for targeted marketing strategies fraud detection identify

patterns and transactions that may indicate fraudulent activity credit risk assessment evaluate the credit worthiness of loan applicants based on

their financial history and other factors operations management optimize decision making in areas like Inventory management Logistics and

resource allocation Health Care medical diagnosis support assist in diagnosing diseases by guiding clinicians through a series of questions and tests based on

patient symptoms and medical history treatment planning help determine the most suitable treatment options based on patient characteristics and disease

severity disease risk prediction identify individuals at high risk of developing certain health conditions based on

factors like lifestyle family history and medical data science and engineering fault diagnosis isolate the cause of malfunctions or failures in complex

systems by analyzing sensor data and system logs classification in biology categorize species based on their characteristics or DNA sequences remote

sensing analyze satellite imagery to classify land cover types or identify areas affected by natural disasters customer service troubleshooting guides

create interactive decision trees to guide customers through troubleshooting steps for products or Services chatbots power automated chatbots that can

categorize customer inquiries and provide appropriate responses reducing weight times and improving support efficiency

other applications game playing design AI opponents in games that can make strategic decisions based on the state of the gain e-commerce personalized

product recommendations based on user browsing behavior and past purchases Human Resources identify key factors

influencing employee retention and make informed decisions why decision trees Thrive here decision trees excel in

these scenarios due to several factors interpretability the decision making process is transparent allowing humans to understand the reasoning behind the

model's predictions handles diverse data accommodates both numerical and categorical features nonlinear relation Alex is intrigued by the

relationship between the number of hours studied and the score scores obtained by students Alex collected data from his peers about their study hours and

respective test scores he wonders can we predict a student score based on the number of hours they study let's leverage decision tree regression to

uncover this technically we're predicting a continuous outcome test score based on an independent variable

study hours complex patterns this is is like trying to fit the curved data set with a straight line in contrast a low bias model is more flexible allowing it

to potentially match intricate Trends in the data Noah is a botanist who has collected data about various plant species and their characteristics such

as Leaf size and flower color Noah is curious if he could predict a plant species based on these features here we'll utilize r random forest and

Ensemble learning method to help him classify plants technically we aim to classify plant species based on certain predictive variables using a random

forest [Music] model the bias variance tradeoff the key challenge in machine

learning lies in finding the right balance between bias and variance generally reducing bias increases variance and vice versa complex models

tend towards low bias and high variance while simpler models tend towards the opposite the ideal model finds a sweet

spot between underfitting High bias and overfitting high variance a balance that depends heavily on the nature of your specific problem and the trade-off

you're willing to make between flexibility and stability naive Bay versus logistic regression naive Bay is known for high

bias it assumes features are independent which often isn't true however its Simplicity makes it less prone to overfitting low variance and

computationally fast to train on the other hand logistic regression is more flexible low bias and can model complex decision boundaries this comes at the

risk of overfitting high variance especially with many features or little Reg regularization when to choose which if you prioritize speed and simplicity

naive Bay might be a good starting point when your data relationships are unlikely to be simple and independent logistic regressions flexibility becomes valuable however if you choose logistic

regression you need to actively manage overfitting potentially using techniques like regularization the provided code serves to elucidate the intricate concepts of

bias VAR and the bias variance tradeoff by just opposing naive bays and logistic regression classifiers on a synthetic

data set here's a cohesive explanation of the code's functionality to begin the script generates a synthetic data set comprising two classes arranged in

circular patterns the make circles function from s kid learn is employed for this purpose creating data that challenges the assumptions of naive Bays due to its

nonlinear separability following data Generation The Script proceeds to train two distinct classifiers firstly aian naive

Bay model is trained on the data set this Choice aligns with the high bias low variance characteristics of naive Bay given its Simplicity and Assumption

of feature Independence secondly a logistic regression model is trained with regularization regularization is

introduced to combat overfitting a concern for logistic regression due to its flexibility potentially leading to higher variant once the models are

trained the script visualizes their decision boundaries through plots showcasing the decision boundaries of both models it becomes evident that

naive base delineates a simpler linear boundary while logistic regression captures the data set's nonlinearity furthermore the script

calculates the accuracy of each model on a heldout test set facilitating a comparative analysis of their performance a deeper understanding of

the bias variance trade-off emerges from this comparison naive Bas exhibits higher bias due to its simplifying assumptions resulting in a less complex

decision boundary on the contrary logistic regressions flexibility enables it to learn the nonlinear pattern with lower bias potential allbe it at the

risk of overfitting the importance of contextual considerations becomes apparent while logistic regression often

boasts lower bias naive bases Simplicity and computational efficiency May render it preferable in certain contexts model selection hinges on

factors such as data set characteristics computational constraints and the relative significance of interpretability versus Raw predictive performance it's crucial to acknowledge

the limitations and caveats of the presented Example The observed results May Vary on different data sets or with alternative hyperparameter

configurations additionally while decision boundary visualization AIDS comprehension accuracy metrics are equally essential for comprehensive model

evaluation finally the script underscores the significance of continuous learning in machine learning it advocates for a methodical approach

involving experimentation with diverse models rigorous evaluation and judicious selection based on problem specific requirements and performance

metric Tom is a movie Enthusiast who watches films across different genres and Records his feedback whether he liked them or not he has noticed that

whether he likes a film might depend on two aspects the movie's length and its genre technically we want to predict a binary outcome like or dislike based on

the independent variables movie length and genre Alex is intrigued by the relationship between the number of hours studied and the scores obtained by

students Alex collected data from his peers about their study hours and respective test scores he wonders can we predict a student's score based on the

number of hours they study let's leverage decision tree regression to uncover this technically we're predicting a continuous outcome test

score based on an independent variable study hours Let's dissect the code to understand its functionality importing libraries We Begin by importing

necessary libraries numpy assists in numerical operations Matt plot lib facilitates visualization and decision tree regressor from Psychic learn is utilized

for decision tree regression sample data definition following the library Imports the code defines sample data this data

includes the number of hours studied and the corresponding test scores achieved achieved each entry pairs a study hour with its corresponding test score forming the data set for training the

regression model creating and training the model a decision tree regression model is instantiated with a maximum depth set to

three this parameter controls the maximum number of levels within the decision tree subsequently the model is trained using the provided study hours

and their corresponding test score prediction after training the model is capable of making predictions an example

Study Hour 5.5 hours is chosen and the model predicts the test score corresponding to this input based on its

training plotting the decision tree the code then generates a visualization of the decision tree regression model this visualization elucidates the the

decision-making process of the model utilizing the provided features study hours plotting study hours versus test scores another plot is created to

illustrate the relationship between study hours and test scores this scatter plot displays the actual data points while the regression line portrays the

predictions made by the decision tree regression model additionally the predicted test score for the new study hour is highlighted

displaying prediction finally the code prints out the predicted test score for the specified study hour now here are some key features sample data study at

hours contains hours studied and test scores contains the corresponding test scores creating and training model we create a decision tree regressor with a

specified maximum depth to prevent overfitting and train it with fit using our data plot the decision tree plot tree

helps visualize the decisionmaking process of the model representing splits based on study hours prediction and

plotting we predict the test score for a new study hour value 5.5 in this example visualize the original data points the decision trees predicted scores and the

new prediction now here are a few conclusions we can make from the decision tree regressor visualization and the study hours versus test scores

plot observations from the plot linear progression with step function the Orange Line representing the predicted scores demonstrates a clear step

function a typical characteristic of decision tree regressors this indicates that the model provides constant predictions within certain ranges of

study hours changing abruptly at specific thresholds where new rules splits [Music] apply prediction accuracy the red dots

actual scores mostly align with the orange step function predicted scores suggesting that the decision tree model does a good job of capturing the general Trend in the data the close alignment

also indicates that the model handles the nonlinear relationship between study hours and test scores well within the constraints of its maximum depth

specific prediction the plot marks a prediction for 5.5 hours of study green X this specific prediction falls on the

step increase suggesting that additional study hours Beyond a certain threshold significantly improve the predicted test score according to the model's training

data conclusions model fit the decision tree appears well fitted to the range of data presented without signs of overfitting or

underfitting the choice of maximum depth seems appropriate balancing model complexity and generalization utility of decision tree

for educational data decision trees are useful for educational data like study hours and test scores because they can easily model thresholds EG minimum hours

needed to achieve a certain score that are intuitive for educational planning and interventions implications for students and Educators this model can help in

setting realist istic study goals based on expected outcomes for instance Educators can advise students about the probable benefit of studying an

additional hour based on the model's predictions potential for refinement while the current model provides valuable insights further refinement

with additional features like type of study material individual student Baseline performance Etc could enhance prediction accuracy testing the model on

more diverse data sets or incorporating Ensemble methods like random forests could provide a more robust analysis and

mitigate any variance not captured by a single decision tree visualization and interpretation the stepwise visualization AIDS in understanding how

additional study hours could lead to increments in test scores which is valuable for explaining Model Behavior to non-technical

stakeholders where decision tree take rote decision trees with their intuitive branching structure find use across various Industries and problem domains

let's dive into some key areas where they prove particularly effective business and finance customer segmentation analyze customer data to

identify groups with similar behaviors or purchasing patterns for targeted marketing strategies fraud detection identify

patterns and transactions that may indicate fraudulent activity credit risk assessment evaluate the credit worthiness of loan applicants based on

their financial history and other factors operations management optimize decision making in areas like Inventory

management Logistics and resource allocation Health Care medical diagnosis support assist in diagnosing diseases by guiding clinicians through a series of

questions and tests based on patient symptoms and medical history treatment planning help determine the most suitable treatment options based on

patient characteristics and disease severity disease risk prediction identify individuals at high risk of developing certain health conditions

based on factors like lifestyle family history and medical data science and engineering fault diagnosis isolate the cause of malfunctions or failures in

complex systems by analyzing sensor data and system logs classification in biology categorize species based on their characteristics or DNA sequences

remote sensing analyze satellite imagery to classify land cover types or identify areas affected by natural disasters

customer service troubleshooting guides create interactive decision trees to to guide customers through troubleshooting steps for products or Services chatbots

power automated chatbots that can categorize customer inquiries and provide appropriate responses reducing weight times and improving support

efficiency other applications game playing design AI opponents in games that can make strategic decisions based on the state of the game e-commerce

personalized product recommendation ations based on user browsing behavior and past purchases Human Resources identify key factors influencing

employee retention and make informed decisions why decision trees Thrive here decision trees excel in these scenarios

due to several factors interpretability the decision making process is transparent allowing humans to understand the reasoning behind the

model's predictions handles diverse data accommodates both numerical and categorical features nonlinear relationships can capture complex

nonlinear patterns within data versatility applicable for both classification predicting a class label and regression meet Lucy a fitness coach who

is curious about predicting her client's weight loss based on their daily calorie intake and workout duration Lucy has data from past clients but recognizes

that individual predictions might be prone to errors let's utilize bagging to create a more stable prediction model technically will predict a continuous

outcome weight loss based on two independent variables daily calorie intake and workout duration using bagging to reduce variance in predictions let's now go through the

code importing libraries We Begin by importing the necessary libraries numpy is imported as NP to facilitate

numerical operations matplot li. pyplot

is imported as PLT for visualization purposes and Gan NB is imported from pyit learn to implement the Gan naive

Bas classifier sample data definition moving on the code defines sample data representing movie features and their corresponding likes the movies

features array contains pairs of movie length and minutes and genre codes where each genre is encoded numerically EG

zero for action one for Romance the movies likes array denotes whether each movie is liked one or disliked zero creating and training the model A

gaussian naive Bas model is instantiated using gaussian and NB creating an instance named model this model is then trained using the fit method with movies

features as the input features and movies likes as the target labels prediction following training the model is ready to make predictions an

example new movie is defined with a length of 100 minutes and a genre code of one this new movie is represented as

a numpy array named new movie movie the model's predict method is then employed to predict whether Tom will like or dislike this new movie

plotting to visualize the movie data the code generates a scatter plot using PLT dos scatter existing movies are plotted

using circles marker o with their length on the x-axis and genre code on the Y AIS liked movies are depicted in one

color while disliked movies are depicted in another Additionally the new movie is plotted with a different marker a red X to highlight its position displaying

prediction finally the code prints a message indicating whether Tom will like the new movie if the predicted like value is one the message states that Tom will like the movie otherwise it

suggests that he will not here are several key features clients data contains daily calorie intake and workout duration and weight loss contains the corresponding weight loss

train test split we split the data into training and test sets to validate the models predictive performance creating and training model we instantiate

bagging regressor with decision tree regressor as the base estimator and train it using dot fit with our training data prediction and evaluation we

predict weight loss for the test data evaluating prediction quality with mean squar error msse visualizing one of the base estimators optionally visualize one

tree from The Ensemble to understand individual decision-making processes keeping in mind an individual tree may not perform well but collectively they produce stable [Music]

predictions a breakdown of the key points true weight loss this range 2 to 4.5 lb represents the actual weight loss experienced by the clients in the test

set predicted weight loss this range 3.1 to 3.96 lb represents the model's

predictions for weight loss in the test set mean squared error msse [Music] 0.75 this metric measures the average

squar difference between the predicted weight loss and the true weight loss a lower MSC generally indicates better model performance in simpler terms the

model predict predicts weight loss somewhat accurately but there are deviations between the predictions and the actual weight loss experienced by the clients on average the model's

predictions were off by 0.75 lb squared in our previous explorations of machine learning we've come to recognize the inherent strengths

and limitations of individual models some models are masters of Simplicity offering quick interpretable results While others thrive in handling complex

high-dimensional data sets yet all models are susceptible to the twin challenges of bias and variance this is where bagging enters as a powerful Ally

leveraging the wisdom of crowds to forge better predictive Frameworks applications of bagging regression problems imagine you're attempting to

predict housing prices within a bustling City factors like square footage location number of bedrooms and countless others collectively influence

the price a single linear regression model might struggle to capture the intricate relationships between these features bagging comes to the Rescue by

training multiple regressors like decision trees on diverse samples of the data and averaging their predictions we reduce variance and improve

accuracy classification quests perhaps your task with classifying customer reviews as positive or negative a lone naive Bay classifier might make

oversimplified assumptions about word independence resulting in subpar performance bagging empowers us to

assemble an ensemble of classifiers each member of The Ensemble casts a vote and the majority vote often yields a superior classification decision

mitigating the shortcomings of any sing model image recognition the vast world of image recognition presents unique

challenges with high dimensional data convolutional neural networks cnns while remarkably powerful can fall prey to

overfitting with bagging at our disposal we can create a council of independently trained cnns where each Network focuses on distinct subsets of the image data

aggregating their predictions instills robustness and can significantly improve classification results harnessing the power of diversity the Cornerstone of

bagging lies in cultivating diversity within its Ensemble by constructing models on varying bootstrapped samples of the original data set each model

develops slightly different biases this clever strategy combats bias through its Collective approach moreover the random

nature of samp sampling reduces variance particularly when working with unstable algorithms like decision

trees real world examples Healthcare in medical diagnosis where Precision is Paramount bagging is widely used ensembles of models trained on patient

data often lead to enhanced accuracy in identifying diseases contributing to Better healthc Care decision making Finance Financial instit utions employ

bagging for critical tasks such as fraud detection and risk assessment aggregated models built with bagging techniques are frequently more efficient at detecting

anomalies and spotting fraudulent patterns aiding in the protection of valuable assets environmental science bagged

models are leveraged in tasks ranging from land cover classification to climate modeling the ability to create more stable and reliable predictions

from diverse data sets and models proves invaluable when tackling complex environmental challenges remember while bagging empowers us with

stronger predictive prowess it's not a Magic Bullet random forests provide an improvement over bag trees by way of a

small tweak that decorrelates the trees as in bagging we build a number of decision trees on bootstrap training samples but when building these decision

trees each time a split in a tree is considered a random sample of M predictors is chosen as split candidates from the full set of P predictors the

split is allowed to use only one of those m predictors a fresh and random sample of M predictors is taken at each

split and typically we choose MP that is the number of predictors considered at each split is approximately equal to the square root of the total number of

predictors this is also the reason why random Forest is called random the main difference between bagging and random forests is the choice of predictor

subset size M decorrelates the trees using a small value of M in building a random Forest will typically be helpful when we have a large number of

correlated predictors so if you have a problem of multicolinearity RF is a good method to fix that problem so unlike in

bagging in the case of random forest in each tree split not all P predictors are considered but only randomly selected M predictors from it this results in not

similar trees being decorrelated and due to the fact that averaging decorrelated trees results in smaller variants random Forest is more

accurate than bagging Noah is a botanist who has collected data about various plant species and their characteristics such as Leaf size and flower color Noah

is curious if he could predict a plant species based on these features here we'll utilize random forest and emble

learning method to help him classify plants technically we aim to classify plant species based on certain predictive variables using a random

forest model let's walk through the provided code step by step importing libraries the code starts by importing

necessary libraries such as numpy for numerical operations matplot lib for visualization random forest classifier

from pyit learn for random Forest classification train test split for splitting the data and classification de

report for evaluating classification performance data preparation the code defines two numpy arrays plants ey and features containing the features of

plants Leaf size and flower color and plants are species containing the corresponding species labels each row in Plants features represents a plant and

the corresponding entry in Plants species denotes its species zero or one train test split the data set is split into training and testing sets using the

train test split function this ensures that the model is trained on a portion of the data and evaluated on an unseen portion the split ratio is set to 75%

training data and 25% testing data model initialization and training a random forest classifier model model is

initialized with 10 estimators trees and a random state of 40 2 for reproducibility the initialized model is then trained using the training data X

train and Y train random forests build multiple decision trees and combine their predictions to improve generalization performance prediction and evaluation

the trained model is used to make predictions y spread on the test data X to test these predictions are then evaluated using the classification for

report function which generates a detailed report including Precision recall F1 score and support for each class displaying prediction and evaluation the classification report

containing evaluation metrics is printed to the console providing insights into the model's performance visualization two visualizations are generated to gain

insights into the data and model scatter plot of species this plot visualizes the distribution of plants features Leaf size and flower C color for each species

different marker shapes and colors represent different species making it easier to distinguish between them feature importance this horizontal bar

plot visualizes the importance of each feature Leaf size and flower color in predicting plant species features with higher importance values contribute more

to the model's decision making process interpreting the scatter plot the scatter plot shows the distribution of plant data points based

on their Leaf size and flower color coded the different marker colors green and red represent the two plant species

0 and one partial separation there appears to be some separation between the green and red data points suggesting that leaf size and flower color might be

partially effective in distinguishing between the two plant species overlap we can also observe some overlap between the green and red

clusters particularly towards the center of the plot this overlap indicates that some plants might have similar Leaf size and flower color features regardless of

species this overlap may lead to some classification Errors By the random forest model potential challenges the overlap

between the two species in the scatter plot suggest that the model might struggle to accurately classify plants that fall in these overlapping areas overall the scatter plot provides a

visual representation of the data that can be helpful in interpreting the performance of the random forest model key points from the bar chart the bar

chart depicts the feature importances for leaf size and flower color the height of the bar represents the features importance in this case the bar

for flower color is considerably higher than the one for leaf sze interpretation this visualization confirms what we might have inferred

from the scatter plot flower color carries more weight in the model's decision-making process for plant species classification this is likely because

the flower color data seems to have a clearer separation between the two species green and red dots in the scatter plot [Music]

overall conclusions the random forest model partially leverages both Leaf size and flower color for classification but flower color appears to be the more

dominant feature the model's performance might be limited by the overlap between the two species in the data particularly for plants with similar Leaf size and flower color

[Music] values random Forest a versatile machine learning algorithm find applications across various domains including finance

and banking Healthcare e-commerce marketing and more finance and banking in fraud detection random forests excel

in spotting irregular patterns in transactions leveraging features such as transaction amount location frequency and Merchant type to flag potential

fraudulent activities for credit risk assessment these models evaluate borrowers creditworthiness by analyzing factors like income debt to income ratio

credit history and employment status predicting the likelihood of default accurately additionally in stock market prediction random forests leverage

historical stock prices company fundamentals new sentiment and market trends to forecast future prices though this remains

challenging Healthcare in medical diagnosis random Forest classify patients based on various Medical Data like test results symptoms and patient

history aiding health care providers in making informed decisions for drug Discovery and development researchers utilize these

models to identify potential drug candidates by analyzing molecular structures gene expression data and existing drug information furthermore

more in personalized medicine random forests help tailor treatments to individual patients by considering factors like genetics medical history

and lifestyle enabling predictions of patient response to specific therapies or medication dosages e-commerce and marketing customer segmentation benefits

from random forests as they group customers based on purchase history browsing Behavior demographics Etc facilitating targeting marketing and

personalization efforts for product recommendations these models analyze customer purchase patterns product ratings and search history to offer

relevant product suggestions thereby enhancing user experience and boosting sales moreover inur prediction random

forests identify customers at risk of leaving by examining usage patterns service interactions and demographic

data allowing for proactive retention strategies other notable areas random forests find applications in environmental science aiding tasks like

land cover classifications using satellite imagery monitoring deforestation and assessing climate change

impact in image analysis they assist in image classification tasks such as facial recognition object detection in self-driving cars and analyzing Medical

Imaging scans furthermore in network intrusion detection random forests help identify suspicious Network traffic patterns by

analyzing features like source and destination IP addresses protocols used and packet sizes contributing to cyber security efforts it's important to note

that while random forests are generally robust to overfitting due to their Ensemble nature careful feature selection and hyperparameter tuning are

crucial for Optimal Performance additionally random forests work well with both categorical with missing values or [Music]

outliers like bagging which averages correlated decision trees and random Forest which averages uncorrelated decision trees boosting aims to improve

the predictions resulting from a decision tree boosting is a supervised machine learning model that can be used for both regression and classification

problems unlike begging or random Forest where the trees are built independently from each other using one of the B bootstrapped samples copy of the initial

training date in boosting the trees are built sequentially and dependent on each other each tree is grown using information from previously grown trees

boosting does not involve of bootstrap sampling instead each tree fits on a modified version of the original data set it's a method of converting weak

Learners into strong Learners in boosting each new tree is a fit on a modified version of the original data set so unlike fitting a single large

decision tree to the data which amounts to fitting the data hard and potentially overfitting the boosting approach instead learns slowly give in the

current model we fit a decision tree to the residuals from the model that is we fit a tree using the current residuals rather than the outcome y as the

response we then add this new decision tree into the fitted function in order to update the residuals each of these trees can be rather small with just a

few terminal nodes determined by the parameter D in the algorithm now let's have a look at had three most popular boosting models in machine learning the

first Ensemble algorithm we will look into today is aab Boost like in all boosting techniques in the case of aab boost the trees are built using the

information from the previous Tree and more specifically part of the tree which didn't perform well this is called the weak learner decision stump this

decision stump is built using only a single predictor and not all predictors to perform the prediction so adaboost combines weak Learners to make

classifications and each stump is made by using the previous stump's errors here is the step-by-step plan for building an adaab boost model Step One

initial weight assignment assign equal weight to all observations in the sample where this weight represents the importance of the observations being

correctly classified one t n all samples are equally important at this stage step two optimal predictor selection the first stump is built by

obtaining the RSS in case of regression or guinea index entropy in case of classification for each predictor picking the stump that does the best job

in terms of prediction accuracy the stump with the smallest RSS or guinea entropy is selected as the next tree

step three Computing stumps weight based on stumps we increase the weight of the observations which have been incorrectly predicted and decrease the remaining

observations which had higher accuracy or have been correctly classified so that the next stump will have higher importance of correctly predicted the value of this

observation step five building the next stump based on updated weights using weighted Genie index to choose the next

next stump step six combining B stumps then all the stumps are combined while taking into account their importance weighted sum imagine a scenario where we

aim to predict house prices based on certain features like the number of rooms and age of the house for this example let's generate synthetic data

where numb rooms the number of rooms in the house house age the age of the house in years price the price of the house in

$1,000 importing libraries the code starts by importing necessary libraries numpy for numerical operations pandas

for data manipulation ma plot Li for visualization and specific modules from psyit learn for machine learning tasks like model selection Ensemble learning

and evaluation metrics data generation synthetic data is generated to mimic a real world scenario random numbers are generated to represent the number of

rooms in a house numb rooms the age of the house house age and noise the price of the house price is then calculated

based on a linear relationship with the number of rooms age of the house and added noise data visualization the generated

data is visualized using SC Scatter Plots two Scatter Plots are created one showing the relationship between the number of rooms and the house price and

the other showing the relationship between the age of the house and the house price this visualization helps in understanding the distribution and

relationships between features and the target variable price data splitting the data is split into training and testing

sets using the train test split function from psyit learn this step is essential for training the model on one subset of data training set and evaluating its

performance on another subset testing set model initialization and training an adaboost regressor model is initialized with specific parameters like the number

of estimators decision trees set to 100 and a random seed for reproducibility the model is then trained using the training data X train

and Y train model evaluation once trained the model makes predictions on the test data X test the mean squared

error MSE and root mean squared error rmse metrics are calculated to evaluate the model's performance compared to the

actual house prices wide test result visualization the actual house prices whyi test and the predicted prices are visualized using a scatter plot the plot

also includes a diagonal line representing perfect predictions this visualization AIDS in assessing how closely the model's predictions align with the actual prices the Scatter Plots

provided shows two key relationships number of rooms versus price this is represented by the green data points there appears to be a positive correlation meaning as the number of

rooms increases the price of the house also tends to increase this is likely because houses with more rooms are generally larger and more expensive

house age versus price this is represented by the red data points the relationship here is less clear there might be a slight negative correlation

where newer houses lower House age tend to be more expensive however the data points are scattered making it difficult to draw a definitive

conclusion additional points to consider the data points show some variation around the general Trends this indicates that there might

be other factors influencing house price besides the number of rooms and House age EG location amenities it's important to note that this is simulated data and

real estate prices can be influenced by many complex factors overall the scatter plot suggests that the number of rooms has a positive correlation with house

price while the relationship between House age and price is less clearcut the Scatter Plots you saw provide valuable insights into the data but adaab boost

plays a crucial role in uncovering the underlying relationship between features number of rooms House age and price

here's how capturing complex relationships the data exhibits some scatter around the general Trends suggesting the price isn't perfectly

explained by just number of r rooms and House age adaab boost excels in handling such scenarios it's an ensemble method that

combines multiple weak decision trees into a stronger final model these decision trees can effectively capture nonlinear patterns in the data providing

a more nuanced understanding of the price feature relationship than a single linear model focus on informative features while both Scatter Plots offer

Clues adaab boost goes beyond simply visualizing correlations it analyzes the data to determine which features number

of rooms House age are most informative for predicting price by focusing on these features during the decision tree creation process adaboost prioritizes

the factors that have the strongest influence on price iterative refinement adaboost Works in a stage manner it trains a series of weak decision trees

each focusing on correcting the errors of the previous one by visualizing the data we can get a general sense of the trends but adabo iteratively refines its

understanding through these multiple stages ultimately leading to a more accurate prediction model in essence the Scatter Plots provide a starting point

for understanding the data but idab boost acts as a powerful tool to leverage that initial understanding and build a more a robust model that captures the complexities of the price

feature relationship Ada boost and gradient boosting are very similar to each other but compared to Ada boost which starts the process by selecting a stump and continuing to build it by

using the weak Learners from the previous stump gradient boosting starts with a single leaf instead of a tree of a stump the outcome corresponding to this chosen Leaf is then an initial

guess for the outcome variable like in the case of adaab boost gradient boosting uses the previous stump's errors to build a tree but unlike in adaab boost the trees that gradient

boost builds are larger than a stump that's a parameter where we set a max number of leaves to make sure the tree is not overfitting gradient boosting uses the learning rate to scale the

gradient contributions gradient boosting is based on the idea that taking lots of small steps in the right direction gradients will result in lower Vari

ience for testing data the major difference between the adab Boost and gradient boosting algorithms is how the two identify the shortcomings of weak

learners for example decision trees while the adaboost model identifies the shortcomings by using High weight data points gradient boosting performs The

Same by using gradients in the loss function why ask what p a e needs a special mention as it is the error term

the law L function is a measure indicating how good a model's coefficients are at fitting the underlying data a logical understanding of loss function would depend on what we

are trying to optimize early stopping the special process of tuning the number of iterations for an algorithm such as GBM and random Forest is called early

stopping a phenomenon we touched upon when discussing the decision trees early stopping performs model optimization by monitoring the model's performance on a

separate test data set and stopping the training procedure once the performance on the test data stops improving Beyond a certain number of iterations it avoids

overfitting by attempting to automatically select the inflection point where performance on the test data set starts to decrease while performance

on the training data set continues to improve as the model starts to over fit in the context of GBM early stopping can

be based either on an outof bag sample set o or cross validation CV like mentioned earlier the ideal time to stop training the model is when the

validation error has decreased and started to stabilize before it starts increasing due to overfitting to build GBM follow this step-by-step process

step one train the model on the existing data to predict the outcome variable step two compute the error rate using the predictions and the real values

pseudo residual step three use the existing features and the pseudo residual as the outcome variable to predict the residuals again step four

use the predicted residuals to update the predictions from step one while scaling this contribution to the tree with a learning rate

hyperparameter step five repeat steps 1 to 4 the process of updating the pseudo residuals and the tree while scaling with the learning rate to move slowly in

the right direction until there is no longer an improvement or we come to our stopping rule the idea is that each time we add a new Scale tree to the model the residual

should get smaller let's break down the provided code step by step model initialization and training the code initializes a gradient boosting

regressor model model GBM with specific parameters such as the number of estimators trees set to 100 learning rate set to 0.1 maximum depth of each

tree set to one and a random seed for reproducibility this model is then trained using the training data XT train and Y train gradient boosting Builds an

ensemble of weak Learners decision trees in this case sequentially with each tree learning from the errors made by the previous ones predictions after training

the model is used to make predictions on the test data X test the predict method is applied to model GBM and the predicted house prices are stored in the

variable predictions model evaluation to assess the model's performance the mean squared error MSC and root mean squared

error rmsc metrics are calculated these metrics quantify the average Square difference between the actual house

prices YY test and the predicted prices predictions lower values indicate better model performance the calculated MSE and

rmsse are then printed to the console using formatted strings result visualization finally the code generates a scatter plot to visualize the

relationship between the actual house prices y test and the predicted prices prediction the scatter plot displays the actual

prices on the x-axis and the predicted prices on the Y AIS additionally a diagonal dashed line is drawn to represent perfect predictions where actual prices equal predicted prices

this visualization helps in assessing how closely the model's predictions align with the actual prices providing insights into the model's accuracy and

potential areas for improvement now if we compare this result with the result that we got before with adaab Boost we can say the following scatter plot

characteristics both plots follow a similar structure with predicted prices on the Y AIS and actual prices on the x- axis points represent individual

predictions with ideal predictions lying on a dashed diagonal line indicating where predicted prices equal actual

prices performance indication Ada boost the points are spread around the diagonal but show a trend of underestimating higher values as seen from the concentration of points below

the line as actual prices increase GBM the points are more tightly clustered around the diagonal line throughout the range of values

suggesting that GBM predicts both low and high prices with better accuracy than Ada boost algorithm Effectiveness the GBM model generally appears to

perform better given the closer clustering of points around the identity line this indicates a more accurate prediction across the range of house prices the auto boost plot shows greater

deviation from the line especially at higher price points suggesting less consistency in prediction accuracy across the price

Spectrum data distribution both models handle the full range of data from about 150 to 500 in units consistent across

both models but the GBM seems to manage the upper range more effectively overall from these plots we can infer that GBM provides a more

accurate and consistent prediction for house prices compared to idab boost particularly at higher price points where idab boost tends to underestimate

values one of the most popular boosting or Ensemble algorithms is Extreme gradient boosting XG boost the difference between the GBM and XG

boost is that in the case of XG boost the second order derivatives are calculated second order gradients this provides more information about the direction of gradients and how

to get to the minimum of the loss function remember that this is needed to identify the weak learner and improve the model by improving the weak Learners

the idea behind XG boost is that the second order derivative tends to be more precise in terms of finding the accurate Direction like the Ida boost XG boost

applies advanced regularization in the form of L1 or L2 Norms to address overfitting unlike Ada boost XG boost is parallelizable due to its special

caching mechanism making it convenient to handle large and complex data sets also to speed up the training XG boost uses an approximate greedy algorithm to

consider only a limited amount of thresholds for splitting the nodes of the trees to build an XG boost model

follow this stepbystep process step one fit a single decision Tree in this step the loss function is calculated for

example ndcg to evaluate the model step two add the second tree this is done such that when this second tree is added to the model it lowers the loss function

based on first and second order derivatives compared to the previous tree where we also used learning rate at EA step three finding the direction of

the next move using the first degree and second degree derivatives we can find the direction in which the loss function decreases the largest this is basically

the gradient of the loss function with regard to the output of the previous model step four splitting the nodes to

split the observations XG boost uses an approximate greedy algorithm about three approximate weighted quantiles usually quantiles that have a similar sum of

weights for finding the split value of the nodes it doesn't consider all the candidate thresholds but instead it uses the quantiles of that predictor only

optimal learning rate can be determined by using cross validation and grid search Imagine you have a data set containing information about various houses and their prices the data set

includes features like the number of bedrooms bathrooms the total area the year built and so on and you want to predict the price of a house based on these features let's dissect the

provided code step by step model initialization and training the code starts by importing the XG boost Library import XG boost as xgb XG boost is a

powerful implementation of gradient boosting machines next an XG boost regressor model model xgb is initialized

with specific parameters the objective is set to regi squ Dera indicating that the model aims to minimize the mean squared error loss function Additionally

the number of estimators trees is set to 100 and a seed value of 42 is specified for reproducibility the initialized model is then trained using the training

data xxe train and YC train XG boost Builds an ensemble of decision trees sequentially optimizing a specified loss function predictions after training the

trained model model JX xgb is used to make predictions on the test data xas test the predict method is applied to the model and the predicted house prices are stored in the variable

predictions model evaluation the code proceeds to evaluate the model's performance using two common metrics mean squared error and M and

root means squared error rmse these metrics quantify the average squared difference between the actual house prices y Quest test and the

predicted prices predictions lower values indicate better model performance the calculated msse and rmsse are then

printed to the console using formatted strings result visualization lastly the code generates a scatter plot to visually compare the actual house prices

wide test with the predicted prices predictions the scatter plot displays the actual prices on the x-axis and the predicted prices on the Y AIS

additionally a diagonal dash line is drawn to represent perfect predictions where actual prices equal predicted prices this visualization AIDS in

assessing the model's accuracy by examining how closely the predicted prices align line with the actual prices providing insights into the model's

performance now let's compare adaboost GBM XG boost here's how they compare adaab boost the adaab Boost plot shows

predictions that tend to underestimate the actual prices especially as the values increase this is evident from the larger number of points lying below the

diagonal line in the higher price ranges the overall fit to the diagonal is less tight suggesting higher prediction errors or bias particularly for higher

priced houses GBM the GBM model produces predictions that are generally closer to the diagonal across all price ranges this indicates better accuracy and

consistency in predicting both lower and higher priced houses compared to adaboost the points in the GBM plot are more tightly clustered around the diagonal indicating lower variance in

prediction errors XG boost the XG boost plot also shows a tight clustering of points

around the diagonal similar to GBM this suggests a high level of accuracy unlike GBM the XG boost plot seems to slightly overestimate the lowest priced houses

while matching or slightly underestimating the highest priced houses but still maintains a close adherence to the diagonal

line summary of comparison accuracy and consistency both GBM and XG boost exhibit high accuracy with their predictions closely clustered around the

diagonal line showing their effectiveness in both lower and higher price predictions Ada boost however shows more variance and a tendency to

underestimate particularly at higher price points performance at different price ranges GBM and XG boost handle Extremes in prices is better than Ada

boost which struggles with underestimation as prices increase General predictive performance XG boost and GBM are quite comparable with slight

differences in how they handle the very low and very high ends of the price Spectrum idab boost appears to be less reliable especially for higher priced

properties from these observations GBM and XG boost seem more suitable for scenarios where precise and consistent predi itions across a wide range of

house prices are critical adaab boost might be more prone to prediction errors particularly in higher price brackets hi I'm vah and in this project

we will learn how to understand your customers better track sales patterns and show those results if you like working with data or own the store this video will show you how to use

information to make better choices and get better results you will speed your customers into smaller groups based on how they shop this helps you send the

right messages to the right people and give them offers they will like loyal customers are the best you will use data to find your biggest supporters and those who are ready to spend more then

you can reward your your best customers with programs that fit their shopping habits this makes them happy and stops them from going to other stores we will

use data to get gu what people will buy and when they will buy it you will find sales patterns in among different items and figure out what cool new product

people will want this lets you always have the right stuff at the right time you won't have too many items everything will sell and customers will be surprised by how well you know what they

need we'll look at how sales change throughout the year this helps you plan for busy times can slow downs early and know exactly when to have big

sales we will use location data and what people say about you to find places where sales are going well and where you could grow you will even show it all on the map this helps you spend your

advertising money wisely find great spots for new stores and even choose the perfect things to sell in each place so let's get

started all right let's now go with the DS I will be using so we are using the superstore sells the set and it has

9,800 rows and the columns order ID order date ship date shipping mode standard class second class or other classes and the customer ID withth them

as some name the segment meaning um who bought the product whether a consumer a corporate or a home

office and the clients mainly come from United States and it's also specified from which city of the United States they come from so we shall import this

to our Google cab and start working on it okay so let's now import the necessary Pyon libraries we import pis

as PD we also import numai SMP import mfold lip IA mimport Seaborn as SMS so let's also import the there and we'll be

using copy [Music] P perfect so this is how it looked like on kle and this is also how it look like

when we have imported so let's now look up the frames larger info so everything seems to be consistent but the postal

code it seems that 11 postal codes are missing okay so what we can do is to fill in those no values

[Music] for [Music] [Music] [Music] okay so as you can see we have replaced

the no uh postal codes customers that didn't have any postal code and we have put the zero inside it all right so let's now move on to checking for

duplicates [Music] ifdf duplicated do su trigger down in zero Das print so let's now see if there

are actually duplicates and if there are duplicates we will print D duplicates exist and if there are not Blue Print no duplicate SE

[Music] found all right so Mo as you can see there exists no duplicate so let's move

on to customer segmentation let's first create a variable new types of customers and let's extract out of our data frame

called segment as you can see from our data frame we have a merried segment within our data this segment includes a list of the types of customers in our data frame we

have both consumer and corporate customers so let's get started with customer segmentation the main problem is that many large businesses struggle to understand the contribution and

importance of their various customer segments they often lack precise information about their main buyers relying on intuition rather than data

this leads to misallocation of resources resulting in Revenue loss and decreased customer satisfaction for example if your store primarily caters to Consumers

it's crucial to tailor your marketing and customer satisfaction efforts to resonate with their needs and preferences by focusing your resources on understanding and Catering to your consumer base you can avoid

misallocating resources to large corporates this ensures you're providing a satisf customer experience for your primary demographic ultimately leading

to increased customer loyalty and revenue growth and we can also you can create a part chart pie chart or bargar from it to clearly illustrate the

revenue contribution of each customer segment and this will allow us to tailor more of our marketing resources our customer satisfaction resources

towards once you've completed C customer segmentation The Next Step depends on your strategic goals here are a few ways to proceed focus on your most valuable

segment if your existing customer segmentation reveals a particularly profitable segment such as consumers tailor your marketing product offerings

and customer service to deepen your engagement with that group Target new segments if you want to attract more corporates or home offices you'll need to understand their unque unique needs

and pain points start by researching these segments what are their challenges what solutions would appeal to them develop tailored messaging and consider offering specialized products or

services to attract these new customer types all right so let's get started [Music] [Music]

so this will extract the types of customers from data frame perfect so it's consumer corporate and home office those are all the um variables are in

our data frame all right so let's count the unique values in our segment and we will do is byy number of customers

[Music] [Music] [Music] [Music] [Music] so what this meaning does is it counts

unique values in our segment and resets the index to turn them into a column and then we can correct the

renaming of columns so we want to give our segment the name as like total customer or type of customer I will go

with the type of customer so we will say number of customers is equal to number of

customers that rename and want to call the col which is the main segment we

want to rename it to to type of customer now if you want to print that print number of customers there are 5,00 101

consumers and for corporate there are like 2,953 corporate buyers and7

1,746 home offices and if you want to create a pie chart out of this we can PL it by saying CL pie the number of

customers and want to base a pie chart on the account you want to label the number of customers total customers

[Music] Perfect all right so from as you can see we had the renew um type of customer the total customer so you can see that from

this uh pie chart our main consumer segment is 52% 30% of our orders come from corporates and 18% from Whole

offices you can see who we have to exactly focus on which are consumers while consumers hold the majority focusing solely on them overlooks significant potential within the

corporate and home office segments let's explore how to balance resource allocation for all three segments to maximize growth to gain even deeper

insights we should integrate our customer data with sales figures this analysis will help us identify which segments generate the most Revenue per

customer average order value and overall profitability customer lifetime value additionally we can segment customers by

purchase frequency and basket size to understand their buying Behavior within each segment here are some additional questions to consider for a more

comprehensive analysis customer acquisition cost CAC how much does it cost to acquire a customer in each segment

customer satisfaction how satisfied are customers in each segment churn rate what is the rate at which customers leave in each segment by analyzing these

factors alongside revenue and customer lifetime value we can create a customer segmentation model that prioritizes

segments based on their overall value and growth potential we can also PL a bar graph for the total sales for each

customer type and group the data by the segment column and calculate the total sales for each segment and you want to do this by so right now you don't see

the exact sales numbers the bar chart you can see the exact sales numbers for each customer type so let's PL it

[Music] [Music] [Music] [Music]

[Music] [Music] so there are around 1.2 million from our consumers and we

have around 600 or 700,000 on corporates now we can also P out uh bar from

this which means PL bar sales per segment customer type type of customer sales hair segment col sale these bar

chart effectively illustrates the distribution of sales across our customer segments consumers account for the largest portion of sales 1.2 million

followed by corporates 1.0 million and Home Offices 0.8 million while the chart is clear a deeper analysis can help us

optimize our marketing efforts customer lifetime value CLTV calculate the CLTV V of each segment to identify which segments

generate the most Revenue over time this will help prioritize customer segments for marketing efforts for example if you

find that the home office segment has a higher CLTV than the Consumer segment you may want to invest more resources in marketing campaigns targeting home

office customers market research conduct market research to understand the spefic specific needs and preferences of each customer segment this will inform the

development of targeted marketing campaigns for instance you might discover that consumers in your data are price sensitive while corporate

customers are more interested in bulk discounts and reliable service you can use this knowledge to tailor your marketing messages to each segment

average order value analyze average order value by segment to identify opportunities to increase Revenue per

customer let's say your analysis reveals that corporate customers have a higher average order value than consumers you could develop marketing campaigns that

encourage consumers to purchase bundles or higher pric products to increase their average order value customer acquisition cost CAC how much does it

cost to acquire a customer in each segment knowing CAC can help determine the return on investment Roi for

marketing efforts here's an example let's say it cost $100 to acquire a new corporate customer but only $20 to acquire a new consumer customer if the

CLTV customer lifetime value of a corporate customer is significantly higher than the CLTV of a cons consumer customer then spending $100 to acquire a

corporate customer may still be profitable however if the CLTV of the corporate customer is only slightly higher than the CLTV of the consumer

customer you may want to focus your marketing efforts on acquiring more consumers because the cost of acquisition is much lower customer

satisfaction how satisfied are customers in each segment understanding satisfaction levels can help identify

areas for improvement and reduce churn here's an example you can conduct surveys or collect customer feedback to understand satisfaction levels if you

find that corporate customers are less satisfied than consumer customers you may want to investigate the reasons for their dissatisfaction and make changes to

improve their EXP experience this could involve improving your customer service offering more competitive pricing for corporate customers or developing

products or services that better meet the needs of corporate customers we can also create a pie chart

for our sales which you can do by FL pie pie sales hair segment toal sales and we

name labels specific to to sales per segment and type of customer type of customer 51% of our

sales come from our consumers 30% from our corporates and 19% from home offices all right so let's now move on to the customer loyalty as a business

you want to make sure that your most loyal customers stay happy this will make sure that those customers keep on coming back keep on bringing new people and also placing new

orders so you will decrease the cost on acquisition of new customers because there will be already existing customers

and you'll also be able to make sure that your Revenue either stay at the same level or increases by keeping your

most loyal customers happy and you want to do that as a business now we can do this by either the following ways we can

rank the most loyal customers by the amount of orders they have placed or the total uh they have spent you have analyzed your data pinpointing your 30

most lawyer customers this represents a significant opportunity to strengthen these relationships and maximize their

lifetime value here's a powerful appro design a targeted email specifically for those high value segments for actively offer personalized support

with inquiries such as how can we assist you today this demonstrates your commitment to their success proactively addressing potential issues and fostering a deep sense of

loyalty loyalty programs consider a tiered loyalty program that offers exclusive rewards tail to your most valuable customers this include earlier

access to new products personalized discounts or even point-based reward systems personalized experiences leverage your Data Insights to go beyond

email consider personalized website recommendations targeted promotions based on past purchase history or even handwritten thank you notes for high

value customers customer feedback loops make sure your top customers feel heard Implement surveys or invite them to participate in exclusive focus groups

this demonstrates you value their input and are actively using feedback to improve the customer experience Community Building depending on your

business model fostering a community among your most loyal customers can create a sense of belonging this could involve access to online forums

exclusive events or opportunities to network with like-minded individuals now this strategy extends Beyond customer satisfaction

prioritizing the experience of your top customers the directly correlates with increased retention positive referrals

and ultimately improves Revenue now um let's dive deeper and see who are our most loyer customers all right so let's

now get started with that let's create the variable with the name all so let's first display the first three R of our

day frame so as you can see there is a row called sales or the col call called sales and each customer has a specific

ID with a specific name so if you can count out the number of times this shows

up then you also have the number of total orders which then you can write which you can use later however you want

to so let's start with doing that

[Music] [Music] [Music] [Music]

[Music] now let's rename the columns we want the column order ID which is where is the

order ID or right here when the order ID to be named our total [Music]

orders now we want to rename the columns that are equal to order ID in this column this must be renamed to Total

orders face is equal to True okay so now let's identify the repeat customers customers with all the frequency greater than one so repeat customers are equal

to customers order frequency customer order frequency total orders and like I said want to make sure it's equal it's greater than one it's equal greater than

one perfect now we can we want to organize this in a way that is the sting we can do that by saying repeat custom

sort it are repeat customers that sort values [Music] [Music] [Music]

[Music] perfect now let's print this out print repeat

customers sorted. head 12 want to

customers sorted. head 12 want to display our top 12 customers reset index so the customer with name William

Brown who a consumer has placed to Total 35 orders so this is the list of your top how many customers and your as

business or superstore you can identify exactly the number of the total orders a person or a business has to

place in order to be considered a loow customer and then according to that you can tailor your services to it now the data clearly reveals that a small group

of customers place aers with considerably higher frequency 3 plus we have William Brown with 35 orders and other home office C customers with 34

and many consumers and one corporate with 32 so it shows clearly that we have loyal group of customers there's also significant potential for our home

office segment several of our most loyal customers belong to the home office segment now this implies that the home office segment has a strong potential for customer loyalty and deserves

targeted marketing efforts and it also shows that we just don't have like one in a group of loyal customers we have home offices consumers and corporates

while there are many consumers it doesn't mean that we have to focus on one segment it means that we still have to devise a plan that caters to our

multiple segments so some recommendations now we can prioritize loal customers segment customers by AIO frequency and uh we can develop

exclusive offers Rewards or Early Access programs tailor 12 almost um all your customers so for example we can provide them exclusive

discounts to your reward programs and earlier access and we can also Target uh more home offices because we see that home offices um keep on coming back and

we are able to satisfy few of the home offices that mean that means we can we have catered to their needs and provided a good enough service for them to keep

on coming back back that means our product is great for home offices that means we can Target more home offices using content marketing social media ads

or other type of marketing strategies and we can also analyze our the behavior with the way we provided service to our um to these

customers and because it worked out pretty well and if we provide this kind of service to our newcoming customers then we increase the chance that they Al become a loyal customer so those are

like several conclusions we can make now we can also identify loyal customers by sales so this is uh identified them by total number of

orders they have placed but we can also use amount of sales so the total amount to identify them because a person can

come and place 35 orders but if they Place 35 $1 order then obviously that's just 35 now this doesn't say anything about the

sales amount so I um ideally you want to organize it by the sales amount to be able to um identify the actual top

spending and loal customers or that said you when there is a significant customer so let's say

someone has spent like say 25,000 that can be done also in one order so there doesn't mean that it's a repeated customer it's it's just a top spender

now um let's start with identifying our top spending customers so let's first create a variable customer sales Glu to

data frame that's grou by customer ID you want their customer ID want to also see the name and also what type of customer they are this segment and want

to do it by sales and and we want to sum that sum those all and we don't want to resent the index and now let's Addy our

top Spenders by having them ranked descendingly meaning our top Spenders will be rank all the way up customer

customer sales at the sord values by e [Music] [Music]

[Music] so Sean Miller has spent the most who was from home office using total amount of 25,000 USD William Brown has placed

the most number of orders which are 35 but William Brown is nowhere to be found here same as Sean Miller he has spent he's the he's a customer who has spent

the most in our Superstore but he's also nowhere to be found here meaning that the repeated customers doesn't really

Define their spending habits so if you it depending on the way you you run a Superstore now obviously I would want to

I would want our customer to come back but I would dedicate my resources to the customers who spend the

most because those are the customers who BR bring the most business to my to me meaning those are the customers I have to keep uh happy so the number the total number of orders is great but it doesn't

really speak that much about their spending habits and about their value to your store all right let's now go over to the next chapter which is shipping

now as a Superstore you also want to know what shipping methods customers prefer and which are the most cost effective and

reliable and overall knowing this impacts your customer satisfaction and also meaning it also has great impact on

your Revenue so that so for example Amazon has mve shipping methods but it has the most popular shipping method which keeps the most amount of customers

happy and it also makes uh Amazon the most amount of money so as a superstar you want to know which one of your shipping methods is the most

reliable oh we create the variable the type all because risks so our shipping model let's create a variable we use the

data of the data frame ship mode we want to count those values and of course we want to reset the index [Music] [Music]

[Music] [Music] [Music] [Music] so our standard class is the most popular by it's almost like four times

more popular my first class is go east and the same day our first class is one East so let's create the pie chart of this PL pie shipping model all right so

this is our class s class or this are like the shipping methods the most popular one is standard class which is

60% of the orders use style class shipping and rest is like 40% so as a super store or as any store you invest

in your shipping so you end up buying some kind of deals with a uh delivery um

companies like thehl ups and others and sometimes you end up recommending the wrong option to your customer so let's say second class is fast but it ends up

costing the con customer way too much the customer ends up not buying your product and this decreases the as you can see this for your store but for if

you know that standard classes is the most popular option then you can can have like a button saying this is our most popular option which is time of class and most of the time people choose

the most popular option so this will help you s this will help the Super Store save the cost of investment to

these others or dedicate the amount of resources that each class brings and it also allows the Super Store to

recommend its most popular option which just a class so the problem that many super stores have because many stores

have um stores in many locations in many um States but they don't know how much how well each is performing on a

dashboard for example you could have that but they have no idea how well each of the stores in each state are performing and leaving them with clueless where where there is an

underperformance or for example where they can where there is a high potential area in which they can open a new store

so let's move on to this chapter which is geographical analysis so many stores have hard time in identifying high potential areas or also identifying

stores that are underperforming so things like Walmart Target they have like many branches and

they they will want to know how well each branch is doing and the perfect way to do this is by counting up the number

of sales for each City the number of sales for each state and then this will allow you to see which of the states or which of the cities is performing the best and which of them is performing the

least and dedicate your resources accordingly so let's say if one city the story is simply just losing money for

your years or for more then you will want to adjust your strategy according to that so maybe you will want to close this store or adjust it in a way so it

starts bringing in more profit or Revenue well so let's get started with that [Music] [Music]

[Music] [Music] [Music] [Music] [Music]

all right so as you can see the most popular state is California and the least popular is New Jersey so maybe you

can go over this and let's say in few of the States where there's still high potential

for a profitable store you can identify that's in Washington and calculate all right so maybe from from this you can

see that what Washington is performing fourth or New Jersey is performing like the least of our top 20 from this You can conclude

that you might have to work on New Jersey more to increase the order count this also allows you to increase your revenue or you can see that the

California is your most popular option so you might want to keep California happy and you can also do it per City so

City TF City you count reset next print City that had the most top 50 to 15 so the most popular city is New York with

the order count of 891 and uh and then Los Angeles and Jackson is the least popular out of our

top 15 and you can also increase this to the 25 so not only can you can focus on the states but for each state you can also FOC was on the the city that's

underperforming or overperforming so this allows you to also medicate your resources to the to the city that you want maybe to increase

your revenue or increase the your potential or maybe there is a like City for example Long Beach where there's high potential but you're not using any of your

resources now we can also uh organize it sales per state let's say state sales so previously we did it by order count and we can also do

it for state sales want to sum it up and then we set the index you want to rank it ACC call

for perfect so as you can see our still our most popular state is California and then New York and then it changes it and then it doesn't change yet Texas so this

is according to the sales amount the popularity of the state according to the sales amount and let's also sorted for

her City [Music] [Music] [Music] [Music]

[Music] [Music]

so most publicity is New York la San Francisco this is exactly the same as our previous analysis on the city

nothing really has changed all right so as a store you want to be able to track down your most popular category of

products or your bestselling products or sales performance across categories and subcategories and find the sweet spots where strong categories also have top

selling subcategories and also spot weaker subcat with otherwise strong categories that might need Improvement or product popularity fluctuations see

see if popularity seasonal trading up and down and helps and this helps to forecast the future demand or um you can group it by location for each location

there might be a different popular product you you want to put it in a certain place to maximize your stor Revenue so let's get started with

finding our top performing products or their categories so let's first extract our products the categories of our

products from our dat frame the unique PR product so right now in our dat frame we have only three sorts of products as

you can see the category and each one has a subcategory cases chairs but we have mainly three uh categories which are Furniture office supplies and

technology so let's go now over to the types of subcategory her product

subcategory the uh a print product sub category's chair is the and bunch of it now let's group the

data by a product category and how many subcategory it has so we want to say for example office supplies may have like 20

subcategories refer may have uh five subcategories so let's see how many subcategories each one has [Music]

[Music] [Music] [Music] [Music] [Music]

[Music] [Music] [Music]

so so there are nine office nine first office supplies four for furniture four for technology so for supplies is much

more sophisticated category now we can also see our top performing subcategory so let's say subcategory C and then you

want to count the sales go buy category [Music] [Music] [Music] [Music] [Music]

[Music] [Music] [Music]

so our most popular subcategory is T specifically Falls it has the most am of Sals Furniture chairs office supply

storage so from this you can see our most popular subcategories and what subcategories you want to recommend them or on a front page

or um in the store now let's see which one of our main categories performs or has the most amount of sales product

category goodbye so as expected uh Tech is the most popular one in the furniture in office supplies so maybe you will have inside the inside your store you

will have a much larger department or not much larger maybe a little bit larger or maybe it's in the first row right in front of the customers to be

able to uh present your most popular option immediately to your customers now this will allow you to of course increase

your revenue and sales if you want to create a pie chart for this you can say product P top product category I can

organized by Sals labels top product has to AR well it seems that pack is perform a little bit better than most or like

these two but it's not that much different that much it's not really that different all right so let's now see which one of our subcategories is the

most popular one now remember that we saw which one of our subcategories had the most amount of sales now let's

create a barograph out of it we can just buy sending false and sales sending is

true let's create the bar bar sub category count sales

is category top product sub category the sales on this SP it and some wring find

now this was perfectly that um our most popular option is phone in church and so since this are generate the most amount of sales that means that

customers are more willing to pay money for this so you can end up spending more of your marketing resources on phones and chairs because it will it's there is

already shown that because of the amount of resources you have provided for phones and CH it already Works meaning if you increase the amount of resources

you spend for phones and chairs then your sales will also increase the accordingly you can also uh conclude that art en envelopes and labels aren't

that popular so maybe right now you can give a discount and get rid of those and buy less of those for the future so you can buy end up buying more of the

popular options for example phones chairs or you can also investigate why they are not popular maybe those are like the most the worst envelopes you

could have bought or maybe it's not the right it's not the right art you have bought maybe those kind of art people don't like but if you were to choose complete other form of Arts maybe

they will customers will end up buying them so this shows exactly how now this data stores can use to optimize their

sales or optimize how their resources are allocated so you would end up making more um more money or more sales our

businesses love making sales they love seeing Revenue increase and profits increase it's all lovely but there you should be able to

track down your Sal so that you can see in what kind of situation you are and you can adjust to that situation and what's what is a better way than having

a pie chart or bar graph or just a normal graph to see how much growth or how much decline you're experience for example if you are a business and

there's a declining Revenue then year over year over month of month over month then you can see that there is a problem and then you can also allocate your

resources toward fixing that problem whether that is investing more in customer preferences or investing more

money into marketing or into resources that make your customers more satisfied or adapting new technologies those are all um things that you can do whenever

you see a declining Revenue But first you must be able to see it coming and also businesses um have a problem with st growth so they may grow one month and

then the next month there's like no growth or maybe there's a decline so you want to see that as a business to be able to stabilize the growth so use uh

continuously grow as a business or missed seasonal opportunities if a business isn't aware of how sales changes throughout the the year they

could miss out on maximizing profits during big Seasons maybe or some Seasons there's a certain product that's uh what

that's high in demand but you are you don't have enough stuff enough stock to cover it so you end up not being able to meet the demand and losing out on um

revenue and profits and there so those were regarding our yearly sales fans and there's also a problem with quality monthly sales so for example

cash flow issues now many businesses experience cash flow issues so they may maybe one day they look at their bank and see that they are out of money and

they cannot invest more in their business or there's also inventory imbalance or ineffective marketing so for example whenever you have a cash

flow issue drastic dips in sales during specific quarters or months can lead to cash crunches making it hard to pay suppliers employees or ongoing expenses

or when whenever you have in inventory imbalance some uh periods you're Overstock and and those items you have to give away when some periods you are

under stock you're are not able to meet the demand or maybe your marketing is ineffective so if you spend significant amount of time in marketing and you

don't reach your desired out outcome that means there's a major issue with your marketing campaign and then you can see that from the sales you're making so for example if you're spending this

amount of money with marketing or increased amount marketing and there's no significant increase of sales that means you're doing something wrong with your marketing campaign or likeing

response to emerging Trends so mully sales data can highlight new trends or drops in demand more quickly than just he the overviews so you can react much

faster to emerging trends for example a certain product was released uh 2024 and all of a sudden is high in

demand in uh many countries they you want to be able to adjust to that demand and get get the supplies for the product but you are not able to do that if you

track yellow sales or don't do any tracking at all so those are all the all the problems that exist if you are not able to track down your sales be it

monthly partly or yearly and so we are intending to solve that problem by having my graphing it and concluding from the results we get from our graphs

all right so let's get started now let's convert the order dat column to data framing format and the order dates is

equal to pd.

to daytime thef or the DAT Day first is equal to the true let's go the data by years and calculate the total sales amount for each year we can do it by y

sales screening the variable first and then do buy thef or the DAT the year sales some release sales is equal toas

sales let reset the index now I want to give the appropriate call the appropriate name cuz right now in the data frame is is not the order date

should be named year and the sales should be named total sales now let's print this out so this the amount of

sales for each year in total and we can also this prod bar out of this prod bar

new sales year new sales total sales [Music] all right so from this uh biograph there can be few uh conclusions made for

example there's a steady growth from 2016 to 2018 that might explain for example new product launches they are effective or economic factors or

marketing efforts those are all the explanations that a person can make but you can make these conclusions only when you have a larger data available to you

and in this data frame we don't have the marketing cost or any other cost involved regarding this so that's our

conclusions are PR limited but what we can see is that um this bar graph combined with um any other bar for example marketing

cost you can make a pretty good amount of conclusions from that as a business now how about um tots we can also PL this using just a normal graph which

means I will just copy it oh no I won't this plet you Sals here to the Sals so this shows a little bit

different now I prefer this sort of graph for instead of a bar graph for tracking the yearly sales because it shows much more clear the amount of

increase with the mod of decrease now we can also uh focus on the quar sales like I said to be able to uh react to

emerging Trends or to emerging or to um react fast to or to be able to react fast to any kind of change now let's

again C the order date to date format order dates [Music] [Music] [Music] [Music]

[Music] [Music] [Music]

[Music] [Music] [Music] [Music] [Music] [Music] [Music]

[Music] [Music]

[Music] [Music] [Music] [Music] [Music] [Music]

[Music] [Music] from our qu Sals we can see that there's

a steady increase of qu qu Sals and all of a sudden um July it blows up to new

heights so from this a graph we can see exactly that uh Q3 and Q4 did very well and q1 and Q2 didn't so something might

have changed for example seasonal Trend or you have increased your marketing or you have introduced a new product or you have targeted a specific customer

segment and as a business is really important to know that so you can also expect that if you follow a similar line

of actions then you might have higher demand for your products on Q Q3 and Q4 so you might want to Overstock it or

maybe you can analyze this uh further and replicate the successful strategies for future quarters and you this will cause for this will make sure that a

business can steadily grow and increase its Revenue which is good and also for um q1 you can see that there's um it starts out with pretty slow I mean

businesses may want to start out the year more quickly so they can investigate it also for example is this a seasonal uh for the industry maybe on

certain Seasons a certain product is not high in demand or maybe the a competitor did some kind of marketing or um some

kind of or use some kind of a uh strategy to drive more customers to them or maybe we have changed um marketing effort in our Q3 and our marketing

efforts were not productive in for q1 and Q2 so maybe that's it all right so maybe you want to investigate this uh

much more deeply so not quarterly but monthly so let's do it now um let's start off the same way call the order dates column to the data time format

thef or the day periods to daytime [Music] [Music] [Music]

world [Music] [Music] [Music] [Music]

[Music] [Music]

[Music] [Music] [Music] [Music] all right so from this sh you can see

that it's growing month over month beside first month over uh 2019 and also the third month the the third month of

2018 so this is generally an upward Trend which which why suggest like a healthy sales are going on and it looks like

August and December might be your seasonal so you might want to Overstock the products there and okay you can also

see that there are seasonal dips in third month of 2018 and also 2019 in November of 2018 so you might want to

consider it uh like seasonal promotions to stimulate offseason sales or diverse diversify your product service offerings to reduce the Reliance on seasonal

demand maybe you want to start deploying new marketing strategies here or try to Target new customer segments by introducing new products so that you might offset the seasonal Trends and

overall it seems pretty consistent so it looks like that there's a healthy sales fund so you might want to invest more in your proven strategies

which might be for example certain marketing tactic promotion or product offering for a certain month and like I said for the dips the store might um try

to deploy new marketing strategies or introduce new products to Target new uh customer segments which might which might offset uh the dip and that said um

and that said it uh more it's important to consider the time frame a year is certainly a year certainly has great amount of data but it does not really

accurately the um reveal seasonal patterns because for one year that might be the case but for another year it might just be

completely uh the opposite so you might want to consider a larger um sales line graph for example maybe for five years and then you can see if that's the case

and if that's the casee then this might suggest a seasonal Trend if not then you will act accordingly all right so we have covered um the sales Trends we can

move on to the next chapter which is all right let's move on to the next chapter which is mapping so we want to create a

map out of uh Sals per state so for each state we want to color it um according to the amount of sales so if you are if there's a high amount of sales then it

should be colored yellow and if there is no amount of sales then it should be colored uh blue so the question is why would someone want to do this now

companies looking to expand into new Geographic areas face the challenge of identifying the most promising States and regions for their products or

Services now for example how do you know if um your product will sell in a certain state so one of the a one of the

tactics that people often use is by seeing if there is a similar store like them operating in that state or in that City and so if there is a similar store

um working there and it's not a saturated Market meaning there are substantial amount of people who might buy your product then it's a good

idea to go there for example a company that manufactures Athletic Apparel is considered expanding its retail footprint by

analyzing total sales data by US state they can see that states with a high concentration of fitness centers and active population like for example

California Texas or Florida might be good candidates for new stores and if there are currently no

uh um sports stores there then it's even better or for example if um you're a business and you want to strategically

allocate your Market budget and sales team and so you have uh stores all over the States you you want to optimize it for each state maybe one state is

forming well and another state is not performing so you might then from the map you might see which state is not performing well and allocate your

resources accordingly to be able to um maximize your return on investment by optimizing certain strategies but if you

don't know which state is not uh performing well then if you don't know which ST which state is performing well then you have no information on where

you have to optimize like for example a a national Pizza chain wants to optimize his marketing spend now sales data reveals that their piz piz areas in the

midwest consistently outperform those on the west coast at this Daya suggest that they might need to allocate more marketing budget to increase brand awareness and sales in the western

states and you might also want to do this because of competitor analysis so staying ahead of the competition means understanding where your competitors are

having the most success analyzing their sales patterns across states can reveal their Geographic strengths and weaknesses now for example a coffee roasting company notic as a competive

coffee brand is experiencing High sales in the Pacific Northwest states a this indicate the competitor has established a strong partnership with local grow grocery stores or a lot of successful

marketing campaigns in that region the company can use this information to Target similar grocery stores or develop competitive marketing strategies for the

Pacific Northwest so without further Ado let's get started with it now I will walk you to the code and instead of writing it instead of writing it we will

I will just walk you through it so let's first import the pl graph function or elaborate so we initialize the pl in or

J notebook and we also want to create a map for all 50 states which we do this here and add the ab abbreviation column

to the data frame because right now there is not we as we initialize the variable and calculate the amount of sales for each state and grou it by

State and this is exactly what we need and then we add the abbreviation to some of sales we do that here and finally we pl it and this is how the map looks like

so the Blue Area are the ones with low amount of sales and the yellow area which California has high amount of sales so from this m you can see which

areas um main cells come from and according to it you can optim you can optimize according so let's say you have a pizza chain or

pizza chain of stores and you want to see which one of your States is performing the best and which one is performing the poorest so you want to spend your most energy on optimizing

what doesn't work so from this you can see clear for is great so you can leave it alone but for example Texas is not performing that well so you might

allocate more marketing budget or more uh resources there to to have to start making um or to start getting more sales

because in Texas there are still many amount of people will consume P serious but they are not buying so why is that and you can also see so for example if

this is a completely another store this is like relates to for example a retail store for Sport Goods and all this

States they've got um a store from You can conclude that uh California is profing real bille so it it's probably not a good idea to go there since it

might have since there might be a like Market situation there but you can go for example to for example Florida you

want go to Florida and start selling um similar uh sports goods there because like you see the market is still uh new

or the market is still not such trade all right so that was that so we can also create a barograph out of it now from this you can see that most of

the total sales P by state California is doing the best and nor theota is doing the worst and of course now I still

remember we previously categorized or showed how large each of our categories are and we did the same for sub our

subcat but we never did it in the same uh plot so here we display our main category of products which are Furniture

office supplies and technology and for each uh category we have subcategories and based on their size it's uh and it's

organized based on their size so chairs seems that chairs is the largest or sells the most in our furniture category that it goes to tables and we have then

small model of book cases and fleshings and in our office supplies you can see that the best uh storage um product is performing the best and

envelopes and labels are performing the worst and for our Tech category phones are performing the best and the machines accessories and copies and from this you

can see that phones is overall the best subcategory it's even I think it's even larger than um chairs yeah little larger

so this is a much better way to display um if you're trying to make an argument uh to display the plot and of course you can also do it this way all

right and hope you guys enjoyed this project I definitely did and I will see you guys in the next video in this part we are going to talk

about a case study in the field of Predictive Analytics and causal analysis so we are going to use this simple yet powerful regression technique called

linear regression in order to perform causal analysis and Predictive Analytics so by causal analysis I mean that we are going to look into this correlations

clation and we're trying to figure out what are the features that have an impact on the housing price on the house value so what are these features that

are describing the house that Define and cause the variation in the uh house prices the goal of this case study is to

uh practice linear regression model and to get this first feeling of how uh you can use a machine learning model a simple machine learning model in order

to perform uh model training model evaluation and also use it for causal analysis where you are trying to identify features that have a

statistically significant impact on your response variable so on your dependent variable so here is the step-by-step process that we are going to follow in

order to find out what are the features that Define the Californian house values so first we are going to understand what are the set of independent variables

that we have we're also going to understand what is the response variable that we have so for our multiple linear regression model we are going to understand what are this uh techniques

that we uh need and what are the libraries in Python that we need to load in order to be able to conduct this case study so first we are going to load all these libraries and we are going to

understand why we need them then we are going to conduct data loading and data preprocessing this is a very important step and I deliberately didn't want you

to skip this and didn't want you to give you the clean data cuz uh usually in normal real Hands-On data science job you won't get a clean data you will get

a dirty data which will contain missing values which will contain outliers and those are things that you need to handle before you proceed to the actual and F

part which is the uh modeling and the uh analysis so therefore we are going to do missing data analysis we are going to remove the missing data from our

Californian house price data we are going to conduct outlier detection so we are going to identify outliers we are going to learn different techniques that you you can use visualization uh

techniques uh in Python that you can use in order to identify outliers and then remove them from your data then we are going to perform data visualization so we are going to explore the data and we

are going to do different plots to learn more about the data to learn more about this outliers and different statistical techniques uh combined with python so

then we are going to do correlation analysis to identify some problematic Fe features which is something that I would suggest you to do independent the nature of your case study to understand what

kind of variables you have what is the relationship between them and whether you are dealing with some potentially problematic variables so then we will be U moving

towards the fun part which is performing the uh multiple theur regression in order to perform the caal nzis which means identifying the features in the

Californian house blocks that Define the value of the Californian houses so uh finally we will do very quickly another uh implementation of the

same multiple uh multiple linear regression in order to uh give you not only one but two different ways of conducting linear regression because linear regression can be used not only

for causal analysis but also as a standalone a common machine learning regression type of model therefore I will also tell you how you can use

psychic learn as a second way of training and then predicting the Californian house values so without further Ado let's get started once you become a data scientist

or machine learning researcher or machine learning engineer there will be some cases some handson uh data science projects where the business will come to you and will tell you well here we have

this data and we want to understand what are these features that have the biggest influence on this other factor in this specific case in our case study um let's

assume we have a client that uh is interested in identifying what are the features that uh Define the house price

so maybe it's someone who wants to um uh invest in uh houses so it's someone who is interested in buying houses and maybe even renovating them and then reselling

them and making profit in that way or maybe in the longterm uh investment Market when uh people are buying real estate in a way of uh investing in it

and then longing for uh holding it for a long time and then uh selling it later or for some other purposes the end goal in this specific case uh for a person is

to identify what are this features of the house that makes this house um to be priced at a certain level so what are

the features of the house that are causing the price and the value of the house so we are going to make use of this very popular data set that is

available on Kel and it's originally coming from psyit learn and is called California housing prices I'll also make sure to put the link uh of this uh

specific um data set uh both in my GitHub account uh under this repository that will be dedicated for this specific case study as well as um I will also

point out the additional links that you can use to learn more about this data set so uh this data set is derived from 1990 um US Census so United uh States

census using one row Paris census block so a Blog group or blog is the smallest uh geographical unit for which

the US s Bureau publishes sample data so a Blog group typically has a population of 600 till 3,00 people who are living there so a

household is a group of people residing within a single home uh since the average number of rooms and bedrooms in this data set are provided per household

this cons may be um May take surprisingly large values for blog groups with few households and many empty houses such as Vacation

Resorts so um let's now look into uh the variables that are available in this specific data set so uh what we have here is the med

in which is the median income in block group so uh this um touches the uh financial side and uh Financial level of

the uh block uh block of households then we have House age so this is the median house age in the block group uh then we have average

rooms which is the average number of rooms uh per household then we have average bedroom which is the average number of bedrooms per

household then we have population which is the uh blog group population so that's basically like we just saw that's the number of people who live in that

block then we have a uh o OU uh which is basically the average number of household members uh then we have latitude and longitude which are the latitude and

longitude of this uh block group that we are looking into so as you can see here we are dealing with aggregated data so we don't have the uh

the data per household but rather the data is calculated and average aggregated based on a block so this is very common in data science uh when we

uh want to reduce the dimension of the data and when we want to have some sensible numbers and create this crosssection data and uh crosssection data means that we have multiple

observations for which we have data on a single time period in this case we are using as an aggregation unit the block

and uh we have already learned as part of the uh Theory lectures this idea of median so we have seen that there are different descriptive measures that we

can use in order to aggregate our data one of them is the mean but the other one is the median and often times especially if we are dealing with skew distribution so if we have a

distribution that is not symmetric but it's rather right skewed or left skewed then we need to use this idea of median because median is then better

representation of this um uh scale of the data um compared to the mean and um in this case we will soon see when

representing and visualizing this data that we are indeed dealing with a skewed data so um this basically very simple a

very basic data set with not too many features so great um way to uh get your hands uh uh on with actual machine learning use case uh we will be keeping

it simple but yet we will be learning the basics and the fundamentals uh in a very good way such that uh learning more um difficult and more advp learning

models will be much more easier for you so let's now get into the actual coding part so uh here I will be using the Google clab so I will be sharing the

link to this notebook uh combined with the data in my python for data science repository and you can make use of it in order to uh follow this uh tutorial uh

with me so uh we always start with importing uh libraries we can run a linear regression uh manually without using libraries by using matrix

multiplication uh but I would suggest you not to do that you can do it for fun or to understand this metrix multiplication the linear algebra behind

the linear regression but uh if you want to um get handson and uh understand how you can use the neur regression like you

expect to do it on your day-to-day job then you expect to use um instead Library such as psychic learn or you can also use the stats models. API

libraries in order to understand uh uh this topic and also to get handson I decided to uh showcase this example not only in one library in py but also the

stats models and uh the reason for this is because many people use linear regression uh just for Predictive Analytics and for that using psyit learn

this is the go-to option but um if you want to use linear gression for causal analysis so to identify and interpret this uh feature the independent variables that have a

statistically significant impact on your response variable and then you will need to uh use another Library a very handy one for linear regression which is

called uh stats models. uh API and from there you need to import the SM uh functionality and this will help you to do exactly that so later on we will see

how uh nicely this Library will provide you the outcome exactly like you will learn on your uh traditional econometrics or introduction to linear

regression uh class so I'm going to give you all this background information like no one before and we're going to interpret and learn everything such that

um you start your machine Learning Journey in a very proper and uh in a very um uh high quality way so uh in

this case uh first thing we are going to import is the pendis library so we are importing pendis Library as PD and then non pile Library as NP we are going to

need pendas uh just to uh create a pendas data frame to read the data and then to perform data wrangling to identify the missing data outliers so

common data wrangling and data proposing steps and then we are going to use numpy and numpy is a common way to uh use whenever you are visualizing data or

whenever you are dealing with metrices or with arrays so so pandis and nonp are being used interchangeably so then we are going to

use MPL lip and specifically the PIP platform it uh and this library is very important um when you want to visualize

a data uh then we have cburn um which uh is another handy data visualization library in Python so whenever you want to visualize data in Python then methot

leip and Cy uh cburn they are two uh very handy data visualization techniques that you must know if you like this um cooler uh undertone of colors that

Seaburn will be your go-to option because then the visualizations that you are creating are much more appealing compared to the med plotly but the underlying way of working so plotting

scatterer plot or lines or um heat map they are the same so then we have the statsmodels.api

uh which is the library from which we will be importing the uh SM uh that is the simple uh linear regression model uh that we will be using uh for our caal

analysis uh here I'm also importing the uh from pyit learn um linear model and specifically the lar regression model

and um this one uh is basically similar to this one you can uh use both of them but um it is a common um way of working with machine learning model so whenever

you are dealing with Predictive Analytics so we you are using the data not for identifying features that have a statistically significant impact on the response variable so features that have

an influence and are causing the dependent variable but rather you are just interested to use the data to train

the model on this data and then um test it on an unseen data then uh you can use pyit learn so pyit learn will uh will will be something that you will be using

not only for linear regression but also for other machine learning models think of uh Canon um logistic regression um

random Forest decision trees um boosting techniques such as light GBM GBM um also clustering techniques like K means DB scan anything that you can

think of uh that fits in in this category of traditional machine learning model you'll be able to find insight therefore I didn't want it to limit this

tutorial only to the S models which we could do uh if we wanted to use um if we wanted to have this case study for specifically for linear regression which

we are doing but instead I wanted to Showcase also this usage of pyit learn because pyit learn is something that you can use Beyond linear regression so for all this added h of machine learning

models and given that this course is designed to introduce you to the world of machine learning learning I thought that we will combine this uh also with pyic learning something that you are

going to see time and time again when you are uh using python combined with machine learning so then I'm also uh importing

the uh training test split uh from the psyit learn model selection such that we can uh split our data into train and

test now uh before we move into uh the uh actual training and testing we need to first load our data so therefore uh

what I did was to uh here uh in this sample data so in a folder in Google collab I uh put it this housing. CSV

data that's the data that you can download uh when you go to this specific uh page so uh when you go here um then

uh you can also uh download here that data so download 400 9 kab of this uh housing data and that's exactly what I'm

uh downloading and then uploading here in Google clap so this housing. CSV in

this folder so I'm copying the path and I'm putting it here and I'm creating a variable that holds this um name so the

path of the data so the file underscore path is the variable string variable that holds the path of the data and then what I need to do is that I need to uh

take this file unor path and I need to put it in the pd. read CSV uh which is a function that we can use in order to uh

load data so PD stands for pandas uh the short way of uh naming pandas uh PD do and then rore CSV is the function that

we are taking from Panda's library and then within the parentheses we are putting the file uncore path if you want to learn more about this Basics or variable different data structures some

basic python for data science then um to ensure that we are keeping this specific tutorial structured I will not be talking about that but feel free to check the python for data science course

and I will put the link um in the comments below such that you can uh learn that if you don't know yet and then you can come back to this tutorial

to learn how you can use python in combination with linear progression so uh the first thing that I tend to do before moving on to the actual execution

stage is to um look into the data to perform data exploration so what I tend to do is to look at the data field so the name of the variables that are

available in the data and that you can do by doing data. columns so you will then look into the columns in your data this will be the name of your uh uh data

fields so let's go ahead and do command enter so we see that we have longitude latitude housing undor median age we have total rooms we have total bedrooms

population so basically the the um amount of people who are living in the in those households and in those houses then we have households then we have median income we have median house

uncore value and we have ocean proximity now you might notice that the name of these variables are bit different different than in the actual um

documentation of the California house so you see here the naming is different but the underlying uh explanation is the same so here they are just trying to

make it uh nicer and uh represented in a better uh naming but uh it is a common um thing to see in Python when we are

dealing with uh data that uh we have this underscores in the name app ation so we have housing uncore median AG

which in this case you can see that it says house um H so bit different but their meaning is the same this is still

the median house AG in the block group so uh one thing uh that you can also uh notice here is that the um in the

official uh documentation we don't have this um one extra variable that we have here which is the ocean proximity and

this basically uh describes the uh closeness of the house from the ocean which of course uh for some people can

definitely mean a increase or decrease in the house price so I basically um we have all these variables and next thing

that I tend to do is to look into the actual data and one thing that we can do is just to look at the um top 10 rows of the data instead of printing the entire

uh data frame so when we go and uh execute this specific part of the code and the command you can see that here we have the top 10 rows uh of our data so

we have the longitude the latitude we have the housing median age you can see we are seeing some 41 year 21 year 52 year basically the number of years that

a house the median age of the house is 41 21 5 two and this is per block then we have the number of total

bedrooms so we see that uh we have um in this blog uh the total number of rooms that this houses have is 7,99 so we are already seeing a data

that consists of these large numbers which is something to take into account when uh you are dealing with machine learning models and especially with linear regression then we have total

bedrooms um and we have then pop population households median income median house value and the ocean proximity one thing that you can see

right of the bed is that uh we have longitude and latitude uh which have some uh unique uh characteristics um and longitude is with

minuses latitude is with pluses uh but that's fine for the linear regression because what it is basically looking is uh whether a variation in certain independent variables in this case

longitude and latitude but that will cause a change in the dependent variable so just to refresh our memory what this linear regression will do in this case um so we are dealing with multiple

linear regression because we have more than one independent variables so we have as independent variables those different features that describe the house except of the house price because

median house value is the dependent variable so that's basically what we are trying to figure out we want to see what

are the features of the house that cause so Define the house price we want to identify what are um the features that

cause a change in our dependent variable and specifically uh what is the uh change in our median house price uh

value if we apply a one unit change in our IND dependent feature so if we have a multiple linear aggression we have learned during the theory lectures that what linear regression tries to use

during causal anasis is that it tries to keep all the independent variables constant and then investigate for a specific independent variable what is this one unit uh change uh increase uh

in the specific independent variable will result in what kind of change in our dependent variable so if we for instance change by one unit our

uh housing median age um then what will be the corresponding change in our median household value keeping everything else

constant so that's basically the idea behind multi multiple line regression and using that for this specific use case and in here um what we also want to

do is to find out what are the uh data types and whether we can learn bit more about our data before proceeding to the next and for that I tend to use this uh info

uh function in pandas so given that the data is a pandas data frame I will just do data. info and then parenthesis and

do data. info and then parenthesis and then this will uh show us what is the data type and what is the number of new values per

variable so um as we have already noticed from this header which we can also see here being confirmed that ocean proximity is a very variable that is not

a numeric value so here you can see nearby um also a value for that variable which unlike all the other values is represented by a string so this is

something that we need to take into account because later when we uh will be doing the data prop processing and we will actually uh actually run this model we will need to do something with this

specific variable we need to process it so um for the rest we are are dealing with numeric variables so you can see here that longitude latitude all the

other variables including our dependent variable is a numeric variable so float 64 the only variable that needs to be

taken care of is this ocean underscore proximity uh which um we can actually later on also see that is um categorical string variable and what this basically

means is that it has these different categories so um for instance uh let us actually do that in here very quickly so

let's see what are all the unique values for this variable so if we take the name

of this variable so we copy it from this overview in here and we do unique then this should give us the unique values for this categorical

variable both so here we go so we have actually five different unique values for this categorical string variable so this

means that this ocean proximity can take uh five different values and it can be either near Bay it can be less than 1 hour from the ocean it can be in land it

can be near Ocean and it can be uh in the Iceland what this means is that we are dealing with a feature that describes the distance uh of the block from the ocean

and here the underlying idea is that maybe this specific feature has a statistically significant impact on the

house value meaning that it might be possible that for some people um in certain areas or in certain countries

living in the uh nearby the ocean uh will be increasing the value of the house so if there is a huge demand for houses which are near the ocean so

people prefer to uh live near the ocean then most likely there will be a positive relationship if there is a uh negative

relationship then it means that uh people uh if uh in that area in California for instance people do not prefer to live near the ocean then uh we

will see this negative relationship so we can see that um if we increase uh the uh if if people uh if the house is in

the uh um area that is uh not close to Ocean so further from the ocean then the house value will be higher so this is something that we want to figure out with this line regression we want to

understand what are the features that uh Define the value of the house and we can say that um if the house has those characteristics then most likely the

house price will be higher or the house price will be lower and uh linear aggression helps us to not only understand what are those features but also to understand how much higher or

how much lower will be the value of the house if we have the certain characteristics and if we increase the certain characteristics by one unit so

next we are going to look into uh the missing data in our data so in order to have a proper machine learning model we need to do some uh data processing so

for that what we need to do is we need to check for the uh Missing values in our data and we need to understand what is this amount of n values per data

field and this will help us to understand whether uh we can uh remove some of those missing values or we need to do

imputation so depending on the amount of missing data that we got in our data we can then understand which of those Solutions we need to take so here we can

see that uh we don't have any new values when it comes to longitude latitude housing median age and all the other variables except a one variable one

independent variable and that's the total bedrooms so we can see that um out of all the observations that we got the total uh uncore bedrooms variable has

207 cases when we do not have the corresponding uh information so when it comes to representing this numbers in percentages

which is some something that you should do as your next step we can see that um out of um the entire data set uh for

total underscore bedrooms variable um only 1. n n uh 3% is missing now this is

only 1. n n uh 3% is missing now this is really important because by simply looking at the number of times the uh

number of missing uh observations per data field this won't be helpful for you because you will not be able to understand relatively how much of the

data is missing now if you have for a certain variable 50% missing or 80% missing then it means that for majority of your house blocks you don't have that

information and including that will not be beneficial for your moral nor will be it accurate to include it and it will result in biased uh moral because if you

have for the majority of observations uh no information and for certain observations you do that inform you have that information then you will automatically skew your results and you

will have biased results therefore if you have uh for the majority of your um data set that specific uh variable

missing then I would suggest you choose just to drop that independent variable in this case we have just one uh% uh of the uh house blocks missing that

information which means that this gives me confidence that uh I would rather keep this independent variable and just to drop those observations that do not

have a total uh underscore bedrooms uh information now another solution could also be is uh to instead of dropping that entire independent variable is just

to uh use some sort of imputation technique so uh what this means is that uh we will uh try to find a way to

systematically find a replacement for that missing value so we can use mean imputation median imputation or more model based more advanced statistical or

econometrical approach is to perform imputation so for now this is out of the scope of this problem but I would say look at the uh percentage of uh

observations that for which this uh independent variable has missing uh values if this is uh low like less than 10% and you have a large data set then

uh you should uh be comfortable uh dropping those observations but if you have a small data set so you got only 100 observations and for them like 20%

or 40% is missing then consider from imputation so try to find the values that can be um used in order to replace those missing

values now uh once we have this information and we have identified the missing values the next thing is to uh clean the data so here what I'm doing is

that I'm using the data that we got and I'm using the function drop na which means drop the um uh observations where

the uh value is missing so I'm dropping all the observations for which the total underscore bedrooms has a null value so I'm getting rid of my missing

observations so after doing that I'm checking whether I got rid of my missing observations and you can see here that when I'm printing data. is n. sum so I'm

summing up the number of uh Missing observations n values per uh variable then uh now I no longer have any missing observations so I successfully deleted

all the missing observations now the next state is to describe the data uh through some descriptive statistics and through data

visualization so before moving on towards the caal analysis or predictive analys is in any sort of machine learning traditional machine learning approach try to First Look Into the data

try to understand the data and see uh whether you are seeing some patterns uh what is the mean uh of different um numeric data fields uh do you have

certain uh categorical values that cause an un unbalanced data those are things that you can discover uh early on uh

before moving on to uh the more model training and testing and blindly believing to the numbers so data visualization techniques and data

exploration are great way to understand uh this uh data that you got before using that uh in order to train and test

machine learning model so here I'm using the uh traditional describe function of pendis so data. describe parenthesis and then this will give me the descriptive

statistics of my data so here what we can see is that in total we got uh 20,600

observations uh and then uh we also have a mean of uh all the variables so you can see that perir variable I have the same count which basically means that

for all variables I have the same number of rows and then uh here I have the mean which means that um here we have the

mean of the uh variable so per variable we have their mean and then we have their standard deviation so the square root of the variance we have the minimum

we have the Maxum but we also have the 25th percentile the 15 percentile and the 75th percentile so the uh percentile

uh and quantiles those are uh statistical terms that we oftenly use and the 25th percentile is the first quantile the 15 percentile is the second

quantile or the uh median and the 75th percentile is the uh third quantile so uh what this basically means is that uh

this percentiles help us to understand what is this threshold when it comes to looking at the um

observations uh that fall under the 25% uh and then above the 25% so when we look at this standard deviation standard deviation helps us to interpret the

variation in the data at the unit so scale of that variable so in this case the variable is median house value and we have that the mean is equal to

206,000 approximately so more or less that uh range 206 K and then the standard deviation is

115k what this means is that uh in the data set we will find blocks that will

the median house value that will be uh 206k 206k plus 115k which is around 321k so there will be blocks where the

median house value is around 321k and there will also be blocks where the um median house value will be around

uh 91k so 206,000 minus 115k so this the idea behind standard deviation this variation your data so

next we can interpret the idea of uh this uh minimum and maximum of your data in your data fields the minimum will help you to understand what is this minimum value that you have per data

field numeric data field and what is the maximum value so what is the range of values that you are looking into in case of the median house value this means

what um are the uh what is this minimum median house value per uh block and uh in case of Maximum what is this um

highest value perir block when it comes to the Medan house value so this can uh help you to understand um when we look at this aggregated data so the median

house value what are the blocks that have the uh chipest uh houses when it comes to their valuation and what are the most expensive uh blocks of

houses so we can see that uh the chipest um block uh where in that block the median house value is uh 15K so

14,999 and the house block with the um highest valuation when it comes to the median house value so uh the median um valuation of the houses is equal to

$500,000 And1 which means that when we look at our blocks of houses um that uh the median house value in this most

expensive blocks will be a maximum 500k so uh next thing that I tend to do is to visualize the data I tend to start with the dependent variable so this is

the variable of interest the target variable or the response variable which is in our case the median house value so this will serve us as our dependent

variable and what I want you to do is to upload this histogram uh in order to understand what is the distribution of Medan house values so I want to see that

when when looking at the data what are the um most frequently appearing median house values and uh what are this uh

type of blocks that have um unique less frequently um appearing uh meded house values by plotting this type of plots

you can see some alt lers some um frequently appearing values but also some values that uh go uh and uh are lying out side of the range and this

will help you to identify and learn more about your data and to identify outliers in your data so in here I'm using the uh caburn uh Library so given that earlier

I already imported this libraries there is no need to import here what I'm doing is that I'm setting the the grid so which basically means that I'm saying the background should be white and I

also want this grid so this means those this grid behind then I'm initializing the size of the figure so PLT this comes from met plotly pip plot and then I'm

setting the figure the figure size should be 10 by6 so um this is the 10 and this is the six then we have the

main PL so I'm um using the uh his plot function from curn and then I'm taking from the uh clean data so from which we have removed the missing data I'm

picking the uh variable of interest which is the median house value and then I'm saying upload this um histogram using the fors green

color and then uh I'm saying uh the title of this figure is distribution of median house values then um I'm also mentioning what is the X label which

basically means what is the name of this variable that I'm putting on the xaxis which is a median house value and what is the Y label so what is the name of

the variable that I need to put on the Y AIS and then I'm saying pl. show which

means show me the figure so that's basically how in Python the visualization works we uh first need to write down the the actual uh figure size

uh and then we need to uh Set uh the function uh and the right variable so provide data to the visualization then we need to put the title we need to put the ex label while label and then we

need to say show me the visualization and uh if you want to learn more about this visualization techniques uh make sure to check the python for data science course cuz that

one will help you to understand slowly uh and in detail how you can uh visualize your data so in here what we are visualizing is the frequency of this

median house values in the entire data set what this means is that we are looking at the um number of times each of those median house values appear in

the data set so uh we want to understand are there uh certain uh median house values that appear very often and are there certain

house values that do not appear that often so those can be maybe considered outliers uh because we want in our data

only to keep those uh most relevant and representative data points we want to derive conclusions that hold for the majority of our uh uh observations and

not for outliers we will be then using that uh representative data in order to run our linear regression and then make conclusions when looking at this graph

what we can see is that uh we have a certain cluster of um median house values that appear quite often and those are the cases when this frequency is

high so you can see that uh we have for instance houses in here in all this block that appear um very often so for

instance the median house value U of about 160 170k this appears very frequently so you can see that the frequency is above 5,000 those are the

most frequently appearing median house values and um there are cases when the um so you can see in here and you can

see in here houses that uh whose Medan house value is not appearing very often so you can see that their frequency is

low so um roughly speaking those houses they are unusual houses they can be considered as outliers and the same holds also for these houses because you

can see that for those the frequency is very low which means that in our population of houses so California house prices you'll most likely see houses uh

blocks of houses whose medium value is between let's say um 17K up to uh let's say uh

300 or 350k but anything below and above this is considered as unusual so you don't often see a houses that are um so house

blocks that have a median house value less than uh 70 or 60k and then uh also

uh houses that are above um 370 or 400k so do consider that uh we are dealing with

1990 um a year data and not the current uh prices because nowadays uh California houses are much more expensive but this is the data coming from 1990 so uh do

take that into account when interpreting this type of data visualizations so uh what we can then do is to use this idea of inter quantile range to remove this outl what is

basically means is that we are looking at the lowest 25th uh percentile so uh we are looking at this first quantile so

0 25 which is a 25th percentile and we are looking at this upper 25th um percent which means the third quantile

or the 75th percentile and then we want to basically remove those uh by using this idea of 25th percentile and 75th

percentile so the first quantile and the search quantile we can then identify what are the U observations so the

blocks that have a median house value that is below the uh 25th percenti and above the 75% he so basically we want to

uh get the middle part of our data so we want to get this data for which the median house uh value is above the 25th

percentile so um above all the uh median house values that is above the lowest 25% uh percent and then we also want to

remove this very large median house values so we want to uh keep in our data the so-called normal uh and representative blocks blocks where the

medium house uh value is above the lowest 25% and smaller than the largest 25% what we are using is this statistical uh term ter called inter

quanti range you don't need to know the name but I think it would be just wor to understand it because this is a very popular way of uh making a datadriven uh

removal of the outliers so I'm selecting the um 25th percentile by using the quantile function from pandas uh so I'm

saying find for me the uh value that divides my entire uh block of observation so block observ ations to

observations for which the median house value is below the um the um 25th percentile and above the 25th percentile

so what are the largest 75% and what are the smallest 25% when it comes to the median house value and we will then be removing this

25% so that I will do by using this q1 and then uh we will be using the uh Q3 in order to remove the very large median house M value so the uh upper 25

percentile and then uh in order to um calculate the inter quanti range we need to uh pick the Q3 and subtract from it

the q1 so just to understand this idea of q1 and Q3 so the Quantas better let's actually print this uh

q1 and this uh Q3 so let's actually remove this part for now and they run

it so as you can see here what we are finding is that the uh q1 so the 25% or first quantile is equal to

$119,500 so it basically is a number in here what it means is that um we have uh

25% um of the um observations the smallest observations have a median house value that is below the

$1 $119,500 and the remaining 75 uh% of our observations have a median house value

that is above the $190,500 and then the uh Q3 which is the third quantile or the 75th percentile it

describes this threshold the volume where we make a distinction between the um uh lowest median house values the

first 75th uh% of the lowest uh median house values versus the uh most expensive so the highest median house

values so what is this upper uh 25% uh when it comes to the median house value so we see that that distinction is

$264,700 so it is somewhere in here which basically means that when it comes to this uh to this blocks of uh houses the

most expensive ones with the highest valuation so the 25% top rated median house values they are above 2 64,500

that's something that we want to remove so we want to remove the observations that have uh smallest median house value and the largest median house values and usually it's a common practice when it

comes to the inter quantel uh range approach to multiply the inter quanti range by 1.5 in order to um obtain the lower bound and the upper bound so to

understand what are the um thresholds that we need to use in order to remove the uh blocks uh so observations from

our data where the Medan house value is very small or very large so for that we will be multiplying the IQR so inter

quanti range by 1.5 and when we uh subtract this value from q1 then we will be getting our lower bound when uh we will be adding this value to Q3 then we

will be infusing and getting this threshold when it comes to the uh upper bound and we will be seeing that um after we uh clean this uh outliers from

our data we we end up uh getting um smaller data so this means that uh previously we had uh 20K so

2,433 observations and now we have 9,369 observations so we have roughly removed um like about 1,000 or bit over

1,000 observations from our data so uh next let's look into some other variables for instance the median uh income

and um one other technique that we can use in order to identify outliers in the data is by using the box plots so I wanted to showcases different approaches

that we can use in know to visualize the data and to identify outliers such that you will be familiar with uh different techniques so let's go ahead and plot

the uh box plot and box is a statistical um way to represent your data uh the central box uh represents the inter

Quant range so um that is the IQR uh and with the uh with the bottom and the top edges they indicate the 25th percentile so the first quantile and the 75th

percentile so the third quantile respectively the length of this box that you see here uh this dark part is basically the 50% of your data for the

median income and uh this uh median uh line in inside this box um this is the uh the

one with uh contrasting color that represents the median of the data set so the median is the middle value when data is sorted in an ascending order then we

have this whiskers in our box fote and this line of whiskers extends from the top and the bottom of the box and indicate this range for the rest of the

data set excluding the outliers they are typically this 1.5 IQR above and 1.5 times um IQR uh below the

q1 something that we also saw uh just previously when we were removing the outliers from the median house volum so in order to um identify the outliers you

can quickly see that we have all these points that um lie above the 1.5 times IQR above the um third quantile so the

75 percentile and um that's something that you can also see here and this means that those are uh blocks of houses that have

unusually high median income that's something that we want to remove from our data and therefore we can use the uh exactly the same approach that we used previously for the median house value so

we will then identify the uh 25 percentile or the first quantile so q1 and then Q3 so the third quantile or the 7 percentile then we will compute the

IQR um and then we will be obtaining the lower bound and the upper bound using this 1.5 um ass scale and then we will be using that this lower bound and upper

bound to then um use this filters in order to remove from the data all the observations where the medium income is above the lower bound and all the

observations that have a median income below the upper bound so we are using lower bound an upper bound to perform double filtering we are using two

filters in the same row as you can see and we are using this parentheses and this end functionality to tell to python well first look that this condition is satisfied so the observations have a

median income that is above this lower bound and at the same time it should hold that the observation so the block should have a median income that is

below the upper bound and if this uh block this observation in data satisfies to two of this criteria then we are dealing with a good point a normal point

and we can keep this and we are seeing that this is our new data so let's actually go ahead and execute this code in this case we can see too high as all

our outliers lie in this part of the Box fot and then we will end up with a clean data I'm taking this clean data and then I'm putting it under data just for

Simplicity and uh this data now uh is much more clean and uh it's better representation of the population something that ideally we want because

we want to find out what are the features that uh describe and Define the house value not based on this unique and rare houses which are too expensive or

which are in the blocks that have uh very high income uh people but rather we want to see the uh the uh true representation so the most frequently

appearing data what are the features that Define the house value of the prices uh for common uh houses and for common areas for people with average or

with normal income that's what we want to uh find so uh the next thing that I tend to do uh when it comes to especially regression

NES and C nazes is to plot the correlation heat map so this means that that uh we are getting the um uh

correlation Matrix pairwise correlation score uh for each of this pair of variables in our data when it comes to the linear regression one of the uh

assumptions of the linear regression that we learned during the theory part is that we should not have a perfect multicolinearity what this means is that there should not be a high correlation

between pair of independent variables so knowing one should not help us to automatically Define the value of the other independent variable and if the

correlation between the two independent variables is very high it means that we might potentially be dealing with multicolinearity that's something that

we do want to avoid so heat map is a great way to identify whether we have this type of problematic independent variables and whether we need to drop any of them or maybe multiple of them to

ensure that we are dealing with proper linear regression model and the assumptions of linear regression model is satisfied now when we look at this correlation heat map um and uh here we

use the caburn in order to plot this as you can see here the colors can be from very light so wide from uh till very

dark green where uh the light means um there is a negative strong negative correlation and very dark uh green means

that there is a very strong positive correlation so uh we know that correlation a value Pearson correlation can take values between minus one and

one minus one means uh very strong negative correlation one means very strong positive correlation and um usually when uh we are dealing with correlation of the

variable with itself so a correlation between longitude and longitude then uh this correlation is equal to one so as you can see on the diagonal we have

therefore all the ones because those are the pairwise correlation of the variables with themselves and then um in here uh all the values under the diagonal are actually equal to the uh

mirror of them in the upper diagonal because the variable between so the correlation between um the same two variables in dependent of how we put it

so which one we put first and which one the second is going to be the same so so basically correlation between longitude and ltitude and correlation ltitude and

longitude is the same so um now we have uh refreshed our memory on this let's now look into the actual number and this heat map so as we can see here we have

the section where we um have uh variables independent variables um that have a low uh positive correlation with

the uh remaining independent variable so you can see here that we have this light green uh values which indicate a low positive relationship between pair of

variables one thing that is very interesting here is the middle part of this heat map where we have this dark numbers so the numbers uh below the diagonals are something we can interpret

and remember that below diagonal and above diagonal is basically the mirror we here already see a problem because we are dealing with variable which are going to be independent variables in our

model that have a high correlation now why is this a problem because one of the assumptions of linear regression like we saw during the theory section is that we

should not have a multiple uh colinearity so multicolinearity problem when we have perfect multicolinearity it means that we are dealing with

independent variables that have a high correlation knowing a value of one variable with help us to know automatically what is the value of the

other one and when we have a correlation of 0.93 which is very high or 0.98 this means that those two variables

those two independent variables they have a super high positive relationship this is a problem because this might

cause our model to result in uh very large standard errors and also not accurate and and not generalizable model

that's something we want to avoid and uh we want to ensure that the assumptions of our model are satisfied now um we are dealing with independent variable which is total

underscore bedrooms and households which means that number of total bedrooms uh per block and the uh households is highly correlated

positively correlated and this a problem so ideally what we want to do is to drop one of those two independent variables and uh the reason why we can do that is

because uh those two variables given that they are highly correlated they already uh explain similar type of information so they contain similar type

of variation which means that including the two just it doesn't make sense on one hand it's U violating the moral assumptions potentially and on the other

hand it's not even adding too much value because the other one already shows similar variation so um the total underscore bedrooms basically contains

similar type of information as the households so we can as well um so we can better just drop one of those uh two independent

variables now uh the question is which one and that's something that we can uh Define by also looking at other correlations in here because we uh have

a total bedrooms uh having a high correlation with households but we can also see that the total underscore rooms

has a very high correlation with our households so this means that there is yet another independent variable that has a high correlation with our

households variable and then this total underscore rooms has also High uh correlation with the total underscore bedroom

so this means that um we can decide which one is has um more frequently uh High correlation with the rest of

independent variables and in this case it seems like that the largest two numbers in here are the um this one and this one so we see that the total

bedroom has 093 as correlation with the total underscore rooms and uh at the same time we also see that hotel

bedrooms has also um very high correlation with the household so 0.98 which means that total underscore bedrooms has the highest correlation

with the remaining independent variables so we might as well drop this independent variable but before you do that I would suggest you do one more

quick visual check and it is to look into the total uncore bedroom correlation with the dependent variable to understand how strong of a relationship does this have on the

response variable that we are looking into so we see that the uh total underscore bedroom uh has this one 0.05 correlation with the response

variable so the median house value when it comes to the total rooms that one has much higher so I'm already seeing from here that uh we can feel comfortable uh

excluding and dropping the total uncore bedroom from our data in order to ensure that we are not dealing with perfect multicolinearity so that's exactly what I'm doing here so

I'm dropping the um a total bedrooms so after doing that we no longer have this uh total bedrooms as the column so before moving on to the

actual cause of n is there is one more step that I wanted to uh show you uh which is super important when it comes to the POS analysis and some

introductory econometrical stuff so uh when you have a string categorical variable there are a few ways that you can deal with them one easy way that you

will see um on the web is to perform one H encoding which basically means transforming all this uh string values so um we have a near Bay less than 1

hour ocean uh Inland near Ocean Iceland to transform all these values to some numbers such that we have for the ocean

proximity variable values such as 1 2 3 4 5 one way of doing that can be uh something like this but better way when it comes to using this type of variables

in linear regression is to transform this uh string uh category type of variable to what we are calling dami variables so dami variable means that

this variable takes two possible values and usually uh it is a binary Boolean variable which means that it can take two possible values zero and one where

one means that the condition is satisfied and zero means condition is not satisfied so let me give you an example in this specific case we have that the ocean proximity has five

different values and ocean proximity is just a single variable then uh what we will do is we will use the uh get uncore d function in

Python from pandas in order to uh go from this one variable to uh five different variable per each of this

category which means that now we will have new variables that uh will uh basically be uh whether uh it is a

nearby or not whether it's less than 1 hour uh uh from the ocean a variable whether it's Inland whether it's near Ocean or whether it's in Island

this will be a separate binary variables a dami variable that will take value 0 and one which means that we are going from one string categorical variable to

five different dami variables and in this case um each of those dami variables that you can see here we are creating five dami variables each of

each for uh each of those five categories and then we are combining them and uh from the original data we will then be dropping the ocean

proximity data so on one hand we are getting rid of the string variable which is a problematic variable for linear regression when combined with the cyler

library because cyler cannot handle this type of um data when it comes to linear regression and B we are making our job easier when it comes to interpreting the

results so uh interpreting linear regression for closer analysis uh is much more easy when we have dami variables than when we have a one string

categorical variable so just to give you an example if we are creating from this uh string variable uh five different dami variables and those are those five different dami variables that you can

see in here so this means that if we are looking at this one category so let's say uh ocean _ proximity under Inland it

means that for all the RO R where we have the value equal to zero it means this criteria is not satisfied which means that uh ocean proximity uh

underscore Inland is equal to zero which means that the house block we are dealing with is not from Inland so that criteria is not satisfied

and otherwise if this value is equal to one so for all these rows when be ocean proximity Inland is equal to one It means that the criteria is satisfied and

we are dealing with house blocks that are indeed in the Inland one thing to keep in mind uh when it comes to uh transforming a string categorical

variable to um set of dummies is that you always need to drop at least one of the categories and the reason for this is because we learned during the theory

that uh we should have no perfect multicolinearity this means that um we cannot have five different variables that are perfect

correlated and if we include all these values and these variables it means that um when uh we know that the uh uh block

of houses is not near the bay is not less than 1 hour ocean is not Inland is not near the ocean automatically we know that it should be the remaining category

which is Inland so we know that for all those blocks the um uh ocean proximity underscore uh uh iseland uh Iceland will

be equal to one and that's something that we want to avoid because that is the definition of perfect multicolinearity So to avoid one of the oiless assumptions to be violated we

need to drop one of those categories so uh we can see here uh that's exactly uh what I'm doing I'm

saying so let's go ahead and actually drop one of those variables so let's see first what is the set of all variables we got so we got less than one hour uh

ocean Inland Iceland new bay and then uh new ocean let's actually drop one of them so let's drop the

Iceland and uh that we can do very simply by let me see I is not allowing me to aod in here

so we are doing data is equal to uh and then data do drop and then the name of

the variable within the uh quotation marks and then uh X is equal to one so in this way I'm basically dropping one

of the uh diing variables that uh I created in order to avoid the perfect multicolinearity assumption to be violated and once I go ahead and print

the columns now we should see uh this uh column uh this appearing here we go so we successfully deleted that variable let's go ahead and

actually get the head so now you can see that we no longer have a string in our data but instead we got four additional binary variable out of a string

categorical variable with five categories all right now we are ready to do the actual work uh when it comes to the training a machine learning model uh

or statistical model we learned during the uh theory that we always always need to split that data into train uh and test set that is the minimum in some cases we also need to do train

validation and test such that we can train the model on the training data and then optimize the model on validation data and find out what is the optimal

set of hyper parameters and then uh use this information to uh apply this fitted and optimize model on an unseen test

data we are going to skip the validation set for Simplicity especially given that we are dealing with a very simple machine learning model as linear regression and we're going to split our

data into train and test and here uh what I'm going to do is first I'm creating this list of the name of variables that we are going to use in

order to um uh train our machine learning model so uh we have a set of independent variables and a set of dependent variable so in our multiple linear

regression here is the set of uh independent variables that we will have so we have longitude latitude housing median Edge total rooms population

households median income median house value and then four different categorical dumi uh four different uh dami variables that we built from the categorical

variable then um I am specifying that the uh Target variable is so the target so the response variable or the

dependent variable is the um median house value this is the value that we want to uh uh Target because we want to see what are the features and what are

the independent variables out of the set of all features that have a statistically significant impact on the uh dependent variable which is the median house value because we want to

find out what are these features um describing the houses in the block that cause a change cause a variation in the

um Target variable such as the Medan house volume so here we have X is equal to and then uh from the data we are taking all the features that have the

following names and then we have the uh Target which is a midin house uh house value and that's uh the column that we are going to subtract and select from

the data so we are doing data filtering so here we are the selecting and what I'm using here is the train

test plit function from the pych learn so you might recall that in the beginning we spoke and imported this uh model selection um library and from the

pyler model selection we imported the train _ testore split function now this is the function that you are going to need quite a lot in machine learning cuz

this a very easy way to uh split your data so um in here uh the arguments of this function is first the uh Matrix or the

data frame that contains the independent variables in our case X so here you fill in X and then the second uh argument is

the dependent variable so uh the Y and then we have test size which means um what is the uh proportion of um observations that you want to put in the

Tex test and what is the proportion of observation that you um don't want to put basically in the training if you are putting 0.2 it means that you want your

test size to be uh 20% of your entire 100% of data and the remaining 80% will be your training data so if you provide your point2 to this argument then the

function automatically understands that you want this 80 20 division so 80% training and then 20% test size and if finally you can also uh add the random

State because the splitting is going to be random so the data is going to be randomly selected from the entire data and to ensure that your results are reproducible and uh the next time you

are running this um notebook you will get the same results and also to ensure that me and you get the same results we will be using a random State and random

state of 111 is just um random number that I liked and decided to use here so uh when we go and um use this and run

this command you can see that we have a training set size 15K and then test size uh 38k so when you look at these numbers you will then get a verification that

you are dealing with 20% versus 80% thresholds so then we go and we do the training one thing to keep in mind is

that here we are using the SM Library uh an SM function that we imported from them uh starts mods. API so this is one

uh function that we can use in order to conduct our uh coonis and to train a linear regression model so uh for that

what we need to do so uh when we are using this Library uh this Library doesn't automatically add the uh first

uh column of ones uh in your uh set of independent variables which means that it only go goes and looks at what are the features that you have provided and those are all the independent variables

but we learned from the theory that uh when it comes to linear regression we always are adding this intercept so the beta zero if you go back to the theory lectures you can see this beta0 to be

added to both to the simple linear regression and to the multiple linear regression this ensures that we look at this intercept and we see what is this

average uh in this case median house volum if all the other features are um equal to zero so um therefore given that

the this specific stats models. API is

not adding this uh constant um column to the beginning for intercept it means that we need to add this manually therefore we are saying sm. add oncore

constant to the ex train which means that U now our uh X uh table or X data frame uh adds a column of ones uh to the

features so let me actually show you uh before doing the uh training because I think this also something that you

should be aware of so if we do here a pause so I'm going to do xcore Shain

uncore constant and then I'm also going to print um the same um feature data frame before adding this

constant such that you see what I mean so as you can see here this is just the same set of all columns that form the independent variables the features so

then when we add the constant now after doing that you can see that now we have this initial column of ones this is done

such that we can have uh uh beta Z at the end which is the intercept and we can then perform a valid multiple linear regression otherwise you don't have an intercept and this is just not what

you're looking for now the pyit learn Library does this automatically therefore when you are using uh this tou models. API you should add this constant

models. API you should add this constant and then I use the pit learn without adding the constant and if you're wondering why to use this specific model

as uh we already discussed about this just to refresh your memory we are using the Tas models. API because this this one has this nice property of visualizing the summary of your results

so your P values your T Test your standard errors something that you definitely are looking for when you are performing a proper causal analysis and you want to identify the features that

have a statistically significant impact on your dependent variable if you are using a machine learning model including linear regression only for Predictive Analytics so in that case you can use

the psychic learn without worrying about using stats models API so this is about adding a constant uh now we are ready to actually uh fit

our model or train our model therefore what we need to do is to use sm. OLS so

OS is the ordinar squares estimation technique that we also discussed as part of the theory and we need to provide first the dependent variable so Yore

train and then the um feature set which is xcore Trainor constant so then what we need to do is to do do

feed parenthesis which means that take the OS model and use the Yore train as my dependent variable and xcore Trainor

constant as my independent variable set and then feed the OLS algorithm and linear regression on this specific data if you're wondering y y train or X train

and what is the differ between train and test Ure to go and revisit the training um Theory lectures because there I go in detail into this concept of training and

testing and how we can divide the data into train and test and uh this Y and X as we have already discussed during this tutorial is simply this distinction

between independent variables defined by X and the dependent variable defined by y so y train y test is the dependent variable data for the training data and

test data and then each train each uh test test is simply the training data features so ex train and then test data features X test we need to use ex train

and Y train to fit our data to learn from the data and then once it comes down to evaluating the model we need to

uh use the fitted model from which we have learned using both the dependent variable and the independent variable set so y train Mi train and then uh once

we have this model uh that is fitted we can apply this to an unseen data xcore test we can obtain the predictions and we can compare this to the true y so

ycore test and to see how different the Y uh underscore test is from the Y predictions for this unseen

data and to evaluate how moral uh is performing this prediction so how model is uh managing to identify the median uh

house values and predict median house uh values based on the um fitted model and on an unseen data so exore test so this

is just a background info and some refreshment and now um in this case we are just uh fitting the data on the training uh dependent variable and then training uh independent variable added a

constant and then we are ready to print the summary now let's now interpret those results first thing that we can see is that uh all the co efficients and all the independent variables are

statistically significant and how can I say this well um if we look in here we can see the column of P values this is the first thing that you need to look at

when you are getting these results of a caal analysis in linear oppression so here we are seeing that the P value is very small and just to refresh our memory P value says what is this

probability that you have obtained too high of a test statistics uh given that this is just by random chance so you are seeing statistically significant results

which is just by random chance and not because your uh n Hyo this is false and you need to reject it so that's one thing in here you can see you can see

that we are getting much more so first thing that you can do is to verify that you have used the correct dependent variable so you can see here that the dependent variable is a median house

value the model that is used to estimate those coefficients in your model is the LS the method is the Le squares so Le squares is simply uh the uh technique

that is the underlying approach of minimizing the sum of uh uh squared residuals so the least squares the DAT that we are running this analysis is the

26 of January of 2024 uh so we have the number of observations which is the number of training observations so the 80% of our

original data we have r s which is the um Matrix that show cases what is the um goodness of fit of your model so r s is

a matrix that is commonly used in linear regression specifically to identify how good your model is able to feed your data with this linear regression line

and the r squ uh the maximum of R squ is one and the minimum is zero 0.58 in this case approximately 59 it

means that uh all your data that you got and all your independent variables so those are all the independent variables that you have included they are able to explain

59% so 0.59 out of the entire set of variation so 59% of variation in your response variable which is the median house value

you are able to explain with the set of independent variables that you have provided to the model now what does this mean on one hand it means that you have a reasonable enough information so

anything above 0.5 is quite good which means that more than half of the uh entire variation in your median house value you are able to explain but on the

other hand it means also that there is approximately 40% of variation so information about your house values that you don't have in your data this means

that you might consider going and looking for extra additional information so additional independent variables to add on the top of the the existing independent variables in order to

increase this amount and to increase the amount of information and variation that you are able to explain with your model

so the r squ this is like the best way to uh explain what is the quality of your regression model another thing that we have is the adjusted R squ adjusted R

squ and R squ in this specific case as you can see they are the same so 0 um 50 T9 this usually means that uh you're fine when it comes to amount of features

that you are using once you uh overwhelm your model with too many features you will notice that the adjusted R squ will be different than your R squ so adjusted R squ helps you to understand whether

your model is performing well only because you are adding so many of you of those variables or because really they contain some useful information cuz sometimes the r squ it will

automatically increase just because you are adding too many independent variables but in some cases those independent variables they are not useful so they are just adding to the complexity of the model and possibly

overfitting your model but not providing any edit information then we have the F statistics here which corresponds to the

F test and uh F test um it comes from statistics uh you don't need to know it but I would say uh check out the fundamentals to statistics course if you

do want to know it CU it means that uh you are testing whether all these independent variables all together whether they are helping to explain your

uh dependent variable so the median house value and uh if the F statistics is very large or the P value of your F statistics is very small so

0.0 it means that all your independent variables jointly are statistically significant which means that all of them

together help to explain your uh uh median house value and have a statistically significant impact on your median house value which means that you have a good set of independent

variables so then we have the log likelihood not super relevant in this case you have the AIC Bic which stand for Aus information criteria and bation

information criteria those are also not necessary to know for now but once you advance in your career in machine learning it might be useful to know at higher level for now think of with like

um value that helps you understand this uh information that you gain when you are adding this uh set of independent variables to your model but this is just optional uh ignore it if you don't know

it for now okay let's now go into the fun part so in this Mata uh part of the summary uh table we got first the set of uh independent variables so we have our

constant which is The Intercept we have the longitude latitude housing median age total rooms population household both median income and the four dami variables that we have created then we

have the coefficients corresponding to those independent variables those are basically the beta 0o beta 1 head beta 2 head Etc which are the um parameters of

the linear regression model that our o method has estimated based on the data that we have provided now before interpreting this independent variables the first thing

you need to do as I mentioned in the beginning is to look at this P value column this showcases the set of all independent variables that are

statistically significant and usually this table that you will get from a satat road API is at 5% significant level so the alpha the threshold of

statistical significant is equal to 5% and any P value that is smaller than 0.05 it means you are dealing with um statistically significant independent

variable now the next thing that you can see here in the left is the T statistics this P value is based on a t test so this T Test is simply stating as we have

learned during the theory and you can also check the fundamental statistics course from lunar tech for more detailed understanding of this test but for now

this T has um States a hypothesis whether um each of this independent variables individually has a statistically significant impact on the

dependent variable and whenever this uh T Test has a P value that is smaller than the 0.05 it means you are dealing with statistically significant uh

independent variable in this case we're super lucky all our independent variables are statistically significant then the question is whether we have a positive statistical significance or

negative that's something that you can see by the signs of these numbers so you can see that longitude has a negative coefficient latitude negative

coefficient housing median AG positive coefficient Etc negative coefficient means that this independent variable causes a negative change in the

dependent variable so more specifically when we look for instance the um let's say which one should we look uh let's

say the uh total underscore rooms when we look at the total underscore rooms and it's minus 2.67 it means that when we look at the total number of rooms and we increase

the number of rooms uh by uh one additional unit so one more room added to the total

underscore rooms then the uh house value uh decreases by minus 2.67 now you might be wondering but how is this possible well first of all the

value the coefficient is quite small so on one hand it's it's not super relevant as we can see the uh relationship between them is not super strong because

the U margin of this um coefficient is quite small but on the other hand you can explain that at some point when you are adding more rooms it just doesn't add any value and in fact in some cases

just decreases the value of the house this might be the case at least this is the case based on this data we can see that if there is a negative coefficient

then one unit increase in that specific independ ver variables all else constant will result in um in this case for

instance in case of the total rooms uh 2.67 decrease in the median house value everything else coner we are also

referring to this ass set that is parus in econometrics which means that everything else concern so one more time let's refresh our memory on this so

ensure that we are clear on this if we add one more rle to the total number of rooms then the median house value will

decrease by $ 2.67 and this when the longitude latitude house median age population households median income and all the

other criterias are the same so if we have uh for instance this negative value this means that we are getting a decrease in the mediate house while if

we have an increase by one unit in our um total number of roles now let's look at the opposite uh case when the coefficient is actually positive and

large which is the housing median age this means is if we have two houses they have uh exactly the same characteristics so they have the same longitude latitude

they have the same total number of rooms population housing households median income they are uh the same in terms of the distance from the ocean then um if

one of this house has one more additional year added on the uh median age so housing median age

so it's one year older then the house value of this specific house is higher by

$846 so this house which has one more additional median age has $846 higher median hous value compared

to the one that has all these characteristics except it has just the um uh house median age that uh is one

year less so one more additional uh year in the median age will result in 846 uh increase in the median house

value everything else concerned so this is regarding this idea of negative and then positive and then the margin of coefficient now now let's

look at one dami variable and um explain the idea behind it and how we can interpret it and uh it's it's a good way to understand how the D variables can be

interpreted in the context of linear aggression so one of the independent variables is the ocean proximity Inland and the coefficient is equal to-

2108 e plus 0.5 this simply means- 2110 uh K uh approximately and um what this means is that if we have two houses they

have exactly the same characteristics so their longitude latitude is the same house median age is the same they have the same total number of rooms population households median income all

these characteristics for this two blocks of houses is the same with a single difference that one block is located in the um Inland when it comes

to Ocean proximity and the other block of houses is not located in the Inland so in this case the reference so the um

category that we have removed from here was the iand you might recall uh so if the block of houses is in the Inland

that their value is on average uh smaller and less by 210k when it comes to the median house value compared to the block of houses

that has exactly the same characteristics but it's not in the Inland so for instance it's in the uh Iceland so uh when it comes to this

dummy uh variables where there is also an underlying referenced variable which you have deleted as part of your string categorical variable then you need to

reference your dami variable to that specific category this might sound complex it is actually not I would say uh it's just a matter of practicing and

trying to understand what is this approach of Dy variable it means that you either have that criteria or not in this specific case it means that if you have two blocks of houses with exactly

the same characteristics and one block of houses is in the Inland and the other one is not in the Inland for instance is in the Iceland then the block of houses

in the England will have on average 210,000 less median house value compared to the block of houses that is in the for instance in the Iceland uh when it

comes to the ocean proximity which kind of uh makes sense because in California people might prefer living uh in the iseland location and the houses might have more demand

when it comes to the iseland location compared to the um Inland locations so the longitude uh has a statistically significant impact on the uh median

house value ltitude how median age has an impact and causes uh uh statistically sign ific difference in the median house value if there is a change in the median

age the total number of rooms have a impact on the median house volume and the population has an impact households

median income as well as the uh proximity from the ocean and this is because all their P values is uh zero which means that they are smaller than

0.05 and this means that they all have a statistically significant imp impact on the mediate house value in the California housing market now when it

comes to the uh interpretation of all of them uh we have interpreted just few uh for the sake of Simplicity and ensuring that this uh this entire case stud

doesn't take too long but what I would suggest you to do is to uh interpret all of the uh coefficients here because we

have interpreted just the housing median age and the um the total number of rooms but you can also interpret the population uh as well as the median

income and uh we have also interpreted one of those Dy variables but feel free also to interpret all the other ones so by doing this you can also uh Even build

an entire case study paper in which you can explain in one or two pages the results that you have obtained and this will showcase that you have an understanding of how you can interpret

the linear gressional results another thing that I would suggest you to do is to uh add a comment on the standard error so let's now look into

the standard errors we can see a huge standard error that we are um making and this is the direct result of the fourth assumption that was violated now this

case study is super important and useful in the way that it showcases what happens if some of your um assumptions are satisfied and if some of those uh

assumptions are violated so in this specific case the Assumption related to the uh uh the uh errors having a constant variance is violated so we have

a heteros assist issue and that's something that we are seeing back in our results and this is a very good example of the case that even without checking

the assumptions you can already see that the standard error is very large and uh you can see here that given that stand

is large this already gives a hint that most likely our heteros skusy uh is present and our homos T assumption is

violated you keep in mind this um idea of um large standard errors that we just saw because we are going to see that this becomes a problem also for the um

performance of the model and we will see that we are obtaining a large error due to this and uh one more comment when it comes to the total rooms and the housing

median age in some cases the linear regression results might not seem logical but sometimes they actually is an underlying explanation that can be provided or maybe your model is just

overfitting or biased that's also possible and uh that's something that uh you can do by checking

your o assumptions and uh before uh going to that stage I wanted to briefly showcase to you this um idea of predictions so we have now

fitted our model on the uh uh training data and we are ready to perform the predictions so we can then use our

fitted model and we can then uh use the test data so ex test in order to perform the predictions so to uh use that data

to get new house mediate house values for the um blocks of houses for which we are not providing the uh corresponding Medan house price so on the anene data

we are uh re um applying our model that we have already fitted and we want to see what are this predicted median house values and then we can compare this

predictions to the true median house values that we have but we are not yet exposing them and we want to see how good our model is doing a job of

estimating and finding this unknown median house values for the test data so for all the blocks of houses for which we have provided the characteristics in the X test but we are not providing the

Y test so uh as usual like in case of training we are adding a constant with this library and then we are saying model that V model uncor fitted so the fitted model and then that predict and

providing the test data and those are the test predictions now uh once we do this we can then get the test predictions and uh

if we print those you can see that we are getting a list of house values those are the house values for the um um blocks of houses

which were included as part of the testing data so the 20% of our entire data set uh like I mentioned just before in order to ensure that your model is

performing well you need to check the OLS assumptions so uh during the um Theory section we learned that there are a couple of assumptions that your model should satisfy and your data should

satisfy uh for ol to provide uh B unbiased and um efficient uh estimates which means that they are accurate their

standard error is low something that um we are also seeing as part of the summary results and uh your estimates are accurate so the standard error is a

measure that showcases how efficient your estimates are which means um do you have a high variation uh can the coefficients that you are showing in

this table vary a lot which means that you don't have accurate um coefficient and your coefficient can be all the way from one place to the other so the range is very L large which means that your

standard error will be very large and this is a bad sign or you are dealing with an accurate estimation and uh it's more precise estimation and in that case

the standard ER will be low uh and unbias estimate means that your estimates are a true representation of the pattern between each pair of independent variable and the response

variable if you want to learn more about this idea of bias unbias and then efficiency and sure to check the U fundamental to statistics course at

lunarch because it explains very clearly these Concepts in detail so here I'm assuming that you know or maybe you don't even need it but I would suggest

you to know at high level at least then uh let's quickly do the checking of oiless assumption so the first assumption is the linearity Assumption which means that your model is linear in

parameters one way of checking that is by using your already fitted model and your uh predicted model so the Y uh uh

test which are your true house median house values for your test data and then test predictions which are your uh predicted median house value use for

nonen data so you are using the uh True Values and the predicted values in order to um plot them and then to also plot the best fitted line in an ideal

situation when you would make no error and your model would give you the exact True Values um and then see how well your um uh how linear is this relationship do we actually have a

linear relationship now if the observed versus predicted values where the observed means the uh real uh test wise and the

predicted means the test predictions if this pattern is kind of linear and matching this perfect linear line then you have um assumption one that is

satisfied your linearity assumption is satisfied and you can say that your uh data uh and your model is indeed linear in

parameters then uh we have the second assumption which t that your uh sample should be random and this basically translates that the uh expectation of

your error terms should be equal to zero and uh one way of checking this is by simply taking the residues from your fitted model so model uncore fitted and

then that's residual so you take the residuales you obtain the average which is a good estimate of your expectation of errors and then this is the mean of

residual so the average uh residuales uh where the residuales is the estimate of your true error terms and then uh here what I do is just I

just round up uh to the two decimals behind uh the uh the point this means that uh we are getting uh this average

amount of uh errors or the estimate of the errors which we are referring as residuals and if this number is equal to zero which is the case so the mean of the residuals in now model is zero it

means that indeed the um uh expectation of the uh error terms at least the estimate of it expectation of the residual is indeed equal to zero another

way of checking the um uh second assumption which is that the um model uh has a is based on a random sample and the sample we are using is random which

means that the expectation of the error terms is equal to zero is by plotting the residual versus values so uh we are taking the residuales from the fitted

model and we are comparing to the fitted values that comes from the model uh and we are looking at this um graph this

scatter plot which you can see in here and we're looking where this um pattern is uh

symmetric uh around the uh threshold of zero so you can see this line kind of comes right in the middle of this pattern which means that on average we

have residuales that are across zero so the mean of the residual is equal to zero and that's exactly what we were calculating also here therefore we can say that we are indeed dealing with a random

sample this quote is also super useful when it comes to the fourth assumption that we will come a bit later so for now let's check the third assumption which

is the Assumption of exogene it so exogeneity means that uh each of our independent variables should be uncorrelated from the err terms so there

is no omitted variable bias there is no um reverse causality which means that the uh independent variable has an impact on the dependent variable but not the other way around so dependent

variable should not have an impact and should not cause the independent variable so for that there are few ways that we can deal with uh with this uh

one way is just straightforward to compute the uh correlation coefficient between between each of these independent variables and the residuales that you have obtained from your fitted

model that just simple uh technique that you can use in a very uh quick way to understand what is this uh correlation between each pair of independent variable and the residual which are the

best estimates of your error terms and in this way you can understand that there is a correlation between your independent variables and your airts another way you can do that and

this is more advanced and bit more um towards the econometrical side is by using this test which is called the Durban uh view Houseman test so this uh

Durban view housan test is um a more professional more advanced way of uh using an econometrical test to find out whether you have um exogeneity so

exogeneity assumption is satisfied or you have endogenity which means that one or multiple of your independent variables is potentially correlated with

your error terms uh I won't go into detail of this test uh I'll put some explanation here and also feel free to uh check any uh introductory to

econometrics course to understand more on this Durban Vu housan test for exogeneity assumption the fourth assumption that we will talk about is the

homos HOSA assumption states that the error terms should have a variance that is constant which means that when we are looking at this variation that uh the

model is making uh across uh different observations that uh when we look at them the variation is kind of constant so uh we have all these uh cases when

the uh in observations for which the residuals are bit small in some cases bit large we have this miror when it comes to this figure with what we are calling heteroskedasticity which means

that homoskedasticity assumption is violated our error terms do not have a variation that is constant across all the observations and we have a high variation and different variations for

different observations so we have the heteros skill T we should consider a bit more um flexible approaches like uh GLS

fgs GMM all bit more advanced econometrical algorithms so uh the final part of this case study will be to show you how you can do uh

this all but for machine learning traditional machine learning site by using the psychic learn so uh in here um

I'm using the um standard scaler function in order to uh scale my data because we saw uh in the summary of the

table um that we got from the stats uh Mor API that our data is at a very high scale because the uh median house values are those large numbers the uh age uh

the median age of the house is in this very large numbers that's something that you want to avoid when you are using the linear regression as a Predictive Analytics model when you are using it for interpreting purposes then you

should keep the skilles because it's easier to interpret those values and to understand uh what is this difference in the median price uh of the h house when

you compare different characteristics of the blocks of houses but when it comes to using it for Predictive Analytics purposes which means that you really care about the accuracy of your

predictions then you need to uh scale your data and ensure that your data is standardized one way of doing that is by using this standard scaler function uh in the psyit learn.

preprocessing uh and uh the way I do it is that I initialize the scaler by using the standard scaler and then parenthesis which I just imported from this pyit

learn library and then uh I am uh taking the scaler I'm doing fitore transform exrain which basically means take the

independent variables and ensure that we scale and standardize the data and standardization simply means that uh we are

standardizing the data that we have to ensure that um some large values do not wrongly influence the predictive power of the model so the the model is not

confused by the large numbers and finds a wrong variation but instead it focuses on the a true variation in the data based on how much the change in one independent variable causes a change in

the dependent variable here given that we are dealing with a supervised learning algorithm uh the exrain uh scaled will be then

containing our standard ardized uh features so independent variables and then each test SC will contain our standardized test features so the Unseen

data that the model will not see during the training but only during prediction and then what we will be doing is that we will also use the um y train and Y

train uh is the dependent variable in our supervised model and why train corresponds to the training data so we will then first initialize the linear

regression here so linear regression model from pyit learn and then uh we will initialize the model this is just the empty linear regression model and

then we will take this initialized uh model and then we will fit them on the uh training data so exore trained uncore scale so this is the trained features

and then the um uh dependent variable from training data so why train uh do note that I'm not scaling the dependent variable this is a

practice cuz you don't want to uh standardize your dependent variable rather than you want to ensure that your features are standardized because what

you care is about the variation in your features and to ensure that the model doesn't mess up when it's learning from those features less when it comes to

looking into the impact of those features on your dependent variable so then uh I am fitting the uh model on this train data so uh features

and independent variable and then I'm using this fitted uh model the LR which already has learned from this features and dependent variable during supervised

training and then I'm using the ex test scale so the test standardized uh data in order to uh perform the prediction so

to predict the median house values for the test data unseen data and you can notice that here in no places I'm using Y test y test I'm keeping to myself

which is the dependent variable True Values such that I can then compare to this predicted values and see how well my motor was able to actually get the

predictions now uh let's actually also do one more step I'm importing from the psyit learn the Matrix such as mean squar error uh and I'm using the mean

squared error to find out how well my motel was able to to predict those house prices so this means that uh we have on average we are making an error of

59,000 of dollars when it comes to the median house prices which uh dependent on what we consider as large or small

this is something that we can look into so um like I mentioned in the beginning the uh idea behind linear regression using in this specific uh course is not

to uh use it in terms of pure traditional machine learning but rather than to perform um causal analysis and to see how we can interpret it when it

comes to the quality of the predictive power of the model then uh if you want to improve this model this can be considered as the next step you can understand whether your model is

overfitting and then the next step could be to apply for instance the um lasso regularization so lasso regression which addresses the overfitting you can also

consider going back and removing more outliers from the data Maybe the outliers that we have removed was not enough so you can also apply that factor then another thing that you can do is to

consider bit more advanc machine learning algorithms because it can be that um although the um regression

assumption is satisfied but still um using bit more flexible models like random Forest decision trees or boosting techniques will be bit more appropriate

and this will give you higher predictive power consider also uh uh working more with this uh scaled uh version or

normalization of your data interested in machine learning or data science then this course is for yoube will build a movie or Commander system using future

selection con factorization coine similarity and at the end of the video we will create a r app using scrim blets building projects is one of the most

effective ways to thoroughly learn a concept and develop essential skills this guide will walk you through building a movie recommendation system

that's tailored to user preferences we will leverage a vath 10,000 movie de set as our foundation while their approach is intentionally simple it establishes

the core building blocks common to the most sophisticated recommendations in the industry think Netflix Spotify or others will harness the power and

versatility of python to manipulate and analyze our data the pandas Library will streamline data preparation and psych learn will provide robust machine

learning algorithms by conve Riser and cosign similarity user experience is key so we will Design the intuitive web application for effortless movie

selection and the recommendation display at the end of the video develop a Daya dur mindset understand the essential steps in building a recommendation system we will Master Core machine

learning techniques Ty into Data manipulation future engineering and machine learning for user recommendations and create a user Center solution deliver a seamless experience

for personalized movie suggestions so let's get start it all right so let's now go over to DS that we be using with the move Commander system we we are

using the tmdb movies dat set from K this dat set is crucial for developing A system that recommends films tail to your preferences and introduces you to

new titles we selected this data set for its comprehensive movie data it includes ID which is essential for movie identification title genre origin of

language and many other features but the key features that we will focus on are IDE and title the genre and overview and to add on top of this we will combine

the tags of overview and J together for so we have selected this DS set because of how big it is it is around 10K top

rated tmtb movies and it has as you see many features so let's move on to next chapter which is future engineering okay so now let's go with

the features that we'll be using Futures in a recommend this system are essentially the the data points you use to make decisions about what to recommend these Futures helping

identifying similarities between movies which is crucial for generating personalized recommendations for your system to be effective it's vital to select fatures that offer meaningful

insight into the content of the movies and the preferences of the users so you have to be careful with what Futures you

choose and and for our movie recommended system we will focus on several key features ID this serves as unique identify for each movie crucial for

indexing and retrieving movie information accurately title the most basic yet essential future and they been uses to identify movies by their names

genre this categorizes movies into different parts facilitating recommendations based on content similarity and user preferences the

genre plays a pivotal role in personalization and of course the overview overing a brief summary of the movie's plot to overview access a rich

source for Content based filtering through NLP so we will be using that overview combined with Jer he create a very comprehensive descriptor for each

movie therefore you're able to recommend movies more accurately now combining the overiew with the genre into a single Pags future gives us a

fuller picture of each movie this combination helps the system to a better analyze and find movies that are similar in theme story or style so for example

let's consider a movie like Inception it's OB might be like a thief who steals corporate Secrets through dream sharing technology who is Task with planting an

idea into the mind of a CEO how the genre could be listed as action science fiction Adventure how if you combine this with a text which is action science

fiction and Adventure you get more Fuller pictures which is actions science fiction and Adventure if corporate Secrets dream sh technology hunting an

idea or a CEO which might recomend your movie like The Matrix but the the main point is that when you combine the overview with

the text you get a much more sophisticated future that you can convert to a much more sophisticated numerical data point which you can use

to better recommend movies and before you use the before using the overview and genre data for the Lord of the Rings

for example we would remove the and all now we pre-process the selected Futures because there are many movies that

include stop fors in the title or in their genre or overview and those are words that don't contribute anything to

the title and therefore we PR process them and remove them and clean out data so those are for example words suggest the and in his and other words as well

so we have done our future selection and now let's move on to content based versus collaborative based recommend assistance otherwise let's explore now

the key recommender systems that are currently being used by nflix Amazon and other big Tech so there are two main

recommended systems which are content based and ctive filtering recommended systems now let's start with a Content based recommended system now a Content

based or Commander system only uses the Futures and the overview of the movies that you have likeed to recommend the Sim movies so let's say

you talk with a friend and you say I have like this like Iron Man because of his Futures such as the Director the genre

is and or the overview it will recommend similar movies based on those Futures so it won't use any other data for example

what other people have liked or what her rating is of other movies or what rating you have given to other movies it's just based solely on the futures of the

movies that you have like previously so for example if you've enjoyed Inception a Content based a Content based system I suggest

Interstellar because both movies share a similar director and the complex narrative structure genre and overview now let's go on to collab the

filter and recommender system so in Netflix you often see for example if you have watched and enjoyed stranger things Netflix might often recommend The

Witcher to you because other users who like stranger things also enjoyed the Witcher now what's important to note here is that it doesn't use the features

or the overview or the other informative data points it only uses what other users have liked what other users that has similar preferences like you have

also like so that's the difference the this recombination is made based on the viewing habits and the preferences of a larger group of viewers with similar

taste to yours now this method doesn't rely on it features but on the wisdom of the crowd it uses patterns of ratings or interactions from many users to predict

what an individual might like for example if users who like The Avengers also enjoyed the Guardians of the Galaxy you might receive a recommendation for Guardians of the Galaxy feel like The

Avengers while both systems are effective on their own if you could combine them that would that would enhance the accuracy of the recommander

system so for example you start off with content based or Commander system but once you start getting more data you can also use collaborative filter

collaborative filtering recommended system to provide more accurate recommendations in this session we will focus on the crucial element of

transforming text into numerical vectors so our models aren't able to be trained on text but they are be they're able to

be trained on new vectors which means that we have to convert our text into a vector so to do that we use count

vectorizer this method simplifies text analysis By ignoring the order of words inste focusing on the frequency by transing text into a numerical data

we'll be able to classify documents a vital function that allows our systems to process and organize large amounts of text Data efficiency we aim to provide a

straightforward practical understanding of this essential technique so that we can move on to our next chapter which is cosign

similarity to provide a more targeted example to provides a more targeted example let's say we are considered

three movie overviews which is an an action pack adventure adventure and

Adventure that or adventure movies inpire me and we both love heart racing

Adventures so if we count the wordss that we are using here here which is fun

action hack adventure movies pinspire me we B law art racing so be the we use the

following words pin our sentences now for our and have par Adventure we use we

use P once patum PEX Adventure we use it as well but for the rest you use of them

so Vector is 1 1 1 0 0 0 0

0 Z and this is our Vector now for the SE example which is p adventure movies inspire me if we use no pen we use no

action pack we use denture once we use movies once you use Inspire once and you re use me once and the rest we don't use

it SWS which means we get the vector 0 0 1 1 1 1 0

0 zero here so this is our second vector and you can do the same for Reb La of hard racing venes but I think you get the idea so this step is key as it

transform statue information into a numerical format a vector each movie or overview is sence converted to a vector in a high in high dimensional space

where each unique W is a dimension and the W's frequency and overview is the value in that Dimension now this structure allows machine learning models to intert the data aing tests such as

genre classification movies as an example so let's see we have Iron Man

one two three and the Avengers we try to create a vector out of this we use Iron

Man

[Music]

on so we have the use of words Iron Man one 2 3 so for example if we have Iron Man 1 you can expect the following

Vector 1 one 1 0 0 who get this 1 1 0 0 two then you can expect what the

structure to be your 1 Z afford ironment threee we can expect we1 one01 so this process of convert terization

translating movies titles into numic of vectors is a cornner stone in the realm of text analysis it allows us to convert

unstructured text adest movie titles into a structured for format as became later use in our machine learning algorithms and we can extend this process to a larger and more complex

body of Tex for instance we could apply con vectorization to movie descriptions reviews or even entire scripts respected

of the text size or complexity the conf T method remains effective it allows us to handle a broad spectrum of Text data to understand the concept of Co

similarity to the learns of movies let's take a simple example imagine we are comparing two movies based on the genous Sci-Fi threr and just sci-fi now close

similarity is the following it's an equation of the dot product of two vectors and the multiplication of the

magnitudes of the two vectors so of a Sci-Fi trer and the convert converted to

a vector which is SAR and then crer is a one one and just s it's just one

zero so when we would multiply them you would do 1 + 1 + 1 * 0 and the magnitud

of one one which is the magnitud of a is equal to square root or Square of one a sare of one and square root of it which

is two half for sci-fi which is Vector B is the magnitude of Z Square so 1 * so this is equal to 1

over2 now if you take the close high of this you will get 0.71 so this is whole similar the two movies are the

0.71 now let's have a more broader example so we have Iron Man one and we

compare it with the Avengers and offenheim now Iron Man 1 one 123 it's an action sci-fi Mount Avengers is an

action scii and adventure and Oken is drama and historical now we will calculate the Sim CL similarity between

this movies to the genre and we will recommend movies based on their genre so if we try to make factors out of them

out of this genres of each movie they have action adventure or you can also do the action sci-fi

Adventure drama and historical then Iron Man would be 1 one

0000 Avengers will be the 1 1 1 0 0 and oimr will be 0 0 01 1 so let's calculate

the similarities between iron man and the D Avengers the do as I mentioned cosign no cosine similarity is equal to

Del of the two factors / by a * B so when we take the example of Iron

Man and the Avengers we get that the following vectors 1 1 0 0 0 and theist

which is 1 1 1 0 0 so now when you take a DOT part of this you get 1 * 1 1 * 1 1

* 0 0 z0 0 0

[Music]

now if we calculate the coin similarity between Iron Man and the open Hammer

that will get zero because

[Music]

so as you can see we we calculate the coine similarity using this formula all right let's get started so I have written clode already so I will just

walk into it and so with step one which is import pandas so since we will be using panas we must first import it this is the day experation and pre-processing

part but let's install install the pandas okay so let's start with the future selection part which means first do not list all the columns and there

from to identify relevant features so these are all the columns that we have inside our data set we going to combine over and genre into

a column which will have the name tags and that's it perfect now move us around into

get now there's a new column which is tags now we have an additional tags and this is the only po that we will be using to

run an our model on which means we don't need overview in John anymore so we can get rid of it we do that by saying

me the is equal to movies. drop and we are going to drop

movies. drop and we are going to drop the columns overview and genre perfect as you can see have only ID title and

tags and this is great now aside the text cleaning part we will import n okay and the necessary [Music] modules for our text

[Music] preprocessing now let's run this cuz I first probably will download the necessary modules the NK

resources for pre-processing we got to start with printing the text so first we want to make sure that it's not a string do

there by saying if not is instance of [Music]

text now we want to also say all movies titles or the text column must all be

lowercase we also want to remove punctuation uh with C digits you also want to token as it move

upwards and join the words back together and now we want to Le again or clean the

perfect all right now up to the combater part so [Music] [Music]

build it import dist and also install it if not already installed but it's s Solly installed we apply the cream text

function to a text column and we create a new column which size Tex Queen so we clean the data here

okay so let's initially as a comer per Futures is 10,000 and stop wordss [Music]

English and we also what we are say here is that our maximum number of Futures are 10,000 and we must remove all the stop

words that contain inside in the English English vocabulary okay now we're back to S hour

um right so now we fit the Tex data into an array and factorize [Music]

it perfect we have onlya the ID and

effect all right so now we have can import coine similarity and start or initializing coine similarity

so through this line we calculate the coign similarity between the movies and based on the vector representations that we

did we already vectorized it through our Conor it now we can calculate the similarity between them and that's you do that by initializing it and seeing

similarities perfect it is run now let's see the similarity great now we click on new beta.

info and we have there is still the same with 10,000 IDs 10,000 titles and still some missing overviews or

tags Perfect all right everything looks fine now we want to identify and pin the

so here we will want to see how our model has

performed so we are going to calculate we are going to identify and print the titles of the top five similar movies to a given movie based on CL similarity and

here we will do it for um from movie three movie

four let's say which one was Movie 4 [Music] [Music]

with for is Godfather part two with this function we see if our model works so what we want to do

here is to calculate the distance between we want to calculate the most similar or the movies that have the

highest coine similarities go with Movie 4 and we will we want to reverse uh return the reverse order of the list meaning that the most similar movie will

come on top and we want to do this by going from 0 to five which means we want to only recommend five

movies and of course we want to then print the data of the list of five movies meaning the title of the list that contains the five most similar

movies to Movie 4 perfect so movie four which is the Goda or something that is related to

Godfather Comm as Godfather [Music] SP all right so let's now create a

function that will only recommend us movie based on the title of the movie so not the ID but just the

[Music] title and let's see if that

[Music] works okay so now let's save the modified data frame

and the similarity Matrix for they use we are going to use this in our web app so let's import pickle

and want to also download the mov list SAR this C

and on to load the saved similarity score and perfect all right I will see you in the last part which is building the

actual we so as you can see the front end will be provided to you and the only thing

you have to do is just to go the content f.p s St import

f.p s St import pickle import requests so so we have to first write the

function to fetch movie poster using movie ID now this function will be provided to

you I can just do that by connecting the API now you have some Mo load movie data

and similarity Matrix you can do it by movies [Music] is equal to pickle load we have done the

similar in call up but now we're loading it and we have to also do the same for similarity and for

list so let's now create the header for web [Music] app should be will show be see

header command system so now we have to also bring the necessary components for our streamlet we will be using we will be creating an image Carousel and for that we must

first import the [Music] components so we will be importing necessary components to create an image

carousel

[Music] so we'll be

using the component from our front end and public and has a base we want to be able

to fetch movie posters withing movie ID or there just to be a base or movies that are

recommended so [Music] that that it's not empty and this is works just

fine and of course let's display name is trash component and now create a drop down menu

for call movies [Music] page and now let's create our commmand function so what we want to do first

[Music] is create so the way we are going to do it is by

first finding the index of a selected movie from our data frame based on its title and then we are going to calculate

its similarity scores against the movies and we are going to rank it from the movie that's the most similar meaning the highest coine similarity score to

the lowest and we are going to provide we are going to recommend five movies of course you can recommend many or 50 movies that's

completely up to [Music] you so let's now first find the index

or um create a variable [Music] index and now let's calculate it by the

similarity score similarity score and by [Music]

distance so as you can see we first calculate the similarity

score we will return the perverse order meaning the highest ranking to the [Music]

lowest and now we want to initi PR the list which we will do by command mov to

[Music] List's poster you put empty list

[Music]

[Music]

we I [Music] prist Comm in poster

[Music]

perfect and it works perfectly all right so let me walk you through the code to show you exactly how

you can do it as well so we first import the streamlet pickle and of course request for our to fetch posters from

for our movie IDs the images because we don't have it stored and then we load our data we load our movies list and of

course the similarity score and we grab it t titles and now we create the header of

our web app and to create a carousel we must first bring or import components for our

streamlet here we initialize our CER [Music] component we feture some movie posters

images for there to be basic list of available um images or rather movies that people can

use to navigate without it's just a basic image Carousel and then we display the image charer component creating of course the

drop down menus and here's the main function of our movie recommander system which is

basically we first initialize the index of our movies or movie movie then we calculate its distance or rather the similarity score between all the movies

and return [Music] um one that ranks the highest so which

means the movies that are the most similar will be returned and all right so let's now work on the

main function this is our main function which allows us to recommend movies based on index so here we initialize the

index of the movie then we calculate here the most similar movies in respect to our selected movie we initialize the

list of movie and poster and here we just fill in the list and once everything is filled we just

return it this is the button that we will click on to our Command movies we will have or five

columns and each column will have a text and it works just fine as you can see when we click on the recommand Iron

Man thank you for watching this video I hope you enjoyed creating the movie recommendation system if you like to watch more of this content then make

sure to subscribe and other than that I will see you guys in the next video when you are to enter into business you need to Le show that you

can work for the test that you are going to be hiring for my stter point is like showing you uh data science portfolio showing that you could actually do the

work right so a strong data science portfolio basically you learn the data science communication translation skills business acument all plus but extra but

for you as a hiring manager during the hiring process you pay attention at the projects that they have completed unless they have already an experience hi

everybody Welcome Cornelius really excited to have you here an experience they scientist a top voice in the field of data science and with wealth of

knowledge here to share with you uh Cornelius uh you are a data science manager at aliens so can you walk us uh through your journey uh in the data

science field and how you uh went through the corporate leather oh that's a nice story I think so I think I will go back to the

beginning for us so before I go to like all the corporate stuff so I starting actually like every student every aspirant scientist I want

to become a scientist but I'm starting not majoring in any data science field I actually majoring in a biology and evolutionary biology my bachelor my

master atic is all about biology I'm a research at heart obiously but this um just like a moment like where I become another SCI that I

want to become a data scientist so this is the moment that on the first coming where I was like in my master study I listening at w weinar and I try

to see what kind of a job that I could have in the future because like like this like as a biologist especially by

country Indonesia it's kind of a little bit hard to actually finding that securing that I I was say like a really money money having some some money for

the job like that's like Financial security job because like it's a little bit harder in Indonesia and even it work I obiously if you do as a biologist here get my

patient but then I try to find something that is still related to my biology is still to related that need my patient like I any research but what could actually make some money and that's I

actually finding something like called data science so this is like what I try to sear doing my let look at this webinar data scientist and then I tried to learning about it that okay they

using esta they using things and then they try to use the data to actually solving a business problem and then is actually using in business and day I realize okay this my

could be my next current move so at that time my master time even like I I'm not graduate yet I try to learn as much as possible regarding data science I try to

joining this kind of online course I try to enjoy it like a community in the social media and I try to read it as much as possible and then yeah when I

coming back from my master to back to India yeah I focusing on self rning again and then I try to join like uh really offline classes through the data

scientist yeah and then they try to make a connection as much as possible from all of the my previous connection that I already have like could I become a data scientist in your company basically how

in you feel yeah and then I by a little bit hard work so I could say like from 2018 to 2019 so like one and year and

half work moving from this field from field and then I become another scientist here super cool that's quite an impressive journey in such short amount

of time you managed to you know to go through all these different levels cuz you know data science can be tough especially in the beginning when you want to get your your first job uh as a

junior data scientist with uh almost no experience right and you were went from uh biology not a so-called traditional

uh data science study all the way to the fields of data science so uh let's talk about that can you talk about some of the challenges that you uh overcame when

you were just starting out with no experience at all yeah there's like a lot of challenges even like when my first start as a data scientist I don't know

anything I I know it's a little bit about programming I know a little bit about statistic I know a little bit here and there but when you already being in the corporate basically because like my

V job right now as a j like in the corporate there's like a lot of things to be learning especially in the business part yeah you can develop like

someing model I mean like you can try to running this code but this is is is it solving this business problem is it like really valuable to our business is it

like really could be convincing enough to be used by the business people and the business team or even like from the customer part to actually they want to using this kind my model like my study

by learning so this is actually something that I learned during the my uh first year I'm like if if like I still relearning like what is about the

business itself because like as a jior scientist we are working in the BL we are working usually in the to execute to

performing uh what the t is going to be right but in corporate it's a little bit different I think like the start up that mightbe you could try to like you could

like be everything as but in corporate it's having a this little bit different because it's everything is already being structured everything organized already in there and there already um business

moving there and the business is already know what you want the business already know what could make money in there and then the business already know what it what should be data science and new

stuff data science is could be there to automate stuff could be try to INF figurate all the new thing in the business but the but as the data science

itself if you're like a junior we need to understand why this Bri process so for example like in my prev years experience as Junior like one of the first modeling that I create is like

this called a propensity to buy model so this is like model this like a modeling that try to predict which customer that will be approached to buy

a new chice basically or a new product it seems easy it's like you just take a data of the customer and then you could try to which one is the one who buy which one who not buy and then you

try to create a modeling that's there but the moving part is actually there's a lot of the that there is's a lot of puzzle pieces in there so yeah you have the model but what kind of product that

you actually sell who's the target of the customer who's going to execute this list who's where the where's the communication is going to be so and then

how long it's going to be the is campaign how long is it going to be is it going to be like a continuous or is it like just one time there's like a so much moving back in there and you need

to like working a lot with the business so that's one of the part that where I was still engineer that I relearning but actually one of the thing that make my career actually boosted up because I

really so much about the business and I also leing about the communication within this business part because like like I said before like you want to

convince the in the customer I say like customer is is user the part that they you want to use this model you what I I

can do my job but can you believe my job so sign that it take takes like uh really good translation between our technical to the business uh term

basically so I can present my model as like for example like a random Forest this is like how this is work and this like this is the Precision this is like recall classification model no they

don't right yeah but what I understand is like for example like I create a proper model this kind of modeling when we are simulated basically it could have

to your Ki your Ki is made for example like like uh income like increase the revenue in the monthly it could increase

the revenue 20% compared to the normal job that you do and simulation so there there example but this like a some kind

of uh need to be rethinking compared from our technical to how the business actually speaking so I always I always call it translation and I think I

couldn't say it even better than what you just said it's really important also from my experience and from what I have seen in my colleagues being a data scientist is not just you know crunching numbers many people think it's just

statistics or maybe some mathematics or really data but it's like you just said all these different skill set that have to come together like business acument

what you just mentioned and uh communication translation from business you know to technical terms because uh usually uh the product managers will

never come and tell you please make an classification model for me to classify this thing no they just come and say I need to do this and then you need to realize that oh you need a

classification model oh you need to select those approaches and I think you perfectly mentioned that uh this is this combination of different things that from one view it might seem very easy

but when you dive deeper you clean the data where to collect the data where to store it how to process it how to make the impact how to measure it so so I I

totally understand and how you could go and grow and you know go through this corporate letter very quickly because early on in your career you had an opportunity to learn and gain all these

different skills and I think also for our audience for our aspiring data scientists this is definitely note to make if they want to grow quickly in their career they need to be prepared to

work on their communication skills on their business skills like you did uh on their uh translation skills how to

translate from uh business kpis okrs to an actual data science problems on that note since you mentioned very interesting project was there an any

particular project in your journey early on when you were data scientist that as a junior meteor that kind of made your

career to you know to be um to set yourself apart from other ones and be promoted yeah precisely that's like this this

it's actually accumulative I don't say that it's just like one question but like maybe I go a little bit to the beginning so one of the things that I also try to do during my junior time I

feel like a medical is to take an initiative so I not just move I'm not just moving like okay this is your project and then it's like you need to do this and that but even like from my J

level I I'm not try to taking like a job like that but it's more like I try to communicate my boss like I know this is like a really interesting project I want

to take it or like I know this is going to be have a good business impact so can we to try to talk and Fa spe the business user this kind of project we could try

to create together so this is like uh try to take initiative on my own on my career so I could know where I could try to move during that time but yeah as

this is accumulated and there's like one project that I know like uh during that time that this is this is the project that I done uh is called the the lp

project so it's basically try to create a prediction which of the customer email was a not a sponsor uh have complain or

not so it's like you know like customer sometimes like really complaining but right yeah but it's not just a complaint what we want to try to predict is like it's like is this kind of Co blade could

be uh damaging to our reputation or not so this like the kind of the C blade kind to predict so in this time uh this kind

of project that I undertake is something that uh I could say try to make because like this is not a kind of a project that was done before in our commy at

final and then I try to S the the this user and my boss that I want to take this kind of project I know this like could be like really useful for our business and then yeah I will try to

take an responsibility like from managing this kind of project and it's going it's going well and it it's like yeah even right now uh the business

users still what's it call still want to try to communicate with me for any kind of project but in the this is because I already starting this kind of project and I try to inate in that and they

still try to approach me if it's like they have like a some a problem with another initiative that one we've done so it start simple it starts just from

ourself but this kind of AC things will be seen will be seen if like our boss uh colleague by your any uh any business partner

and so just take an initiative I think like some uh could be like really make on Breck car not just brick it really makes you understood so kind of be

proactive if you um like you you are already a junior data scientist you already have some basic skills you have worked with some um senior data scientist under

supervision couple of months now it's time to take initiative to uh look around uh you know Network and see what kind of projects are on the table and

then go and try to make something from even if something seems boring or uninteresting you kind of need to dig deeper like you did right so you basically identified them and said to

your boss well can I work on this and if this is impactful because at the end of the day you will always go to Next Level if you are doing a project if if it has

a lot of impact right yes yes that's true and then yeah it's also uh ad fin is still from my junior level I already I feel like before I become data

scientist right I already having I know what I want to be I know what I want to do and then when I already become what I do I'm not stopping I try to making what

is called a master plan basically like this is my going to be my career but I need to take uh into my hand basically this is also what my boss always always

said like before like this is like already inspired to me like your carer is in your hand your life is in your hand so every kind of movie that you make you need to be responsible for that

but it's also need to be something that make you improve ask your a person or ask your career or add anything so that's why I always try to take initiative in

there and and it worked out right because now you are data science manager you are top voice in data science so it worked out well which is amazing and on

that note given your uh non traditional data science background because you you have background in biology and there are many people out there from our listeners

uh also uh who want to make a career change or who come from a non-traditional data science background meaning they don't have a traditional statistics mathematics or data science

master's degree um I wanted to uh I wanted to ask you actually you can tell to our audience about the impact that your unique background had on your

career it's really interesting yeah because as a biologist before I not using much about really regarding my education like use the specific for

example like Gan or like a protein of course I'm not using that in my everyday life but this more about what how I try to methodically thinking during my time

of working as a biologist as a researcher really how I structure my work during that time how do I using my statistic and finist I still using that

from myological thinking and and what's it called uh and my education part I think like really taking part like how I could try

actually uh break down a lot of like uh the AC because academic you need to like d like for example like the structure of your work need to be like

from uh the theory and then how the methodology and then how this going to be work the result and the coration so all that that my experience

previously uh it actually helped me working as a data scientist but of course like I know every person having like a d background I def Le for my because I'm still from the science

part this kind of risk methodology could be really useful during that as dat Landes but I know like there could be like a people coming from the literature

for example like from philosophy like f uh yeah language and then something that really not related to a programming or like really data at all but I think like

it's like uh could have like a different unique perspective as well in there and the fin L uh when you're approaching your work it could be like how do you

approach your work do during your study could be applicable the way you're working at the time but at the finist if you have already have the basic I think

we could Le even if you are not from the data science major so you basically used on your advantage in your advantage all your background even if it was a

non-traditional data science background yes so what would be your advice to anyone who um who has this non-traditional background in terms of

specific steps yes having a rock Mac for me is like the best one so I know it's like a little bit clich I think I to say but

having a rock Mac to follow is really helpful so there's already a lot in the online like right now like how how to become a data scientist step to step First Learning statistic second learn it

third learning the machine learning fourth having like the machine learning project data science project five like try to apply try to communicate try to

uh uh uh learning the presentation this kind of Step might boring but following those step is actually important because like it really a structure way because

like to Pi a data side I'm going to say like it just start small in there but those stepe like if statistic

programming and machine learning it's are going to be uh connected each other to each other the skill and it going to help you to become a data scientist that

you want so just uh follow me the road map that I already have it going to be helpful rather than you try to learning yourself like jumping here and there try

to I learning then I try to just learn do those typ of things but then you for statistic then forget about the math or you just learning about the programming then forget about the how to desend it

to the business user just uh follow the step one by one and then you I think it could be already good so in organized way basically to

follow the road map yeah yeah okay and have a clear plan I suppose yes having had clear plan is actually really helpful because like as uh this is like coming from my experience as well

because like I try to create a plan for for myself to become a data scientist and then I Tred to follow that plan and it actually work but it worked out not

well but this kind of plan that actually I think really helped Ence for me because like I know if we are not focusing because this plan can uh this

kind of plan actually make you focus because if you are not focusing on that thing you could just going anywhere at that lose loose loose loose lose your

way to becoming like the data that you want right right no I absolutely agree with you having a clear plan and uh learn in an organized way and not just

to go from one course to the other one to try to learn everything that that would be indeed the best way because otherwise they will everyone will be spending ton of time on learning one skill and then there is a new skill to

learn by the time they finish another one amazing and uh when it comes to Leading because you are a data science manager and you have gone through these

different steps already and you are leading teams what in your opinion uh is that helped you uh to balance the you so

the technical side and leading people how do you manage projects and people and uh at the same time uh you driving your

career yeah that's really hard I could say but this is like something that I also learned during my corporate time because leading and then try to become

an individual contributor I could already say like it's two different things on one side like uh in the technical part we already understand but

become a leader become a to the people we need to understand how to delegate how to delegate what kind of the task what kind of project and then trust the

other that they could actually done it and then by trusting each other I myself and I trust you as a colleague I would say like you are also my friend if like

you like I but also you are friend that in we are working together in here we are not just like boss and to okay you do this you do that and you are okay bye

no but we are trying to communicate together what are the problem in here what how we could try to work together in this here but I try to not to back from myage like I try to jumping in

there doing a stuff life you are to but trusting the people and the delegate to the people is actually very important but it's coming from the learning of course of course like as individual

contributor and I still really love encoding I still love programming I do it right now I still do it right now I really want to job get like doing this

border and stuff uh at any time but trusting the others is like it's a learning learning to managing this project it could be you could do this

and then learning that I could actually uh doing this stuff uh much better when I actually already being in this spot right now but so yeah it's it's still a

learning process for me I could say like it's still alarming it's uh never ending journey I could say but delegating to the people that I trust having a Trust

basically to the other is like the important to balancing things on that note because um maybe some of our listeners who are aspiring uh tech

people and who haven't worked in the field they don't know this concept of individual contributor what you just mentioned because for anyone who knows who doesn't know this term uh we have

usually this two different career paths right in data science when you just join you can uh you usually join as an individual contributor so you work on the technical stuff and once you go

through your career there's usually two ways to go one is the uh individual career and as an um uh individual contributor and the other one is the managerial path which I assume is the

one that you are following because you are managing people yes so um that's about that uh you mentioned working as an individual contributor and then now

you are manager which means that you are managing people and supervising them and on the note of the trust because that's something that you mentioned a lot when explaining your leadership skills the

trusting in others but how do you build trust as a data scientist when you have just joined your data science uh

job yes he my proof I think because that's what I do to build the trust with my boss with my colleague I Pro that in my job I could actually do it and then I

try to present how my result is going to be and then I try to ProActive at my work trust is not built just like by you saying it but trust is built by you are

having a proof of your work trust is coming from the things that you already did is from the action itself I know like communicating is maybe not that

easy as well but just by but your work is your uh it's like a promise in there during that time so I know like as a junior data scientist like yeah I I can

do I can do it I can do it I said I can do it like that and I know like uh people that I is like yeah I can do it but how they work so it is that's

actually that that that that work that actually making that uh the proof of the trust so as a Jor I think the best way is like yeah just do the work if like a

smaller like as you said previously youly like it might be boring stuff it might be an assum it might be not leading anywhere but those kind of work show that you could actually do the work

rather than just so do work done basically yeah they work down right right that makes sense so then you can build the trust uh make

sure that you help your supervisor because that's at the end of the day the job of the data scci Junior data scientists to help uh the senior data scientists and other leaders in the team

to make their life easier whether it's unfortunately data uh collection data analysis or not and um on the note of uh

leading the teams and becoming manager in your opinion um how can one decide whether they want to take the path towards individual contributors so become a principle data scientist and

then stay like an individual contributor or they should consider the path uh towards becoming a manager data science

manager what qualities uh do you need to have to go towards One path or the other I think it's coming from yourself

because every single person have like a different evaluation right but what they kind want to be like for example for my side I love both side I already said

like I love both side like as individual contributor as they leading I love both of them so I want to try to actually experience having like this experience as a leader as well so because I already

have good experience as a different contut that's why I try to moveing to that site as well to try to the leadership part but of course like it's coming back to yourself again what do

you want to do do in your future because like I cannot tell like for as every single person that you need to be like become a manager is the best path because it's like have the best money or

like become is much better because it's like really coming to yourself or like your comfort zone where you you want to be in the future so it need to become to yourself I mean like every single puff

car way have their pro and con it's always like that it's just like the decision that you already make you need to responsible for that that makes sense so what it is to be a

data science manager um walk us through your kind of dayto day uh your responsibilities for the listeners to get an idea what it means to be a data

science manager yes uh youi in Mar will say like still Ling the thing of course like

having to having like a meeting like AIT the the themes and then having like managing the project what what's our project right now what the what's how

how is it going and then how is the progressing so it's like managing the progress of the work in there and then try to balance it with the business user what their expectation is and then

translate it into our technical so it could be actually making those kind of progress so it's like balancing with everybody uh from every side as well so

I will say like those kind of work that I usually like the day so managing uh the people like like how what you going to be like in this week or what going to

be in this day or and then they try to when the business us are coming how do we going to make this project uh

successful right so that's dayto day basically that makes sense so managing people lot of communication meetings there defitely also part of the

managerial position that's for sure so if you don't like meetings and you don't like communicating with product people and business people you should not choose for death paths right that's true

that's true that's yeah I I would say like that's the con because like I could say like if you see my calendar of course like the this meeting it could be

like 8 nine 10 12 it could be like lot and then it's like suddenly there like uhom me what what do I need to

J this my for yeah yeah for sure no I heard that a lot so that's something for our listeners to take into account when choosing which career path they need to

go through and on the note of the projects because uh you have been leading projects and you have been also going through the corporate letter successfully as your da science manager

now at such a young age can you uh remember maybe there is like a one project an impactful one that you end up uh going through it was a tough project

but you uh with your team have completed it and what were the main takeaways from it um this like so many project but I

think what the really impactful right now if I like still we still ongoing work is that we try to create this kind like a CL fraud detection project so

basically it's like every comp need like a fraud detection model this is like a really really impactful to the business because like you know like truck usually

not happening that often but when it happened it could be damaging the finan damage into the reputation damage into the fbody on their business and then yeah this is like

really ongoing project that need a lot of consideration like coming from this kind of business like this customer like through the data set is coming from so

it's like really involving a lot of OB stakeholder and a lot of technical person uh to be involved so this kind of project that we are doing

uh I will say like taking a lot of people come and go because like if like this is like really hard project but at the very Leist uh I would say it's like

making an impact because like the business use use the business us are using it and then they try I try to trust our result but there's always a

lot of room for improvement because like uh uh I said before a FR detection as are really could be really damaging if we could not do it uh

properly so we need to present those result from this CL fraud to be something that could be understood by the business so we try to create the model as good as possible we need to

have like the expandability as good as possible we need to integrate all the business process as possible so do moving part really uh really right now

are still inte uh the project that I really remember until now that I tried to even uh try to what's it called uh in

another project those kind of work that I done there I try to implement it in in the other because it's just like this kind of complexity really need to be uh

taking consideration okay amazing sounds as a tough problem fraud detection um it if

if you have a high rate uh that can seriously impact your operations also uh this about money uh that we are talking

about after trol uh and on the node of money so uh you have gone uh through this different steps you start as a junior data scientist usually or as an

intern in data science if you don't have any background at all and you haven't complete any projects and then you need to go through the steps usually you become a major data scientist and then

you you become um a senior data scientist and you become then a data science manager like you are today so

what it takes to get a promotion is it only the technical side building that trust like you just said making an impact or you also need to actively

promote yourself because I know it really is different from companies to companies but uh there are some common qualities that differentiate data

scientists who stay in the same role for many years even if they have a very uh good skill set verus the data scientists who very quickly like yourself go

through these different steps and get promoted a lot yes I think like withc corporate yeah I said before it's really depend on

the business as well of course and because promotion only coming uh first if there's like a business set you want like a business position to be filled that's the first stage of course like

you want to like having like a really higher position it might not go as fast as mine if you have if your company doesn't even

need it but for a promotion basically for a promotion that increase your money uh increase your financial security increasing that I I believe that's the most important part actually the

financial I think like for most the people like I know like some people doesn't want doesn't get to promoted but they just want to have like a higher

salary but they right bigger title but in like my bigger biggest quality that yeah of course like I said before like taking initiative I discuss it with my

manager what my career I want to be as well I want to be like I said in before like can I get a promotion

I yeah us like at the time like and then they try uh and then they the what's it called uh negotiate basically yeah yeah we can

do it like in this uh of time with with your kind of pro proof of work we already been everything that already I done I already show it because like but I had thought just like six month no I

just I not saying that I in six month I said I can be promoted no but I try to build it during that time like but it's actually a year that I to to the med to

see here I I ask even I ask so it's really like an initiative from because you don't even know like yeah is it like

a possible or not people are not asking I know like some company culture is like a little

bit like right like like so like it's like really shy or like like some people I just to shy but it's is depend on culture the company because like in my

company C at my voice was like really open it doesn't matter if you want to you ask to be promoted or not it's like sh we are open discussion but for the

company that a little bit Str like a little bit a t in that part yeah you need to take a little bit proactive but showing you are the best

in your work showing that if like compare to your colleague yeah I am the another one there I I I would say this is a b competition I will stay if you

are to be like in the Cent it's still a competition very often if you put it as a competition but it's like a really healthy competition if you

want to get up in dark okay amazing so basically don't be shy to ask because if you don't ask you might not get it read the room meaning

you need to know who your bus is whether he is open or he is not open to u a conversation like that so understand the company culture whether it's fine to

promote yourself and uh kind of build that negotiation skill set right to negotiate for whether it's higher salary or Better Business title because sometimes that's also

important uh for for your position in the company but do that on time meaning before that show your work and only with right cards go and promote yourself

right yes yes you need to have a strategy Bally for that like a to that but yeah I know like like is uh take a time take a skill like uh to do that I

know like not everyone could actually brave enough to do that or like uh having like thinking to that everyone is different but of course like you could

try to copy a little bit people strategy I know like it online like if like what I said right now if you want to copy my strategy like even after a year after you work hard you already taking

initiative you ask your boss then when that strategy work you could try to copy my if you uh company actually having a

little bit my mind could be like a reop okay so that that's a very good advice I think to not show away to ask

and know your strategy plan in advance yes on the note on that planning and uh on a personal branding so planning your personal brand now Cornelius you have a

huge following you are top voice on LinkedIn and you have about 30,000 followers on LinkedIn right almost

30,000 still almost yes if you listening now go and follow Cornelius maybe we can 30,000 yes so how important it is to

have a personal brand as a data scientist yes it's actually very important so I will talk of my work because like of course it's you school but offset up my work it give me a lot

of chances it give me a lot of opport opportunity like right now with you I'm talking with you because I already have personal branding right you know yeah you understand me it's they

opening a lot of door so I get a lot of new friends lot of new n workking lot of a current opportunity that uh offset of my job so try to a lot of freelances lot

of writings I love writing so I try to focusing myself as in the writing as well because like some of the people that C content creator who actually do

like a video to actually doing a Tik Tok for example photo but I love article writing I have like a newsleter I try to make myself person branding and there as

a data scientist not writing and does does uh make my carer actually have like a I don't say it's a side hustle anymore because it's already my hustle

it's become my personal hustle that b uh improve my income you just secur so person branding could have make you

having like more car choices it could make you a lot of because like like this uh I read a lot about uh people who like

a business entrepreneur they usually not only have like one source of income they have like a multiple they reallying here and there and I see like this coming actually from the personal branding as

well they could actually coming to have like this kind of a source of income the multiple because they promote thems like myself like I try to promot as data

scientist who know some of the stuff in the data science world and then yeah the opportunity keep coming that's because I

promoting here and there and then try to having which uh card this choice that could actually make in the future it at

it's the door opening is uh more so uh in the case that maybe someday I could lay off or something you never know in the company you never know what

could happen if like it might like something like so secure right now maybe in the future something like this like a Cris financial crisis being L off like or even like a

pandemic before like people that who since like Financial Security suddenly getting cut off like I I see my personal brand right now as a security

measurement as well that I could try to make it uh a runway in the case of something that happening as well so there

really yeah for sure because like you just mentioned uh in many parts of the world also currently there are many data scientists that are being laid off just because of the economic

structure and uh like you the message what you are also giving to our audience is that your personal brand will not just be uh something

uh for uh you know for for that scenario but also for uh for for for just in general it opens many doors for you new

opportunities new networks new contacts and also in case something happens so uh you're being terminated laid up for some reason then uh that will be a great way

for you uh to have a source of income so not to put all your eggs in one basket that's the message you are giving right basically yeah very precisely and on the note of your

newsletter because you have a popular newsletter and it is called non-brand Data can you tell us more about that yes at first yeah I just call it non-brand

data because like is it's really in the early year or I try to create a newl and branding so I try to just not want to

boxing myself in this of branding that I could to be a data engineering or like a scientist like in the nalp of stuff I could try so this kind of us that could

talk about anything but and related to data but at the moment right I try to talk more about a things like first about more about a career so like a

business in the career in as data scientist like those tips and that's like uh my opinion but and the second what is talking more about uh the

technical stuff so I try to also still putting my technical uh knowledge uh in my newsletter so I'm try talking about the python ml op and then how to

integrate it so basically I try to combine like I still understanding the business but understand understanding programming stuff or technical stuff in

there it's still uh it data science just little that try to make you as a car uh data scientist to improve your car this

GI amazing so if you are listening now make sure to check the non-brand data newsletter from Cornelius hit the Subscribe button and you will get a lot

of help with uh your career how to start a data science career and also the technical stuff like Cornelius just mentioned yes also uh let's talk about

the coaching that you do because you have a topmate um uh link uh and also a way to provide your knowledge so to

coach other aspiring data scientists or people who are already in the field can more about what kind of coaching you do yes so it's more about how you could

move as a data scientist the car where do you want to move I like I said before manag or to contributor but basically how do you

want to move as data scientist in your career but of course it is also open and open to coaching as like from non uh data scien yet that you want to move

breaking into Data s science field opponent for the coaching orat as well amazing so if you're looking for a coach then uh make sure to get in touch

with Cornelius uh he might be able to help you with your career but also the technical part yes he's on the top mate so we will put the link uh in the

description so on the note of the person branding let's say uh you have a person who comes and ask for your coaching services or just in general your advice

uh you have a huge personal brand at the moment uh huge following on LinkedIn what would be your advice more in actionable way how to build a personal brand from

scratch yes I will come from my personal experience first I first I'm networking first actually before I posted anything

much about that but I already before I coming up with the plan of what I want to be I'm a data scientist I what I know

then I try to uh building that those going to be my brand those going to about dat sideen this in here then I tried to having like a networking like a

so the first time that I posting it uh my social media of course like it's not going to be like having like a lot of Follow That reading it and how like lot of people that are noticing it but

that's why I try to actually also approaching people that actually already have think just big follower so I have like right now my friend like his name is kin but he's already not that much

active right now in the in but he's the first one that I am helping me going to the network in the data science field I

in as a data scientist and then because of his follower I get another follower as well and then I start of getting more

and more momentum in there so it's just like annoing my brand then try to networking with those people who are actually already uh having those number

basically I would say it's a number game if you say it but and three consistent I'm consistently still posting it from

my uh it's already been five years I think I still consistent until now try to posting something stuff even if like I know like one of the things that uh

make people kind of share way after they posting their first post is there doesn't have like a lot of people that might be liking them or like a comment

to their content but no I mean like just keep posting it I mean like every single post that you have might be pable for someone do you might thinking that this

of course might not ah this is like just too easy why people already know but no whybe someone is actually reading your

content and understand okay this is how do you do it with okay people uh do that I I think about that before okay like

not much left but there's a people actually say thank you look my post and then it's like okay that actually make

my M higher but the first that it might not showing that much but just keep consistent

posting because it's like it take take a mental I think like for take a mental to to keep posting if like your situation is not that much but of course it's take

a strategy still take a strategy to improve your social media and personal branding so what I think going to be like people having out with your personal brand

having the network and then keep consistent I think those tra is already being good enough if you want to build in personal brand amazing so if you if

you are someone who is listening and who who does yet have any personal brand uh following the steps will help you to uh

gain that personal brand uh online but also offline um so on that note Cornelius um do you think that uh having

uh like coaching um services having a newsletter and having a LinkedIn following uh is that enough for your personal brand and of course also

networking or are there any other uh media or channels that you usually use in order to uh build your personal brand as a technical person I'm talking about

for example GitHub you know um place to um store and uh create and showcase your code also to Showcase how you can tell a

story about data yes yes that's true like GitHub is like uh really those is the community for a DAT not even dat just like a

programmer to have like those check portolio to be showed in there but also for myself because I a writerly uh

I but really after writing I using medium to hosting all my uh article I'm still quite active there if uh but right

now is a little bit more in the my newsletter but depending on your paion I think like a strategy you need to yeah uh need to a little bit branching so for

example like my go do media and GitHub but maybe for example for some people that would like video maybe you could try to have like present in the YouTube or like in the Tik Tok because I know

like some friend in there uh so of my friend is actually really active in the Tik Tok compared to the uh Linkin but uh

it's really need to understand as well what the audience what your audience want to be uh sorry what you want the audience you want the audience that you

want so for example like I likel in because it's like a really professional places like a social pum so there's not a lot of professional uh Communication in there

so that's why I really active there try to BU my building personal brand there but maybe your personal branding want to Target more more casual people then maybe X may be better places so I know

like a lot of people in The X like a big data people in there as well there's a lot of follow I try to post here and there as well in X but it's mostly actually in Basa Indonesia so maybe a

little bit harder for Indonesia speaker but yeah uh think but if you are just starting try to focus on one

platform first having like a one platform social media then try to build from there then try to branch in there but like GitHub like medium I think

that's just the medium way just the places to show in portolio so I think it's like more about where you can show your working but if you want to for uh I

think it's like uh complimenting complimenting your personal branding that you're already building in the social media like GitHub edum or like some other uh it's like for the

portfolio but starting uh your starting places then try to focus their course on there I think that's one important part I think that's an amazing advice to keep it simple because sometimes it can be

overwhelming if you have so many social media or different channels that you need to keep up with it will you will just run out of time or you will burn out so start simple basically that's

what you are saying and then uh they will all start to compliment each other so either ex or Tik Tok or GitHub medium

or um different out of places LinkedIn uh Facebook so it really depends on your uh Target profile like you just

mentioned yes amazing let's now dive deep become a bit more technical CU data science is such a buzz word there are people who understand data science as

data analytics there are people who understand data science as this machine learning engineer and um with the uh revolution of AI now there are many

people who understand data scientist as someone who is dealing with lops large language models deep learning machine learning so many parts that um many data

scientists across the world are learning and are implementing now in your opinion as a data science manager with the wealth of knowledge and experience in

the field what is for you the data scientist and what it takes to be a data scientist yes this is still a question

that I actually still asking because data scientist previously I said is someone who working with the data to bring a into the business but right now

is moving much more than that as well well so it's like data scientist is someone who bring the value to the business and making the decision for the

battle any business but I always said a business because like a data scientist uh always need a business to working with because like even if you're

a Sol PR a Sol that this your product will be to be a promoted in some way then they have like some business value so that's why I always said that right

now data science is not just something that Mis value but also someone who bring a better decision in the business

but yeah to become a Delta scientist I would say it takes a lot of uh mental for so like take a

consistency taking a what to learn all the way every day every year every time always for system learning right and I

was like Pro business I already said it before so I don't need to repeat it again but yeah just keep want to keep learning and keep up with the new latest

technology this technology latest technology I believe in myself I believe more that it's going to chance to work

it's to chance to work like I right now it's going to chance to work just so I scientist try to keep following that technology because if you really not

following that technology we will be yeah as I said being replaced basically as a data scientist we have like an

advantage because we are usually working for creating that AI complimenting that aiting right so Che on that note I think

you already answered my question because I wanted to ask you do you think that AI is a password many people think that this new era of generative AI chat GPT

you know cloud lmms it's a hype it will just go away but you just answered that in your opinion it's not going to uh going away

anytime soon and it's going to make a huge impact so can you uh walk us through some of the recent developments that you are aware of and you believe

are going to make a huge impact and in which industry specifically in your opinion yes uh if you want to talk about technology right now yeah we already have like a lot of that

generation the open I already had like a lot of the model that really good like the like the cloud is also really have a really great you see

lately and then I I feel like like Sora right now already try to cre out just image the of the video but right now what I really see is like the business

are so like non technical people because right now it's like the is like just like recording or for a yeah simple stuff but right now like a business

people start to see how useful is that like a lot of not just lot technical people but like a non technical people who are try to build a product based on

that so yeah it just from my S so uh this P two weeks three weeks I already did like a lot of uh stakeholder basically this India like a big stick on

actually they want to try to building this kind of EG product and they try to using this uh AI to simplify their business process

basically is like if feel like in internet in Indonesia they already see the potential that it could be used but of course like right now it's our J as

well like as data scientist to prove that this kind of AI to be actually useful in the VIS it's not just like the business want to use it and try to uh

make it effort from their side but we as a data this to make an effort as well that yes this AI could be useful for your business and that it could be like us proove

that this two things that combine together business and is like become big it could be like become a game changer so but yeah that's why I really quite really confident that like I could

actually change the world because it's not just uh I personally using it and then try to get a benefit from that but also everyone that I I I like from my

side is actually already starting using it and they try to implement it in their business and then it's actually useful well that's amazing to hear because I am trying when when I'm

hearing talks about AI will replace data scientists I feel like um that's something that uh can is highly arguable

so um first I want to take uh your uh opinion on this do you think AI will replace data scientist is and um how can

you future prove yourself as a data scientist yes I think it's not going to replace holy I think it's more like about some of the job that we not yes

some of the task that we are done for example like maybe about about detection or like uh Co generation it could be like delegated to do AI but of course

like restructuring all theor words to is going to be using it and that where is going to be managed it's still taking a data scientist to do that but that's why

it's the data set just become evolving right because like those task could be replac AI we as a data scientist not to for us well we are not just need to

understand how to coding it well but we need to understand how to manage this scod better we need to actually document it better we need to understand where it's going to be in the

business which going to be used which of us data scientist is going to be really using this AI so all this latest technology we need to understand as well

so but is what I want to say like data need to be become a full step and that is like cannot be affordable on

basically like like like my as well like I try to learn a lot of ml Ops ma operationalization because I don't think it's going to be

by I as well the structure could be be that yeah I couldn't say better because um being

able to become this full stock professional that you know both the data side the machine learning side the Deep learning but also the recent

developments in AI kind of at least high level to know what LM is what LM Ops is those Cloud Technologies how they can be used right yes and have also the

business acument and the communication because if the uh product um there's always communication right business versus data scientist there's always a

need for the translation and um as long as you are able to do so and continuously develop yourself with the technology there is no way that you will be replaced because by the way the other

day I was reading that um currently uh there is an 85% gap between uh the demand for data science and AI professionals versus the supply so

unless you are doing a manual job I always tend to say that unless you do something repetitively that can be replaced by AI you are good to go yes so

you should definitely have a motivation to get into data science if you like it on the note of getting into data science because you are now in the position of

hiring data scientist so what are the skill set that you are paying attention at because with this recent B LMS generative AI many aspiring data

scientists instead of starting with fundamentals they start um with a difficult stuff training narrow networks understanding RNN attention mechanisms

Transformers diffusion models and then uh try to show with this project that they are an experienced professional okay what do you pay attention at when you are hiring and you are getting this

different resumes okay so the first thing that I don't pay attention much first thing that I don't actually pay attention much

is actually where the university is or where the GP is or their AG is or there how highground is like if you out

but but I the fin L those kind of a discriminative i the do so I try to make every call in the in the K but of course like when you are in a business there's

like a business bu needs so the first thing of course like filling up the business needs uh when it's like hiring and as a data scientist that what I I do that I see for a junior data side list I

want to see if you could actually do the work so basically have you already at uh Le do some data science project already

created some data project data science project and then for this data SCI project how's your uh thought process going how's your what what motivated you

to do this kind of project what motivated you to do uh this kind of code what motivated you to present this kinde of result so do data science project is actually what I really see of course

like uh having like an internship experience having like uh job experience previously really uh us up I see that as well but of course like uh I know like

this a little bit harder for Jor f uh sorry uh as data F to become up because like the position is like really hard to

feel so that's why I try to I at Le taking a look at their data science portolio project so that's the those the the most important part that I try to

text in for like like a communication like a business equipment like for a transl to the business I think the kind of a skill that I think will be you

learn when you are already being in a business inside the business but when you are to enter in business you need to at Le show that you can work for the P

that you are going to be hiring for and understanding but but of course like understanding a little bit about bus is uh really help it still really helpful it's a plus yeah there is a plus it's a

plus going but my standard point is like showing your data science portfolio showing that you could actually the work that's what that my right so a strong

data science portfolio basically you learn the data science communication translation skills business acument all plus but extra but for you as a hiring

manager during the hiring process you pay attention at the projects that they have completed unless they have already an

experience yes and so on that note uh what would you suggest like couple of examples of such projects that um you

would suggest an aspiring data scientist who have zero experience just fresh out of college maybe even nonre data science Ed education to put on their resume to

impress you this a what I mean like a lot of people like are impress me but they so

smart but it's like I you think like uh I there like a lot of that but uh that you could try to find it in the cas

or like UCI those kind of little bit complex data science project but what would really impress me if you actually could formulate a business problem from

those data s data set and then formulate why youve develop this type of modeling and then this model that you develop is actually solving or even not

using a modeling are you actually using the data set no you don't need a model but I have you can formulate a business problem from this data set and then and you try to useing the data science

technique maybe just a clustering technique maybe just a customer or Dimension reduction technique but Co actually showing that this is how I could solve the problem from my business

problem that I already formulate with this data set and then I showing you that this is like how I do it so that's the one that I really impress me because like what I want to see in your data

science project portolio is basically the top process and usually the top process will coming in when you already have like this business problem that you

want to uh solve but of course like uh the the data set that already available in the public is may be a little bit limited but of course like within this

limited data set if you could be creatively thinking about a problem and then try to solve it it will be really impress me right so from end to end basically

and also have kind of this extra skill set than just solving the problem in technical way yes solving a problem in technical way you can say that amazing

and one last question because we spoken about many important topics that I believe uh many aspiring data scientists would be interested in uh what do you see as the future of data science in the

upcoming Five Years well we going to chance a lot I think like with this all AI development

andu data technical stuff that where I know like in five years data size will be invaluable to the business like I said before AI is like it's like a

password previously but is a password that used by the business and if the business could actually use using it bets in the company in this five year if

it is five year they will try to hire as much as data SST just to making that sure that this is going uh smoothly and then I F sure absolutely sure like bu in

this five years will'll be using a lot of automation from our data from data science uh technique amazing now Cornelius thank

you so much for joining us today I think uh your insights and all your tips were uh invaluable for our listeners who are interested in Tech and in data

science uh for our listeners make sure to follow Cornelius on LinkedIn but also to check his newsletter non-brand data and if you are looking for a coach then

uh definitely go for Cornelius he will be able to help you thank you so much Cornelius was really pleasure to have you here thank you so much thank you

that so they're gonna buy a b a business and that'll be the base and then on top of that base we're going to add other you know other businesses to try to help it grow you know exponentially faster

and and I used 100% Bank debt to buy those 23 companies and you know the n result is you know a tremendous amount of shareholder value created Tesla could have come crashing down if if the

lenders started saying you know enough is enough joining us today is Adam coffee who brings over 21 years of experience in building businesses having

served as a CEO for three major companies supported by nine different private Equity sponsors Adam has managed

transactions worth over $2.5 billion and advised top Fortune 500 companies during his time Adam organized 58 business

deals and significantly increased company values achieving fivefold returns for investors he grew one company's value from 10 million to over

a billion dollars earning him recognition as one of the most influential leaders by the Orange County Business Journal Adam is also a

best-selling author a popular speaker and a mentor to aspiring leaders his extensive background includes role in healthare manufacturing and Beyond

his diverse skill set also includes being a licensed contractor a pilot an army veteran and a former executive at G for 10 years today we will dive into the

proven strategies that aspiring Tech entrepreneurs and fresh graduates need to drive in today's competitive landscape we will uncover invaluable insights on how to navigate the tech

work cut through the noise build investor trust and secure funding as well as Forge lasting Partnerships finally we will learn how

to plan and execute a lucrative exit that maximizes your hard-earned success the podcast will be hosted by vah asan experienced software engineer and a tech

entrepreneur the co-founder of lunar Tech that is on the mission to democratize data science in AI so without further ado let's get started

welcome Adam we are excited to having you join us today now Adam you arep a big deal in private equity and in business done deals over like 2.5

billion um you have also advised top 1400 companies and written bestselling books on business Adam could you please share your journey and how you got to

where you are today with you with our audience happy too happy too and and hey by the way good to see you good to uh good to be here with all your listeners out there um you know I think for all of

us life is a journey and it's a building of a set of experiences that make us who we are as a young person I served in the US military um service in the military taught me something about discipline

teamwork leadership uh engineering uh engineering made me a meticulous planner um I'm a pilot Pilots don't take off unless we know where we're going so that taught me how as an entrepreneur to plan

an exit from the beginning you know and always have an exit and a destination in mind uh I spent 10 years working for Jack Welsh in the uh Camelot era of uh

ge I call it um GE was the world's number one largest company Fortune number one on the Fortune 500 list company was growing so fast it was doubling in size every three years and

that that really informed my thinking about growth and GE taught me how to run a business then I spent 21 years as a CEO building three different National companies for nine different private

Equity firms um bought 58 companies a buy and build guy a turnaround guy and uh you know I've got two and a half billion dollars in uh in CEO exits under

my belt um that kind of led me to writing books you know about how to do this I I wanted to educate I'm turning 60 here shortly and I I I wanted to start thinking about you know Legacy and

how how you know I wanted to teach the next generation of entrepreneurs and business owners how to excel at this game at this thing you know that that's

been so influenced by private equity and you know that kind of kind of led me to hanging up my CEO cleats a couple years ago I started a Consulting business I've got clients all over the globe uh I

helped them with uh with scaling uh with doing m&a teaching them the tricks that that the big institutional investors use to create shareholder wealth and then um

you know I help people exit I work with private Equity firms I work with individuals and Founders I'm having a ball I work more hours now than I ever did when I was a CEO so so much for uh

for for slowing down I I think I've actually speed it up awesome and um can you share a few stories on how you acquired a new

businesses and grew them and sold them like the top businesses you worked on well so so usually in my world again I've been doing this with large institutional shareholders and so they

always start with what's called a platform company so they're going to buy a b a business and that'll be the base and then on top of that base we're going to add other you know other businesses to try to help it grow you know

exponentially faster so if I take my last company as an example um the the company that I was hired to run so private Equity Firm buys a company it's a platform company it has 200 million

plus in Revenue uh they buy it with a combination of debt and Equity from their fund uh and then they they bring me in the company has not done well you

know and time to bring in the guy to turn it around to fix it to get it scaling again and then start doing a a buy and build and so I then bought 23

companies over a five-year period uh I bought I bought eight total in the first hold period 15 in the second and uh you know and started bolting on these other

businesses to go from being Regional to National National to International depending on which company I was building at at the time and in addition

to to Growing through m&a also then would focus on on improving the business that I started with so usually investing in technology trying to do my best to

increase the revenues the profitability um you know and and you know of the base business and then also a lot of effort around organic growth to get the business that that was underwhelming and

not not doing really well to grow like it had never grown before organically and and so so I've learned how to build I'll say a very balanced growth oriented

company but no question that m& is the largest component of shareholder value creation and on that example those 23 companies that that we bought you know

on average I paid five times for each one of the companies they were small they were plentiful or smaller plentiful and and I used 100% Bank debt to buy

those 23 companies and I used the cash flow of the 23 businesses to you know service the debt while I'm collecting them buying them and then when we go to

market we sell it and we sold for the first time a multiple of around 14 times and so things I was buying at five times I'm now selling it 14 times and you know

the net result is you know a tremendous amount of shareholder value created then you add in the organic growth you add in the margin Improvement and that's kind

of my recipe for the perfect exit you know in that case in my first exit the the the three-year period uh it was a 4ex multiple of invested Capital so

share holders were happy investors were happy management team was thrilled we made a ton of money you know and uh when it when when things go

well and um so as you know like in the tech industry the creating uh value wealth is the the opportunity is immense

that it's also like very competitive like we have many fresh graduates coming straight out of University and trying to make a new business a new startup but

they have no idea how to do this so what strategy or what mindset would you recommend them for our ambitious Tech entrepreneurs who want to get their food

in the field yeah so Tech is an entirely different world you know you have to get to different concepts um you know software as a service you know or uh a

Tech enabled platform definitely brings a higher valuation call it at Exit um but often times from a tech startup perspective I'd say that a lot of people out there are trying to create the new

next best thing and oftentimes I tell people it's like instead of trying to create something new that doesn't yet exist um potentially solve uh an old

problem or or put a new spin on something that's already out there and you know I think too often it's like we we try too hard as entrepreneurs to create something new differentiated that

the world's never seen before and sometimes boring old problems still need help and still need solving you know and they can be updated and solved in a new modern fashion and and so I I think

sometimes entrepreneurs overthink complexity um and so when when I when I am talking to people about what constitutes a great company you know I

tell them to think about human basic needs think about needs versus wants in a bad economy in a down economy if my business is focused on needs I'm not going to get hurt as bad my revenue

streams will be be still fairly consistent um but if my if my product or my service is is a want then if I'm laid off or I'm unemployed or I'm feeling a

pinch from high interest rates you know I can slow down I can avoid or I can completely ignore that spend for a you know for an extended period of time

until the economy comes back and so we have to be concerned with just the cyclicality of of of the broader economies you know in the world we go through up Cycles we go through down

Cycles you know and and the world can throw us curveballs like covid and so we we have to be very thoughtful around if we're going to start something I want it

to be needs-based so you know it's like if if my if my roof was leaking and it's raining outside and I'm in my house you know and water is pouring on my head I

have to fix that whether I'm broke or not you know but if I wanted to put new fancy you know accessories on my big monster truck out there um you know if I'm unemployed and I don't have any

money then I just look at the magazine and dream about what I would do but I don't have the money and so I don't do it you know it's a it's a discretionary spend so needs versus wants then we want

subscription based versus I'll call it Project based we want some type of a a product that customers are going to pay us a monthly fee for it's shocking to me if I go to my credit Card statements and

I look for all the monthly fees I'm paying you know for Adobe Acrobat and for Google cloud and for Apple this or that and it's like I spend you know a a

fortune every month in just these recurring you know contracted type charges and you know that's also the key to entrepreneurial success once I find a

customer I want to create a recurring Revenue stream you know even in games people might have a free game but there's in in in app purchases to help augment you know and so if I'm thinking

from a tech perspective needs versus wants contracted Revenue stream versus one-time use or Project based and uh you know and and then I'm thinking you know

in a perfect world low capital expenditure um not a lot of money to to further develop or refine a product once it's created and uh you know and it's

it's creating a profile like that that leads to high profitability high fre cash flow and with high-f free cash flow comes the ability to service a lot of

debt which means buyers who want to use debt as a primary funding source they can uh they can service a lot of debt because there's a lot of cash flow so if you can build a business with high-f

free cash flow you know that's focused on needs not wants you know and uh and has a recurrent contracted Revenue stream you're going to do much much better yeah now that's that's a great

advice like often times entrepreneurs start working on some kind of a new project they think they are like working on solving a problem in a certain way but actually they're like not solving

like any form of new problem and they end up hitting like a wall where where they are like but is isn't already someone else doing that when they talk to an

investor yeah sometimes boring you know Industries and we we solve a problem there but they're a a staple you know or a Mainstay of an economy um we we we can

get a lot better ATT traction you know when we solve a a a common problem for common people rather than create a new problem that someone doesn't know they have yet and then have to convince them

they need our product to solve that problem yeah 100% And with um startups like many startups

like need human capital they need like some kind of um Capital to be able to employ uh new people to be able to uh

invested in marketing or in other resources now build building trust is very important with investors because they w't know that they are investing in

someone that's trustful and they're able to not only get their money back but also get a multiple returns so how would

you advice on new people entering the field on building um thrust with investors well this is the age-old problem and the age-old question right

so chicken or the egg I have no Revenue but I need people you know Venture Capital investor says I don't want to give you a bunch of capital that you're going to waste you know on the come you

know I need you to be able to prove you know proof of concept and prove that you can can actually create these revenue streams so it's a very delicate balance and it makes startups a very difficult

place to to to to be you know and and oftentimes I I I ask myself do I want to build or do I want to buy and I'll I'll look at the existing Market in place you

know and I'll say look if I start from scratch I have a very high probability of failure um I I have a a lot of hurdles that I'm going to have to to cross and I might ask myself is there an

existing company that has the existing technology or has the existing product that I can buy that's pre-existing and as a result I've got a company that has

Revenue customers a history of profitability and then it's a different game you know so in the startup world we use things like Founders equity and and we we we try to attract people by by

telling them you know how rich they're going to be one day down the road in the future and that's a hard sell you know and I I got to tell you I get contacted constantly with people who want to offer

me Founders Equity to to help them and you know what I work for cash flow I don't work for Founders Equity uh and when I'm sitting on the boards of companies they give me stock anyway so I I get I get stock in an existing company

with real new real customers you know and I get cash flow and so I personally won't work in a tech type startup world where there's Founders Equity involved

so I I think we have to be realistic and and we we have a we have to profile so anytime I need people you know my goal and objective is to hire the best people I can find for the company that I want

to be in five years not the company I am today you know and part of my tenants include I have to pay a fair market wage but if I can't because I'm I'm cash

constrained then the only tool I've got is incentive Equity to try to attract people and then my profile might change you know I I may not be looking for an established executive who's used to

making seven figures a year because I have no money to pay them and so I'm looking for a different profile it's a younger person it's an upand cominging person it's a person with great skills but they live in their mother's basement

you know or they live in an apartment and their cost structure is low they don't yet have kids are not yet married and as a result of that you know I can

attract them with the equity potential and the lack of of cash flow because their their needs are lower you know it's like I I'm a seven fig you know

eight figure guy every year and so it's like I I I don't work for free I don't work for Equity that may or may not pay off in 10 years I work for cash and Equity you know and so you know we we

have to think about the talent that we need and where are we going to find it and how are we going to attract it and retain it and so we have to build a

profile for the type of person that we think would be uniquely qualified to go on this entrepreneurial Journey with us um especially when we're cash constrained in the beginning and we just

don't have the right level of capital so I need Brilliance on a budget and I'm going to look for a profile of a person who's got low cash flow needs to wear my

small poultry s salary will in in you know at least cover their basic needs because they have cheap basic needs but brilliant skills and they're trying to become you know call it the next tech

you know Tech billionaire or multi-millionaire they'll believe in the journey and they'll take and call it use Sweat Equity to get

there um now in your experience how do um successful companies balance Innovation with sust sustainable growth like for example we have like lot of

businesses that are innovating but they like they keep on innovating and in an unsustainable way a small Tech entrepreneur who's trying to create

something that has you know a a legs I'll call it something that has longlasting ability to build a sustainable Revenue in future um at some

point we have to shift entrepreneurial gears and say it's good enough it's good enough for now and our Focus needs to be scaling and then the Innovation or

investment you know we we we we don't necessarily want to stop but we do need to throttle back so if I've gotten to a proof of concept you know I'm out in the marketplace you know there is a a point

where we have to be thinking about well if I continue to spend money I don't have you know innovating innovating innovating while this is important I'm never going to build a sustainable business if I don't also keep my eye on

the ball and the fact that my investors need to see a return and I need to create you know revenue and so as I get out of the gates I get out of the market when I start to start to see Revenue

coming in it's like we really have to drive Revenue hard and show sustainable high levels of Revenue growth and the high margins that that that were we were

hoping for and we have to demonstrate this and and so our you know we have a lot of initial effort to to call it on the technology side to innovate and create the product once we get out and

launch that needs to to scale back and our efforts need to be replaced by focusing on marketing and sales and building the revenue stream we have to remember in order to build the best

business in the world it still has to be fed with cash and investors eventually will run out of patience and pull out the rug from under us if we can't prove that we've got revenue and and so I I

think back to like Elon Musk in the early days of Tesla you know or Jeff Bezos at at Amazon you know on any given day Elon Musk could have you know Tesla could have come crashing down if if the

lenders started saying you know enough is enough I'm not loaning you anymore it's time for you you know to either make money or or shut down and you know

he he was able to to to navigate that as was Jeff when he was building Amazon but the typical small entrepreneur isn't going to get that kind of treatment you know they are not going to be able to

sustain Innovation and investment in in a hope in a prayer if they cannot prove that money is there so don't forget that while we may be interested in technologically changing the world

there's the commercial aspect of we got to make money and before I worry about making money big I need to prove to people I can make money small and once I've got my product kind of to a stage

to where it's re ready to revenue I need to turn down innovation turn up marketing and really focus on driving Revenue creation and customer adoption

so that I can then start generating cash which will let me then go back to innovating you know at a future time so we we have to be balanced a lot of times entrepreneurs forget the commercial

aspect and the commercial aspect is we got to make money we're so busy innovating we forget that we have to make money and before long it's like investors get tired of us because there's a thousand other things for them to invest in they pull the rug out from

under us and we crash and burn and so the best technology on the planet does not guarantee you Commercial Success you know we have to drive commercial success

as soon as we're able in order to prove the sustainability of our business now 100% And I I feel like ego

has some kind of role in that where like an entrepreneur is like very convinced because of certain reasons but also their ego on why this Innovation will

only cause growth although in a reality it's only uh hinders their growth like what do how would you why that that's why dreamers dream and doers do you know

so there there's I call it the accidental arrogance of success it's like PE people get so into their own you know self-promotion that that you know I I'm

God's gift and what I've done is going to change the world and you know I I see those pitches every day from people who are out there Adam my idea plus your

wallet equals you know the best thing the planet has ever seen and I'm like first of all if you're talking to me about money you don't understand my value because my value is what's up here not what's in my wallet money is a

commodity there's trillions of it out there looking for Investments right now you know and so all you have to do is know where it is go go go go get it treat it well give it you an outsized

return and you'll get funded you know and so money is a commodity money's not the issue people who are focused on money is my problem don't understand how money works and so in addition to being

you know call it a tech genius they need to have a business Acumen you know Andor partner with someone who understands business and and they can be the the the strange person who's locked

in the the dark room for 20 hours a day innovating and creating something great but they still need some business guy out there to be the front end and so when we get arrogant you know and and

and keep in mind most of these people have not created anything yet you know and so if they have an arrogance of success and they have it before before they're actually revenu and and building

something special then boy that's that's a an entrepreneur who's going to have a hard time finding capital and finding money there's a fine line between arrogance you know and confidence and we

need to be confident we shouldn't be arrogant and if we're Arrogant with no money and we're Arrogant with an idea but no Revenue you know then investors

just simply walk away you know I I that's not a not an adventure that I'm going to back so we we have to be careful about letting the arrogance of

our genius Cloud our thinking and and ultimately investors see right through that and uh if there's you know what it's it's okay to be arrogant you know

if you're the richest man on the planet and you have you know you you have arrived when you have an idea and no revenue and you're arrogant and you're arrogant with investors that's not a

good recipe for success and we are almost uh hitting the time and could um tell us about your services for example we have new startups but they they have

no idea how to do business they don't have the business acument so maybe they can talk to you or yeah so I do Consulting work um you know people can read my books they're cheap you know I

donate my royalties to charity um and you know all three of my books have been number one bestsellers so so thank you to everybody out there who reads my books I've been on hundreds of podcasts

just like this I do these freely so that there you know if you go to listennotes docomo from from there I teach seminars globally you know those are relatively low cost and I'll do boot camps where we

spend two to four days together and I I I really get in-depth about about all things around growth and raising capital and selling businesses and uh and then I I do work you know one-on-one with uh

with dozens of entrepreneurs I I have a per group we call the chairman group I do that with my business partner JT Fox um you can reach out to me on LinkedIn

you can go to my website uh Adam e coffee.com um you can also uh you know in in I'd say LinkedIn is where you'll find me the most I'm most active it's

really the only social media platform I'm on Twitter I post some things once in a while I'm not on Instagram or Facebook there's a fake Adam coffee out there believe it or not I guess you know you've arrived when there's people who

are intimidating you and so on Facebook and Instagram you'll find fake Adam coffees trying to take money from you um you know I I'm trying to help people not

buk them for uh for for money um so I'm I'm a consultant you know and I I do Consulting work with uh all kinds of different people private Equity firms you know family offices Etc so thanks

for having me on I appreciate you appreciate your listeners out there good luck take care of people and uh and revenue will happen thank you Adam

the next question is what is gradient descent so gradient descent is an optimization algorithm that we are using in both machine learning and in deep learning in order to minimize the loss

function of our model which means that we are iteratively improving the model parameters in order to minimize the cost function and to end up with a set of

model parameters that will that will optimize our model and the model will will be producing highly accurate predictions so in order to understand the gradient descent we need to

understand what is the loss function what is the cost function which is another way of referring to the loss function uh we need to also understand the the flow of neural network how the

training process works uh which we have seen as part of the previous questions and then we need to understand the idea of iteratively improving the model and

why we are doing that so let's start from the very beginning so we have just learned that during the training process of neuro Network we first do the forward

paths which means that we are iteratively Computing our Activation so we are taking our input data we then are passing it with the corresponding weight

parameters and the bias vectors through the hidden layers and we are activating those neurons by using activation functions and then we are going through

those multiple hidden layers up until we end up with uh Computing the output for that specific forward path so the uh

predictions why hat so once we perform this in the for our very initial iteration of training the neural network we need to have a set of model

parameters that we can start uh training process in the first place so we therefore need to initialize those parameters in our model and we have specifically two type of model

parameters in the neural network we have the weights and we have the bias factors as we have seen in the previous questions so then the question is well how much error are we making if we are

using this specific set of weights and bias vectors cuz those are the parameters that we can change in order to improve the accuracy of our model so

then the question is well if we use this very initial version of the model parameters so the weight and the B vectors and we compute the output so the

Y hat uh then we need to understand how much error is the model making based on this set of model parameters that's the loss function so loss function or the cost function which means the average

error that we are making when we are using this ways and the bias factors in order to perform the predictions and uh as you know already from the machine learning we have

regression type of tasks and classification type of tasks based on the problem that you are solving you can also decide what kind of loss functions you will be using in order to measure

how well your model is doing and the idea behind narrow Network training process is that you want to iteratively improve this moral parameters so the

weights and the bias factors such that you will end up with the set of best and most optimal waste and the bias factors that will result in the smallest amount

of error that the model is making which means that you came up with an algorithm and and with neural network that is producing highly accurate predictions

which is our goal our entire goal by using neural networks so loss functions if you are dealing with classification type of problems can be the cross entropy which is usually the go to

Choice when it comes to the classification type of tasks but you can also use the F1 score so F beta score you can use the Precision Recall now beside this in case you have a

regression type of task you can also use the mean the msse you can use the rmsc you can use the MAA and those are all the ways that you can measure the

performance of your model every time when you are changing your model parameters so we have also seen as part of the training of neural network that there is one fundamental algorithm that

we need to use which we called and referred as a back propagation that we use in order to understand how much is there a change in our loss function when

we apply a small change in our parameters so this is what we were referring as gradients and this came from mathematics and as part of the back prop what we were doing is that we were

Computing the first order partial derivative of the loss function with respect to each of our model parameters in order to understand how much we can

change each of those parameters in order to decrease our loss function so then the question is how exactly gradient descent is performing the optimization

so the gradient descent is using the entire training data when going through one pass and one iteration as part of the training process so for each update

of the parameters so every time it wants to update the weight factors and the bias factors it is using the entire training data which means that in one go

in one forward pass we are using all the training observations in order to compute our predictions and then compute our loss function and then perform back

propagation compute our first order derivative of the loss function with respect to each of those model parameters and they use that in order to update those parameters so the way that

the GD is performing the optimization and updating the model parameters is taking the output of the back prop which is the first order partial derivative of

the loss function with respect to the moral parameters and then multiplying it by by the Learning rate or the step size and then subtracting this amount from

the original and current model parameters in order to get the updated version of the model parameters so as you can see here this comes from the

previously showcase simple example from neural network and here when we compute the predictions we take the gradients from the back propop and then we are

using this DV which is the first order gradient of the loss function with respect to the weight parameter and then multiply this with the St size the EA

and then we are subtracting this from V which is the current weight parameter in order to get the new updated weight parameter and the same we also do for

our second parameter which is the bias Factor so one thing you can see here is that we are using this step size the learning rate which can be also

considered a separate a topic we can go into details behind this but for now uh think of the learning rate as a step

size which decides how much of this size of the step should be when we are performing the updates because we know exactly how much the change there will

be in the loss function when we make a certain change in our parameters so we know the gradient size and then it's up to us to understand how much of this

entire change we need to apply so do we want to make a big jump or we want to make a smaller jumps when it comes to iteratively improving the model

parameters if we take this learning rate very large it means that we will apply a bigger change which means the algorithm will make a bigger step when it comes to

moving towards the global optim and later on we will also see that it might become problematic when we are making too big of a jumps especially

if those are not accurate uh jumps so we need to therefore ensure that we optimize this learning parameter which is a hyperparameter and we can tune this

in order to find the best learning rate that will be minimizing the loss function and will be optimizing our neural network and when it comes to the gradient descent the quality of this

algorithm is very high it is known as a good Optimizer because it's using the entire training data when performing the gradient so performing the back propop and then taking this in order to update

the model parameters and the gradients that we got based on the entire training data is the represents the true gradients so we are not estimating it we

are not making an error but uh instead we are using the entire training data when calculating those gradients which means that we have a good Optimizer that

will be able to make accurate steps towards finding the global Optimum therefore GD is also known as a good Optimizer and it's able to find with

higher likelihood the global Optimum of the loss function so the problem of the gradient descent is that when it is using the entire training data for every

time updating the model parameters it is just sometimes computationally not visible or super expensive because training a lot of observations taking

the entire training dat data to perform Just One update in your model parameters and every time stor storing that large data into the memory performing those

iterations on this large data it means that when you have a very large data using this algorithm might take hours to optimize in some cases even days or

years when it comes to using very large data or using very complex data therefore GD is known to be a good op Optimizer but in some cases it's just

not feasible to use it because it's just not efficient the next question is what is loss function and what are various loss functions used in deep learning so

loss function is used in order to quantify the amount of overall error that the model is making whether it's a deep learning model but also in general the traditional machine learning models

in all these cases we need a way to measure the amount of error that the model is making and in order to do so we are making use of this idea of loss

functions so loss function is a way to measure the amount of loss that the model is making which means the amount of overall errors that the model is

making when performing the prediction so we can have a loss when we are dealing with classification model we can have a loss when we are dealing with regression

model at the end of the day we know that independent of the type of problem we are solving we we are always going to have this errors that we will have as

part of the predictions so we can never get predictions which are exactly equal to the True Values that we want to get therefore we need to know what are this

these errors that the model is making and what is the overall error that the model is making such that we can then know how we can edit and adjust our model in order to improve it such that

the model will make less uh loss therefore the idea behind the optimization techniques such as gradient descend SGD RMS prop is to minimize the

loss function but to be able to do that we first need to have a proper loss function that is measuring the overall error that the model is making when it comes to the different examples of loss

functions depending on the type of problem we are dealing with we can use different sorts of loss functions if we are dealing with regression problem we can use the mean squared error the root

mean squ error the MAA which is another measure that is commonly used as part of evaluating the regression type of problems so as a loss function we can

use this different Matrix in order to compute what is the overall error uh in the predictions for that specific model

type and here we are using as an input the actual values of y's which are usually numeric values given that we have regression type of problem and the

estimated values which come from our machine learning or deep learning model so once we have perir iteration the predicted values then uh

we can use this predicted values these numeric values of uh y hat and compare it to the actual y that we have as part of our validation set or training set or

testing set in order to compute the amount of loss the MoDOT is making so there is always is this two set of input values the Y head and the Y and then

using mean squared error which is the mean of the square of all the sum of the errors that we are making as part of the model training and then take the average

of it therefore it's also called the mean of the sum of squared of the errors so uh when it comes to the classification type of problems then the

class for the classification type of problems we can use the cross entropy which which is also known as a log loss in order to evaluate the performance of

The Deep learning model this is handy when dealing with binary classification when it comes to other type of L functions that we can use for

classification type of problems we can use the Precision we can use the recall we can also use the F1 score or the F beta score which is more General version

of F1 score when we know specifically what is more important for us the recall versus Precision whereas in case of F1 score we don't know or we don't care and

it's more that we want to have a good balance between the Precision and the recall then the F1 score basically provides a 50% importance to each of those two the next question is what is

cross entropy and why it is preferred as a cost function for multiclass classification type of problems so cross entropy which is also known as log loss

it measures the per performance of a classification model that has an output in the terms of probabilities which are values between Zer and one so whenever

you are dealing with classification type of problem let's say you want to classify whether an image is of a cat or

a dog or and the house can be classified as a old house versus a new house in all those cases when you have these labels

and you want the morel to Pro provide a probability uh to each of those classes per observation such that you will have us an output of your model that the

house a has 50% probability of being classified as new 50% probability of being classified as old or the cat has

70% probability of uh uh being a cat image or this image has 30% probability of of being a dog image in all those cases when you are dealing with this

type of problems you can apply the cross entropy as a loss function and the cross entropy is measured as this negative of

the sum of the Y log p + 1 - Y and then log 1 - P where Y is the actual label so in binary classification this can be for

instance one and zero and then p is the predicted probability so in this case the P will be then the value between 0 and one and then Y is the corresponding

label so let's say label zero when you are dealing with cat image and label one when you are dealing with a dog image and the mathematical explanation behind

this formula is out of the scope of this question so I will not be going into that details but if you are interested in that make sure to check out the logistic regression model this is part

of my machine learning fundamentals handbook which you can check out and this one includes also logistic regression which explain step by step how we end up with this log likelihood

function and how then we go from the products to summations after applying the logarithmic function so we get the log ODS and then we multiply it with the

minus because this is the uh negative of the uh likelihood function given that we want to ideally minimize the loss

function and this is the opposite of the likelihood function um and in this case what the showcases is that we will end

up getting a volue that tells how well the model is performing in terms of classification so the cross entropy then will tell us whether the model is doing

good job in terms of classifying the observation to certain class the next question is what kind of loss function we can apply when we are dealing with multiclass

classification so in this case when dealing with multiclass class class ification we can use the multiclass crow entropy which is often referred as a

softmax function so softmax loss function is a great way to measure the performance of a model that wants to classify observations to one of the

multiple classes which means that we are no longer dealing with binary classification but we are dealing with multiclass classification so one example of such

case is when we want to classify an image image to be from Summer theme to be from Spring theme or from winter theme given that we have three different

possible classes we are no longer dealing with binary classification but we are dealing with multiclass classification which means that uh we also need to have a proper way to

measure the performance of the model that will do this classification and softmax is doing exactly this so instead of getting the pair observation two

different volumes which will say what is the probability of that observation belonging to class one or class two instead we will have a larger Vector pair observation depending on the number

of classes you will be having in this specific example we will end up having three different value so one vector with three different entries perir observation saying what is the

probability that this picture is from Winter scene what is the probability of this observation coming from Summer theme and the third one one what is the obser what is the probability that the

observation comes from a spring theme in this way we will then have all the classes with the corresponding probabilities so as in case of the Cross

entropy also in case of the soft Max when we are when we have a small value for the softmax it means that the model is performing a good job in terms of

classifying observations to different classes and we have well separated classes and one thing to keep keep in mind when we are comparing cross entropy

and multi class cross entropy or the softmax is that we are usually using this whenever we have more than two

classes and you might recall from the uh Transformer uh model introduction from the paper tension is all you need that as part of this architecture of

Transformers a soft Max layer is also applied um as part of the multiclass classification so when we are uh Computing our activation course and also

at the end when we want to transform our output to a values that make sense and to measure the performance of the Transformer the next question is what is

SGD and why it is used in training naral networks so SGD is like GD an optimization algorithm that is used in deep learning in order to optimize the

performance of a deep learning model and to find the set of model parameters that will minimize the loss function by iteratively improving the parameters of the model including the weight

parameters and the bias parameters so the SGD the way it performs the updates of model parameters

is by using a randomly selected single or just few training observations so unlike the GD which was using the entire

training data to update the model parameters in one iteration in case case of SGD the SGD is using just single uh

randomly selected training observation to perform the update so what this basically means is that instead of using the entire training data for each update

SGD is making those updates in the model parameters per training observation and there is also an importance of this random component so the stochastic

element in this algorithm hence also the name stochastic grade in descent because SGD is randomly sampling from training observations a single or just couple of

training data points and then using that it performs the forward pass so it computes the Z scores and then computes the activation scores after applying

activation function then reaches the end of the forward pass and the network computes the output so the Y hat and then computes the losss and then we

perform the back prop only on those few data points um and then we are getting the gradients which are then no longer the exact gradient so in SGD given that

we are using only a randomly Selected Few data points or a single data point instead of having the actual gradients we are estimating those true gradients

because the true gradients are based on the entire training data and in SGD for this optimization we are using all only

few data points what this means is that we are getting an imperfect estimate of those gradients as part of the back propagation which means that the

gradients will contain this noise and the result of this is that we are making the optimization process much more efficient because we are making those uh

uh parameter updates very quickly based on pass by using only a few data points U and training Neal Network on just a

few data points is much faster uh and easier than using an entire training data for a single update but this comes at the cost of the quality of the SGD

because when we are using only a few data points to train the model and then compute gradients which are the estimate of the true gradients then this

gradients will be very noisy they will be imperfect and most likely far off from the actual gradients which also means that that uh we will make a less

accurate updates to our model parameters and this means that every time when the optimization algorithm is trying to find that Global Optimum and make those

movements per iteration to move one step closer towards that Optimum most of the time it will end up making wrong decisions and will pick the wrong

direction given that the gradient is the source of that choice of what direction it needs to take and every time it will make those uh

oscilations those movements which will be very erratic and it will end up most of the time discovering the local opum

instead of the global opum because every time when it's using just very small part of the training data it's estimating the gradients which are noisy

which means that the direction it will take will most likely be also a wrong one and when you make those wrong directions and wrong moves every time

you will start to ciliate and this is exactly what HGD is doing it's making those wrong decision Choice when it comes to direction of the optimization

and it will end up discovering a local Optimum instead of the global one and therefore the HGD is also known to be a bed Optimizer it is efficient it is

great in terms of convergence time in terms of the memory usage cuz storing model and that is based on a very small data and storing that small data into

the memory is not computationally heavy and memory heavy but this comes at the cost of this quality of the optimizer and in the upcoming interview questions

we will learn how we can adjust this SGD algorithm in order to improve the quality of this optimization technique the next question is why does fastic

gradient Des sensor the SGD or C ilate towards local minimum so there are a few reasons why this oscilation happens but

first let's discuss what oscilation is so oscilation is the movement that we have when we're trying to find the global Optimum so uh whenever we are

trying to optimize the algorithm by using an optimization method like GD SGD RMS probe Adam we are trying to minimize

the loss function and ideally we want to change iteratively our model parameters so much that we will end up with the set

of parameters resulting in the minimum so Global minimum of the loss function not just local minimum but the global one and the difference between the two is that the local minimum might appear

as of it's the minimum of the loss function but it holds only for a certain area when we are looking at this optimization process whereas the global

Optimum is really the mean as the real minimum of the loss function and that's exactly uh that we are trying to chase

when we have too much oscilations which means too much movements when we are trying to find the direction towards that Global Optimum then this might

become problematic because we then are making too many movements every time and if we are making those movements that are opposite or they are towards the

wrong direction then this will end up resulting in discovering local opum instead of global opum something that we are trying to avoid and the oscilations

happen much more often in the SGD compared to GD because in case of GD we are using the entire training data uh in order to compute the gradient so the

partial derivative of the loss function with respect to the parameters of the model whereas in case of SGD we learned that we are using just randomly some

single or few training data points in order to update the gradients and to use these gradients to update the model parameters this then for SGD results in

having too many of these oscilations because the the random subsets that we are using they are much smaller than training data they do not contain all the information in training data and

this means that the gradients that we are calculating in each St when we are using entirely different and very small data can def significantly one time we

uh can have One Direction the other time an entirely uh different direction uh for our movement in our optimization

process and uh this huge difference this variability in the direction uh because of the huge difference in the gradients

can result in a too often of this oscilation so too many of uh bouncing around towards the area to find the right direction towards the

global Optimum in this case the minimum of the loss function so that's the first reason the random subsets the second reason why in HGD we have too many of

those oscilations those movements uh is the step size so step size the learning rate can Define how much we need to

update the the weights and or the bias parameters and the magnitude of this updat is determined by this learning rate which then also plays a role how

many of this how different this movement will be and how large uh the the jumps will be when we are looking at the

oscilations so the third reason why the SGD will suffer from too many of oscilations which is a bad thing because it will result in finding a local opum

instead of the global opum too many times is the imperfect estimate so when we are Computing the gradient of the loss function with respect to the weight

parameters or the bias factors then if this is done on a small sample of the training data then the gradients will be noisy whereas if we were to use the entire training data that contains all

the information about the relationships between the features and just in general in the data then the gradients will be much less noisy they will be much more

accurate therefore because we are using this the gradients of based on small data as estimate of the actual gradients which is based on the entire training

data this introduces a noise so imperfection when it comes to estimating this true gradient and this imperfection can result in updates that do not always

Point directly towards the global opum and this will then cause this oscilations in the HGD so at higher level I would say that there are three reasons why HGT will

have too many of this osil iations the first one is the random subsets the second one is the step size and the third one is definitely the imperfect estimate of the gradient the next

question is how is GD different from HGD so what is the difference between the gradient descent and the stochastic gradient descent so by now given that we

have gone too much into details of HGD I will just give you a higher level summary of the differences of the T So for this question I would answer by

making use of four different factors that cause a difference between the GD and HGD so the first factor is the data usage the second one is the update frequency the third one is the

computational efficiency and the fourth one is the convergence pattern so let's go into one of this into each of these factors one by one so gradient descent

uses the entire training data when uh training the model and Computing the gradients and using this gradients as part of back propagation process to

update the model parameters however SGD unlike GD is not using the entire training data when performing the training process and updating the model

parameters in one go instead what HGD does is that it uses just a randomly sample single or just two training data points when performing the training and

when using the gradients based on the Su points in order to update the model parameters so that's the data usage and the amount of data that SGD is using

versus the GD so the second difference is the update frequency so given that GD updates the model parameters based on the entire training data every time it

makes much less of this updates compared to the HGD because SGD then very frequently every time for this single data point or just two training data

points it updates the model parameters unlike the GD that has to use the entire training data for just one single set of

update so this causes then HGD to make those updates much more frequently uh when using just a very small data so that's about the difference in terms of

update frequency then another difference is the computational efficiency so GD is less computationally efficient than HGD because GD has to use this entire

training data make the computation so back propagation and then update the model parameters based on this entire training data which can be computationally heavy especially if

you're dealing with very large data and very complex data and unlike GD SGD is much more efficient and very fast because it's using a very small amount

of data to perform the updates which means that it is it requires less amount of memory to sort of data it uses small data and it will then take much less

amount of time to find a global Optimum or at least it thinks that it finds the global Optimum so the convergence is much faster in case of SGD compared to

GD which makes it much more efficient than D GD then the final factor that I would mention as part of this question is the convergence pattern so GD is

known to be smoother and of higher quality as an optimization algorithm than SGD SGD is known to be a bed Optimizer and the reason for this is

because that the eff efficiency of HGD comes at a cost of the quality of it of finding the global optim so HGD makes all the all this oscilations given that

it's using a very small part of the training data when estimating the true gradients and unlike SGD GD is using the entire training data so it doesn't need

to estimate gradients it's able to determine the exact gradients and this causes a lot of oscilations in in case of SGD and in case of GD we don't need

need to make all these oscilation so the amount of movements that the algorithm is making is much smaller and that's why it takes much less amount of time for HG

to find the global opum but unfortunately most of the time it confuses the global opum with the local opum so SGD ends up making this many movements and it end up discovering the

local opum and confuses it with the global optim which is of course not desirable because we would like to have the actual Global op ter or the set of parameters that will actually minimize

and find the minimum value of the loss function and SGD is the opposite because it's using the true gradients and it is most of the time able to identify the

true Global optim so the next question is how can we use optimization methods like GD uh but in a more improved way so how we can improve the GD and what is

the role of the momentum term so whenever you hear momentum and then GD uh try to automatically focus on the HGD

with momentum because SGD with momentum is basically the improved version of HGD and as far as you know the difference between HGD and GD it will be much easier for you to explain what is the

HGD with momentum so uh we just discussed that the HD suffers from oscilation so too many of those movements and a lot of

time because we are using a small amount of training data to estimate the true gradients this will result in having entirely different gradients and too

much of this different sorts of updates in the weights and of course that's something that we want to avoid because we saw and we explained that too many of

those movements will end up causing the optimization algorithm to mistakenly confuse uh the global opum and local opum so it will pick the local opum

think that is a global opum but it's not the case so to solve this problem and to improve the uh SGD algorithm while taking into account that SGD in many

aspects is much more much better than the GD we came up with this SGD with momentum algoritm where HGD with momentum will take basically the benefits of the HGD and then it will

also try to address the biggest disadvantage of SGD which is this too many of these oscilations and the way SGD with momentum does is that it uses this

momentum and it introduces this idea of momentum so momentum is basically a way to find and put the optimization algorithm towards better Direction and

reduce the amount of oscilations so the amount of of this random movements and the way that it does is that it tries to add a fraction of this previous updates

that we made on the motor parameters which then we assume will be a good indication of the more accurate Direction in this specific time step so

imagine that we are at time step T and we need to make the update then uh the what momentum does is that uh it looks at all the previous updates and uses the

more recent updates more heavily and says that the more recent updates most likely uh will be better representation of the direction that we need to take

versus the very old updates and this updates in the optimization process these very recent ones when we take them into account then we can have a better

uh a better way of and more accurate way of updating the moral parameters so let's look into mathematical representation just for a quick refreshment so what the SGD with

momentum tries to do is to accelerate this conversion process and instead of having too many of the movements towards the different dire Direction and having

two different of gradients and updates it tries to stabilize this process and have more constant updates and in here you can see that as part of the momentum

we are obtaining this momentum term which is equal to VT +1 for the update at the time step of t+1 what it does is

that it takes this this gamma multiplies it by VT plus the learning rate ITA and then the gradient where you can see that

this inflated triangle and then underneath the Theta and then J Theta T simply means the gradient of the loss function with respect to the parameter

TAA and what what is basically doing is that it says we are Computing this momentum term for the time step of t+1 which is based on the previous updates

uh through this term gamma multi VT plus the the common term that we saw before for the HGD and for GD which is basically uh using this ITA learning

rate multiplied by the first order partial derivative of the loss function with respect to the parameter Theta so we then are using this momentum term to

Simply subtract it from our current parameter Theta T in order to obtain the new version so the updated version which is Theta t+ one where Theta is simply

the model parameter so in this way what we are doing is that we are performing more the updates in more consistent way so we are introducing consistency into

the direction by waiting the recent adjustments more heavily and it builds up the momentum hence the name moment momentum so the momentum builds up this

speed towards the direction of the global opum in more consistent gradients enhancing the uh movement towards This Global op so the global minimum of the

loss function function and this then on its turn will improve of course the quality of the optimization algorithm and we will end up discovering the global Optimum rather than local Optimum

so to summarize what this SGD with momentum does is that it basically takes the SGD algorithm so it again uses a small training data when performing the

model parameter updates but unlike the SGD what SGD with momentum does is that it tries to replicate the gd's quality when it comes to the finding the actual

Global optim and the way it does that is by introducing this momentum term which helps also to introduce consistency in the updates and to reduce the oscilations the algorithm is making by

having much more smoother path towards discovering the actual Global Optimum of the loss function the next question is compare badge gradient descent to mini

badge gradient descent and to stochastic gradient descent so here we have three different versions of the gradient descent algorithm the uh traditional

badge gradient descent often referred as GD simply uh the second algorithm is the mini badge gradient descent and the third algorithm is the SGD or the

stochastic gradient descent so the three algorithm are algorithms are very close to each other they do differ in terms of their efficiency and the amount of data

that they are using when performing each of this modal training and the model parameters update so let's go through them one by one so the batch gradient

descent this is the original GD uh this method involves the traditional approach of using the entire training data uh for each iteration when Computing the gradients so doing the back prop and

then taking this gradients as an input for the optimization algorithm to perform a single update for these model parameters then on to the next iteration

when again using the entire training data to compute the gradients and to update the model parameters so here for the badge gradient descent we are not estimating the true gradients but we are

actually Computing the gradients because we have the entire training data now B gradient descent thanks to this quality of using the entire training data uh has a very high quality so it's very stable

it's able to identify the actual Global Optimum however this comes at a cost of efficiency because the bch gradient descent uses the entire training data it needs to every time put this entire

training data into the memory and it is very slow when it comes to performing the optimization especially when dealing with large and complex data sets now

next we have the Other Extreme of the bch gradient descend which is SGD so SGD unlike the GD and we saw this previously when discussing the previous interview

questions that SGD is using uh stochastically so randomly sampled single or just few few training observations in order to perform the training so Computing the gradients

Performing the backrop and then using optimization to update the model parameters in each iteration which means that we actually we do not compute the

actual gradients but we actually are estimating the true gradients because we are using just a small part of the training data so uh this of course comes at a cost of the quality of the

algorithm although it's efficient to use only small sample uh from the training data when doing the back prop the training we need you don't need to store uh the entire training data into the

memory but just a very small portion of it and we perform the model updates quickly but then uh we find the so-called Optimum much quicker compared

to the GD but this comes at a cost of the quality of the algorithm because then it starts to make too many of these oscilations due to this noisy GR

which then ends up confusing the global opun with the local opun and then finally we have our third optimization algorithm uh which is the mini badge gradient descent and this mini badge

gradient descent is basically the Silver Lining between the batch gradient descent and the original HGD so stochastic gradient descent and the way mini badge works is that it tries to

strike this balance between the traditional GD and the HGD it tries to take the the advantages of the SGD when it comes to uh the efficiency and combin

it with the advantages of GD when it comes to stability and consistency of the updates and finding the actual Global Optimum and the way that it does that is by randomly sampling the

training observations into two batches where the batch is much bigger compared to SGD and it then uses this smaller portions of training data in each

iteration to to do the back propop and then to update the model parameters so think of this like the kfold cross validation when we are sampling our

training data into this K different folds in this case batches and then we are using this in order to train the model and then in case of naral networks

to use the mini mini badge gradient descent to update the model parameters such as weights and bias vectors so the tree have have a lot of similarities but they also have differences and in this

interior question your intervie is trying to test whether do you understand the benefits of one and two and what is the purpose of having mini batch gradient descent the next question is

what is RMS prop and how does it work so we just saw that RMS prop is one of the examples that that can be defined as an

Adaptive optimization process and RMS probe stands for the root mean squared propagation and it is like GD HGD and HGD with momentum an optimization

algorithm that tries to minimize the loss function of your deep learning model to find the set of model parameters that that will minimize the

loss function so uh what RMS probe does is that it tries to address some of the shortcomings of the traditional gradient descent algorithm and it is especially

useful when we are dealing with Vanishing gradient problem or exploring gradient problem so we saw before that a very big problem during the training of

deep neural network is this concept of Vanishing gradient or exploring gradient so when the gradients start to converge towards zero uh so they become very

small they almost vanish uh or when the gradients are so big that they are exploding so they are becoming very large and they result in a large amount

of oscilations now uh to avoid this what RMS prop is to doing is that it is using an Adaptive learning rate it's adjusting the learning rate and it is using for

this for this process this idea of running running average of the second order gradients so this is related to

this concept of Haitian and it is also using this DK parameter uh which takes into account and regulates uh what is the magnitude the average of the

magnitudes of the recent gradients that we need to use when we are updating the model parameters so basically what is the amount of information that we need

to take into account from the recent adjustments so in this case this means that parameters with large gradients

will have their effective learning rate to be reduced so whenever we have uh large gradients for parameter we will be then reducing the gradients this means

that we will then control the exploding gradient effect uh and of course the other way around holds through in case of RMS probe for the parameters that will have a small gradients we will be

then controlling this and we will be increasing their learning rate to ensure that the gradient will not vanish and in

this way we will be done controlling and smoothing the process so the RMS prob uses this DEC rate which you can see

here too this beta which is then a number usually around 0.9 and it controls how quickly this running

average forgets the oldest gradients so as you can see here uh we have this running average VT which is equal to

Beta * VT minus 1 + 1 - beta * GT ^ 2 so this is basically our uh second order gradient so then what we are doing is

that we are taking this running average and then we are using this to adapt and to adjust our learning rate so you can see here in our second expression the

Theta t + 1 so the updated version of the parameter is equal to the Theta T so the current parameter minus so we're subtracting the learning rate divided to

square root of this VT square root of this running average and we are adding some Epsilon which is usually a small number number just to ensure that we are

not dividing this ITA to0 in case our running average is equal to zero so we are ensuring that this number still exists and we do not divide a number to zero and then we are simply multiplying

this to our gradient so as you can see depending on our parameter we will then have a different learning rate and we will then

be updating this learning so by adopting this learning rate in case of RMS prop we are then stabilizing this optimization process we are then preventing all these random movements

these oscilations and at the same time we are ensuring smoother convergence we are also ensuring that our Network especially for deep neural networks it

doesn't suffer from this Vanishing gradient problem and from the exploding gradient problem which can be a serious problem when we are trying to optimize

our deep neural network the next question is what are l 2 or L1 regularizations and how do they prevent overfitting in their own network so both

L1 and L2 regularizations are shrinkage or regularization techniques that are both used both in traditional machine learning and in deep learning in order

to prevent the morel overfitting so trying to make the model more generalizable like the Dropout so you might know from the traditional machine learning models what

is L1 and L2 regular ation do L2 regularization is also referred as Rich regression and L1 regularization is also referred as L regression so what L1

regularization does is that it adds as a normalization or as a regularization or as a penalization factor that is based

on this penalization parameter Lambda multiplied by this term which is based on the absolute value of the weights so

this is different from the L2 regularization which is the reach regularization and this regularization adds to our loss function the

regularization term which is based on the Lambda so the penalization parameter multiplied by the square of the weights so you can see how the two are different

one is based on what we are calling the L1 norm and the other one is based what we are calling L2 Norm hence the names L1 and L2 to

regularization so both of them are used with the same motivation to prevent overfitting what L1 does different from

L2 is that L1 can set the weight of certain neurons exactly equal to zero so in some way also performing feature

selection whereas L2 regularization it shrinks the weights towards zero but it never sets them exactly equal to zero so in this aspect fact L2 doesn't perform

feature selection and it only performs regularization whereas L1 can be used not only for uh shrinking the weight and regularizing the network but also for performing feature selection when you

have too many features so you might be wondering but how does this help to prevent overfitting well when you uh shrink the

weights towards zero and you are trying to regularize this small or large weights then this methods such as L1 or

L2 regularization they will ensure that the model doesn't overfit to the training data so you will then regular regularize the weight and this will then

on its turn regularize the network because the weights will Define how much of this Behavior eretic Behavior will be

prevented if you have two large weights and you reduce them and you regularize them it will ensure that you don't don't have exploring gradients it will also uh

ensure that the network doesn't heavily rely on certain neurons and this will then ensure that your mot is not overfitting and not memorizing training

data which might also include noise and outlier points the next question is what is the curse of dimensionality in machine learning or in AI so the curse

of dimensionality is a nonn phenomena in machine learning especially when we are dealing with this distance-based neighborhood based models like KNN or

KES and we need to compute this distance using distance measures such as aine distance cosign distance Manhattan distance and whenever we have uh High

dimensional data so we have many features in our data then the model starts to really suffer from the Cur of

dimensionality so the complexity Rises when the model needs to compute these distances between the the set of pairs but given that we have so many features

it becomes problematic and sometimes even invisible to obtain those distances and obtaining and calculating these

distances in some cases doesn't even make sense because they no longer reflect the actual pairwise relationship or the distance between those two pair

of observation when we have so many features and that's what we are calling the curse of dimensionality so we have a curse on our ml or AI model when we have

high dimension and when we want to compute this distance between pair of observations and this can introduce data sparity this can introduce computational

challenges can introduce a risk of overfitting for our problem the model becomes less generalizable and it will also become problem in terms of picking a distance

measure that can handle this high dimension of our data

Loading...

Loading video analysis...