Data Science Full Course | Learn Data Science in 3 Hours | Data Science for Beginners | Edureka

By edureka!

Summary

## Key takeaways - **Data Science: A Broad, Interdisciplinary Field**: Data science is an interdisciplinary field that extracts knowledge and insights from data in various forms, employing methods from mathematics, statistics, information science, and computer science. [02:19] - **Career Paths in Data: Analyst, Scientist, ML Engineer**: The data science career path typically starts with a foundation in mathematics and statistics, leading to roles like Data Analyst, Data Scientist, or Machine Learning Engineer, each requiring specific skill sets and offering different salary potentials. [03:25] - **Machine Learning: Learning from Data, Not Explicit Programming**: Machine learning enables systems to automatically learn and improve from experience without explicit programming, allowing computers to program themselves and make decisions using data by detecting patterns and adjusting actions. [25:11] - **Types of Machine Learning: Supervised, Unsupervised, Reinforcement**: Machine learning is categorized into supervised learning (input and output variables), unsupervised learning (unlabeled data for clustering), and reinforcement learning (learning through rewards and penalties in an environment). [40:57] - **Deep Learning: Mimicking the Human Brain with Neural Networks**: Deep learning, a subset of machine learning, uses deep neural networks to mimic the human brain's functionality, enabling it to handle high-dimensional data and identify relevant features automatically. [22:11] - **Key Python Libraries for Data Science**: Essential Python libraries for data science include Matplotlib and Seaborn for visualization, Pandas for data manipulation, NumPy for numerical operations, and Scikit-learn for machine learning algorithms. [39:01]

Topics Covered

The Alarming Rate of Data Generation
Data Science: More Than Just Big Data
Data Scientist vs. Data Analyst vs. ML Engineer Salaries
Data Science vs. AI, ML, and Deep Learning: Clarifying the Terms
Becoming a Data Scientist: The Need for Rigorous Training and Practice

Full Transcript

Hello everyone and welcome to this interesting session on data science

full course. So before we begin let's have a quick look at the agenda of this

session so first of all I'll be starting off by explaining you guys about the

evolution of data how it led to the growth of data science, machine learning,

AI and all the different aspects of data. Then we'll have a quick introduction to

data science, understand what exactly it is then we'll move forward to the data

science careers and the salary and understand what are the different job

profiles in the data science career path how to become a data scientist data

analyst or a machine learning engineer. Then we'll move on to the first and the

foremost part of data science which is statistics and after completing statistics

we'll move on to machine learning where we'll understand what exactly is machine

learning what are the different types of machine learning and how are they used

and where are they used the different algorithms and next we'll understand

what is deep learning and how deep learning is different from machine

learning, what is the relationship between AI, machine learning and deep

learning in terms of data science and understand how exactly neural network

works, how to create a neural network and much more, So let's begin our session now.

Data is increasingly shaping the systems that we interact with every day, whether

you are searching something on Google using Siri or browsing your Facebook

feed you are consuming the result of data analysis. It is increasing at a very

alarming rate where we are generating 2.5 quintillion bytes of it every day.

Now that's a lot of data and considering there are more than 3 billion

Internet users in the world a quantity that has tripled in the last 12 years

and 4.3 billion cell phone users that's a heck lot of data and this rapid growth

has generated opportunity for new professionals who can make sense out of

this data. Now given its transformation ability it's no wonder that so many data

arrays with jobs have been created in the past few years like data analysts,

data scientists, machine learning engineers, artificial intelligence

engineers and much more. And before we dwell into

the details of all of these different professionals, let's understand exactly

what data science is. So data science also known as the relevant science is an

interdisciplinary field about scientific methods, processes and systems to extract

knowledge or insights from data in various forms. It's structured or

unstructured. It is the study of where information comes from what it

represents and how it can be turned into a valuable resource in the creation of

business and IT strategies. So data science employs many techniques and

theories from fees like mathematics, statistics, information science as well

as computer science, and can be applied to small data sets also yet most people

think data science is when you are dealing with big data or large amounts

of data. So this brings the question which job profile is suitable for you, is

it the data analysts, the data scientist or the machine learning engineer. Now

data scientist has been called the sexiest java 21st century nonetheless

data science is a hot and growing field so before we drill into the data science

let's discuss all of these profiles one by one and see what this roles are and how

they work in the industries so read a science career usually starts with

mathematics and stats as the base which brings up the force profile in our data

science career path which is a data analyst so Idina analyst delivers value

to the companies by taking information about specific topics and then

interpreting analyzing and presenting the finding in comprehensive reports now

many different types of businesses use data analysts to help as experts data

analysts are often called on to use the skills and tools provide competitive

analysis and identify trends within the industry's most entry-level professional

interested in going into Data related jobs start off as data analyst

qualifying for this role is as simple as it gets all you need is a bachelor's

degree in computer science mathematics and a good statistical knowledge strong

technical skills would be a plus and can give you an edge over most other

applicants so next we have data scientists there are several definitions

available on data scientists but in simple words

scientist is one who practices the art of data science the highly popular term

data scientist was coined by DJ Patton and Jeff hammer backer data scientists

are those who crack complex data problems with a strong expertise in

certain scientific disciplines they work with several elements related to

mathematics statistics computer science and much more now data scientists are

usually business analysts or data analysts with a difference it is a

position for specialists and you can specialize in different types of skills

like speech analytics text analytics which is the natural language processing

image processing video processing medicine simulation material simulation

now each of these specialists roles are very limited in number and hence the

value of such a specialist is immense now if we talk about AI or machine

learning ingenious so machine learning engineers are sophisticated programmers

who develop machines and systems that can learn and apply knowledge without

specific direction artificial intelligence is the goal of a machine

learning engineer they are computer programmers but their focus goes beyond

specifically programming machines to perform specific tasks now they create

programs that will enable machines to take actions without being specifically

directed to perform those tasks so now if we have a look at the salary trends

of all of these professionals so starting with a data analyst the average

salary in the u.s. is around 83,000 dollars or it's almost close to eighty

four thousand dollars whereas in India it's around four lakh and four thousand

rupees per annum. Now coming to data scientist the average salary is ninety

one thousand dollars nine eleven point five thousand dollars and in India it is

almost seven lakh rupees and finally four ml in ten years the average salary

in the u.s. is around one hundred and eleven thousand dollars whereas in India

is around seven lakh and twenty thousand dollars so as you can see the radius

scientist an ml ingenious position are a certain higher position which requires

certain degree of expertise in that field so that's the reason why there is

a difference in the salary of all the three professionals so if you have a

look at the road map of becoming any one of these profession

so what first one needs to do is own a bachelor's degree

now this bachelor's degree can be in either computer science mathematics

information technology statistics finance or even economics now after

completing a bachelor's degree the next comes is fine-tuning the technical

skills during the technical skills is one of the most important parts in the

roadmap where you learn all the statistical methods and packages you

either learn are Python essays languages which are very important you learn about

data warehousing business intelligence data cleaning visualization reporting

techniques walking knowledge of Hadoop and MapReduce is very very important and

if you talk about machine learning techniques it is one of the most

important parts of the data science career now apart from these technical

skills there are also some business skills which are very much required so

this involves analytical problem-solving effective communication creative

thinking as well as industry knowledge now after fine-tuning your technical

skills and developing all the business skills you have the options of either

going for a job or either going for a master's degree or certification

programs now I might suggest as you go for a master's degree as just coming out

of the BTech world and having the technical skills is not enough so you

need to have a certain level of expertise in the field so it's better to

go for any masters or PhD programs which are in computer science statistics or

machine learning you can also go for big data certifications and you can also go

for industry certifications regarding the data analysis machine learning or

the data science it so happens that arica also provides a machine learning

data analysis as well as a data science certification training they have

master's program which are equivalent to a master's degree which you get from a

certain University so do check it out guys I'll leave the link to all of these

in the description box below and after you have completed the master's degree

what comes is working on the projects which are related to this field so it's

better if you work on machine learning deep learning or data ethics projects

that will give you an edge over other competitors while applying for a job

scenario so a certain level of expertise in the field is also required and this

is how you will succeed in the rate of science career path there are certain

skills which are required which I was talking about earlier the technical

skills and the non technical skills now if you talk about the skills which are

required to become all of these professions so they are mostly the same

so for any data analyst first of all you need to have analytical skills which

involves Maths having good knowledge of matrix multiplications the Fourier

transformations and all next we have communication skills so come looking for

a data analyst require someone who has the good communication skills who can

explain all of their technical terms to non-technical teams such as marketing or

the sales team another important skill required is critical thinking you need

to think in certain directions and gain insights from the data so that's one of

the most important part of a data analysts job obviously you need to pay

attention on the details so as a minor shift or the deviation in the result or

in the calculation what you say the analysis might result in some sort of

loss of the company it's not necessarily to create a loss but it's better to

avoid any kind of deviation from the results so paying attention to the

detail is very very important and then again we talk about the mathematical

skills knowing about all the types of differentiations and integrations is

going to help a lot because you know a lot of machine learning algorithms as I

would say are mostly mathematical terms or mathematical functions so having good

knowledge of mathematics is also required apart from this the usage of

technical tools such as Python are we have essays you need to know about the

big data ecosystem how it works the HDFS how to extract data create a pipeline

you know about JavaScript a little and if you talk about the skills of data

scientist it's almost the same having analytical and statistics knowledge now

another important part here is to know the machine learning algorithms as it

plays an important role in the data science career from solving skills

obviously now another important aspect if you talk about the skill which

differs from that of a data analyst is only deep learning so deep learning I'll

talk about deep learning later in the second half or the later part of the

video so having a good knowledge of deep learning and the various frameworks such

as tensorflow PI torch you have piano all of this is

very required for data scientists and again business communication as I

mentioned earlier is very much required because as you know these are one of the

technical roles most technical roles in the industries and the output of these

roles or what I would say the output of what these professions - is not that

much technical is more business oriented so they have to explain all of these

findings to either the non-technical teams the sales the marketing and again

you need the technical tools and the skills now for machine learning engineer

obviously programming languages having good knowledge of our Python C++ or Java

it's very much required you need to know about calculus and statistics as I

mentioned earlier learning about mattresses integration now another

important skill here is signal processing so a lot of times machine

learning engineers have to work on robots and signal processing they work

on human-like robots they work on robotics which mimic human behavior so a

lot of signal processing techniques are also required in this field applied

mathematics as I mentioned earlier and again neural networks it is one of the

base of artificial intelligence which is being used and again we have natural

language processing so as you know we have personal assistants like Siri and

Cortana and they work on language processing and not just language

processing you have audio processing as well as video processing so that they

can interact with a real environment and provide a certain answer to a particular

question so these were the skills I would say for all of these three roles

next if we have a look at the peripherals of data science so first of

all we have statistics needless to say there are programming languages we have

short read integrations then we have machine learning which is a big part of

data science and then again we have big data so let's start with statistics

which is the first area of data science or I should say the first milestone

which we should cover so for statistics let's understand first

what exactly is data so data in general terms refers to facts

and statistics collected together for reference or analysis when working with

statistic it's important to recognize the different types of data so data can

be broadly classified into numerical categorical and ordinal now data with no

inherent order or ranking such as a gender or race is called nominal data so

as you can see in the type 1 we have male female male female that is nominal

data now data with an ordered series is

called ordinal data so as you can see here we have an ordered series where we

have the customer IDs and the rating scale no data with only two options

series is called binary data now in this type of data there are only two options

like either yes or no or true or false or 1 or 0 so as you can see here we have

customer ID and in the owner or car column we have either yes or no now the

types of data we just discussed under law describe the quality of something in

size appearance value or something such kind of data is broadly classified into

qualitative data now data which can be categorized into a classification data

which is based upon counts there is only a finite number of values possible and

the values cannot be subdivided meaningfully is called discrete data so

as you can see here in our example we have organization and the number of

products so this cannot be subdivided into number of sub products right and if

you talk about data which can be measured on a continuum or a scale no

data which can have almost any numeric value and can be subdivided into finer

and finer increments is called continuous data so as you can see here

in patient ID we have weight of the patient it is 6.5 kgs now kgs can be

subdivided into grams and milligrams and final refinement is also possible now

this type of data that can be measured by the quantity of something rather than

its quality is called quantitative data now that we have honest with the

different types of data qualitative and quantitative it's time to understand the

types of variables we have now there are majorly two types of variables dependent

and independent variables so if you want to know whether caffeine affects your

appetite the presence or the absence of the amount of

caffeine would be the independent variable and how hungry you are would be

the dependent variables so in statistics dependent variable is the outcome of an

experiment as you change the independent variable you watched what happens to the

dependent variable whereas if you talk about independent variable a variable

that is not affected by anything that you or the researcher does usually

plotted on the x-axis now the next step after knowing about

the datatypes and the variables is to know about population and sampling and

that comes into experimental research now in experimental research the aim is

to manipulate an independent variable and then examine the effect that this

change has on a dependent variable now since it is possible to manipulate the

independent variable experimental research has the advantage of enabling a

researcher to identify a cause and effect between the variables well

suppose there are 100 volunteers at the hospital and a doctor needs to check the

working of a particular medicine which has been cleared by the government so

the doctor divides those hundred patients into two groups of 50 and then

asked one group to take one type of medicine and the other group to not take

any medicine at all and then after of me then compare the results and in non

experimental research the researcher does not manipulate the independent

variable this is not to say that it is impossible to do so but it will either

be impractical or it will be unethical to do so so for example a researcher may

be interested in the effect of illegal recreational drug views which is the

independent variable on certain types of behavior which is the dependent variable

however why is possible it would be unethical to ask an individual to take

illegal drugs in order to study what effects this hat on certain behaviors it

is always good to go for experimental research rather than non experimental

research so next in our session we have population and sampling those are two of

the most important terms in statistics so let's understand these terms so in

statistic the term population is the entire pool from which a sample is drawn

statistician also speak of a population of objects or events or procedures or

observation including such things as the quantity of

the number of vehicle owned by a penny person now population is thus an

aggregate of creatures things cases and so on and a population commonly contains

too many individuals to study conveniently an investigation is often

restricted to one or most samples drawn from it now a world chosen sample will

contain most of the information about a particular population parameter but the

relationship between the sample and the population must be such as to allow true

inferences to be made about a population from that sample for that we have

different types of sampling techniques so in probabilities there are sampling

methods which are classified either as probability or non probability so in

probability sampling each member of the population has a known nonzero

probability of being selected probably the methods include random sampling

systematic sampling and stratified sampling whereas in nonprobability

sampling members are selected from a population in some non-random manner but

these includes convenience sampling judgement sampling quota sampling and

snowball sampling while sampling is important there is another term which is

known as sampling error so sampling error is a degree to which a sample

might differ from the population when inferring to a population results are

reported plus or minus the sampling error now in probability sampling there

are three terms which are random sampling systematic sampling and

stratified sampling so talking about random sampling probability of each

member of the population to be chosen has equal chance of being selected such

type of sampling is random sampling never talk about systematic sampling it

is often used instead of random sampling and it is also called the NEP name

selection technique now pay attention to the name called Anette name so after the

required sample size has been calculated every NS record is selected from the

list of the population member now it's only advantage over Anna's having

technique is its simplicity now the final type of sampling is a stratified

sampling so a stratum is a subset of the population that shares at least one

common characteristics the researcher first hand you fires irrelevant stratums

and there actual representations in the population

before analysis so now that we know how our data is and what kind of sampling is

done let's have a look at the measure of center which helps describe to what

extent this pattern holds for a specific numerical value so as you can see in

measure of center we have three terms which are the mean median and mode and

I'm sure everyone must be aware of all of these terms

I'll not get into the details of these terms what's more important is to know

about the measure of spreads now a measure of spread sometime called a

measure of dispersion is used to describe the variability in the sample

or population it is usually used in conjunction with a measure of Center

tendency such as the mean or median provide an overall description of a set

of data now if you talk about deviation it is the difference between each X I

and the mean for a sample population which is known as the deviation about

the mean whereas variance is based on deviation and entails computing squares

of deviation so as you can see here we have the formula for the variance which

is the difference between the mean and the particular data point squared and

divided by the total number of data points and it's summation standard

deviation is basically the under root of variance so as you can see the formula

is the same just we have the under root over the variance so that was stand

evasion and variance another topic in probability and statistics is kunis so

skewness is a measure of symmetry or more precisely the lack of symmetry so

as you can see here we have left skewed symmetric non symmetric left skewed we

have right skewed so normally distributed curves are the most

symmetric curves we'll talk about normal distribution later

so after skewness what we need to know about is the confusion matrix now

confusion matrix represent a tabular representation of actual versus the

predicted values now this help us find the accuracy of the model when we are

creating any machine learning or the team learning model to find the accuracy

what we do is plot a confusion matrix so what you need to do is you can calculate

the accuracy of your model with adding the true positives and the true negative

and dividing it with the true positives plus true negatives plus false positive

plus false negatives that will give you the accuracy of the model so as you can

see in the image we have good bad for predicted as well as actual and as you

can see here the true positive D and the true negative a are the two areas where

we have created it it was good and the actual value was good in true negative a

we have the predicted it was bad and the actually it's bad so model which gets

the higher true positive and true negatives are the ones which have the

higher accuracy so that's what confusion matrix are for now the next term and a

very important term in statistics is probability so probability is the

measure of how likely something will occur it is the ratio of desired

outcomes to the total outcomes now if I roll a dice there are six total

possibilities one two three four five and six

now each possibility has one outcome so each has a probability of one out of six

now for instance the probability of getting a number two is one out of six

since there is only a single two on the dice now when talking about the

probability distribution techniques or the terminologies there are three

possible terms which are the probability density function normal distribution and

the central limit theorem so probability density function it is the equation

describing a continuous probability distribution so it is usually referred

as PDF now if we talk about normal distribution so the normal distribution

is a probability distribution that associates the normal random variable X

with a cumulative probability the normal distribution is defined by the following

equation so as you can see here Y is 1 by Sigma into the square root of 2 pi 2

whole multiplied by E raised to power minus X minus mu whole square divided by

2 Sigma square where X is a random normal variable mu is the mean and Sigma

is the standard deviation now the central limit theorem states that the

sampling distribution of the mean of any independent random variable will be

normal or nearly normal if the sample size is large enough now accuracy or the

resemblance to normal distribution depends on however two factors the first

one is a number of sample points taken and second is the shape of the

underlying population now enough about statistics if you want to know more

about statistics and if you want to get in-depth knowledge over statistics you

can refer to our statistics for data science video I'll leave the link to

that video in the description box so that video talks about statistics and

probability in a more depth movie then I explained here so I will talk about the

p-value is the hypotheses what all are required or any data science project so

let's move on to our next part of data science learning which is learning paths

which is the machine learning so let's understand what exactly is machine

learning so machine learning is an application of artificial intelligence

that provides systems the ability to automatically learn and improve from

experience without being explicitly programmed now getting computers to

program themselves and also teaching them to make decisions using data where

writing software is a bottleneck let the data do the work instead now machine

learning is a class of algorithms which is data driven that is unlike normal

algorithms it is the data that does what the good answer is so if we have a look

at the various features of machine learning so first of all it uses the

data to detect patterns in a data set and adjust the program actions

accordingly it focuses on the development of computer programs that

can teach themselves to grow and change when exposed to new data so it's not

just the old data on which it has been trained so whenever a new data is

entered the program changes accordingly it enables computers to find hidden

insights using iterative algorithms without being explicitly programmed

either so machine learning is a method of data analysis that automates

analytical model building now let's understand how exactly it Wells so if we

have a look at the diagram which is given here we have traditional

programming on one side we have machine learning on the other so first of all in

traditional program what we used to do was provide the data provide the program

and the computer used to generate the output so things have changed now so in

machine learning what we do is provide the data and we provide a predicted

output to the machine now what the machine does is learns from the data

find hidden insights and creates a model now it takes the output data also again

and it reiterates and trains and grows accordingly so that the model gets

better every time it's a strain with the new data or the new output so the first

and the foremost application of machine learning in the industry I would like to

get your attention towards is the navigation or the Google Maps so Google

Maps is probably the app we use whenever we go out and require assistant in

directions and traffic right the other day I was traveling to another city and

took the expressway and the math suggested despite the havoc traffic you

are on the fastest route no but how does it know that well it's a combination of

people currently using the services the historic data of that fruit collected

over time and a few tricks acquired from the other companies everyone using maps

is providing their location their average speed the route in which they

are traveling which in turn helps Google collect massive data about the traffic

which may extemporary the upcoming traffic and it adjust your route

according to it which is pretty amazing right now coming to the second

application which is the social media if we talk about Facebook so one of the

most common application is automatic friend tanks suggestion in Facebook and

I'm sure you might have gotten this so it's present in all the other social

media platform as well so Facebook uses face detection and image recognition to

automatically find the face of the person which matches its database and

hence it suggests us to tag that person based on deep face

now if the face is Facebook's machine learning project which is responsible

for recognition of faces and define which person is in the picture and it

also provides alternative tags to the images already uploading on Facebook so

for example if we have a look at this image and we introspect the following

image on Facebook we get the alt tag which has a particular description so in

our case what we get here is the image may contain sky grass outdoor and nature

now transportation and commuting is another industry where machine learning

is used heavily so if you have used an app to book a cab recently then you are

already using machine learning to an extent and what happens is that it

provides a personalized application which is unique to you it automatically

detects your location and provides option to either go home or office or

any other frequent basis based on your history and patterns it uses machine

learning algorithm layered on top of historic trip date had to make more

accurate ETA predictions now uber with the implementation of machine learning

on their app and their website saw a 26 percent accuracy in delivery and pick up

that's a huge a point now coming to the virtual person assistant as a name

suggests virtual person assistant assist in finding useful information when asked

why a voice or text if you have the major applications of machine learning

here a speech recognition speech to text conversion natural language processing

and text-to-speech conversion all you need to do is ask a simple question like

what is my schedule for tomorrow or show my upcoming flights now for answering

your personal assistant searches for information or recalls your related

queries to collect the information recently personal assistants are being

used in chat pods which are being implemented in various food ordering

apps online training web sites and also in commuting apps as well again product

recommendation now this is one of the area where machine learning is

absolutely necessary and it was one of the few areas which emerged the need for

machine learning now suppose you check an item on Amazon but you do not buy it

then and there but the next day you are watching videos on YouTube and suddenly

you see an ad for the same item you switch to Facebook there also you see

the same ad and again you go back to any other side and you see the ad for the

same sort of items so how does this happen well this happens because Google

tracks your search history and recommends asked based on your search

history this is one of the coolest application of machine learning and in

fact 35% of Amazon's revenue is generated by the products recommendation

now coming to the cool and highly technological side of machine learning

we have self-driving cars if we talk about self-driving car it's here

and people are already using it now machine learning plays a very important

role in self-driving cars as I'm sure you guys might have heard about Tesla

the leader in this business and the excurrent artificial intelligence is

driven by the hardware manufacturer Nvidia which is based on unsupervised

learning algorithm which is a type of machine learning algorithm now in media

state that they did not train their model to detect people or any of the

objects as such the model works on deep learning and Traut sources it's data

from the other vehicles and drivers it uses a lot of sensors which are a part

of IOT and according to the data gathered by McKenzie the automotive data

will hold a tremendous value of 750 billion dollars but that's a lot of

dollars we are talking about it now next again we have Google Translate now

remember the time when you travel to the new place and you find it difficult to

communicate with the locals or finding local spots where everything is written

in a different languages well those days are gone

Google's G and M T which is the Google neural machine translation is a neural

machine learning that works on thousands of languages and dictionary it uses

natural language processing to provide the most accurate translation of any

sentence of words since the tone of the word also matters it uses other

techniques like POS tagging named entity recognition and chunking and it is one

of the most used applications of machine learning now if we talk about dynamic

pricing setting the rice price for a good or a service is an old problem in

economic theory there are a vast amount of pricing strategies that depend on the

objective sort be it a movie ticket a plane ticket or a cafe everything is

dynamically priced now in recent year machine learning has enabled pricing

solution to track buying trends and determine more competitive product

prices now if we talk about uber how does Oberer determine the price of your

right who was biggest use of machine learning

comes in the form of surge pricing a machine learning model named as

geosearch if you are getting late for a meeting and you need to book an uber in

a crowded area get ready to pay twice the normal fear

even for flats if you're traveling in the festive season the chances are that

prices will be twice as much as the original price now coming to the final

application of machine learning we have is the online video streaming we have

Netflix Hulu and Amazon Prime video now here I'm going to explain the

application using the Netflix example so with over 100 million subscribers there

is no doubt that Netflix is the daddy of the online streaming world when Netflix

PD dries has all the movie industrialists taken aback forcing them

to us how on earth could one single website take on Hollywood now the answer

is machine learning the Netflix algorithm constantly gathers massive

amounts of data about user activities like when you pause rewind fast-forward

what do you want the content TV shows on weekdays movies on weekend the date you

watch the time you watch whenever you pause and leave a content so that if you

ever come back they would such as the same video the rating events which are

about four million per day the searches which are about three million per day

the browsing and the scrolling behavior and a lot more now they collect this

data for each subscriber they have and use the recommender system and a lot of

machine learning applications and that is why they have such a huge customer

retention rate so I hope these applications are enough for you to

understand how exactly machine learning is changing the way we are interacting

with the society and how fast it is affecting the world in which we live in

so if you have a look at the market trend of the machine learning here so as

you can see initially it wasn't much in the market but if you have a look at the

2016 side there was an enormous growth in machine learning and this happened

mostly because you know earlier we had the idea of machine learning but then

again we did not had the amount of big data so as you can see the red line we

have here in the histogram and the power plot is that of the Big Data so Big Data

also increased during the years and which led to the increase in the amount

of data generated and recently we had that power or I should say the

underlying technology and the hardware to support that power that makes us

create machine learning programs that will work on the spectator so that is

why you see very high inclination during the 2016 period time as compared to 2012

so because during 2016 we got new hardware and we were able to find

insights using those hardware and program and create models which would

work on heavy data now let's have a look at the life cycle of machine learning

so a typical machine learning life cycle has six steps so the first step is

collecting data second is video wrangling then we have the third step

per be analyzed the data fourth step where we train the algorithm the fifth

step is when we test the algorithm and the sixth step is when we deploy that

particular algorithm for industrial uses so when we talk about the fourth step

which is collecting data so here data is being collected from various sources and

this stage involves the collection of all the relevant data from various

sources now if we talk about data wrangling so data wrangling is the

process of cleaning and converting raw data into a format that allows

convenient consumption now this is a very important part in the machine

learning lifecycle as it's not every time that we receive a data which is

clean and is in a proper format sometimes their value is missing

sometimes there are wrong values sometimes data format is different so a

major part in a machinery lifecycle goes in data wrangling and data cleaning so

if we talk about the next step which is data analysis so data is analyzed to

select and filter the data required to prepare the model so in this step we

take the data use machine learning algorithms to create a particular model

now next again when we have a model what we do is strain the model now here we

use the data sets and the algorithm is trained on between data set through

which algorithm understand the pattern and the rules which govern the

particular data once we have trained the algorithm

next comes testing so the testing data set determines the accuracy of our

models so what we do is provide the test

dataset to the model and which tells us the accuracy of the particular model

whether it's 60% 70% 80% depending upon the requirement of the company and

finally we have the operation and optimization so if the speed and

accuracy of the model is acceptable then that moral should be deployed in the

real system the model that is used in the production should be made with all

the available data models improve with the amount of available data used to

create them all the result of the moral needs to be incorporated in the business

strategy now after the model is deployed based upon its performance the model is

updated and improved if there is a dip in the performance the moral is

retrained so all of these happen in the operation and optimization stage now

before we move forward since machine learning is mostly done in Python and us

so and if we have a look at the difference between Python and our I'm

pretty sure most of the people would go for Python and the major reason why

people go for python is because python has more number of libraries and python

is being used in just more than data analysis and machine learning so some of

the important Python libraries here which I want to discuss here so first of

all I'll talk about matplotlib now what Matt brought lib does is that it enables

you to make bar charts scatter plots the line charts histogram basically what it

does is helps in the visualization aspect as data analyst and machine

learning ingenious what one needs to represent the data in such a format that

it is used that it can be understood by non-technical people such as people from

marketing people from sales and other departments as well so another important

Python library here we have a seaborne which is focused on the visuals of

statistical models which includes heat maps and depict the overall

distributions sometimes people work on data which are

more geographically aligned and I would say in those cases he traps are very

much required now next we come to scikit-learn

and scikit-learn is the one of the most famous libraries of python i would say

it's simple and efficient or data mining and for data analysis it is built on

numpy and my rock lab and it is open-source next on our list we have

pandas it is the perfect tool for data wrangling which is designed for quick

and easy data manipulation aggregation and visualization and finally we have

numpy now numpy stands for a numerical Python provides an abundance of useful

features for operation on n arrays which has an umpire's and matrices in spite

and mostly it is used for mathematical purposes so which gives a plus point to

any machine learning algorithm so as these were the important part in larry's

which one must know in order to do any price and programming for machine

learning or as such if you are doing Python programming you need to know

about all of these libraries so guys next what we are going to discuss other

types of machine learning so then again we have three types of machine learning

which are supervised reinforcement and unsupervised machine learning so if we

talk about supervised machine learning so supervised learning is where you have

the input variable X and the output variable Y and you use an algo I know to

learn the mapping function from the input to the output so if we take the

case of object detection here so or face detection I rather say so first of all

what we do is input the raw data in the form of labelled faces and again it's

not necessary that we just input faces to train the model what we do is input a

mixture of faces and non-faces images so as you can see here we have labeled face

and labeled on faces what we do is provide the data to the algorithm the

algorithm creates a model it uses the training dataset to understand what

exactly is in a face what exactly is in a picture which is not a face and after

the model is done with the training and processing so to test it what we do is

provide particular input of a face or an on face what we know see the major part

of supervised learning here is that we exactly know the output so when we are

providing a face we our selves know that it's a phase so to

test that particular model and get the accuracy we use the labeled input raw

data so next when we talk about unsupervised learning unsupervised

learning is the training of a model using information that is neither

classified nor labeled now this model can be used to cluster the input data in

classes or the basis of the statistical properties for example for a basket full

of vegetables we can cluster different vegetables based upon their color or

sizes so if I have a look at this particular example here we have what we

are doing is we are inputting the raw data which can be either apple banana or

mango what we don't have here which was previously there in supervised learning

are the labels so what the algorithm does is that it visually gets the

features of a particular set of data it makes clusters so what will happen is

that it will make a cluster of red looking fruits which are Apple yellow

local fruits which are banana and based upon the shape also it determines what

exactly the fruit is and categorizes it as mango banana or apple so this is

unsupervised learning now the third type of learning which we have here is

reinforcement learning so reinforcement learning is the learning by interacting

with a space or an environment it selects the action on the basis of its

past experience the exploration and also by new choices a reinforcement learning

agent learns from the consequences of its action rather than from being taught

explicitly so if we have a look at the example here the input data we have what

it does is goes to the training goes to the agent where the agent selects the

algorithm it takes the best action from the environment gets the reward and the

model is strange so if you provide a picture of a green apple

although the Apple which it particularly nose is red what it will do is it will

try to get an answer and with the past experience what it has and it will

recreate the algorithm and then finally provide an output which is according to

our requirements so now these were the major types of machine learning

algorithms next what we never do is dig deep into all of these types of machine

learning one by one so let's get started with supervised learning first and

understand what exactly is supervised learning and what are the different

algorithms inside it how it works the algorithms the working and we'll have a

look at the various algorithm demos now which will make you understand it in a

much better way so let's go ahead and understand what exactly is supervised

learning so supervised learning is where you have the input variable X and the

output variable Y and using algorithm to learn the mapping function from the

input to the output as I mentioned earlier with the example of face

detection so it is cos subbu is learning because the process of an algorithm

learning from the training data set can be thought of as a teacher supervising

the learning process so if we have a look at the supervised learning steps or

what will rather say the workflow so the model is used as you can see here we

have the historic data then we again we have the random sampling we split the

data enter training error set and the testing data set using the training data

set we with the help of machine learning which is supervised machine learning we

create statistical model now after we have a model which is being generated

with the help of the training data set what we do is use the testing data set

for prediction and testing what we do is get the output and finally if we have

the model validation outcome that was third training and testing so if we have

a look at the prediction part of any particular supervised learning algorithm

so the model is used for operating outcome of a new data set so whenever

performance of the model degraded the model is retrained or if there are any

performance issues the model is retrained with the help of

the new data now when we talk about supervisor in there are not just one but

quite a few algorithms here so we have linear regression logistic regression

this is entry we have random forest we have made biased classifiers so linear

regression is used to estimate real values for

the cost of houses the number of cars the total sales based on the continuous

variable so that is what Rainier generation is now when we talk about

logistic regression it is used to estimate discrete values for example

which are binary values like zero and one yes or no true and false based on

the given set of independent way so for example when you are talking about

something like the chance of winning or if we talk about winning which can be

the true or false if will it rain today which it can be the yes or no so it

cannot be like when the output of a particular algorithm or the particular

question is either yes/no or binary then only we use a logic regression now next

we have decision trees so so these are used for classification problems it

works for both categorical and continuous dependent variables and if we

talk over random forest so random forest is an N symbol of a decision tree it

gives better prediction and accuracy that decision tree so that is another

type of supervised learning algorithm and finally we have the Nate Byars

classifier so it is a classification technique based on the based theorem

with an assumption of independence between predictors so we'll get more

into the details of all of these algorithms one by one so let's get

started with linear regression so first of all let us understand what exactly

linear regression is so linear regression analysis is a powerful

technique you operating the unknown value of a variable which is the

dependent variable from the known value of another variable which is the

independent variable so a dependent variable is the variable to be predicted

or explained in a regression model whereas an independent variable is a

variable related to the dependent variable in a regression equation so if

you have a look here as a simple linear regression so it's basically equivalent

to a simple line which is with a slope which is y equals a plus B X where Y is

the dependent variable a is the y-intercept we have P which is the slope

of the line and X which is the independent variable

so intercept is the value of the dependent variable Y when the value of

the independent variable X is 0 it is the

the line cuts the y-axis whereas slope is the change in the dependent variable

for a unit increase in the independent variable it is the tangent of the angle

made by the line with the x-axis now when we talk about the relation between

the variables we have a particular term which is known as correlation so

correlation is an important factor to check the dependencies when there are

multiple variables what it does is it gives us an insight of the mutual

relationship among variables and it is used for creating a correlation plot

with the help of the Seabourn library which I mentioned earlier which is one

of the most important libraries in Python so correlation is very important

term to know about now if we talk about regression lines so linear regression

analysis is a powerful technique used for predicting the unknown value of a

variable which is the dependent variable from the regression line which is simply

a single line that best fits the data in terms of having the smallest overall

distance from the line to the points so as you can see in the plot here we have

the different points or the data points so these are known as the fitted points

then again we have the regression line which has the smallest overall distance

from the line to the points so you have a look at the distance between the point

to the regression line so what this line shows is the deviation from the

regression line so exactly how far the point is from the regression line so

let's understand a simple use case of linear regression with the help of a

demo so first of all there is a real state company use case which I'm going

to talk about so first of all here we have John he has some baseline for

pricing the villa's and the independent houses he has in Boston so here we have

the data set description which we're going to use so this data set has

different columns such as the crime rate per capita which is CRI M it has

proportional residential residential land zone for the Lots proportion of non

retail business the river the United Rock side concentration average number

of rooms and the proportion of the owner occupying the built prior to 1940 the

distance of the five Boston employment centers in

excess of accessibility to Riedl highways and much more so first of all

let's have a look at the data set we have here

so one number I don't thing here guys is that I'm gonna be using Jupiter notebook

to execute all my practicals you are free to use the spider notebook or the

console either so it basically comes down to your preference so for my

preference I'm going to use the Jupiter notebook so for this use case we're

gonna use the Boston housing data set so as you can see here we have the data set

which has the CRI mzn in desc CAS NO x the different variables and we have the

data set of form almost I would say like 500 houses so what John needs to do is

plan the pricing of the housing depending upon all of these different

variables so that it's profitable for him to sell the house and it's easier

for the customers also to buy the house so first of all let me open the code

here for you so first of all what we're gonna do is import the library is

necessary for this project so we're going to use the numpy we're going to

import numpy as NP import pandas at PD then we're gonna also import the

matplotlib and then we are going to do is read the Boston housing data set into

the BOS one variable so now what we are going to do is create two variables x

and y so what we're gonna do is take 0 to 13 I'll say is from CR I am two LS

dat in 1x because that's the independent variable and Y here is dependent

variable which is the MA TV which is the final price so first of all what we need

to do is plot a correlation so what we're gonna do is import the Seabourn

library as SN s we're going to use the correlations to plot the correlations

between the different 0 to 13 variables what we gonna do is also use ma DV here

also so what we're going to do is SN s dot heatmap correlations to be going to

use the square to differentiate usually it comes up in

square only or circles so you don't know so we're gonna use square you want to

see you see map with the Y as GNP you this is the color so there's no rotation

in the y axis and we're gonna rotate the excesses to the 90 degree and let's we

gonna plot it now so this is what the plot looks like so as you can see here

the more thicker or the more darker the color gets the more is the correlation

between the variables so for example if you have a look at CRI M and M a DV

right so as you can see here the color is very less where the correlation is

very low so one thing important what we can see here is the tax and our ad which

is the full value of the property and RIT is the index of accessibility to the

radial highways now these things are highly correlated and that is natural

because the more it is connected to the highway and more closer it is to the

highway the more easier it is for people to travel and hence the tax on it is

more as it is closer to the highways now what we're going to do is from SQL and

dot cross-validation we're going to import the Train test split and we're

gonna split the data set now so what we are going to do is create four variables

which are the extreme X test Y train white tests and we're going to use a

train test split function to split the x and y and here we're going to use the

test size 0.3 tree which will split the data set into the test size will be 33%

well as the training size will be 67% now this is dependent on you usually it

is either 60/40 70/30 this depends on your use case your data you have the

kind of output you are getting the model you are creating and much more then

again from SQL learn dot linear model we're going to import linear regression

now this is the major functions we're gonna use just linear regression

function which is present in SQL which is a scikit-learn so we going to create

our linear regression model into LM and the model which are going to create and

we're going to fit the training videos which has the X train and the why train

then we're gonna create a prediction underscore 5 which is the LM dot credit

and I take the X test variables which will provide the predicted Y variables

so now finally if we plot the scatter plot of the Y test and the y predicted

what we can see is that and we give the X label as white test and the Y label

has y predicted we can see the regression line which we have plotted in

at the scatter plot and if you want to draw a regression line it's usually it

will go through all of these points excluding the extremities which are here

present at the endpoints so this is how a normal linear regression works in

Python what you do is create a correlation you find out you split the

dataset into training and testing variables then again you define what is

going to be your test size import the reintegration moral use the training

data set into the model fitted use the test data set to create the predictions

and then use the wireless code test and the predicted Y and plot the scatter

plot and see how close your model is doing with the original data it had and

check the accuracy of that model now typically you use these steps which was

collecting data what we did data wrangling analyze the data we trained

the algorithm we use the test algorithm and then we deployed so fitting a model

means that you are making your algorithm learn the relationship between

predictors and the outcomes so that you can predict the future values of the

outcome so the best fitted model has a specific set of parameters which best

defines the problem at hand since this is a linear model with the equation y

equals MX plus C so in this case the parameters of the model learns from the

data that are M and C so this is what more fitting now if it have a look at

the types of fitting which are available so first of all machine learning

algorithm first attempt to solve the problem of underfitting

that is of taking a line that does not approximate the data well and making it

approximate to the data better so machine does not know where to stop in

order to solve the problem and it can go ahead from appropriate to overfit more

sometimes when we say a model overfits we mean that it may have a low error

rate for training data but it may not generalize well to the overall

population of the data we are interested in so we have under fact appropriate and

over fit these are the types of fitting now guys this was linear regression

which is a type of supervised learning algorithm in machine learning so next

what we're going to do is understand the need for logistic regression so let's

consider a use case as in political elections are being contested in our

country and suppose that we are interested to know which candidate will

probably win now the outcome variables result in binary either win or lose the

predictor variables are the amount of money spent the age the popularity rank

and etc etcetera now here the best fit line in the regression war is going

below 0 and above what and since the value of y will be discrete that is

between 0 & 1 the linear rain has to be clipped at 0 & 1 now linear regression

gives us only a single line to classify the output with linear regression our

resulting curve cannot be formulated into a single formula as you obtain

three different straight lines what we need is a new way to solve this problem

so hence people came up with logistic regression so let's understand what

exactly is logic regression so logistic regression is a statistical method for

analyzing a data set in which there are 1 or more independent variables that

determine an outcome and the outcome is a binary class type so example a patient

goes a followed a teen checkup in the hospital and his interest is to know

whether the cancer is benign or malignant now a patient's data such as

sugar level blood pressure eight skin width and the previous medical history

are recorded and a daughter checks the patient data and it reminds the outcome

of his illness and severity of illness the outcome will result in binary that

is zero if the cancer is malignant and one if it's been I know no strict

regression is a statistical method used for analyzing a dataset there were say

one or more dependent variables like we discuss like the sugar level blood

pressure skin with the previous medical history

and the output is binary class type so now let's have a look at the lowest ik

regression curve now the law disintegration code is also called a

sigmoid curve or the S curve the sigmoid function converts any value from minus

infinity to infinity to the discrete value 0 or 1 now how to decide whether

the value is 0 or 1 from this curve so let's take an example what we do is

provide a threshold value we set it we decide the output from that function so

let's take an example with the threshold value of 0.4 so any value above 0.4 will

be rounded off to 1 and anyone below 0.4 we really reduce to 0 so similarly we

have polynomial regression also so when we have nonlinear data which cannot be

predicted with a linear model we switch to the polynomial regression now such a

scenario is shown in the below graph so as you can see here we have the equation

y equals 3x cubed plus 4x squared minus 5x plus 2 now here we cannot perform

this linearly so we need polynomial regression to solve these kind of

problems now when we talk about logistic regression there is an important term

which is decision tree and this is one of the most used algorithms in

supervised learning now let's understand what exactly is a decision tree so our

decision tree is a tree like structure in which internal load represent tests

on an attribute now each attribute represents outcome of test and each leaf

node represents the class label which is a decision taken after computing all

attributes apart from root to the leaf represents

classification rules and a decision tree is made from our data by analyzing the

variables from the decision tree now from the tree we can easily find out

whether there will be came tomorrow if the conditions are rainy and less windy

now let's see how we can implement the same so suppose here we have a data set

in which we have the outlook so what we can do is from each of the Outlawz we

can divide the data as sunny overcast and rainy so as you can see in the sunny

side we get two yeses and three noes because the outlook is sunny the

humidity is now and oven is weak and strong so it's a

fully sunny day what we have is that it's not a pure subset so what we're

gonna do is split it further so if you have a look at the overcast

we have humidity high normal week so yes during overcast weekend play and if you

have a look at the Raney's area we have three SS and - no so again what we're

going to do is split it further so when we talk of a sunny then we have humidity

in humidity we have high and normal so when the humidity is normal we're going

to play which is the pure subset and if the humidity is high we are not going to

play which is also a pure subset now so let's do the same for the rainy day so

during rainy day we have the vent classifier so if the wind is to be it

becomes a pure subset we're going to play and if the vent is strong it's a

pure substance we not gonna play so the final decision tree looks like this so

first of all we check if the outlook is sunny overcast or rain if it's overcast

we will play if it's sunny we then again check the humidity if the humidity is

high we will not play if the humidity is normal real play then again in the case

of rainy if we check the vent if the wind is weak the play will go on and

similarly if the wind is strong the play must stop so this is how exactly a

decision tree works so let's go ahead and see how we can implement logisitics

relation in decision trees now for logistic regression we're going to use

the Casa data set so this is how the data set looks like so here we have the

eye diagnosis radius mean - I mean parameter mean these are the stats of

particular cancer cells or the cyst which are present in the body so we have

like total 33 columns all the way starting from IDE - unnamed 32 so our

main goal here is to define whether or I'll say predict whether the cancer is

pinang on mannequin so first of all what vinegar - is from scikit-learn dot small

selection we're gonna import cross-validation score and again we're

going to use numpy for linear algebra we're gonna use

pandas as PD because for data processing the CSV file input for data manipulation

in sequel and most of the stuff then we're going to import the matplotlib it

is used for plotting the graph we're going to import Seabourn which is used

to plot interactive graph like in the last example we saw we plotted a heatmap

correlation so from SK learn we're going to import the logistic regression which

is the major model or the algorithm behind the whole logic regression we're

gonna import the train dressed split so as to split the raita into two paths

training and testing data set we're going to import metrics to check the

error and the accuracy of the model and we're gonna import decision tree

classifier so first of all what we're gonna do is create a variable data and

use the pandas PD to read the data from the data set so here the header 0 means

that the zeroth row is our column name and if we have a look at the data or the

top six part of the data we're going to use the friend data dot head and get the

data dot info so as you can see here we have so many data columns such as highly

diagnosis radius being in text remain parameter main area means smoothness

mean we have texture worst symmetry worst we have fractal dimension worse

and lastly we have the unnamed so first of all we can see we have six rows and

33 columns and if you have a look at all of these columns here right we get the

total number which is the 569 which is the total number of observation we have

and we check whether it's non null and then again we check the type of the

particular column so it's integer it's object float mostly most of them are

float some are integer so now again we're going to drop the unnamed column

which is the column 30 second 0 to 33 which is the 30 second column so in this

process we will change it in our data itself so if you want to save the old

data you can also see if that but then again that's of no use so theta dot

columns will give us all of these columns when we remove that so

you can see here in the output we do not have the final one which was the unnamed

the last one we have is the type which is float so latex we also don't want the

ID column for our analysis so what we're gonna do is we're gonna drop the ID

again so as I said above the data can be divided into three paths so let's divide

the features according to their category now as you know our diagnosis column is

object type so we can map it to the integer value so we what we wanna do is

use the data diagnosis and we're gonna map it to M 1 and B 0 so that the output

is either M or B now if we use a rated or described so you can see here we have

8 rows and 1 columns because we dropped two of the columns and in the diagonals

we have the values here let's get the frequency of the cancer stages so here

we're going to use the Seabourn SNS not count plot data with diagnosis and Lee

will come and if we use the PLT dot show so here you can see the diagnosis for 0

is more and for 1 is less if you plot the correlation among this data so we're

going to use the PLT dot figure SNS start heat map we're gonna use a heat

map we're going to plot the correlation c by true we're going to use square true

and we're gonna use the cold warm technique so as you can see here the

correlation of the radius worst with the area worst and the parameter worst is

more whereas the radius worst has high correlation to the parameter mean and

the area mean because if the radius is more the parameter is more area is more

so based on the core plot let's select some features from the model now the

decision is made in order to remove the : era t so we will have a prediction

variable in which we have the texture mean the parameter mean the smoothness

mean the compactors mean and the symmetry mean but these are the

variables which we'll use for the prediction now we'll gonna split the

data into the training and testing data set now in this our main data is

splitted into training a test data set with the 0.3 test size that is 30 to 70

ratio next what we're going to do is check the dimension of that training and

the testing data says so what we're going to do is use the print command and

pass the parameter train dot shape and test our shape so what we can see here

is that we have almost like 400 398 observations were 31 columns in the

training dataset whereas 171 rows and 31 columns in the

testing dataset so then again what we're going to do is take the training data

input what we're going to do is create a Train underscore X with the prediction

underscore rad and train is for y is for the diagnosis now this is the output of

our training data same as we did for the test so we're going to use test

underscore X for the test prediction variable and test underscore Y for the

test diagnosis which is the output of the test data now we're going to create

a logistic regression method and create a model logistic dot fit in which you're

going to fit the training data set which is strain X entering Y and then we're

going to use a TEM P which is a temporary variable in which you can

operate X and then what we're going to do is we're going to compare to EMP

which is a test X with the test Y to check the accuracy so the accuracy here

we get is 0.9 1 then again what we need to do this was like location normal

roads retribution are we going to use classifier so we're going to create a

decision tree classifier with random state given as 0 now what next we're

going to do is create the cross-validation school which is the CLF

we take the moral we take the train X 3 and Y and C V equals 10 the

cross-validation score now if we fit the training test and the sample weight we

have not defined here check the input of his true and XID x sorted is none so if

we get the parameters true we predict using the test X and then predict the

long probability of test X and if we compare the score of the test X to test

Y with the sample weight none we get the same result as a decision tree so this

is how you implement a decision tree classifier and check the accuracy of the

particular model so that was it so next on our list is random forest so

let's understand what exactly is a random forest so random forest is an

symbol classifier made using many decision tree models so so what exactly

are in symbol malls so n symbol malls combines the results from different

models the result from an N simple mall is usually better than the result of the

one of the individual model because every tree votes for one class the final

decision is based upon the majority of votes and it is better than decision

tree because compared to decision tree it can be much more accurate it rests if

efficiently on the last data set it can handle thousands of input variables

without variable deletion and what it does is it gives an estimate of what

variables are important in the classification so let's take the example

of weather data so let's understand I know for us with the help of the

hurricanes and typhoons data set so we have the data about hurricanes and

typhoons from 1851 to 2014 and the data comprises off location when the pressure

of tropical cyclones in the Pacific Ocean the based on the data we have to

classify the storms into hurricanes typhoons and the sub categories as

further to predefined classes mentioned so the predefined classes are TD

tropical cyclone of tropical depression intensity which is less than 34 knots if

it's between thirty four to six to 18 oz it's D s greater than 64 knots it's a

cheer which is a hurricane intensity e^x is esta tropical cyclone s T is less

than 34 it's a subtropical cyclone or subtropical depression s s is greater

than 34 which is a subtropical cyclone of subtropical storm intensity and then

again we have L o which is a low that is neither a tropical cyclone a tropical

subtropical cyclone or non and extraterrestrial cyclone and then again

finally we have DB which is disturbance of any intensity now these were the

predefined classes description so as you can see this is the data in which we

have the ID name date event say this line it's your longitude maximum when

minimum when there are so many variables so let's start with imp

the pandas then again we import the matplotlib then we gonna use the

aggregate method in matplotlib we're going to use the matplotlib in line

which is used for plotting interactive graph and I like it most for plots so

next what we're going to do is import Seabourn as SNS now this is used to plot

the graph again and we're going to import the model selection which is the

Train test split so we're gonna import it from a scaler and the scikit-learn

we have to import metrics watching the accuracy then we have to import sq learn

and then again from SQL and we have to import tree from SQL or dot + symbol

we're gonna import the random forest classifier from SQL and Road metrics

we're going to import confusion matrix so as to check the accuracy and from SQL

and on message we're gonna also import the accuracy score

so let's import random and let's read the dataset and print the first six rows

of the data sets you can see here we have the ID we have the name date time

it will stay this latitude longitude so in total we have 22 columns here so as

you can see here we have a column name status which is TS TS TS for the four

six so what we're gonna do is data at our state as visible P dot categorical

data the state so what we can do is make it a categorical data with quotes so

that it's easier for the machine to understand it rather than having certain

categories as means we're gonna use the categories as numbers so it's easier for

the computer to do the analysis so let's get the frequency of different typhoons

so what we're going to do is random dot seed then again what are we gonna do is

if we have to drop the status we have to drop the event because these are

unnecessary we're gonna drop latitude longitude we're gonna drop ID then name

the date and the time it occurred so if we print the prediction list so ignore

the error here so that's not necessary so we have the maximum and minimum and

pressure low went any low when deci low when s top blue and these are the

parameters on which we're going to do the predictions so now we'll split that

into training and testing data sets so then again we have the trained comet

test and we're gonna use a trained test split especially in the 70s of 30

industrial standard ratio now important thing here to note is that you can split

it in any form you want can be either 60/40 70/30 80/20 it all depends upon

the model which you have our the industrial requirement which you have so

then again if after printing let's check the dimensions so the training dataset

comprises of eighteen thousand two hundred and ninety five rows were twenty

two columns whereas the testing dataset comprised of eight thousand rows with

twenty two columns we have the training data input train x we had a train y so

status is the final output of the training data which will tell us the

status whether it's a TS d d which it's an hu which kind of a hurricane or

typhoon or any kind of subcategories which are defined which were like

subtropical cyclone the subtropical typhoon and much more so our prediction

or the output variable will be status so so this is these are the list of the

training columns which we have here now same we have to do for the test variable

so we have the test x with the prediction underscore rat with a test y

with the status so now what we're going to do is build a random foils classifier

so in the model we have the random forest classifier with estimators as 100

a simple random for small and then we fit the training data set which is a

training X and train by then we again make the prediction which is the world

or predict that with the test underscore X then that and this will predict for

the test data and prediction will contain the rated value by our model

predicted values of the diagnosis column for the test inputs so if you print the

metrics of the accuracy score between the prediction and the test and a score

why to check the accuracy we get 95% accuracy now the same if we're going to

do with decision tree so again we're gonna use the model tree dot decision

tree classifier we're going to use the Train X and tree in Y which other

training data sets new prediction is smaller for a task or

text we're going to create a data frame which is the Parador data frame and if

we have a look at the prediction and the test underscore Y you can see the state

has 10 10 3 3 10 10 11 and 5 5 3 11 and 3 3 so it goes on and on so it has 7840

2 rows and 1 column and if you print the accuracy we get a ninety-five point five

seven percent of accuracy and if you have a look at the accuracy of the

random for us we get 95 point six six percent which is more than 95 point five

seven so as I mentioned earlier usually random forest gives a better output or

creates a better more than the decision tree classifier because as I mentioned

earlier it combines the result from different models you know so the final

decision is based upon the majority of votes and is usually higher than the

decision tree models so let's move ahead with our knee by selca rhythm and let's

see what exactly is neat bias so nave bias is a simple but surprisingly

powerful algorithm for predictive modeling now it is a classification

technique based on the base theorem with an assumption of independence among

predictors it comprises of two parts which are the nave and the bias so in

simple terms an a bias classifier assumes that the presence of a

particular feature in a class is unrelated to the presence of any other

feature even of these features depend on each other or upon the existence of the

other features all of these properties independently contribute to the

probability that a fruit it's an apple or an orange and that is why it is known

as a noun a base model is easy to build and particularly useful for very large

data sets in probability theory and statistics Bayes theorem which is

alternatively known as the base law or the Bayes rule also emitted as Bayes

theorem describes the probability of an event based on the prior knowledge of

conditions that might be related to the event so Bayes theorem is a way to

figure out the conditional probability now conditional probability is the

probability of an event happening given that it has some

to one or more other events for example your probability of getting a parking

space is connected to the town today you park where you park and what conventions

are going on at the same time so base Hyrum is slightly more nuanced and a

nutshell it gives us the actual probability of an event given

information about tests so let's talk about the base Hyrum now so now given

any I policies edge and evidence II Bayes theorem states that the

relationship between the probability of the hypothesis before getting the

evidence pH and the probability of the hypothesis after getting the evidence

which is P H bar e is PE bar H into probability of H there are a probability

of e which means it's the probability of even after in the hypothesis inter

priority of the hypothesis divided by the probability of the evidence so let's

understand it with a simple example here so now for example if a single card is

drawn from standard deck of playing cards the probability of that card being

a king is 4 out of 52 now since there are 4 kings in a standard deck of 52

cards the rewarding this if the king is the event this card is a king the

priority of the king that is the probability of king equals 4 by 52 which

in turn is 1 by 30 now if the event is is varieties or instance someone looks

at the card that the single card is a face card then the posterior probability

which is the P of King given it's a face can be calculated using the Bayes

theorem given the probability of King given its face is equal to probability

of the face given its a king there is a probability of face into the probability

of King since every King is also a face card so the probability of face given

its a king is equal to 1 and since there are 3 face cards in each

suit that are jacking and Queen the probability of face card is 3 out of 30

combining these given likelihood ratios are we get the value using the paste

theorem of probability of King events of face is equal to 1 out of 3 so foreign

joint probability distribution with events a and B the probability of a

intersection B which is the conditional probability of a given B is now defined

as property of intersection B divided by the probability of B now this is how we

get the base theorem now that we know the different basic proof of how we got

the base theorem so let's have a look at the working of the base your answer with

the help of an examples here so let's take the same example of the radius set

of the these forecasts in which we had the sunny rainy

overcast so first of all what we're gonna do is first we will create a

frequency table using each attribute of the data set so as you can see here we

have the frequency table here for the outlook humidity and the wind so for

Outlook we have the frequency table here we have the frequency table for humidity

and the wind so next what we're gonna do is create the probability of sunny given

say s that is three out of ten find the probability of sunny which is five out

of 14 and this 14 comes from the total number of observations there and from

yes and no so similarly we're gonna find the probability of yes also which is 10

out of 14 which is 0.7 one for each frequency table will generate these kind

of likelihood tables so the likelihood of yes given it's a sunny is equal to

0.51 similarly the likelihood of no given

sunny is equal to 0.40 so here you can look that using Bayes theorem we have

found out the likelihood of yes given it's a sunny and no given it's a sunny

similarly we're gonna do the save all likelihood table for humidity and the

same for wind so for humidity we're gonna check the probability of yes given

its high humidity is high probability of plane no given the humidity is high is

your going to calculate it using the same base theorem so suppose we have a

day with the following values in which we have the outlook as rain humidity as

high and wind as we since we discussed the same example earlier with the

decision tree we know the answer so let's not get ahead of ourselves and

let's try to find out the answer using the Bayes theorem

let's understand how neat bass works actually so first of all we gonna use

the likelihood of yes on that day so that equals to probability of Outlook of

rain given it's a yes into probability of humidity high given SAS interpretive

NVQ NCS into probability of yes okay so that gives us zero point zero one nine

similarly they're probably likelihood of no on that day is the outlook is rain in

units and no humidity is high given its and no and win this week given so know

that equals to zero point zero one six now what we're going to do is find the

probability of V s and no and for that what we're going to do is take the

probability the likelihood and divide it with the sum of the likelihoods obvious

and known so and that really gonna get the probability of yes overall so you

think that formula we get the probability of years as zero point five

five and the probability of no as zero point four five and our model predicts

that there is a fifty five percent chance that there will be game tomorrow

if it's rainy the humidity is high and the wind is

weak now if you have a look at the industrial use cases of any bias we have

new scatterings use categorization as what happens is that the news are comes

in a lot of tags and it has to be categorized so that the user gets

information he needs in a particular format then again we have spam filtering

which is one of the major use cases of Nate Byars classifier as it classifies

the email as spam or ham then finally we have with a prediction also as we saw

just with the example that we predict whether we're going to play or not that

sort of prediction is always there so guys this was all about supervised

learning we discussed linear regression logistic regression we discussed named

pies we've discussed random forests decision tree and we understood how the

random forest is better than decision tree in some cases it might be equal to

decision tree but nonetheless it's always gonna provide us a better result

so guys that was all about the supervised learning so but before that

let's go ahead and see how exactly we're gonna implement nay bias

so guys here we have another data set run or walk it's the kinematic data sets

and it has been measured using the mobile sensor so let the target were

able to be Y assign all the columns after it to X using scikit-learn a by a

small we're going to observe the accuracy generate a classification

report using scikit-learn now we're going to repeat the model once

only the acceleration values as predictors and then using only the gyro

value aspirators and we're going to comment on the difference in accuracy

between the two moles so here we have a data set which is run or walk so let me

open that for you so here I was data sets run or walk so as you can see we

have the date time user name risk activity acceleration XY assertions see

Cairo ex Cairo y Cairo Z so based on it let's see how we can implement the name

by is classifier and so first of all what we're gonna do is import pandas at

speedy then we gonna import matplotlib for plotting we're gonna read the run or

walk data file with pandas period or tree and a CSV let's have a look at the

info so first of all we see that we have 88 thousand five hundred eighty eight

rows with 11 columns so we have the date/time username rest activity

assertion XYZ Cairo XYZ and the memory uses is send point 4 MB data so this is

how you look at the columns D F dot columns now again we're gonna split the

dataset into training and testing data sets so we're going to use the Train

test flight model so that's what we're gonna do is split it into X train X test

y train by test and we're gonna split it into the size of 0.2 here so again I am

saying it depends on you what is the test size so let's print the shape of

the training and see it's 70,000 observation has six columns now what

we're going to do is from the scikit-learn dot knee pius we're going

to import the caution NB which is the question a bias and we're going to put

the classifier as caution NB then we'll pass on the extreme and white rain

variables to the classifier and again we have the wireless co-credit which is the

classifier predict X text and we gonna compare the Y underscore predict with

the y underscore test to see the accuracies for that so for that we're

going to import sq learn dot matrix we're going to import the accuracy score

now let's compare both of these so the accuracy what we get is ninety five

point five four percent now another way is to get a confusion matrix bill so

from scikit-learn dot matrix we're going to import the confusion matrix and we're

gonna plot the matrix of five predict and white test so as you can see here we

have 90 and 699 that's a very good number so now what we're gonna do is

create a classification report so from metrics we're gonna import the

classification because reports we're going to put the target names as walk

comma run and friends the report using white s and by predict within target

means we have so for walking we get the precision of 0.92 and the recall of 0.99

f1 score is zero point nine six the support is eight thousand six hundred

seventy three and for runway appreciation of ninety ninety percent

with the recoil of 0.92 and f1 score of zero point 95 so guys this is how you

exactly use the Gaussian in me or the new pie's classifier on it and all of

these types of algorithms which are present in the supervisor or unsurprised

or reinforcement learning are all present in the cyclotron library so one

second assist SQL learn is a very important library when you are dealing

with machine learning because you do not have to code any algorithm hard coding

algorithm every algorithm is present there all you have to do is just passed

it either split the dataset into training and testing dataset and then

again you have to find the predictions and then compare the predicted Y with

the test case Y so that is exactly what we do every time we work on a machine

learning algorithm now guys that was all about supervised learning let's go ahead

and understand what exactly is unsupervised learning so sometimes the

given data is unstructured and unlabeled so it becomes difficult to classify the

data into different categories so answer learning helps to solve this problem

this learning is used to cluster the input data in classes on the basis of

their statistical properties so example we can cluster different bikes based

upon the speed limit their acceleration or the average that they are giving so

I'm supporting is a type of machine learning algorithm used to draw

inferences from Veda sets consisting of input data without labeled responses so

if you have a look at the workflow or the process flow of unsupervised

learning so the training data is collection of information without any

label we have the machine learning algorithm and then began the clustering

models so what it does is that distributes the data into different

clusters and again if you provide any unlabeled new data it will make a

prediction and find out to which cluster that particular data or the data set

belongs to or the particular data point belongs to so one of the most important

algorithms in unsupervised learning is clustering so let's understand exactly

what is clustering so a clustering basically is the process of dividing the

datasets into groups consisting of similar data points

it means grouping of objects based on the information found in the data

describing the objects or their relationships so clustering models

focused on identifying groups of similar records and labeling records according

to the group to which they belong now this is done without the benefit of

prior knowledge about the groups and their characteristics so and in fact we

may not even know exactly how many groups are there to look for now these

models are often referred to as unsupervised learning models since there

is no external standard by which to judge the models classification

performance there are no right or wrong answers to these model and if we talk

about why clustering is used so the goal of clustering is to determine the

intrinsic group in a set of unlabeled data sometime the partitioning is the

goal or the of clustering algorithm is to make sense

of and exact value from the last set of structured and unstructured data so that

is why clustering is used in the industry and if you have a look at the

various use cases of clustering in the industry so first of all it's being used

in marketing so discovering distinct groups in customer databases such as

customers who make a lot of long-distance calls

customers who use internet more than cause they also using insurance

companies for like identifying groups of cooperation insurance policyholders with

high average game rate farmers crash cops which is profitable they are using

cease mix studies and define probable areas of oil or gas exploration based on

Seesmic data and they're also used in the recommendation of movies if you

would say they are also used in Flickr photos they also use by Amazon for

recommending the product which category it lies in so basically if we talk about

clustering there are three types of clustering so first of all we have the

exclusive clustering which is the hard clustering so here an item belongs

exclusively to one cluster not several clusters and the data point belong

exclusively to one cluster so an example of this is the k-means clustering so

claiming clustering does this exclusive kind of clustering so secondly we have

overlapping clustering so it is also known as soft clusters in this an item

can belong to multiple clusters as its degree of association with each cluster

is shown and for example we have fuzzy or the C means clustering which means

being used for overlapping clustering and finally we have the hierarchical

clustering so when two clusters have a painting change relationship or a

tree-like structure then it is known as hierarchical cluster so as you can see

here from the example we have a pain child kind of relationship in the

cluster given here so let's understand what exactly is k-means clustering so

today means clustering is an inquiry um whose main goal is to group similar

elements of data points into a cluster and it is the process by which objects

are classified into a predefined number of groups

so that they are as much it is similar as possible from one group to another

group but as much as similar or possible within each group now if you have a look

at the algorithm working here you're right so first of all it starts with an

defying the number of clusters which is key then again we find the centroid we

find the distance objects to the distance object to the centroid distance

of objects to the centroid then we find the grouping based on the minimum

distance has the centroid converge if true then we make a cluster false we

then I can find the centroid repeat all of the steps again and again so let me

show you how exactly clustering was with an example here so first we need to

decide the number of clusters to be made now another important task here is how

to decide the important number of clusters or how to decide the number of

clusters we'll get into that later so force let's assume that the number of

clusters we have decided is three so after that then we provide the centroids

for all the creditors which is guessing the algorithm calculates the Euclidean

distance of the point from each centroid and assigns the data point to the

closest cluster now Euclidean distance all of you know is the square root of

the distance the square root of the square of the distance so next when the

centroids are calculated again we have our new clusters for each data point

then again the distance from the points to the new clusters are calculated and

then again the points are assigned to the closest cluster and then again we

have the new centroid scatter it and now these steps are repeated until we have a

repetition in the centroids or the new centers are very close to the very

previous ones so until unless our output gets repeated or the outputs are very

very close enough we do not stop this process we keep on calculating the

Euclidean distance of all the points to the centroids then we calculate the new

centroids and that is how claiming is clustering works basically so an

important part here is to understand how to decide then value of K or the number

of clusters it does not make any sense if you do not

know how many class are you going to make so to decide the number of clusters

we have the elbow method so let's assume first of all compute the sum squared

error which is the SS e for some value of K for example let's take two four six

and eight now the SS e which is the sum squared error is defined as a sum of the

squared distance between each number member of the cluster and its centroid

mathematically and if you mathematically it is given by the equation which is

provided here and if you brought the key against the SS II you will see that the

error decreases as K gets large now this is because the number of cluster

increases they should be smaller so this distortion is also smaller now the idea

of the elbow method is to choose the key at which the SSE decreases abruptly so

for example here if we have a look at the figure given here we see that the

best number of cluster is at the elbow so as you can see here the graph here

genius abruptly after number four so for this particular example we're going to

use for as a number of cluster so first of all while working with k-means

clustering there are two key points to know first of all be careful about where

you start so choosing the first Center at random choosing the second Center

that is far away from the first Center some of it choosing the NH Center as far

away possible from the closest of the all the other centers and the second

idea is to do as many runs of k-means each with different random standing

points so that you get an idea where exactly and how many clusters you need

to make and where exactly the centroid lies and how the data is getting

converged now he means he's not exactly a very good method so let's understand

the pros and cons of k-means clustering z' we know that k-means is simple and

understandable everyone don't see that the first go the

items automatically assigned to the clusters now if we have a look at the

corns so first of all one needs to define the number of clusters this is a

very heavy task as us if we have 3/4 or if we have 10 categories and if you do

not know but number of clusters are gonna be it's

very difficult for anyone to you know to guess the number of clusters now all the

items are forced into clusters whether they are actually belong to any other

cluster or any other category they are forced to to lie in that other category

in which they are closest to and this against happens because of the number of

clusters with not defining the correct number of clusters or not being able to

guess the correct number of clusters so and most of all it's unable to handle

the noisy data and the outliners because anyways and machine learning engineers

and data scientists have to clean the data but then again it comes down to the

analysis what they are doing and the method that they are using so typically

people do not clean the data for k-means clustering or even if the clean there

are sometimes are now see noisy and outliners data which affect the whole

model so that was all for k-means clustering so what we're gonna do is now

a use k-means clustering for the movie data sets so we have to find out the

number of clusters and divide it accordingly so the use case is that

first of all we have at the air set of five thousand movies and what we want to

do is group them look the movies into clusters based on the facebook lights so

guys let's have a look at the demo here so first of all what we're gonna do is

import deep copy numpy pandas Seabourn the various libraries which we're going

to use now and from map rat levels when you use ply PI plot and we're gonna use

this GD plot and next what we're gonna do is import the data set and look at

the shape of the data set so if you have a look at the shape of the data set we

can see that it has five thousand and forty three rows with 28 columns and if

you have a look at the head of the data set we can see it has five thousand

forty three data points so what we're gonna do is place the data

points in the plot we take the director Facebook Likes and we have a look at the

data columns yeah face number in poster cast total Facebook Likes director

Facebook Likes so what we have done here now is taking the director Facebook

Likes and the actor 3 Facebook Likes right so we have five thousand forty

three rows and two columns now using the key means from s key alone what we're

going to do is import it first when import key means from SQL or

cluster remember guys sq done is a very important library in Python for machine

learning so and the number of cluster what we're gonna do is provide as five

note this again the number of cluster depends upon the SSE which is the sum

squared errors or the we're going to use the elbow method so I'm not going to go

into the details of that again so we're gonna fit the data into the k-means dot

fit and if you find the cluster centers then for the k-means and print it so

what we find is is an array of five clusters and if you print the label of

the k-means cluster now next what we're gonna do is plot the data which we have

with the clusters with the new data clusters which we have found and for

this we're going to use the Seabourn and as you can see here we have plotted the

card we have plotted the data into the grid and you can see here we have five

clusters so probably what I would say is that the cluster three and the cluster

zero are very very close so it might depend see that's exactly what I was

going to say is that initially the main challenge and k-means clustering is to

define the number of centers which are the key so as you can see here that the

third center and the zeroth cluster the third cluster and is your cluster are

very very close to each other so guys it probably could have been in one another

cluster and the another disadvantage was that we do not exactly know how the

points are to be arranged so it's very difficult to force the data into any

other cluster which makes our analysis a little different

works fine but sometimes it might be difficult to code in the k-means

clustering now let's understand what exactly is

siemens clustering so the fuzzy c means is an extension of a key means

clustering and the popular simple clustering technique so fuzzy clustering

also referred as soft clustering is a form of clustering in which each data

point can belong to more than one cluster so he means tries to find the

hard clusters where each point belongs to one cluster whereas the fuzzy c means

discovers the soft clusters in a soft cluster any point can belong to more

than one cluster at a time with a certain affinity value towards each

fuzzy c means assigns the degree of membership which ranges from 0 to 1 to

an object to a given cluster so there is a stipulation that the sum of fuzzy

membership of an object to all the cluster it belongs to must be equal to 1

so the degree of membership of this particular point to pool of these

clusters 0.6 and 0.4 and if you add a peak at 1 so that is one of the logic

behind the fuzzy c means so on and this affinity is proportional to the distance

from the point to the center of the cluster now then again we have the pros

and cons of fuzzy c means so first of all it allows a data point to be in

multiple clusters that's a pro it's a more neutral representation of the

behavior of genes genes usually are involved in multiple functions so it is

a very good type of clustering when we are talking about genes first of and

again if we talk about the cons again we have to define C which is the number of

clusters same as K next we need to determine the membership cutoff value

also so that takes a lot of time and it's time-consuming and the clusters are

sensitive to initial assignment of centroid so a slight change or deviation

from the center's is going to result in a very different kind of you know a

funny kind of output we get from the fuzzy see means and one of the major

disadvantage of a C means clustering is that it's this are non-deterministic

algorithm so it does not give you a particular output as in such

that's that now let's have a look at the third type of clustering which is the

hierarchical clustering so uh hierarchical clustering is an

alternative approach which builds a hierarchy from the bottom up or the top

to bottom and does not require to specify the number of clusters

beforehand another algorithm works as in first of

all we put each dita point in its own cluster and if I that closes to cluster

and combine them into one more cluster repeat the above step till the data

points are in a single cluster now there are two types of hierarchical clustering

one is elaborated clustering and the other one is division clustering so a

cumulative clustering builds the dendogram from bottom level while the

division clustering it starts all the data points in one cluster from cluster

now again her archaic clustering also has some sort of pros and cons so in the

pros though no assumption of a particular number of cluster is required

and it may correspond to meaningful taxonomies whereas if we talk about the

course once a decision is made to combine two clusters it cannot be undone

and one of the major disadvantage of these hierarchical clustering is that it

becomes very slow if we talk about very very large datasets and nowadays I think

every industry are using last year as its and collecting large amounts of data

so hierarchical clustering is not the app or the best method someone might

need to go for so there's that now when we talk about unsupervised learning so

we have k-means clustering and again and there's another important term which

people usually miss while talking about us was learning and there's one very

important concept of market basket analysis now it is one of the key

techniques used by large retailers to uncover association between items now it

works by looking for combination of items that occurred together frequently

in the transactions to put it it another way it allows retailers to analyze the

relationships between the items that the people buy for example people who buy

bread also tend to buy butter the marketing team at the retail store

should target customers who buy bread and butter and provide them an offer so

that they buy a third eye like an egg so if a customer buys bread

and butter and sees a discount or an offer on eggs he will be encouraged to

spend more money and buy the eggs but this is what market basket analysis is

all about now to find the association between the two items and make

predictions about what the customers will buy there are two algorithms which

are the Association rule mining and the ebrary algorithms so let's discuss each

of these algorithm with an example first of all if we have a look at the

Association rule mining now it's a technique that shows how items are

associated to each other for example customers who purchase bread have a 60%

likelihood of also purchasing Jam and customers who purchase laptop are more

likely to purchase laptop bags now if you take an example of an association

rule if you have a look at the example here a aro B it means that if a person

buys an Adam 8 then he will also buy an item P now there are three common ways

to measure a particular Association because we have to find these rules on

the basis of some statistics right so what we do is use support confidence and

lift now these three common ways and the measures to have a look at the

Association rule mining and know exactly how good is that rule so first of all we

have support so support gives the fraction of the transaction which

contains an item a and B so it's basically the frequency of the item in

the whole item set whereas confidence gives how often the item a and B

occurred together given the number of item given the number of times a occur

so it's frequency a comma B divided by the frequency of a now lift what

indicates is the strength of the rule over the random co-occurrence of a and B

if you have a close look at the denominator of the lift formula here we

have support a into support B now a major thing which can be noted from this

is that the support of a and B are independent here so if the value of lift

or the denominator value of the lift is more it means that the items are

independently selling more not together so that in turn will decrease the value

of lift so what happens is that suppose the value of lift is more that implies

that which we get it implies that the rule is

strong and it can be used for later purposes because in that case the

support in to support p-value which is the denominator of lift will be low

which in turn means that there's a relationship between the items a and B

so let's take an example of Association rule mining and understand how exactly

it works so let's suppose we have a set of items a B C D and E and we have the

set of transactions which are t1 t2 t3 t4 and t5 and what we need to do is

create some sort of rules for example you can see a D which means that if a

person buys a he buys D if a person buys C he buys a if it wasn't by his a he by

C and for the fourth one is if a person buy a B and C he is in turn by a now

what we need to do is calculate the support confidence and left of these

rules now head again we talk about a priori algorithm so a priori algorithm

and the associated rule mining go hand-in-hand so what a predators is

algorithm it uses the frequent itemsets to generate the Association rules and it

is based on the concept that a subset of a frequent item set must also be a

frequent Isum set so let's understand what is a frequent item set and how all

of these work together so if we take the following transactions of items we have

transaction T 1 T 2 T 5 and the items are 1 3 4 2 3 5 1 2 3 5 to 5 and 1 3 5

now another more important thing about support which I forgot to mention was

that when talking about Association rule mining there is a minimum support count

what we need to do now the first step is to build a list of items set of size 1

using this transaction data set and use the minimum support count 2 now let's

see how we do that if we create the tables see when if you have a close look

at the table C 1 we have the item set 1 which has a support 3 because it appears

in the transaction 1 3 & 5 similarly if you have a look at the item set the

single item 3 so it has a supporter of 4 it appears in t 1

D 2 D 3 and T 5 but if we have a look at the items at 4 it only appears in the

transaction once so it's support value is 1 now the item set with the support

rally which is less than the minimum support value that is to have to be

eliminated so the final David which is a table F 1 has 1 2 3 and 5 it does not

contain the 4 now what we're going to do is create the item list of the size 2

and all the combination of the item sets in f1 are used in this iteration so

we've left four behind we just have 1 2 3 and 5 so the possible item sets of 1 2

1 3 1 5 2 3 2 5 & 3 5 then again we'll calculate these support so in this case

if we have a closer look at the table c2 we see that the items at 1 comma 2 is

having a support value 1 which has to be eliminated so the final table F 2 does

not contain 1 comma 2 similarly if we create the item sets of size 3 and

calculate these support values but before calculating the support let's

perform the peirong on the data set now what Spearing so after all the

combinations are made we divide the table see three items to check if there

are another subset whose support is less than the minimum support value this is a

priori algorithm so in the item sets 1 2 3 what we can see that we have 1 2 and

in the 1 to 5 again we have 1 2 so we'll discard poor of these item sets and

we'll be left with 1 3 5 & 2 3 5 so with 135 we have three subsets 1 5 1 3 3 5

which are present in table F 2 then again we have 2 3 2 5 & 3 5 which are

also present in tea we'll have to so we have to remove 1 comma 2 from the table

C 3 and create the table F 3 now if we're using the items of C 3 to

create the adults of c4 so what we find is that we have the item set 1 2 3 5 the

support value is 1 which is less than the minimum support value of 2 so what

we're going to do is stop and we're gonna return to the previous

item set that is the table c3 so the final table f3 was one three five with

the support value of two and two three five with the support value of two

now what waiting a Jew is generate all the subsets of each frequent itemsets so

let's assume that our minimum confidence value is 60%

so for every subset s of AI the output rule is that s gives I two s is that s

recommends i ns if the support of I divided by the support of s is greater

than or equal to the minimum confidence value then only we'll proceed further so

keep in mind that we have not used lift till now we are only working with

support and confidence so applying rules with Adam sets of f3 we get rule 1 which

is 1 comma 3 which gives 1 3 5 & 1 3 it means if you buy 1 & 3

there's a 66% chance that you'll buy item 5 also similarly the rule 1 comma 5

it means that if you buy 1 & 5 there's 100% chance that you will buy 3

also similarly if we have a look at rule 5 & 6 here the confidence value is less

than 60% which was the assumed confidence value so what we're going to

do is we'll reject these files now an important thing to note here is that

have a closer look to the rule 5 and rule 3

you see it's it has 1 5 3 1 5 3 3 1 5 it's very confusing so one thing to keep

in mind is that the order of the item sets is also very important that will

help us allow create good rules and avoid any kind of confusion so that's

done so now let's learn how Association rule I used in market basket analysis

problem so what we'll do is we will be using the online transactional data of a

retail store for generating Association rules so first of all what you need to

do is import pandas MLT ml X T and D libraries from the imported and read the

data so first of all what we're going to do is read the data

what we're gonna do is from ml X T and e dot frequent patterns we're going to

improve the a priori and Association rules as you can see here we have the

head of the data you can see we have inverse number of stock code the

description quantity the inverse TTL unit price customer ID and the country

so in the next step what we will do is we will do the data cleanup which

includes reviewing spaces from some of the descriptions given and what we're

going to do is drop the rules that do not have the inverse numbers and remove

the Freight transaction so hey what what you're gonna do is remove which do not

have an invoice number if the string contains type seen was a number then

we're going to remove that those are the credits remove any kind of spaces from

the descriptions so as you can see here we have like five iron and 32,000 rows

with eight columns so next what we wanted to do is after the clean up we

need to consolidate the items into one transaction per row with each product

for the sake of keeping the data assets small we gonna only look at the sales

for France so we're gonna use the only France and group by invoice number

description with the quantity sum up and C so which leaves us with 392 rows and

1563 columns now there are a lot of zeros in the data but we also need to

make sure any positive values are converted to a 1 and anything less than

0 is set to 0 so for that we're going to use this code defining end code units if

X is less than 0 it owns 0 if X is greater than 1 returns 1 so what we're

going to do is map and apply it to the whole data set we have here so now that

we have structured the data properly so the next step is to generate the

frequent item set that has support of at least 7%

now this lumber is chosen so that you can you get close enough now what we're

gonna do is generate the ruse with the corresponding support confidence and

lift so we had given the minimum support at 0.7 the metric is lift frequent item

set and threshold is one so these are the following rules now a few rules with

a high lift value which means that it occurs more frequently than would be

expected given the number of transaction the product combinations most of the

places the confidence is high as well so these are few of the observations what

we get here if we filter the data frame using the standard pandas code for large

lift six and high confidence 0.8 this is what the output is going to look like

these are 1 2 3 4 5 6 7 8 so as you can see here we have the eh rules which are

the final rules which are given by the Association rule mining and that is how

all the industries or any of these we've talked about large retailers they tend

to know how their products are used and how exactly they should rearrange and

provide the offers on the products so that people spend more and more money

and time in the shop so that was all about Association rule mining so so guys

that's all for unsupervised learning I hope you got to know about the different

formulas how unsupervised learning works because you know we did not provide any

label to the data all we did was create some rules and not knowing what the data

is and we did clusterings different types of clusterings k-means

simi's hierarchical clustering so now coming to the third and last type of

learning is the reinforcement learning so what reinforcement learning is it's a

type of machine learning where an agent is put in an environment and it learns

to behave in this environment by performing certain actions and observing

the rewards which it gets from those actions so a reinforcement learning is

all about taking an appropriate action in order to maximize a reward in the

particular situation and in supervised learning the training theater comprises

of input and expected output so the model is strained with the

expected output itself but when it comes to reinforcement learning

there is no expected output the reinforcement agent decides what actions

to take in order to perform a given task in the absence of a training dataset it

is bound to learn from its expertise so let's understand reinforcement learning

with an analogy so consider a scenario wherein a baby is learning how to walk

now this scenario can go in two ways first the baby starts walking in and

makes it to the candy now since the candy is the end goal the

baby is happy it's positive the baby is happy positive reward now coming to the

second scenario the baby starts walking but falls due to some hurdle in between

now the baby gets hurt and does not get to the candy it's negative the baby is

sad negative reward just like we humans learn from our mistakes by a trial and

an earth reinforcement learning is also similar and we have an agent which is

baby a reward which is candy and many hurdles in between the agent is supposed

to find the best possible path to reach the reward so guys if you have a look at

some of the important reinforcement learning definitions first of all we

have the agent so the reinforcement learning algorithm that learns from

trial in err that's the agent now if we talk about environment the world through

which the agent moves or the obstacles which the agent has to conquer or the

environment now actions a are all the possible steps

that the agent can take the state s is the current conditions returned by the

environment then again we have reward R and instant return for the environment

to appraise the last action then again we have policy which is PI it is the

approach that the agent uses to remind the next action based on the current

state we have value V which is the expected long-term return with discount

as open to the short-term what are then again we have the action value Q this is

similar to value except it takes an extra parameter which is the current

state action which is a now let's talk about reward maximization for a moment

now reinforcement learning agent works based on the theory of reward

maximization this is exactly why the RL must be trained in such a way that he

takes the best action so that the reward is maximum

now the collective rewards at a particular time and the respective

action is written as G T equals RT plus one RT plus two and so on

now the equation is an ideal representation of rewards generally

things do not work out like this while summing up the cumulative rewards now

let me explain this with a small gape in the figure you see a fox right some meat

and a Tyler our reinforcement learning agent is the Fox and his end goal is to

eat the massive Otto meat before being eaten by the tiger

since this fox is clever fellow he eats the meat that is closer to him rather

than the meat which is close to the tiger because the closer he goes to the

Tiger the tiger the higher are his chances of getting killed as a result

the reward near the tiger in if they are bigger meat chunks will be discounted

this is done because of the uncertainty factor that the tiger might kill the Fox

now the next thing to understand is how discounting of reward works now to do

this we define a discount called the gamma the value of gamma is between 0 &

1 the smaller the gamma the larger the

discount and vice versa so our cumulative discounted reward is GT

summation of K 0 to infinity gamma to the power P as DK

t plus k plus 1 where gamma belongs to 0 to 1 but if the Fox decides to explore a

bit it can find bigger rewards that is this big chunk of meats this is called

exploration so the reinforcement learning basically works on the basis of

exploration and exploitation so exploitation is about using the

already known expert information to heighten the rewards whereas exploration

is all about exploring and capturing more information about the environment

there is another problem which is known as the K armed bandit problem the K

armed bandit it is a metaphor representing a casino slot machine with

K pull levers or arms the users or the customer pulls any one of the levers to

win a projected reward the objective is to select the leeward

that will provide the user with the highest reward now here comes the

epsilon greedy algorithm it tries to be fair to do opposite cause of exploration

exploitation by using a mechanism of flipping a coin which is like if you

flip a coin and comes up head you should explore for memory butter comes up days

you should exploit it takes whatever action seems best at the present moment

so with probability while epsilon the epsilon greedy algorithm exploits the

best known option with probability epsilon by 2 epsilon 0 it explores the

best known option and with the probability epsilon by 2 with

probability epsilon by 2 the algorithm explores the best known option and with

the probability epsilon by 2 the epsilon greedy algorithm explores the worst

known option now let's talk about Markov decision process the mathematical

approach for mapping a solution in reinforcement learning is called Markov

decision process which is MDP in a way the purpose of reinforcement learning is

to solve a Markov decision process now the following parameters are used to

attain a solution set of actions a set of states s we have the reward our

policy PI and the value V and we have translational function T probability

that our forum leads to s now to briefly sum it up the agent must take up an

action to transition from the start state to end state s while doing so the

agent receives the reward R for each action he takes the series of actions

taken by the agent define the policy PI and the rewards collected by collected

to find the value of V the main goal here is to maximize the rewards by

choosing the optimum policy now let's take an example of choosing the shortest

path now consider the given example here so what we have is given the above

representation our goal here is to find the shortest path between a and D each

edge has a number linked to it and this denotes the cost to traverse that edge

now the task at hand is to traverse from point A to D with the minimum possible

cost in this problem the set of states are denoted by the nodes ABCD a

d the action is to traverse from one node to another are given by a arrow B

or C our OD reward is the cost represented by each edge and the policy

is the path taken to reach each destination a to C to D so you start off

at node a and take baby steps to your destination initially only the next

possible node is visible to you if you follow the greedy approach and take the

most optimal step that is choosing a to see instead of a to B or C now you are

at node C and want to traverse to node T you must again choose the path wisely

choose the path with the lowest cost we can see that a CD has the lowest cost

and hence we take that path to conclude the policy is a to C to D and the value

is 120 so let's understand Q learning algorithm

which is one of the most use reinforcement learning algorithm with

the help of examples so we have five rooms in a building

connected by toast and each room is numbered from 0 through 4 the outside of

the building can be thought of as one big room which is tea room number five

now dose 1 & 4 lead into the building from the room 5

outside now let's represent the rooms on a graph and each node each room has a

node and each door as link so as you can see here we have represented it as a

graph and our goal is to reach the node 5 which is the outer space so what we're

gonna do is and the next step is to associate a reward value to each toe so

the dose that directed read to the you will have a reward of 100 whereas the

doors that do not directly connect to the target have a reward and because the

dose had to weigh two arrows are assigned to each room and each row

contains an instant about valley so after that the terminology in the

q-learning includes the term states and action so the room 5 represents a state

agents movement from one room to another room represents in action and in this

figure a state is depicted as a node while an action is represented by the

arrows so for example let's say can eat in that Traverse from room to to the

roof I so the initial state is gonna be the state to it then the next step is

from stage 2 to stage 3 next is to moves from stage 3 to stage either 2 1 or 4 so

if it goes to the 4 it reaches stage 5 so that's how you represent the hole

traversing of any particular agent in all of these rooms a represents their

actions via notes so we can put this state diagram and instant reward values

into a reward table which is the matrix R so as you can see the minus 1 here in

the table represents the null values because you cannot go from 1 to 1 right

and since there is no way from to go from 1 to 0 so that is also minus 1 so

minus 1 represents the null values whereas the 0

represents zero reward and 100 represents the reward going to the room

five so one more important thing to know here is that if you're enrolled fireman

you could go to room five the reward is hundred so what we need to do is add

another matrix Q representing the memory of what the agent has learned to

experience the rows of matrix Q represent the current state of the agent

whereas the columns represent the possible action leading to the next

state now if the formula to calculate the Q matrix is if a particular Q at a

particular state and the given action is equal to the R of that state in action

plus gamma which we discussed earlier the Kurama parameter which we discussed

earlier which ranges from 0 to 1 into the maximum of the Q or the next state

comma all actions so let's understand this with an example so here are the

nine steps which any Q learning algorithm particularly has so first of

all is to set the gamma parameter and the environment rewards in the matrix R

then we need to do is initialize the matrix Q to 0 select the random initial

state set the initial state to current state select one among all the possible

actions for the current state using this possible action consider going to the

next state when you get the next state get the maximum Q value for this next

state based upon all the actions compute the Q value using the formula repeat the

above steps until the current state equals your code so the first step is to

set the values of the learning parameters gamma which is 0.8 and

initial state as room number one so the next initialize the Q matrix a zero

matrix so on the left hand side as you can see here we have the Q matrix which

has all the values as 0 now from room 1 you can either go to room 3 or room 5 so

let's select room 5 because that's our end goal so from room 5 calculate the

maximum cube value for this next state based on all possible actions so Q 1

comma 5 equals R 1 comma 5 which is hundred plus zero point eight which is

the gamma into the maximum of Q 5 comma 1 5 comma 4 and 5 comma 5 so

maximum or five comma one five comma four five comma five is hundred so the Q

values from initially as you can see here the Q values are initialized to

zero so it does not matter as of now so the maximum is zero so the final Q value

for Q 1 comma 5 is 100 so so that's how we're gonna update our Q matrix so Q

matrix the position has 1 comma 5 in the second row gets updated to 100 so the

first step we have turned right now that for the next episode we start with a

randomly chosen initial state so let's assume that the stage is 3 so from rule

number 3 you can either go to room number 1 2 or 4 so let's select the

option of room number 1 because from our previous experience what we've seen is

that one has directly connected to room 5 so from room / 1 calculate the maximum

Q value for this next state based on all possible action so 3 comma 1 if we take

we get our 3 4 1 plus 0 point 8 comma into maximum of T's we get the value as

80 so the matrix Q gets updated now for the next episode the next state 1 now

becomes the current state we repeat the inner loop of the Q learning algorithm

because tip 1 is not the goal state from 1 you can either go to 3 of 5 so let's

select 105 as that's our goal so from room row 5 again we can go from all of

these so the Q matrix remains the same since Q 1 5 is already fed to the agent

and that is how you select the random starting points and fill up the Q Q

matrix and see where which path will lead us there with the maximum provide

points now what we gonna do is do the same coding using the Python in machine

learning so what we're going to do is improve an umpire's NP we're gonna take

the R matrix as we defined earlier so that the minus 1 are the nerve values

zeros are the values which provides a 0 and hundreds is the value so what we're

going to do is initialize the Q matrix now to 0 we're going to put gamma as 0.8

and set the initial state as 1 now here returns all the available

actions in the state given as an argument so if we define the of

action with the given state we get the available action in the current state so

we have the another function here which is known as a sample next action what

this function does is that chooses at random which action to be performed

within the range of all the available actions and finally we have action which

is the sample next action with the available act now again we have another

function which is update now what it does is that it updates the Q matrix

according to the path selected and a Q learning algorithm so so initially our Q

matrix is all 0 so what we're gonna do is we're gonna train it over 10,000

iterations and let's see what exactly gives the output of the Q value so if

then the agent learns more through for the iterations it will finally breach

converges value in Q matrix so the Q matrix can then be normalized at is

converted to percentage by dividing all the non-zeros entities by the highest

number which is 500 in this case so once the matrix Q gets close enough to the

state of convergence agent has learned the most optimal path to the goal State

so what we're gonna do next is divide it by 5 which is the maximum here so Q R

and P Q max in 200 so that we get a normalized

now once the Q matrix gets close enough to the state of convergence the agent

has learned or the paths so the optimal path given by the Q learning employer

Thomas if it starts from 2 it will go to 3 then go to 1 and then go to 5 if it

starts at 2 it can go to 3 then 4 then 5 that will give us the same results so as

you can see here is the output given by the Q learning algorithm is the selected

path is 2 3 1 and Feinstein from the Q State - so this is how exactly a

reinforcement learning algorithm works it finds the optimal solution using the

path and given the action and rewards and the various other definitions or the

various other challenges I would say actually the main goal is to get the

master reward and get the maximum value through the environment and that's how

an agent learns through its own path and going millions and millions of

iterations learning how each part will give us what reward so that's how the Q

learning algorithm works and that's how it works in Python as well as I showed

you so now that you have a clear idea of the different machine learning

algorithms how it works the different phases of machine learning the different

applications of machine learning how supervised learning works how

unsupervised learning works our reinforcement learning works and what to

choose in what scenario what are the different algorithms under all of these

types of machine learning next move forward to the next part our session

Rich's understanding about artificial intelligence deep learning and machine

learning well data science is something that has

been there for ages nonetheless and data science is the extraction of knowledge

from data by using scientific techniques and algorithms people usually have a

certain level of dilemma or I would say a certain level of confusion when it

comes to differentiating between the terms artificial intelligence machine

learning and deep learning so don't worry I'll clear all of these doubts for

you artificial intelligence is a technique which enables machine to mimic

human behavior now the idea behind artificial intelligence is fairly simple

yet fascinating which is to make intelligent machines that can take

decisions on their own now for years it was thought that computers would never

match the power of the human brain well back then we did not have enough data

and computational power but now with big data coming into existence and with the

advent of GPUs artificial intelligence is possible now machine learning is a

subset of artificial intelligence technique which uses statistical method

to enable machines to improve with experience whereas deep learning is a

subset of machine learning which makes the computation of multi-layer neural

network feasible it uses the neural networks to stimulate human-like

decision-making so as you can see if we talk about the data science ecosystem we

have artificial intelligence machine learning and deep learning deep learning

being the innermost circle is very much required for machine learning as well as

artificial in but why was deep learning required so

for that less understand the need for deep lolly so a step towards artificial

intelligence was machine learning and machine learning was a subset of ei play

it deals with the extraction of patterns from the last dataset haslam la dataset

was not a problem what was a problem was machine learning algorithms could not

handle the hight dimensional data where we have a large number of inputs and

outputs which rounds thousands of dimensions handling and processing such

type of data becomes very complex and resource exhaustion now this is also

termed as the curse of dimensionality now another challenge faced by machine

learning was to specify the features to be extracted so as we saw earlier in all

the algorithms which are discussed now we had to specify the features to be

extracted now this plays an important role in protecting the outcome as well

as in achieving better actress therefore without feature extraction the challenge

for the programmer increases as the effectiveness of the algorithm very much

depends on how insightful the programmer is now this is where deep learning comes

into picture and comes to the rescue but deep learning is capable of handling

the high dimensional data and is also efficient in focusing on the right

features on its own so what exactly is deeper so deep learning is a subset of

machine learning as I mentioned earlier where similar machine learning

algorithms are used to Train deep neural networks so as to achieve better

accuracy in those cases where the former was not performing up to the MA

basically deep learning mimics the way our brain functions and learns from

experience so as you know our brain is made up of billions of neurons that

allows us to do amazing things when the brain of a small kid is capable of

solving complex problems which are very difficult to solve even using the

supercomputers so how can we achieve the same functionality in programs now this

is where we understand artificial neuron and artificial neural networks so first

of all let's have a look at the different applications of deep learning

we have automatic machine translation object classification before

automatic handwriting generation character text generation we have image

caption generation colorization of black and white images we have

automatic game playing and much more now google lens is a set of vision based

computing capabilities that allows your smartphone to understand what's going on

in a photo video or any live feed for instance point your phone at a

flower and google lens will tell you on the screen which type of flower it is

you can in that camera at any restaurant sign to see the reviews and other

recommendations now if we talk word mushroom transition this is a task where

you are given words in some language and you have to translate the words to a

desired language see English but this kind of translation is classic example

of image recognition and final application of deep learning which we

have here is image polarization so automatic colorization of black and

white images as you know earlier we did not had color photographs back there in

40s and 50s we did not have any color photographs so through deep learning

analyzing water shadows is present in the image how the light is bouncing off

the skin tone of the people automatic colorization is now possible and this is

all possible because of deep learning now deep learning studies the basic unit

of a brain cell called a neuron now let us understand the functionality of a

biological neuron and how we mimic this functionality in the perceptron or what

we call is an artificial neuron so as you can see here we have the image of a

biological neuron so it has a cell body it has mitochondrion nucleus we have

dendrites there we have the axon we have the node of the ran of ear you have the

scavenge cell and the synapse so we need not know about all of these so what we

need to know mostly about is dendrite which receives signals from other

neurons we have a cell body which sums up all the inputs and we have axon which

is used to transmit the signals to the other cells now an artificial neuron or

perceptron is a linear model which is based upon the same principle and is

used for binary classification it models a neuron which has a set of

inputs each of which is given a specific weight and the neuron computes some

functions on these weighted inputs and gives the outputs it receives n inputs

corresponding to each feature it then sums up those inputs applies the

transformation and produces an output it has generally two functions which are

the summation and the transformation but the transformation is also known as

activation functions so as you can see here we have certain inputs we have

certain weights we have the transfer function and then we have the activation

function now the transfer function is nothing but the summation function here

and it is the schematic for a neuron in a neural network so this is how we mimic

a biological neuron in terms of programming now the way it shows the

effectiveness of a particular input move the weight of input more it will have an

impact on the neural network on the other hand bias is an additional

parameter in the perceptron which is used to address the output along with

the weighted sum of the inputs to the neuron which helps the model in a way

that it can best fit for the given data activation functions translate the

inputs into outputs and it uses a threshold to produce an output there are

many functions that are use has activation functions such as linear or

identity we have unit or binary step we have sigmoid logistic tan edge ray Lu

and soft Max now if we talk about the linear transformation or the activation

function so a linear transform is basically the identity function where

the dependent variable has a direct proportional relationship with the

independent variable now in practical terms it means that a function passes

the signal through unchanged now the question arises when to use linear

transform function simple answer is when we want to solve a linear regression

problem we apply a linear transformation function and next in our list of

activated functions we have your next step the output of a unit step function

is either 1 or 0 now it depends on the threshold value we define a

step function with the threshold value five is shown here so let's consider X

is five so if the value is less than five the output will be zero whereas if

the value is equal to or greater than five then the valuable one this equal to

is very much important to consider here because sometimes people put up the

equal two in the lower end of the side so that's not it how it is used but

rather it's used on the upper hand side where if the value is greater than

particular X greater than or equal to X then only the value will be one now a

sigmoid function is a machine that converts an independent variable of near

infinite range into simple probabilities between 0 & 1 now most of its output

will be very close to either 0 or 1 and if you have a look at the function here

we have 1 divided by n plus y raise to power minus beta X so I'm not going to

the details or the mathematical function of a particular sigmoid but it's very

much used to convert the independent variables of very large infinite range

to the values between 0 & 1 now the question arises when to use a

sigmoid transformation function so when we want to map the input values to a

value in the range of 0 to 1 where we know the output should lie only between

these two numbers we apply the sigmoid transformation function note an H is a

hyperbolic trigonometric function now unlike the sigmoid function the

normalized range of tan H is minus 1 to 1 it's very much similar to the sigmoid

function but the advantage of tan H is that it can deal more easily with

negative numbers now next on our list we have Ray Lu now rail you or the rectify

linear unit transform function only activates our node if the input is above

a certain quantity while the input is below 0 the output is 0 but when the

input Rises about a certain threshold or if we take in this case at 0 but if you

have a certain value X if it crosses that certain threshold it has a linear

relationship with the dependent variable now this is very much different from a

normal linear transformation so has certain threshold now the question

arises here again when to use a railroad transformation function so when we want

to map the input values to a value in the range so as input X to maximum 0

comma X that is it Maps the negative inputs to 0 and the positive inputs are

output without any change we apply a rectified linear unit or the railroad

transformation function now the final one which we have is sort max so when we

have four or five classes of outputs the softmax function will give the

probability distribution of each it is useful for finding out the class which

has the maximum probability so soft mass is a function you will often find at the

output layer of a classifier now suppose we have an input of say the letters of

English words and we want to classify which letter it is so for that case

we're going to use the sort max function because in the output we have certain

classes but I would say in English if we take English we had 26 classes from A to

Z so in that case softmax activation function is very much important now

artificial neuron can be used to implement logic gates now I'm sure you

guys must be familiar with the working of all K that is the output is one if

any of the input is also one therefore a perceptron can be used as a separator or

a decision line that divides the input set of or gate into two classes the

first class being the inputs having output as 0 that lies below the decision

line and the second class would be inputs having output as 1 that lie above

the decision line or the separator so mathematically a perceptron can be

thought of like an equation of weights inputs and bias as you can see here we

have f of X is equal to weight into the input vector plus the bias so let's go

ahead with our demo understand how we can implement this perceptron example

which is of an or gate using neural networks using artificial neuron or the

perceptron and here we're going to use tensor flow along with Python so let's

understand what exactly is tensor flow first before going it to the demo so

basically tensor flow is a deep learning framework by Google

to understand it in a very easy way let's understand the two terms of

tensorflow which are the tensors and the flow so starting with tensors tensors

are standard way of representing theater in deep learning and they are just

multi-dimensional arrays it is an extension of two-dimensional table

matrices through the data with higher dimension so as you can see have first

of all we have a tensor of dimension 6 then we have a tensor of dimension 6

comma 4 which is 2d and again we have a tensor of dimension 6 4 and 2 which is

reading now this dimension is not restricted to 3 we can have four

dimensions five dimensions it depends upon the number of inputs or the number

of classes or the parameters which we provide to a particular neural net or a

particular perceptron so which brings us tensorflow intensive flow the

computation is approached as a data flow graph so we have a tensor and then again

we have a flow in which we suppose for taking the example here we have the data

we do addition then we do matrix multiplication then we check the result

if it's good then it's fine and if the result is not good then we again do some

sort of matrix multiplication or addition it depends upon the function

what we are using and then finally we have the output so if you want to know

about it as a flow we have an entire playlist on tensor flow and deep

learning which you should see i'll give the link to all of these videos in the

description box so let's go ahead with our demo and understand how we can

implement the or gates using perceptron so first of all what we're going to do

is import all the required libraries and Here I am going to import only one

library which is the tensor flow library so what we're going to do is import

tensorflow a steal now the next step what we're going to do

is define vector variables for input and output so for that we need to create

variables for storing the input output and the bias for the perceptron so as

you can see here we have the training input and again we have the training

output now what we're going to do next is define the weight variable and here

we are we will define the tensor variable of the shape 3 comma 1

and for our weights and we will assign some random values to it initially so

we're going to use T AF dot variable and we're going to use TF run random normal

to assign random variables to the 3 cross 1 tensor next what we do is define

placeholders for input and output and so that they can accept external inputs on

the run so this will be T F dot float32 so for X we are going to use a dimension

for 3 and for y it's dimension of 1 now as discussed earlier the input received

by a positron is force multiplied by the respective weights and then all of these

weights input our sum together now this sum value is then fed to the activation

for obtaining the final result of the or gate perceptron so this is the output

here what we are defining so it's TF dot neural networks dot relu using the relu

activation function here and we are doing the matrix multiplication of the

weights and biases in this case I have used the rayleigh function but you are

free to use any of the activation functions according to your needs the

next what we're going to do is calculate the cost or ere so we need to calculate

the cost which is the mean squared error which is nothing but the square of the

differences or the perceptron output and the desired output so the equation will

be loss equals D F dot reduce some and we'll use the TF dot Square output minus

y now the cool of a perceptron is to minimize the loss or the cost or the

error so here we are going to use the gradient descent optimizer which will

reduce the loss and it is a very important part of any neural network to

use any sort of optimizer so here we are using the gradient descent optimizer you

can know more about the gradient descent optimizer in other a Drake of videos or

deep learning and neural networks now the next step comes is to initialize

the variables so variables are only defined with TF dot variables the

initially what weighted so we need to initialize this variable define so for

that we're going to use the T F dot global variable initializer and we're

going to create the F dot session and we will not run

with the initialization variables so as all the variables are initialized not

coming to the last step what we're going to do is we need to train our perceptron

that is update away our values of the weights and the biases in the successive

iteration to minimize the error or the Ross so here I will be training our

perceptron in hundred epochs

so as you can see here for I in range hundred we are going to run the session

with training data in and why as a trainee at the output and we're going to

calculate the loss and feed it directly to the X train and why train and again

and print the epoch so as you can see here for the first iteration the loss

was two point zero seven and coming down if as soon as the iterations increase

the loss is decreasing because of the gradient optimizer it's learning how the

data is and coming down to the hundredths or the final epoch here we

have the loss of zero point two seven start with two point zero seven here

initially and we ended up with zero point two seven loss which is very good

this was how perceptron works on a particular given data set it learns

about it and as you saw earlier we gave a set of input the input variables we

provided weights we had a summation function and then we use the rail u

activation function in the code to get the final output and then we trained the

particular model for hundred iterations with the training data so as to minimize

the loss and the loss came down all the way from two point seven to zero point

two seven well if you think perceptron solves all the problem of making a human

brain then you were wrong there are two major problems first problem being that

the single layer perceptron cannot classify non linearly separable data

points and which other complex problems that involve a lot of parameters cannot

be solved by a single layer perceptron now consider the example here and the

complexity with the parameters involved to take a decision by the marketing team

so as you can see here for every email direct paid referral program or organic

we have certain number of social media subcategories Google Facebook LinkedIn

we have twitter and then we have the type such as the search ad remarketing

as interest as ad look like ads and again the parameters to be considered

are the customer acquisition cost money span leads generated customers generated

time taken to become a customer and all of these problems cannot be

solved by a single layer of perceptron our one neuron cannot take in so many

inputs and that is why more than one neuron would be used to solve these kind

of problems so neural network is really just a

composition of perceptrons connected in different ways and operating on

activation functions so for that we have three different terminologies in a

particular neural network we have the input layers we have the hidden layers

and we have the output layers so in hidden layer we have hidden nodes which

provide information from the outside world to the network and heart together

referred to as the input layer now the hidden nodes perform computations and

transfer information from the input nodes to the output nodes now a

collection of hidden nodes forms idle layer in our image we have one two three

four hidden layers and finally the output nodes are collectively referred

to as output layers and are responsible for computation and transferring

information from the network to the outside world

now that you have an idea of how a perceptron behaves the different

parameters involved and the different layers of neural networks let's continue

this session and see how we can create our own neural network from scratch in

this image as you can see here we have given a list of faces first of all the

patterns of local contrast is being computed in the input layer then in the

hidden layer 1 we get the face features and in the hidden layer 2 we get the

different features of the face and finally we have the output layer now if

we talk about training networks and weights in a particular neural networks

we can estimate the weight values for our training data using stochastic

gradient descent optimizer as I mentioned earlier now it requires

two parameters which is the learning rate and as I mentioned earlier learning

rate is used to limit the amount of each weight is corrected each time it is

updated and epoch is a number of times to run through the training data while

updating the way so in the previous example we had 100 ebox so we trained

the whole model hundred times and these along with the training data will be the

arguments to the function as data scientists or data analysts

or machine learning engineers working on the hyper parameters is the most

important part because anyone can do the coding it's your experience and your way

of thinking about the learning rate and the epochs the model which you are

working the input data you are taking how much time it will require to train

because time is limited and as you know these hyper parameters are the only

things which are successful data centers will be guessing when creating a

particular model and these play a huge role in the model such as even a slight

difference in learning create of the e box might result in the model training

time so as it will take longer time to Train having a large amount of data

using the particular data set that these all things are what data scientist or

machine learning engineer keeps in mind while creating them all let's create our

own new network and here we are going to use the MN is DDS a so the MN IC data

set consists of 60,000 training samples and 10,000 testing samples of

handwritten digit images not the images are of the size 28 into 28 pixels and

the output can lie anywhere between 0 to 9 now the task here is to train a model

which can accurately identify the digit present on the image so let's see how we

can do this using tensor fro and Python so firstly we are going to use the

import function here to bring all the print function from Python 3 into python

2.6 or the future statements let's continue with our cone

so next what we are going to do is from pencil for examples tutorials we can

take the mi nasty data which is already provided by tensorflow in their example

tutorials data but this is only for the learning part and later on you can use

this particular data for more purposes for your learning now next what we are

going to do is create MN ist and we're going to use the input data tour tree

data set and one hot is given us through here so we're going to import tensorflow

and whack plot lib next what we are going to do is define the hyper

parameters here so as I mentioned earlier we have few hyper parameters

like learning rate equals batch size display step is not a very big hyper

parameter to consider here but so the learning rate we have given here is

0.001 training epochs is 15 that is up to you because more than number of

epochs the more time it will take for the model to Train and here you have to

take a decision between the amount of time it takes for the model to train and

give the output versus the speed again we have the batch size of 100 now this

is one of the most important have a parameter to be considered because you

cannot take all of the images at once and create the radius so you need to do

it in a bath size manner and for that we define a bad size of 100 so out of

60,000 we're going to take 100 as a bath size 100 images which will go through 15

iterations and the training set has 60,000 images so you do the math how

many batch we will require and how many epochs for each batch we'll have 15 a

box the next step is defining the hidden layers and the input and the classes so

for input layers have taken 256 numbers these are the number of perceptron I

need or the number of features to be extracted in the first layer so this

number is arbitrary you can use it according to your requirements and your

needs so for simplicity I am using two bits X here and the same I'm going to

use for the hidden layer 2 now for the number of inputs I'm going to use 784

and that is why because as I discussed earlier the MST data has an image or the

shape 28 cross 28 which is 784 so in short we have 784 pixels to be

considered in a particular image and each pixel will provide immense amount

of data so I am taking a 784 input and number of output classes Here I am

defining ten because the output can either range from zero one two three

four five six seven eight and nine so the total number of classes or the

output classes here I'm going to use are ten and again we are going to create x

and y variables X for the input and Y for the output classes now as you can

see here we have the multi-layer perceptron in which we have defined all

the hidden layers and the output layers so the layer one will do the addition

and first I will do the matrix multiplication of the weights and the

input with the biases and then it will provide a summation and then again the

outward for this one will be given to layer two by using the activation

function of rail you here so as you can see here we have rail you activation

function for layer 1 layer 2 will take the input of layer 1 with the weights

provided in h2 hidden to layer with the biases of b2 layer it will do the

multiplication of layer 1 into weights it will add the biases and then again

we'll have a rail lu activation function and the output of this layer 2 will be

given to the output layer so as you can see here in the final output layer we

have matrix multiplication of layer 2 into weights of the output layer plus

the biases of the output layer and what we're going to do is return the output

so let's mention the weights and the biases so here we are taking random

points for that and next what we're going to do is use the prediction of the

multi-layer perceptron using the input weights and biases and one thing more

important what we're going to do here is define the cost so we're going to use

the TF naught reduce mean and we are using the short max cross entropy with

logits this is a function and here we are using

the atom optimizer rather than the gradient descent optimizer with learning

rate provided initially and what we're going to do is minimize the cost

so again we're going to initialize all the global variables and we have two

arrays for cos history and accuracy history so as to store all the values

and train our model so we're going to create a session and the training cycle

for epoch in the range of 15 we first initialize the average cost at zero and

the total patch is the MN asset in number of examples divided by bass has

which is 100 and we loop it over all the patches run the optimization or the back

propagation and the cost operation to get the loss value and then we have to

display the logs per each Ipoh for that will show the epochs and the cost at

each step we're going to calculate the accuracy add the last to the correct

prediction and will append the accuracy to the list after every epoch we will

append the cost after every epoch because that is what and we have created

cos history and the accuracy history for that purpose and finally we will plot

the cost history using the matplotlib and we'll plot the accuracy history also

and what we're going to do is we're going to see how accurate is our model

so so let's train it now and as you can see at first epoch we have cost 188 and

address is 0.85 so if you see just have the second epoch the cost has reduced

from 188 to 42 now it's 26 as you can see the accuracy is increasing from 0.85

to 0.909 one you have reached five epochs you see the cost is diminishing

at a huge rate which is very good and you can use different types of

optimizers or gradient descent or be it atom optimizer and not go to the details

of the optimization because that is another half an hour or one hour to

explain you guys what exactly it is and how exactly it works so as you can see

till the tenth epoch or 11th epoch we have cost 2.4 and the accuracy is 0.94

let's wait a little further till the 50th epoch is turn

so as you can see in the 15th eat walk we have cost 0.83 and actress is 0.94 we

start with cost 188 and accuracy 0.85 have you ever east the accuracies of

0.94 so as you can see this is the graph of the cost

it started from 188 ending at 0.8 3 we have the crop of the accuracy which

started from 0 point 8 4 or 8 5 2 all the way to zero point nine four so as

you can see the 14th epoch reached an accuracy of 0.9 4/7 as you can see here

in the graph again and in the 15th epoch we came to the accuracy of 0.9 for now

one might ask the question the accuracy was higher in that particular epoch why

has the accuracy decreased another important aspect or have a parameter to

consider here is the cost the more lower the cost the more accurate will be your

mod so the goal is to minimize the cost which will in turn increase the accuracy

and finally accuracy here we have a 0.9 for tonight which is very good now this

was all about deep learning neural networks and tensorflow how would create

a perceptron or deep neural network what are the different hyper parameters

involved how does a neuron work so let's have a look at the companies hiring

these professionals these data professionals in the data science

environment we have companies all the way from startups to big giants so the

major companies here we can see as our Dropbox Adobe IBM we have Walmart who

were chase LinkedIn Red Hat and there are so many companies and as I mentioned

earlier the required for these professions are high but the people

applying are too low because you need a certain level of experience to

understand how things are working you need to understand machine learning to

understand deep learning you need to understand all the statistics and

property and that is not an easy task so you require at least 3 to 6 months of

rigorous training with minimum one to two years of practical implementation

and project work I would say to go into data science career if you think that's

the career you want to go so Yurika as you know provides data science

master program we have a machine learning master program but as you can

see in the data master program we have Python statistics we have our statistics

we have data size using our Python for data science we have Apache spark and

Scylla we have PA and deep learning with tensorflow we have tableau so guys as

you can see here we have 12 courses in this master program with 250 hours of

interactive learning via capstone projects and as you can see here we have

a certain discount going on the hike in salary you get is much more if you go

for data science rather than any other program so you can see we have Python

statistics a statistics data science using Python we have Python for data

science Apache spark and Scala which is a very important part in data science

you need to know what the Hadoop ecosystem we have deep learning with

tensorflow you have tableau and this is a 31 feet

course as I mentioned earlier it's not an easy task and you do not become a D

assigned all in one month or in two months you cry a lot of training and a

lot of practice to become a data scientist or machine learning engineer

or even a data analyst because you see a lot of topics on a vast list of areas is

what you need to cover and once you cover all of these topics

what you need to do is select an either which you wanna work the kind of data

which you're going to be handling whether it be text data it would be

medical records if it's video audio or images for processing it is not an easy

task to become a data scientist so you need a very good and a very correct path

of learning to become a real scientist so so guys that's it for this session I

hope you enjoyed the session and got to know about data science the different

aspects of data science how it works all the ways to either from statistics

probability machine learning deep learning and finally coming to AI so

this was the path of data science and I hope you enjoyed this session and if you

have any queries regarding session or any other session please feel free to

mention it in the comment section below and we'll happily answer all of your

queries till then thank you and happy learning. I hope you have

enjoyed listening to this video, please be kind enough to like it and you can

comment any of your doubts and queries and we will reply them at the earliest.

Do look out for more videos in our playlist and subscribe to edureka!

channel to learn more. Happy learning!

Loading...

Loading video analysis...