Data Distribution

Basics-of-Predictive-Modeling

Data generation process is exponential and we don’t have time, money or ability to take care of all these data to understand its generation process. The main reason for understanding data generation process is to make some educated guess before making any decision. For example, before writing a book you should know how many would be the potential buyers for this book. Without knowing the number of buyers you may launch under or over demand of books in the market which in return could be a risk in terms of its production. To find this estimated figure you should have some potential buyer data likeness and dis likeness attributes (data), you make some survey where you ask likelihood of buying this book from some customer or you may have already this information in some data base, you can’t get this survey from all customer as it would require lot of time, resources and money, you collect some sample and based on these sample data you infer the average number of customer will be the buyer of this book , this collecting survey process is the part of your sampling , which should be biased free and be representative of population

After collecting sample data there is always in real world (data generating process) some form of randomness or there is an element of uncertainty, different samples from the same data source (population) gives different result. The challenge is to understand and handle these randomness and uncertainty in sampling to extract specific trends for inference,  in our case how many customer will buy this book. In short we don’t know what data will do – there is a range of outcomes, each of which is more likely or less likely. This range of outcome has some possible data values and how often they occur which we can represent in the form of data distribution e.g. histograms, dot plots, box plots.By sketching these plots for sample data we can understand the distribution of data which help us to choose the right statistical analysis tool for inference. If you conduct an analysis that assumes the data follow a normal distribution when, in fact, the data are non-normal, your results will be inaccurate. To avoid error you must determine the distribution of your data. There are many different distribution of data , some commonly are

  • Normal Distribution
  • Exponential Distribution
  • Gamma Distribution
  • Binomial Distribution
  • Weibull Distribution

Inferential Figure-7

Understanding data distribution help us in finding correct statistical tools for inferencing population parameter  but it does not tell us what data can do. It tells us what data has done and what the distribution it has  ,  to know what data can do we need to understand fixed nature of the underlying variables  , fixed nature i mean dealing with uncertainty within data ,this can be achievable  by representing  data in the form of probability distribution , this probability distribution is a function that describes the likelihood of all the possible values that the random variable can take on.A random variable is the outcome of an experiment (i.e. a random process) expressed as a number  , we express it with capital letters e.g. (X or Y)

Long story short to make an inferential statistics you need to understand or fit the distribution of your sample data and then based on that one you can overlay this distribution in the form of  probability distribution.Histograms and box plots can be quite useful in suggesting the shape of a probability distribution, but there are many sophisticated way to understand the distribution of data , some data are more random in nature then in that case one need to fit this data distribution. I am not covering fitting data distribution here but will cover the same in next blog – this topic deserve a separate blog article. By fitting the data distribution you come up with a model which can extrapolate your findings in population. If your model is not correct then it can have serious result such as inability to complete tasks or projects in time leading to substantial time and money loss, wrong engineering design resulting in damage of expensive equipment etc. In some specific areas such as hydrology, using appropriate distributions can be even more critical.

Remember Statistical inference based on the laws of probability, Probability deals with predicting the likelihood of future events by understanding the randomness of data while statistics involves the analysis of the frequency of past events using data samples

There are mainly two classes of probability distribution

  • Discrete Probability Distribution
  • Continuous Probability Distribution

Discrete Probability Distribution

A discrete probability distribution applicable to the scenarios where the set of possible outcomes comes from counting process, such number of phone calls received in an hour or a coin toss or a roll of dice

A probability distribution for a discrete random variable is a mutually exclusive of all possible numerical outcomes of the random variable with the probability of occurrence associated with each outcomes

Remember  we refer probability distribution for discrete variable  as a probability mass function. Binomial and Poison distribution are used for Discrete random variables

Binomial Distribution

The binomial distribution is a probability distribution. It has discrete values. It counts the number of successes in yes/no-type experiments. There are two parameters, the number of times an experiment is done (n) and the probability of a success (p). Examples are:

  • Tossing a coin 10 times, and counting the number of face-ups. (n=10, p=1/2)
  • Rolling a dice 10 times, and counting the number of sixes. (n=10, p=1/6)
  • Suppose 5% of a certain population of people have green eyes. 500 people are picked randomly. The number of green-eyed people will follow a binomial distribution (n=500, p=0.05).

If the random variable is discrete, then the corresponding probability distribution function will also be discrete, as we all know in Probability we check the probability of our expected result , If a coin is tossed three times, the number of heads (expected) obtained can be 0, 1, 2 or 3. The probabilities of each of these possibilities can be shown

Probability Distribution = {p(x1), p(x2),….., p (xn)}

The sample space has size 2² = 4

You may wonder why 2² then you need to recall your high school math Permutation and combination. I can give you a little brush up about this one  but cant cover the whole topic as it requires lot of things to cover.

So why 2² is according to one of Permutation theorem that to make the permutation or arrangement of n distinct objects where repetition is allowed then we use N^r , where N is number of permutations or arrangement of objects here in this example arrangement for H and T  2 objects and r is the no of toss or trial

{0 heads}= {TT}
{1 head} = {TH, HT}
{2 head} = {HH}

Outcome Number of Heads
TT 0
TH 1
HT 1
HH 2

What will we do here we will find the relative frequency of each outcome, a relative frequency distribution is obtained by dividing the the frequency in each class by the total number of values, from this a percentage distribution can be obtained by multiplying relative frequency distribution to 100. So what are you seeing here that we are representing our data in the form of probability distribution

Number of Heads Relative Frequency  Percentage
0 ¼ 25%
1 ½ (2/4) 50%
2 ¼ 25%

So, here we are assigning the outcome of our desired outcome head in terms of probabilities, we are describing the randomness of variable in terms of what data could do instead what data did , a probability pattern which tells the fix nature of underlying variable , we are not defining just  outcome of data but behavior of data as well. This outcome variable number of heads is random variable,in short we are assigning each variable outcome in terms of probability distribution, a very compact definition of Probability distribution is

“A function f(x) that assigns a probability to each outcome for a discrete random variable”

Inferential Figure-2

Above given toss examples has very limited number of samples 4, for large number of samples we use following Binomial distribution formula

Inferential Figure-3

X is the expected outcome, n is the number of trials , p is the probability of success on single trial, in many situations this will be p is 0.5, for example the chance of a coin coming up heads is 50:50 equals p=0.5

Use of the binomial distribution requires three assumptions:

  1. Each replication of the process results in one of two possible outcomes (success or failure),
  2. The probability of success is the same for each replication, and
  3. The replications are independent, meaning here that a success in one outcome does not influence the probability of success in another

Lets take a very simple example for 2 heads exactly when tossing 2 coins simultaneously

The sample space has size 2² = 4

{0 heads}= {TT}
{1 head} = {TH, HT}
{2 head} = {HH}

X=2
n=2
p=0.5

Inferential Figure-4

For 2 heads exactly when tossing 3 coins simultaneously

The sample space has size 2^3 = 8

{0 heads}= {TTT}
{1 head}  = {THT, TTH, HTT}
{2 heads} = {HTH, HHT, THH}
{3 heads} = {HHH}

Outcome Number of Heads
TTT 0
THT 1
TTH 1
HTT 1
HTH 2
HHT 2
THH 2
HHH 3
Number of Heads Relative Frequency Percentage
0 1/8 12.5%
1 3/8 37.5%
2 3/8 37.5%
3 1/8 12.5%

Inferential Figure-5

P(2)    =3!   / 2! (3 – 2) ! (.5)² (1-.5)^3-2
        =3 (.25) (.5)
        =37.5%

As you have found the probability of getting two heads is 37.5% using traditional way in given table as well as using statistical robust way binomial distribution formula.

Using that formula , you don’t require to create

  • Samples and desired outcomes ,permutation is already being used in binomial formula n!/x!(n-x)! for 2 heads is coming 3 times.
  • Long tables of each probabilities where random outcomes in practical life is more and cumbersome to calculate.
  • You need to find the probabilities of success and failure which is p and (1-p) , no need to calculate probabilities of each outcome

As all of you much aware that we use the sample mean and variance to describe the center and dispersion of a sample. In the same way we can use the mean and variance of a random variable to describe the center and dispersion of a probability distribution.The mean µ of a probability distribution is the expected value of its random variable.

To calculate the the expected value of a discrete random variable we multiply each outcome X by its corresponding probability P(X) and then sum these products

Expected Value-1

E[X]= 0 * 1/8 + 1 * 3/8 + 2 * 3/8 + 3 * 1/8

E[X]= 1.5

The interpretation of the above number is that, if you toss 3 coins simultaneously many times, then the average outcome expected will be 1.5 heads or approximately 2 heads

The expected value is numerically the same as the average value, but it is a prediction for a specific future occurrence rather than a generalization across multiple occurrences ,what i mean by generalization across multiple occurrences is an average which  is a statistical generalization of multiple occurrences of an event (such as the mean time you waited at the checkout the last 10 times you went shopping, or indeed the mean time you will wait at the checkout the next 10 times you go shopping)

The distinction is subtle but important:

  1. The average value is a statistical generalization of multiple occurrences of an event (such as the mean time you waited at the checkout the last 10 times you went shopping, or indeed the mean time you will wait at the checkout the next 10 times you go shopping).
  2. The expected value refers to a single event that will happen in the future (such as the amount of time you expect to wait at the checkout the next time you go shopping – there is a 50% chance it will be longer or shorter than this). The expected value is numerically the same as the average value, but it is a prediction for a specific future occurrence rather than a generalization across multiple occurrences

This average may yield different properties with regards to the estimation of the “actual average” of the underlying distribution, for instance you may consider how the mathematical definition of the sample average behaves when passing to the limit (taking the sample size to infinity), etc.; but the expected value is functionally associated to distribution with a given parameter,- a distribution that can further generate samples with different sample averages.

Logistic regression analysis is often used to investigate the relationship between a Binary response variables and a set of explanatory, or independent, variables. A Binary response consists, for example, as success and failure. In case of diseases studies your outcome are denoted as Y=1 if disease is present or Y=2 otherwise.

We use Binomial distribution in logistic regression model e.g.

Model=glm(Y ~ . , data=TrainData,family=binomial)

Continuous Probability Distribution

Continuous probability distribution applicable to the scenarios where the set of possible outcomes can take on values in a continuous range e.g. real numbers, such as the temperature on a given day is typically described by probability density functions.

In above discrete variable for tossing a coin 10 times using Binomial distribution , you can easily find what is the probability of head nth times outcome in 10 trials. But with continuous random variable you can’t  exactly find the probability of specific outcome ,it is uncountable

Another example ,you want to check the claim  of  a sample mineral water in liters from a shipment lot would be exactly 1 liter will be 0 percent,it might 0.99 liters or 1.1 liters. Another bottle from the same lot might be 1.2 liters or 1.1 liters. So , to get exactly 1 liters of water from a random chosen bottle from a lot would be 0 percent. In this case we find what is the probability that a randomly selected water quantity  between 0.99 and 1.01 liters makes more sense.

That is, if we let X denote the weight of a randomly selected water bottle liters,  and ask what is P(0.99 < X < 1.01) would make sense. We need to find the probability that continuous random variable X would be in some interval (ab).For that one we use a probability density function (“p.d.f.”).

The PDF is sort of proportional to the probability that the random variable will take a particular value.

There are many types of Continuous PDF some of them i will discuss here Normal Distribution

The normal distribution is a commonly encountered continuous probability distribution.

Normal or Gaussian distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. Many groups follow this type of pattern are

Heights of people
Blood pressure
IQ scores
Salaries
Growth rate

Firstly, nothing in real life exactly matches the Normal distribution. But this is partly matches due to the Central Limit Theorem, which says that if you average enough unrelated things, you will get the Normal Distribution, i will cover Central Limit theorem in detail in next blog. This theory deserve for the dedicated one detail blog , a nice gift from mathematician

Secondly, the problem with your sample data is that you do not know its distribution, you can’t judge distribution based on your sample. You can mimic too many distributions based on your sample but all may not be verified, this is why based on central limit theorem you can fairly draw and abstract your sample distribution as a normal distribution

As George Box famously noted:  “…the statistician knows…that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.” (JASA, 1976, Vol. 71, 791-799)  Therefore, the normality assumption will never be exactly true when one is working with real data

So I am wrapping up here, too much theory has been now. Will cover more Normal distribution and how to make inference based on this distribution with some real life example using R

 

Advertisements

Law Of Large Numbers

Initial

In previous blog we covered the sample statistics i.e. mean and its standard deviation, by knowing this two property we can make some uni variate variables predictability, I would recommend to read this article first here Standard Deviation In Light of Inferential Statistics

If you first time flip a coin the probability of flipping a coin for head and tail is P=0.5 or 50%. But, it does not guarantee flipping the same coin 10 times will give us 5 heads and tails(P=0.5), it could be 3 heads 7 tails or 4 heads 6 tails or 5 heads 5 tails. We can’t get every time 0.5 or 50% Probability of head by flipping a coin 10 times because outcome of one coin flip is not dependent on another one , but if we increase flipping the coin times indefinitely, then accumulating proportion of heads and tails gets closer and closer to 0.5 or 50%.

The Law of Large Numbers states that the sample mean approaches the population mean as we increase the sample size.

OR

From Wolfram Alpha:

A “law of large numbers” is one of several theorems expressing the idea that as the number of trials of a random process increases, the percentage difference between the expected and actual values goes to zero.

Let’s visualize this phenomena in R

In probability you consider success as a probability e.g. here our success is to get heads which is equal to p for each trial, regardless of the outcomes of the other trials

Try with 10 times tossing coins

set.seed(12345)
n<-10 # Iteration of Samples
#sample of 0 or 1 n times with replacement
x <- sample(0:1, n, repl=TRUE)
# Cumulative sum of both 0 or 1
cs <- cumsum(x)
# probability of each cumulative sum with respect to n times occurrence
pr <- cs/1:n
# Plot the graph, y axis probability between 0 and 1
plot(pr, ylim=c(0, 1), type="l")
# line of .50 or 50%
lines(c(0,n), c(.50,.50),col="red")
# displaying each result
round(cbind(x,cs,pr), 5)[1:n,];

Law of Large No 1

Law of Large No 2

Law of Large No 3

If you see the graph it’s very uncertain to get the proportion of head and tail 50/50, at the second toss no head so its probability is 0 at third toss it got 1 head hence head probability at 3rd toss is 1/3=0.33 and at 4th to 9th toss no head which diminished its probability to 11%

If we follow the law of large number then its actual or observed probability will reach to expected probability, this time I am going to flip the coin 10,000 times

set.seed(12345)
n<-10000 # Iteration of Samples
# sample of 0 or 1 n times with replacement
x <- sample(0:1, n, repl=TRUE)
# Cumulative sum of both 0 or 1
cs <- cumsum(x)
# probability of each cumulative sum with respect to n times occurrence
pr <- cs/1:n
# Plot the graph, y axis probability between 0 and 1
plot(pr, ylim=c(0, 1), type="l")
# line of .50 or 50%
lines(c(0,n), c(.50,.50),col="red")
# displaying each result
round(cbind(x,cs,pr),5)[10000,];

Law of Large No 4

Above graph is telling us by increasing number of coins flipping the probability of heads is smoothing towards 50%

Let’s try all different trial for 10, 100, 1000 and 10000 turn by turn and observe the result. Here I am showing code for only 10000 number of samples but the result of each outcome is in following table. You can use 10, 100, 1000 to see how the result probability converging to 50% gradually

set.seed(12345)
n<-10000 # Number of Samples
# sample of 0 or 1 n times with replacement
x <- sample(0:1, n, repl=TRUE)
# Cumulative sum of both 0 or 1
cs <- cumsum(x)
# probability of each cumulative sum with respect to n times occurrence
pr <- cs/1:n
# Plot the graph, y axis probability between 0 and 1
plot(pr, ylim=c(0, 1), type="l")
# line of .50 or 50%
lines(c(0,n), c(.50,.50),col="red")
# displaying each result
round(cbind(x,cs,pr), 5)[n,];

I have run the code for 10, 100, 1000 and 10000Law of Large No 5

You can see in above table the more you increase number of tosses in its trial the more its probability converging to expected probability, 10 times tossing coin has more deviation towards expected probability than 100 times tossing, same 100 times tossing coin in each trial has more deviation to expected probability than 1000 and 1000 has more than 10000. 10000 has more convergence towards the expected probability 50/50 than all trial 10, 100, 1000, 10000

What have you observed from above table is the more trials are performed the nearer to the statistical truth the results will be. This means that results are random in the short run but will be closer to the expected value in the long run.

Remember this phenomena is not only with Probabilities, this phenomena could be with mean or I don’t know probably with other statistics as well. This is how an insurance company know how much charge for a car insurance? First they find how many accidents on an average occur during specific time. In laymen term they discover that five out of 150 people will suffer a serious and expensive injury during a given year. If the company is only able to insure 10 or 25 people, it faces far greater risks than if it is able to ensure all 150 people. But how do they know 5 out of 150 people will suffer a serious and expensive injury? The answer is using above law of large numbers they get larger and larger samples and see the smoothness of outcome visually and numerically

If you ever have made any online trade then you could understand the law of large number in profit and loss – Please do not take my statement as granted for trade , you could use this phenomena in your demo account. What does it mean that a single trade should be insignificant, but by placing several trades we will have productive trades in the long run.

In next blog based on above understanding we can understand easily

Normal Distribution

Central Limit Theorem

 

Standard Deviation in Light of Inferential Statistics

As all of us aware about the standard deviation which is nothing but a tool to find how spread is our data set from its mean. Is it simple? Was not for me when I read the same statement first time, let’s dive it with some intuition to understand the overall understanding of Standard deviation using some simple example

Example 1

I like cricket that’s why presenting this example, this example works for interpreting any other example within your own favorite game, suppose you have two cricket batsmen and their batting score as follows below in the table – be mindful this batting score belongs to one tournament e.g. World Cup Semi Final from 1999-2015

Standard Deviation Figure 1

All the above scores are arbitrary

I like both batsmen and they are great for their all playoffs season, they belong to the same team, if you see Mean (Average Score) then you will find both has the same mean (Average Score 85), this statistics does not allow us to compare their overall capability. By seeing this statistics we can say both are at same level for all season’s situation, standard deviation tell us when and at what situation who is the best in their season

Standard Deviation Figure 2

Based on above figure we can say that Misbah is more consistent player than Afridi. This spread/variability/variance of data tells consistency and inconsistency of data and is called Standard deviation, it tells you how far away each batsmen score from their mean 85

We use the following formula to find the standard deviation of data

Standard Deviation Figure 3

Standard Deviation 3a

But for samples Standard Deviation we use following formula why? See my next series

Standard Deviation Figure 3b

Here x in our example is each batsmen score

Standard Deviation Figure 4

SD=√250/4

Afridi SD =7.90

Standard Deviation Figure 5

SD=√90/4

Misbah SD= 4.74

As you can see Misbah is more consistent to his average score than Afridi, his Standard deviation is 4.74. Afridi is less consistent to his average score than Misbah as his Standard Deviation is 7.90

Let’s suppose in one of playoff season (World Cup Semi Final) their team requires 79 runs and team Captain has option whether to send Afridi or Misbah for the batting

If he sends batting to Afridi then his variance range from average score is between 77.1 and 92.9 (85 ± 7.90) , in other case if he sends batting to Misbah then his variance range from average score is between 80.26 and 89.74 ( 85 ± 4.74 )

In Afridi case his lower range band 77.1 is a bit risky to chase required run 79, in case of Misbah’s lower range 80.26 and upper range 89.26 covering the required run 79. It makes sense based on above statistics and situation Misbah is best candidate for this situation. In short Standard deviation give us a univariate variables predictability power to compare data, we can say that Standard deviation help us to generate a hypothesis based on given data

So far we have discussed the Standard deviation in context of Descriptive Statistics, a descriptive statistics is nothing but an absolute numerical measures to tell about features of a given data set

Standard deviation is also used statistics for inferential question, an inferential question is nothing but a proposed hypothesis which will be answerable by analysing different set of data in different forms of sample, and first I will explain what inferential analysis is? and how can SD be used in doing inference

Here , I have mentioned the predictability power using standard deviation for specific situation – I am not saying using standard deviation you can compare which one has more variation in their batting score, but which one has more variation for a specific season. Simple example if you have two different dataset which has different standard deviation , based on bigger magnitude  of standard deviation you can’t say bigger magnitude of standard deviation has more spread than smaller one , for an example if you compare two different grades standard deviation  for a class students you cant say one has more spread than other based on magnitude

Standard Deviation in Light of Inferential Statistics

Inference is one of the most important type of data analysis which enable you to make a statement about those data that you can’t observed, in short it describes the data generation process from a population using its sample. How population data is behaving is the cornerstone of inferential statistics e.g. average salary in a specific region, cholesterol level variation (Standard Deviation) of men and women in a specific city etc. These statistics average salary and cholesterol level variance does not make sense by itself, you have to understand where it comes from, to understand this you need the whole data from these population which is impossible, if you can’t collect the whole population data then you can’t describe the population. This is like a dog chasing tail situation, but thanks for sampling process which we can use somehow to make a statement about the population. We do inference analysis using different sample and the operations which we perform on this sample data is called sample statistics i.e. average, standard deviation etc. The outcome of this operations (i.e. average, standard deviation) is not treated an absolute outcome of population like in descriptive analysis, in inferential analysis this outcome is called point estimate or unknown population parameters of interest– point estimate and sample statistics are same.

These operations are performed again and again on different samples of data, doing this operation again on again on different sample of data give you different sample statistics. This different results vary from one sample to another sample and this is called sampling variability or sampling variation or standard error. Quantifying how sample statistics vary provides a way to estimate the margin of error associated with our point estimate.

Remember there is a difference between Standard Deviation and Standard Error , standard error of the sample mean is an estimate of how far the sample mean is likely to be from the POPULATION mean i.e. it measures the variability in means of samples of the same size taken from same populations i.e. quantifying the variability of such an estimate, whereas the standard deviation of the sample is the degree to which individuals within the sample differ from the sample mean. Standard deviation of the sampling distribution is the same as Standard error

Standard Deviation Figure 6

Above figure for different samples drawn from that same population randomly would in general have different values of the sample mean, so there is a distribution of sampled means (with its own mean and standard deviation) called Sampling distribution and the standard deviation of these samples is called standard error, see below figure sampling distribution of different samples statistics

Standard Deviation Figure 7

Now you have understood so far

What is Standard Deviation?

What is Sampling Distribution?

What is Standard Error or Sampling Variability?

We will discuss the next pillars of inferential statistics in next blog are

Law Of Large Numbers

Data Distribution

Central Limit Theorem

Normal Distribution

Confidence Intervals

Hypothesis testing

Significance & Confidence

References

https://en.wikipedia.org/wiki/Statistical_inference

http://au.wiley.com/WileyCDA/WileyTitle/productCd-1118464311.html

 

How Multicollinearity effects your model?

In inferential analysis you are answerable about the likelihood of happening certain event with respect to certain time e.g.  Predicting what types of people will eat a diet in good quality during the next year. After answering this question using your model, you communicate your findings to your stake holders i.e. what factors make these people will eat high fresh diet, these factors help in your stake holders taking some measurable decision. If you can’t tell factors precisely using your model then there will be no action in the business and a question may be asked ”SO WHAT?” What business problem it is going to solve?

Regression models fails to deliver good intuition with the data sets if there is high correlation between independent variables, high correlation is good between dependent and independent variables but not between independent variables, if they are moderate (not high between independent variables) then its fine.

One of the regression model assumption is  none of the independent variables should be  constant, and there should not be an exact linear relationships among the independent variables, Highly Correlated independent variable are exact linear in relation with each other and these Highly Correlated independent variables in a model is called Multicollinearity, this is one of the problem in regression model which can make your model less interpret able and can increase standard error of your predictor variable’s coefficient estimate. There are always some Correlation between independent variables because this is the reason multiple regression came into existence, this correlation help to infer causality in cases where simple regression analysis (with one independent variable) can’t, but the same is bad if there is high correlation exist between independent variable

Here I am aiming only to describe how miss interpretation and increasing standard error in the model coefficient estimate occurs due to multicollinearity. We all now aware multicollinearity arises when two predictor variables are highly correlated in a model, such as high income in certain people influence to take high fresh diet or probably the more aged people influence them to take high fresh diet. If your model can’t detect which is exactly influencing people to take high fresh diet then your business stake holders cannot manifest the spot to increase the sale.In short with high multicollinearity it is very difficult to draw ceteris paribus conclusion about income or people age affects on high fresh diet

Standard error of a coefficient tell us how much sampling variation there is if we were to re-sample and re-estimate the coefficients of a model. So question comes here why highly correlated variable (multicollinearity) increase Standard error of their coefficient estimate? The more highly correlated independent variables are, the more difficult it is to determine how much variation in outcome variable is due to each these correlated independent variables. For example, in our example set if high income and age are highly correlated (which means they are very similar to each other) it is difficult to determine whether high income is responsible for variation in buying good quality food or ages of aged people, as a result , the standard errors for both variables become very large

I will take an example data set from wine industry where many different variables that
could be used to predict wine price e.g. average growing season temperature (AGST), harvest rain, winter rain ,age of wine, population of France where this wine is producing etc.

I will use R here to depict multicollinearity and its effect on estimated coefficient standard error

First Picture

Here i am going to check which predictor variables are highly correlated (remember correlated predictor variables are the fancy name of multicollinearity) using data visualization – A part of an Exploratory data analysis tool , visualizing is the most important tool for exploratory data analysis because the information conveyed using graphs can be very intuitive to recognize pattern. Highly correlated variables are those whose absolute value of correlation is close to 1.

I am here just randomly checking the correlation between AGST and HarvestRain

Second Picture

Third Picture

Hard to visually see a linear relationship between AGST and HarvestRain, and it    turns out that the correlation between these two variables are close to 0 hence no correlation

Fourth Picture

Rather finding correlation between predictor variables one by one we can find the same between all variables within any data set in one go

Fifth Picture

Here you can see using above data matrix that Age and FrancePop are highly negative correlated as it is very close to 1, we can validate it by plotting graph

Sixth Picture

Seventh Picture

Hence it is proved that variable FrancePop and Age has strong correlation and will have multicollinearity problem, if these two variables will be the part of model then these will mislead to interpret the model to find which one actually predicting wine price

In this situation we need to get rid of one insignificant variable in order to make our model precise interpret-able. Before going to remove one insignificant variable we need to understand increasing the standard error of coefficient estimate of these two variables coefficient estimate due to multicollinearity

Lets create a model which include all the independent variables

Eighth Picture

As you can see highlighted standard error of Age and FrancePop variable in model has more noise with two standard error Age 0.07900 and FrancePop 0.0001667. These two variables doing the same task predicting wine price which is redundant and an extra noise, this extra noise make the estimates very sensitive to minor changes in the model. The result is that the coefficient estimates are unstable and difficult to interpret. Coefficients may have the “wrong” sign or implausible magnitudes e.g. -4.953-05

Above standard error magnitude can easily trick you as FrancePop magnitude is less than Age magnitude so FrancePop is more significant, but this is not the way to find which one is less unstable. There is a way to find more significant or less unstable variable between these two correlated variable and this is called VIF (Variable Inflation factor)

Highest VIF variable is a sign of best candidate to be removed from the model, threshold value for the VIF is greater than 5 and it depends on your model usage and its application see detail for VIF threshold VIF Threshold common practice if any correlated variable has greater than 5 VIF or whichever is bigger are deemed to be removed from the model

Ninth Picture

As you can see the Age and FrancePop both has VIF greater than 5 but Age VIF is less than FrancePop. We expect Age to be significant as older wines are typically more expensive,so Age makes more intuitive sense in our model

Tenth Picture

Now see the standard error of Age when FrancPop was with it in previous model

0.07900>0.00809

standard error rate of Age has been decreased, not only Age standard error decreased the other independent variable’s standard error decreased as well e.g. AGST standard error is now 0.0987011 while with these two correlated variable (Age+FrancePop) in model1 was 1.030e-01 = 0.1030 , same as with HarvestRain decreased from 8.751e-04 = 0.0008751 to 0.0008538 ,would you mind notice other factors as well Age has been more significant in model2 as compare to model1 (with FrancPop) and over all p-value of model is now 0.0000002036 which was 0.000001044 in model1

Lets see what happens if we drop Age variable and keep FrancePop

Eleventh Picture

As you can see by dropping Age variable and including only FrancePop cause WinterRain variable less significant

In all above workaround , notice the R-squared value almost 82% in all cases with Age and FrancePop  or with only Age or with only FrancePop R-squared value remains 82%, this R-squared is telling us how well our model is fitting the data , but we can’t say each of them is giving us accurate predictions , to find the accuracy of each model we need  to run a cross validation to measure the accuracy of above each workaround

Caveats

  • Dropping any important variable can lead your regression model to bias i.e. jumping your model from multicollinearity to heteroscedasticity problem (dependent variable became correlated with the error term), see below figure where bias is estimated parameter E(B^) to population parameter (true B), an unbiased estimator will come into your model

12th Picture

          If E(B^) is not equal to (true B) then biased13th Picture

  • Increasing standard error for the estimated coefficient cause increasing confidence interval not making an estimated variable coefficient biased
  • Do not put in too many variables in your model blindly put in place EDA  to check variables that measure the same thing

Conclusion

In short Multicollinearity effect intuition of overall model by increasing standard error of each independent variable coefficient estimate which often result in non significant result (P-Value) and one of correlated coefficients may have the non sensual direction

Multicollinearity can occur  if you accidentally include the same variable twice, e.g. height in inches and height in feet. Another common error occurs when one of the variable is computed from the other variable (e.g. Family income = Wife’s income + Husband’s income). EDA is very crucial step before creating any model,it can determine if there are any problems with your dataset and is the business question you are trying to solve can be answerable by the data you have in place