Data generation process is exponential and we don’t have time, money or ability to take care of all these data to understand its generation process. The main reason for understanding data generation process is to make some educated guess before making any decision. For example, before writing a book you should know how many would be the potential buyers for this book. Without knowing the number of buyers you may launch under or over demand of books in the market which in return could be a risk in terms of its production. To find this estimated figure you should have some potential buyer data likeness and dis likeness attributes (data), you make some survey where you ask likelihood of buying this book from some customer or you may have already this information in some data base, you can’t get this survey from all customer as it would require lot of time, resources and money, you collect some sample and based on these sample data you infer the average number of customer will be the buyer of this book , this collecting survey process is the part of your sampling , which should be biased free and be representative of population

After collecting sample data there is always in real world (data generating process) some form of randomness or there is an element of uncertainty, different samples from the same data source (population) gives different result. The challenge is to understand and handle these randomness and uncertainty in sampling to extract specific trends for inference, in our case how many customer will buy this book. In short we don’t know what data will do – there is a range of outcomes, each of which is more likely or less likely. This range of outcome has some possible data values and how often they occur which we can represent in the form of data distribution e.g. histograms, dot plots, box plots.By sketching these plots for sample data we can understand the distribution of data which help us to choose the right statistical analysis tool for inference. If you conduct an analysis that assumes the data follow a normal distribution when, in fact, the data are non-normal, your results will be inaccurate. To avoid error you must determine the distribution of your data. There are many different distribution of data , some commonly are

- Normal Distribution
- Exponential Distribution
- Gamma Distribution
- Binomial Distribution
- Weibull Distribution

Understanding data distribution help us in finding correct statistical tools for inferencing population parameter but it does not tell us what data can do. It tells us what data has done and what the distribution it has , to know what data can do we need to understand fixed nature of the underlying variables , fixed nature i mean dealing with uncertainty within data ,this can be achievable by representing data in the form of probability distribution , this probability distribution is a function that describes the likelihood of all the possible values that the random variable can take on.A random variable is the outcome of an experiment (i.e. a random process) expressed as a number , we express it with capital letters e.g. (X or Y)

Long story short to make an inferential statistics you need to understand or fit the distribution of your sample data and then based on that one you can overlay this distribution in the form of probability distribution.Histograms and box plots can be quite useful in suggesting the **shape** of a probability distribution, but there are many sophisticated way to understand the distribution of data , some data are more random in nature then in that case one need to fit this data distribution. I am not covering fitting data distribution here but will cover the same in next blog – this topic deserve a separate blog article. By fitting the data distribution you come up with a model which can extrapolate your findings in population. If your model is not correct then it can have serious result such as inability to complete tasks or projects in time leading to substantial time and money loss, wrong engineering design resulting in damage of expensive equipment etc. In some specific areas such as hydrology, using appropriate distributions can be even more critical.

Remember Statistical inference based on the laws of probability, *Probability** *deals with predicting the likelihood of future events by understanding the randomness of data while statistics involves the analysis of the frequency of past events using data samples

There are mainly two classes of probability distribution

- Discrete Probability Distribution
- Continuous Probability Distribution

__Discrete Probability Distribution__

A discrete probability distribution applicable to the scenarios where the set of possible outcomes comes from counting process, such number of phone calls received in an hour or a coin toss or a roll of dice

A probability distribution for a discrete random variable is a mutually exclusive of all possible numerical outcomes of the random variable with the probability of occurrence associated with each outcomes

Remember we refer probability distribution for discrete variable as a probability mass function. Binomial and Poison distribution are used for Discrete random variables

**Binomial Distribution**

The **binomial distribution** is a probability distribution. It has discrete values. It counts the number of successes in yes/no-type experiments. There are two parameters, the number of times an experiment is done (*n*) and the probability of a success (*p*). Examples are:

- Tossing a coin 10 times, and counting the number of
*face-up*s. (n=10, p=1/2) - Rolling a dice 10 times, and counting the number of sixes. (n=10, p=1/6)
- Suppose 5% of a certain population of people have green eyes. 500 people are picked randomly. The number of green-eyed people will follow a binomial distribution (n=500, p=0.05).

If the random variable is discrete, then the corresponding probability distribution function will also be discrete, as we all know in Probability we check the probability of our expected result , If a coin is tossed three times, the number of heads (expected) obtained can be 0, 1, 2 or 3. The probabilities of each of these possibilities can be shown

Probability Distribution = {p(x1), p(x2),….., p (xn)}

The sample space has size 2² = 4

You may wonder why 2² then you need to recall your high school math Permutation and combination. I can give you a little brush up about this one but cant cover the whole topic as it requires lot of things to cover.

So why 2² is according to one of Permutation theorem that to make the permutation or arrangement of n distinct objects where repetition is allowed then we use N^r , where N is number of permutations or arrangement of objects here in this example arrangement for H and T 2 objects and r is the no of toss or trial

{0 heads}= {TT}

{1 head} = {TH, HT}

{2 head} = {HH}

Outcome |
Number of Heads |

TT | 0 |

TH | 1 |

HT | 1 |

HH | 2 |

What will we do here we will find the relative frequency of each outcome, a relative frequency distribution is obtained by dividing the the frequency in each class by the total number of values, from this a percentage distribution can be obtained by multiplying relative frequency distribution to 100. So what are you seeing here that we are representing our data in the form of probability distribution

Number of Heads |
Relative Frequency |
Percentage |

0 | ¼ | 25% |

1 | ½ (2/4) | 50% |

2 | ¼ | 25% |

So, here we are assigning the outcome of our desired outcome head in terms of probabilities, we are describing the randomness of variable in terms of what data could do instead what data did , a probability pattern which tells the fix nature of underlying variable , we are not defining just outcome of data but behavior of data as well. This outcome variable number of heads is random variable,in short we are assigning each variable outcome in terms of probability distribution, a very compact definition of Probability distribution is

“A function f(x) that assigns a probability to each outcome for a discrete random variable”

Above given toss examples has very limited number of samples 4, for large number of samples we use following Binomial distribution formula

X is the expected outcome, n is the number of trials , p is the probability of success on single trial, in many situations this will be p is 0.5, for example the chance of a coin coming up heads is 50:50 equals p=0.5

Use of the binomial distribution requires three assumptions:

- Each replication of the process results in one of two possible outcomes (success or failure),
- The probability of success is the same for each replication, and
- The replications are independent, meaning here that a success in one outcome does not influence the probability of success in another

Lets take a very simple example for 2 heads exactly when tossing 2 coins simultaneously

The sample space has size 2² = 4

{0 heads}= {TT}

{1 head} = {TH, HT}

{2 head} = {HH}

X=2

n=2

p=0.5

For 2 heads exactly when tossing 3 coins simultaneously

The sample space has size 2^3 = 8

{0 heads}= {TTT}

{1 head} = {THT, TTH, HTT}

{2 heads} = {HTH, HHT, THH}

{3 heads} = {HHH}

Outcome |
Number of Heads |

TTT | 0 |

THT | 1 |

TTH | 1 |

HTT | 1 |

HTH | 2 |

HHT | 2 |

THH | 2 |

HHH | 3 |

Number of Heads |
Relative Frequency |
Percentage |

0 | 1/8 | 12.5% |

1 | 3/8 | 37.5% |

2 | 3/8 | 37.5% |

3 | 1/8 | 12.5% |

P(2) =3! / 2! (3 – 2) ! (.5)² (1-.5)^3-2 =3 (.25) (.5) =37.5%

As you have found the probability of getting two heads is 37.5% using traditional way in given table as well as using statistical robust way binomial distribution formula.

Using that formula , you don’t require to create

- Samples and desired outcomes ,permutation is already being used in binomial formula n!/x!(n-x)! for 2 heads is coming 3 times.
- Long tables of each probabilities where random outcomes in practical life is more and cumbersome to calculate.
- You need to find the probabilities of success and failure which is p and (1-p) , no need to calculate probabilities of each outcome

As all of you much aware that we use the sample mean and variance to describe the center and dispersion of a sample. In the same way we can use the mean and variance of a random variable to describe the center and dispersion of a probability distribution.The mean µ of a probability distribution is the expected value of its random variable.

To calculate the the expected value of a discrete random variable we multiply each outcome X by its corresponding probability P(X) and then sum these products

E[X]= 0 * 1/8 + 1 * 3/8 + 2 * 3/8 + 3 * 1/8

E[X]= 1.5

The interpretation of the above number is that, if you toss 3 coins simultaneously many times, then the average outcome expected will be 1.5 heads or approximately 2 heads

The expected value is numerically the same as the average value, but it is a prediction for a specific future occurrence rather than a generalization across multiple occurrences ,what i mean by generalization across multiple occurrences is an average which is a statistical generalization of multiple occurrences of an event (such as the mean time you waited at the checkout the last 10 times you went shopping, or indeed the mean time you will wait at the checkout the next 10 times you go shopping)

The distinction is subtle but important:

- The average value is a statistical generalization of multiple occurrences of an event (such as the mean time you waited at the checkout the last 10 times you went shopping, or indeed the mean time you will wait at the checkout the next 10 times you go shopping).
- The expected value refers to a single event that will happen in the future (such as the amount of time you expect to wait at the checkout the next time you go shopping – there is a 50% chance it will be longer or shorter than this). The expected value is numerically the same as the average value, but it is a prediction for a specific future occurrence rather than a generalization across multiple occurrences

This average may yield different properties with regards to the estimation of the “actual average” of the underlying distribution, for instance you may consider how the mathematical definition of the sample average behaves when passing to the limit (taking the sample size to infinity), etc.; but the expected value is functionally associated to distribution with a given parameter,- a distribution that can further generate samples with different sample averages.

Logistic regression analysis is often used to investigate the relationship between a Binary response variables and a set of explanatory, or independent, variables. A Binary response consists, for example, as success and failure. In case of diseases studies your outcome are denoted as Y=1 if disease is present or Y=2 otherwise.

We use Binomial distribution in logistic regression model e.g.

Model=glm(Y ~ . , data=TrainData,family=**binomial**)

__Continuous Probability Distribution __

Continuous probability distribution applicable to the scenarios where the set of possible outcomes can take on values in a continuous range e.g. real numbers, such as the temperature on a given day is typically described by probability density functions.

In above discrete variable for tossing a coin 10 times using Binomial distribution , you can easily find what is the probability of head nth times outcome in 10 trials. But with continuous random variable you can’t exactly find the probability of specific outcome ,it is uncountable

Another example ,you want to check the claim of a sample mineral water in liters from a shipment lot would be exactly 1 liter will be 0 percent,it might 0.99 liters or 1.1 liters. Another bottle from the same lot might be 1.2 liters or 1.1 liters. So , to get exactly 1 liters of water from a random chosen bottle from a lot would be 0 percent. In this case we find what is the probability that a randomly selected water quantity between 0.99 and 1.01 liters makes more sense.

That is, if we let *X* denote the weight of a randomly selected water bottle liters, and ask what is *P*(0.99 < *X* < 1.01) would make sense. We need to find the probability that continuous random variable *X* would be in some interval (*a*, *b*).For that one we use a probability density function (“p.d.f.”).

The PDF is sort of proportional to the probability that the random variable will take a particular value.

There are many types of Continuous PDF some of them i will discuss here Normal Distribution

The normal distribution is a commonly encountered continuous probability distribution.

**Normal** or **Gaussian** **distribution** is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. Many groups follow this type of pattern are

Heights of people

Blood pressure

IQ scores

Salaries

Growth rate

Firstly, nothing in real life exactly matches the Normal distribution. But this is partly matches due to the Central Limit Theorem, which says that if you average enough unrelated things, you will get the Normal Distribution, i will cover Central Limit theorem in detail in next blog. This theory deserve for the dedicated one detail blog , a nice gift from mathematician

Secondly, the problem with your sample data is that you do not know its distribution, you can’t judge distribution based on your sample. You can mimic too many distributions based on your sample but all may not be verified, this is why based on central limit theorem you can fairly draw and abstract your sample distribution as a normal distribution

As George Box famously noted: “…the statistician knows…that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.” (JASA, 1976, Vol. 71, 791-799) Therefore, the normality assumption will never be exactly true when one is working with real data

So I am wrapping up here, too much theory has been now. Will cover more Normal distribution and how to make inference based on this distribution with some real life example using R