Descriptive Statistics Using Python

Descriptive Statistics — is used to understand your data by calculating various statistical values for given numeric variables. For any given data our approach is to understand it and calculated various statistical values. This will help us to identify various statistical test that can be done on provided data. Let understand in more detail.

Under descriptive statistics we can calculate following values

1. Central tendency — mean, median, mode

2. Dispersion — variance, standard deviation, range, interquartile range(IQR)

3. Skewness — symmetry of data along with mean value

4. Kurtosis — peakedness of data at mean value

Note- I have not given mathematical formula for all these values.

We have system defined functions to get these values for any given datasets. Let’s understand these values and their business usages.

import pandas as pd

import numpy as np

np.random.seed(10)

data=pd.DataFrame(np.random.randn(10,4),columns=list(‘ABCD’))

print(data)

1. Calculating Central Tendency

data[‘A’].mean()

data[‘A’].median()

data[‘A’].mode()

#mean — is average value of given numeric values

#median — is middle most value of given values

#mode — is most frequently occurring value of given numeric variables

Why we need to calculated mean, median and mode ?

These values will help us to identify our potential customer or target audience. If a new customer/client comes who’s age is close to average age of your right customer, then you can put extra effort to make him do business with you.

Example — If you watch TV, you can observe any TV commercial product and actor in Ads, they are of same age group. Please take look on below TV ads.


2. Dispersion

Dispersion is used to define variation present in given variable. Variation means how values are close or away from the mean value.

Variance — its gives average deviation from mean value

Standard Deviation — it is square root of variance

Range — it gives difference between max and min value

InterQuartile Range(IQR) — it gives difference between Q3 and Q1, where Q3 is 3rd Quartile value and Q1 is 1st Quartile value.

data[‘A’].var()

data[‘A’].std()

data[‘A’].max()-data[‘A’].min()

data[‘A’].quantile([.25,.5,.75])

Why we need to calculate dispersion of given variable ?

Variance of given variable will help you to get customer requirement range. h this means you get to know in what highest and lowest my customer needs. This will help you to understand your customer requirement variation and you maintain your inventory accordingly.

3. Skewness

Skewness is used to measure symmetry of data along with the mean value. Symmetry means equal distribution of observation above or below the mean.

skewness = 0: if data is symmetric along with mean

skewness = Negative: if data is not symmetric and right side tail is longer than left side tail of density plot.

skewness = Positive: if data is not symmetric and left side tail is longer than right side tail in density plot.

We can find skewness of given variable by below given formula.

data[‘A’].skew()

4. Kurtosis

Kurtosis is used to defined peakedness ( or flatness) of density plot (normal distribution plot). But you research more regarding definition of Kurtosis you will Dr. Westfall and Dr. Donald Wheeler name and their definitions. As per Dr. Wheeler defines kurtosis defined as: “The kurtosis parameter is a measure of the combined weight of the tails relative to the rest of the distribution.” This means we measure tail heaviness of given distribution.

kurtosis = 0: if peakedness of graph is equal to normal distribution.

kurtosis = Negative: if peakedness of graph is less than normal distribution(flat plot)

kurtosis = Positive: if peakedness of graph is more than normal distribution (more peaked plot)

We can find kurtosis of given variable by below given formula.

data[‘A’].kurt()

Let see the graph representation of given variable and interpretation of skewness and peakedness of distribution from it.

import seaborn as sns

sns.distplot(data[‘A’],hist=True,kde=True)

Density plot of variable ‘A’

In the above graph we can clearly see that we more under left side of tail, so it is left skewed (or it has negative skewness). Histogram is above the line that means data has flat plot. This means kurtosis of this distribution is negative. in case line plot is above histogram then kurtosis is taken as positive.

How we make decision by seeing these graph

Negative skewness — this means we have more observation below mean. This conclude data have more people who like product with below mean value product. so you should keep more stock of below mean price product.

Positive Kurtosis — this means we have more peaked data compare to normal. This means your product comes under premium categories and you have only limited customer range. You can sell your product without giving any discount. But in case you negative kurtosis you have to give discount to your product or service in order to sell it.

I request you please give your valuable feedback on this.

Please follow and like us:

About the Author

Alok Ranjan
Alok is an Analytics enthusiast. He is an MBA in Finance and B.E in Computer Science. He has years of experience in Analytics and is also Co-founder, COO, Nikhil Analytics. He has prior worked with companies like SLK Soft, Fifth-Third Bank, Bank of Maharashtra etc.

Be the first to comment on "Descriptive Statistics Using Python"

Leave a comment

Your email address will not be published.


*


error

Subscribe for Data Analytics Edge Newsletter & Share..:-)

error: Content is protected !!