Outlier is an observation that is in a random sample from a population an abnormal distance from other values. In simple words it leaves up to the analyst to decide that what will be considered as abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations.
Outliers are observations with a unique combination of characteristics identifiable as distinctly different from the other observations. It is basically judged to be an unusually high or low value on variable or unique combinations of values across several variables that make the observations stand from others.
Assume we sample 20 individuals to determine the average household income and in our sample we gather responses that range between $20,000 and $1,00,000 so that the average is $45,000.But if 21st person has an income of $ 1 million. If we include this income this value in the analysis then average income increases more than $90,000.Thus for one variable the average income increases to more than $90,000.
1. Univariate Detection:
The univariate identification of outliers examines the distribution of observations for each variable in the analysis and selects as outliers those cases falling at the outlier range(high or low) of the distribution. A threshold is established which is mean+3*std where std is standard deviation. Any value which are falling above this threshold is termed as High outlier and any values which are falling below this threshold are termed as Low outlier. For that, we have to convert the data values to standard scores which have a mean of 0 and standard deviation of 1 by using method known as Standardization. Here one data frame is created
So after executing this code we found out that only one record that is 70.204032 is having value greater than mean+3*std.So it is an outlier in the data frame.
2. Bivariate Detection:
a) Examining Relationship between Variables
When we are finding relationship between two numeric variables we can find it by using scatterplots. From scatter plots if we observe some cases that fall markedly outside the range of other observations will be seen as isolated points in scatterplots.To assist in determining the expected range of observations as two dimensional portrayal, an ellipse representing a bivariate normal distributions confidence interval (typically set as 90% to 95% level) is superimposed on scatterplot.This ellipse provides graphical portrayal of confidence limits and facilitates the identification of outliers. Suppose I took one data having variables weight and horsepower and if I want to find correlation between graphically and used scatterplots using python given below:
Import matplotlib.pyplot as plt
Here from this graph extreme points are termed as outliers
b) Examining Group Differences Between Variables
For performing graphical analysis of metric variable for each group (category) of a nonmetric variable we use Boxplot.First the upper and lower quartiles of data distribution form the upper and lower boundaries of box , with the box length being distance between the 25th percentile and 75th percentile. The box contains the middle 50 percent of the data values and larger the box greater the spread (that is standard deviation) of observations. The median is depicted as solid lines within the box. If the median lies near end of the box, skewness is opposite directions. The lines extending from each box (called whiskers) represent the distance to the smallest and largest observations that are less than one quartile range from the box.
Observations that range between 1.0 and 1.5 quartiles away from the box and also observations greater than 1.5 quartiles away from the box are termed as Outliers. Here,
Q3+1.5IQR is termed as High Outlier and
Q1-1.5IQR is termed as Low Outlier
Q3 is Upper Quartile and
IQR is InterQuartile Range which is difference between Upper and Lower Quartile is termed as Lower Quartile.
Here I have created random variables setting seed equal to 10 so that to re-produce some result again and again. Whenever we use a random function to generate random value, every time we get new values, which can generate different results
array1=np.random.normal (100, 10, 200)
array2=np.random.normal (90, 20, 200)
data = [array1,array2]
res =plt.boxplot (data)
3. MULTIVARIATE METHODS:
Isolation Forest is similar in principle to Random Forest and is built on the basis of decision trees. Isolation Forest isolates observations by randomly selecting a feature and then randomly selects a split value between the maximum and minimum values of that selected feature. It is an unsupervised algorithm and therefore it does not need labels to identify the outlier/anomaly. The PyOD Isolation Forest module is a wrapper of scikit learn Isolation Forest with more functionalities. Here I have worked on Superstore data and implemented Isolation Forest Algorithm to find outliers and calculate anomaly score among them.
1) Random and recursive partition of data is carried out, which is represented as a tree (random forest). This is the training stage where the user defines the parameters of the subsample and the number of trees. Here we have trained Isolation Forest using Sales Data
2) The end of the tree is reached once the recursive partition of data is finished. It is expected that the distance taken to reach the outlier is far less than that for the normal data
3) The distance of the path is averaged and normalized to calculate the anomaly score.
4) The judgment of the outlier is carried out on the basis of the score to determine which region is outlier and which is not.
5) From the graph, we can understand regions where outliers fall.
from sklearn.ensemble import IsolationForest
isolation_forest = IsolationForest(n_estimators=100)
xx = np.linspace(df[‘Sales’].min(), df[‘Sales’].max(), len(df)).reshape(-1,1)
anomaly_score = isolation_forest .decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.plot(xx, anomaly_score, label=’anomaly score’)
plt.fill_between(xx.T, np.min(anomaly_score), np.max(anomaly_score),
alpha=.4, label=’outlier region’)
According to above visualization results we can conclude that Sales exceeding 1000 is considered an outlier
- In the dataset if there is one or two observations as outlier values, then we can remove that observation.
df =df.drop([column name],axis=1,inplace=True)
- If dataset has more than two observations as outliers we can replace them with median.
df= df.replace(70.204032, df.median())
Anirban De is PGDM in Big Data Analytics with 1.5 years of Experience. Currently he is working as Analyst Intern with NikhilGuru Consulting Analytics Service LLP (Nikhil Analytics), Bangalore.