Hot Topics in Analytics

# How to do Linear Regression using Python Before starting our Python code for linear regression, first we try to understand “What is linear Regression ? Why we need it ? When we can use it ?

Linear Regression is used to predict target variable for given feature variables using best fit model ( or equation).

We need linear regression to know future value of target variable or to find target variable for a new value of feature variable.

Below you are seeing a graphical representation of linear regression model. Blue line drawn is best fit linear regression line. And black Dot are different observations of taken dataset.

Linear model will be in the form of

Here,

Y: target

W: Weight

X: feature

B: Bias

In case of multiple feature we will get multiple weight corresponding to each feature.

There is few assumption for linear regression. These assumptions are as follows

1. Linear relationship between X and Y

2. X and Y must be multivariate normal

3. Homoscedasticity (means same variance) — Residual should have same variance at all point of prediction

Residual = actual — predicted

4. No Multicollinearity — feature variables should not be correlated among themselves

5. No Autocorrelation — Autocorrelation refer degree of correlation between values of same variables. Residual values should not be correlated.

Now we are going to perform regression. Here I am giving you steps to do regression. These steps can help you to develop your own code and play around different part of program.

from sklearn import datasets

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

boston_data.DESCR # this works for only in built dataset

boston_data.keys()

X = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)

Y = pd.DataFrame(boston_data.target,columns=[‘MEDV’])

X.columns

X.shape

X.isnull().sum()

train_x,test_x,train_y,test_y=train_test_split(X,Y,test_size=0.3, random_state=10)

model1=LinearRegression().fit(train_x,train_y)

### Calculate R-Squared value

R-Squared gives percentage variation target variable explained by derived equation. You can consider this percentage accuracy of prediction result from derived model.

model1.score(train_x,train_y) # 0.75

### Interpretation of R-Squared value

R-square >= 0.7 — good fit model — accepted

R-square >= 0.85 — best fit model — accepted

R-square < 0.5 — poor fit model — rejected

If your derived model has R-Squared above 0.7, then it will be accepted for prediction.

### Predicting target value for test data

pred_y = model1.predict(test_x)

### Calculating Mean Square Error (MSE)

mean_squared_error(test_y,pred_y) # 29.511

Mean Squared Error is used to give average error in predicting target value. And if this value is lower values close to 0 then it is more preferred.

In case you want to perform regression using only feature than above code is not going to work. So here I am giving an little modified code to do Bivariate regression.

# Bivariate regression

# I want to find cost of house based on number of rooms.

# target variable — cost_of_house

# feature — no_of_rooms

x = np.array(X[‘RM’])

y = np.array(Y[‘MEDV’])

# You need to reshape your x and y variable as they are taken as Series value because of single column present in it.

x=x.reshape(len(x),1)

y=y.reshape(len(y),1)

model2=LinearRegression().fit(x,y)

model2.coef_ # 9.10

model2.intercept_ # -34.67

# your equation will be in t form of

# y = 9.10*x+(-34.67)

# finding y for given x

x=6.575

y = 9.10*x+(-34.67)

print(x,y)

I know many of you want to know how to find significance of each variable used in model. In my next articles I will cover this.

Happy Learning,

Alok Ranjan