Learn Multiple Regression using R

Learn Multiple Regression using R:

The term “multiple” in multiple regression means that more than one variable as independent (numerical and categorical) to make predictions for one dependent variable. But when there is more than one dependent variable in regression it is called as Multivariate regression equation.

For example if there are number of independent variables X1,X2………Xm and only one dependent variable Y, Then equation for the multiple or multi-variables regression would be

Y=a+b1*X1+b2*X2+b3*X3+………+bn*Xm+e

Where Y is a continuous dependent variables, and Xi is a simple predictor or independent variable for multiple or multi-variable regression model. And a is intercept, b1,b2…..bn all are coefficients(weights) of X1,X2…….Xm respectively.

ASSUMPTIONS MULTIPLE REGRESSION

There is some assumption that we have to consider before going for multiple regressions. You have to accept these assumptions and also validate it before forming regression equation. I am listing all these assumption here with their details.

  • Linear relationship

Multiple Regressions needs the relationship between the independent and dependent variables to be linear. We can measure linear relationship by using scatter plot between dependent and all other independent variables. An outlier for any independent variable may effects linearity between dependent and that variable. So you should able to check outlier before going for scatter plot.

  • Multivariate normality

Residual of linear regression equation should be normally distributed. Residual can be calculated by subtracting predicted dependent from observed dependent. You can check assumption by viewing Normal Q-Q plot of regression equation. Normality can also be checked with a goodness of fit test for residual.

  • No or little Multicollinearity

Multicollinearity present in data makes regression equation over fitted. Multicollinearity exists if there is significant correlation between independent variables of regression equation. For good fit multiple regression equation there should be no or little Multicollinearity. Multicollinearity can be checked by finding correlation matrix between independent variables, Variance Inflation Factor (VIF) and Condition Index.

  • No auto-correlation

Fourth assumption of multiple regression analysis requires that there is little or no autocorrelation in the data.  Autocorrelation occurs when the residuals are not independent from each other. In other words, you can say when the value of y at x+1 is not independent from the value of y at X. Generally this happens in stock price or in time series data when value of dependent variables depends on previous value of its own.

  • Homoscedasticity

And the last assumption the linear regression analysis is Homoscedasticity (which means same variance).  Homoscedasticity is also called as homogeneity of variance. This means that the variance of residuals should not increase with fitted values of independent variables. Residual should be same across all values of independent variables. Heteroscedasticity is opposite of Homoscedasticity and it is present when residual (also terms as error term) differs across different values of independent variables.

Variable Selection Methods

There are multiple variable selection methods available for regression analysis. Here we are covering Forwards Selection Method, Backward Elimination Method and Stepwise Selection Method. Details of these methods are as follows.

  • Forward Selection Method

Simplest and easiest method of variable selection in Multiple Regressions equation is Forward Selection Method. In this Method one-one variable from list of independent variables is added in equation at a time. Addition of variable will be continuing until p-value is below pre-set level. It will be good if we set value above 0.05 (Ideal value) like .10 or .15.

Model begins with the variable that is most significant in the beginning and continues adding variables one by one until none of remaining variables are significant with pre-set P-value.

  • Backward Elimination Method

In Forward selection Method, each new addition of variable may cause non-significant of already included variables. In order to avoid such cases we can use Backward Selection Method.

Under Backward Selection Method, we start with a model by including all possible interest variables and then drop least significant variables one by one. We continue this process as long as all remaining variables are statistically significant.

Even Backward selection has drawbacks as well. Sometimes variables are dropped that would be significant when added to the final reduced models. This suggests us that we should use some compromise model between forward and backward selection methods and that is Stepwise Selection Method.

  • Stepwise Selection Method

Stepwise selection is a variable selection method that allows moves in direction, dropping or adding variables at the various steps.

The method proceeds as for forward selection with the addition that at each step, after the inclusion of a new variable, F values are calculated for all variables already included in the equations as though they were the most recently entered. Any variable which leads to a non-significant F statistic here is removed from the model, i.e. the hypothesis is accepted that its regression coefficient, in the equation containing all variables up to and including that associated with the step under consideration, is zero. When an F value testing for inclusion of a variable becomes non-significant the procedure ends and that variable and all succeeding variables are excluded from the final model.

This method permits the identification of variables which have become redundant through the subsequent inclusion of other variables into the regression equation, and may counteract the adverse effect of compounds on the forward Elimination procedure.

MULTIPLE REGRESSION IN R

#Multiple Regression Analysis – Here we have data of walmart customer. We are going to creating Regression Equation to predict customer_satisfaction for walmart customer using walmart reg.csv data.

#Setting working directory path

setwd(“C:/Users/m/Desktop/Multiple Regression”)

dir()      #this will show all files of working directory

#Reading CSV data into R using read.csv

data_set<-read.csv(“walmart reg.csv”)

#Opening data frame in new windows

View(data_set)

#this will display structure of given data frame

str(data_set)

#In Regression equation, first step you have to check linear relation of all expected independent variable with dependent variables.

#To find linear relationship between dependent and other variables using plot function

PLOT_1<-plot(Customer_Satisfaction~.,data=data_set)

PLOT_1

#You have to press enter to see one by one plot, In case you want all plots in one page you can use plot(data_set)

#Identifying Outlier for each variable of  data frame

boxplot(data_set)

#A small circle coming on top or bottom part of boxplot indicate that particular variable has outlier

#Separating columns with outliers

data_outlier<-data_set[,c(3,4,5,8,10,11,12,14)]

boxplot(data_outlier)

#Identify Outlier values for each column

outlier_values <- boxplot.stats(data_outlier[,1])$out  # outlier values.

boxplot(data_outlier[,1], boxwex=0.1)

mtext(paste(“Outliers: “, paste(outlier_values, collapse=”, “)), cex=0.6)

Order_Billing_box_plot

Delivery_Speed_box_plot

Complaint_Resolution_box_plot

Warranty_Claims_box_plot

#Replacing outlier with mean of variable

#Custom Function to replace outlier value with mean

outlier_rem<-function(y,z)

{

y_mean<-round(mean(y),1)

y[y %in% z]<-y_mean

return(y)

}

 

n_col<- ncol(data_outlier)

for (i in 1:n_col)

{

data_outlier[,i] = outlier_rem(data_outlier[,i],boxplot.stats(data_outlier[,i])$out)

}

#As we can see there is no outlier now in the data, In case you find more outlier then you can cap these outlier by near quantiles values.

boxplot(data_outlier)

#Combining with original dataset

data_set_1<-cbind(data_set[,c(1,2,6,7,9,13)],data_outlier)

 

# Finding Correlation and performing Correlation test

#Performing Correlation matrix between Customer_Satisfaction and other variables

cor_matrix<-cor(data_set_1$Customer_Satisfaction,data_set_1[,-1])

cor_matrix

#Here you can check correlation of each variable with Customer_Satisfaction

#Product_Quality Advertising Product_Line Competitive_Pricing Price_Flexibility E_Commerce

#0.5210519                  0.3535415   0.6463371    -0.2820634          0.03182443        0.202262

 

#Technical_Support Complaint_Resolution Salesforce_Image Warranty_Claims Packaging

#0.1880519         0.587448             0.4675346        0.2624168        0.245244         0.4492715

 

#Order_Billing Delivery_Speed

#0.4492715     0.6115337

#Identify weak, moderate and strong correlation

#Variables with weak correlation –

#Competitive_pricing, Price_Flexibility, E_Commerce, Technical_Support

#Warrant_Claims,Packaging

 

#Moderate correlation are –

#Advertising, Salesforce_Image, Order_Billing, Delivery_Speed

 

#Strong correlation are –

#Product_Quality, Product_Line, Complaint_resolution

#Drop weak correlation variables and use moderate and strong correlation for correlation test

#perform correlation test for moderate and strong correlation using cor.test function

Test of Hypothesis Assumption for Correlation Test

Null Hypothesis      :  Correlation (rho) is equal to zero

Alternate Hypothesis :  Correlation (rho) is not equal to zero

 

#Look for P_value : P_Value less than 0.05, indicates correlation test is significant at significant level of 5%.

 

cor_pvalue<-cor.test(data_set_1$Customer_Satisfaction,data_set_1$Advertising)

cor_pvalue$p.value

#2.816837e-07 which less then 0.05 that correlation is significant.

 

cor_pvalue<-cor.test(data_set_1$Customer_Satisfaction,data_set_1$Order_Billing)

cor_pvalue$p.value

#2.50605e-11 which less then 0.05 that correlation is significant.

 

cor_pvalue<-cor.test(data_set_1$Customer_Satisfaction,data_set_1$Delivery_Speed)

cor_pvalue$p.value

#6.695041e-22 which less then 0.05 that correlation is significant.

 

Similarly perform Correlation Test for other variables and check their p-value.

 

#Subsetting dataset with those variables which are either moderate or strong correlation

#with dependent variables in equation

 

data_set_2<-subset(data_set_1,select=c(“Customer_Satisfaction”,

“Product_Quality”, “Advertising”,

“Product_Line”,

“Complaint_Resolution”,

“Salesforce_Image”,

“Order_Billing”,”Delivery_Speed”

))

View(data_set_2)

#Variables Selection for regression equation

#Intial Step – regression equation with no variable but only intercept

intial_step<-lm(Customer_Satisfaction~1,data=data_set_2)

intial_step

 

#final Step – regression equation with all variables

final_step<-lm(Customer_Satisfaction~.,data=data_set_2)

final_step

 

#perform forward selection using step function

FORWARD_METHOD<-step(initial_step,scope=list(lower=initial_step,upper=final_step), direction=”forward”)

 

#Backward  Elimination Method

BACKWARD_METHOD<- step(final_step, data= data_set_1, direction=”backward”)

 

#Step_wise Selection Method

STEP_WISE<-step(initial_step, scope = list(upper=final_step), data= data_set_1, direction=”both”)

 

#Another method

#You can use stepAIC function to identify Independent Variables for prediction

# StepAIC for building model

model1<-lm(Customer_Satisfaction ~ .,data=data_set_2)

model1

 

stepAIC_model<-stepAIC(model1,direction=c(“both”))

Model validations will be covered in upcoming articles.

 

Reference Taken:

https://www.stat.ubc.ca/

https://www.jstor.org/stable/1402505?seq=11#page_scan_tab_contents

Compiled By:

About Alok:

Alok is an Analytics enthusiast. He is an MBA in Finance and B.E in Computer Science. He has years of experience in Analytics and is also Co-founder, COO, Nikhil Analytics. He has prior worked with companies like SLK Soft, Fifth-Third Bank, Bank of Maharashtra etc.

Assisted By:

About Saugata:

Saugata is an MBA in Finance and BCA. Currently He is working as Analyst Intern with NikhilGuru Consulting Analytics Service LLP (Nikhil Analytics), Bangalore

About the Author

dyuti
Dyuti is an Analytics Enthusiast. She is an MBA in Finance and B.E in Computer Science. She has years of experience in the field of Analytics and is also the Co-founder, CEO, Nikhil Analytics. She has prior worked with companies like HCL Technologies, Deutsche Bank ,WNS, Reliance Capital etc.

Be the first to comment on "Learn Multiple Regression using R"

Leave a comment

Your email address will not be published.


*


error

Subscribe for Data Analytics Edge Newsletter & Share..:-)

error: Content is protected !!