Learn Multiple Regression using R:
The term “multiple” in multiple regression means that more than one variable as independent (numerical and categorical) to make predictions for one dependent variable. But when there is more than one dependent variable in regression it is called as Multivariate regression equation.
For example if there are number of independent variables X1,X2………Xm and only one dependent variable Y, Then equation for the multiple or multi-variables regression would be
Where Y is a continuous dependent variables, and Xi is a simple predictor or independent variable for multiple or multi-variable regression model. And a is intercept, b1,b2…..bn all are coefficients(weights) of X1,X2…….Xm respectively.
ASSUMPTIONS MULTIPLE REGRESSION
There is some assumption that we have to consider before going for multiple regressions. You have to accept these assumptions and also validate it before forming regression equation. I am listing all these assumption here with their details.
- Linear relationship
Multiple Regressions needs the relationship between the independent and dependent variables to be linear. We can measure linear relationship by using scatter plot between dependent and all other independent variables. An outlier for any independent variable may effects linearity between dependent and that variable. So you should able to check outlier before going for scatter plot.
- Multivariate normality
Residual of linear regression equation should be normally distributed. Residual can be calculated by subtracting predicted dependent from observed dependent. You can check assumption by viewing Normal Q-Q plot of regression equation. Normality can also be checked with a goodness of fit test for residual.
- No or little Multicollinearity
Multicollinearity present in data makes regression equation over fitted. Multicollinearity exists if there is significant correlation between independent variables of regression equation. For good fit multiple regression equation there should be no or little Multicollinearity. Multicollinearity can be checked by finding correlation matrix between independent variables, Variance Inflation Factor (VIF) and Condition Index.
- No auto-correlation
Fourth assumption of multiple regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. In other words, you can say when the value of y at x+1 is not independent from the value of y at X. Generally this happens in stock price or in time series data when value of dependent variables depends on previous value of its own.
And the last assumption the linear regression analysis is Homoscedasticity (which means same variance). Homoscedasticity is also called as homogeneity of variance. This means that the variance of residuals should not increase with fitted values of independent variables. Residual should be same across all values of independent variables. Heteroscedasticity is opposite of Homoscedasticity and it is present when residual (also terms as error term) differs across different values of independent variables.
Variable Selection Methods
There are multiple variable selection methods available for regression analysis. Here we are covering Forwards Selection Method, Backward Elimination Method and Stepwise Selection Method. Details of these methods are as follows.
- Forward Selection Method
Simplest and easiest method of variable selection in Multiple Regressions equation is Forward Selection Method. In this Method one-one variable from list of independent variables is added in equation at a time. Addition of variable will be continuing until p-value is below pre-set level. It will be good if we set value above 0.05 (Ideal value) like .10 or .15.
Model begins with the variable that is most significant in the beginning and continues adding variables one by one until none of remaining variables are significant with pre-set P-value.
- Backward Elimination Method
In Forward selection Method, each new addition of variable may cause non-significant of already included variables. In order to avoid such cases we can use Backward Selection Method.
Under Backward Selection Method, we start with a model by including all possible interest variables and then drop least significant variables one by one. We continue this process as long as all remaining variables are statistically significant.
Even Backward selection has drawbacks as well. Sometimes variables are dropped that would be significant when added to the final reduced models. This suggests us that we should use some compromise model between forward and backward selection methods and that is Stepwise Selection Method.
- Stepwise Selection Method
Stepwise selection is a variable selection method that allows moves in direction, dropping or adding variables at the various steps.
The method proceeds as for forward selection with the addition that at each step, after the inclusion of a new variable, F values are calculated for all variables already included in the equations as though they were the most recently entered. Any variable which leads to a non-significant F statistic here is removed from the model, i.e. the hypothesis is accepted that its regression coefficient, in the equation containing all variables up to and including that associated with the step under consideration, is zero. When an F value testing for inclusion of a variable becomes non-significant the procedure ends and that variable and all succeeding variables are excluded from the final model.
This method permits the identification of variables which have become redundant through the subsequent inclusion of other variables into the regression equation, and may counteract the adverse effect of compounds on the forward Elimination procedure.
MULTIPLE REGRESSION IN R
#Multiple Regression Analysis – Here we have data of walmart customer. We are going to creating Regression Equation to predict customer_satisfaction for walmart customer using walmart reg.csv data.
#Setting working directory path
dir() #this will show all files of working directory
#Reading CSV data into R using read.csv
#Opening data frame in new windows
#this will display structure of given data frame
#In Regression equation, first step you have to check linear relation of all expected independent variable with dependent variables.
#To find linear relationship between dependent and other variables using plot function
#You have to press enter to see one by one plot, In case you want all plots in one page you can use plot(data_set)
#Identifying Outlier for each variable of data frame
#A small circle coming on top or bottom part of boxplot indicate that particular variable has outlier
#Separating columns with outliers
#Identify Outlier values for each column
outlier_values <- boxplot.stats(data_outlier[,1])$out # outlier values.
mtext(paste(“Outliers: “, paste(outlier_values, collapse=”, “)), cex=0.6)
#Replacing outlier with mean of variable
#Custom Function to replace outlier value with mean
y[y %in% z]<-y_mean
for (i in 1:n_col)
data_outlier[,i] = outlier_rem(data_outlier[,i],boxplot.stats(data_outlier[,i])$out)
#As we can see there is no outlier now in the data, In case you find more outlier then you can cap these outlier by near quantiles values.
#Combining with original dataset
# Finding Correlation and performing Correlation test
#Performing Correlation matrix between Customer_Satisfaction and other variables
#Here you can check correlation of each variable with Customer_Satisfaction
#Product_Quality Advertising Product_Line Competitive_Pricing Price_Flexibility E_Commerce
#0.5210519 0.3535415 0.6463371 -0.2820634 0.03182443 0.202262
#Technical_Support Complaint_Resolution Salesforce_Image Warranty_Claims Packaging
#0.1880519 0.587448 0.4675346 0.2624168 0.245244 0.4492715
#Identify weak, moderate and strong correlation
#Variables with weak correlation –
#Competitive_pricing, Price_Flexibility, E_Commerce, Technical_Support
#Moderate correlation are –
#Advertising, Salesforce_Image, Order_Billing, Delivery_Speed
#Strong correlation are –
#Product_Quality, Product_Line, Complaint_resolution
#Drop weak correlation variables and use moderate and strong correlation for correlation test
#perform correlation test for moderate and strong correlation using cor.test function
Test of Hypothesis Assumption for Correlation Test
Null Hypothesis : Correlation (rho) is equal to zero
Alternate Hypothesis : Correlation (rho) is not equal to zero
#Look for P_value : P_Value less than 0.05, indicates correlation test is significant at significant level of 5%.
#2.816837e-07 which less then 0.05 that correlation is significant.
#2.50605e-11 which less then 0.05 that correlation is significant.
#6.695041e-22 which less then 0.05 that correlation is significant.
Similarly perform Correlation Test for other variables and check their p-value.
#Subsetting dataset with those variables which are either moderate or strong correlation
#with dependent variables in equation
#Variables Selection for regression equation
#Intial Step – regression equation with no variable but only intercept
#final Step – regression equation with all variables
#perform forward selection using step function
#Backward Elimination Method
BACKWARD_METHOD<- step(final_step, data= data_set_1, direction=”backward”)
#Step_wise Selection Method
STEP_WISE<-step(initial_step, scope = list(upper=final_step), data= data_set_1, direction=”both”)
#You can use stepAIC function to identify Independent Variables for prediction
# StepAIC for building model
model1<-lm(Customer_Satisfaction ~ .,data=data_set_2)
Model validations will be covered in upcoming articles.
Alok is an Analytics enthusiast. He is an MBA in Finance and B.E in Computer Science. He has years of experience in Analytics and is also Co-founder, COO, Nikhil Analytics. He has prior worked with companies like SLK Soft, Fifth-Third Bank, Bank of Maharashtra etc.
Saugata is an MBA in Finance and BCA. Currently He is working as Analyst Intern with NikhilGuru Consulting Analytics Service LLP (Nikhil Analytics), Bangalore