Two of the most prolific regression techniques used in the creation of parsimonious models involving a great number of features are Ridge and Lasso regressions respectively.
Lasso and Ridge regression is also known as Regularization method which means it is used to make the model enhanced.
From this model, I found that the Diamond Price is increased based on the quality and its features.
Let us talk about Regularization:-
In Machine Learning many models get over fitted due to which it creates wrong analysis in the target score. Reason behind this is due to the extra features which a model failed to predict true result conclusively. The Machine Learning model also couldn’t recognise the error for which it gives the negative impact by which the model becomes over fitted.
WHY TO PERFORM REGULARIZATION METHOD:-
This Overfitting of Models which causes noisy result while Predicting can be solved by this Regularization Method. Whenever there is a big dataset with many columns and thousands of records there we can perform the two method called Lasso and Ridge Regression to get the better fit models.
This two Regressions reduce the error and inflict the penalty on the magnitude of Coefficients along with the Features.
Both has the different Characteristics and give the positive and meaningful result to our Models.
In Lasso Regression it reduces the magnitude of the high Coefficient and eliminate the unsought features nearly close to 0 .Thus it gives better result while predicting.
In Ridge Regression on the other hand this operation couldn’t reduce the coefficient to 0.But in Ridge Regression when the features is more relevant with the target variable the coefficient contracts so that it reduces the model complexity and multi-collinearity.
What is Lasso Regression?
Lasso which means least absolute shrinkage and selection operator it performs both regularization and variable selection in the Lasso Regression techniques for enhancing the accuracy and the understanding of the statistical model produced by this course of action.
What is Ridge Regression?
Ridge Regression is a modification in the cost function, by the addition of a penalty having a similar value as the coefficient’s magnitude’s square, be done in the case of Ridge Regression.
Let us execute this two method in the Python Code
Step 1:-Implication the required Packages.
Step 2:-Extracted the Dataset from the given Path.
Step 3:-Checking the Correlation and dropping the unwanted column which are not in a relation.
Step 4:-Checking the Missing Values.
Step 5:-Visualizing the data set with the help of heat map.
Step 6:-Checking the unique values of text features.
Step 7:-Converting the Categorical Value to Numeric Value and labelling them using Label encoder.
Step 8:-Scaling the Data.
Step 9:-Tuning the Model by using Grid Search CV.
Step 10:-Declaring the features in X and target in Y and splitting them into Training and Test data.
Step 11:-Fitting models and printing the best parameters, R-squared scores, MSE, and coefficients for both Lasso and Ridge Regression.
Step 12:-Finally Predicting the Price of Diamond.
I have taken the Diamond Dataset from the Internet Source and performing this task to understand you how it actually works.
#Importing the required packages to perform this task
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error
#SCALING THE DATA
from sklearn.preprocessing import StandardScaler
Then I have extracted the data from the path we found 53940 rows and 11 columns are there in Diamond Dataset.
#I have checked the Correlation among them
#We can drop the column name=unnamed:0
diamond.drop([‘Unnamed: 0′,’table’,’depth’],inplace=True,axis=1)
#When inplace = True is used, it performs operation on data and nothing is returned.
Now, after dropping the column dataset contains 53940 rows*8 columns
While checking for missing values we found No Missing Value in any of the columns.
#Printing unique values of text features
#Now transforming the categorical value to Numerical value using Label encoder
# Converting the variables to numerical
for i in range(3):
new = le.fit_transform(diamond[categorical_features[i]])
diamond[categorical_features[i]] = new
diamond.head()
OUTPUT :
carat cut color clarity price x y z
0 0.23 2 1 3 326 3.95 3.98 2.43
1 0.21 3 1 2 326 3.89 3.84 2.31
2 0.23 1 1 4 327 4.05 4.07 2.31
3 0.29 3 5 5 334 4.20 4.23 2.63
4 0.31 1 6 3 335 4.34 4.35 2.75
Then Scaling of the data was performed using StandardScaler().
So, we will split the data into train and test sets, build Ridge and Lasso, and choose the regularization parameter with the help of GridSearch.
For that, we have to define the set of parameters for GridSearch. In this case,the models with the highest R-squared score will give us the best parameters.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)
parameters = {‘alpha’: np.concatenate((np.arange(0.1,2,0.1), np.arange(2, 5, 0.5), np.arange(5, 26, 1)))}
lasso = linear_model.Lasso()
gridlasso = GridSearchCV(lasso, parameters, scoring =’r2′)
Now we will see the Lasso Score:-
# Fit models and print the best parameters, R-squared scores, MSE, and coefficients for Lasso Regression
gridlasso.fit(X_train, y_train)
print(“lasso best parameters:”, gridlasso.best_params_)
#OUTPUT:- lasso best parameters: {‘alpha’: 0.1}
print(“lasso score:”, gridlasso.score(X_test, y_test))
#OUTPUT:-lasso score: 0.7873154467246168
print(“lasso MSE:”, mean_squared_error(y_test, gridlasso.predict(X_test)))
#OUTPUT:-lasso MSE: 0.21231853566115141
print(“lasso best estimator coef:”, gridlasso.best_estimator_.coef_)
#OUTPUT:-lasso best estimator coef:
[ 0. 0.60436848 0.11552185 0. 0.03079737 0. 0.]
Here the coefficient which becomes 0 excluded means only the 3 features is relevant with the Model
Now we will see the Ridge Score:-
ridge = linear_model.Ridge()
gridridge = GridSearchCV(ridge, parameters, scoring =’r2′)
# Fit models and print the best parameters, R-squared scores, MSE, and coefficients fir Ridge Regression
gridridge.fit(X_train, y_train)
print(“ridge best parameters:”, gridridge.best_params_)
#OUTPUT:- Ridge best parameters: {‘alpha’: 15.0}
print(“ridge score:”, gridridge.score(X_test, y_test))
#OUTPUT:-Ridge score: 0.8775879993978104
print(“ridge MSE:”, mean_squared_error(y_test, gridridge.predict(X_test)))
#OUTPUT:- Ridge MSE: 0.12220133674473616
print(“ridge best estimator coef:”, gridridge.best_estimator_.coef_)
#OUTPUT:-Ridge best estimator coef: [[ 2.585893 -0.17002585 0.09953217 -0.24939873 0.07572918 0.0159031 -0.0668639 ]]
Here we obtain that the coefficient value comes to nearly 0 but it is not removed so all the features are relevant with the target variable.
Predicting the Diamond price:-
pred_y=gridridge.predict(X_test)
final_price=np.abs(scaler.inverse_transform(pred_y))
print(final_price)
#OUTPUT:-
[[1384.04819166]
[4748.55934856]
[ 485.29958189]
…
[4200.8463314 ]
[9523.46959716]
[3782.0781901 ]]
Conclusion:-
On analyzing the Diamond dataset, it was found that Ridge regression is giving us a better accuracy of about 88% where as in Lasso regression it is 78%. Hence to predict the price of the Diamonds Ridge regression is the feasible method.
References:-
1. https://en.wikipedia.org/wiki/Lasso_(statistics)
2. https://en.wikipedia.org/wiki/Regularized_least_squares
Written by:
Ashrayas Kumar:
ASHRAYAS KUMAR is B.Tech in Electrical Engineering From (ITER College, Bhubaneswar). Currently he is working as Data Scientist with NikhilGuru Consulting Service LLP (Nikhil Analytics), Bangalore
Fascinating.