XGBoost is a supervised machine learning algorithm which is used both in regression as well as classification. It is an application of gradient boosted decision trees designed for good speed and performance. It stands for eXtreme Gradient Boosting.
XGBoost was developed by Tianqi Chen and is laser focused computational speed and model performance. Additionally it assist to all key variations of the technique and the real interest is the speed provided by the careful engineering of the execution including the following:
- Parallelization of the tree construction using all the CPU cores during the time of training.
- Distributed Computing for the training in very large models using the cluster of the machines.
- Out-of-Core Computing for vast dataset which won’t fit into the memory.
- Cache Optimization of the data structures and the algorithms to make finest usage of the hardware.
Commonly, gradient boosting implementations are very slow because it is sequential in nature which means each tree should be constructed and then added to the model. The on performance in the advancement of XGBoost has resulted as one of the best predictive modeling algorithms which can strap the full capability of the hardware platform which might be in the rent in the cloud.
1. XGBoost is a powerful machine learning algorithm especially where speed and accuracy are concerned.
2. We need to consider different parameters and their values to be specified while implementing an XGBoost model
3. The XGBoost model requires parameter tuning to improve and fully leverage its advantages over other algorithms.
4. If things don’t go your way in predictive modeling, use XGBoost.
5. It is a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data.
6. To improve the model, parameter tuning is must.
How to install XGBoost?
To install this package with conda run
conda install -c anaconda py-xgboost
How does XGBoost Work?
XGBoost belongs to a family of boosting algorithms. Basically boosting is a sequential process i.e. trees are grown using the information from a previously grown tree one after other. This process slowly learns from data and tries to improve its prediction in subsequent iterations. Let’s take a classification example:
Four classifiers shown above and trying to classify + and – classes as homogeneously.
Box 1: First classifier creates a vertical line at D1.It says that anything to the left of D1 will be + and anything to the right of the D1 will be -.But this classifier misclassifies three + points.
Box 2: Next classifier tries to correct the previous mistakes. So it gives more weight to that three misclassified + points and creates a vertical line at D2.Again it says anything to right of D2 is – and left is +.Again three – points classified incorrectly.
Box 3: Again next classifier split at D3 and increased the weight of misclassified three – points and create a horizontal line at D3, but classifier fails to classify the points correctly.
Box 4: As box 1,2 and 3 is weak classifiers, so these weak classifiers used to create a strong classifier box 4.It is a weighted combination of the weak classifiers and classified all the points correctly.
#XGBoost Algorithm in Python
#General Approach for Parameter Tuning
We will use an approach similar to that of GBM here.
- Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems.
- Determine the optimum number of trees for this learning rate.
- XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.
2. Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
3. Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
4. Lower the learning rate and decide the optimal parameters .
Advantages of XGBoost :
1. Parallel Computing: It is enabled with parallel processing, it means when you run XGBoost by default it would use all the cores of machine.
2. Regularization: This is the biggest advantage of XGBoost. Regularization is a technique used to avoid overfitting in linear and tree based models.
3. Enabled cross validation: Usually we use external packages such as caret in R to obtain CV results. But XGBoost is enabled with internal CV function.
4. Handling Missing values: XGBoost has an in-built routine to handle missing values. The user is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
5. High Flexibility: XGBoost allows users to define custom optimization objectives and evaluation criteria. This adds a whole new dimension to the model and there is no limit to what we can do.
6. Save and Reload: XGBoost gives us a feature to save our data matrix and model and reload it later.
7. Tree Pruning: Unlike GBM, where tree pruning stops once a negative loss is encountered, XGBoost grows the tree upto max_depth and then prune backward until the improvement in loss function is below a threshold.
8. Built-in Cross-Validation: XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run. This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
9. Continue on Existing Model: User can start training an XGBoost model from its last iteration of previous run.
Example of XGBoost:
Let us take the dataset “Churn_Modelling”
This data set contains details of a bank’s customers and the target variable is a binary variable reflecting the fact whether the customer left the bank (closed his account) or he continues to be a customer.
#Age’, ‘CreditScore’, and ‘Balance’ seem to follow nearly normal distribution.
#’Balance’ has a large skew at a low balance. This will throw off averages or medians
#if calculated using this field.
#The other fields are mostlty uniform in distribution.
#’Exited’ and ‘NumOfProducts’ are different thought, they look like decaying exponential functions.
#2. Plot correlation
#Now I’m going to check correlations to see which features most strongly correlate with ‘Exited’.
# I also want to check if any of the have any co-linearity or if there are any other interesting correlations in the dataset.
#Here we see that ‘Age’, ‘Balance’, ‘IsActiveMember’, ‘Gender’ and ‘NumOfProducts’ are most correlated
#with ‘Exited’ (all still under +/- 0.3 so they are light correlations).
After Analyzing the Churn Modelling classification data set, we can conclude that the model that we have created using XGBoost is 86.65 % Accurate. That means the model does a very good job in predicting whether the customer left the bank (closed his account) or he continues to be a customer. Small number of observations predict that there will customer left the bank (closed his account). Majority of observations predict that there will continues to be a customer.
Niharika Priyadarshini :
Niharika Priyadarshini is B.Tech in Computer Science. Currently she is working as Data Scientist Intern with NikhilGuru Consulting Analytics Service LLP (Nikhil Analytics), Bangalore.