Introduction to Decision tree:
Decision tree is a tree model to make different predictions. It features upside down tree.
A Decision tree splits the data into multiple sets. After that each of the sets are further splited into different subsets to conclude at decision. It is a very natural decision making process asking a series of questions in a nested if-then-else structure. On each node you ask a question to further split the data held by the node. If the test passes, you go left; otherwise you go right.
If a test splits data into more than two partitions, this is called as “Multiway decision tree”.
Let us consider a very basic example that uses titanic data set for predicting whether a passenger will survive or not. Let us take three attributes from the data set namely age, sex and sibsp.
A decision tree is drawn upside down with its root at the top. In the image above bold text represents a condition/ internal node, based on which tree splits into branches/ edges. The end of the branch that doesn’t split anymore is the decision/leaf, in this case whether the passenger survived or died, represented as red and green respectively.
The feature importance is clear and relations can be viewed easily. This methodology is more commonly known as learning decision tree from data and above tree is called Classification tree as the target is to classify passenger as survived or died. Regression trees are represented in the same manner, just each leaf represents a linear regression model that predicts continuous values like price of a house. In general, Decision Tree algorithms are referred to as CART or Classification and Regression Trees.
Algorithm used for Decision tree construction:
Concept of Homogeneity:
What Decision tree construction algorithm will try to do is to create a split in such a way that the homogeneity of different pieces must be as high as possible.
If the above fig denotes a branch after a split and H be its homogeneity. The decision tree checks whether H > some threshold then there is no further split. If the H < threshold then there will be further split. This process will be continued where the label is sufficiently homogeneous then a leaf is created.
So we go step by step, picking an attribute and splitting the data such that the homogeneity increases after every split. We stop splitting when the resulting leaves are sufficiently homogeneous.
There are various ways to quantify homogeneity, such as the Gini Index, Information gain, Entropy etc.
The homogeneity measure used in building decision tree in CART is Gini Index.
Gini Index uses the probability of finding a data point with one label as an indicator for homogeneity. If the dataset is completely homogeneous, then the probability of finding a data point with one of the labels is 1 and the probability of finding a data point with the other label is 0. Gini index is defined as,
Note that the Gini Index is maximum when pi=1 for exactly one of the classes and all others are zero. So, higher the Gini Index higher the homogeneity. In a Gini based decision tree algorithm, we therefore find the split that maximizes the weighted sum(weighted by the size of the partition) of the Gini indices of the two partitions created by the split.
Let us see an example to illustrate about Gini Index. We want to build a model for whom among the people in an organization plays football. Each employee has two explanatory attributes Gender and Age. The target attribute is whether they play football. The below figure illustrates the dataset, the number against P and N indicate the number of employees who play football and those who doesn’t respectively.
1. Split on gender:
The two partitions will have 10/500 and 300/500 as the probabilities of finding a football player respectively. Each partition is half the total population.
2. Split on Age:
The two partitions will have 260/700 and 50/250 as the probabilities, and 700 and 300 as the sizes respectively, giving us a Gini Index of:
Therefore we would first need to split on the gender. This split gives a higher Gini Index for the partitions. Gini Index can only be used on classification problems where the target attribute is categorical.
Entropy and Information Gain:
Entropy quantifies the degree of disorder in data. Entropy is always positive number between 0 and 1. Another interpretation of Entropy is in terms of information content. A completely homogeneous data has no information content in it whereas a dataset with a lot of disorder has a lot of latent information waiting to be learnt. Assume the dataset consists of only categorical attributes, both the explanatory variables and the class variable. In terms of probabilities of finding data points belonging to various classes, entropy for a dataset D is defined as
Notice that the entropy is 0 if and only if for some i pi=1 and all other pi=0 i.e., when the dataset is completely homogeneous. Consider a k-valued attribute A of the dataset. Suppose we partition the dataset into groups where each group DA=I consists of all the data points for which the attribute A has value i. The weighted average entropy if we partition the dataset based on the values of A is
This is also the expected entropy of the partition if the dataset is split on the different values of attribute A. This corresponds to a multiway split i.e., partitioning the dataset into groups, each of which is filtered on one value of the splitting attribute. Entropy based algorithms therefore, at each state, find the attribute on which the data needs to be split to make the entropy of the partition minimum. In practice a slightly modified measure called Information Gain is used. Information Gain, denoted Gain(D,A), is the expected reduction in entropy for the collection of data points D if we filter on a specific value of the attribute A. Information Gain is a direct homogeneity measure, higher the information gain, higher the homogeneity achieved.
Information Gain, Gain(D,A) is defined as,
In entropy based methods, the partition is carried out on the attribute that maximizes the information gain.
For example we can find the information gain for the football example, which we taken above.
The entropy of that data is calculated as,
Next we will find out the entropy after partitioning.
1. Split on gender: the two partitions will have 10/500 and 300/500 as the probabilities of finding a football player respectively. Each partition is half the total population. Therefore
2. Split on age: the two partitions will have 260/700 and 50/250 as the probabilities and 700 and 300 as the sizes respectively, giving us entropy for the partition as:
Gender split gives much information gain. So we must split on gender.
Splitting by R2:
Gini Index, Entropy and Information gain are used for the splitting where target variable is discrete. When the target variable is continuous then we use R2 value for splitting, here the leaf nodes are the regression models.
Split the data such that the R2 if the partitions obtained after splitting is greater than that of the original or parent data set. In other words, the fit of the model should be as good as possible after splitting.
When to stop splitting?
As a problem usually has a large set of features, it results in large number of split, which in turn gives a huge tree. Such trees are complex and can lead to overfitting. So, we need to know when to stop?
Here we have two methods namely Truncation and Pruning.
Stop the tree while it is still growing so that it may not end up with leaves containing very low data points. One way to do this is to set a minimum number of training inputs to use on each leaf. For example we can use a minimum of 10 passengers to reach a decision (died or survived), and ignore any leaf that takes less than 10 passengers. Another way is to set maximum depth of your model. Maximum depth refers to the length of the longest path from a root to a leaf.
Let the tree grow to any complexity. Then, cut the branches of the tree in a bottom-up fashion, starting from the leaves.
Advantages and Disadvantages of Decision Tress:
- Simple to understand, interpret, visualize.
- Decision trees implicitly perform variable screening or feature selection.
- Can handle both numerical and categorical data. Can also handle multi-output problems.
- Decision trees require relatively little effort from users for data preparation. Nonlinear relationships between parameters do not affect tree performance.
- Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.
- Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.
- Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees, where the features and samples are randomly sampled with replacement.
- Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree.
Decision Tree using R:
In R we use rpart() function to build a decision tree.
rpart (formula, data, parms=list(), control=rpart.control())
formula Is the formula to show the relation between dependent and independent variables
data is the name of the dataset
method Specifies whether it is classification or regression tree.
parm Specifies the homogeneity measure. “information” for information gain, “gini” for gini index.
control a list of options that control details of rpart algorithm.
The options are,
minsplit This is the minimum size of partition for splitting. If you set the minsplit=30, the node with less than 30 observations will not be split further.
cp Complexity parameter. This is minimum percentage reduction in the error after a split, i.e., the error should reduce by at least the cp; otherwise, the split will not be attempted. For example, if you set the cp=0.02, a node will be split only if the error after the split reduces by atleast 2%.
maxdepth This is the maximum depth to which a tree can grow.
minbucket This is the minimum number of observations a terminal node should have.
Let us take the dataset “readingSkills” in “party” package.
Let us observe first few columns to understand about the dataset.
We can also see the structure of the data set. It has 200 observations with variables.
Here we are going to build a decision tree such that we are going to predict whether the given age, shoe size and score of a person will be a native speaker or not.
Now we divide the data into train and test data and use the train data to build the model and test data to validate the model.
We need following libraries to build a decision tree using rpart().
Build a decision tree.
Plot the decision tree.
Now the model is built, we will try to predict the dependent variable nativespeaker using the above model for the test data.
Now we have to compare the actual and the predicted values. So we build a confusion matrix and find the accuracy.
From the confusion matrix the accuracy can be measured as (number of observations correctly predicted)/(Total number of observations).
The accuracy value we get is nearly 74.7%.
Truncation and Pruning using “Complexity Parameter”:
Complexity parameter is the minimum reduction of error after a split. We will introduce a cp of 0.01 for the above tree model.
In the similar way pruning can be done after constructing the tree.
Here in this example we cannot find any difference in the decision tree constructed originally and the tree which is truncated and pruned because the number of variables is very less here. If it is a complex tree we would have found the difference in the decision trees.
Now we will evaluate the model using ROC curve. For that we need ROCR library.
First we have to predict the values of dependent variable “nativeSpeaker” for the test data in terms of probabilities. After prediction we get the probabilities of both “no” and “yes” for each column. But as we are predicting “yes” probabilities we have to take the probabilities of “yes” column. After that we have to make prediction object using the predicted probabilities and the actual values of the “nativeSpeaker” column of the test data. Using this prediction object we find the performance and plot it.
The curve shows good performance that the area under the curve is greater than 0.6.
We can even measure the area under the curve.
We find that the area under the performance curve is 0.74 which is a good indicator that the decision tree which we have built is a good model.
Note: If the data set is large then there will be better accuracy.
About Bhagavathi :
Bhagavathi is B.Tech (Electronics and Communication Engineering). Currently she is working as Analyst Intern with Nikhil Guru Consulting Analytics Service LLP (Nikhil Analytics), Bangalore.