top of page
Search

The Fundamental Tool in Machine Learning: Decision Trees

Updated: Aug 5

Decision trees[1] are a fundamental tool in the field of machine learning, serving as both a powerful analytical method and an educational example for understanding more complex algorithms. Their simplicity and intuitive structure make them an excellent starting point for anyone looking to delve into the world of data analysis and predictive modeling.


One of the primary reasons decision trees are so valuable is that they provide a clear and straightforward way to visualize decision-making processes. Each node in the tree represents a decision point based on a specific feature of the data, and the branches indicate the possible outcomes of these decisions. This transparent structure not only helps in understanding how the model works but also in identifying the most important features influencing the predictions.


Decision trees are also commonly used as a baseline model in machine learning projects. Their simplicity allows them to serve as a trivial baseline, providing a point of reference against which more complex models can be compared. By starting with a decision tree, data scientists can quickly establish a performance benchmark and then explore whether more sophisticated techniques, such as neural networks or gradient boosting, offer significant improvements.


Moreover, the concept of decision trees extends into more advanced techniques. One notable example is the Random Forest algorithm, which is essentially an ensemble of decision trees. Random Forests leverage the power of multiple trees to improve predictive accuracy and control overfitting. By aggregating the results of numerous trees, a Random Forest model can capture a broader range of patterns in the data, leading to more robust and reliable predictions. This ensemble method is widely regarded as one of the best techniques in machine learning due to its versatility and high performance across various types of data and problems.


A Practical Example


Decision trees are not only useful for understanding algorithms and providing baseline models but also for approximating complex decision-making criteria in real-world scenarios. Let's explore how a decision tree can be employed to approximate a neighbor's decision on whether to go for a walk on Saturday mornings based on weather conditions.


Imagine a neighbor who sometimes decides to go for a walk on Saturday mornings, but the exact criteria for this decision are unknown. We suspect that the decision is influenced by the weather. Our goal is to develop a function, "Walk," that approximates this decision-making process based on observable weather variables.

To achieve this, we define the function "Walk" as follows:


Walk = f(sky,temperature,humidity,wind) → {yes,no}


Data Collection and Variables


We start by gathering data over several weeks, recording whether the neighbor went for a walk and noting the weather conditions on each Saturday morning. The variables we record include:


  • Sky: {sunny, cloudy, rainy}

  • Temperature: {hot, mild, cold}

  • Humidity: {normal, high}

  • Wind: {weak, strong}



Example of data collected
Example of data collected

Problem Framework


In the context of machine learning, we define our problem using the following components:


  • Instance Set (X): The set of all days we have recorded. Each instance (day) has attributes: Sky, Temperature, Humidity, and Wind.

  • Target Function (f): The ideal function we aim to approximate, which maps the weather conditions to the neighbor's decision (yes or no).

  • Hypothesis Space (H): The set of all possible functions that could map our instance set X to the target set Y. For example, any decision tree that could model the neighbor's decision.

  • Training Data: The recorded observations, denoted as



Instances equation

where x(i) represents the weather conditions on the ith Saturday and y(i)represents whether the neighbor went for a walk.

  • Output of the Learning Algorithm: The learning algorithm outputs a hypothesis (or model) h∈H that approximates the function f. In our case, the model would be the function "Walk"


Learning Algorithm

The learning algorithm takes the training data as input and outputs a hypothesis h ∈ H. This hypothesis is a model that approximates the target function f. In our case, the decision tree algorithm will analyze the training data and produce a model that predicts whether the neighbor will go for a walk based on the weather conditions.


Model Output


The output of our learning algorithm is a decision tree that serves as the function "Walk." This tree will help us predict the neighbor's walking behavior given new weather conditions. By splitting the data based on the values of Sky, Temperature, Humidity, and Wind, the decision tree identifies patterns and relationships that approximate the neighbor's decision-making criteria.


Constructing a Decision Tree: Step-by-Step Guide


Building a decision tree involves systematically selecting attributes and placing them in the tree to effectively classify the instances based on the target variable. Let’s delve into the process of constructing a decision tree using our example of predicting a neighbor's decision to go for a walk based on weather conditions.


Step-by-Step Tree Construction


1. Selecting the Root Node

The construction of a decision tree begins with choosing an attribute to serve as the root node. This is the starting point of our tree, where the initial decision is made.


  • Root Node: In our example, we start with the attribute Sky because it has a significant influence on the decision.

2. Adding Branches

From the root node, branches represent the different possible values of the chosen attribute.


  • Branches from Sky: The attribute Sky has three possible values: sunny, cloudy, and rainy. We create branches for each of these values.

3. Creating Nodes and Leaves

At each branch, we further divide the data by selecting additional attributes, continuing this process until we reach a point where no further splitting is needed. These endpoints are called leaves, which provide the final classification.


  • Leaf Node for Cloudy:

    • For the branch where Sky = cloudy, the outcome is always yes (Y = yes). This branch directly leads to a leaf node because the decision is definitive.

  • Node for Sunny:

    • For the branch where Sky = sunny, we consider the next attribute Humidity.

      • If Humidity = normal, the neighbor goes for a walk (Y = yes).

      • If Humidity = high, the neighbor does not go for a walk (Y = no).

  • Node for Rainy:

    • For the branch where Sky = rainy, we consider the attribute Wind.

      • If Wind = weak, the neighbor goes for a walk (Y = yes).

      • If Wind = strong, the neighbor does not go for a walk (Y = no).



Example of how the resulting tree
Example of how the resulting tree

Algorithm: A Detailed Approach


Constructing a decision tree involves a series of methodical steps that can be represented through pseudocode, providing a clear understanding of the logical flow. This algorithm selects the best attributes, creates nodes and branches, and classifies instances effectively until the data is sufficiently classified.


function BuildDecisionTree(instances, attributes):
    if instances are perfectly classified:
        return Leaf Node with classification
    if no more attributes:
        return Leaf Node with majority classification
    A ← selectBestAttribute(attributes, instances)
    Tree ← createNode(A)
    for each value v of A:
        subset ← instances where A = v
        if subset is empty:
            ChildNode ← Leaf Node with majority classification of 		      instances
        else:
            ChildNode ← BuildDecisionTree(subset, attributes - A)
        addBranch(Tree, v, ChildNode)
    return Tree

function selectBestAttribute(attributes, instances):
    bestAttribute ← null
    bestScore ← -infinity
    for each attribute in attributes:
        score ← calculateScore(attribute, instances)
        if score > bestScore:
            bestScore ← score
            bestAttribute ← attribute
    return bestAttribute

The primary function in our algorithm is BuildDecisionTree, which takes two parameters: the set of instances (data points) and the list of available attributes. The process begins by checking if the instances are perfectly classified. If they are, a leaf node with the appropriate classification is returned, signifying the end of that branch. If no more attributes are available and the instances are not perfectly classified, a leaf node with the majority classification of the instances is created.


If neither of these base cases is met, the algorithm selects the best attribute for splitting the data. This is done using the selectBestAttribute function, which evaluates each attribute based on a scoring method (such as Information Gain or Gini Index) and chooses the one that best classifies the data. The selected attribute becomes the decision attribute for the current node, and a new tree node is created.


For each possible value of the chosen attribute, the algorithm creates a branch and identifies the subset of instances that correspond to that value. If this subset is empty, a leaf node with the majority classification of the original instances is created. If the subset is not empty, the BuildDecisionTree function is called recursively to construct the subtree for that subset. This process of selecting attributes, creating branches, and recursively building subtrees continues until all instances are sufficiently classified.


The helper function selectBestAttribute plays a crucial role in this process. It iterates over each attribute, calculates a score based on how well the attribute classifies the instances and selects the attribute with the highest score. The specific method for calculating this score is implemented in the calculateScore function, which could use methods such as Information Gain or Gini Index to measure the effectiveness of each attribute.


By systematically selecting the best attributes and creating branches for all possible values, the decision tree algorithm ensures that the data is split in a way that maximizes classification accuracy. This step-by-step approach results in a robust predictive model that can effectively classify new instances based on the learned decision criteria.


Selecting the Best Attribute 


When constructing a decision tree, one of the key tasks is to determine the best attribute to use at each node to split the data. The goal is to increase the homogeneity of the resulting subnodes, or in other words, to increase their purity with respect to the target variable. There are several methods to achieve this, two of the most common being the Gini Impurity and Information Gain.


Gini Impurity


It is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the set. The Gini Impurity for a set of items is calculated as follows:



Gini equation

where:


  • D is the dataset,

  • C is the number of classes,

  • p_i is the probability of choosing an element of class iii.

The Gini Impurity score ranges from 0 (perfect purity) to 0.5 (maximum impurity for a binary classification).


When a node is split on an attribute, the Gini Impurity of the resulting subnodes is calculated and the weighted average of these impurities is used to determine the overall impurity of the split. The attribute that results in the lowest Gini Impurity is chosen as the best attribute for that node.


Information Gain


Information Gain[2], on the other hand, is based on the concept of entropy from information theory[3]. Entropy is a measure of uncertainty or impurity in the dataset. In the context of a decision tree, entropy quantifies the disorder or impurity in a collection of instances.


For a binary classification problem, the entropy E(S) of a set S is defined as:



Entropy equation

where:


  • p_1​ is the proportion of instances in class 1,

  • p_2​ is the proportion of instances in class 2.

Information Gain measures the reduction in entropy achieved by partitioning the dataset according to an attribute. It is calculated as:



Info Gain Equation

where:

  • S is the original dataset,

  • A is the attribute,

  • S_v is the subset of S for which attribute A has value v,

  • Values(A) are the possible values of attribute A,

  • E(S) is the entropy of the original set S,

  • E(S_v) is the entropy of the subset S_v​,

  • ∣S_v∣/∣S∣ is the weight of the subset S_v​.

The attribute with the highest Information Gain is chosen as the best attribute for splitting the node, as it reduces the uncertainty the most, creating the most homogeneous subgroups.


Overfitting 


It is a common problem in decision tree models. It occurs when the model learns the training data too well, capturing not only the underlying patterns but also the noise and random fluctuations. This results in a model that performs exceptionally well on the training data but poorly on new, unseen data.


It happens primarily due to the complexity of decision trees. As the tree grows, it creates many branches to perfectly classify the training data. This complexity allows the tree to capture noise and outliers, making it overly specialized to the training data. When the dataset is small or has a lot of variability, the tree is more likely to overfit because it tries to account for every detail and anomaly in the limited data.


The main issue is poor generalization. An overfitted model fails to generalize to new data, leading to inaccurate predictions. This happens because the model has learned the specific details and noise of the training data, which do not apply to unseen data. Additionally, overfitted models tend to be more complex and harder to interpret, making them less useful in practical applications.



Example of prediction on new data when you have an overfitting tree and when you don't.
Example of prediction on new data when you have an overfitting tree and when you don't.

Solutions


Several strategies can be employed to mitigate overfitting in decision trees. One effective approach is to define stopping criteria for the tree's growth. By setting conditions under which the tree construction will halt, such as when further splits do not significantly improve the purity of the subnodes or when the number of instances in a node falls below a certain threshold, we can prevent the tree from becoming overly complex.


Another method to control overfitting is to limit the maximum depth of the tree. By capping the depth, we restrict the tree's ability to capture noise and outliers, maintaining a balance between capturing important patterns and avoiding overfitting. This ensures the model remains simpler and more generalizable.


Pruning is another technique used to combat overfitting. Pruning can be done in two ways: pre-pruning and post-pruning. Pre-pruning, also known as early stopping, involves halting the tree growth early, before it becomes too complex. Post-pruning, on the other hand, involves building the full tree and then removing branches that do not contribute significantly to the model's performance. Post-pruning is typically guided by evaluating the performance of the tree on a separate validation dataset. By removing unnecessary branches, the model becomes less complex and more generalized.


Finally, post-pruning rules can be employed. After constructing the decision tree, it can be converted into a set of rules. These rules are then optimized based on their accuracy. By simplifying the rules and removing those that do not add significant value, the model's generalization ability can be improved. This process helps in refining the model to ensure it captures only the essential patterns in the data, leading to better performance on unseen data.


Advantages and Disadvantages


As with any machine learning method, it has its weaknesses and strong strengths. So, let’s review them.


Advantages


One of the most significant advantages of decision trees is their interpretability. The structure of a decision tree is straightforward and intuitive, making it easy to understand and explain. This interpretability aligns closely with human decision-making processes, as decision trees mimic the way humans think through choices by breaking down decisions into a series of simple, sequential questions.


Decision trees are also useful for identifying the importance of various variables. In datasets with hundreds of variables, decision trees can highlight which variables play a crucial role in the decision-making process. This capability is particularly valuable in fields like medicine, finance, and marketing, where understanding the impact of different factors is essential.


Another advantage is their flexibility with different types of inputs. Decision trees can handle both numerical and categorical data, making them versatile tools for various applications.


Disadvantages 


Despite their many advantages, decision trees also have notable disadvantages. One major drawback is that they often achieve less precision compared to other, more complex models such as ensemble methods, neural networks, or support vector machines. This lower precision can be a significant limitation in tasks that require high accuracy.


Another critical disadvantage is their high variance. Decision trees are sensitive to small changes in the data; even a minor modification can drastically alter the structure of the tree. This sensitivity can lead to instability, making the model unreliable in certain situations. To mitigate this issue, techniques such as pruning, ensemble methods like Random Forests, or boosting can be employed, but these add layers of complexity and reduce the simplicity and interpretability that are the hallmarks of decision trees.

Conclusion


Decision trees offer several benefits, including interpretability, alignment with human decision-making, the ability to identify important variables, and versatility with different types of inputs. However, they also have limitations, such as lower precision compared to other models and high variance due to sensitivity to data changes. Understanding these pros and cons is crucial for effectively applying decision trees in various real-world scenarios and leveraging their strengths while addressing their weaknesses.


References


[2] Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1, 81-106.

[3] Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423.



댓글


bottom of page