ALGORITHM ALLEY

Building Decision Trees with the ID3 Algorithm

Andrew Colin

Andrew is a mathematician and proprietary trader in the Treasury, Commonwealth Bank of Australia, Sydney. He can be reached at 100036.1647@compuserve.com.

From network routers to games, decision making is a crucial part of many programs. In this column, Andrew deals with decision making at two levels. Most obviously, he is interested in constructing binary decision trees. Perhaps more interestingly, the algorithm Andrew discusses uses a metric to decide how to build the tree. This particular metric is grounded in the mathematics of information theory, but other decision algorithms use more ad hoc measurements, such as those used to evaluate moves in game-playing programs. If you'd like to write an article about an interesting metric or other decision method, contact me at Dr. Dobb's Journal.

--Tim Kientzle

A decision tree is a series of nested If/Then rules that users apply to data to elicit a result. For instance, a decision tree to assist in the diagnosis of automobile engine trouble might start out with general questions ("Does the engine start?") and use the outcome to determine which subsequent questions are asked:

1. Did the engine turn over? No.

2. Is the battery charged? Yes.

3. Is there a loose battery connection? Yes.

A decision tree is usually a straightforward tool to use. However, inducing rules from sets of past decisions to build such decision trees is a difficult problem, especially when we want to reach a useful answer in as few questions as possible.

In this article, I'll present one way of building such decision trees using the ID3 algorithm, originally developed by J. Ross Quinlan of the University of Sydney. This algorithm was first presented in 1975 (J.R. Quinlan, Machine Learning, vol 1. no. 1) and has proven to be so powerful that it has found its way into a number of commercial rule-induction packages. However, the underlying concepts are simple enough for ID3 to be implemented in only a few pages of code.

How does ID3 work? Suppose you have the animals ostrich, raven, albatross, and koala. For each one, you know the following attribute values: warm blooded, has feathers, has fur, can swim underwater, lays eggs. Based on these attributes, you are asked to find a set of rules to determine whether the animal lays eggs. In this case, a classifying attribute is readily apparent: If the animal has feathers, then it lays eggs. Consequently, the decision tree has just one question: "Does the animal have feathers?" If you add dolphin and crocodile to the list, this rule is incorrect, since the crocodile does not have feathers, but does lay eggs. By inspecting Table 1, you can amend this to the rules in Figure 1(a).

However, there are a number of ways to elicit the correct answer from a dataset. In Figure 1(b), the first question in the tree ("Does the animal have fur?") is poorly chosen, since it gives you little information about the answer. Clearly, some questions are more useful in this context than others. For more realistic datasets--where there are tens of attributes and thousands of samples--a well-chosen set of questions is of critical importance to the effectiveness of the tree.

The ID3 algorithm searches through the attributes of a dataset for the one that conveys the most information about the desired target. If the attribute classifies the target perfectly, then you stop. Otherwise, the process is repeated recursively on the two subsets generated by setting this "best" attribute to their on/off states.

Information theory provides one way to measure the amount of information conveyed by a particular attribute. The best attribute is the one with the lowest "negentropy," as defined in Figure 2. The ID3 algorithm picks the attribute with the lowest total negentropy.

To illustrate, I'll calculate the negentropies generated by Table 1. For example, there are five animals in the sample that are warm blooded, of which three lay eggs. The negentropy of the "on" (true) attribute is NE_on = -(3/5) log2 (3/5) - (2/5) log2 (2/5) = 0.970951. There is one animal in the sample that is not warm blooded but does lay eggs. The negentropy of the "off" (false) attribute is NE_off = -(1/1) log2(1) - 0 = 0. The combined negentropy is the weighted sum (5 * NE_on + 1 * NE_off) / (5 + 1) = 0.809125. Table 2 shows the negentropies for all of the attributes.

None of the attributes has a negentropy of zero, so none classifies all cases correctly. However, the "feathers" attribute has the lowest negentropy and hence conveys the most information. This is what we use as the root attribute for our decision tree.

The process is now repeated recursively. All animals that have feathers lay eggs, so the negentropy for this subset is zero and there are no further questions to ask. When the animal does not have feathers, you compute the negentropies using only those animals to produce Table 3. The attribute to use for the subtree is therefore "warm blooded," which has the lowest negentropy in this sample. The zero value indicates that all samples are now classified correctly.

The algorithm can also be used for real-valued, rather than binary-valued, attributes. At the outset of processing, you scan all of the variables to select an attribute i and a value r so that the condition Attribute(i) >= r yields the lowest negentropy possible. The "yes/no" question is now of the form "Is attribute greater than or equal to r?" If the negentropy arising from this partition is zero, then the data was classified perfectly, and the calculation ends. Otherwise, the dataset is partitioned into two sets--one for which the first condition is true, and the other for false.

The process is repeated recursively over the two subsets until all negentropies are zero (perfect classification) or until no more splitting is possible and the negentropy is nonzero. In this case, classification is not possible and you must solicit more data from the user. Such a situation could occur when all attributes in two records have the same values, but the outcomes differ. More data must then be supplied to the program to allow it to discriminate between the two cases.

Tree Pruning

Most real-world datasets do not have the convenient properties shown here. Instead, noisy data with measurement errors or incorrect classification for some examples can lead to very bushy trees in which the rule tree has many special cases to classify small numbers of uninteresting samples.

One way to address this problem is to use "rule-tree pruning." Instead of stopping when the negentropy reaches zero, you stop when it reaches some sufficiently small value, indicating that you are near the end of a branch. This pruning leaves a small number of examples incorrectly classified, but the overall structure of the decision tree will be preserved. Finding the exact, nonzero cutoff value will be a matter of experimentation.

Implementing Decision Trees

In implementing the decision tree described here, I've represented it within the program (see Listing One, as a binary tree, constructed of NODE structures pointing to other NODEs or to NULLs for terminal nodes.

Rather than copying the data table for each partition, I pass the partially formed data tree to the routine that calculates negentropy, allowing the program to exclude records that are not relevant for that part of the tree. Negentropy of a partition is calculated in routine negentropy (see Listing Two), which is called for all attribute/threshold combinations by routine ID3 (Listing Three).

The ability to use real-valued as well as binary-valued attributes comes at a price. To ensure the correct value of r, we scan through all attribute values in the dataset--a process that can be quite computationally intensive for large datasets.

No claims are made for the efficiency of this implementation. For cases where many sample attribute values are the same, or where a mixture of real-valued and binary-valued attributes is to be considered, the user is probably better advised to sort the attributes into a list and to eliminate repeated values. I've also not considered the case where a question can have more than two outcomes.

Two illustrative datasets are available electronically; see "Availability," page 3. The first is a set of sample data from a botanical classification problem, in which a type of flower, an iris, is to be classified into one of two subgenera (Virgin, Setosa) according to the dimensions of sample pistils and stamens. The data is taken from M. James' book Classification Algorithms (Collins, 1985). Figure 3 shows the resulting decision tree.

The second dataset is a torture test for the algorithm. Given the attributes {John, Paul, George, Ringo}, which are random numbers between 0 and 1, and a target attribute that takes random values from {0, 1}, the program returns a large and complex tree that classifies 100 examples. On a 486/75, the calculation took about 220 seconds to run to completion. Most real-world datasets will produce much simpler decision trees than Listing Four.

The code (available electronically) includes the files ID3.C, ID3.H, PROTO.H. To build this program with Borland's Turbo C compiler, enter tcc id3.c at the DOS prompt. The code has also been compiled under Watcom C and Microsoft Visual C, so it should be fairly portable. To run the program, two datasets are required. The first contains the sample data in ASCII form, with values for the target attribute in the last column (0 or 1). The second dataset contains the names of the attributes, again in ASCII. For the iris dataset, these files have the names IRIS.DAT and IRIS.TAG. Enter id3 iris at the command prompt to run the program.

Conclusion

ID3 is a conceptually simple but powerful classification algorithm. Used in conjunction with other statistical and machine-learning techniques, this algorithm will form a valuable addition to your armory of data exploration tools.

Table 1: Various animal attributes.

Animal   Warm blooded  Feathers  Fur  Swims  Lays eggs

Ostrich       Yes         Yes    No    No       Yes

Crocodile     No          No     No    Yes      Yes

Raven         Yes         Yes    No    No       Yes

Albatross     Yes         Yes    No    No       Yes

Dolphin       Yes         No     No    Yes      No

Koala         Yes         No     Yes   No       No

Table 2: Negentropies for each attribute from Table 1.

Attribute    on_ctr  Hits  Off_ctr  Hits  Negentropy

Warmblooded    5       3      1       1    0.809125

Feathers       3       3      3       2    0.459148

Fur            1       0      5       4    0.601607

Swims          2       1      4       3    0.874185

Table 3: Negentropies of each attribute for animals that do not have feathers.

Attribute    on_ctr  Hits  Off_ctr  Hits  Negentropy

Warmblooded    2       0      1       0    0.0

Feathers       0       0      3       1    0.918296

Fur            1       0      2       1    0.666666

Swims          2       1      1       0    0.666666

Figure 1: Two different decision trees to answer the question: Does the animal lay eggs?

(a)  Does animal have feathers?
        Yes: Lays eggs (raven, albatross, ostrich)
        No : Is animal warmblooded?
               Yes: Does not lay eggs (koala, dolphin)
               No: Lays eggs (crocodile)
(b)  Does animal have fur?
        Yes: Does not lay eggs (koala)
        No:  Does animal have feathers?
                Yes: Lays eggs (raven, albatross, ostrich)
                No: Is animal warmblooded?
                        Yes: Does not lay eggs (dolphin)
                        No: Lays eggs (crocodile)

Figure 2: Definition of negentropy. p(ON) and p(OFF) are the measured probabilities of an answer being true or false.

If p(ON) and p(OFF) are both nonzero:
      -p(ON) log2 p(ON) - p(OFF)
                          log2 p(OFF)
Otherwise:
      0

Figure 3: Decision tree for irises.

Is the pistil width >= 1.80?
       Yes: class Setosa
      No : Is the pistil length >= 5.60?
               Yes: class Setosa
            No : class Virgin

Listing One

typedef struct node {
   UINT idx;            /* ID code for attribute */
   REAL threshold;      /* Numerical threshold for attribute test */
   struct node *on;     /* Address of 'on' node */
   struct node *off;        /* Address of 'off' node */
   struct node *parent;     /* Addess of parent node */
} NODE;

Listing Two

NEGENTROPY negentropy ( REAL **data, UINT n_samples, NODE *local, UINT target)
{
   /* Calculates the entropy of classification of an attribute, given a table 
    * of attributes already used, the attribute on which splitting is to be
    * taken, and the target attribute. Entropy is calculated in bits, so logs
    * are taken to base 2 by dividing by LN_2.
    * The returned value always lies in the (closed) range [0, 1].
    */
   NEGENTROPY ret_val;
   NODE *_node, *_parent;
   UINT on_ctr, off_ctr, p1, p2, i, _match;
   REAL p_on, p_off, negentropy_on, negentropy_off;
   on_ctr = off_ctr = p1 = p2 = 0;
   /* Scan through all supplied data samples */
   for (i=0; i<n_samples; i++) {
      /* If pattern matches the current position in the decision tree, then use
       * this vector. The match is determined by passing up the decision tree
       * and checking whether 'data[idx] >= threshold' matches at each step, 
       * where idx and threshold are taken from each node in turn.
       */
      _match = 1;
      _node = local;
      while (_node->parent != NULL) { /* If at root node, all entries match */
         _parent = _node->parent;
         if (_node == _parent->on) { /* if parent node is ON */
            if (data[i][_parent->idx] < _parent->threshold)
               _match = 0;
     }
         else
         if (_node == _parent->off) { /* if parent node is OFF */
            if (data[i][_parent->idx] >= _parent->threshold)
              _match = 0;
     }
      _node = _parent;
      }
      if (_match) {
         if (data[i][local->idx] >= local->threshold) {
            on_ctr++;
              if (data[i][target] >= 0.5)
               p1++;
         }
         else {
            off_ctr++;
            if (data[i][target] >= 0.5)
               p2++;
         }
      }
   }   /* for (i=0; i<n_samples; i++) */
   /* 1: Entropy of subtable with activation ON */
   /* We now have the numbers of samples that match this part of the decision
    * tree, & the number of samples for which the supplied condition are true.
    * From these quantities we can find the negentropy of this partition.
    */
   if (on_ctr > 0)
   {
      p_on  = (REAL) p1 / (REAL) on_ctr;
      p_off = 1 - p_on;
      negentropy_on = -entropy (p_on) - entropy (p_off);
   }
   else
      negentropy_on = 0.0;
   /* 2: Entropy of subtable with activation OFF */
   if (off_ctr > 0)
   {
      p_on  = (REAL) p2 / (REAL) off_ctr;
      p_off = 1 - p_on;
      negentropy_off = -entropy (p_on) - entropy (p_off);
   }
   else
      negentropy_off = 0.0;
   ret_val.ne = (negentropy_on * on_ctr + negentropy_off * off_ctr);
   ret_val.ne /= (on_ctr + off_ctr);
   /* If all values examined were the same, set 'ret_val.status' to
    * the target value since this will be an end-of-branch node
    */
   if ((p1 == on_ctr) && (p2 == off_ctr))
      ret_val.status = ON;
   else if ((p1 == 0) && (p2 == 0))
      ret_val.status = OFF;
   else
      ret_val.status = INACTIVE;
   return ret_val;
}

Listing Three

NODE* ID3 ( MATRIX *matrix, NODE* parent, UINT target, UINT state)
/* Routine to build a decision tree, based on Quinlan's ID3 algorithm. */
{
   NEGENTROPY negentropy_struct;
   NODE *node;
   UINT n_vars = matrix->width, n_samples = matrix->height, i, j, split;
   REAL **data = matrix->data;
   REAL best_threshold, min_negentropy, _negentropy;
   /* Allocate memory for this node */
   node = (NODE*) malloc (sizeof(NODE));
   if (!node)
      err_exit (__FILE__, __LINE__);
   /* Set up links in decision tree */
   node->parent = parent;  /* Set address of parent node */
   if (parent != NULL) /* parent to child; not relevant for root node */
   {
      /* Pass address of this node to the parent node */
      if (state == ON)
     parent->on = node;
      else
      if (state == OFF)
     parent->off = node;
   }
   /* Select attribute with lowest negentropy for splitting. Scan through
    * ALL attributes (except target) and ALL data samples. This is inefficient
    * for data sets with repeated values, but will do for illustrative purposes
    */
   min_negentropy = 1.0;
   for (i=0; i<n_vars; i++) {
      for (j=0; j<n_samples; j++) {
         if (i != target) {
            /* Set trial values for this node... */
            node->idx = i;
            node->threshold = data[j][i];
            /* ...and calculate the negentropy of this partition */
            negentropy_struct = negentropy (data, n_samples, node, target);
            _negentropy = negentropy_struct.ne;
        /* If this negentropy is lower than any other, retain the
               index and threshold for future use */
            if (_negentropy < min_negentropy) {
           min_negentropy = _negentropy;
               split = i;
               best_threshold = data[j][i];
            }
         } /*if (i != target)*/
      } /*for (j=0; j<n_samples; j++)*/
   } /*for (i=0; i<n_vars; i++)*/
   /* Save the combination of best attribute and threshold value */
   node->idx = split;
   node->threshold = best_threshold;
   /* If the negentropy routine found itself at an end-of-branch
    * for the decision tree, the 'status' flag in 'negentropy_struct'
    * is set to ON or OFF and the node labelled accordingly. Otherwise,
    * ID3 continues to call itself until all end-of-branch nodes are found.
    */
   if  (negentropy_struct.status != INACTIVE) {
      node->on = node->off = NULL;
      node->idx = negentropy_struct.status;
   }
   else
   {
      node->on  = ID3 (matrix, node, target, ON);
      node->off = ID3 (matrix, node, target, OFF);
   }
   return node;
}

Listing Four

Welcome to ID3
Last compiled on Dec 28 1995, 09:15:09
if { Ringo >= 0.33 then if { George >= 0.63 then if { Paul >= 0.58 then 
if { George >= 0.99 then OFF else ON }  else if { Ringo >= 0.77 then 
if { Paul >= 0.13 then ON else OFF }  else if { John >= 0.52 then 
if { Ringo >= 0.57 then ON else if { John >= 0.79 then ON 
else OFF } }  else OFF }  }  }  else ON }  else if { George >= 0.34 
then if { John >= 0.52 then if { Paul >= 0.43 then if { George >= 0.76 
then OFF else if { John >= 0.65 then ON else if { John >= 0.52 
then if { Paul >= 0.79 then ON else OFF }  else ON }  }  }  else OFF }  
else OFF }  else if { John >= 0.49 
then ON else if { Ringo >= 0.08 then if { John >= 0.01 
then ON else OFF }  else OFF }  }  }  }

Back to Table of Contents