Binary Trees

Volume Number:		6
Issue Number:		8
Column Tag:		Language Translation

Binary Trees

By Clifford Story, Mouunt Prospect, IL

Note: Source code files accompanying article are located on MacTech CD-ROM or source code disks.

A. Introduction

This article, Part V, I’m going to take an excursion away from language translation and develop some support routines. In particular, I’m going to create routines to handle binary trees; later, I’ll use these routines to implement a symbol table for the new Canon tool and the fabled Inline tool.

There are a couple of “Why?” questions to answer. First question, why fill MacTutor with a long article on something that is taught in every college computer department in the U.S.? Ah, but not all of us have computer degrees; I don’t, for example. And while you can find the subject in various books, I couldn’t find a decent explanation of balanced binary trees -- I had to figure it out from hints I gleaned from Knuth’s discussion (Searching and Sorting, pages 451 -- 457). Second question, why a binary tree? Aren’t symbol tables usually done with hash tables? Well, yes, this is my impression, and Aho and Ullman state that hashing is “[t]he most efficient and commonly used method” for maintaining symbol tables (The Theory of Parsing, Translation, and Compiling, volume II, page 793). So why a binary tree? I’m sorry, I haven’t got an answer for that one...

If you should ever write your own compiler, I guess you should use a hash table. After all, Aho & Ullman’s opinion carries a lot of weight. In the meantime, binary trees are fun, so sit up and enjoy!

B. Simple Binary Trees

A binary tree is a tree with at most two sub-trees at each node. These are commonly called the left and right sub-trees. Searching a binary tree is very simple: take the left branch if the search key is less than the node’s value, and the right branch if it is greater. Keep going until you get either an exact match (in which case the search succeeds), or a nil pointer (in which case the search fails).

The example program I have written implements search and insertion, as well as a recursive dump routine. It does not do deletions. I haven’t any use for deletions in a symbol table, so I have left that part as an exercise...

B(1). Node Structure

Each node has four fields: the node’s value or key, a pointer to the left sub-tree, a pointer to the right sub-tree, and the data. Here is the node structure I will use in my example:

/* 1 */

typedef struct node
 {
 char   *key;
 struct node*left;
 struct node*right;
 char   *data;
 } node;

The data for this example is some arbitrary character string.

I’ll going to create new nodes with the Mac trap NewPtr. I would usually use an array of nodes with a free list, and allocate them myself; I’m using NewPtr in this case for simplicity’s sake.

/* 2 */

// createnode
 
node *createnode(char *thekey, 
 char *thedata)
 {
 
 node   *thenode;
 int    thelength;
 
 thenode = (node *)
 NewPtr(sizeof(node));
 if (thenode == nil)
 return(nil);
 
 thelength = 1 + strlen(thekey);
 thenode->key = (char *)
 NewPtr(thelength);
 if (thenode->key == nil)
 {
 DisposPtr((Ptr)thenode);
 return(nil);
 }
 BlockMove((Ptr)thekey, 
 (Ptr)thenode->key, thelength);
 
 thenode->left = nil;
 thenode->right = nil;
 
 thelength = 1 + strlen(thedata);
 thenode->data = (char *)
 NewPtr(thelength);
 if (thenode->data == nil)
 {
 DisposPtr((Ptr)thenode->key);
 DisposPtr((Ptr)thenode);
 return(nil);
 }
 BlockMove((Ptr)thedata, 
 (Ptr)thenode->data, thelength);
 
 return(thenode);
 
 }

Finally, I’ll start off the table with a head node, a dummy node with no data that points to the true root of the tree. That way, I can create the tree before any insertions; I don’t have to special-case inserting data into a nil tree.

B(2). Finding a Node by Key

As I said above, looking up a node by key is easy. The routine below is passed a pointer to the root node of the tree, and the search key. It starts with the root and follows the algorithm, until either it finds the node it is looking for and falls out of the loop, or finds a nil pointer and returns it.

/* 3 */

// lookup
 
node *lookup(node *thetable, 
 char *thekey)
 {
 
 node   *thenode;
 int    compare;
 
 thenode = thetable;
 
 while (compare = strcmp(
 thekey, thenode->key))
 {
 if (compare < 0)
 thenode = thenode->left;
 else
 thenode = thenode->right;
 if (thenode == nil)
 return(nil);
 }
 
 return(thenode);
 
 }

B(3). Inserting a Node

This routine inserts a new node into the tree. First, it performs a search not unlike that in the “lookup” routine, except that it returns nil (indicating failure) if the search is successful -- i.e., a duplicate key exists. If, on the other hand, it finds a nil pointer, then that is the place for the new node, and the routine inserts it.

/* 4 */

// insert
 
node *insert(node *thetable, 
 char *thekey, char *thedata)
 {
 
 node   *thenode;
 int    compare;
 node   *thechild;
 
 thenode = thetable;
 
 while (compare = strcmp(
 thekey, thenode->key))
 {
 
 if (compare < 0)
 {
 
 thechild = thenode->left;
 if (thechild == nil)
 {
 thenode->left = createnode(
 thekey, thedata);
 return(thenode->left);
 }
 
 }
 else
 {
 
 thechild = thenode->right;
 if (thechild == nil)
 {
 thenode->right = createnode(
 thekey, thedata);
 return(thenode->right);
 }
 
 }
 
 thenode = thechild;
 
 }
 
 return(nil);
 
 }

B(4). Dumping and Destroying

Tree structures are natural candidates for recursive routines, and here are two. The “dump” routine displays the entire tree, in key order. The “destroy” routine disposes of all the dynamic memory used by the tree.

/* 5 */

// dump
 
void dump(node *thetable)
 {
 
 if (thetable == nil)
 return;
 
 dump(thetable->left);
 
 printf(“%10s - %10s - %10s\n”,
 thetable->key,
 thetable->left ? 
 thetable->left->key : “nil”,
 thetable->right ? 
 thetable->right->key : “nil”);
 
 dump(thetable->right);
 
 }
 
// destroy
 
void destroy(node *thetable)
 {
 
 if (thetable == nil)
 return;
 
 destroy(thetable->left);
 destroy(thetable->right);
 
 DisposPtr((Ptr)thetable->key);
 DisposPtr((Ptr)thetable->data);
 DisposPtr((Ptr)thetable);
 
 }

B(5). A Test Shell

/* 6 */

// Binary.c
// --------
// Binary tree search and insertion
 
#pragma load “managers”
 
// Constants and Macros
 
#define nil 0
 
// Types
 
typedef enum {false, true} logical;

typedef struct node
 {
 char   *key;
 struct node*left;
 struct node*right;
 char   *data;
 } node;
 
// Prototypes
 
 void initmac();
 node *createnode(char *thekey, 
 char *thedata);
 node *insert(node *thetable, 
 char *thekey, char *thedata);
 node *lookup(node *thetable, 
 char *thekey);
 void dump(node *thetable);
 void destroy(node *thetable);
 
// main
 
main()
 {
 
 node   *thetable;
 char   thestring[256];
 char   command;
 char   thekey[256];
 char   thedata[256];
 node   *thenode;
 
 initmac();
 
 thetable = createnode(“”, “”);
 if (thetable == nil)
 {
 printf(“\tUnable to “
 “create table\n”);
 exit(2);
 }
 
 printf(“? “);
 while (true)
 {
 
 gets(thestring);
 sscanf(&thestring[2], 
 “%c %s %[^\n]”, 
 &command, thekey, thedata);
 
 switch (command)
 {
 
 case ‘A’:
 case ‘a’:
 if (insert(thetable, thekey, 
 thedata) == nil)
 printf(“\tKey “
 “already exists “
 “in table\n”);
 break;
 
 case ‘D’:
 case ‘d’:
 dump(thetable->right);
 break;
 
 case ‘F’:
 case ‘f’:
 thenode = lookup(
 thetable, thekey);
 if (thenode == nil)
 {
 printf(“\tKey not “
 “found in table\n”);
 break;
 }
 printf(“\tData is “%s”\n”, 
 thenode->data);
 break;
 
 case ‘Q’:
 case ‘q’:
 destroy(thetable);
 exit(0);
 
 }
 
 printf(“? “);
 
 }
 
 }
 
// initmac
 
void initmac()
 {
 
 InitGraf((Ptr)&qd.thePort);
 SetFScaleDisable(true);
 
 InitCursorCtl(nil);
 
 }

C. Balanced Binary Trees

Simple binary trees are pretty and elegant but they aren’t perfect. Remember, I’ve promised a Canon tool will come out of all this, right? Aren’t most Canon dictionaries in alphabetical order? What happens to a binary tree when you insert already ordered data?

If you insert already ordered data into a simple binary tree, you get a degenerate tree. All the insertions go to the right; all the left pointers are nil. What is left is a linked list. Lookups in a linked list are o(N), instead of the o(log N) we would expect from a binary tree. (This means that the number of comparisons is proportional to the number of nodes, rather than to the log of the number of nodes. o(log N) is much faster than o(N).)

One solution to this problem is to use a balanced binary tree. This involves a little more work, and a more complicated insertion algorithm, but it restores the o(log N) running time.

C(1). What is a Balanced Tree?

The height of a node is the number of steps to its most distant descendant. The height of a tree (or a sub-tree) is the height of its root node. A binary tree is balanced if for every node in the tree, the height of the left sub-tree is within 1 of the height of the right sub-tree.

Here’s another way to say the same thing: each node in a binary tree has a balance factor, which is equal to the height of the left sub-tree minus the height of the right sub-tree. A binary tree is balanced if every balance factor is either 0, 1 or -1.

It’s pretty easy to see that a balanced tree is not going to exhibit the sort of degenerate behavior that simple binary trees can fall into. It’s also easy to see that inserting new nodes into the tree can throw it out of balance, which is why insertion is a bit more complex than in the simple case.

C(2). Searching, Dumping, Destroying

Balancing a binary tree requires a minor change to the node structure. We have to add a “balance” field, to keep track of the balance factor:

/* 7 */

typedef struct node
 {
 char   *key;
 struct node*left;
 struct node*right;
 int    balance;
 char   *data;
 } node;

The “createnode” routine initializes this new field to zero.

Searching, dumping and destroying the tree are identical to the routines for the simple binary tree, already shown above.

C(3). Balancing the Tree

We do insertions just the same way as for a simple binary tree, until we unbalance the tree. Then we have to re-balance it. We know the tree is out of balance if we get a node with a balance factor of ±2. Fortunately, balancing a tree turns out to be a localized operation.

Suppose that as we search the tree for the place to insert the new node, H is the last node we pass with a balance factor of ±1. Then the tree is imbalanced after the insertion if and only if H’s balance factor becomes ±2.

Proof: if H’s balance factor is ±2, then the tree is imbalanced by definition. On the other hand, if the tree is imbalanced, where is the ±2 node? It isn’t below H, since those nodes all had zero balance factors and can now be at most ±1 (in fact, they are all ±1; none remains at zero). If H’s balance factor isn’t ±2 after the insertion, then either it stayed ±1 or it went to zero. In the first case, the heights of H’s subtrees didn’t change. In the second, the height of the shorter tree increased. In both cases, the height of H itself remained the same, so the balance factors of all the nodes above H remained the same. Thus, if H isn’t the ±2 node, then the tree is balanced, contrary to assumption.

What’s more, the process of re-balancing H’s sub-tree will leave the height of the sub-tree the same as its height before the insertion. So after balancing the sub-tree, the entire tree will be balanced. This means that we need only concern ourselves with a handful of nodes around H -- four nodes, at most.

C(3)(a). Left-left insertion

The insertion algorithm divides into four cases. After finding node H, we move either left or right to continue searching for the place to insert the new node, and after the next node, we again move left or right. So the “left-left” case is the one in which we took two left branches after finding H.

C(3)(a)(i). Before Inserting the New Node

Here’s the immediate situation before inserting the new node:

Figure 1: Left-left case before insertion.

The ovals represent nodes; the capital letters are node names, and the numbers after the colons are the balance factors. The rectangles represent sub-trees; the lower-case letters are their names, and the expressions under the horizontal lines are tree heights. Node P is the parent of node H, the high node, which is in turn the parent of L, the low node. The new node will go into sub-tree a.

P’s balance factor is irrelevant. H’s is ±1 by definition (this is the same H as was discussed earlier, the last node with a non-zero balance factor), and L’s is zero for the same reason (definition of H). Since we’re adding the new node to the left sub-tree of H, which will increase H’s balance factor and unbalance the tree, H’s balance factor must be +1, not -1.

L’s balance factor is zero, so a and b have the same height, call it h. Then the height of L is h+1. H’s balance factor is +1, so c’s height is one less than L’s, or h.

Also, note that the height of H is h+1. And listing the parts of the sub-tree in order of keys, we get: a, L, b, H, c, since L, a and b are to the left of H and c to the right, and L is between a and b. These facts are important; I have already promised that inserting the new node and re-balancing the tree will not change the sub-tree’s height, and the order of the nodes must be preserved by any balancing transformation.

C(3)(a)(ii). After Inserting the New Node

Now we insert the new node in sub-tree a:

Figure 2: Left-left case after insertion.

The new node increases the height of a by 1, to h+1. The height of b remains h, so L’s balance factor is now +1. The height of L goes to h+2, which makes H’s balance factor +2, since the height of c is unchanged. The tree is now out of balance.

C(3)(a)(iii). After Re-Balancing the Tree

At this point, we need to perform some transformation on the tree that will bring it back into balance. I’m going to just pull it out of the hat:

Figure 3: Left-left case after transformation.

What we did was switch b to H, and lift sub-tree a, the big one, higher. The heights of a, b and c haven’t changed, and it’s easy to show that the new balance factors are correct. The important question is, is this still a properly-ordered binary tree?

Recall the order of nodes before the insertion: a, L, b, H, c. Has the order changed? Now a is to the left of L, and H, b and c to the right; H is between b and c: a, L, b, H, c. The order is the same.

Next, I promised that the height of the sub-tree would be the same after re-balancing as before insertion. The height before was h+1; it’s easy to see that it’s h+1 now. This fact allows us to ignore all the nodes above the sub-tree.

C(3)(b). Left-right insertion

A left-right insertion looks like a left-left insertion, except that the new node goes in sub-tree b instead of sub-tree a:

Figure 4: Left-right case after insertion.

Unfortunately, we can’t employ the same trick. Shifting b to H and raising L will result in exactly the same picture as above, only reversed horizontally. We have to take a different view of the situation:

Figure 5: Left-right case before insertion.

C is the root of our former sub-tree b, which is now split into sub-trees b and c; the former sub-tree c is now sub-tree d. The height of C is the same as the height of the old b, or h, so the heights of the new b and c are both h-1 (they’re equal because the balance factor of C is zero, by definition of H).

The new node goes into either b or c. It doesn’t matter much which; this will only affect balance factors after the re-balancing, and we can handle it then. So the tree after insertion looks like this:

Figure 6: Left-right case after insertion.

(Of course, the new node goes in only one of b and c, not in both! I had to draw it somehow...) The new heights and balance factors are easy to confirm.

And the balancing transformation is:

Figure 7: Left-right case after transformation.

L’s balance factor will be either 0 or +1, and H’s either -1 or 0, depending on whether the new node went into b or c. We have to check to make sure that neither the node order nor the height have changed (they haven’t).

C(3)(c). Right-right insertion

The right-right insertion is just the reverse of the left-left, so I’ll content myself with showing the pictures, and refer you back to left-left insertion for explanation.

Figure 8: Right-right case before insertion.

Figure 9: Right-right case after insertion.

Figure 10: Right-right case after transformation.

C(3)(d). Right-left insertion

The right-left insertion is just the reverse of the left-right, so I’ll content myself with showing the pictures, and refer you back to left-right insertion for explanation.

Figure 11: Right-left case before insertion.

Figure 12: Right-left case after insertion.

Figure 13: Right-left case after transformation.

C(4). The Insertion Algorithm

The above exhausts all possible cases when an insertion unbalances the tree. What about insertions that leave the tree balanced? We can’t simply ignore these, since they change balance factors as well.

Suppose a node has a balance factor of zero (which means its two sub-trees have the same height), and one sub-tree gets higher. There are two results: the node’s balance factor changes to +1 (if the left sub-tree grew) or -1 (if the right sub-tree grew), and the height of the node increases by 1. This second result will force a re-evaluation of the node’s parent, and the process will continue back up the search path so long as balance factors are zero. So we can see inserting the new node as the first child of its parent sets off a chain reaction back up the tree.

What if a node has a balance factor of ±1, and one sub-tree gets higher? If the sub-tree that already is higher is the one to grow, then we’ve unbalanced the tree. If the shorter sub-tree gets higher, then the node’s balance factor goes to zero. The first case we handled above; neither case changes the height of the node, so both kill the chain reaction.

So the high-level view of the insertion algorithm is this: First, insert the new node in the tree. Then, back up along the search path for as long as the nodes have zero balance factors, changing those balance factors to +1 or -1 according to whether the path went left or right when leaving the node. When a non-zero balance factor finally appears, increment or decrement it, according to the same test. If the result is zero, we’re done. If it’s ±2, the tree is unbalanced, so apply the appropriate balancing transformation.

C(5). The “insert” Routine

The following routine uses one cute trick: it keeps a stack of search directions in an int variable, and returns it as its result. This requires some explanation.

The integer has 32 bits. I divide it into 8 4-bit fields. The routine is called recursively going down the search path; it returns values going back up. The value it returns is the value it got, shifted left by one field, and with the direction from “parent” in the rightmost field: 1 for left, 2 for right.

The function result thus shows the direction from “parent” in the rightmost field, the direction from parent’s child in the next field, the direction from parent’s grandchild in the next field, and that’s as far as we need to go.

There are two special values: zero means “no further adjustment of balance factors is necessary”, and 4 means “error”, which in this context means a duplicate key (there are other possible errors but I cleaned out the error-detection code for simplicity’s sake).

/* 8 */

// insert
 
unsigned int insert(node *parent, 
 char *thekey, char *thedata)
 {
 
 int    compare;
 node   *high;
 unsigned int    result;
 node   *low;
 node   *child;
 
// if thekey matches this node, 
// then return an error
// (duplicate keys)
 
 compare = strcmp(thekey, parent->key);
 if (compare == 0)
 return(4);
 
// if there’s a slot for the new node, 
// then create it and return
// 1 if it is to the left of 
// parent and 2 if to the right
// else determine high, the next step 
// in the search path
 
 if (compare < 0)
 {
 high = parent->left;
 if (high == nil)
 {
 parent->left = createnode(
 thekey, thedata);
 return(1);
 }
 }
 else
 {
 high = parent->right;
 if (high == nil)
 {
 parent->right = createnode(
 thekey, thedata);
 return(2);
 }
 }
 
// now continue the search by 
//  calling “insert” recursively
// if the result is 0 (no balancing 
// needed at this level) or
// the result is 4 (duplicate 
// key), return
 
 result = insert(high, 
 thekey, thedata);
 if ((result == 0) || (result == 4))
 return(result);
 
// the low 4 bits of “result” indicate 
// the direction of the search path 
// leaving “high”;
// increment high’s balance if the path 
// goes left, and decrement it if the 
// path goes right
 
 if (result % 16 == 1)
 high->balance++;
 else
 high->balance--;
 
// if the balance is now zero, no further 
// correction of balances is needed, 
// so return
// if it’s ±1, push the direction away 
// from “parent” onto the “result” 
// stack and return it
// if it’s ±2, continue to balancing 
// transformation
 
 switch (high->balance)
 {
 case 0:
 return(0);
 case 1:
 case -1:
 return((result << 4) 
 + ((compare < 0) ? 1 : 2));
 case 2:
 case -2:
 break;
 }
 
// use the direction away from “high” to 
// find “low”, then switch on that 
// direction and the next one
// the easiest way to follow these 
// transformations is to refer to the 
// pictures
 
 low = (result % 16 == 1) ? 
 high->left : high->right;
 switch (result % 256)
 {
 
 case 0x11: // left-left case
 
 high->left = low->right;
 low->right = high;
 if (compare < 0)
 parent->left = low;
 else
 parent->right = low;
 high->balance = 0;
 low->balance = 0;
 return(0);
 
 case 0x12: // right-left case
 
 child = low->left;
 high->right = child->left;
 low->left = child->right;
 if (compare < 0)
 parent->left = child;
 else
 parent->right = child;
 child->left = high;
 child->right = low;
 child->balance = 0;
 low->balance = 0;
 high->balance = 0;
 result &= 0xF00;
 if (result == 0x100)
 low->balance = -1;
 else if (result == 0x200)
 high->balance = 1;
 return(0);
 
 case 0x21: // left-right case
 
 child = low->right;
 low->right = child->left;
 high->left = child->right;
 if (compare < 0)
 parent->left = child;
 else
 parent->right = child;
 child->left = low;
 child->right = high;
 child->balance = 0;
 low->balance = 0;
 high->balance = 0;
 result &= 0xF00;
 if (result == 0x100)
 high->balance = -1;
 else if (result == 0x200)
 low->balance = 1;
 return(0);
 
 case 0x22: // right-right case
 
 high->right = low->left;
 low->left = high;
 if (compare < 0)
 parent->left = low;
 else
 parent->right = low;
 high->balance = 0;
 low->balance = 0;
 return(0);
 
 }
 
 }

D. Conclusion

Before I depart, let me mention the existence of B-trees, which are sort of generalized binary trees. B-trees have several keys at each node, and several plus one pointers. For example, a 2-3 tree has two keys and three pointers per node. But not all of these keys need be used; a node in a 2-3 tree may have one or two keys. This makes the tree structure flexible enough to add the requirement that all leaves (nodes with no children) be the same distance from the root.

I think there are also B*-trees, or am I confusing this with C* algebras? Anyway, B-trees are a hot topic these days but I have no experience with them.

Next time, I’ll return to lexical analysis and do the new Canon tool. See you then.

Software Updates via MacUpdate

Latest Forum Discussions

Price Scanner via MacPrices.net

Jobs Board

SPREAD THE WORD:
Slashdot
Digg
Del.icio.us
Reddit
Newsvine