Binary Trees
Volume Number: | | 6
|
Issue Number: | | 8
|
Column Tag: | | Language Translation
|
Binary Trees
By Clifford Story, Mouunt Prospect, IL
Note: Source code files accompanying article are located on MacTech CD-ROM or source code disks.
A. Introduction
This article, Part V, Im going to take an excursion away from language translation and develop some support routines. In particular, Im going to create routines to handle binary trees; later, Ill use these routines to implement a symbol table for the new Canon tool and the fabled Inline tool.
There are a couple of Why? questions to answer. First question, why fill MacTutor with a long article on something that is taught in every college computer department in the U.S.? Ah, but not all of us have computer degrees; I dont, for example. And while you can find the subject in various books, I couldnt find a decent explanation of balanced binary trees -- I had to figure it out from hints I gleaned from Knuths discussion (Searching and Sorting, pages 451 -- 457). Second question, why a binary tree? Arent symbol tables usually done with hash tables? Well, yes, this is my impression, and Aho and Ullman state that hashing is [t]he most efficient and commonly used method for maintaining symbol tables (The Theory of Parsing, Translation, and Compiling, volume II, page 793). So why a binary tree? Im sorry, I havent got an answer for that one...
If you should ever write your own compiler, I guess you should use a hash table. After all, Aho & Ullmans opinion carries a lot of weight. In the meantime, binary trees are fun, so sit up and enjoy!
B. Simple Binary Trees
A binary tree is a tree with at most two sub-trees at each node. These are commonly called the left and right sub-trees. Searching a binary tree is very simple: take the left branch if the search key is less than the nodes value, and the right branch if it is greater. Keep going until you get either an exact match (in which case the search succeeds), or a nil pointer (in which case the search fails).
The example program I have written implements search and insertion, as well as a recursive dump routine. It does not do deletions. I havent any use for deletions in a symbol table, so I have left that part as an exercise...
B(1). Node Structure
Each node has four fields: the nodes value or key, a pointer to the left sub-tree, a pointer to the right sub-tree, and the data. Here is the node structure I will use in my example:
/* 1 */
typedef struct node
{
char *key;
struct node*left;
struct node*right;
char *data;
} node;
The data for this example is some arbitrary character string.
Ill going to create new nodes with the Mac trap NewPtr. I would usually use an array of nodes with a free list, and allocate them myself; Im using NewPtr in this case for simplicitys sake.
/* 2 */
// createnode
node *createnode(char *thekey,
char *thedata)
{
node *thenode;
int thelength;
thenode = (node *)
NewPtr(sizeof(node));
if (thenode == nil)
return(nil);
thelength = 1 + strlen(thekey);
thenode->key = (char *)
NewPtr(thelength);
if (thenode->key == nil)
{
DisposPtr((Ptr)thenode);
return(nil);
}
BlockMove((Ptr)thekey,
(Ptr)thenode->key, thelength);
thenode->left = nil;
thenode->right = nil;
thelength = 1 + strlen(thedata);
thenode->data = (char *)
NewPtr(thelength);
if (thenode->data == nil)
{
DisposPtr((Ptr)thenode->key);
DisposPtr((Ptr)thenode);
return(nil);
}
BlockMove((Ptr)thedata,
(Ptr)thenode->data, thelength);
return(thenode);
}
Finally, Ill start off the table with a head node, a dummy node with no data that points to the true root of the tree. That way, I can create the tree before any insertions; I dont have to special-case inserting data into a nil tree.
B(2). Finding a Node by Key
As I said above, looking up a node by key is easy. The routine below is passed a pointer to the root node of the tree, and the search key. It starts with the root and follows the algorithm, until either it finds the node it is looking for and falls out of the loop, or finds a nil pointer and returns it.
/* 3 */
// lookup
node *lookup(node *thetable,
char *thekey)
{
node *thenode;
int compare;
thenode = thetable;
while (compare = strcmp(
thekey, thenode->key))
{
if (compare < 0)
thenode = thenode->left;
else
thenode = thenode->right;
if (thenode == nil)
return(nil);
}
return(thenode);
}
B(3). Inserting a Node
This routine inserts a new node into the tree. First, it performs a search not unlike that in the lookup routine, except that it returns nil (indicating failure) if the search is successful -- i.e., a duplicate key exists. If, on the other hand, it finds a nil pointer, then that is the place for the new node, and the routine inserts it.
/* 4 */
// insert
node *insert(node *thetable,
char *thekey, char *thedata)
{
node *thenode;
int compare;
node *thechild;
thenode = thetable;
while (compare = strcmp(
thekey, thenode->key))
{
if (compare < 0)
{
thechild = thenode->left;
if (thechild == nil)
{
thenode->left = createnode(
thekey, thedata);
return(thenode->left);
}
}
else
{
thechild = thenode->right;
if (thechild == nil)
{
thenode->right = createnode(
thekey, thedata);
return(thenode->right);
}
}
thenode = thechild;
}
return(nil);
}
B(4). Dumping and Destroying
Tree structures are natural candidates for recursive routines, and here are two. The dump routine displays the entire tree, in key order. The destroy routine disposes of all the dynamic memory used by the tree.
/* 5 */
// dump
void dump(node *thetable)
{
if (thetable == nil)
return;
dump(thetable->left);
printf(%10s - %10s - %10s\n,
thetable->key,
thetable->left ?
thetable->left->key : nil,
thetable->right ?
thetable->right->key : nil);
dump(thetable->right);
}
// destroy
void destroy(node *thetable)
{
if (thetable == nil)
return;
destroy(thetable->left);
destroy(thetable->right);
DisposPtr((Ptr)thetable->key);
DisposPtr((Ptr)thetable->data);
DisposPtr((Ptr)thetable);
}
B(5). A Test Shell
/* 6 */
// Binary.c
// --------
// Binary tree search and insertion
#pragma load managers
// Constants and Macros
#define nil 0
// Types
typedef enum {false, true} logical;
typedef struct node
{
char *key;
struct node*left;
struct node*right;
char *data;
} node;
// Prototypes
void initmac();
node *createnode(char *thekey,
char *thedata);
node *insert(node *thetable,
char *thekey, char *thedata);
node *lookup(node *thetable,
char *thekey);
void dump(node *thetable);
void destroy(node *thetable);
// main
main()
{
node *thetable;
char thestring[256];
char command;
char thekey[256];
char thedata[256];
node *thenode;
initmac();
thetable = createnode(, );
if (thetable == nil)
{
printf(\tUnable to
create table\n);
exit(2);
}
printf(? );
while (true)
{
gets(thestring);
sscanf(&thestring[2],
%c %s %[^\n],
&command, thekey, thedata);
switch (command)
{
case A:
case a:
if (insert(thetable, thekey,
thedata) == nil)
printf(\tKey
already exists
in table\n);
break;
case D:
case d:
dump(thetable->right);
break;
case F:
case f:
thenode = lookup(
thetable, thekey);
if (thenode == nil)
{
printf(\tKey not
found in table\n);
break;
}
printf(\tData is %s\n,
thenode->data);
break;
case Q:
case q:
destroy(thetable);
exit(0);
}
printf(? );
}
}
// initmac
void initmac()
{
InitGraf((Ptr)&qd.thePort);
SetFScaleDisable(true);
InitCursorCtl(nil);
}
C. Balanced Binary Trees
Simple binary trees are pretty and elegant but they arent perfect. Remember, Ive promised a Canon tool will come out of all this, right? Arent most Canon dictionaries in alphabetical order? What happens to a binary tree when you insert already ordered data?
If you insert already ordered data into a simple binary tree, you get a degenerate tree. All the insertions go to the right; all the left pointers are nil. What is left is a linked list. Lookups in a linked list are o(N), instead of the o(log N) we would expect from a binary tree. (This means that the number of comparisons is proportional to the number of nodes, rather than to the log of the number of nodes. o(log N) is much faster than o(N).)
One solution to this problem is to use a balanced binary tree. This involves a little more work, and a more complicated insertion algorithm, but it restores the o(log N) running time.
C(1). What is a Balanced Tree?
The height of a node is the number of steps to its most distant descendant. The height of a tree (or a sub-tree) is the height of its root node. A binary tree is balanced if for every node in the tree, the height of the left sub-tree is within 1 of the height of the right sub-tree.
Heres another way to say the same thing: each node in a binary tree has a balance factor, which is equal to the height of the left sub-tree minus the height of the right sub-tree. A binary tree is balanced if every balance factor is either 0, 1 or -1.
Its pretty easy to see that a balanced tree is not going to exhibit the sort of degenerate behavior that simple binary trees can fall into. Its also easy to see that inserting new nodes into the tree can throw it out of balance, which is why insertion is a bit more complex than in the simple case.
C(2). Searching, Dumping, Destroying
Balancing a binary tree requires a minor change to the node structure. We have to add a balance field, to keep track of the balance factor:
/* 7 */
typedef struct node
{
char *key;
struct node*left;
struct node*right;
int balance;
char *data;
} node;
The createnode routine initializes this new field to zero.
Searching, dumping and destroying the tree are identical to the routines for the simple binary tree, already shown above.
C(3). Balancing the Tree
We do insertions just the same way as for a simple binary tree, until we unbalance the tree. Then we have to re-balance it. We know the tree is out of balance if we get a node with a balance factor of ±2. Fortunately, balancing a tree turns out to be a localized operation.
Suppose that as we search the tree for the place to insert the new node, H is the last node we pass with a balance factor of ±1. Then the tree is imbalanced after the insertion if and only if Hs balance factor becomes ±2.
Proof: if Hs balance factor is ±2, then the tree is imbalanced by definition. On the other hand, if the tree is imbalanced, where is the ±2 node? It isnt below H, since those nodes all had zero balance factors and can now be at most ±1 (in fact, they are all ±1; none remains at zero). If Hs balance factor isnt ±2 after the insertion, then either it stayed ±1 or it went to zero. In the first case, the heights of Hs subtrees didnt change. In the second, the height of the shorter tree increased. In both cases, the height of H itself remained the same, so the balance factors of all the nodes above H remained the same. Thus, if H isnt the ±2 node, then the tree is balanced, contrary to assumption.
Whats more, the process of re-balancing Hs sub-tree will leave the height of the sub-tree the same as its height before the insertion. So after balancing the sub-tree, the entire tree will be balanced. This means that we need only concern ourselves with a handful of nodes around H -- four nodes, at most.
C(3)(a). Left-left insertion
The insertion algorithm divides into four cases. After finding node H, we move either left or right to continue searching for the place to insert the new node, and after the next node, we again move left or right. So the left-left case is the one in which we took two left branches after finding H.
C(3)(a)(i). Before Inserting the New Node
Heres the immediate situation before inserting the new node:
Figure 1: Left-left case before insertion.
The ovals represent nodes; the capital letters are node names, and the numbers after the colons are the balance factors. The rectangles represent sub-trees; the lower-case letters are their names, and the expressions under the horizontal lines are tree heights. Node P is the parent of node H, the high node, which is in turn the parent of L, the low node. The new node will go into sub-tree a.
Ps balance factor is irrelevant. Hs is ±1 by definition (this is the same H as was discussed earlier, the last node with a non-zero balance factor), and Ls is zero for the same reason (definition of H). Since were adding the new node to the left sub-tree of H, which will increase Hs balance factor and unbalance the tree, Hs balance factor must be +1, not -1.
Ls balance factor is zero, so a and b have the same height, call it h. Then the height of L is h+1. Hs balance factor is +1, so cs height is one less than Ls, or h.
Also, note that the height of H is h+1. And listing the parts of the sub-tree in order of keys, we get: a, L, b, H, c, since L, a and b are to the left of H and c to the right, and L is between a and b. These facts are important; I have already promised that inserting the new node and re-balancing the tree will not change the sub-trees height, and the order of the nodes must be preserved by any balancing transformation.
C(3)(a)(ii). After Inserting the New Node
Now we insert the new node in sub-tree a:
Figure 2: Left-left case after insertion.
The new node increases the height of a by 1, to h+1. The height of b remains h, so Ls balance factor is now +1. The height of L goes to h+2, which makes Hs balance factor +2, since the height of c is unchanged. The tree is now out of balance.
C(3)(a)(iii). After Re-Balancing the Tree
At this point, we need to perform some transformation on the tree that will bring it back into balance. Im going to just pull it out of the hat:
Figure 3: Left-left case after transformation.
What we did was switch b to H, and lift sub-tree a, the big one, higher. The heights of a, b and c havent changed, and its easy to show that the new balance factors are correct. The important question is, is this still a properly-ordered binary tree?
Recall the order of nodes before the insertion: a, L, b, H, c. Has the order changed? Now a is to the left of L, and H, b and c to the right; H is between b and c: a, L, b, H, c. The order is the same.
Next, I promised that the height of the sub-tree would be the same after re-balancing as before insertion. The height before was h+1; its easy to see that its h+1 now. This fact allows us to ignore all the nodes above the sub-tree.
C(3)(b). Left-right insertion
A left-right insertion looks like a left-left insertion, except that the new node goes in sub-tree b instead of sub-tree a:
Figure 4: Left-right case after insertion.
Unfortunately, we cant employ the same trick. Shifting b to H and raising L will result in exactly the same picture as above, only reversed horizontally. We have to take a different view of the situation:
Figure 5: Left-right case before insertion.
C is the root of our former sub-tree b, which is now split into sub-trees b and c; the former sub-tree c is now sub-tree d. The height of C is the same as the height of the old b, or h, so the heights of the new b and c are both h-1 (theyre equal because the balance factor of C is zero, by definition of H).
The new node goes into either b or c. It doesnt matter much which; this will only affect balance factors after the re-balancing, and we can handle it then. So the tree after insertion looks like this:
Figure 6: Left-right case after insertion.
(Of course, the new node goes in only one of b and c, not in both! I had to draw it somehow...) The new heights and balance factors are easy to confirm.
And the balancing transformation is:
Figure 7: Left-right case after transformation.
Ls balance factor will be either 0 or +1, and Hs either -1 or 0, depending on whether the new node went into b or c. We have to check to make sure that neither the node order nor the height have changed (they havent).
C(3)(c). Right-right insertion
The right-right insertion is just the reverse of the left-left, so Ill content myself with showing the pictures, and refer you back to left-left insertion for explanation.
Figure 8: Right-right case before insertion.
Figure 9: Right-right case after insertion.
Figure 10: Right-right case after transformation.
C(3)(d). Right-left insertion
The right-left insertion is just the reverse of the left-right, so Ill content myself with showing the pictures, and refer you back to left-right insertion for explanation.
Figure 11: Right-left case before insertion.
Figure 12: Right-left case after insertion.
Figure 13: Right-left case after transformation.
C(4). The Insertion Algorithm
The above exhausts all possible cases when an insertion unbalances the tree. What about insertions that leave the tree balanced? We cant simply ignore these, since they change balance factors as well.
Suppose a node has a balance factor of zero (which means its two sub-trees have the same height), and one sub-tree gets higher. There are two results: the nodes balance factor changes to +1 (if the left sub-tree grew) or -1 (if the right sub-tree grew), and the height of the node increases by 1. This second result will force a re-evaluation of the nodes parent, and the process will continue back up the search path so long as balance factors are zero. So we can see inserting the new node as the first child of its parent sets off a chain reaction back up the tree.
What if a node has a balance factor of ±1, and one sub-tree gets higher? If the sub-tree that already is higher is the one to grow, then weve unbalanced the tree. If the shorter sub-tree gets higher, then the nodes balance factor goes to zero. The first case we handled above; neither case changes the height of the node, so both kill the chain reaction.
So the high-level view of the insertion algorithm is this: First, insert the new node in the tree. Then, back up along the search path for as long as the nodes have zero balance factors, changing those balance factors to +1 or -1 according to whether the path went left or right when leaving the node. When a non-zero balance factor finally appears, increment or decrement it, according to the same test. If the result is zero, were done. If its ±2, the tree is unbalanced, so apply the appropriate balancing transformation.
C(5). The insert Routine
The following routine uses one cute trick: it keeps a stack of search directions in an int variable, and returns it as its result. This requires some explanation.
The integer has 32 bits. I divide it into 8 4-bit fields. The routine is called recursively going down the search path; it returns values going back up. The value it returns is the value it got, shifted left by one field, and with the direction from parent in the rightmost field: 1 for left, 2 for right.
The function result thus shows the direction from parent in the rightmost field, the direction from parents child in the next field, the direction from parents grandchild in the next field, and thats as far as we need to go.
There are two special values: zero means no further adjustment of balance factors is necessary, and 4 means error, which in this context means a duplicate key (there are other possible errors but I cleaned out the error-detection code for simplicitys sake).
/* 8 */
// insert
unsigned int insert(node *parent,
char *thekey, char *thedata)
{
int compare;
node *high;
unsigned int result;
node *low;
node *child;
// if thekey matches this node,
// then return an error
// (duplicate keys)
compare = strcmp(thekey, parent->key);
if (compare == 0)
return(4);
// if theres a slot for the new node,
// then create it and return
// 1 if it is to the left of
// parent and 2 if to the right
// else determine high, the next step
// in the search path
if (compare < 0)
{
high = parent->left;
if (high == nil)
{
parent->left = createnode(
thekey, thedata);
return(1);
}
}
else
{
high = parent->right;
if (high == nil)
{
parent->right = createnode(
thekey, thedata);
return(2);
}
}
// now continue the search by
// calling insert recursively
// if the result is 0 (no balancing
// needed at this level) or
// the result is 4 (duplicate
// key), return
result = insert(high,
thekey, thedata);
if ((result == 0) || (result == 4))
return(result);
// the low 4 bits of result indicate
// the direction of the search path
// leaving high;
// increment highs balance if the path
// goes left, and decrement it if the
// path goes right
if (result % 16 == 1)
high->balance++;
else
high->balance--;
// if the balance is now zero, no further
// correction of balances is needed,
// so return
// if its ±1, push the direction away
// from parent onto the result
// stack and return it
// if its ±2, continue to balancing
// transformation
switch (high->balance)
{
case 0:
return(0);
case 1:
case -1:
return((result << 4)
+ ((compare < 0) ? 1 : 2));
case 2:
case -2:
break;
}
// use the direction away from high to
// find low, then switch on that
// direction and the next one
// the easiest way to follow these
// transformations is to refer to the
// pictures
low = (result % 16 == 1) ?
high->left : high->right;
switch (result % 256)
{
case 0x11: // left-left case
high->left = low->right;
low->right = high;
if (compare < 0)
parent->left = low;
else
parent->right = low;
high->balance = 0;
low->balance = 0;
return(0);
case 0x12: // right-left case
child = low->left;
high->right = child->left;
low->left = child->right;
if (compare < 0)
parent->left = child;
else
parent->right = child;
child->left = high;
child->right = low;
child->balance = 0;
low->balance = 0;
high->balance = 0;
result &= 0xF00;
if (result == 0x100)
low->balance = -1;
else if (result == 0x200)
high->balance = 1;
return(0);
case 0x21: // left-right case
child = low->right;
low->right = child->left;
high->left = child->right;
if (compare < 0)
parent->left = child;
else
parent->right = child;
child->left = low;
child->right = high;
child->balance = 0;
low->balance = 0;
high->balance = 0;
result &= 0xF00;
if (result == 0x100)
high->balance = -1;
else if (result == 0x200)
low->balance = 1;
return(0);
case 0x22: // right-right case
high->right = low->left;
low->left = high;
if (compare < 0)
parent->left = low;
else
parent->right = low;
high->balance = 0;
low->balance = 0;
return(0);
}
}
D. Conclusion
Before I depart, let me mention the existence of B-trees, which are sort of generalized binary trees. B-trees have several keys at each node, and several plus one pointers. For example, a 2-3 tree has two keys and three pointers per node. But not all of these keys need be used; a node in a 2-3 tree may have one or two keys. This makes the tree structure flexible enough to add the requirement that all leaves (nodes with no children) be the same distance from the root.
I think there are also B*-trees, or am I confusing this with C* algebras? Anyway, B-trees are a hot topic these days but I have no experience with them.
Next time, Ill return to lexical analysis and do the new Canon tool. See you then.