Using Flex and Bison
Volume Number: 16 (2000)
Issue Number: 7
Column Tag: Programming
Using Flex and Bison
by Aaron Montgomery
Scanners, Parsers and Abstract Syntax Trees, oh my!
What Are Flex and Bison (And, Why Do I Need Them)
Flex and bison are code-generating tools designed to aid in compiler development. In particular, flex (short for fast lexical analyzer) will take a sequence of characters and break them apart into tokens (words). It could be used to write code breaking this article into a sequence of words and punctuation. Bison (whose name comes from YACC-Yet Another Compiler Compiler) will group the words into sentences, the sentences into paragraphs and the paragraphs into sections. But wait, you protest, I don't write compilers. That's okay, there are still a number of reasons to learn a little about flex and bison. First, you may find that writing (or even understanding the code behind) a simple compiler may make the syntax errors coming from other compilers more understandable. There are also situations where you might want to be able to analyze a sequence of tokens even if the final output is not compiled code. For example, if you are transferring data over the Internet in the form of strings, you can use flex and bison to create a parser to decode the information.
My own personal experience with flex and bison began in a course in compiler design at the University of Wisconsin-Madison and I am currently working on reimplementing an interpreter for the language isetl in ANSI C++. This recent work has required that I brush up on my skills with flex and bison (in the Metrowerks' CodeWarrior environment) and doing a small project with a toy language similar to isetl seemed like a good first step.
This article will present a small simple language called Miniset and then present the steps I followed to develop a parser for the language. It is highly unlikely that you have ever used (or even heard of) Miniset, largely because I made it up last month in order to practice with flex and bison before working on isetl. The final result will be an application which simply echoes out the source you type in (although it will make it conform to my stylistic conventions). This echo application is probably a good first step in any parser's development since it allows you to confirm that your parser is correctly interpreting the source. The next step would be to write the code to implement the Miniset source, but that would be another article.
This article will assume some understanding of C++ although bison and flex can be used even if you intend to do all your programming in C. You are also expected to have a basic understanding of the CodeWarrior IDE (in particular, how to adjust project settings in preference panels). While I will spend some time on the peculiarities of using flex and bison in the CodeWarrior environment, much of the article remains valid in other environments (or even other operating systems). A basic understanding of Miniset may come in handy in places and there is a short overview of the language included with the project file. This article does not assume that you have any previous experience with either flex or bison. After reading the article, you will hopefully have enough understanding of flex and bison to determine if they suit your needs. In addition, you will have seen some of the issues that can arise as you try to build a parser or a scanner with these tools. This article does not intend to provide a complete introduction to these tools. If you decide that you are going to use flex or bison, then I would strongly encourage you to obtain a copy of the book lex & yacc as well as download the bison and flex documentation (references available at end of article).
How Flex and Bison Work
Before discussing this particular project, an overview of how flex and bison operate is probably appropriate. I will call the input file for flex a "scanner specification" and it is conventional for it to have an l suffix. I will call the input file for bison a "parser specification" and it is conventional for it to have a y suffix. When you tell flex or bison to process its input file, it will produce a file containing C/C++ source code. I will call these files source files in order to distinguish them from the specification files. Your own code will probably call the function yyparse() created by bison from your parser specification. This function will in turn call yylex() created by flex from your scanner specification. You can use yylex() independent of yyparse() and you can supply your own yylex() to be used with a bison generated yyparse().
When you call yyparse() from your code, the parser will repeatedly call yylex() to obtain tokens (words and punctuation) from the input stream. With each new token, the parser will decide whether to push the token onto its internal stack or to reduce a sequence of tokens on the stack to create a new language part to push onto the stack. These new parts of the language are referred to as "non-terminals". One of the non-terminals is the "start-state". When this non-terminal is matched, yyparse() will return. This is probably best explained with the following example.
Assume that the language knows that a noun-phrase consists of an article ("the" or "a") followed by a noun. That a sentence consists of a noun-phrase, a verb, a noun-phrase and a period. Furthermore, assume that the parser is attempting to match a sentence. The non-terminals here are the noun-phrase and the sentence, the tokens are articles, nouns, verbs and periods. The start-state is a complete sentence. If the input stream consists of "The man ate a worm." then a call to yyparse() will result in the sequence of events depicted in Figure 1.
Figure 1. A sample parsing.
- The parser will obtain the first token from the scanner obtaining "The", an article. The parser will push the "The" on the stack (Figure 1-a).
- The parser will obtain another token from the scanner obtaining "man", a noun. Since there is an article on the stack, the parser will pop the article off the stack and replace it with a noun-phrase "The man" (Figure 1-b).
- The parser will obtain another token from the scanner obtaining "ate", a verb. There is no rule reducing "noun-phrase verb" so it pushes "ate" on the stack (Figure 1-c).
- The parser will obtain another token from the scanner obtaining "a", an article. There is no rule reducing "noun-phrase verb article" and so it pushes the "a" on the stack (Figure 1-d).
- The parser will obtain another token from the scanner obtaining "worm", a noun. The parser can now reduce the "article noun" on the top of the stack to a noun-phrase and places the noun-phrase on the stack (Figure 1-e).
- The parser will obtain another token from the scanner obtaining ".", a period. The parser can now reduce the "noun-phrase verb noun-phrase period" to a sentence and places this on the stack (Figure 1-f).
The parser has now matched a sentence and returns. When the parser returns, its internal stack vanishes and so you are responsible for finding some method of returning the information you need from that stack. The most common method (the one used in this article) is to use a global pointer to the needed information. You can save this information because will execute a block of C/C++ code that you provide each time bison matches a pattern. Frequently, these blocks of C code will be used to produce something called an "abstract syntax tree" or "AST" for short. Abstract syntax trees are data structures designed to mirror the syntactic structure of the pattern being parsed. For example, the abstract syntax tree from the sentence above might look like the one in Figure 2 (note, we don't need to have a period included in the tree since it occurs for all sentences).
Figure 2. An abstract syntax tree.
Bison facilitates this construction by assigning to each token or non-terminal a value and the block of C/C++ code for a rule reduction can use the values of the tokens and non-terminals in the pattern to set the value of the non-terminal being reduced. You will be introduced to the details of this in the description of Parser 3 below.
Installing Flex & Bison for CodeWarrior
The flex and bison plug-ins do not come packaged with CodeWarrior, so the first thing you will need to do is get them. (an URL is provided at the end of this article). After decompressing the folders, you should place the flex and bison folders into the CodeWarrior Plugins folder of CodeWarrior. You now need to make sure that the file-mappings are correct. In the project settings, you should make sure flex is associated with the file suffix l and the precompile flag is set and that bison is associated with the file suffix y and the precompile flag is set. If you don't set the precompile flag, CodeWarrior will complain of an error (from the old source file) and then show you the source where the error no longer exists.
If you intend to only use them to generate PPC code, then you are set. If you plan to generate code for a 68K processor, you will need to change the parser skeleton file called bison.simple which is in the Bison plug-in folder. This change is needed to support a limitation in CodeWarrior's implementation of the alloca() function in the 68K environment. I reported this to Metrowerks and received a work-around. You should replace the file in the Bison plug-in folder with the file included with the project. The alloca() limitation only affects 68K development and there is a work-around so I suspect (even hope to be honest) that it sits low on Metrowerks' priority list.
Another problem is that it is easy to edit the source file (for example after a compile error or while debugging) and forget to change the specification file. If you do this, flex and bison will overwrite your changes the next time you run them. This can be particularly frustrating when you infrequently change the specification files. You can fix the error in the source file, leave the files untouched for a month, then rebuild the project and (mysteriously) a bug which you thought you had fixed a month ago will reappear.
The Project
The project builds a parser for a language called Miniset. Miniset is a small language handling the implementation of sets of positive integers. It is complicated enough to present some common issues in writing a parser and a scanner, but simple enough that the flex and bison specification files are manageable. Since Miniset is intended to be an interpreted language, the parser will assume that it should read each statement you type into the text window, execute the statement (in our case, just echo it back to the screen) and then wait for you to type in the next statement. I have tried to document the less obvious language constructs in the parser specification files. If you are interested in learning more about the language, you should read the short introduction to Miniset included with the project and work with the output application of the Parser 8 target. This output application will read what you have typed and echo it out to the screen. If you end the session with the quit command, it will then echo your code out to a file called miniset one, then read it back in (reparsing it) and then echo it out to a file called miniset two. This file creation is done as a check on the internal consistency of the parser (both files should be identical).
The project has eight targets (aptly named Parser 1, Parser 2, ... Parser 8). They each present a different stage in the process of building the final application (Parser 8). This article will attempt to lead you through the sequence I used when building the parser and you can follow along with the files. Due to length issues, it would be inappropriate to document every difference between the project files, but you can use a diff tool (like CodeWarrior's Compare Files...) to quickly identify the changes between the stages of the project. Here is a general overview of the eight targets:
- Parser 1: initial parser (bison) specification, lots of warnings generated by bison
- Parser 2: rewrote portions of the parser specification to eliminate the warnings
- Parser 3: added C++ code, resulting source file compiles
- Parser 4: initial scanner (flex) specification, there are some problems while scanning
- Parser 5: fixed the scanning problems, some bugs in the parser specification are now visible
- Parser 6: fixed the bugs discovered in Parser 5
- Parser 7: added better error handling
- Parser 8: repackaged the scanner and parser as a C++ class
Parser 1
Writing the parser specification
The primary purpose of Parser 1 is to introduce you to the notation used by bison. If you already have some familiarity with such specifications, you might be able to identify the mistakes in the file as you read through it. Hopefully, by Parser 6 we will have resolved all such problems. The project consists of a single file (parser1.y) which is the first attempt to produce a bison input file for Miniset.
Bison specification files have three sections (separated by lines containing %%. The first section of a bison file (from the start of the file to the first %%) is called the Definitions Section.
parser1.y (Definitions Section)
//punctuation
%token T_COMMA, T_SEMICOLON
%token T_OPEN_PAREN, T_CLOSE_PAREN
%token T_END
//code omitted
%token T_MULTIPLY
%start nt_program
%%
This section provides information that is placed at the top of the resulting source file or which is needed by bison to create the parser. Lines beginning with whitespace will be copied verbatim into the resulting source file near the top, but you should restrict yourself to comments on such lines because there is no guarantee about their exact position in the resulting source file. Similarly, lines contained between a %{ and a %} are also copied verbatim into the source file.
The other lines beginning with a % are information that bison needs to construct the parser. Currently there are two types in the file: %token lines and the %start line. The lines beginning with %token list the tokens that the scanner returns to the parser. It is conventional to name these with all uppercase letters (I add the T_ prefix as well). Assuming the Generate Header option is selected in the Bison preference panel, bison will generate a header file which #defines each token to be a unique integer. The scanner function, yylex(), should use these tokens as return values. The ordering is unimportant for now (however, it is set up to make the transition to Parser 2 easier).
The %start line tells the parser that it is attempting to parse something called an nt_program. Non-terminals are conventionally written in lower case (I add the nt_ prefix as well). The specification defines the non-terminals in the next section: the Rules Section.
The section between the first %% and the second %% is called the Rules Section. This section describes the non-terminals in the grammar (those things which are built from tokens and other non-terminals). The Rules Section in parser1.y is long and I have omitted a number of rules. The lines displayed below are representative of what you find in a rules section and they contain all the code of interest (including the bugs).
parser1.y (Rules Section)
nt_program
:
nt_statement_list nt_quit_statement
;
nt_quit_statement
:
T_QUIT T_SEMICOLON
;
nt_statement_list
:
/* empty */
|
nt_statement
|
nt_statement_list nt_statement
;
//code omitted
nt_number
:
T_NUMBER_LITERAL
|
T_NUMBER_IDENTIFIER
|
nt_number T_MULTIPLY nt_number
|
nt_number T_ADD nt_number
|
T_OPEN_PAREN nt_number T_CLOSE_PAREN
|
T_NUMBER_IDENTIFIER T_OPEN_BRACE
nt_expression_list T_CLOSE_BRACE
;
//code omitted
A rule describes to the parser when it can do a reduction. The result of the reduction (a non-terminal) is given first followed by a colon (:), then there are a sequence of patterns separated by vertical bars (|) and finally a semicolon (;). After each pattern, there is an optional block of code. Some of the blocks in parser1.y contain comments about the Miniset grammar.
parser1.y (Rule Section)
nt_program
:
nt_statement_list nt_quit_statement
;
nt_quit_statement
:
T_QUIT T_SEMICOLON
;
nt_statement_list
:
/* empty */
|
nt_statement
|
nt_statement_list nt_statement
;
//code omitted
nt_number
:
T_NUMBER_LITERAL
|
T_NUMBER_IDENTIFIER
|
nt_number T_MULTIPLY nt_number
|
nt_number T_ADD nt_number
|
T_OPEN_PAREN nt_number T_CLOSE_PAREN
|
T_NUMBER_IDENTIFIER T_OPEN_BRACE
nt_expression_list T_CLOSE_BRACE
;
//code omitted
%%
It is conventional to have the start state as the first rule in the list and write the specification from the top down. The first rule given above tells the parser that an nt_program is an nt_statement_list followed by an nt_quit_statement.
The second rule describes an nt_quit_statement as the token T_QUIT followed by the token T_SEMICOLON. In other words, a quit statement looks like:
quit;
The third rule tells the parser that a nt_statement_list can either be empty (the first possibility). Empty reductions are typically listed first and indicated by /* empty */. Or a list may contain a single nt_statement (the second possibility). Or a list can be obtained recursively as an nt_statement_list followed by an nt_statement. Notice that the recursion could have been presented as nt_statement nt_statement_list. Although both presentations are equivalent, the first form is preferred since the second would require all the nt_statements to be pushed onto the stack before a reduction could occur. The first form allows the nt_statement_list to be placed on the top of the stack first and then allows it "gobble up" each following nt_statement as it appears in the input stream.
The third (and final) section occurs after the second %% and consists of C/C++ code that bison will place in the resulting source file. This section typically includes any code for additional functions needed. Currently we don't require any and so this section is empty.
Processing the parser specification
Now issue the Make command. If this is the first time you make this project, you should get a message that you need to add a file to the project and then a message stating that the file you should add cannot be found. Just issue the Make command a second time and everything will be all right. You will get similar messages on the later targets as well. Occasionally, I have also received an "unknown error" message which can usually be solved by trashing any previously created parser files (in this case, parser1.tab.h, parser1.tab.c and parser1.output).
More importantly, you should get a warning message indicating that there are a number of conflicts (170 to be exact). A shift/reduce conflict occurs when the parser cannot decide if it should shift a token onto the stack or reduce to a non-terminal. A reduce/reduce conflict occurs when the parser cannot decide which reduction rule to use to reduce the stack. You should attempt to remove all of these errors from your parser specification, although this may be difficult (or even impossible). In any case, you should identify each of them and determine that the parser is doing the correct action in each case.
Now that we know we have errors, we will adjust some settings to get clearer messages. In the Bison preference panel, you should check the Verbose checkbox. This will result in more informative warning messages (containing information about the "states" in which the bugs occur). You will now want to Touch the specification file and issue the Make command again. During this build, you will receive more explicit warnings about the shift/reduce conflicts. The warning messages grow because each new warning message is appended onto the previous warning. This means that you will need to use the horizontal scrollbar to see all of the messages. You should write down all the states that are listed as having conflicts (or if you are lazy like I am, just write down a few and go work with them first).
Checking the Verbose checkbox also causes the file parser1.output to be created in the same file as parser1.y. The parser1.output file contains a description of the parser in terms of a finite-state-machine. You don't need to know much about the theory behind this, but the basic principle is that as each token is processed, the parser moves from one state to another (these are the states referred to in the warning messages). The parser1.output file describes these movements but it can be intimidating at first. In the next section we walk through some of the entries to determine the problems with Parser 1.
Parser 2
Adjusting the parser specification
Open the file parser1.output (it's pretty scary, you might want to close the file, take a deep breath and then open it again). At the top of the file are all of the reduction rules that are used as well as information about where each token is used in the file. Following this information are the definitions of each state. You should look for the definitions of the states with conflicts. Because each state definition begins with a line containing the word "state" followed by the state number, you can use the regular expression facility of the Find dialog to find a state's definition quickly. For example, to find the definition of state 47, search for ^state 47 with Regexp checked in the Find dialog. The symbol ^ indicates that you are looking for a match at the beginning of a line. Using the list you made at the end of the previous section, you should look at what is happening in the states with conflicts. Your exact state numbers may vary from those presented in this article, but I suspect they do not. We start with the problem in state 0.
parser1.output (state 0)
state 0
T_OPEN_PAREN shift, and go to state 1
T_IF shift, and go to state 2
T_FOR shift, and go to state 3
//code omitted
T_OPEN_PAREN [reduce using rule 3 (nt_statement_list)]
T_IF [reduce using rule 3 (nt_statement_list)]
T_FOR [reduce using rule 3 (nt_statement_list)]
//code omitted
When presented with an T_OPEN_PAREN, the parser has the choice of shifting the token and going to a new state (line 3 in the listing above) or reducing the empty stack using rule 3 (line 9 in the rule above). You should inspect the states to which the parser moves after it does the shift. You can use regular expressions to find state 1 (or just scroll down the screen a little).
parser1.output (state 1)
state 1:
nt_set -> T_OPEN_PAREN . nt_set T_CLOSE_PAREN
nt_number -> T_OPEN_PAREN . nt_number T_CLOSE_PAREN
nt_boolean -> T_OPEN_PAREN . nt_boolean T_CLOSE_PAREN
The non-terminal to the left of the arrow (->) is the pattern the parser is attempting to match. The period (.) in the listing indicates the current location of the parser in this attempt. Items after the arrow and to the left of the period are on the stack and items to the right are things that the parser would expect to see in the input stream. In this case, the parser is trying to match an nt_set, an nt_number or an nt_boolean. It will have an T_OPEN_PAREN on the stack and is expecting either an nt_set, nt_number or nt_boolean. The parser is attempting to match an expression and in Miniset, an expression followed by a semicolon is a valid statement (causing the value of the expression to be printed). From these clues, you can infer that the parser is attempting to build an nt_statement on the stack.
The alternative to this shift and move is to use rule 3 is to reduce the stack. At the top of the file parser1.output and you will find a list of rules. Rule 3 is the following:
parser1.output (rule 3)
rule 3 nt_statement_list-> /* empty */
The reduction would use the empty stack as the start of an nt_statement_list. Since we want to allow an empty statement list, we will want to allow the parser to reduce with rule 3 and then allow the nt_statement_list on the top of the stack to append the first nt_statement. This changes the rule for nt_statement_list to the following:
parser2.y
nt_statement_list
:
/* empty */
|
nt_statement_list nt_statement
;
This problem isn't limited to nt_statement_lists and I have also adjusted the rules for nt_parameter_list, nt_number_list and nt_expression_list in parser2.y.
If you look at all the states with conflicts that are not caused by lists, it becomes clear that all of them involve operators. An example of this type of conflict can be seen in state 90.
parser1.output (state 90)
state 90
nt_number -> nt_number . T_MULTIPLY nt_number (rule 68)
nt_number -> nt_number . T_ADD nt_number (rule 69)
nt_number -> nt_number T_ADD nt_number . (rule 69)
T_ADD shift, and go to state 58
T_MULTIPLY shift, and go to state 59
T_ADD [reduce using rule 69 (nt_number)]
T_MULTIPLY [reduce using rule 69 (nt_number)]
$default reduce using rule 69 (nt_number)
Conflicts occur when a reduction is possible which means the trouble is with the third rule (line 5) of this state, the stack contains an expression like "2 + 3" and a reduction is possible. The conflict arises if either a "+" or a "*" is seen next, in this case, it would be possible to shift the new token onto the stack and reduce the expression from right to left. This would be the correct behavior if the new token were a "*" since the multiplication should be performed before addition. On the other hand, if the next token were a "+" then the first addition on the stack should be reduced (since addition is done from left to right).
The appropriate behavior depends on the precedence of the operators involved. One solution to this problem is to introduce some new non-terminals for factors and terms and then rewrite the operator rules to reflect when shifts and reduces should be performed. Fortunately, this problem is common enough (and the introduction of new non-terminals confusing enough) that bison provides a simpler solution.
In addition to the %token declarations we used in parser1.y, you can declare operator tokens using %left, %right or %nonassoc. These three identifiers determine whether the operators associate to the left (meaning a + b + c = (a + b) + c), to the right (meaning a + b + c = (a + b) + c), or if it is an error to fail to provide exact grouping instructions. In addition to indicating the associativity of the operator, the %left, %right or %nonassoc provide the operators with a precedence level based on their order of definition. The sooner the operator is defined, the lower its precedence. We therefore change the operator token definitions in parser2.y to the following:
parser2.y
//operations
%nonassoc T_GETS
%nonassoc T_EQUAL, T_LESSTHAN
%nonassoc T_UNION, T_INTERSECT
%nonassoc T_NOT, T_AND, T_OR
%left T_ADD
%left T_MULTIPLY
The T_GETS operator (assignment) is given the lowest precedence so that it will always be executed last and T_EQUAL and T_LESSTHAN given second lowest precedence. Operators on one type of variable won't conflict with operators for a different type of variable and so the placement of the set and boolean operators is unimportant in relation to the placement of the number operators. For both of the set and boolean operators, we declare them non-associative (meaning that a Miniset programmer must always provide explicit grouping). Finally, the addition and multiplication operators are declared as left associative with addition having a lower precedence level than multiplication.
Processing the parser specification
When you issue the Make command, (with the Verbose option), there are 20 warnings, one for each place a precedence rule is used. You should check the parser2.output file to confirm that the correct action is being taken in each case. At the top of parser2.output is a listing of where precedence rules were used.
parser2.output (top of file)
Conflict in state 49 between rule 72 and token T_EQUAL
resolved as reduce.
Conflict in state 49 between rule 72 and token T_AND
resolved as an error.
Conflict in state 49 between rule 72 and token T_OR
resolved as an error.
//code omitted
state 49
nt_boolean -> nt_boolean . T_AND nt_boolean (rule 70)
nt_boolean -> nt_boolean . T_OR nt_boolean (rule 71)
nt_boolean -> T_NOT nt_boolean . (rule 72)
nt_boolean -> nt_boolean . T_EQUAL nt_boolean (rule 75)
T_AND error (nonassociative)
T_OR error (nonassociative)
$default reduce using rule 72 (nt_boolean)
The conflicts will occur in the third rule reduction (line 5 of state 49) where a reduction is possible. If an equal sign is seen, a reduction is made (since the equality operator has second lowest precedence). If either of the other boolean operators are seen, the parser will push the error non-terminal onto the stack. This is a special non-terminal (you don't define it yourself) which indicates that the input stream contains a sequence of tokens which cannot occur in the language. This is the correct action here because the boolean operators require explicit grouping. In Parser 7, some use of this non-terminal is made to allow for better error handling.
Now that the only conflicts have been explicitly handled by the parser specification, issuing the Make command without the Verbose option will result in no messages.
Parser 3
Adding the C/C++ code
We are now ready to start adding the C/C++ code. Up until now we have been ignoring the files created by the bison preprocessor. The output files will have the same root name, but will have the file suffix y replaced by the file suffix tab.c. The goal of Parser 3 is to write a parser specification (parser3.y) from which bison will create a source file (named parser3.tab.c) which will compile. We need to add the C/C++ code to be executed as rules are reduced to parser3.y. In addition, we will need the header files for any classes we use (although we won't actually need the source files for these classes until Parser 4 where we try to link the project).
The definition section undergoes a fairly significant change.
parser3.y (Defnition Section)
%{
#include "parser.h"
#include <alloca.h>
namespace {
inline int yylex(void) { return 0; }
}
%}
%union {
CStatementList StatementList;
CStatementNode Statement;
CElseIfList ElseIfList;
CElseIfNode ElseIf;
CElseNode Else;
//code omitted
}
//code omitted
%type<ElseIfList> nt_elseif_list
%type<ElseIf> nt_elseif
%type<Else> nt_else
%type<IdentifierList> nt_parameter_list
%type<Identifier> nt_parameter
%type<NumberList> nt_number_list
%type<Number> nt_number
%type<Boolean> nt_boolean
// code omitted
%token<Identifier> T_SET_IDENTIFIER
// code omitted
The code between the %{ and the %} is C/C++ code which will be placed at the top of the source file. We need to include the header parser.h (which contains our class headers) as well as the header alloca.h (which contains the memory routine alloca() used by the parser). We also define a stub function yylex() (in Parser 4, we will have flex generate this function).
Recall that the values of the tokens and non-terminals are available from within the code-block associated with a reduction to a non-terminal. These values are stored in a union data-type and the %union declaration determines the members (and types) of that union. In our case, these are all pointers to C++ objects. I will talk about these classes (very briefly) in the discussion of Parser 4, for now it suffices to say that they are all subclasses of an abstract class called CNode.
Next comes a sequence of %type declarations. Frequently in the rule reductions for a non-terminal, you will assign a value to the non-terminal (so that it can be used in later reductions). These %type declarations describe to which member of the union the parser will assign the value. You will only need a %type declaration for those parts of the grammar whose values you use in some rule reduction. If a token has a value associated with it, you identify its associated union member in the %token declaration (with the same syntax as the %type declarations).
Because all of the types are subclasses of the class CNode, one might be tempted to use polymorphism to reduce the union to just a CNode*. However, doing this would require the class constructors to accept a generic CNode* and then downcast it in the constructor. Taking the time to correctly describe the union (as well as specify the %type declarations below) will allow more rigorous static type checking.
The rules section also sees substantial changes. As described in the general introduction to flex and bison, we attempt to generate an abstract syntax tree reflecting the structure of the input stream. My general strategy is to build a C++ class hierarchy matching the grammatical hierarchy. This means that if there is a non-terminal called nt_statement, then I create a class called CStatementNode, similarly, the non-terminal nt_assign_statement corresponds to a subclass of CStatementNode (called CAssignNode). Almost all code blocks are then simply passing the values from the left side of the reduction to the constructor of the class corresponding to the non-terminal being reduced.
The value of the non-terminal being reduced can be accessed as the "variable" $$. The values of the non-terminals and tokens on the right are accessed through the "variables" $1, $2, $3, and so forth. All non-terminals and tokens on the right are assigned a $ variable (even those which do not have an associated union member). However, you should only use those non-terminals and tokens for which you have provided a type (and bison will complain if you try to do otherwise). If no rule is given, the default rule is
$$ = $1;
This rule assigns the value of the new non-terminal to the value of the first token on the right. I tend to write this rule where I will use the value of the result and allow bison to provide it in cases where I do not use the value of the result (e.g., in the reduction for nt_end_if in the code below). The following code is a sample of what the new code blocks look like.
parser3.y (Rules Section)
nt_program
:
nt_statement_list nt_quit_statement
{
gProgramP = $1->Append($2);
YYACCEPT;
}
;
//code omitted
nt_assign_statement
:
T_SET_IDENTIFIER T_GETS nt_set T_SEMICOLON
{
$$ = new CSetAssignNode($1, $3);
}
|
T_NUMBER_IDENTIFIER T_GETS nt_number T_SEMICOLON
{
$$ = new CNumberAssignNode($1, $3);
}
|
//code omitted
;
//code omitted
nt_end_if
:
T_END
|
T_END T_IF
;
//code omitted
nt_number_list
:
/* empty */
{
$$ = new CNumberList();
}
|
nt_number_list T_COMMA nt_number
{
$$ = $1->Append($3);
}
;
//code omitted
The nt_program rule appends the nt_quit_statement to a global CStatementList* (a global variable is the standard way in which information from yyparse() reaches the outside world) and then uses the macro YYACCEPT which causes the parser to return a value indicating that the input was acceptable.
The nt_assign_statement rule reductions demonstrate the usual rule structure. We pass the right-side non-terminal and token values to the constructor of an object for the left side. This makes writing the class constructors exceptionally easy, they are almost completely dictated by your parser specification.
The section of code for an nt_number_list shows how lists are generated. The list is created with a default constructor (from an /* empty */ non-terminal) and then following nodes are appended to the list.
One thing to notice is that there are a lot of new statements and not a single delete. The CNode class and its subclasses take ownership of pointers passed to them and guarantee that these nodes will be deleted (through the use of an auto_ptr style template class). Furthermore, CNode is designed so that it cannot be allocated on the stack and so the call to delete will always be to a pointer originally created with new. This type of care is necessary in generating parsers with flex and bison because you don't typically know how the flow of control moves through the resulting source code. And because your parser may be required to deal with hundreds (or thousands) of tokens and non-terminals in a single run, losing even a little memory for each one can result in severe memory problems.
Processing the parser specification
When you issue the Make command, the parser3.y specification should generate a parser3.tab.c source file and this file will compile without error. You should get a link error (__start hasn't been defined since we don't have a main() function). One thing to be careful about here is that it is very easy to try and fix your C/C++ errors in the compiler Errors & Warnings Window. This is a mistake since the next time you process the parser specification, you will overwrite the file with your fixes. Personally, I do tend to fix the compile errors in the Errors & Warnings Window (making a note of what I fix) and once I get a clean compile, I go back and fix the code in the specification. This is because (being really impatient) I don't want to wait while bison runs after each fix to see if the fix was correct. You just need to remember to go back and change the parser specification before you run bison again. Now that we appear to have a source file for the parser, it is time to try to build the first application.
Parser 4
The Abstract Syntax Tree
There's nothing terribly tricky in the CNode subclasses. You might want to peruse some of the source files to convince yourself that they do what I say they are going to do. The classes are really skeletal in the sense that they provide no facilities to interpret the Miniset language. If we were going to implement the interpreter, I would have added Execute() methods to the CStatementNodes and Value() methods to the CBooleanNodes, CNumberNodes and CSetNodes.
Adjusting the parser specification
There are very few changes to the parser specification. The first is to provide access to the scanner created by flex. Unfortunately, all the source files from flex will be called the same unless we enable the Gen. prefix from file name option in the Flex preference panel which will cause the files to be labeled uniquely. The downside to this is that all of the scanner functions will also be named uniquely, so we need to provide some glue.
parser4.y (Definitions Section)
namespace {
inline int yylex(void) { return scanner4lex(); }
}
This is needed because the yyparse() will call yylex().
The second change is to have the parser print out any top-level statement it encounters just before appending it to the statement list being created.
parser4.y (Rules Section)
//code omitted
nt_statement_list
:
/empty */
{
$$ = new CStatementList();
}
|
nt_statement_list nt_statement
{
$$ = $1->Append($2);
std::cout << std::endl;
$2->PrettyPrint(std::cout, 1);
std::cout << std::endl;
}
;
//code omitted
The last changes are to the user routine section. First, we add the definition of the global variable gProgramP which is used to pass the result of the call to yyparse() to the outside world. We also add a very rudimentary error routine called yyerror() (which will be required by both the parser and the scanner even if you never call it in your own code). The error routine just prints out a short (and not very informative) error. Better error handling is implemented in Parser 7.
Writing the scanner specification
The scanner specification (in scanner4.l) is all that remains to discuss. The flex plug-in will take a scanner specification and generate the code needed for the function yylex() (or in our case scanner4lex()). The specification has the same layout as the parser specification with a Definitions Section, a Rules Section and a User Routine Section. These sections serve the same purposes as the sections in the parser specification.
scanner4.l (Definitions Section)
%{
#include "parser.h"
#include "parser4.tab.h"
#ifdef yywrap
# undef yywrap
#endif
inline int yywrap(void) { return 0; };
#define YY_NEVER_INTERACTIVE 1
#ifdef YY_INPUT
# undef YY_INPUT
#endif
#define YY_INPUT(inBuffer, outResult, inMaxSize) \ ((outResult) = FillBuffer((inBuffer), (inMaxSize)))
namespace {
const long theBufferSize = 8*1024;
char theBuffer[theBufferSize];
long theCharCount;
inline int FillBuffer(char* inBuffer, int inMaxSize)
{
if (0 == theCharCount)
{
std::cin.getline(theBuffer, theBufferSize);
// -1 since we consume the \n,
//but is not included in theBuffer
theCharCount = std::cin.gcount() - 1;
}
int theFillSize = ( (inMaxSize < theCharCount) ?
inMaxSize : theCharCount );
if (theFillSize > 0)
{
std::memcpy(inBuffer, theBuffer, theFillSize);
theCharCount -= theFillSize;
std::memcpy(theBuffer, theBuffer + theFillSize,
theCharCount);
}
return theFillSize;
}
}
%}
DIGIT ([0-9])
UCHAR ([A-Z])
LCHAR ([a-z])
USCR ([_])
CHAR ([A-Za-z_])
WS ([\t\n\r ])
In the Definitions Section, we start by defining some of the macros and functions needed by the parser. The yywrap() function is defined to return 0, indicating that the parser can always expect more input (there will be no end-of-file since the user will be supplying it from the console window). The YY_INPUT macro is used to fill flex's internal buffer of the stream, the FillBuffer function (in an unnamed namespace) serves this purpose for us by calling getline() on std::cin.
The remaining lines declare abbreviations for commonly used characters (digits, uppercase characters, lowercase characters, underscores, all characters and whitespace). The complete description of regular expressions is beyond this article, but a quick overview of those used here are:
- Characters in square braces [ ] indicate one of the included characters
- Characters followed by a * indicate zero or more occurrences
- Characters followed by a + indicate one or more occurrences
- A period (.) indicates any character other than a newline
Each token is associated with a block of C/C++ code. Just as in the parser specification, this code will be executed when the token is matched.
scanner4.l (Rules Section)
"," { return T_COMMA; }
";" { return T_SEMICOLON; }
"(" { return T_OPEN_PAREN; }
")" { return T_CLOSE_PAREN; }
end { return T_END; }
if { return T_IF; }
elif |
elseif { return T_ELSE_IF; }
//code omitted
{DIGIT}+ {
yylval.Number = new CNumberLiteralNode(yytext);
return T_NUMBER_LITERAL;
}
{UCHAR}{CHAR} {
yylval.Identifier = new CIdentifierNode(yytext);
return T_SET_IDENTIFIER;
}
//code omitted
{WS} { /*ignore whitespace*/ }
"//". { /*ignore comments*/ }
. { yyerror("Bad Character"); }
In general, the C/C++ code simply returns the token type to the parser as in the case of a comma, a semicolon etc. Symbols are enclosed in double-quotes to remove any special meanings. If the rule for a symbol is a vertical bar (|), then the rule for the following token is used. For example, both "elif" and "elseif" produce the token T_ELSEIF.
The rules for number literal tokens and identifiers need to do a little more work. These tokens provide more information than just the type of the token scanned, the parser will also need to know the value of the token. In each case, we construct a node of the appropriate type and assign it to the variable yylval. The type of this variable is equal to the union defined in the file parser4.y and we need to assign the information to the appropriate union member. The scanner variable yytext contains the text of the token scanned. You will need to copy yytext if you wish to use it outside of the action block.
Finally, white space and comments are ignored. The scanner will scan them, but without a return statement in the code block, the parser will never see them. The last rule matches any other token and calls the function yyerror (defined in parser4.y) which will print an error message and halt the parse.
Writing main
Ignoring the debugging code, main is short and simple. We parse the user input until it completes a program or there is a parse error. If a program is completed, we echo it out, otherwise, we print "Program Incomplete".
main.cp
int main(void)
{
//Debugging code omitted
std::cout << "Enter some miniset code:" << std::endl;
std::cout << std::endl;
int theResult = yyparse();
if (0 == theResult)
{
std::cout << std::endl;
std::cout << "" << std::endl;
std::cout << "Program Complete" << std::endl;
std::cout << "" << std::endl;
std::cout << std::endl;
gProgramP->PrettyPrint(std::cout, 1);
}
else
{
std::cout << std::endl;
std::cout << "" << std::endl;
std::cout << "Program Incomplete" << std::endl;
std::cout << "" << std::endl;
std::cout << std::endl;
}
}
Running Parser 4
We now have something we can build and run, so issue the Run command. The target already has a number of debugging settings selected. In particular, Generate debug scanner is selected in the Flex preference panel and Generate Debug Parser is selected in the Bison preference panel. When the parser asks if you want to create a debug parser, answer with anything other than y. We will use a debug parser in a moment, but not quite yet.
The output included in parentheses starting with "" is feedback from the scanner. Start with a simple assignment statement, for example, type
x <- 2;
and press return. The parser should accept this as a complete statement, but the parser complains of an "empty buffer" and requires you to type the next line before echoing the first statement.
If you want more feedback, quit Parser 4, restart Parser 4 and type y at the start of the session. This will turn on the debugging code in the parser and now you will be told what the parser is doing as well as the scanner. If you do this, then you will see that the parser gets the 2 as a number literal, but then before the parser sees the semicolon, the scanner responds with the "empty buffer" message. The problem is that the scanner needs to know that the token (starting with the semicolon) is finished before it can decide what to do next. Unfortunately, when the scanner tries to read ahead, it finds that the buffer is empty and so it calls our FillBuffer() routine which waits for another line of input. This is why we will need to type a new line before the semicolon gets processed.
Parser 5
Adjusting the scanner specification
The new line at the end of the command should be scanned as white space, but we have discarded it when we called getline() in the FillBuffer() routine. We simply need to add it back into theBuffer manually.
parser5.y (FillBuffer)
inline int FillBuffer(char* inBuffer, int inMaxSize)
{
if (0 == theCharCount)
{
std::cin.getline(theBuffer, theBufferSize - 1);
theCharCount = std::cin.gcount();
theBuffer[theCharCount - 1] = '\n';
theBuffer[theCharCount] = '\0';
}
int theFillSize = ( (inMaxSize < theCharCount) ?
inMaxSize : theCharCount );
if (theFillSize > 0)
{
std::memcpy(inBuffer, theBuffer, theFillSize);
theCharCount -= theFillSize;
std::memcpy(theBuffer,
theBuffer + theFillSize, theCharCount);
}
return theFillSize;
}
Running Parser 5
That's enough work on the scanner. At this point, I believe we are down to one bug in the parser. Because we will be debugging the parser, the Verbose option is set in the Bison preference panel so that we can use the parser5.output file if we need it (I predict we will). Issue the Run command and type y at the first prompt and try to define a non-empty set:
X <- { 1, 2, 3};
This will be unsuccessful, and the parser will give you more information than you probably want. The article presents the few lines which are of most interest.
Parser 5 console output
Shifting token 277 (T_OPEN_SET), Entering state 8
Reducing via rule 58 (line 427), -> nt_number_list
state stack now 0 1 12 44 8
Entering state 42
Reading a token: Next token is 279 (T_NUMBER_LITERAL)
!Urk: parse error
The error occurs when the parser has reduced an /* empty */ to an nt_number_list and sees the first number (in this case a T_NUMBER_LITERAL). This causes a parse error and the stack is popped since we have no error handling mechanisms. The important information is that the parser was in state 42 when the T_NUMBER_LITERAL was scanned. We would expect that a T_NUMBER_LITERAL would be valid at this point, so lets see what state 42 thinks is valid.
parser5.output
state 42
nt_set -> T_OPEN_SET nt_number_list . T_CLOSE_SET
(rule 53)
nt_number_list -> nt_number_list . T_COMMA nt_number
(rule 59)
T_COMMA shift, and go to state 71
T_CLOSE_SET shift, and go to state 72
The first possibility in state 42 expects a T_CLOSE_SET and in this case the parser will reduce an nt_set (in our case an empty set). You can check that this works correctly in Parser 5. In the second case, the parser expects a T_COMMA followed by an nt_number in which case it will reduce an nt_number_list. Since we want the list to be extended with the next number, this is the case on which we will focus.
First, we confirm that we have found the problem by testing the statement:
X <- { ,1, 2, 3 };
Parser 5 accepts this as valid, but then echoes:
X <- { 1, 2, 3 };
You can confirm that the parser exhibits the same behavior with parameter lists and expression lists but behaves correctly with regard to statement lists. The problem is that the first number in a number list is different from every other number in the list because it is not preceded by a comma (equivalently, the last number is different because it is not followed by a comma). In this way, the comma is a different type of punctuation than the semicolon (the former is a separator and the latter is a terminator). We will now make the parser specification appreciate this fact.
Parser 6
Adjusting the parser specification
One method for solving this problem is to create a new non-terminal and expanding the rules for these lists. The new non-terminal is for an nt_nonempty_number_list and parser will be able to identify whether you want the empty or non-empty version by checking the look-ahead token (if it is an nt_number, the list is non-empty). The specifications for nt_parameter_list and nt_expression_list are also adjusted in a similar manner.
parser6.y (Rules Section)
nt_number_list
:
/* empty */
{
$$ = new CNumberList();
}
|
nt_nonempty_number_list
{
$$ = $1;
}
;
nt_nonempty_number_list
:
nt_number
{
$$ = new CNumberList($1);
}
|
nt_nonempty_number_list T_COMMA nt_number
{
$$ = $1->Append($3);
}
;
Running Parser 6
Congratulations, you have now worked through the process of constructing a parser. You can verify that all the lists act as expected and that all the other bugs we hunted down earlier have been correctly resolved (e.g., operator precedence). The next two targets add some amenities, but the hard work has been completed.
Parsers 7 & 8
At this point, I've run out of words (or more correctly, the column has run out of space). Parsers 7 and 8 are slightly more complex than Parser 6 and provide some extra amenities. If I have intrigued you with the idea of writing your own parser, you may want to pick up Levine's book or the bison and flex manuals and see if you can understand what I have done in these last two targets. I describe some of the added features next.
Parser 7
Parser 7 recovers from syntax errors more gracefully than just aborting. It will discard everything back to the last complete statement and then ask the user to continue from that point. While this would probably be unacceptable if Miniset were designed for batch compilation, the strategy is appropriate for an interpreted language.
Parser 7 also adds some rudimentary error handling. In particular, it catches the mistake of using the wrong type on the left side of an assignment statement. For example, I frequently find myself typing:
x <- { 1, 2, 3};
but in Miniset this a type error (x is a number variable because it starts with a lowercase letter).
Parser 8
Parser 8 doesn't enhance the user's interface with Miniset, but does help the progamming interface. Parser 8 encapsulates the whole parsing mechanism into a C++ class. The class is implemented in the files CParser.cp, scanner8.l and parser8.y with all interfaces in the header file CParser.h. Although all methods of the CParser class are static, we could set up flex and bison to avoid the use of global variables and make separate CParser objects. This would be important if Miniset were being run on a server with each user working with a separate CParser instantiation. Parser 8 also allows the user to take input from a file instead of from the keyboard. This is used to create a consistency check on the scanner and parser. Once a program is completed it is printed to a file and then reparsed and reprinted to a second file. The resulting two files should be identical.
Where To Go Now
If this intrigues you, the first thing to get is the flex and bison plug-ins at <http://www.ambrosiaSW.com/~fprefect/plugins.html>
you can get bison documentation at: <http://www.gnu.org/manual/bison/index.html>
and I found flex documentation at: <http://www.delorie.com/gnu/docs/flex/>
If you've got some money, the best introduction to flex and bison is probably the book lex & yacc by John R. Levine, Tony Mason and Doug Brown. O'Reilly & Associates, Inc. printed the second edition in 1992 (with minor corrections in 1995).
You might also want to look into the Purdue Compiler Construction Tool Set (PCCTS). These tools are an alternative to flex and bison and you can find information about them at <http://www.antlr.org/>
Even if you stick with flex and bison, the section "Exactly 1800 Words On Languages and Parsing" in Language Translation Using PCCTS and C++ (free for download from at the URL given above) is worth reading if you are interested in some of the theory behind grammars.
Aaron Montgomery wastes (excuse me, spends) a lot of time thinking about things like this when he really ought to be preparing his calculus lecture notes. Currently, he is finishing up a 3 year Visiting Assistant Professorship at Purdue University North Central and moving on to an Assistant Professorship at Central Washington University starting in Fall of 2000. He hopes to use the money he gets from this article to buy a bunch of Rokenbok remote control vehicles and building sets.