Parsing with YACC
Volume Number: | | 6
|
Issue Number: | | 4
|
Column Tag: | | Language Translation
|
Introduction to Parsing and Inside YACC
By Clifford Story, Mount Prospect, IL
Note: Source code files accompanying article are located on MacTech CD-ROM or source code disks.
Part I. Intoduction to Parsing
A. Introduction
This is the first of a series of articles on language translation. This is commonly called parsing, and I will no doubt lapse from time to time into this usage, but it is better to reserve the word for one phase of translation.
The techniques I will discuss are used in compilers and interpreters, of course, but also in many more-common circumstances. For example, when you add a formula to a speadsheet, the application must translate the formula into a form it can use. Many database programs let you specify formats for input fields -- these must translate your format, and then use the result to translate the input. Consider the expression search in MPW. I guarantee you theres a translator operating there. And, of course, my own program Idealiner uses a translator when you specify topic numbers.
All the code examples in this series of articles will be MPW tools, to minimize interface considerations. I doubt if youll see a single Macintosh trap before the tenth article. I will focus instead on the internals of language translation.
Language translation can be divided into three stages: first, the LEXICAL ANALYZER (also called the scanner) divides the input text into individual symbols, pieces like identifiers, operators and constants; then the PARSER interprets the sequence of symbols according to the languages grammar; and lastly the CODE GENERATOR responds in an appropriate fashion to the statements the parser has processed. These three phases operate simultaneously, the parser calling the lexical analyzer for input, and then calling the code generator for output.
A(1). Parsing
The first three parts in the series will focus on parsing.
The first part of the article (this very one) will introduce parsing and the parser-generator YACC (Yet Another Compiler Compiler). YACC converts a set of GRAMMAR RULES into C source for a parser that will accept that grammar. You still have to write the lexical analyzer and the code generator (although YACC provides a framework for code generation) but YACC takes care of the most tedious part of the job.
The second part will go beneath the surface and explore the inner workings of YACC. This will be somewhat technical, and may seem unnecessary, but it will provide an essential foundation for later topics. The article will cover deriving parse tables from YACC output, parsing expressions by hand (using the parse tables), and YACCs debug output.
The next article will discuss ambiguities, association and error detection. An understanding of the first two topics are essential to writing correct and efficient grammars. The third covers detecting and reporting errors by type and location.
A(2). Lexical Analysis
With the third article, I will return to the first stage of language translation, lexical analysis. The article will begin with a discussion of state machines, then present a simple but effective table-driven analyzer.
The fourth article will be an excursion into a seemingly unrelated field: hash tables and binary trees. The idea is to develop some tools to increase the power of the lexical analyzer in the following article.
The fifth article will extend the analyzer by adding a symbol table. The routines developed in the fifth article will give us a way to save symbols; the example program will be an improved version of the MPW Canon tool.
A(3). And a Useful Example
Now, Im pretty sure of the preceding, because Ive already written the articles! What follows is a forecast of what Ill do next.
The plan is to build an MPW tool that will preprocess either Pascal or C source files and convert inline assembly code into Pascal inline routines or C direct functions, as appropriate. That is, you will be able to write assembly language instructions, and the preprocessor will convert your assembly into machine language. Your assembler routines will be declared with high-level headers, a la Lightspeed C, and you will be able to refer to routine arguments and local variables by name, rather than indexing off the stack, a real convenience. Im going to write this tool because, damn it, I want to use it!
This will be a major project: Ill need to parse Pascal and C well enough to find routines, and Ill need to parse assembler completely. Ill then need to write a more-or-less complete assembler. It could take six or eight columns.
B. Grammar Descriptions
Heres a reference for this topic: Compilers: Principles, Techniques and Tools, by Aho, Sethi and Ullman (Addison-Wesley 1986). You may have heard of some of these guys. I will refer to it as ASU.
A computer language is composed of a set of symbols (the words of the language), and a set of grammar rules that determine how these words are formed into statements (the sentences of a computer language). As I said earlier, the lexical analyzer is in charge of extracting the words from the input; it is the job of the parser to make meaningful sentences from these words.
A bit of terminology: classes of symbols (such as integer or identifier) are called TOKENS, and specific instances of tokens (such as 252 or sysbeep) are called LEXEMES. The lexical analyzer will typically determine both the type and value of a symbol that it reads; the former is the token classification and the latter is the lexical value. The parser cares about tokens; the code generator cares about values.
I guarantee that I will use the word token to mean both token and lexeme. The meaning should be clear from the context.
B(1). Grammar Rules
A grammar rule is called a PRODUCTION. It is a substitution rule; the single symbol of the left-hand side (often, but not by me, abbreviated LHS) produces the expansion on the right-hand side (RHS). Heres an example:
stmt -> ID = expr ;
A stmt can be expanded to (thats what the -> means) a series of symbols consisting of an ID, an = sign, and expr and a semicolon. Symbols within quotation marks are explicit, and must appear as written; other symbols are token classes.
Recursive definition is permitted. Heres an example:
expr -> expr + NUM
Applying this rule to the first, we discover that:
stmt -> ID = expr + NUM ;
stmt -> ID = expr + NUM + NUM ;
And so on. It is possible to determine a complete language such as Pascal or C with relatively few such productions, even though there are infinitely-many legal statements in the language.
Now, everyone can make up his own conventions, of course, but I will distinguish two kinds of non-explicit symbols by showing one in all-caps and one in all-lower-case. All-caps symbols are defined by the lexical analyzer, not by the parser. Thus, they will not appear on the left-hand side of any production. These symbols are call TERMINALS, because they terminate expansion. When you expand to a terminal, you can expand no further. The lower-case symbols are, surprise!, NON-TERMINALS. They are defined by the parser rather than the lexical analyzer, appear somewhere as the left-hand side of one or more productions, and do not terminate expansion. These distinctions are important!
As the above examples suggest, several productions can share the same left-hand side:
expr -> expr + NUM
expr -> NUM
This pair of productions expands to arbitrary sums. Just start with the first production and substitute the first production into it to add another term, or the second to terminate the expansion.
B(2). An Example Grammar
This will be fun. If you dont want to mess with abstract symbols, just skip this whole section; the result is all you need. My development of the grammar wont be very theoretical, though.
A parser reads the input symbols from left to right, until it has read the right-hand side of a production. Then it REDUCES the input, running the production backwards and replacing the right-hand side with the left-hand side. This is the point at which code is generated, and expressions evaluated: when a reduction occurs.
So to see if a grammar does what we want, we can start with a test statement that should be accepted by the grammar, and see what happens when we play parser.
The grammar we are shooting for will accept simple algebraic expressions, involving addition, subtraction, multiplication and division of numbers.
B(2)(a). First Try
Earlier, we saw productions that can expand to arbitrary sums:
expr -> expr + NUM
expr -> NUM
We can add a few more to get the other three usual operators:
/* 1 */
(1) expr -> expr + NUM
(2) expr -> expr - NUM
(3) expr -> expr * NUM
(4) expr -> expr / NUM
(5) expr -> NUM
Lets try a simple test:
NUM + NUM * NUM
-> expr + NUM * NUM
(rule 5; remember that we read from the left)
-> expr * NUM (rule 1)
-> expr (rule 3)
The addition was the first thing to go; therefore, it was performed first. This is contrary to our expectations, I hope (my past as a math teacher shows itself!). The grammar wont work. We need to make sure that multiplication and division are performed before addition and subtraction.
B(2)(b). Second Try
So lets introduce another non-terminal. We want to cluster products together, to ensure that they are evaluated first, so lets make them a separate non-terminal:
/* 2 */
(1) expr -> expr + term
(2) expr -> expr - term
(3) expr -> term
(4) term -> term * NUM
(5) term -> term / NUM
(6) term -> NUM
Now try the test string again:
NUM + NUM * NUM
-> term + NUM * NUM
(rule 6)
-> expr + NUM * NUM
(rule 3)
-> expr + term * NUM
(rule 6)
Oops, it looks like we have a choice here: rule 1 or rule 4. We really dont, though; reducing by rule 1 would leave us with expr * NUM, and we dont have any rule with expr * in it.
-> expr + term
(rule 4)
-> expr
(rule 1)
So the multiplication happened first, just like we wanted.
B(2)(c). Third Try
Suppose, however, that we wanted the addition to occur first. Then we need to add some parentheses:
/* 3 */
(1) expr -> expr + term
(2) expr -> expr - term
(3) expr -> term
(4) term -> term * NUM
(5) term -> term / NUM
(6) term -> ( expr )
(7) term -> NUM
Now we treat an addition or subtraction that occurs within parentheses as if the sum or difference was a term (rule 6). And that should do it.
C. YACC: Yet Another Compiler Compiler
YACC is a UNIX tool that builds parsers from grammars. We can take the grammar just developed, supplement it with a minimal bit of C code, and YACC will write the complete parser for us. Or, if the parser is only a small part of a big program, YACC will write that small part, and we can then link it to the main program. In either case, YACC saves us from the unbearable tedium of computing parse tables by hand.
A YACC input file is divided into three parts: the declaration part, the grammar part, and the program part. (Remind you of anything? My past as a COBOL programmer coming out!) The three sections are separated by lines beginning:
%%
YACC writes a C source file as its output; it will also write a file of parser information if you desire it. Well look at that next time.
C(1). The Declaration Section
The declarations include a set of C-style declarations that are used by YACC to build the parser. For now, there are only two sorts of declarations that concern us.
The first is the %token declaration. YACC will automatically recognize one-character tokens like + and -. All other terminal tokens must be declared with the %token statement, e.g.,
%token NUM
Non-terminals do not need to be declared; they are implicitly declared in the grammar rules.
The second declaration type lets us pass regular C declarations through YACC to the C compiler. These look like this:
/* 4 */
%{
#define blip 12
%}
Anything between the %{ and the %} is passed through unchanged, and written by YACC to the C source file.
It is customary to include a #define for YYSTYPE. What is YYSTYPE? It is the type of the parsers stack. For now, theres no need to worry about it. It is int by default, and int will work fine for what were doing this month. Later, after Ive discussed how the parser operates, and we know what sort of things go on the stack, well come back to it.
C(2). The Grammar Section
The grammar section includes all the grammar productions, like those we discussed earlier. They are written in a somewhat non-standard format (taking ASUs notation as the standard, which I think is reasonable). They also provide a framework for code generation.
C(2)(a). Production Rule Format
The arrow in productions is replaced with a colon, and productions with the same left-hand side are combined into one, with the right-hand sides separated with vertical bars, |. The last right-hand side is terminated with a semicolon. The usual practice is to format the productions like this:
/* 5 */
expr : expr + term
| expr - term
| term
;
term : term * NUM
| term / NUM
| ( expr )
| NUM
;
in place of:
expr -> expr + term
expr -> expr - term
expr -> term
term -> term * NUM
term -> term / NUM
term -> ( expr )
term -> NUM
C(2)(b). Code Generation
You can follow each right-hand side with some pseudo-C code that the parser will then call when the production rule is executed (i.e., if you read that stuff about reductions, when the input is reduced by that production).
Heres an example. Suppose you have the production rules:
expr : expr + term
| expr - term
| term
;
The question is, what is the value of the expr? In the first case, its the value of the first token of the right-hand side plus the value of the third; in the second, its the first minus the third; and in the third, its just the value of the first token. So we write:
/* 6 */
expr : expr + term
{
$$ = $1 + $3;
}
| expr - term
{
$$ = $1 - $3;
}
| term
{
$$ = $1;
}
;
This isnt hard to figure out; $$ is the value of the left-hand side, $1 is the value of the first token of the right-hand side, $3 the value of the third token. Using those symbols, you just write straight C code. You are not limited to a single line, and you can call routines that you have written elsewhere. Dont forget the braces or the semicolons.
C(3). The Program Section
The program section is made up of straight C code, and is copied unaltered into the C source file. While YACC requires that you supply some routines to it, you can put them in another file, and this section can be completely empty. In simpler cases, however, the program section allows you to write your entire program in the YACC source file.
Heres what YACC requires: a lexical analyzer, called yylex(), and an error routine, yyerror(). Common sense requires a main() routine as well.
The prototype for yylex() is:
/* 7 */
int yylex();
The parser routine, called yyparse(), will call yylex() whenever it needs input. yylex() reads the next token, sets the global variable yylval to the value of the token (optional; this is the $ value used in the production code), and returns the token type. You might wonder where it reads the token from, since it hasnt any arguments. The answer is, it uses global variables of some sort, such as a global string or a global file reference, that is set up by the main() routine.
The error routine is:
/* 8 */
void yyerror(char *message);
The message is generated by the parser when it detects an error; yyerror()s job is to notify the user that something has gone wrong. Of course, you dont have do live with YACCs default error messages, which are rather unilluminating; you can call yyerror() yourself. More on this in a future article.
D. An MPW Hex Calculator
Now for some real code. Our example program is an MPW tool, a hex calculator. It will evaluate expressions using +, -, * and /, and will also properly evaluate expressions with parentheses in them. The name of the tool is Hex; you can invoke it with an expression, e.g.:
Hex 2 - ( 3 - 4 )
in which case it will evaluate, print the result, and exit. Notice that in this case, the expression must be in quotation marks, or MPW will treat each token separately. Note also that the tokens must be separated by spaces. This is to simplify the lexical analyzer; we will relax this requirement in a future version.
The tool may also be invoked without an expression to evaluate. It will then go into a loop, reading in expressions and evaluating them, e.g.:
Hex
? 2 - ( 3 - 4 )
= 3
? 64 * 8
= 320
?
The loop ends on a blank line.
Here are the input globals and the code for the main() routine:
/* 9 */
char *input;
char *token;
void main(int argc, char *argv[])
{
char thestring[256];
if (argc < 1)
printf(\tImpossible error!\n);
else if (argc > 2)
printf(\tHey! One at a time!\n);
else if (argc == 2)
{
input = argv[1];
yyparse();
}
else
{
printf(? );
while (strlen(gets(thestring)) > 2)
{
input = &thestring[2];
yyparse();
printf(? );
}
}
}
input is the input buffer, and holds the expression to evaluate. token is filled by yylex() with the current token. Its useful for debugging and error reporting. Finally, in the read loop, notice that the gets() routine will read the ? prompt as well as the expression, this being MPW, which is why input points to thestring[2].
D(1). The Lexical Analyzer and Error Routines
This is about as simple as a lexical analyzer can be. The strtok() routine will return the tokens in the input string, so long as theyre separated by spaces, or newline at the end of input. Then if sscanf() can read a hex number, thats what the token must be; if it isnt a number, it must be an operator or a parenthesis, so return the first character.
This routine is so simple it is vulnerable to pathological input -- please dont TRY to break it! Wheres the challenge? This is just a stopgap, good enough to serve until we get a real lexical analyzer.
/* 10 */
int yylex()
{
if (input == 0)
token = strtok(0, );
else
{
token = strtok(input, );
input = 0;
}
if (token == 0)
return(\n);
else if (sscanf(token, %x, &yylval) == 1)
return(NUM);
else
return(token[0]);
}
The error routine is even simpler. It just prints out the parsers default error message, which is the blindingly helpful syntax error. We do add the current token, which may be helpful.
/* 11 */
#define yyerror(x)
{
printf(\t%s [%s]\n, x, token);
return(0);
}
(This is divided into separate lines to fit in Mac Toots narrow columns; thats not the way wed write it in C, of course!) Note the return statement. yyerror is called from within the parser routine yyparse(), so thats what were returning from. The effect is to abort the translation.
D(2). The Grammar
The grammar we will use is the same as that developed earlier. Theres one additional production: we have to give YACC a START SYMBOL, which is the left-hand side of the first production. In this case, prob is short for problem.
prob -> expr \n
The newline is a single-character token returned by yylex() to signal the end of input. So if the entire input string is an expression, weve got a complete problem. Any other occurrence of a newline is an error.
/* 12 */
prob -> expr \n
expr -> expr + term
expr -> expr - term
expr -> term
term -> term * NUM
term -> term / NUM
term -> ( expr )
term -> NUM
D(3). The Value Calculation and Output
Heres the actual YACC input file, declaration and grammar sections. The declaration section consists of a single %token declaration, making NUM a terminal symbol. The grammar section includes all the productions listed above, each with some associated code.
/* 13 */
%token NUM
%%
prob : expr
{
printf(\t= %X\n, $1);
return(0);
}
;
expr : expr + term
{
$$ = $1 + $3;
}
| expr - term
{
$$ = $1 - $3;
}
| term
{
$$ = $1;
}
;
term : term * NUM
{
$$ = $1 * $3;
}
| term / NUM
{
$$ = $1 / $3;
}
| ( expr )
{
$$ = $2;
}
| NUM
{
$$ = $1;
}
;
D(4). The Make File
YACC goes right into your make file, just like any other MPW tool. Hmm. I thought I said that YACC was a UNIX tool... The truth is that there is an MPW version of YACC available, called MACYACC (I have renamed the tool on my system to make typing easier).
#14
Hex.c Hex.make Hex.y
Yacc -VHex.out Hex.y
Hex.c.o Hex.make Hex.c
C -r Hex.c
Hex Hex.make Hex.c.o
Link -w -t MPST -c MPS
Hex.c.o
{Libraries}Interface.o
{CLibraries}CRuntime.o
{CLibraries}StdCLib.o
{CLibraries}CSANELib.o
{CLibraries}Math.o
{CLibraries}CInterface.o
{Libraries}ToolLibs.o
-o Hex
E. Review of Abraxas MACYACC
MACYACC is a port of Abraxas PCYACC to the Macintosh. It functions as an MPW tool; you can call it from a make file just like a compiler. Abraxas sells it in two versions: The Personal Version for $139.00 includes the YACC tool itself and a small set of examples; the Professional Version for $395.00 includes that, plus MACLEX (a lexical analyzer generator) and two floppies of big examples, including grammars for C++, Hypertalk, SQL, Pascal, K & R C, ANSI C, and so on.
I wont hide the ball; my recommendation is that you buy the Personal Version and skip the Professional Version.
E(1). Buy the Personal Version
Heres why you should buy the Personal Version: because it works. You can just stick it in your make file and forget it.
The guys at Abraxas are apparently not too attuned to the Mac and MPW, so the port from the PC is pretty brutal. There are a lot of irritating little things, non-standard behavior and so forth. For example, to get a list of the command-line options, you dont type help yacc; no, you invoke the tool itself without any options. The tool, when called with arguments, writes out a few lines of copyright information and the name of the grammar file it is processing, contrary to the MPW rule that a compiler should run silently. You cant quiet it by directing standard output to Dev:Null, because it writes error messages to standard output. Abraxas converted the PCYACC manual by simply replacing all occurrences of PCYACC with MACYACC, which provides a lot of laughs. And so on.
The Fully-Worked-Out example requires Quick C to compile. Quick C requires Windows. The example is a graphics program; it requires a CGA monitor to run. Like I said, a lot of laughs...
Somehow, though, I find myself comforted by all that evidence of MACYACCs lowly origin. Why? Because it means that those nice PC programmers have already debugged the thing for us! The internal code changes between the two versions were probably minimal and I doubt if the guts of the program were touched at all. So what has worked on the PC in the past should continue to work on the Mac. (Abraxas tells me that PCYACC has been out for five years, and MACYACC for two. Lotsa time for a shakedown.)
I did some comparisons. I converted the same grammar files with MACYACC and the Unix YACC from the Apollo at work. The parse tables were identical. In fact, I took an example from ASU and submitted it to both YACCs; all three parse tables were identical, except that ASUs had error conditions in some places where the two YACC tables had reductions, which means that the YACC tables would take longer to catch some errors (more on this in a later article).
So I conclude: if you plan to write parsers, MACYACC is a whole lot easier than deriving parse tables by hand. (This is the voice of experience talking! I derived the parse table for my Idealiner parser by hand.)
E(2). Complaints with MACYACC
Now here are some complaints:
(1) The -t and -T command line options are very badly explained; I had to get Abraxas to tell me just what they are supposed to do. They cause the YACC-generated parser to write a special dump file. Unfortunately, this file is created untyped; you have to set its type to TEXT before you can even read it. And then its not too useful (see next months installment).
(2) MPW 2.0.2 is included on the disk. Or is it? Only the MPW Shell is there! This is totally weird. No tools, no startup script, nothing but the shell. Cmon, guys, all or nothing! I dont know what you can accomplish with just the shell. (Of course, people likely to be interested in Yacc are also likely to have MPW already.)
(3) The manual isnt the worst Ive ever seen but youll do a lot better if you dont rely on it as a tutorial. If you can get a hold of the YACC chapter from a UNIX system manual, use that instead. Its a lot clearer. With luck, this series of articles will repair some of the deficiencies of the manual.
One problem with YACC is that it came out of UNIX, and so is filled with obscure identifiers. What, for example, are yyval and yylval? I was not able to find the answer in the MACYACC manual. (yylval is a global used to return the value of the token read by the function yylex() -- you gotta know that! yylval, Im pretty sure, is the value of the left-hand side of the last reduction.)
When TML came out with their MPW Pascal, they included four floppies (with the complete MPW system) and an inch of documents in a custom cardboard box, and priced the thing at $125.00. That means that they probably got about $70.00 from dealers. Abraxas, as far as I know, sells only directly to customers. For twice the price, they should be able to come up with a comparable package; most importantly, they should write a useful manual, and get a Mac programmer to do a real port.
On the other hand, they havent many Mac customers, and so they havent felt inclined to spend much effort on the Mac product. Or is it the other way around?
E(3). Skip the Professional Version
For an extra $256.00, you can get the Professional Version: the Personal Version, plus two disks of example grammars and a third with MACLEX. Is it worth the money? No.
These three disks come with no written manual. Most examples have sketchy read-me files instead. Not sufficient. You are on your own if you want to use them.
Abraxas itself admits that Lex is useless; they recommend building lexical analyzers by hand, as you can do a better job that way. This is, apparently, the prevailing wisdom; I found the same advice in a UNIX book.
As for the example grammars: These are direct ports from the PC. Many of the source files do not even have Macintosh file types! A few examples present the grammar only; most include enough code to get a syntax checker going but, of course, omit code generation. So the grammars are what counts. But how much is a grammar description of an existing language really worth?
Consider: a few years ago, Apple put out a wall poster with the complete syntax of Pascal. (If anyone has one of those posters and is tired of it, Id be delighted to take it off his hands...) You could take that poster and write a grammar description in about ten minutes. So why should you pay $256.00?
Well, Abraxas also includes grammars to languages that are not so well-documented. You wont find Hypertalk on any poster, for example, and if you want to see its grammar, Abraxas presents it for you. Whether that is worth $256.00 is a judgment call; in my judgment, it is not.
I think that if Abraxas cleaned up the code (i.e., did a complete port) and wrote a manual, they could justify selling the Professional Version for, say, $195.00. Without a manual, and with code designed for the PC, $395.00 is way out of line.
Parse Table Correction
By Kirk Chase, Editor, MacTutor
Parse Table Correction
Kirk Chase
MacTutor
Unfortunately, Clifford Storys second part in his Language Translation Series was missing the parse table. Here it is for you now: