Lexical Analysis
Volume Number: | | 6
|
Issue Number: | | 5
|
Column Tag: | | Language Translation
|
Ambiguities and Lexical Analysis
By Clifford Story, Mount Prospect, IL
Part III. Ambiguities, Etc.
A. Introduction
This is the third part in my series on Language Translation. Language translation has three phases: lexical analysis, parsing, and code generation. The first two parts have dealt with building parsers using YACC; this is the final installment on that subject.
That will take up only about half the article. The remainder will begin the new topic of lexical analysis by presenting a skeleton filter tool.
B. Parsing
I now conclude parsing with a couple of miscellaneous topics. The first is a simple way to handle ambiguities in a grammar. Then comes the horrible question of error detection and reporting.
B(1). Ambiguities
Parser ambiguities, you may recall, are spots in the parse table where YACCs generation algorithm would place a shift and a reduction, or two reductions. A naive grammar would generate an ambiguity when considering the input string
2 + 3 * 4
since it isnt clear whether this means (2 + 3) * 4 or 2 + (3 * 4).
We faced this problem in the first part and solved it by re-writing the grammar to eliminate the ambiguity. YACC offers another way of handling ambiguities (also known as shift/reduce and reduce/reduce conflicts).
B(1)(a). YACCs Default Rule
First, something you should know: YACC will work through any conflicts that it finds and produce a non-ambiguous grammar for you. It resolves conflicts automatically, and will only issues warnings, not errors.
Unfortunately, it does this by making assumptions about what you intended, and these assumptions need not be correct. You can get a grammar that simply does not do what you meant it to do, despite the lack of error messages. In my opinion, this is a design flaw in YACC; conflicts should be fatal errors.
Even worse, MACYACC (see the first part for a review) uses an inordinately complicated rule for resolving conflicts. Unix YACC uses a very simple rule, so you can guess what its going to do. MACYACC, no such luck.
Therefore, YOU SHOULD ALWAYS REACT TO CONFLICT WARNINGS AS IF THEY WERE FATAL ERRORS!
B(1)(b). Operator Precedence
Remember, conflicts arise when the parser doesnt know which operation to perform first, as in the 2 + 3 * 4 example. We intelligent humans know that multiplication comes before addition: multiplication has a higher precedence than addition. The conflict is easily and correctly resolved if YACC knows the order of precedence among all operators in the grammar.
All that need be done is tell YACC:
/* 1 */
%left + -
%left * /
Multiplication and division have the same precedence (because they are on the same line), and higher precedence than addition and subtraction (because their line is after addition and subtractions line). That solves the problem (read on for an explanation of the left business).
But what about strings like
2 - 3 + 4
Is that 2 - (3 + 4) or (2 - 3) + 4? The above rules dont seem to say, since subtraction and addition have the same precedence.
B(1)(c). What is Associativity?
Good question. In a string of the form
... ID ...
where is an operator and ID an identifier, the to the left of the ID will have a higher precedence than the to the right if is left-associative; and conversely. Thus,
2 - 3 - 4
is conventionally -5 because subtraction is left-associative. If it were right-associative, 2 - 3 - 4 would equal 2 - (3 - 4), or 3. Just about everything is left-associative.
So associativity defines order of evaluation among operators on the same level of operator precedence. And the line
%left + -
says that addition and subtraction are left-associative among one another.
YACC also includes keywords %right, which means just what you might think, and %nonassoc, which means constructs of the form
... ID ...
are illegal (think of logical operators).
B(1)(d). The Lonely Minus...
Theres one more thing to worry about: unary operators. The classic example is the minus sign. Do you remember my asking why a hex calculator was easier to write than a decimal calculator? The answer is that a hex calculator doesnt have to deal with unary minus...
The problem arises because the minus sign is used for both unary and binary operators (negation and subtraction). When we assign it a precedence in the %left statement, were thinking about subtraction, so we give it a lower precedence than multiplication and division. But negation should have the same precedence as multiplication and division.
The solution is to give a grammar rule a precedence. This can be done with the %prec keyword:
expr : - expr %prec *
gives this rule the same precedence as multiplication. Thus, unary minus will have the right precedence, and everything is at last conflict-free.
B(1)(e). An Example
%token NUM
%left + -
%left * /
%left MINUS
%%
prob : expr \n
{
printf(\t= %d\n, $1);
return(0);
}
;
expr : expr + expr
{
$$ = $1 + $3;
}
| expr - expr
{
$$ = $1 - $3;
}
| expr * expr
{
$$ = $1 * $3;
}
| expr / expr
{
$$ = $1 / $3;
}
| - expr %prec MINUS
{
$$ = - $2;
}
| ( expr )
{
$$ = $2;
}
| NUM
{
$$ = $1;
}
;
%%
/***************************************/
#include stdio.h
#include ctype.h
#include string.h
/***************************************/
char *input;
char *token;
/***************************************/
#define yyerror(x)
{
printf(\t%s [%s]\n, x, token);
return(0);
}
yyparse();
int yylex();
/***************************************/
void main(int argc, char *argv[])
{
char thestring[256];
if (argc < 1)
printf(\tImpossible error!\n);
else if (argc > 2)
printf(\tHey! One at a time!\n);
else if (argc == 2)
{
input = argv[1];
yyparse();
}
else
{
printf(? );
while (strlen(gets(thestring)) > 2)
{
input = &thestring[2];
yyparse();
printf(? );
}
}
}
/***************************************/
int yylex()
{
token = strtok(input, );
input = 0;
if (token == 0)
return(\n);
else if (sscanf(token, %d, &yylval) == 1)
return(NUM);
else
return(token[0]);
}
/***************************************/
B(2). Catching Errors
You may have noticed that Pascal compilers write real error messages, while C compilers write error messages more cryptic than C itself. Theres a reason for this: Pascal compilers use recursive-descent parsers, while C compilers use table-driven parsers. Recursive-descent parsers are the sort of parsers a person might naturally write: get the next token; if its a +, do this, else get the next token and do that, and so on. A location in the code of the parser corresponds to a particular grammar structure in the input string, so its easy to insert appropriate error messages. Table-driven parsers make things more difficult. Ever seen an error message to the effect Need an lval? Thats a sign of the parser, not the language; from the YACC global yylval, you can probably guess what an lval is.
So writing meaningful error messages in a YACC-generated parser is a real problem.
B(2)(a). Semantic Errors
Semantic errors are easy. These are illegal operations, like division by zero, and other violations of data types (overflow, writing past the end of an array, and so forth).
The decimal calculator provides the opportunity to divide by zero. Lets catch this error, and issue an error message instead. Division occurs in only one grammar rule:
/* 3 */
expr | expr / expr
{
$$ = $1 / $3;
}
Just insert an operand check in the generated code:
/* 4 */
expr | expr / expr
{
if ($3 != 0)
$$ = $1 / $3;
else
{
printf(Divide by zero!\n);
return(0);
}
}
Im using a cheap trick here; the return(0) means abort. This is ok, because the calculator evaluates one expression at a time, and an error should cause an abort. But if this were a compiler, detecting a single error should not kill the compile; I want to know about ALL the errors.
B(2)(b). Illegal Characters
The languages character set has nothing to do with the grammar, and hence nothing to do with the parser. It is entirely in the control of the lexical analyzer. And, of course, its easy for the lexical analyzer to catch illegal characters. But how can it report them?
Heres my solution. First, I declare a new token at the top of the input file:
/* 5 */
%token ILLEGAL
None of the grammar rules use this token, so should the lexical analyzer return it, the parser must sense an error.
Now, to get an error message out of this! Im going to continue to declare the yyerror routine as a macro, so it can use the parsers local variables and also abort the parser. Then Im going to create a new error routine that yyerror will call, passing a couple of those interesting locals. The two I want are tmpstate, the current state of the parser, and pcyytoken, the type of the last token returned by yylex(). The declarations look like this:
/* 6 */
#define yyerror(x)
{
errordisplay(tmpstate, pcyytoken);
return(0);
}
void errordisplay(int state,
int tokentype);
Then, in the errordisplay() routine, if tokentype equals ILLEGAL, I print an appropriate message.
B(2)(c). Syntactic Errors
Errors of syntax are the hard ones (and my development of this topic has not been helped by my outliners just destroying my first approach to it. Especially since I wrote the outliner... A word of advice to aspiring Mac programmers: do not neglect the grow zone routine.) The problem is that these errors are detected by a table that we didnt write, instead of nice readable code that we did.
But we can still zero in on the specific error. We begin with the .out file. If theres an error in state 0, we know why: the parser was expecting a number, a -, or a (, but it got something else instead. So instead of saying just syntax error, we say Expecting a number, - or (!. We can do the same for each state, and write errordisplay() as follows:
/* 7 */
void errordisplay(int state,
int tokentype)
{
if (tokentype == ILLEGAL)
printf(\tIllegal character!\n);
else switch (state)
{
case 0:
case 3:
case 4:
case 7:
case 8:
case 9:
case 10:
printf(\tExpecting a number,
- or (!\n);
break;
case 2:
printf(\tExpecting an
operator or end
of input!\n);
break;
case 12:
printf(\tExpecting an
operator or )!\n);
break;
case 13:
case 14:
printf(\tExpecting a *
or /!\n);
break;
default:
printf(\tImpossible error!
State = %d, token = %d\n,
state, tokentype);
break;
}
}
All right, but we can do better than that. For example, since the parser is in state 0 only when it is reading the first token, an error in state zero means that the token is wrong for the beginning of input, so we might write something more to the point, like An expression must begin with a number, - or (!. Similarly, we can look closer at the other states and get a better idea of just what is going on in each. We can use the token type to further focus on the error. And so on.
And we can also keep track of just where in the input we are, so we can point to the location of the error:
? 2 - 3 4
^
Expecting an operator or end of input!
B(3). Last Words on YACC
And thats it for YACC. Which is not to say that nothing in the remainder of the series will rely on YACC; on the contrary. But I will assume that my audience is now familiar with the tool and introduce grammar descriptions and such without apology.
The next topic is lexical analysis. And since Ive got some space left this month, Ill launch into it with a skeleton filter program.
C. Lexical Analysis
Next time, I will move on to lexical analysis, and replace my calculator example with a file filter. Filter programs are somewhat unusual in the Macintosh world, so perhaps a definition is appropriate: a filter program is one that reads one filter, massages it in some way, and writes the result. Such programs are common under Unix, where simple programs can be strung together in batch files with IO redirection and piping to create much more powerful utilities. For an example, see the discussion of Steve Johnsons spell utility on page 139 of Jon Bentleys Programming Pearls.
The calculator example Ive used so far is not a filter, since it works from direct user input on the command line. It looks like the rest of this series will use filters, however; first with the lexical analysis examples, and then with the inline assembler. What Im going to do now is develop a basic identity filter, to settle some issues once and for all, so I can then ignore them and concentrate on language translation.
C(1). Command Line
The first problem is reading the command line. I want the tool to read either one or more files named on the command line, or standard input if there arent any named input files. I want it to write to a named output file, or to standard output if none is named. And I want to be able to set a language type (for reasons that wont become clear until next month) with command line options.
Recall that MPW passes the command line as an array of strings. The first string, argv[0], is the name of the tool, and the rest are the individual arguments.
C(1)(a). Input Files
Input files are specified on the command line by name alone, with no special flags. If a name appears unaccompanied by any flag, it is by default an input file. The tool can read arbitrarily many input files; if none are specified, then it reads standard input (which can come from IO redirection).
So Im going to have an integer variable called input which I will initialize to the standard input unit. Then Ill walk through the argument list, and if I find an input file, I will open it (using the input variable for its unit), append it to my input buffer, and close it. If, after reading the entire command line, input is still equal to standard input, then I know that no input files were named, and so Ill read standard input into the buffer.
C(1)(b). Options
Options are command arguments that begin with a hyphen (this doesnt have to be so; I have written a tool with an almost natural-language command line but -options are customary and easy to parse).
There are two kinds of options: the output file, and true options (the name of the output file isnt a true option, of course, but its specified with option syntax).
C(1)(b)(i). Output File
The tool can write one output file, or write to standard output. A named output file is specified with a -o option followed by the name of the file.
Similarly to output, I have an integer variable called output, initially set to the standard output unit. If an output file is named on the command line, then I open it, and set output to its unit number. If output isnt standard output, then I know an output file is already open, so I print a warning and ignore the new file.
C(1)(b)(ii). Language
I wont be using the language option this month but I might as well get it in here anyway. The language can be either Pascal or C. Pascal is the default; it is reset by the first named input file with either a .p or .c extension, and this can be overridden with either a -p or -c command option.
First, Ill declare a special type, codetype:
/* 8 */
typedef enum
{
nocode,
pascalcode,
ccode
} codetype;
and a variable, language of type codetype, initially nocode. This indicates that the language has not been set.
As I walk though the command line, if I find an input file, language is still nocode, and the file name ends in either .p or .c, then Ill set language accordingly. Thus, only the first such file can set the language.
If, on the other hand, I find a -p or -c option, I will set language accordingly, regardless of any previous setting. The options override filename conventions. (I dont check for multiple options; the last one controls.)
Finally, if language is still nocode after reading the entire command line, then I set it to the default, Pascal.
C(2). IO Buffering
In the interests of speed, Ill buffer both input and output. If you dont think this makes a difference, just re-write the tool without buffering!
Input buffering is easy: I just read input, in 1K chunks, into a single buffer, which I can re-size as necessary to accommodate the amount to read. The MPW interface doesnt provide any way to get the file size before reading it (not surprising, I guess; whats the size of standard input?).
Output buffering is a bit more complex. This is an identity filter, so I just copy the input without modification to the output buffer. When the output buffer fills up, I write 1K of it to the output file, and shift whats left to the front of the buffer.
C(3). The Program
Heres the code. The file managers is a load file containing all the C include files; this makes compilation of the program faster.
/* 9 */
// Filter.c - Skeleton MPW filter tool
#pragma load managers
// Constants and Macros
#define nil 0
#define stdinfd 0
#define stdoutfd 1
#define stderrfd 2
#define stdunit(x) ((x >= stdinfd) && (x <= stderrfd))
#define notstdunit(x)(x > stderrfd)
#define nombuffsize1024
#define truebuffsize 1200
// Types
typedef enum {false, true} logical;
typedef enum
{
nocode,
pascalcode,
ccode
} codetype;
// Prototypes
void initmac();
int openoutput(char *thename, int output);
int readinput(int input, Handle inbuffer, int buffersize);
int filter(char *inbuffer, int buffersize, int output,
codetype language);
int writeoutput(int output, char *outbuffer, int buffersize);
// main
// ----
// the main routine reads and
// interprets the command line,
// concatenates input files into an
// input buffer, opens the output
// file, and calls the filter
// routine to write the output.
int main(int argc, char *argv[])
{
int index;
int input;
int output;
codetype language;
Handle inbuffer;
int buffersize;
char *thetail;
initmac();
// input is the fd of the input file,
// initially stdin output is the fd
// of the output file, initially
// stdout language is the language
// to parse, initially unknown
input = stdinfd;
output = stdoutfd;
language = nocode;
// inbuffer is the input buffer,
// initially empty but able to grow
// buffersize is the size of inbuffer
inbuffer = NewHandle(0);
buffersize = 0;
// command line interpreter: loop through command options
for (index = 1; index < argc; index++)
{
if (argv[index][0] == -)
{
switch (argv[index][1])
{
// -p and -c options set language
// type; these override any previous setting
case P:
case p:
language = pascalcode;
break;
case C:
case c:
language = ccode;
break;
case O:
case o:
output = openoutput(
argv[++index], output);
if (output < 0)
{
fprintf(stderr, Error - Unable
to open output file %s!\n,
argv[index]);
exit(2);
}
break;
default:
fprintf(stderr, Error - Unknown option %s\n,
argv[index]);
exit(2);
break;
}
}
else
{
// if language has not changed since
// initialization, set language
// according to file name (the first
// input file thus determines language type)
if (language == nocode)
{
thetail = argv[index] + strlen(argv[index]) - 2;
if (strcmp(thetail, .p) == 0)
language = pascalcode;
else if (strcmp(thetail, .c) == 0)
language = ccode;
}
// open the input file (after this step,
// input will NOT contain a standard
// unit number) and read it into the input buffer
input = open(argv[index], O_RDONLY);
if (input < 0)
{
fprintf(stderr, Error - Unable to open input
file %s!\n, argv[index]);
exit(2);
}
buffersize = readinput(input, inbuffer, buffersize);
if (buffersize < 0)
{
fprintf(stderr, Error - Reading from %s!\n,
argv[index]);
exit(2);
}
close(input);
}
}
// if input is still a standard unit
// number, then no input file was
// opened, and input must be from standard input
if (stdunit(input))
{
buffersize = readinput(input,
inbuffer, buffersize);
if (buffersize < 0)
{
fprintf(stderr, Error - Reading from standard
input!\n);
exit(2);
}
}
// if language is still unknown, set it to Pascal
if (language == nocode)
language = pascalcode;
// the routine filter does the real work of the program
HLock(inbuffer);
filter(*inbuffer, buffersize, output, language);
HUnlock(inbuffer);
// wrapup: close output first if the program opened it
DisposHandle(inbuffer);
if (notstdunit(output))
close(output);
exit(0);
}
// initmac
// ------
// initialize any necessary managers and whatnot.
void initmac()
{
InitGraf((Ptr)&qd.thePort);
SetFScaleDisable(true);
InitCursorCtl(nil);
}
// openoutput
// ----------
// open the output file. returns the
// fd or, if an error occurs, the
// error flag.
int openoutput(char *thename, int output)
{
FInfo theinfo;
// if output is not a standard unit,
// then an output file must have already be open
if (notstdunit(output))
{
fprintf(stderr, Warning - additional output file %s
ignored!\n, thename);
return(output);
}
// open the output file for writing
// (O_WRONLY), creating it if
// necessary (O_CREAT) and
// zeroing it otherwise (O_TRUNC)
output = open(thename, O_WRONLY + O_CREAT + O_TRUNC);
if (output < 0)
return(output);
// if the file was created by open, it
// will be untyped, so set the type to TEXT and MPS
if (getfinfo(thename, 0, &theinfo))
{
fprintf(stderr, Warning - unable to get info for output file %s!\n,
thename);
return(output);
}
theinfo.fdType = TEXT;
theinfo.fdCreator = MPS ;
if (setfinfo(thename, 0, &theinfo))
fprintf(stderr, Warning - unable to set info for output file %s!\n,
thename);
return(output);
}
// readinput
// --------
// this routine appends an input file
// to the input buffer and returns
// the new size of the buffer or, if
// a read error occurs, the error flag.
int readinput(int input, Handle inbuffer, int buffersize)
{
int readsize;
SetHandleSize(inbuffer, buffersize + 1024);
HLock(inbuffer);
while ((readsize = read(input,
*inbuffer + buffersize, 1024)) > 0)
{
buffersize += readsize;
HUnlock(inbuffer);
SetHandleSize(inbuffer, buffersize + 1024);
HLock(inbuffer);
}
if (readsize < 0)
return(readsize);
HUnlock(inbuffer);
SetHandleSize(inbuffer, buffersize + 1024);
return(buffersize);
}
// filter
// ------
// this routine does the main work of
// the program, which in this case
// consists of simply writing the
// input buffer to the output file.
int filter(char *inbuffer, int buffersize, int output,
codetype language)
{
#pragma unused(language)
int inposition;
int outposition;
char outbuffer[truebuffsize];
unsigned char thechar;
int writesize;
// inposition keeps track of the
// current position in the input
// buffer, initially at the beginning
// outposition keeps track of the
// current position in the output
// buffer, initially at the beginning
inposition = 0;
outposition = 0;
while (inposition < buffersize)
{
// copy input to the output buffer, one character at a time
thechar = *(inbuffer + inposition++);
outbuffer[outposition++] = thechar;
// when the output buffer fills up, write it to output
if (outposition >= nombuffsize)
outposition = writeoutput(
output, outbuffer,
outposition);
if (outposition < 0)
return(outposition);
}
// write whatever is left in the buffer
// directly to output
writesize = write(output, outbuffer, outposition);
return(writesize);
}
// writeoutput
// ----------
// this routine flushes the output
// buffer by writing it to the output
// file. It returns the new size of
// the buffer or, if a write error
// occurs, the error flag.
int writeoutput(int output, char *outbuffer, int buffersize)
{
int writesize;
writesize = write(output, outbuffer, nombuffsize);
if (writesize < 0)
return(writesize);
buffersize -= writesize;
BlockMove(outbuffer + writesize, outbuffer, buffersize);
return(buffersize);
}
D. Conclusion
Next time Ill expand the Filter tool by adding a state machine to control the transfer of data from the input buffer to the output buffer. Ill assume youve got the code just above, and wont repeat it, except for the filter routine. So hang on to this issue of MacTutor!