2008 Oct 18
The parser generator bison parses only LALR(1) context free languages, which means it must be possible to tell how to parse any portion of an input string to the parser with a just a single token of lookahead (a restriction which is LR(1), left recursive) {see sec 5.7 p64}.
Language design and parser generation has the following steps.
The input file to bison is organized as follows.
%{ C Declarations %} Bison Declarations %% Grammar Rules %% Additional C code
The C Declarations may define types and variables that used in the action. This section may also use C preprocessor #define or #include statements. The contents of this section are copied to the beginning of the output parser file (before yyparse.
The Bison Declarations define the names of the terminal and nonterminal symbols, and can also specify operator precedence and the data types of the symbols. In simple grammars the default declarations may be enough. Each terminal symbol that is not a single character constant or literal must be declared here. If ther are multiple declarations then the first one has the lowest precedence and the last has the highest precedence.
The Grammar Rules section contains only one or more grammar rules. There must be at least one rule.
The Additional C code can contain any C code needed (e.g. the definition of the lexcial analyzer yylex, plus functions called by rule action. Even main can be defined here. This section is copied to the end of the parser file, after yyparse. Be sure to avoid variable or function names starting with yy or YY.
bison and yacc input files usually have the extension .y. If the input is rpcalc.y the output file name is rpcalc.tab.c.
A bison rule defines a token, or symbol, each of which has a type and a value. A token can be terminal, i.e. made up of characters from the base alphabet, or nonterminal, i.e. composed of other tokens. Internally a token is represented by a unique number that the parser assigns to it. This number code is what yylex returns. In the grammar rules only the symbolic name is used; you never need to know the actual code value.
By convention symbol names are lower case and must start with a letter or underscore or period. Subsequent characters in the name can also be digits. Periods cannot be used in terminal names.
There are two ways of writing terminal rules:
%toke
.The value returned by yylex is always one of the terminal symbols (or 0 for end of input). Thus the grammar rules must be consistent with this. Each named token type becomes a C macro in the parser file so yylex can use the name to stand for the code. This is why periods do not work in terminal symbol names.
If yylex is defined in a separate file, it is necessary to arrange for token macro definitions to be available there. Use the '-d' option when running bison so that it will write these macro definitions to a separate header file "name.tab.h", which can be included in other source files as needed.
Example: a bison rule for the C return statement (nonterminal) would be:
stmt: RETURN expr ';' ;
3.1 Notation: (used in actions)
$$ : value of the symbol being recognized $1 : value of the 1st item of the rule $2 : value of the 2nd item of the rule, ... $n : where n is 0 or negative referes to tokens and groupings on the stack before those that match the current rule. This is risky, unles you reliably know the context in which it occurs and can guarantee the information is still on the stack. Example: goo: expr bar '+' expr { ... } | expr bar '-' expr { ... } ; gum: /* empty */ { previous_expr = $0 } ; which works as long asgum
is used only as shown above, then always $0 refers to theexpr
which precedesgum
in the definition ofgoo
.
3.2 Syntax General Form:
A bison grammar rule has the following general form:
result components... ;
where result is the nonterminal symbol defined by the rule (or rule alternative) and components are a sequence of terminal and nonterminal symbols that define this rule (or rule alternative), e.g.
Whitespace in rules has no significance in a rule except to separate symbols. Any amount of white space is equivalent. A semicolon ends the rule.exp: exp '+' exp ;
Within the components can be actions that specify the semantics of the rule. An action is C statements contained within curly braces:
{ C statements }
It is good practice to write a comment /* empty */
in each
rule (alternative) with an empty component.
In order to be useful a parser must produce some output. A bison grammar rule can have an action made up of C statements. The action defines the symantics of a rule. Each time the rule is matched the action code is executed. Usually this action is to compute the value of the whole (i.e. of the rule token being defined and matched) from the value of its parts (in the RHS). For example:
expr: expr '+' expr { $$ = $1 + $3; } ;
The parse created by bison groups tokens according to the input rules. The tokens come from the lexical analyzer that must be supplied (usually in C) along with the rule file input. The generated bison parser is C code, with the top level function yyparse implements the langauge described in the input file. When there is no action present, by default bison does: { $$ = $1; }.
If there is only a single data type then, the $$ and $n always have
that data type. If %union
is used to specify a set of data types,
then it is necessary to declare the type for each terminal and
nonterminal symbol that can have a value.
If %union
has been used to specify a variety of data types
(replacing the use of the YYSTYPE macro), then it is necessary to declare a
choice of these types for each terminal or nonterminal symbol
that can have a value. Then each time a $$ or $n is used in action
its type is determine by the type of the symbol it refers to.
Alternatives for a symbol can be separated by the | character:
a: b | c ;
or can be combined by separated instances of the rule:
a: b ; a: c ;
A rule is recursive when its result nonterminal appears also on its right hand side (in one or more components. This can be left recursion or right recursion when the result nonterminal appears first (left) or last (right). Examples:
expseq_l: exp | expseq_l ',' exp // left recursion ; expseq_r: exp | exp ', 'expseq_r // right recursion ;
The advantage of the preferred left recursion is that it can parse a sequence of any number of items with a bounded stack size. Right recursion uses a stack proportional to the number of items in the sequence, because all items must be shifted onto the stack before a rule can be applied even once.
Indirect or mutual recursion occurs when the result of a rule does not directly appear on the right hand side, but does appear afet rone or more steps of rules. For example:
expr: primary | primary '+' primary ; primary: constant | '(' expr ')' ;
The above is an example of mutual recursion since each of the rules refers to the other.
This function is not a complete C program. It is necessary to have additional functions. One is the lexical analyzer, yylex, another is the error reporting function yyerror which the parser calls when an error is detected. A main function must also be provided which calls yyparse. All variables and function names in the bison parser start with yy or YY, except for the one defined by the rules.
The function yyerrror is called if a syntax error in the input is
detected (i.e. no rule match exists), and then by default yyparse exits
with a non zero exit code. The bison language has the symbol
error
which is a reserved predefined terminal symbol reserved
for error recovery, and shoul not be used for any other purpose.
In particular, yylex should never return this value.
If there is a parse error, then error becomes the current lokkahead token. Action associated with error are then executed and the lookahead token is changed back to the token that originally caused the syntax error (see p69). As an example, here is a modified rule:
stmt: RETURN expr ';' | error ';' { yyerrok; } ;
yyerrok is a macro defined by bison whose meaning is error recovery is complete.
Bison - The YACC Compatible Parser Generator, December 1992, Bison Version 1.20;
by Charles Donnelly and Richard Stallman, Published by the Free Software Foundation;
ISBN 1-882114-30-2.
2006-2008