bison

2008 Oct 18


The parser generator bison parses only LALR(1) context free languages, which means it must be possible to tell how to parse any portion of an input string to the parser with a just a single token of lookahead (a restriction which is LR(1), left recursive) {see sec 5.7 p64}.


1. Using Bison

Language design and parser generation has the following steps.

  1. Create an input file which specifies the grammar and action
  2. Create a lexical analyzer to supply tokens to the parser yyparse
  3. Write a function to call the parser (directly or indirectly from main)
  4. Write error reporting functions
  5. Run bison on the file in (1)
  6. Compile the code output from bison along with other source files.
  7. Link the objects files into an executable program

2. Bison Input File

The input file to bison is organized as follows.

%{
   C Declarations
%}

Bison Declarations

%%
   Grammar Rules
%%

Additional C code

The C Declarations may define types and variables that used in the action. This section may also use C preprocessor #define or #include statements. The contents of this section are copied to the beginning of the output parser file (before yyparse.

The Bison Declarations define the names of the terminal and nonterminal symbols, and can also specify operator precedence and the data types of the symbols. In simple grammars the default declarations may be enough. Each terminal symbol that is not a single character constant or literal must be declared here. If ther are multiple declarations then the first one has the lowest precedence and the last has the highest precedence.

The Grammar Rules section contains only one or more grammar rules. There must be at least one rule.

The Additional C code can contain any C code needed (e.g. the definition of the lexcial analyzer yylex, plus functions called by rule action. Even main can be defined here. This section is copied to the end of the parser file, after yyparse. Be sure to avoid variable or function names starting with yy or YY.

bison and yacc input files usually have the extension .y. If the input is rpcalc.y the output file name is rpcalc.tab.c.


3. Rule Syntax

A bison rule defines a token, or symbol, each of which has a type and a value. A token can be terminal, i.e. made up of characters from the base alphabet, or nonterminal, i.e. composed of other tokens. Internally a token is represented by a unique number that the parser assigns to it. This number code is what yylex returns. In the grammar rules only the symbolic name is used; you never need to know the actual code value.

By convention symbol names are lower case and must start with a letter or underscore or period. Subsequent characters in the name can also be digits. Periods cannot be used in terminal names.

There are two ways of writing terminal rules:

  1. A named token type is written with an identifier, and by convention is usually all upper case. Each such name must be defined with a bison declaration such as %toke.

  2. A character token type (or literal token) is written in C like notation, e.g. '+'. It does not need to be declared unless its type, assoicativity or precedence needs to be explicitly specified.

    By convention, the character code is used as the token type code. Many other users expect this (so it is not advisable to change it). All the usual C escape sequences are available. Since 0 is used by yylex to mean end of input the NUL character cannot be used.

The value returned by yylex is always one of the terminal symbols (or 0 for end of input). Thus the grammar rules must be consistent with this. Each named token type becomes a C macro in the parser file so yylex can use the name to stand for the code. This is why periods do not work in terminal symbol names.

If yylex is defined in a separate file, it is necessary to arrange for token macro definitions to be available there. Use the '-d' option when running bison so that it will write these macro definitions to a separate header file "name.tab.h", which can be included in other source files as needed.

Example: a bison rule for the C return statement (nonterminal) would be:

stmt: RETURN expr ';'
      ;

3.1 Notation: (used in actions)

$$ : value of the symbol being recognized
$1 : value of the 1st item of the rule
$2 : value of the 2nd item of the rule, ...
$n : where n is 0 or negative referes to tokens and groupings on
     the stack before those that match the current rule. This is
     risky, unles you reliably know the context in which it occurs
     and can guarantee the information is still on the stack. Example:

     goo:   expr bar '+' expr { ... }
          | expr bar '-' expr { ... }
          ;
     gum:   /* empty */
            { previous_expr = $0 }
          ;

     which works as long as gum is used only as shown above,
     then always $0 refers to the expr which precedes gum in
     the definition of goo.

3.2 Syntax General Form:

A bison grammar rule has the following general form:

result components... ;

where result is the nonterminal symbol defined by the rule (or rule alternative) and components are a sequence of terminal and nonterminal symbols that define this rule (or rule alternative), e.g.

exp: exp '+' exp ;
Whitespace in rules has no significance in a rule except to separate symbols. Any amount of white space is equivalent. A semicolon ends the rule.

Within the components can be actions that specify the semantics of the rule. An action is C statements contained within curly braces:

{ C statements } 

It is good practice to write a comment /* empty */ in each rule (alternative) with an empty component.


3.3 Semantic Action

In order to be useful a parser must produce some output. A bison grammar rule can have an action made up of C statements. The action defines the symantics of a rule. Each time the rule is matched the action code is executed. Usually this action is to compute the value of the whole (i.e. of the rule token being defined and matched) from the value of its parts (in the RHS). For example:

expr: expr '+' expr { $$ = $1 + $3; }
      ;

The parse created by bison groups tokens according to the input rules. The tokens come from the lexical analyzer that must be supplied (usually in C) along with the rule file input. The generated bison parser is C code, with the top level function yyparse implements the langauge described in the input file. When there is no action present, by default bison does: { $$ = $1; }.


3.4 Data Types of Values in Actions

If there is only a single data type then, the $$ and $n always have that data type. If %union is used to specify a set of data types, then it is necessary to declare the type for each terminal and nonterminal symbol that can have a value.

If %union has been used to specify a variety of data types (replacing the use of the YYSTYPE macro), then it is necessary to declare a choice of these types for each terminal or nonterminal symbol that can have a value. Then each time a $$ or $n is used in action its type is determine by the type of the symbol it refers to.


3.5 Alternatives

Alternatives for a symbol can be separated by the | character:

a:   b
   | c
   ;

or can be combined by separated instances of the rule:

a: b ;
a: c ;

3.6 Recursive Rules

A rule is recursive when its result nonterminal appears also on its right hand side (in one or more components. This can be left recursion or right recursion when the result nonterminal appears first (left) or last (right). Examples:

expseq_l:   exp
          | expseq_l ',' exp // left recursion
          ;

expseq_r:   exp
          | exp ', 'expseq_r // right recursion
          ;

The advantage of the preferred left recursion is that it can parse a sequence of any number of items with a bounded stack size. Right recursion uses a stack proportional to the number of items in the sequence, because all items must be shifted onto the stack before a rule can be applied even once.

Indirect or mutual recursion occurs when the result of a rule does not directly appear on the right hand side, but does appear afet rone or more steps of rules. For example:

expr:      primary
         | primary '+' primary
         ;
primary:   constant
         | '(' expr ')'
         ;

The above is an example of mutual recursion since each of the rules refers to the other.


3.7 Error Handling

This function is not a complete C program. It is necessary to have additional functions. One is the lexical analyzer, yylex, another is the error reporting function yyerror which the parser calls when an error is detected. A main function must also be provided which calls yyparse. All variables and function names in the bison parser start with yy or YY, except for the one defined by the rules.

The function yyerrror is called if a syntax error in the input is detected (i.e. no rule match exists), and then by default yyparse exits with a non zero exit code. The bison language has the symbol error which is a reserved predefined terminal symbol reserved for error recovery, and shoul not be used for any other purpose. In particular, yylex should never return this value.

If there is a parse error, then error becomes the current lokkahead token. Action associated with error are then executed and the lookahead token is changed back to the token that originally caused the syntax error (see p69). As an example, here is a modified rule:

stmt:   RETURN expr ';'
      | error ';' { yyerrok; }
      ;

yyerrok is a macro defined by bison whose meaning is error recovery is complete.


4. Book Reference

Bison - The YACC Compatible Parser Generator, December 1992, Bison Version 1.20;
by Charles Donnelly and Richard Stallman, Published by the Free Software Foundation;
ISBN 1-882114-30-2.


2006-2008