next up previous contents
Next: Derivations and Parse Trees Up: Grammars for Programming Languages Previous: Grammars for Programming Languages   Contents

BNF Notation

One of the earliest forms of notation used for describing the syntax of a programming language was Backus-Naur Form (BNF); this is basically just a variant of a context-free grammar, with the symbol ``::='' used in place of ``$\rightarrow$'' to mean ``is defined as''. Additionally, non-terminals are usually written in angle-brackets and terminals in quotes.

For example, we could describe a block of statements in Pascal as:

$\langle$ block $\rangle$ ::= ``BEGIN'' $\langle$ opt-stats$\rangle$ ``END''.
$\langle$ opt-stats $\rangle$ ::= $\langle$ stats-list $\rangle$ $\mid$ $\varepsilon$.
$\langle$ stats-list $\rangle$ ::= $\langle$ statement $\rangle$ $\mid$ $\langle$ statement $\rangle$ ``;'' $\langle$ stats-list $\rangle$.

An extension of this notation, known as extended BNF (EBNF), allows us to use regular- expression-like constructs on the right-hand-side of the rules: for any sequence of symbols $\alpha$, we write ``[$\alpha$]'' to say that $\alpha$ is optional, and ``{$\alpha$}'' to denote 0 or more repetitions of $\alpha$ (ie. the Kleene closure of $\alpha$).

Thus we could re-write the above rules in EBNF as:

$\langle$ block $\rangle$ ::= ``BEGIN'' [$\langle$ stats-list $\rangle$] ``END''.
$\langle$ stats-list $\rangle$ ::= {$\langle$ statement $\rangle$ ";"} $\langle$ statement $\rangle$.

Using EBNF has the advantage of simplifying the grammar by reducing the need to use recursive rules.

We could describe the complete syntax of a programming language using EBNF (or equivalently a CFG), but it tends to make things easier if we break it up into two sections:

  1. The lexical syntax, which defines which patterns of characters make up which words. This can be described using regular expressions

  2. The context-free syntax, which defines which sequences of words which make up phrases from the programming language; we describe this using a context-free grammar.

Thus the first task in writing a parser for a language is breaking up its syntax into the ``regular'' and ``context-free'' part. The choice is arbitrary but a simple guideline is that:

characters from the input should not appear in the context-free syntax;
these should be tokenised by the lexical phase.
As well as modularising our task, this approach will facilitate debugging in parser generators such as yacc.


next up previous contents
Next: Derivations and Parse Trees Up: Grammars for Programming Languages Previous: Grammars for Programming Languages   Contents
James Power 2002-11-29