One of the earliest forms of notation used for describing the syntax
of a programming language was Backus-Naur Form (BNF); this is
basically just a variant of a context-free grammar, with the symbol
``::='' used in place of ``'' to mean ``is defined as''.
Additionally, non-terminals are usually written in angle-brackets and
terminals in quotes.
For example, we could describe a block of statements in Pascal as:
block | ::= | ``BEGIN'' opt-stats ``END''. |
opt-stats | ::= | stats-list . |
stats-list | ::= | statement statement ``;'' stats-list . |
An extension of this notation, known as extended BNF (EBNF),
allows us to use regular- expression-like constructs on the
right-hand-side of the rules: for any sequence of symbols , we
write ``[]'' to say that is optional, and
``{}'' to denote 0 or more repetitions of (ie. the
Kleene closure of ).
Thus we could re-write the above rules in EBNF as:
block | ::= | ``BEGIN'' [ stats-list ] ``END''. |
stats-list | ::= | { statement ";"} statement . |
Using EBNF has the advantage of simplifying the grammar by reducing
the need to use recursive rules.
We could describe the complete syntax of a programming language using EBNF (or equivalently a CFG), but it tends to make things easier if we break it up into two sections:
Thus the first task in writing a parser for a language is breaking up its syntax into the ``regular'' and ``context-free'' part. The choice is arbitrary but a simple guideline is that: