Next: Regular Expressions
Up: Notes on Formal Language
Previous: Contents
  Contents
Lexical analysis, also called scanning, is the phase of the
compilation process which deals with the actual program being
compiled, character by character. The higher level parts of the
compiler will call the lexical analyzer with the command "get the next
word from the input", and it is the scanner's job to sort through the
input characters and find this word.
The types of "words" commonly found in a program are:
- programming language keywords, such as if, while, struct, int etc.
- operator symbols like =, +, -, &&, !, <= etc.
- other special symbols like: ( ), { }, [ ], ;, & etc.
- constants like 1, 2, 3, 'a', 'b', 'c', "any quoted string" etc.
- variable and function names (called identifers) such as x, i, t1 etc.
Some languages (such as C) are case sensitive, in that they
differentiate between eg. if and IF; thus the former would
be a keyword, the latter a variable name.
Also, most languages would insist that identifers cannot be any of the
keywords, or contain operator symbols (versions of Fortran don't,
making lexical analysis quite difficult).
In addition to the basic grouping process, lexical analysis usually
performs the following tasks:
- Since there are only a finite number of types of words, instead of passing the actual word
to the next phase we can save space by passing a suitable
representation. This representation is known as a token.
- If the language isn't case sensitive, we can eliminate differences between case at this point
by using just one token per keyword, irrespective of case; eg.
#define IF-TOKEN 1
#define WHILE-TOKEN 2
.....
.....
if we meet "IF", "If", "iF", "if" then return IF_TOKEN
if we meet "WHILE, "While", "WHile", ... then return WHILE-TOKEN
- We can pick out mistakes in the lexical syntax of the program such as using a character which is not valid in the language. (Note that we do not worry about the combination of patterns; eg. the pattern of characters "+*"
would be returned as PLUS-TOKEN, MULT-TOKEN, and it would be up
to the next phase to see that these should not follow in sequence.)
- We can eliminate pieces of the program that are no longer relevant, such as spaces, tabs, carriage-returns (in most languages), and comments.
In order to specify the lexical analysis process, what we need is some
method of describing which patterns of characters correspond to which
words.
Subsections
Next: Regular Expressions
Up: Notes on Formal Language
Previous: Contents
  Contents
James Power
2002-11-29