Next: Regular Expressions Up: Notes on Formal Language Previous: Contents Contents

REGULAR LANGUAGES

Lexical analysis, also called scanning, is the phase of the compilation process which deals with the actual program being compiled, character by character. The higher level parts of the compiler will call the lexical analyzer with the command "get the next word from the input", and it is the scanner's job to sort through the input characters and find this word.

The types of "words" commonly found in a program are:

programming language keywords, such as if, while, struct, int etc.
operator symbols like =, +, -, &&, !, <= etc.
other special symbols like: ( ), { }, [ ], ;, & etc.
constants like 1, 2, 3, 'a', 'b', 'c', "any quoted string" etc.
variable and function names (called identifers) such as x, i, t1 etc.

Some languages (such as C) are case sensitive, in that they differentiate between eg. if and IF; thus the former would be a keyword, the latter a variable name.

Also, most languages would insist that identifers cannot be any of the keywords, or contain operator symbols (versions of Fortran don't, making lexical analysis quite difficult).

In addition to the basic grouping process, lexical analysis usually performs the following tasks:

Since there are only a finite number of types of words, instead of passing the actual word to the next phase we can save space by passing a suitable representation. This representation is known as a token.

If the language isn't case sensitive, we can eliminate differences between case at this point by using just one token per keyword, irrespective of case; eg.

     #define IF-TOKEN 1
     #define WHILE-TOKEN 2
     .....
     .....
     if we meet "IF", "If", "iF", "if"    then return IF_TOKEN
     if we meet "WHILE, "While", "WHile", ...   then return WHILE-TOKEN

We can pick out mistakes in the lexical syntax of the program such as using a character which is not valid in the language. (Note that we do not worry about the combination of patterns; eg. the pattern of characters "+*" would be returned as PLUS-TOKEN, MULT-TOKEN, and it would be up to the next phase to see that these should not follow in sequence.)
We can eliminate pieces of the program that are no longer relevant, such as spaces, tabs, carriage-returns (in most languages), and comments.

In order to specify the lexical analysis process, what we need is some method of describing which patterns of characters correspond to which words.

Subsections

Next: Regular Expressions Up: Notes on Formal Language Previous: Contents Contents

James Power 2002-11-29