Regular Expressions

Next: Non-Deterministic Finite-State Automata Up: REGULAR LANGUAGES Previous: REGULAR LANGUAGES Contents

Regular Expressions

Regular expressions are used to define patterns of characters; they are used in UNIX tools such as awk, grep, vi and, of course, lex.

A regular expression is just a form of notation, used for describing sets of words. For any given set of characters $\Sigma$ , a regular expression over $\Sigma$ is defined by:

The empty string, $\varepsilon$ , which denotes a string of length zero, and means ``take nothing from the input''. It is most commonly used in conjunction with other regular expressions eg. to denote optionality.
Any character in $\Sigma$ may be used in a regular expression. For instance, if we write a as a regular expression, this means ``take the letter a from the input''; ie. it denotes the (singleton) set of words {``a''}
The union operator, ``|'', which denotes the union of two sets of words. Thus the regular expression a|b denotes the set {``a'', ``b''}, and means ``take either the letter a or the letter b from the input''
Writing two regular expressions side-by-side is known as concatenation; thus the regular expression ab denotes the set {``ab''} and means ``take the character a followed by the character b from the input''.
The Kleene closure of a regular expression, denoted by ``*'', indicates zero or more occurrences of that expression. Thus a* is the (infinite) set { $\varepsilon$ , ``a'', ``aa'', ``aaa'', ...} and means ``take zero or more as from the input''.

Brackets may be used in a regular expression to enforce precedence or increase clarity.

Next: Non-Deterministic Finite-State Automata Up: REGULAR LANGUAGES Previous: REGULAR LANGUAGES Contents

James Power 2002-11-29