A regular expression matcher

Regular expressions (regexps) are a both a theoretical model of computation and a practical basis for language processing (e.g. the Unix command-line tools grep and awk, scripting languages such as bash and perl and the lex and alex compiler-building tools).

This tutorial we are going to build a simple regexp matcher in Haskell. The key to expressing matching in an elegant and compositional way is to use a combination of algebraic datatypes and programming with continuations. Note, however, that the emphasis of this tutorial is on clarity, not optimization; real-world code should rely on an industrial-strengh
library such as regexp-pcre.

This tutorial was inspired by Olivier Danvy's Defunctionalization at Work (a BRICS technical report).

Regular expressions

I define UNIX as "30 definitions of regular expressions living under one roof." --- Don Knuth

Although practical tools adopt extended definitions of regular expressions, we will consider only simple regexps built as follows:

0 that matches nothing (the empty language);
1 (or epsilon) that matches the empty string;
a single character c matching itself;
the union (+) of two regexps;
the concatenation (.) of two regexps;
the zero-or-more repetition (*) of a regexp (also called Kleene closure).

Concatenation, union and repetition are standard in practical tools such as grep. Constants 0 and 1 are necessary to ensure good algebraic properties (every language recognized by an automaton can be represented by a regexp) and sometimes ommitted in practice: e.g. in grep the empty string is represented by ^$ but there is no representation for the empty language.

Instead of re-using some existing type to encode regexps (e.g. strings) we begin by defining a custom recursive datatype. This will make it easier to process regexps and define the matching algorithm.

data Regexp = Zero                  -- empty
            | One                   -- epsilon
            | Lit Char              -- single character
            | Plus Regexp Regexp    -- union (+)
            | Cat  Regexp Regexp    -- concatenation (.)
            | Many Regexp           -- repetition (*)

Some examples of regexp together with a description of what they match:

Lit 'a'                        -- an 'a'
Plus (Lit 'a') (Lit 'b')       -- an 'a' or a 'b'
Cat (Many (Lit 'a')) (Lit 'b') -- b, ab, aab, aaab, ...

Note that any value of the Regexp data type (except for non-terminating ones) corresponds to a valid regular expression. This means that ill-formed regexps are avoided simply by type checking!

However, writting complex regexps this way is verbose and error-prone. We therefore introduce some shortcuts.

First, we define two infix operators for union and concatenation of regexps. We can also use some algebraic properties such as 0+e = e+0 = e and 1.e = 1.e = e to do some automatic simplification. We also define a "smart" constructor for repetition.

infixl 6 <+>
infixl 7 <>

(<+>) :: Regexp -> Regexp -> Regexp
Zero <+> e = e
e <+> Zero = e
e1 <+> e2  = Plus e1 e2

(<>) :: Regexp -> Regexp -> Regexp
Zero <> _   = Zero
_ <> Zero   = Zero
One <> e    = e
e <> One    = e
e1 <> e2    = Cat e1 e2

many :: Regexp -> Regexp 
many Zero     = One
many One       = One
many (Many e)  = Many e
many e         = Many e

Second, we employ the OverloadedStrings GHC extension to automatically convert any string literal to a regexp. For example, the string "abc" is converted to a concatenation of characters:

Cat (Lit 'a') (Cat (Lit 'b') (Lit 'c'))

(To see how this is achived open the full code window on the runnable example at the end.)

Using these operators we can write regexps quite succintly:

"ab" 
"a"<+>"b"
"ab" <> many ("a"<+>"b")

Matching

Our objective is to define a matching function, i.e. a function that takes a regexp and a string and yields a boolean.

match :: Regexp -> String -> Bool

However, if we try to define the match function directly by recursion on the Regexp data type we run into problems. For example, to match the concatenation of two regexps, we would have to split the input string:

match (Cat e1 e2) cs = match e1 cs1 && match e2 cs2
    -- incomplete: missing some cs1, cs2 such that cs=cs1++cs2

The problem is that match does not yield how much of the input string was matched.

One solution would be to have match yield a pair e.g. of a boolean and a string. But another more elegant solution is to define a "worker" function taking an extra parameter called the continuation that specifies what to do with the non-matched part of the string; in this case, the continuation is a function from strings to booleans (the result of matching). For redability we define a type synomym for continuations;

type Cont = String -> Bool   -- type for continuations

We can now define the worker function accept by case-analysis on the regexp; note that accept (unlike match) is higher-order because the continuation argument k is a function. The top-level function match simply calls accept with a continuation that checks for the empty string (i.e. the null function from the standard Prelude).

accept :: Regexp -> String -> Cont -> Bool  -- worker function
accept Zero    cs      k = False
accept One     cs      k = k cs
accept (Lit c) (c':cs) k = c==c' && k cs
accept (Lit c) []      k = False
accept (Cat e1 e2) cs  k = accept e1 cs (\cs' -> accept e2 cs' k)
accept (Plus e1 e2) cs k = accept e1 cs k || accept e2 cs k
accept (Many e) cs k     = acceptMany e cs k
  where 
     acceptMany e cs k 
       = k cs || accept e cs (\cs' -> cs'/=cs && acceptMany e cs' k)

The case of Zero always fails while One success and applies the continuation to the remaining string. The case for single character checks the start of the string and applies the continuation. The case for union is trivial. Concatenation is more interesting: we simply call accept for the first regexp e1 with a continuation that calls accept for e2 (and then proceeds to the original continuation). Finally, we use an auxiliary definition acceptMany for matching the repetition.

The following example allows experimenting with the matcher; try changing the string or the regexp and re-running the program.

{-# LANGUAGE OverloadedStrings #-}
import GHC.Exts (IsString(..))

data Regexp = Zero                  -- empty
            | One                   -- epsilon
            | Lit Char              -- single character
            | Plus Regexp Regexp    -- union (+)
            | Cat  Regexp Regexp    -- concatenation (.)
            | Many Regexp           -- repetition (*)
            deriving Show

infixl 6 <+>
infixl 7 <>

(<+>) :: Regexp -> Regexp -> Regexp
Zero <+> e = e
e <+> Zero = e
e1 <+> e2  = Plus e1 e2

(<>) :: Regexp -> Regexp -> Regexp
Zero <> _   = Zero
_ <> Zero   = Zero
One <> e    = e
e <> One    = e
e1 <> e2    = Cat e1 e2

many :: Regexp -> Regexp 
many Zero     = One
many One       = One
many (Many e)  = Many e
many e         = Many e

type Cont= String -> Bool

accept :: Regexp -> String -> Cont -> Bool  -- worker function
accept Zero    cs      k = False
accept One     cs      k = k cs
accept (Lit c) (c':cs) k = c==c' && k cs
accept (Lit c) []      k = False
accept (Cat e1 e2) cs  k = accept e1 cs (\cs' -> accept e2 cs' k)
accept (Plus e1 e2) cs k = accept e1 cs k || accept e2 cs k
accept (Many e) cs k     = acceptMany e cs k
  where 
     acceptMany e cs k 
       = k cs || accept e cs (\cs' -> cs'/=cs && acceptMany e cs' k)


match :: Regexp -> String -> Bool
match re s = accept re s null

instance IsString Regexp where
  fromString cs = foldr ((<>) . Lit) One cs

-- show
main = print (match ("ab" <> many "ba") "abbaba")
-- /show

Exercises

Define auxiliary functions plus and option for the + and ? operators in grep, i.e. one-or-more times repetition and optional matching.
Here are one-line solutions that re-use the previous functions.
```
plus, option :: Regexp -> Regexp
plus e = e <> many e
option e = "" <+> e
```
The function acceptMany for handling repetition includes a condition cs'/=cs to check that characters are consumed in the recursive case. However, checking list inequality requires traversing the full list in the worst-case. Find a way to avoid this inefficient check.
It is sufficient to pair each character with its index and replace the check that the inequality on indices rather than string. The type of continuations and the worker function need to be modified; the relevant change is the case of the repetition operator:
```
type Cont = [(Int,Char)] -> Bool   -- new type for continuations

accept :: Regexp -> [(Int,Char)] -> Cont -> Bool
accept (Many e) ics k  = acceptMany e ics k
   where 
     acceptMany e ics k 
        = k ics || accept e ics (\ics' -> fst ics'/=fst ics && acceptMany e ics' k)
-- other cases are straightforward

match :: Regexp -> String -> Bool
match re cs = accept re (zip [0..] cs) null
```

Sections

Regular expressions

Matching

Exercises