*Regular expressions* (regexps) are a both a theoretical model of computation and a practical basis for language processing (e.g. the Unix command-line tools `grep` and `awk`, scripting languages such as `bash` and `perl` and the `lex` and `alex` compiler-building tools). This tutorial we are going to build a simple regexp matcher in Haskell. The key to expressing matching in an elegant and compositional way is to use a combination of *algebraic datatypes* and programming with *continuations*. Note, however, that the emphasis of this tutorial is on clarity, *not* optimization; real-world code should rely on an industrial-strengh library such as `regexp-pcre`. This tutorial was inspired by Olivier Danvy's [Defunctionalization at Work](http://www.brics.dk/RS/01/23) (a BRICS technical report). ## Regular expressions > I define UNIX as "30 definitions of regular expressions living under > one roof." --- Don Knuth Although practical tools adopt extended definitions of regular expressions, we will consider only simple *regexps* built as follows: - 0 that matches nothing (the empty language); - 1 (or *epsilon*) that matches the empty string; - a single character *c* matching itself; - the union (+) of two *regexp*s; - the concatenation (.) of two *regexp*s; - the zero-or-more repetition (\*) of a *regexp* (also called *Kleene closure*). Concatenation, union and repetition are standard in practical tools such as `grep`. Constants 0 and 1 are necessary to ensure good algebraic properties (every language recognized by an automaton can be represented by a regexp) and sometimes ommitted in practice: e.g. in `grep` the empty string is represented by `^$` but there is no representation for the empty language. Instead of re-using some existing type to encode regexps (e.g. strings) we begin by defining a custom recursive datatype. This will make it easier to process regexps and define the matching algorithm. ```haskell data Regexp = Zero -- empty | One -- epsilon | Lit Char -- single character | Plus Regexp Regexp -- union (+) | Cat Regexp Regexp -- concatenation (.) | Many Regexp -- repetition (*) ``` Some examples of regexp together with a description of what they match: ```haskell Lit 'a' -- an 'a' Plus (Lit 'a') (Lit 'b') -- an 'a' or a 'b' Cat (Many (Lit 'a')) (Lit 'b') -- b, ab, aab, aaab, ... ``` Note that *any* value of the `Regexp` data type (except for non-terminating ones) corresponds to a valid regular expression. This means that ill-formed regexps are avoided simply by type checking! However, writting complex regexps this way is verbose and error-prone. We therefore introduce some shortcuts. First, we define two infix operators for union and concatenation of regexps. We can also use some algebraic properties such as `0+e = e+0 = e` and `1.e = 1.e = e` to do some automatic simplification. We also define a "smart" constructor for repetition. ```haskell infixl 6 <+> infixl 7 <> (<+>) :: Regexp -> Regexp -> Regexp Zero <+> e = e e <+> Zero = e e1 <+> e2 = Plus e1 e2 (<>) :: Regexp -> Regexp -> Regexp Zero <> _ = Zero _ <> Zero = Zero One <> e = e e <> One = e e1 <> e2 = Cat e1 e2 many :: Regexp -> Regexp many Zero = One many One = One many (Many e) = Many e many e = Many e ``` Second, we employ the `OverloadedStrings` GHC extension to automatically convert any string literal to a regexp. For example, the string `"abc"` is converted to a concatenation of characters: ```haskell Cat (Lit 'a') (Cat (Lit 'b') (Lit 'c')) ``` (*To see how this is achived open the full code window on the runnable example at the end.*) Using these operators we can write regexps quite succintly: ``` haskell "ab" "a"<+>"b" "ab" <> many ("a"<+>"b") ``` ## Matching Our objective is to define a matching function, i.e. a function that takes a regexp and a string and yields a boolean. ```haskell match :: Regexp -> String -> Bool ``` However, if we try to define the `match` function directly by recursion on the `Regexp` data type we run into problems. For example, to match the concatenation of two regexps, we would have to split the input string: ```haskell match (Cat e1 e2) cs = match e1 cs1 && match e2 cs2 -- incomplete: missing some cs1, cs2 such that cs=cs1++cs2 ``` The problem is that `match` does not yield how much of the input string was matched. One solution would be to have `match` yield a pair e.g. of a boolean and a string. But another more elegant solution is to define a "worker" function taking an extra parameter called the *continuation* that specifies what to do with the non-matched part of the string; in this case, the continuation is a function from strings to booleans (the result of matching). For redability we define a type synomym for continuations; ```haskell type Cont = String -> Bool -- type for continuations ``` We can now define the worker function `accept` by case-analysis on the regexp; note that `accept` (unlike `match`) is higher-order because the continuation argument `k` is a function. The top-level function `match` simply calls `accept` with a continuation that checks for the empty string (i.e. the `null` function from the standard Prelude). ```haskell accept :: Regexp -> String -> Cont -> Bool -- worker function accept Zero cs k = False accept One cs k = k cs accept (Lit c) (c':cs) k = c==c' && k cs accept (Lit c) [] k = False accept (Cat e1 e2) cs k = accept e1 cs (\cs' -> accept e2 cs' k) accept (Plus e1 e2) cs k = accept e1 cs k || accept e2 cs k accept (Many e) cs k = acceptMany e cs k where acceptMany e cs k = k cs || accept e cs (\cs' -> cs'/=cs && acceptMany e cs' k) ``` The case of `Zero` always fails while `One` success and applies the continuation to the remaining string. The case for single character checks the start of the string and applies the continuation. The case for union is trivial. Concatenation is more interesting: we simply call `accept` for the first regexp *e1* with a continuation that calls `accept` for *e2* (and then proceeds to the original continuation). Finally, we use an auxiliary definition `acceptMany` for matching the repetition. The following example allows experimenting with the matcher; try changing the string or the regexp and re-running the program. ```active haskell {-# LANGUAGE OverloadedStrings #-} import GHC.Exts (IsString(..)) data Regexp = Zero -- empty | One -- epsilon | Lit Char -- single character | Plus Regexp Regexp -- union (+) | Cat Regexp Regexp -- concatenation (.) | Many Regexp -- repetition (*) deriving Show infixl 6 <+> infixl 7 <> (<+>) :: Regexp -> Regexp -> Regexp Zero <+> e = e e <+> Zero = e e1 <+> e2 = Plus e1 e2 (<>) :: Regexp -> Regexp -> Regexp Zero <> _ = Zero _ <> Zero = Zero One <> e = e e <> One = e e1 <> e2 = Cat e1 e2 many :: Regexp -> Regexp many Zero = One many One = One many (Many e) = Many e many e = Many e type Cont= String -> Bool accept :: Regexp -> String -> Cont -> Bool -- worker function accept Zero cs k = False accept One cs k = k cs accept (Lit c) (c':cs) k = c==c' && k cs accept (Lit c) [] k = False accept (Cat e1 e2) cs k = accept e1 cs (\cs' -> accept e2 cs' k) accept (Plus e1 e2) cs k = accept e1 cs k || accept e2 cs k accept (Many e) cs k = acceptMany e cs k where acceptMany e cs k = k cs || accept e cs (\cs' -> cs'/=cs && acceptMany e cs' k) match :: Regexp -> String -> Bool match re s = accept re s null instance IsString Regexp where fromString cs = foldr ((<>) . Lit) One cs -- show main = print (match ("ab" <> many "ba") "abbaba") -- /show ``` ## Exercises 1. Define auxiliary functions `plus` and `option` for the `+` and `?` operators in `grep`, i.e. one-or-more times repetition and optional matching. @@@ Here are one-line solutions that re-use the previous functions. ```haskell plus, option :: Regexp -> Regexp plus e = e <> many e option e = "" <+> e ``` @@@ 2. The function `acceptMany` for handling repetition includes a condition `cs'/=cs` to check that characters are consumed in the recursive case. However, checking list inequality requires traversing the full list in the worst-case. Find a way to avoid this inefficient check. @@@ It is sufficient to pair each character with its *index* and replace the check that the inequality on indices rather than string. The type of continuations and the worker function need to be modified; the relevant change is the case of the repetition operator: ```haskell type Cont = [(Int,Char)] -> Bool -- new type for continuations accept :: Regexp -> [(Int,Char)] -> Cont -> Bool accept (Many e) ics k = acceptMany e ics k where acceptMany e ics k = k ics || accept e ics (\ics' -> fst ics'/=fst ics && acceptMany e ics' k) -- other cases are straightforward match :: Regexp -> String -> Bool match re cs = accept re (zip [0..] cs) null ``` @@@