Interactive code snippets not yet available for SoH 2.0, see our Status of of School of Haskell 2.0 blog post

# 7. Tokenizer: Higher Order Functions

2 Jul 2013

Previously we have implemented a single-character proof-of-concept tokenizer. What we really need is to be able to recognize multi-character tokens such as identifiers and numbers. But before we get there, let's explore some new functional techniques.

## Higher Order Functions and Lambdas

It's a good programming practice to separate independent concerns. The code we came up with so far was a mixture of two such concerns:

1. Traversing a data structure
2. Performing an operation on each element

Try to identify the two in our implementation of `tokenize`:

``````tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs)
| elem c "+-*/" = TokOp (operator c) : tokenize cs
| isDigit c  = TokNum (digitToInt c) : tokenize cs
| isAlpha c  = TokIdent [c]          : tokenize cs
| isSpace c  = tokenize cs
| otherwise  = error \$ "Cannot tokenize " ++ [c]``````

These kinds of traversals are so common that functional languages abstract them into higher order functions. A higher order function is a function that takes another function as an argument.

This higher order functional approach has been so successful that it was eventually adopted by imperative languages. The C++ STL is based on the separation of concerns and higher order functions: traversal is abstracted into iterators, and generic algorithms accept functions (function pointers, function objects, or lambdas) as arguments. If you're familiar with the STL algorithms: `std::transform`, `std::copy_if`, and `std::remove_if`, you'll have no problem understanding what follows.

Let me start with the function `map` defined in the Prelude:

``map :: (a -> b) -> [a] -> [b]``

The first argument to `map` is a function `(a -> b)`. Notice that the parentheses around this argument are mandatory because the arrow associates to the right. The second argument is a list of `a`, and the result is the list of `b`. The type signature pretty much says it all: `map` applies the function `(a -> b)` to everly element of a list to produce a new list. The implementation is pretty straightforward too:

``````import Data.Char -- for the example
import Prelude hiding (map)
-- show
map :: (a -> b) -> [a] -> [b]
map _ [] = []
map f (a : as) = f a : map f as

main = print \$ map toUpper "hello world!"``````

Notice that `map` is generic in both type variables, `a` and `b`.

### Map and Filter in Action

Now suppose we have a helper function `tokenizeChar` that turns a single character into a token. We can rewrite our tokenizer using `map`:

``````tokenize :: String -> [Token]
tokenize str = map tokenizeChar str``````

This can be simplified further using currying ("divide" both sides by `str`):

``````tokenize :: String -> [Token]
tokenize = map tokenizeChar``````

Here's the function `tokenizeChar`:

``````tokenizeChar :: Char -> Token
tokenizeChar c | elem c "+-*/" = TokOp (operator c)
| isDigit c  = TokNum (digitToInt c)
| isAlpha c  = TokIdent [c]
| isSpace c  = TokSpace
| otherwise  = error \$ "Cannot tokenize " ++ [c]``````

Notice that the skipping of whitespace cannot be done using `map` because `map` cannot change the shape of the list (the length, in this case). That's why I had to introduce a new token, `TokSpace` to replace white space characters.

Fortunately, there is another higher-order function, `filter`, for removing stuff from a list:

``filter :: (a -> Bool) -> [a] -> [a]``

It takes a predicate function, `(a -> Bool)`, and applies it to each element of the list. If the predicate returns `True` the element is kept in the list, otherwise it's removed. `filter` is defined in the Prelude, but we could have as well implemented it ourselves -- it's so easy:

``````import Data.Char
import Prelude hiding (filter)
-- show
filter :: (a -> Bool) -> [a] -> [a]
filter _ [] = []
filter p (a : as) = if p a
then a : filter p as
else filter p as
main = putStrLn \$ filter isDigit "1x+3y"``````

We are now ready to implement the function `deSpace` that filters out `TokSpace` tokens.

``````deSpace :: [Token] -> [Token]
deSpace = filter notSpace

notSpace :: Token -> Bool
notSpace t = t /= TokSpace``````

(This definition will only work with tokens that are `deriving Eq`. Can you tell why? Recall that `/=` means not equal)

Since the auxiliary function `notSpace` is so simple, we can inline it using a lambda -- the anonymous function:

``````deSpace :: [Token] -> [Token]
deSpace = filter (\t -> t /= TokSpace)``````

The syntax for lambdas is very simple. You start with a backslash, which looks a little like the Greek letter lambda, λ; follow it with the list of arguments; an arrow `->`; and the body of the function -- an expression. Lambda syntax doesn't support multiple patterns or guards, but otherwise lambdas are just like definitions of regular functions. Lambdas are so useful that they are now part of C++11 and Java 8.

Here's the final version of the single-character tokenizer:

``````import Data.Char

data Operator = Plus | Minus | Times | Div
deriving (Show, Eq)

data Token = TokOp Operator
| TokIdent String
| TokNum Int
| TokSpace
deriving (Show, Eq)

operator :: Char -> Operator
operator c | c == '+' = Plus
| c == '-' = Minus
| c == '*' = Times
| c == '/' = Div

tokenize :: String -> [Token]
tokenize = map tokenizeChar

tokenizeChar :: Char -> Token
tokenizeChar c | elem c "+-*/" = TokOp (operator c)
| isDigit c  = TokNum (digitToInt c)
| isAlpha c  = TokIdent [c]
| isSpace c  = TokSpace
| otherwise  = error \$ "Cannot tokenize " ++ [c]

deSpace :: [Token] -> [Token]
deSpace = filter (\t -> t /= TokSpace)

main = print \$ deSpace \$ tokenize " 1 + 4 / x "``````

Did you notice something interesting? This program doesn't exhibit any recursion at all. Recursion is hidden in the implementation of higher order functions.

A good program in any language should take advantage of higher order abstractions. Haskell makes it easier, but even in imperative languages it pays to raise the level of abstraction. For instance, the use of plain loops is now being actively discouraged in modern C++ in favor of range-based `for` or STL algorithms like `foreach`, `transform`, `accumulate`, etc.; all of which was made considerably easier with the use of lambdas.

Ex 1. Implement the function `toInts` from the previous tutorial using `map`. This function takes a string of digits and creates a list of `Int`s corresponding to these digits.

``````import Data.Char -- (digitToInt)

toInts :: String -> [Int]
toInts = undefined

main = print \$ toInts "30750"``````

Ex 2. Implement function `squares` that takes a list of integers and returns the list of their squares. Use higher order functions and lambdas.

``````squares :: [Int] -> [Int]
squares = undefined

main = print \$ squares [1..10]``````

Ex 3. Implement function `inCircle2` that takes a list of 2-D points and returns only those that fit inside the circle of radius 2.

``````type Point = (Double, Double)
inCircle2 :: [Point] -> [Point]
inCircle2 = undefined

main = print \$ inCircle2 [(0, 0), (2, -2), (1, -1), (1.9, 0.1), (10, 1)]``````

## Tokenizing Identifiers

Unfortunately, recognizing multi-character tokens can't be reduced to `map` and `filter`. The former doesn't modify the shape of the list; the latter does, but only by deleting elements. So let's go back to the recursive version of the tokenizer and let's start with multi-character identifiers.

It's easy to recognize the start of an identifier: it must be an alphabetic character. We even have a predicate `isAlpha` just for that purpose. This first character can be followed by zero or more alphanumeric characters. There is another predicate, `isAlphaNum`, for that purpose. Notice that the tokenizer has to change its behavior while processing an identifier: digits inside an identifier are not treated as numbers, as they would normally be.

First, let's modify the appropriate part of `tokenize` to look for more than one character:

``````tokenize (c : cs)
...
| isAlpha c = identifier c cs
...``````

The new function `identifier` takes the already recognized alphabetic character, plus the rest of the input for further processing. We'll run the input through a helper function, `alnums`, that consumes and aggregates alphanumeric characters. What should we do if there's still more input after the run of anphanumeric characters? We need the function `alnums` to return the leftover input together with the list of recognized characters -- we'll return a pair of lists:

``alnums :: String -> (String, String)``

So here's the implementation of `identifier` using `alnums`:

``````identifier c cs = let (str, cs') = alnums cs in
TokIdent (c:str) : tokenize cs'``````

We have to store (and pattern match) the pair returned by `alnums` in temporary variables, because we'll do different things with different components of the pair. Defining local variables (local binding) is done using the `let`/`in` expression. The `let` part defines local variables, and the `in` part is an expression that uses those variables. Variables bound in `let` can be pattern matched. One `let` statement may contain multiple definitions.

``````let (pattern1) = expr1
(pattern2) = expr2
var = expr3
in
expression``````

It's important to understand that `let` is not a statement -- it's an expression. It has a value: the value defined by the `in` expression.

The visibility of local variables is restricted to the `let`/`in` scope.

### Mutual Recursion

Notice what `identifier` does with the unconsumed input, `cs'`. It recursively calls our main `tokenize` function. We could have done the same trick as with `identifier` -- returning both the result and the unprocessed string -- but that would unnecessarily complicate the code for processing other tokens. What I've done instead is to use mutual recursion, where multiple functions recurse into each other.

### Accumulating

The remaining task is to implement the helper function `alnums`. This function should accumulate alphanumeric characters into a list. As usual, we will recurse into the input list, but this time we have to carry along the accumulator -- the list of alphanumerics that were recognized so far. The standard trick is to define an auxiliary recursive function, let's call it `als`, that takes the accumulator along with the rest of the input. To start the recursion, we will pass an empty accumulator to this function.

``````alnums :: String -> (String, String)
alnums str = als "" str``````

Let's first consider the conditions that terminate recursion. One is the end of input, and the other is a non-alphanumeric character. In both cases `als` should return the current accumulator paired with the rest (if any) of the input.

When `als` recognizes an alphanumeric character, `c`, it will hold on to it and immediately call `als` with the rest of the input. This call will return the list of alphanumeric characters,`acc'`, and the unused tail of the input, `cs'`. We prepend the retained character to the front of the accumulator, and pair it with the new tail: `(c:acc', cs')`. Here it is:

``````import Data.Char

alnums :: String -> (String, String)
alnums str = als "" str
where
als acc [] = (acc, [])
als acc (c : cs) | isAlphaNum c =
let (acc', cs') = als acc cs
in (c:acc', cs')
| otherwise = (acc, c:cs)

main = print \$ alnums "R2D2+C3Po"``````

Instead of making `als` a top-level function, I introduced the new construct, `where`. Whatever is defined inside the `where` block -- it could be definitiona of functions or variables -- is visible to the main body of the function but not outside of it. Here, the function `als` is defined in the `where` clause. (Conversely, the arguments to the main function are accessible inside the `where` clause, but we're not using this property in this example).

I could have used `let` instead of `where` to define `als`, but I think it would be less readable. See for yourself:

``````import Data.Char

alnums :: String -> (String, String)
alnums str =
let
als acc [] = (acc, [])
als acc (c : cs) | isAlphaNum c =
let (acc', cs') = als acc cs
in (c:acc', cs')
| otherwise = (acc, c:cs)
in
als "" str

main = print \$ alnums "R2D2+C3Po"``````

In general, though, these two constructs are not exchangeable: `let`/`in` can be used anywhere an expression is expected; whereas `where` is tied to the end of a function definition. Moreover, if you have a function defined using multiple patterns and guards, the definitions in the `where` clause are going to be visible accross all bodies of the function; whereas `let` is local to each body.

### Folding

Traversal with accumulation is also a very common pattern and is encapsulated in `foldl` and `foldr` (fold left and fold right) -- higher order functions defined in the Prelude. For instance, here's the signature of `foldl`:

``foldl :: (a -> b -> a) -> a -> [b] -> a``

`a` is the type of the accumulator and `[b]` is the input list type. `foldl` traverses the list from left to right, calling the function `(a -> b -> a)` with two arguments: the current accumulator and the current element. The function returns the new accumulator, which is then used in the next iteration.

Ex 4. Use `foldl` to calculate the sum of squares given a list of doubles.

``````squares :: [Int] -> Int
squares = foldl (\acc x -> ???) 0

main = print \$ squares [3, 4, 5]``````

Ex 5. The accumulator in `foldl` can also be a list. With this in mind, implement function `rev` that reverses a list.

``````rev :: [a] -> [a]
rev = foldl ???

main = print \$ rev "spot on"``````

Ex 6. Just as a proof of concept, implement a version of `alnums` using `foldl`, even though it's going to be awkward and inefficient.

``````import Data.Char

alnums :: String -> (String, String)
alnums str = undefined

main = do
print \$ alnums "R2D2+C3Po"
print \$ alnums "a14"``````

## Tokenizing Numbers

The tokenizer for numbers is very similar to the tokenizer for identifiers. As before, we'll use mutual recursion:

``````tokenize (c : cs)
...
| isDigit c = number c cs

number c cs =
let (digs, cs') = digits cs in
TokNum (read (c : digs)) : tokenize cs'``````

I used `read` to convert a string to an `Int`.

The function `digits` is the equivalent of `alnums`, except that it gathers digits rather than alphanumeric characters:

``````digits :: String -> (String, String)
digits str = digs "" str
where
digs :: String -> String -> (String, String)
digs acc [] = (acc, [])
digs acc (c : cs) | isDigit c =
let (acc', cs') = digs acc cs
in (c:acc', cs')
| otherwise = (acc, c:cs)``````

In fact, if you compare the two functions, they only differ in one place: The predicate `isAlphaNum` is replaced by `isDigit`. This clearly calls for refactoring: We should make the predicate an argument to a more general (higher order) function. Let's call this function `span`:

``````import Data.Char
import Prelude hiding (span)
-- show
span :: (a -> Bool) -> [a] -> ([a], [a])
span pred str =
let -- define a helper function 'spanAcc'
spanAcc acc [] = (acc, [])
spanAcc acc (c : cs) | pred c =
let (acc', cs') = spanAcc acc cs
in (c:acc', cs')
| otherwise = (acc, c:cs)
in
spanAcc [] str

main = print \$ span isAlphaNum "R2D2 + C3Po"``````

`span` is so useful that it's included in the Prelude. We'll use it in the final version of the tokenizer, which follows.

## The Tokenizer

``````import Data.Char

data Operator = Plus | Minus | Times | Div
deriving (Show, Eq)

data Token = TokOp Operator
| TokIdent String
| TokNum Int
deriving (Show, Eq)

operator :: Char -> Operator
operator c | c == '+' = Plus
| c == '-' = Minus
| c == '*' = Times
| c == '/' = Div

tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs)
| elem c "+-*/" = TokOp (operator c) : tokenize cs
| isDigit c = number c cs
| isAlpha c = identifier c cs
| isSpace c = tokenize cs
| otherwise = error \$ "Cannot tokenize " ++ [c]

identifier c cs = let (str, cs') = span isAlphaNum cs in
TokIdent (c:str) : tokenize cs'

number c cs =
let (digs, cs') = span isDigit cs in
TokNum (read (c : digs)) : tokenize cs'

main = do
print \$ tokenize "12 + 24 / x1"``````

In the next installment we are going to implement the parser. We'll be using the results of the simple exercise below, so try to work on it, or just peek at the solution.

Ex 7. Extend the tokenizer above to recognize more tokens: `LParen` and `RParen` corresponding to `(` and `)`; as well as `TokAssign` for `=`.