6. Tokenizer: Function Types

We haven't gotten too far in our implementation of the symbolic calculator yet, but we've already learned a lot. We know how to work with list, and Strings in particular, and we have defined the Token data type. It's time to start implementing the tokenizer function:

tokenize :: String -> [Token]

Categorizing Characters

An imperative programmer would implement the tokenizer as a loop for processing consecutive characters in the string.

An object-oriented programmer would implement a tokenizer as a stateful object with a getter that returns the current token and an incrementer that moves to the next token, consuming one or more characters from the string in the process.

A functional programmer looks at the tokenizer as a function that picks the first character of the string, categorizes it, turns it into a token, and then tokenizes the rest of the string. (We'll tackle multi-character tokens in the next tutorial.)

The application of the tokenizer to the rest of the string is the standard recursive step in the algorithm.

Here's a very simple tokenizer that recognizes digits and non-digit characters:

import Data.Char

data Token = Digit | Alpha
    deriving (Show, Eq)

tokenize :: String -> [Token]
tokenize (c : rest) = 
    if isDigit c
    then Digit : tokenize rest
    else Alpha : tokenize rest
tokenize [] = []

main = print $ tokenize "passwd123"

I used the library function isDigit. This function is not defined in the Prelude, it's defined in a different library called Data.Char. I had to import it explicitly using the import statement at the top of the file -- somewhat analogous to C's or Java's #include statement.

Data.Char defines several useful functions, like isSpace, isAlpha, isAlphaNum, and many more. You can look them up in Haskell's Hoogle database.

We could have easily implemented isDigit from scratch using the Prelude function elem, which tests whether its first argument is an element of the second argument (which is a list):

isDigit :: Char -> Bool
isDigit c = elem c "0123456789"

main = print $ isDigit '3'

Here we are testing whether c is an element of the list of characters (string) "0123456789".

We could have also implemented the function elem from scratch, except that, up to now, I've been avoiding functions that require more than one argument. That's because I wanted to defer the explanation of currying until you get comfortable with single-argument functions.

Currying

Defining a multi-argument function is easy -- it's the type signature that requires some getting used to.

So, ignoring type signatures for a moment, here's the recursive implementation of isElem:

isElem c (d : rest) = if c == d 
                      then True 
                      else isElem c rest
isElem _ [] = False

main = do
    print $ isElem '3' "abc"
    print $ isElem '3' "123"

Nothing surprising here. You just specify more than one argument, and you can do pattern matching on each of them.

The fun part is that Haskell allows you to call a function using fewer arguments than there are formal parameters in its definition. In our example, it's okay to call isElem with just one argument, say, character '3'. What you get back is not a Bool but something that expects one more argument, a list, to produce a Bool. In other words you get a function [Char]->Bool.

isElem c (d : rest) = if c == d 
                      then True 
                      else isElem c rest
isElem _ [] = False
-- show
is3elem :: [Char] -> Bool
is3elem = isElem '3'

main = print $ is3elem "123"

Let me repeat this, because it's very important: we have curried the two-argument function isElem by providing the first argument, '3'. The result is a function that expects a list of Char (the second argument to isElem). We are storing this curried function in the variable is3elem. We can then call is3elem with a list of Char and get back a Bool.

By the way, we could have defined is3elem using standard function definition syntax:

isElem c (d : rest) = if c == d 
                      then True 
                      else isElem c rest
isElem _ [] = False
-- show
is3elem' :: [Char] -> Bool
is3elem' str = isElem '3' str

main = print $ is3elem' "123"

It almost looks like the first definition was obtained by "dividing" both sides of the second definition by str. After such simplification, the only trace of the [Char] argument is in the signature of is3elem. So if you see a function definition that is missing some arguments that are specified in its signature, you're looking at a curried definition. You'll see a lot of code written in this style, which has its own name: point-free notation. We'll talk more about it in the future.

Back to isElem: What's its type signature? By our reasoning, we can look at it as a function that takes a Char and returns a function ([Char]->Bool). Being able to return a function from a function is one of the perks of "functions being first-class citizens in Haskell." (The others are storing functions in variables, which we've already seen, and passing functions as arguments to other functions, which we'll see soon.)

Indeed, this is a valid signature of the funtion isElem:

isElem :: Char -> ([Char] -> Bool)

However, since the arrow -> is right associative, the parentheses are not necessary and are usually omitted, as in:

isElem :: Char -> [Char] -> Bool

One more observation: isElem, as well as elem, will work not only for Char, but for any type that supports equality comparison. In particular, since we defined Token as deriving Eq, it supports equality comparison and can be used with elem.

data Token = Digit | Alpha
    deriving (Show, Eq)

main = print $ elem Digit [Digit, Alpha]

We'll come back to this when we study type classes.

To summarize: multi-argument functions can always be curried, and this is reflected in the way their type signatures are written. For instance, the signature:

f :: a -> b -> c -> d

tells us that f is a function of three arguments of types a, b, and c, returning d. You can also treat it as a function of two arguments a and b returning a function (c ->d). Or a one-argument function taking a returning a two argument function (b -> c -> d).

Currying is extremely useful and it's a pity that most other languages don't support it out of the box. Multi-argument functions in those languages are written and behave as if they were always taking tuples: In Haskell, the syntax f (x, y, z) would be interpreted as a function taking a tuple of three elements. In fact such function are sometimes called uncurried.

Parentheses and commas in the function syntax impede currying. Scala has special syntax for curriable functions: multiple pairs of parentheses. But you have to anticipate the need for currying when defining the function -- the default doesn't support it.

Tokenizing Operators: Guards

The tokenizer has to recognize operators, identifiers, and numbers. Let's start with operators. We could categorize them using a bunch of nested if/then/else clauses, but that would be awkward. Let's use a different mechanism: guards. Just like you can split a function definition by patterns, you can further specialize it based on more general predicates (Boolean expressions). Let's define a function operator that converts a character to Operator:

data Operator = Plus | Minus | Times | Div
    deriving (Show, Eq)

operator :: Char -> Operator
operator c | c == '+' = Plus
           | c == '-' = Minus
           | c == '*' = Times
           | c == '/' = Div

main = print $ operator '*'

There are four bodies of the function operator corresponding to four different guards (a body of a function starts after the equal sign). For example, the guard c == '+' corresponds to the body Plus, etc. Guards are placed after vertical bars. They are tested in order of appearance.

What happens when no guard is satisfied? You can try it by calling operator with the "wrong" character. You'll get a runtime error PatternMatchFail and the program will abort. This might be good enough if all you need is a bona fide assertion. In general, these kinds of non-exhaustive patterns are to be avoided. You can always end your list of guards with otherwise (which is syntactic sugar for True) and provide a default body. You'll see an example of this later.

Single Character Tokenizer

Before we implement full blown tokenizers for numbers and identifiers, let's first tackle a simplified problem. Let's restrict numbers to single digits, and identifiers to single characters. This way we'll only need to process one character at a time and we can use simple recursion. Here's our recursive skeleton:

tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs) = ... : tokenize cs

We'll have to categorize the current character, convert it to a token, and then tokenize the rest of the string (here I'm pattern matching the string as a list of characters). The result is the current token prepended to the rest of tokens.

We'll do categorization using guards. Let's start with operators:

tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs) 
    | elem c "+-*/" = TokOp (operator c) : tokenize cs
    | otherwise  = error $ "Cannot tokenize " ++ [c]

The guard checks if c is an element of the list of four characters "+-*/". If it is, it constructs a Token using the TokOp constructor, passing it the result of the call to operator c (the function we defined earlier). This token is combined using : with the list returned by the recursive call, tokenize cs.

I also added the catch all guard that calls error. error is a function that takes a String, displays it, and aborts the program. (In order to satisfy the type checker, error is polymorphic in its return type. In this example, the type checker expects a list of Tokens; so that's the type it will pick for the return type of error.)

I constructed the string by concatenating "Cannot tokenize " with a single character string [c]. I couldn't use c directly, because the concatenation operator ++ expects a list of Char, not a Char. So I created a one-element list [c].

The tokenization of one-character numbers and identifiers is pretty simple:

| isDigit c  = TokNum (digitToInt c) : tokenize cs
| isAlpha c  = TokIdent [c]          : tokenize cs

Here, I converted a digit to an integer using digitToInt, and a single character to a string using [c].

Finally, our tokenizer should also be able to discard white space between tokens:

| isSpace c  = tokenize cs

We are ready to test our first tokenizer:

import Data.Char

data Operator = Plus | Minus | Times | Div
    deriving (Show, Eq)

data Token = TokOp Operator
           | TokIdent String
           | TokNum Int
    deriving (Show, Eq)

operator :: Char -> Operator
operator c | c == '+' = Plus
           | c == '-' = Minus
           | c == '*' = Times
           | c == '/' = Div

tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs) 
    | elem c "+-*/" = TokOp (operator c) : tokenize cs
    | isDigit c  = TokNum (digitToInt c) : tokenize cs
    | isAlpha c  = TokIdent [c]          : tokenize cs
    | isSpace c  = tokenize cs
    | otherwise  = error $ "Cannot tokenize " ++ [c]

main = print $ tokenize " 1 + 4 / x "

Next time we'll work on tokenizing multi-character numbers and identifiers and learn about mutual recursion.

Ex 1. Rewrite the implementation of Fibonacci numbers using guards instead of the if statement (it should become much closer to the mathematical definition):

-- Old definition:
-- fib n = if n > 2 then fib (n - 1) + fib (n - 2) else 1
fib n | n == 1    = ...
      | ...       = ...
      | otherwise = ...

main = print (fib 20)

fib :: Int -> Int
fib n | n == 1    = 1
      | n == 2    = 1
      | otherwise = fib (n-1) + fib (n-2)

main = print (fib 20)

Ex 2. Implement function cat that concatenates two lists.

cat :: [a] -> [a] -> [a]
cat = undefined

main = putStrLn $ cat "Hello " "World!"

cat :: [a] -> [a] -> [a]
cat [] ys = ys
cat (x : xs) ys = x : cat xs ys

main = putStrLn $ cat "Hello " "World!"

Ex 3. Use cat from previous exercise and currying to define a function pig that prepends "pig" to any string.

cat :: [a] -> [a] -> [a]
cat = undefined

pig :: String -> String
pig = undefined

main = putStrLn $ pig "sty"

cat :: [a] -> [a] -> [a]
cat [] ys = ys
cat (x : xs) ys = x : cat xs ys

pig :: String -> String
pig = cat "pig"

main = putStrLn $ pig "sty"

Ex 4. Implement function toInts that takes a number in the form of a string and returns a list of its digits as integers.

import Data.Char

toInts :: String -> [Int]
toInts = undefined

main = print $ toInts "2013"

import Data.Char

toInts :: String -> [Int]
toInts [] = []
toInts (c : cs) = digitToInt c : toInts cs

main = print $ toInts "2013"

Ex 5. Implement function sumDig that takes a number in the form of a string and calculates the sum of its digits. Make use of the function from the previous exercise.

import Data.Char

toInts :: String -> [Int]
toInts [] = []
toInts (c : cs) = digitToInt c : toInts cs

sumDig :: String -> Int
sumDig = undefined

main = print $ sumDig "30750"

Define an auxiliary recursive function acc that takes the sum-so-far and the remaining digits, and returns the total. Call it with appropriate arguments.

acc :: Int -> [Int] -> Int

import Data.Char

toInts :: String -> [Int]
toInts [] = []
toInts (c : cs) = digitToInt c : toInts cs

sumDig :: String -> Int
sumDig str = acc 0 (toInts str)

acc :: Int -> [Int] -> Int
acc a [] = a
acc a (i:is) = acc (a + i) is

main = print $ sumDig "30750"

Sections

Categorizing Characters

Currying

Tokenizing Operators: Guards

Single Character Tokenizer