We haven't gotten too far in our implementation of the symbolic calculator yet, but we've already learned a lot. We know how to work with list, and `String`s in particular, and we have defined the `Token` data type. It's time to start implementing the tokenizer function:
``` haskell
tokenize :: String -> [Token]
```
## Categorizing Characters
An imperative programmer would implement the tokenizer as a loop for processing consecutive characters in the string.
An object-oriented programmer would implement a tokenizer as a stateful object with a getter that returns the current token and an incrementer that moves to the next token, consuming one or more characters from the string in the process.
A functional programmer looks at the tokenizer as a function that picks the first character of the string, categorizes it, turns it into a token, and then tokenizes the rest of the string. (We'll tackle multi-character tokens in the next tutorial.)
The application of the tokenizer to the rest of the string is the standard recursive step in the algorithm.
Here's a very simple tokenizer that recognizes digits and non-digit characters:
``` active haskell
import Data.Char
data Token = Digit | Alpha
deriving (Show, Eq)
tokenize :: String -> [Token]
tokenize (c : rest) =
if isDigit c
then Digit : tokenize rest
else Alpha : tokenize rest
tokenize [] = []
main = print $ tokenize "passwd123"
```
I used the library function `isDigit`. This function is not defined in the Prelude, it's defined in a different library called `Data.Char`. I had to import it explicitly using the `import` statement at the top of the file -- somewhat analogous to C's or Java's `#include` statement.
`Data.Char` defines several useful functions, like `isSpace`, `isAlpha`, `isAlphaNum`, and many more. You can look them up in Haskell's [Hoogle database](http://hackage.haskell.org/package/base-4.6.0.1/docs/Data-Char.html).
We could have easily implemented `isDigit` from scratch using the Prelude function `elem`, which tests whether its first argument is an element of the second argument (which is a list):
``` active haskell
isDigit :: Char -> Bool
isDigit c = elem c "0123456789"
main = print $ isDigit '3'
```
Here we are testing whether `c` is an element of the list of characters (string) "0123456789".
We could have also implemented the function `elem` from scratch, except that, up to now, I've been avoiding functions that require more than one argument. That's because I wanted to defer the explanation of currying until you get comfortable with single-argument functions.
## Currying
Defining a multi-argument function is easy -- it's the type signature that requires some getting used to.
So, ignoring type signatures for a moment, here's the recursive implementation of `isElem`:
``` active haskell
isElem c (d : rest) = if c == d
then True
else isElem c rest
isElem _ [] = False
main = do
print $ isElem '3' "abc"
print $ isElem '3' "123"
```
Nothing surprising here. You just specify more than one argument, and you can do pattern matching on each of them.
The fun part is that Haskell allows you to call a function using fewer arguments than there are formal parameters in its definition. In our example, it's okay to call `isElem` with just one argument, say, character `'3'`. What you get back is not a `Bool` but something that expects one more argument, a list, to produce a `Bool`. In other words you get a function `[Char]->Bool`.
``` active haskell
isElem c (d : rest) = if c == d
then True
else isElem c rest
isElem _ [] = False
-- show
is3elem :: [Char] -> Bool
is3elem = isElem '3'
main = print $ is3elem "123"
```
Let me repeat this, because it's very important: we have curried the two-argument function `isElem` by providing the first argument, `'3'`. The result is a function that expects a list of `Char` (the second argument to `isElem`). We are storing this curried function in the variable `is3elem`. We can then call `is3elem` with a list of `Char` and get back a `Bool`.
By the way, we could have defined `is3elem` using standard function definition syntax:
``` active haskell
isElem c (d : rest) = if c == d
then True
else isElem c rest
isElem _ [] = False
-- show
is3elem' :: [Char] -> Bool
is3elem' str = isElem '3' str
main = print $ is3elem' "123"
```
It almost looks like the first definition was obtained by "dividing" both sides of the second definition by `str`. After such simplification, the only trace of the `[Char]` argument is in the signature of `is3elem`. So if you see a function definition that is missing some arguments that are specified in its signature, you're looking at a curried definition. You'll see a lot of code written in this style, which has its own name: *point-free notation*. We'll talk more about it in the future.
Back to `isElem`: What's its type signature? By our reasoning, we can look at it as a function that takes a `Char` and returns a function `([Char]->Bool)`. Being able to return a function from a function is one of the perks of "functions being first-class citizens in Haskell." (The others are storing functions in variables, which we've already seen, and passing functions as arguments to other functions, which we'll see soon.)
Indeed, this is a valid signature of the funtion `isElem`:
``` haskell
isElem :: Char -> ([Char] -> Bool)
```
However, since the arrow `->` is right associative, the parentheses are not necessary and are usually omitted, as in:
``` haskell
isElem :: Char -> [Char] -> Bool
```
One more observation: `isElem`, as well as `elem`, will work not only for `Char`, but for any type that supports equality comparison. In particular, since we defined `Token` as `deriving Eq`, it supports equality comparison and can be used with `elem`.
``` active haskell
data Token = Digit | Alpha
deriving (Show, Eq)
main = print $ elem Digit [Digit, Alpha]
```
We'll come back to this when we study type classes.
To summarize: multi-argument functions can always be curried, and this is reflected in the way their type signatures are written. For instance, the signature:
``` haskell
f :: a -> b -> c -> d
```
tells us that `f` is a function of three arguments of types `a`, `b`, and `c`, returning `d`. You can also treat it as a function of two arguments `a` and `b` returning a function `(c ->d)`. Or a one-argument function taking `a` returning a two argument function `(b -> c -> d)`.
Currying is extremely useful and it's a pity that most other languages don't support it out of the box. Multi-argument functions in those languages are written and behave as if they were always taking tuples: In Haskell, the syntax `f (x, y, z)` would be interpreted as a function taking a tuple of three elements. In fact such function are sometimes called *uncurried*.
Parentheses and commas in the function syntax impede currying. Scala has special syntax for curriable functions: multiple pairs of parentheses. But you have to anticipate the need for currying when defining the function -- the default doesn't support it.
## Tokenizing Operators: Guards
The tokenizer has to recognize operators, identifiers, and numbers. Let's start with operators. We could categorize them using a bunch of nested if/then/else clauses, but that would be awkward. Let's use a different mechanism: guards. Just like you can split a function definition by patterns, you can further specialize it based on more general predicates (Boolean expressions). Let's define a function `operator` that converts a character to `Operator`:
``` active haskell
data Operator = Plus | Minus | Times | Div
deriving (Show, Eq)
operator :: Char -> Operator
operator c | c == '+' = Plus
| c == '-' = Minus
| c == '*' = Times
| c == '/' = Div
main = print $ operator '*'
```
There are four bodies of the function `operator` corresponding to four different guards (a body of a function starts after the equal sign). For example, the guard `c == '+'` corresponds to the body `Plus`, etc. Guards are placed after vertical bars. They are tested in order of appearance.
What happens when no guard is satisfied? You can try it by calling `operator` with the "wrong" character. You'll get a runtime error `PatternMatchFail` and the program will abort. This might be good enough if all you need is a bona fide assertion. In general, these kinds of non-exhaustive patterns are to be avoided. You can always end your list of guards with `otherwise` (which is syntactic sugar for `True`) and provide a default body. You'll see an example of this later.
## Single Character Tokenizer
Before we implement full blown tokenizers for numbers and identifiers, let's first tackle a simplified problem. Let's restrict numbers to single digits, and identifiers to single characters. This way we'll only need to process one character at a time and we can use simple recursion. Here's our recursive skeleton:
``` haskell
tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs) = ... : tokenize cs
```
We'll have to categorize the current character, convert it to a token, and then tokenize the rest of the string (here I'm pattern matching the string as a list of characters). The result is the current token prepended to the rest of tokens.
We'll do categorization using guards. Let's start with operators:
``` haskell
tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs)
| elem c "+-*/" = TokOp (operator c) : tokenize cs
| otherwise = error $ "Cannot tokenize " ++ [c]
```
The guard checks if `c` is an element of the list of four characters `"+-*/"`. If it is, it constructs a `Token` using the `TokOp` constructor, passing it the result of the call to `operator c` (the function we defined earlier). This token is combined using `:` with the list returned by the recursive call, `tokenize cs`.
I also added the catch all guard that calls `error`. `error` is a function that takes a `String`, displays it, and aborts the program. (In order to satisfy the type checker, `error` is polymorphic in its return type. In this example, the type checker expects a list of `Token`s; so that's the type it will pick for the return type of `error`.)
I constructed the string by concatenating `"Cannot tokenize "` with a single character string `[c]`. I couldn't use `c` directly, because the concatenation operator `++` expects a list of `Char`, not a `Char`. So I created a one-element list `[c]`.
The tokenization of one-character numbers and identifiers is pretty simple:
``` haskell
| isDigit c = TokNum (digitToInt c) : tokenize cs
| isAlpha c = TokIdent [c] : tokenize cs
```
Here, I converted a digit to an integer using `digitToInt`, and a single character to a string using `[c]`.
Finally, our tokenizer should also be able to discard white space between tokens:
``` haskell
| isSpace c = tokenize cs
```
We are ready to test our first tokenizer:
``` active haskell
import Data.Char
data Operator = Plus | Minus | Times | Div
deriving (Show, Eq)
data Token = TokOp Operator
| TokIdent String
| TokNum Int
deriving (Show, Eq)
operator :: Char -> Operator
operator c | c == '+' = Plus
| c == '-' = Minus
| c == '*' = Times
| c == '/' = Div
tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs)
| elem c "+-*/" = TokOp (operator c) : tokenize cs
| isDigit c = TokNum (digitToInt c) : tokenize cs
| isAlpha c = TokIdent [c] : tokenize cs
| isSpace c = tokenize cs
| otherwise = error $ "Cannot tokenize " ++ [c]
main = print $ tokenize " 1 + 4 / x "
```
Next time we'll work on tokenizing multi-character numbers and identifiers and learn about mutual recursion.
**Ex 1.** Rewrite the implementation of Fibonacci numbers using guards instead of the `if` statement (it should become much closer to the mathematical definition):
``` active haskell
-- Old definition:
-- fib n = if n > 2 then fib (n - 1) + fib (n - 2) else 1
fib n | n == 1 = ...
| ... = ...
| otherwise = ...
main = print (fib 20)
```
@@@ Show solution
``` active haskell
fib :: Int -> Int
fib n | n == 1 = 1
| n == 2 = 1
| otherwise = fib (n-1) + fib (n-2)
main = print (fib 20)
```
@@@
**Ex 2.** Implement function `cat` that concatenates two lists.
``` active haskell
cat :: [a] -> [a] -> [a]
cat = undefined
main = putStrLn $ cat "Hello " "World!"
```
@@@ Show hint
You want to create a list whose first element is the first element of the first list (if any) followed by *the rest of the first list concatenated with the second list*.
@@@
@@@ Show solution
``` active haskell
cat :: [a] -> [a] -> [a]
cat [] ys = ys
cat (x : xs) ys = x : cat xs ys
main = putStrLn $ cat "Hello " "World!"
```
@@@
**Ex 3.** Use `cat` from previous exercise and currying to define a function `pig` that prepends "pig" to any string.
``` active haskell
cat :: [a] -> [a] -> [a]
cat = undefined
pig :: String -> String
pig = undefined
main = putStrLn $ pig "sty"
```
@@@ Show solution
``` active haskell
cat :: [a] -> [a] -> [a]
cat [] ys = ys
cat (x : xs) ys = x : cat xs ys
pig :: String -> String
pig = cat "pig"
main = putStrLn $ pig "sty"
```
@@@
**Ex 4.** Implement function `toInts` that takes a number in the form of a string and returns a list of its digits as integers.
``` active haskell
import Data.Char
toInts :: String -> [Int]
toInts = undefined
main = print $ toInts "2013"
```
@@@ Show solution
``` active haskell
import Data.Char
toInts :: String -> [Int]
toInts [] = []
toInts (c : cs) = digitToInt c : toInts cs
main = print $ toInts "2013"
```
@@@
**Ex 5.** Implement function `sumDig` that takes a number in the form of a string and calculates the sum of its digits. Make use of the function from the previous exercise.
``` active haskell
import Data.Char
toInts :: String -> [Int]
toInts [] = []
toInts (c : cs) = digitToInt c : toInts cs
sumDig :: String -> Int
sumDig = undefined
main = print $ sumDig "30750"
```
@@@ Show hint
Define an auxiliary recursive function `acc` that takes the sum-so-far and the remaining digits, and returns the total. Call it with appropriate arguments.
``` haskell
acc :: Int -> [Int] -> Int
```
@@@
@@@ Show solution
``` active haskell
import Data.Char
toInts :: String -> [Int]
toInts [] = []
toInts (c : cs) = digitToInt c : toInts cs
sumDig :: String -> Int
sumDig str = acc 0 (toInts str)
acc :: Int -> [Int] -> Int
acc a [] = a
acc a (i:is) = acc (a + i) is
main = print $ sumDig "30750"
```
@@@