We haven't gotten too far in our implementation of the symbolic calculator yet, but we've already learned a lot. We know how to work with list, and `String`s in particular, and we have defined the `Token` data type. It's time to start implementing the tokenizer function: ``` haskell tokenize :: String -> [Token] ``` ## Categorizing Characters An imperative programmer would implement the tokenizer as a loop for processing consecutive characters in the string. An object-oriented programmer would implement a tokenizer as a stateful object with a getter that returns the current token and an incrementer that moves to the next token, consuming one or more characters from the string in the process. A functional programmer looks at the tokenizer as a function that picks the first character of the string, categorizes it, turns it into a token, and then tokenizes the rest of the string. (We'll tackle multi-character tokens in the next tutorial.) The application of the tokenizer to the rest of the string is the standard recursive step in the algorithm. Here's a very simple tokenizer that recognizes digits and non-digit characters: ``` active haskell import Data.Char data Token = Digit | Alpha deriving (Show, Eq) tokenize :: String -> [Token] tokenize (c : rest) = if isDigit c then Digit : tokenize rest else Alpha : tokenize rest tokenize [] = [] main = print $ tokenize "passwd123" ``` I used the library function `isDigit`. This function is not defined in the Prelude, it's defined in a different library called `Data.Char`. I had to import it explicitly using the `import` statement at the top of the file -- somewhat analogous to C's or Java's `#include` statement. `Data.Char` defines several useful functions, like `isSpace`, `isAlpha`, `isAlphaNum`, and many more. You can look them up in Haskell's [Hoogle database](http://hackage.haskell.org/package/base-4.6.0.1/docs/Data-Char.html). We could have easily implemented `isDigit` from scratch using the Prelude function `elem`, which tests whether its first argument is an element of the second argument (which is a list): ``` active haskell isDigit :: Char -> Bool isDigit c = elem c "0123456789" main = print $ isDigit '3' ``` Here we are testing whether `c` is an element of the list of characters (string) "0123456789". We could have also implemented the function `elem` from scratch, except that, up to now, I've been avoiding functions that require more than one argument. That's because I wanted to defer the explanation of currying until you get comfortable with single-argument functions. ## Currying Defining a multi-argument function is easy -- it's the type signature that requires some getting used to. So, ignoring type signatures for a moment, here's the recursive implementation of `isElem`: ``` active haskell isElem c (d : rest) = if c == d then True else isElem c rest isElem _ [] = False main = do print $ isElem '3' "abc" print $ isElem '3' "123" ``` Nothing surprising here. You just specify more than one argument, and you can do pattern matching on each of them. The fun part is that Haskell allows you to call a function using fewer arguments than there are formal parameters in its definition. In our example, it's okay to call `isElem` with just one argument, say, character `'3'`. What you get back is not a `Bool` but something that expects one more argument, a list, to produce a `Bool`. In other words you get a function `[Char]->Bool`. ``` active haskell isElem c (d : rest) = if c == d then True else isElem c rest isElem _ [] = False -- show is3elem :: [Char] -> Bool is3elem = isElem '3' main = print $ is3elem "123" ``` Let me repeat this, because it's very important: we have curried the two-argument function `isElem` by providing the first argument, `'3'`. The result is a function that expects a list of `Char` (the second argument to `isElem`). We are storing this curried function in the variable `is3elem`. We can then call `is3elem` with a list of `Char` and get back a `Bool`. By the way, we could have defined `is3elem` using standard function definition syntax: ``` active haskell isElem c (d : rest) = if c == d then True else isElem c rest isElem _ [] = False -- show is3elem' :: [Char] -> Bool is3elem' str = isElem '3' str main = print $ is3elem' "123" ``` It almost looks like the first definition was obtained by "dividing" both sides of the second definition by `str`. After such simplification, the only trace of the `[Char]` argument is in the signature of `is3elem`. So if you see a function definition that is missing some arguments that are specified in its signature, you're looking at a curried definition. You'll see a lot of code written in this style, which has its own name: *point-free notation*. We'll talk more about it in the future. Back to `isElem`: What's its type signature? By our reasoning, we can look at it as a function that takes a `Char` and returns a function `([Char]->Bool)`. Being able to return a function from a function is one of the perks of "functions being first-class citizens in Haskell." (The others are storing functions in variables, which we've already seen, and passing functions as arguments to other functions, which we'll see soon.) Indeed, this is a valid signature of the funtion `isElem`: ``` haskell isElem :: Char -> ([Char] -> Bool) ``` However, since the arrow `->` is right associative, the parentheses are not necessary and are usually omitted, as in: ``` haskell isElem :: Char -> [Char] -> Bool ``` One more observation: `isElem`, as well as `elem`, will work not only for `Char`, but for any type that supports equality comparison. In particular, since we defined `Token` as `deriving Eq`, it supports equality comparison and can be used with `elem`. ``` active haskell data Token = Digit | Alpha deriving (Show, Eq) main = print $ elem Digit [Digit, Alpha] ``` We'll come back to this when we study type classes. To summarize: multi-argument functions can always be curried, and this is reflected in the way their type signatures are written. For instance, the signature: ``` haskell f :: a -> b -> c -> d ``` tells us that `f` is a function of three arguments of types `a`, `b`, and `c`, returning `d`. You can also treat it as a function of two arguments `a` and `b` returning a function `(c ->d)`. Or a one-argument function taking `a` returning a two argument function `(b -> c -> d)`. Currying is extremely useful and it's a pity that most other languages don't support it out of the box. Multi-argument functions in those languages are written and behave as if they were always taking tuples: In Haskell, the syntax `f (x, y, z)` would be interpreted as a function taking a tuple of three elements. In fact such function are sometimes called *uncurried*. Parentheses and commas in the function syntax impede currying. Scala has special syntax for curriable functions: multiple pairs of parentheses. But you have to anticipate the need for currying when defining the function -- the default doesn't support it. ## Tokenizing Operators: Guards The tokenizer has to recognize operators, identifiers, and numbers. Let's start with operators. We could categorize them using a bunch of nested if/then/else clauses, but that would be awkward. Let's use a different mechanism: guards. Just like you can split a function definition by patterns, you can further specialize it based on more general predicates (Boolean expressions). Let's define a function `operator` that converts a character to `Operator`: ``` active haskell data Operator = Plus | Minus | Times | Div deriving (Show, Eq) operator :: Char -> Operator operator c | c == '+' = Plus | c == '-' = Minus | c == '*' = Times | c == '/' = Div main = print $ operator '*' ``` There are four bodies of the function `operator` corresponding to four different guards (a body of a function starts after the equal sign). For example, the guard `c == '+'` corresponds to the body `Plus`, etc. Guards are placed after vertical bars. They are tested in order of appearance. What happens when no guard is satisfied? You can try it by calling `operator` with the "wrong" character. You'll get a runtime error `PatternMatchFail` and the program will abort. This might be good enough if all you need is a bona fide assertion. In general, these kinds of non-exhaustive patterns are to be avoided. You can always end your list of guards with `otherwise` (which is syntactic sugar for `True`) and provide a default body. You'll see an example of this later. ## Single Character Tokenizer Before we implement full blown tokenizers for numbers and identifiers, let's first tackle a simplified problem. Let's restrict numbers to single digits, and identifiers to single characters. This way we'll only need to process one character at a time and we can use simple recursion. Here's our recursive skeleton: ``` haskell tokenize :: String -> [Token] tokenize [] = [] tokenize (c : cs) = ... : tokenize cs ``` We'll have to categorize the current character, convert it to a token, and then tokenize the rest of the string (here I'm pattern matching the string as a list of characters). The result is the current token prepended to the rest of tokens. We'll do categorization using guards. Let's start with operators: ``` haskell tokenize :: String -> [Token] tokenize [] = [] tokenize (c : cs) | elem c "+-*/" = TokOp (operator c) : tokenize cs | otherwise = error $ "Cannot tokenize " ++ [c] ``` The guard checks if `c` is an element of the list of four characters `"+-*/"`. If it is, it constructs a `Token` using the `TokOp` constructor, passing it the result of the call to `operator c` (the function we defined earlier). This token is combined using `:` with the list returned by the recursive call, `tokenize cs`. I also added the catch all guard that calls `error`. `error` is a function that takes a `String`, displays it, and aborts the program. (In order to satisfy the type checker, `error` is polymorphic in its return type. In this example, the type checker expects a list of `Token`s; so that's the type it will pick for the return type of `error`.) I constructed the string by concatenating `"Cannot tokenize "` with a single character string `[c]`. I couldn't use `c` directly, because the concatenation operator `++` expects a list of `Char`, not a `Char`. So I created a one-element list `[c]`. The tokenization of one-character numbers and identifiers is pretty simple: ``` haskell | isDigit c = TokNum (digitToInt c) : tokenize cs | isAlpha c = TokIdent [c] : tokenize cs ``` Here, I converted a digit to an integer using `digitToInt`, and a single character to a string using `[c]`. Finally, our tokenizer should also be able to discard white space between tokens: ``` haskell | isSpace c = tokenize cs ``` We are ready to test our first tokenizer: ``` active haskell import Data.Char data Operator = Plus | Minus | Times | Div deriving (Show, Eq) data Token = TokOp Operator | TokIdent String | TokNum Int deriving (Show, Eq) operator :: Char -> Operator operator c | c == '+' = Plus | c == '-' = Minus | c == '*' = Times | c == '/' = Div tokenize :: String -> [Token] tokenize [] = [] tokenize (c : cs) | elem c "+-*/" = TokOp (operator c) : tokenize cs | isDigit c = TokNum (digitToInt c) : tokenize cs | isAlpha c = TokIdent [c] : tokenize cs | isSpace c = tokenize cs | otherwise = error $ "Cannot tokenize " ++ [c] main = print $ tokenize " 1 + 4 / x " ``` Next time we'll work on tokenizing multi-character numbers and identifiers and learn about mutual recursion. **Ex 1.** Rewrite the implementation of Fibonacci numbers using guards instead of the `if` statement (it should become much closer to the mathematical definition): ``` active haskell -- Old definition: -- fib n = if n > 2 then fib (n - 1) + fib (n - 2) else 1 fib n | n == 1 = ... | ... = ... | otherwise = ... main = print (fib 20) ``` @@@ Show solution ``` active haskell fib :: Int -> Int fib n | n == 1 = 1 | n == 2 = 1 | otherwise = fib (n-1) + fib (n-2) main = print (fib 20) ``` @@@ **Ex 2.** Implement function `cat` that concatenates two lists. ``` active haskell cat :: [a] -> [a] -> [a] cat = undefined main = putStrLn $ cat "Hello " "World!" ``` @@@ Show hint You want to create a list whose first element is the first element of the first list (if any) followed by *the rest of the first list concatenated with the second list*. @@@ @@@ Show solution ``` active haskell cat :: [a] -> [a] -> [a] cat [] ys = ys cat (x : xs) ys = x : cat xs ys main = putStrLn $ cat "Hello " "World!" ``` @@@ **Ex 3.** Use `cat` from previous exercise and currying to define a function `pig` that prepends "pig" to any string. ``` active haskell cat :: [a] -> [a] -> [a] cat = undefined pig :: String -> String pig = undefined main = putStrLn $ pig "sty" ``` @@@ Show solution ``` active haskell cat :: [a] -> [a] -> [a] cat [] ys = ys cat (x : xs) ys = x : cat xs ys pig :: String -> String pig = cat "pig" main = putStrLn $ pig "sty" ``` @@@ **Ex 4.** Implement function `toInts` that takes a number in the form of a string and returns a list of its digits as integers. ``` active haskell import Data.Char toInts :: String -> [Int] toInts = undefined main = print $ toInts "2013" ``` @@@ Show solution ``` active haskell import Data.Char toInts :: String -> [Int] toInts [] = [] toInts (c : cs) = digitToInt c : toInts cs main = print $ toInts "2013" ``` @@@ **Ex 5.** Implement function `sumDig` that takes a number in the form of a string and calculates the sum of its digits. Make use of the function from the previous exercise. ``` active haskell import Data.Char toInts :: String -> [Int] toInts [] = [] toInts (c : cs) = digitToInt c : toInts cs sumDig :: String -> Int sumDig = undefined main = print $ sumDig "30750" ``` @@@ Show hint Define an auxiliary recursive function `acc` that takes the sum-so-far and the remaining digits, and returns the total. Call it with appropriate arguments. ``` haskell acc :: Int -> [Int] -> Int ``` @@@ @@@ Show solution ``` active haskell import Data.Char toInts :: String -> [Int] toInts [] = [] toInts (c : cs) = digitToInt c : toInts cs sumDig :: String -> Int sumDig str = acc 0 (toInts str) acc :: Int -> [Int] -> Int acc a [] = a acc a (i:is) = acc (a + i) is main = print $ sumDig "30750" ``` @@@