We haven't gotten too far in our implementation of the symbolic calculator yet, but we've already learned a lot. We know how to work with list, and
Strings in particular, and we have defined the
Token data type. It's time to start implementing the tokenizer function:
tokenize :: String -> [Token]
An imperative programmer would implement the tokenizer as a loop for processing consecutive characters in the string.
An object-oriented programmer would implement a tokenizer as a stateful object with a getter that returns the current token and an incrementer that moves to the next token, consuming one or more characters from the string in the process.
A functional programmer looks at the tokenizer as a function that picks the first character of the string, categorizes it, turns it into a token, and then tokenizes the rest of the string. (We'll tackle multi-character tokens in the next tutorial.)
The application of the tokenizer to the rest of the string is the standard recursive step in the algorithm.
Here's a very simple tokenizer that recognizes digits and non-digit characters:
import Data.Char data Token = Digit | Alpha deriving (Show, Eq) tokenize :: String -> [Token] tokenize (c : rest) = if isDigit c then Digit : tokenize rest else Alpha : tokenize rest tokenize  =  main = print $ tokenize "passwd123"
I used the library function
isDigit. This function is not defined in the Prelude, it's defined in a different library called
Data.Char. I had to import it explicitly using the
import statement at the top of the file -- somewhat analogous to C's or Java's
Data.Char defines several useful functions, like
isAlphaNum, and many more. You can look them up in Haskell's Hoogle database.
We could have easily implemented
isDigit from scratch using the Prelude function
elem, which tests whether its first argument is an element of the second argument (which is a list):
isDigit :: Char -> Bool isDigit c = elem c "0123456789" main = print $ isDigit '3'
Here we are testing whether
c is an element of the list of characters (string) "0123456789".
We could have also implemented the function
elem from scratch, except that, up to now, I've been avoiding functions that require more than one argument. That's because I wanted to defer the explanation of currying until you get comfortable with single-argument functions.
Defining a multi-argument function is easy -- it's the type signature that requires some getting used to.
So, ignoring type signatures for a moment, here's the recursive implementation of
isElem c (d : rest) = if c == d then True else isElem c rest isElem _  = False main = do print $ isElem '3' "abc" print $ isElem '3' "123"
Nothing surprising here. You just specify more than one argument, and you can do pattern matching on each of them.
The fun part is that Haskell allows you to call a function using fewer arguments than there are formal parameters in its definition. In our example, it's okay to call
isElem with just one argument, say, character
'3'. What you get back is not a
Bool but something that expects one more argument, a list, to produce a
Bool. In other words you get a function
isElem c (d : rest) = if c == d then True else isElem c rest isElem _  = False -- show is3elem :: [Char] -> Bool is3elem = isElem '3' main = print $ is3elem "123"
Let me repeat this, because it's very important: we have curried the two-argument function
isElem by providing the first argument,
'3'. The result is a function that expects a list of
Char (the second argument to
isElem). We are storing this curried function in the variable
is3elem. We can then call
is3elem with a list of
Char and get back a
By the way, we could have defined
is3elem using standard function definition syntax:
isElem c (d : rest) = if c == d then True else isElem c rest isElem _  = False -- show is3elem' :: [Char] -> Bool is3elem' str = isElem '3' str main = print $ is3elem' "123"
It almost looks like the first definition was obtained by "dividing" both sides of the second definition by
str. After such simplification, the only trace of the
[Char] argument is in the signature of
is3elem. So if you see a function definition that is missing some arguments that are specified in its signature, you're looking at a curried definition. You'll see a lot of code written in this style, which has its own name: point-free notation. We'll talk more about it in the future.
isElem: What's its type signature? By our reasoning, we can look at it as a function that takes a
Char and returns a function
([Char]->Bool). Being able to return a function from a function is one of the perks of "functions being first-class citizens in Haskell." (The others are storing functions in variables, which we've already seen, and passing functions as arguments to other functions, which we'll see soon.)
Indeed, this is a valid signature of the funtion
isElem :: Char -> ([Char] -> Bool)
However, since the arrow
-> is right associative, the parentheses are not necessary and are usually omitted, as in:
isElem :: Char -> [Char] -> Bool
One more observation:
isElem, as well as
elem, will work not only for
Char, but for any type that supports equality comparison. In particular, since we defined
deriving Eq, it supports equality comparison and can be used with
data Token = Digit | Alpha deriving (Show, Eq) main = print $ elem Digit [Digit, Alpha]
We'll come back to this when we study type classes.
To summarize: multi-argument functions can always be curried, and this is reflected in the way their type signatures are written. For instance, the signature:
f :: a -> b -> c -> d
tells us that
f is a function of three arguments of types
d. You can also treat it as a function of two arguments
b returning a function
(c ->d). Or a one-argument function taking
a returning a two argument function
(b -> c -> d).
Currying is extremely useful and it's a pity that most other languages don't support it out of the box. Multi-argument functions in those languages are written and behave as if they were always taking tuples: In Haskell, the syntax
f (x, y, z) would be interpreted as a function taking a tuple of three elements. In fact such function are sometimes called uncurried.
Parentheses and commas in the function syntax impede currying. Scala has special syntax for curriable functions: multiple pairs of parentheses. But you have to anticipate the need for currying when defining the function -- the default doesn't support it.
The tokenizer has to recognize operators, identifiers, and numbers. Let's start with operators. We could categorize them using a bunch of nested if/then/else clauses, but that would be awkward. Let's use a different mechanism: guards. Just like you can split a function definition by patterns, you can further specialize it based on more general predicates (Boolean expressions). Let's define a function
operator that converts a character to
data Operator = Plus | Minus | Times | Div deriving (Show, Eq) operator :: Char -> Operator operator c | c == '+' = Plus | c == '-' = Minus | c == '*' = Times | c == '/' = Div main = print $ operator '*'
There are four bodies of the function
operator corresponding to four different guards (a body of a function starts after the equal sign). For example, the guard
c == '+' corresponds to the body
Plus, etc. Guards are placed after vertical bars. They are tested in order of appearance.
What happens when no guard is satisfied? You can try it by calling
operator with the "wrong" character. You'll get a runtime error
PatternMatchFail and the program will abort. This might be good enough if all you need is a bona fide assertion. In general, these kinds of non-exhaustive patterns are to be avoided. You can always end your list of guards with
otherwise (which is syntactic sugar for
True) and provide a default body. You'll see an example of this later.
Before we implement full blown tokenizers for numbers and identifiers, let's first tackle a simplified problem. Let's restrict numbers to single digits, and identifiers to single characters. This way we'll only need to process one character at a time and we can use simple recursion. Here's our recursive skeleton:
tokenize :: String -> [Token] tokenize  =  tokenize (c : cs) = ... : tokenize cs
We'll have to categorize the current character, convert it to a token, and then tokenize the rest of the string (here I'm pattern matching the string as a list of characters). The result is the current token prepended to the rest of tokens.
We'll do categorization using guards. Let's start with operators:
tokenize :: String -> [Token] tokenize  =  tokenize (c : cs) | elem c "+-*/" = TokOp (operator c) : tokenize cs | otherwise = error $ "Cannot tokenize " ++ [c]
The guard checks if
c is an element of the list of four characters
"+-*/". If it is, it constructs a
Token using the
TokOp constructor, passing it the result of the call to
operator c (the function we defined earlier). This token is combined using
: with the list returned by the recursive call,
I also added the catch all guard that calls
error is a function that takes a
String, displays it, and aborts the program. (In order to satisfy the type checker,
error is polymorphic in its return type. In this example, the type checker expects a list of
Tokens; so that's the type it will pick for the return type of
I constructed the string by concatenating
"Cannot tokenize " with a single character string
[c]. I couldn't use
c directly, because the concatenation operator
++ expects a list of
Char, not a
Char. So I created a one-element list
The tokenization of one-character numbers and identifiers is pretty simple:
| isDigit c = TokNum (digitToInt c) : tokenize cs | isAlpha c = TokIdent [c] : tokenize cs
Here, I converted a digit to an integer using
digitToInt, and a single character to a string using
Finally, our tokenizer should also be able to discard white space between tokens:
| isSpace c = tokenize cs
We are ready to test our first tokenizer:
import Data.Char data Operator = Plus | Minus | Times | Div deriving (Show, Eq) data Token = TokOp Operator | TokIdent String | TokNum Int deriving (Show, Eq) operator :: Char -> Operator operator c | c == '+' = Plus | c == '-' = Minus | c == '*' = Times | c == '/' = Div tokenize :: String -> [Token] tokenize  =  tokenize (c : cs) | elem c "+-*/" = TokOp (operator c) : tokenize cs | isDigit c = TokNum (digitToInt c) : tokenize cs | isAlpha c = TokIdent [c] : tokenize cs | isSpace c = tokenize cs | otherwise = error $ "Cannot tokenize " ++ [c] main = print $ tokenize " 1 + 4 / x "
Next time we'll work on tokenizing multi-character numbers and identifiers and learn about mutual recursion.
Ex 1. Rewrite the implementation of Fibonacci numbers using guards instead of the
if statement (it should become much closer to the mathematical definition):
-- Old definition: -- fib n = if n > 2 then fib (n - 1) + fib (n - 2) else 1 fib n | n == 1 = ... | ... = ... | otherwise = ... main = print (fib 20)
fib :: Int -> Int fib n | n == 1 = 1 | n == 2 = 1 | otherwise = fib (n-1) + fib (n-2) main = print (fib 20)
Ex 2. Implement function
cat that concatenates two lists.
cat :: [a] -> [a] -> [a] cat = undefined main = putStrLn $ cat "Hello " "World!"
You want to create a list whose first element is the first element of the first list (if any) followed by the rest of the first list concatenated with the second list.
cat :: [a] -> [a] -> [a] cat  ys = ys cat (x : xs) ys = x : cat xs ys main = putStrLn $ cat "Hello " "World!"
Ex 3. Use
cat from previous exercise and currying to define a function
pig that prepends "pig" to any string.
cat :: [a] -> [a] -> [a] cat = undefined pig :: String -> String pig = undefined main = putStrLn $ pig "sty"
cat :: [a] -> [a] -> [a] cat  ys = ys cat (x : xs) ys = x : cat xs ys pig :: String -> String pig = cat "pig" main = putStrLn $ pig "sty"
Ex 4. Implement function
toInts that takes a number in the form of a string and returns a list of its digits as integers.
import Data.Char toInts :: String -> [Int] toInts = undefined main = print $ toInts "2013"
import Data.Char toInts :: String -> [Int] toInts  =  toInts (c : cs) = digitToInt c : toInts cs main = print $ toInts "2013"
Ex 5. Implement function
sumDig that takes a number in the form of a string and calculates the sum of its digits. Make use of the function from the previous exercise.
import Data.Char toInts :: String -> [Int] toInts  =  toInts (c : cs) = digitToInt c : toInts cs sumDig :: String -> Int sumDig = undefined main = print $ sumDig "30750"
Define an auxiliary recursive function
acc that takes the sum-so-far and the remaining digits, and returns the total. Call it with appropriate arguments.
acc :: Int -> [Int] -> Int
import Data.Char toInts :: String -> [Int] toInts  =  toInts (c : cs) = digitToInt c : toInts cs sumDig :: String -> Int sumDig str = acc 0 (toInts str) acc :: Int -> [Int] -> Int acc a  = a acc a (i:is) = acc (a + i) is main = print $ sumDig "30750"