## Introduction
[Tagsoup](http://hackage.haskell.org/package/tagsoup-0.12.8) is a package for parsing strings of xml into a list of tag elements.
```haskell
-- | A single HTML element. A whole document is represented by a list of @Tag@.
-- There is no requirement for 'TagOpen' and 'TagClose' to match.
data Tag str =
TagOpen str [Attribute str] -- ^ An open tag with 'Attribute's in their original order
| TagClose str -- ^ A closing tag
| TagText str -- ^ A text node, guaranteed not to be the empty string
| TagComment str -- ^ A comment
| TagWarning str -- ^ Meta: A syntax error in the input file
| TagPosition !Row !Column -- ^ Meta: The position of a parsed element
deriving (Show, Eq, Ord, Data, Typeable)
parseTags :: StringLike str => str -> [Tag str]
```
There are functions for splitting a list of `Tag`s into sections whose first item matches a predicate; but there are no functions for searching/filtering the xml objects; there is module which converts a list of tags into a tree of tags, but it is provisional and has limited support for searching.
This tutorial will show how to create a simple, yet effective, library for searching/filtering a list of xml objects. I will first show a novel technique for transforming a list of `Tag`s into a list of objects using a generalization of `unfoldr`. Then I will create a small DSL for predicates over these objects; this DSL will make use of Applicatives.
## Objects
First I'll need a type to represent xml objects.
```haskell
data Object = Object {
srcFile :: FilePath,
kind :: String,
attrs :: [Attribute String],
text :: String,
kids :: [Object]
}
deriving (Eq, Show)
```
I include a source file field in anticipation of reading in xml from several files.
I will also have a few convenience functions for `Object`s.
```haskell
emptyObj = Object {
srcFile = "",
kind = "",
attrs = [],
text = "",
kids = []
}
addKid :: Object -> Object -> Object
addKid obj kid = obj{kids = kids obj ++ [kid]}
getAttr :: String -> Object -> Maybe String
getAttr nm = lookup nm . attrs
showObj :: Object -> String
showObj obj = "("++kind obj++
", attrs="++show (attrs obj)++
", text="++text obj++
", kids=["++intercalate "," (map showObj $ kids obj)++"])"
```
I will parse a list of `Tag`s into an `Object` and the remaining list of `Tag`s
I would like to structure this function in a manner similar to `unfoldr`
```haskell
unfoldr :: (b -> Maybe (a, b)) -> b -> [a]
```
where I have a function to produce the next `Object` and the rest of the stream, and I use a combinator like `unfoldr` to handle the recursion and produce all the `Object`s. However, I can not use `unfoldr`; since it doesn't return the rest of the stream when it is done, I can not use it to handle the recursive calls parse the child `Object`s. So I will use a generalization of `unfoldr`
```haskell
unfoldGen :: (a -> Either c (b, a)) -> a -> ([b], c)
unfoldGen f = either ([],) (\(v, x') -> first (v:) (unfoldGen f x')) . f
```
where `either` is from `Data.Either` and `first` is from `Control.Arrow`.
I can now use a function `nextObj` of type
```haskell
nextObj :: FilePath -> [Tag String] -> Either [Tag String] (Either Object String, [Tag String])
```
with `unfoldGen` to turn a list of `Tag`s into a list of `Object`s. The `Either Object String` is necessary in `nextObj` since there can be both text and objects inside an object.
```haskell
nextObj :: FilePath -> [Tag String] -> Either [Tag String] (Either Object String, [Tag String])
nextObj file [] = Left []
nextObj file (t:ts) = case t of
TagOpen nm ats | nm /= "?xml" -> Right (Left obj, popTagClose ts')
where (res, ts') = unfoldGen (nextObj file) ts
obj = Object{srcFile = file,
kind = nm,
attrs = ats,
text = concat $ rights res,
kids = lefts res
}
-- silently ignore orphaned tags
popTagClose = drop 1 . dropWhile (not . isTagCloseName nm)
TagClose _ -> Left (t:ts)
TagText txt | all isSpace txt -> nextObj file ts -- ignore just whitespace text
| otherwise -> Right (Right txt, ts)
_ -> nextObj file ts -- ignore all other TagSoup tags
```
I can now write `tagsToObjs` to transform a list of `Tag`s to a list of `Object`s
```haskell
tagsToObjs :: FilePath -> [Tag String] -> [Object]
tagsToObjs file = lefts . fst . unfoldGen (nextObj file)
```
I can now put everything together to get a complete program.
```active haskell
--/show
{-# LANGUAGE TupleSections #-}
import Control.Applicative
import Control.Arrow
import Data.Char
import Data.Either
import Data.List
import Data.Maybe
import qualified Data.Text as T
import qualified Data.Text.IO as T
import Debug.Trace
import System.FilePath
import Text.HTML.TagSoup
data Object = Object {
srcFile :: FilePath,
kind :: String,
attrs :: [Attribute String],
text :: String,
kids :: [Object]
}
deriving (Eq, Show)
emptyObj = Object {
srcFile = "",
kind = "",
attrs = [],
text = "",
kids = []
}
addKid :: Object -> Object -> Object
addKid obj kid = obj{kids = kids obj ++ [kid]}
getAttr :: String -> Object -> Maybe String
getAttr nm = lookup nm . attrs
showObj :: Object -> String
showObj obj = "("++kind obj++
", attrs="++show (attrs obj)++
", text="++text obj++
", kids=["++intercalate "," (map showObj $ kids obj)++"])"
unfoldGen :: (a -> Either c (b, a)) -> a -> ([b], c)
unfoldGen f = either ([],) (\(v, x') -> first (v:) (unfoldGen f x')) . f
nextObj :: FilePath -> [Tag String] -> Either [Tag String] (Either Object String, [Tag String])
nextObj file [] = Left []
nextObj file (t:ts) = case t of
TagOpen nm ats | nm /= "?xml" -> Right (Left obj, popTagClose ts')
where (res, ts') = unfoldGen (nextObj file) ts
obj = Object{srcFile = file,
kind = nm,
attrs = ats,
text = concat $ rights res,
kids = lefts res
}
-- silently ignore orphaned tags
popTagClose = drop 1 . dropWhile (not . isTagCloseName nm)
TagClose _ -> Left (t:ts)
TagText txt | all isSpace txt -> nextObj file ts -- ignore just whitespace text
| otherwise -> Right (Right txt, ts)
_ -> nextObj file ts -- ignore all other TagSoup tags
tagsToObjs :: FilePath -> [Tag String] -> [Object]
tagsToObjs file = lefts . fst . unfoldGen (nextObj file)
-- show Parse a file into a list of Objects.
getFileObjs :: FilePath -> IO [Object]
getFileObjs file = tagsToObjs file . parseTags . T.unpack <$> T.readFile file
conv = "boa viagem"
test = "eu vou emboraboa viagema deus"
main = do
putStrLn $ "tags=\n" ++ show (parseTags $ test ++ conv)
putStrLn ""
mapM_ (putStrLn . ("obj=\n"++) . showObj) $ tagsToObjs "" $ parseTags $ conv ++ test
putStrLn "\nok"
```
## Predicates
Now that I can create a list of `Object`s, it'd be nice to search and filter the objects.
I'll start with the simplest way to get a list of sub-`Object`s satisfying a given predicate.
```haskell
findObjs :: (Object -> Bool) -> Object -> [Object]
findObjs p obj | p obj = obj : objs
| otherwise = objs
where objs = concatMap (findObjs p) (kids obj)
```
Now I can write pretty much any kind of search I want.
Here are some basic handy predicates.
```haskell
anyObj = const True
kindIs k = (== k) . kind
textVal p = p . text
letVal getVal useVal obj = useVal (getVal obj) obj
```
Here are some basic building blocks for predicates on attributes.
```haskell
genAttr p = isJust . find p . attrs
-- Note: Attributes are just pairs in TagSoup
p ! q = genAttr (\(k,v) -> p k && q v)
attrName p = genAttr (p . fst)
attrVal p = genAttr (p . snd)
```
I can also define compound predicates.
```haskell
(<&&> :: (Object -> Bool) -> (Object -> Bool) -> Object -> Bool
p1 <&&> p2 = \obj -> p1 obj && p2 obj
(<||> :: (Object -> Bool) -> (Object -> Bool) -> Object -> Bool
p1 <||> p2 = \obj -> p1 obj || p2 obj
```
However, it is slightly inconvenient to have to use named variables when using compound predicates. So I will use the fact that `((->) a)` is an instance of `Applicative` to tidy up my predicates.
```haskell
type ObjPred = Object -> Bool
(<&&>) :: ObjPred -> ObjPred -> ObjPred
p1 <&&> p2 = (&&) <$> p1 <*> p2
infixl 6 <&&>
(<||>) :: ObjPred -> ObjPred -> ObjPred
p1 <||> p2 = (||) <$> p1 <*> p2
infixl 5 <||>
```
Note that I can compose with regular `Boolean` functions.
```haskell
not . (kindIs "a" <&&> not . attrName (== "bad") :: ObjPred
```
I will now add predicates on ancestors and descendants of a given object; i.e. I'd like to be able to find an object which has a descendant satisfying some property. In order to do this, I will have to have access to both parents and descendants of a given object. One way to accomplish this is to add a parent field to `Object`s. However, this will complicate object traversal and serialization (which I mightbe interested in). So, I will opt for adding the traversal history to the find function.
I will start by redefining `ObjPred` to take some state (the path to the current object).
```haskell
data ObjPredState = ObjPredState{curObj :: Object, prevObjs :: [Object]}
type ObjPred = ObjPredState -> Bool
```
and I will use a few convenience functions on `ObjPredState`.
```haskell
newObjPredState :: Object -> ObjPredState
newObjPredState obj = ObjPredState{curObj = obj, prevObjs = []}
nextObjPredStates :: ObjPredState -> [ObjPredState]
nextObjPredStates objPredSt = map modOPS (kids obj)
where obj = curObj objPredSt
modOPS o = ObjPredState{curObj = o, prevObjs = obj : prevObjs objPredSt}
prevObjPredState :: ObjPredState -> Maybe ObjPredState
prevObjPredState objPredSt = case prevObjs objPredSt of
[] -> Nothing
(x:xs) -> Just ObjPredState{curObj = x, prevObjs = xs}
```
Then I can redefine `findObjs` to keep track of the paths.
```haskell
findObjs' :: ObjPred -> Object -> [ObjPredState]
findObjs' p = go . newObjPredState where
go opSt | p opSt = opSt : opSts
| otherwise = opSts
where opSts = concatMap go $ nextObjPredStates opSt
findObjs :: ObjPred -> Object -> [Object]
findObjs p = map curObj . findObjs' p
```
I have separated out `findObjs'` from `findObjs` since `findObjs'` is more basic and might be needed to write some predicates (i.e. those which require looking at the path to this object).
Now I can write the following predicates.
```haskell
hasParent :: ObjPred -> ObjPred
hasParent p = maybe False p . prevObjPredState
hasAncestor :: ObjPred -> ObjPred
hasAncestor p = maybe False checkParent . prevObjPredState
where checkParent opSt = p opSt || hasAncestor p opSt
hasChild :: ObjPred -> ObjPred
hasChild p = any p . nextObjPredStates
hasDescendant :: ObjPred -> ObjPred
hasDescendant p opSt = or $ p opSt : map (hasDescendant p) (nextObjPredStates opSt)
```
And make the following convenience infix operators.
```haskell
x ~~> y = x <&&> hasChild y
infixl 7 ~~>
x *~~> y = x <&&> hasDescendant y
infixl 7 *~~>
x +~~> y = x <&&> hasChild (hasDescendant y)
infixl 7 *~~>
x ~/~> y = x <&&> not . hasChild y
infixl 7 ~/~>
x *~/~> y = x <&&> not . hasDescendant y
infixl 7 *~/~>
x +~/~> y = x <&&> not . hasChild (hasDescendant y)
infixl 7 *~/~>
```
And of course the corresponding ones for ancestors can be defined.
Now I can write predicates such as
```haskell
kindIs "a" +~~> kindIs "b" +~/~> (== "name") ! ('&' `elem`)
```
to specify an `a` which has a proper descendant `b` which doesn't have a proper descendant with a `name` attribute whose value contains an `&`.