## Introduction [Tagsoup](http://hackage.haskell.org/package/tagsoup-0.12.8) is a package for parsing strings of xml into a list of tag elements. ```haskell -- | A single HTML element. A whole document is represented by a list of @Tag@. -- There is no requirement for 'TagOpen' and 'TagClose' to match. data Tag str = TagOpen str [Attribute str] -- ^ An open tag with 'Attribute's in their original order | TagClose str -- ^ A closing tag | TagText str -- ^ A text node, guaranteed not to be the empty string | TagComment str -- ^ A comment | TagWarning str -- ^ Meta: A syntax error in the input file | TagPosition !Row !Column -- ^ Meta: The position of a parsed element deriving (Show, Eq, Ord, Data, Typeable) parseTags :: StringLike str => str -> [Tag str] ``` There are functions for splitting a list of `Tag`s into sections whose first item matches a predicate; but there are no functions for searching/filtering the xml objects; there is module which converts a list of tags into a tree of tags, but it is provisional and has limited support for searching. This tutorial will show how to create a simple, yet effective, library for searching/filtering a list of xml objects. I will first show a novel technique for transforming a list of `Tag`s into a list of objects using a generalization of `unfoldr`. Then I will create a small DSL for predicates over these objects; this DSL will make use of Applicatives. ## Objects First I'll need a type to represent xml objects. ```haskell data Object = Object { srcFile :: FilePath, kind :: String, attrs :: [Attribute String], text :: String, kids :: [Object] } deriving (Eq, Show) ``` I include a source file field in anticipation of reading in xml from several files. I will also have a few convenience functions for `Object`s. ```haskell emptyObj = Object { srcFile = "", kind = "", attrs = [], text = "", kids = [] } addKid :: Object -> Object -> Object addKid obj kid = obj{kids = kids obj ++ [kid]} getAttr :: String -> Object -> Maybe String getAttr nm = lookup nm . attrs showObj :: Object -> String showObj obj = "("++kind obj++ ", attrs="++show (attrs obj)++ ", text="++text obj++ ", kids=["++intercalate "," (map showObj $ kids obj)++"])" ``` I will parse a list of `Tag`s into an `Object` and the remaining list of `Tag`s I would like to structure this function in a manner similar to `unfoldr` ```haskell unfoldr :: (b -> Maybe (a, b)) -> b -> [a] ``` where I have a function to produce the next `Object` and the rest of the stream, and I use a combinator like `unfoldr` to handle the recursion and produce all the `Object`s. However, I can not use `unfoldr`; since it doesn't return the rest of the stream when it is done, I can not use it to handle the recursive calls parse the child `Object`s. So I will use a generalization of `unfoldr` ```haskell unfoldGen :: (a -> Either c (b, a)) -> a -> ([b], c) unfoldGen f = either ([],) (\(v, x') -> first (v:) (unfoldGen f x')) . f ``` where `either` is from `Data.Either` and `first` is from `Control.Arrow`. I can now use a function `nextObj` of type ```haskell nextObj :: FilePath -> [Tag String] -> Either [Tag String] (Either Object String, [Tag String]) ``` with `unfoldGen` to turn a list of `Tag`s into a list of `Object`s. The `Either Object String` is necessary in `nextObj` since there can be both text and objects inside an object. ```haskell nextObj :: FilePath -> [Tag String] -> Either [Tag String] (Either Object String, [Tag String]) nextObj file [] = Left [] nextObj file (t:ts) = case t of TagOpen nm ats | nm /= "?xml" -> Right (Left obj, popTagClose ts') where (res, ts') = unfoldGen (nextObj file) ts obj = Object{srcFile = file, kind = nm, attrs = ats, text = concat $ rights res, kids = lefts res } -- silently ignore orphaned tags popTagClose = drop 1 . dropWhile (not . isTagCloseName nm) TagClose _ -> Left (t:ts) TagText txt | all isSpace txt -> nextObj file ts -- ignore just whitespace text | otherwise -> Right (Right txt, ts) _ -> nextObj file ts -- ignore all other TagSoup tags ``` I can now write `tagsToObjs` to transform a list of `Tag`s to a list of `Object`s ```haskell tagsToObjs :: FilePath -> [Tag String] -> [Object] tagsToObjs file = lefts . fst . unfoldGen (nextObj file) ``` I can now put everything together to get a complete program. ```active haskell --/show {-# LANGUAGE TupleSections #-} import Control.Applicative import Control.Arrow import Data.Char import Data.Either import Data.List import Data.Maybe import qualified Data.Text as T import qualified Data.Text.IO as T import Debug.Trace import System.FilePath import Text.HTML.TagSoup data Object = Object { srcFile :: FilePath, kind :: String, attrs :: [Attribute String], text :: String, kids :: [Object] } deriving (Eq, Show) emptyObj = Object { srcFile = "", kind = "", attrs = [], text = "", kids = [] } addKid :: Object -> Object -> Object addKid obj kid = obj{kids = kids obj ++ [kid]} getAttr :: String -> Object -> Maybe String getAttr nm = lookup nm . attrs showObj :: Object -> String showObj obj = "("++kind obj++ ", attrs="++show (attrs obj)++ ", text="++text obj++ ", kids=["++intercalate "," (map showObj $ kids obj)++"])" unfoldGen :: (a -> Either c (b, a)) -> a -> ([b], c) unfoldGen f = either ([],) (\(v, x') -> first (v:) (unfoldGen f x')) . f nextObj :: FilePath -> [Tag String] -> Either [Tag String] (Either Object String, [Tag String]) nextObj file [] = Left [] nextObj file (t:ts) = case t of TagOpen nm ats | nm /= "?xml" -> Right (Left obj, popTagClose ts') where (res, ts') = unfoldGen (nextObj file) ts obj = Object{srcFile = file, kind = nm, attrs = ats, text = concat $ rights res, kids = lefts res } -- silently ignore orphaned tags popTagClose = drop 1 . dropWhile (not . isTagCloseName nm) TagClose _ -> Left (t:ts) TagText txt | all isSpace txt -> nextObj file ts -- ignore just whitespace text | otherwise -> Right (Right txt, ts) _ -> nextObj file ts -- ignore all other TagSoup tags tagsToObjs :: FilePath -> [Tag String] -> [Object] tagsToObjs file = lefts . fst . unfoldGen (nextObj file) -- show Parse a file into a list of Objects. getFileObjs :: FilePath -> IO [Object] getFileObjs file = tagsToObjs file . parseTags . T.unpack <$> T.readFile file conv = "boa viagem" test = "eu vou emboraboa viagema deus" main = do putStrLn $ "tags=\n" ++ show (parseTags $ test ++ conv) putStrLn "" mapM_ (putStrLn . ("obj=\n"++) . showObj) $ tagsToObjs "" $ parseTags $ conv ++ test putStrLn "\nok" ``` ## Predicates Now that I can create a list of `Object`s, it'd be nice to search and filter the objects. I'll start with the simplest way to get a list of sub-`Object`s satisfying a given predicate. ```haskell findObjs :: (Object -> Bool) -> Object -> [Object] findObjs p obj | p obj = obj : objs | otherwise = objs where objs = concatMap (findObjs p) (kids obj) ``` Now I can write pretty much any kind of search I want. Here are some basic handy predicates. ```haskell anyObj = const True kindIs k = (== k) . kind textVal p = p . text letVal getVal useVal obj = useVal (getVal obj) obj ``` Here are some basic building blocks for predicates on attributes. ```haskell genAttr p = isJust . find p . attrs -- Note: Attributes are just pairs in TagSoup p ! q = genAttr (\(k,v) -> p k && q v) attrName p = genAttr (p . fst) attrVal p = genAttr (p . snd) ``` I can also define compound predicates. ```haskell (<&&> :: (Object -> Bool) -> (Object -> Bool) -> Object -> Bool p1 <&&> p2 = \obj -> p1 obj && p2 obj (<||> :: (Object -> Bool) -> (Object -> Bool) -> Object -> Bool p1 <||> p2 = \obj -> p1 obj || p2 obj ``` However, it is slightly inconvenient to have to use named variables when using compound predicates. So I will use the fact that `((->) a)` is an instance of `Applicative` to tidy up my predicates. ```haskell type ObjPred = Object -> Bool (<&&>) :: ObjPred -> ObjPred -> ObjPred p1 <&&> p2 = (&&) <$> p1 <*> p2 infixl 6 <&&> (<||>) :: ObjPred -> ObjPred -> ObjPred p1 <||> p2 = (||) <$> p1 <*> p2 infixl 5 <||> ``` Note that I can compose with regular `Boolean` functions. ```haskell not . (kindIs "a" <&&> not . attrName (== "bad") :: ObjPred ``` I will now add predicates on ancestors and descendants of a given object; i.e. I'd like to be able to find an object which has a descendant satisfying some property. In order to do this, I will have to have access to both parents and descendants of a given object. One way to accomplish this is to add a parent field to `Object`s. However, this will complicate object traversal and serialization (which I mightbe interested in). So, I will opt for adding the traversal history to the find function. I will start by redefining `ObjPred` to take some state (the path to the current object). ```haskell data ObjPredState = ObjPredState{curObj :: Object, prevObjs :: [Object]} type ObjPred = ObjPredState -> Bool ``` and I will use a few convenience functions on `ObjPredState`. ```haskell newObjPredState :: Object -> ObjPredState newObjPredState obj = ObjPredState{curObj = obj, prevObjs = []} nextObjPredStates :: ObjPredState -> [ObjPredState] nextObjPredStates objPredSt = map modOPS (kids obj) where obj = curObj objPredSt modOPS o = ObjPredState{curObj = o, prevObjs = obj : prevObjs objPredSt} prevObjPredState :: ObjPredState -> Maybe ObjPredState prevObjPredState objPredSt = case prevObjs objPredSt of [] -> Nothing (x:xs) -> Just ObjPredState{curObj = x, prevObjs = xs} ``` Then I can redefine `findObjs` to keep track of the paths. ```haskell findObjs' :: ObjPred -> Object -> [ObjPredState] findObjs' p = go . newObjPredState where go opSt | p opSt = opSt : opSts | otherwise = opSts where opSts = concatMap go $ nextObjPredStates opSt findObjs :: ObjPred -> Object -> [Object] findObjs p = map curObj . findObjs' p ``` I have separated out `findObjs'` from `findObjs` since `findObjs'` is more basic and might be needed to write some predicates (i.e. those which require looking at the path to this object). Now I can write the following predicates. ```haskell hasParent :: ObjPred -> ObjPred hasParent p = maybe False p . prevObjPredState hasAncestor :: ObjPred -> ObjPred hasAncestor p = maybe False checkParent . prevObjPredState where checkParent opSt = p opSt || hasAncestor p opSt hasChild :: ObjPred -> ObjPred hasChild p = any p . nextObjPredStates hasDescendant :: ObjPred -> ObjPred hasDescendant p opSt = or $ p opSt : map (hasDescendant p) (nextObjPredStates opSt) ``` And make the following convenience infix operators. ```haskell x ~~> y = x <&&> hasChild y infixl 7 ~~> x *~~> y = x <&&> hasDescendant y infixl 7 *~~> x +~~> y = x <&&> hasChild (hasDescendant y) infixl 7 *~~> x ~/~> y = x <&&> not . hasChild y infixl 7 ~/~> x *~/~> y = x <&&> not . hasDescendant y infixl 7 *~/~> x +~/~> y = x <&&> not . hasChild (hasDescendant y) infixl 7 *~/~> ``` And of course the corresponding ones for ancestors can be defined. Now I can write predicates such as ```haskell kindIs "a" +~~> kindIs "b" +~/~> (== "name") ! ('&' `elem`) ``` to specify an `a` which has a proper descendant `b` which doesn't have a proper descendant with a `name` attribute whose value contains an `&`.