# Part V: Heaps of Performance

16 Nov 2014

### Sections

Back in part 3 I naïvely plodded ahead with implementing a nice fusion combinator to try to use inside of my matrix multiplication routine.

It was slow.

It was slow because I built nested trees of concatenations and merges, and then when pumping it stream fusion was never getting anything that it could unroll into a bigger loop. Moreover, when faced with a tree of merges, the `Stream` fusion framework was being forced to keep the associativity by which it was originally constructed, but there was no balance ensuring that tree was any good.

If you're just joining, you may want to go back and read parts 1,2,3, and 4, but like the last two parts, this one mostly stands alone if you ignore the motivation for the trick as a short practicum on how to bootstrap a datastructure and an introduction to pairing heaps and ephemeral steques.

# Work Smarter

Let's work smarter, not harder, by finding a better algorithm, rather than trying to run a dumb one fast.

To that end, I need to be able to deal with merging together streams and concatenating streams with a minimum of impact from concatenation on the cost of getting the next element.

We could of course merge streams by representing them as a heap.

Pick your favorite heap. Mine today for ease of exposition is a pairing heap. It is pretty awful when you want to actually analyze the runtime performance of your algorithm, but it has excellent amortized performance in practice and fits the form of what I'm going to put around it.

I'll deal with only non-empty heaps, as we can move all the reasoning for the empty case into `Maybe (Heap a)` and it admits a nicer recursive definition.

For simplicity, I'll just be using `Int` keys rather than Morton-ordered keys below.

A pairing heap is a heap built on a rose tree. That is to say any node in your heap can have any number of children, so long as you satisfy the heap property.

``data Heap a = Heap !Int a [Heap a] deriving Show``

A pairing heap provides us with a dead simple union for two heaps. Compare them, and then shove the one with the larger starting key one underneath the smaller. That it is.

``````mix :: Heap a -> Heap a -> Heap a
mix x@(Heap i a as) y@(Heap j b bs)
| i <= j    = Heap i a (y:as)
| otherwise = Heap j b (x:bs)``````

We can obviously grab the `top` element

``````top :: Heap a -> (Int, a)
top (Heap i a _) = (i, a)``````

and we can `pop` that element off the `Heap`:

``````pop :: Heap a -> Maybe (Heap a)
pop (Heap _ _ [])    = Nothing
pop (Heap _ _ (x:xs) = Just (merge x xs)

merge :: Heap a -> [Heap a] -> Heap a
merge x (y:ys) = case ys of
(z:zs) -> mix x y `mix` merge z zs
[]     -> mix x y
merge x [] = x``````

The complexity in analyzing the asymptotics of a pairing heap comes from the fact that `merge` might have to do a lot of work, but `merge` pairs up the heaps before doing it's work to get better balancing. Again, nothing in what I'm writing here cares about the pairing heap parts, other than it is easy to write and similar to the surrounding code we're adding, and it is the most obviously amenable heap to the transformation I'll be applying later.

`mix` now handles the behavior I need for merging streams/heaps, but I'm also going to need efficient concatenation. While `mix` is "technically correct" for concatenating two heaps that do not overlap in key space, it also isn't terribly good at it.

# Steques

A "stack-ended queue", or steque, is what Tarjan calls a structure that permits `cons` and `snoc` on both ends, but only efficient `head`/`tail`/`uncons` from one side.

This distinguishes it from a deque, which permits efficient `init`/`tail`/`unsnoc` as well.

The simplest implementation of a steque is

``data Steque a = Steque [a] [a] deriving (Show,Read)``

The first list is in order from left to right, and the second list is reversed.

Note: This isn't the most robust steque or deque in Okasaki's book. Its asymptotics are amortized, not worst case, and they only hold for ephemeral rather than persistent use. However, I only care about that ephemeral use case and so the reduction in book-keeping overhead is more valuable than slower execution to support operations I don't use!

``````instance Functor Steque where
fmap f (Steque fs rs) = Steque (fmap f fs) (fmap f rs)

instance Foldable Steque where
foldMap f (Steque fs rs) = foldMap f fs `mappend` getDual (foldMap (Dual . f) rs)``````

We can use a little bit of `lens` to make the `Traversable` instance easier, given the combinator `backwards` that can turn a `Traversal` around.

``````instance Traversable Steque where
traverse f (Steque fs rs) = Steque <\$> traverse f fs <*> backwards traverse f rs``````

and then we could quotient out the irrelevancies of how our elements are distributed between the two lists:

``````instance Eq a => Eq (Steque a) where
(==) = (==) `on` toList

instance Ord a => Ord (Steque a) where
compare = compare `on` toList``````

We could even get ambitious and define a bunch of other instances on `Steque`, just because we can:

``````instance Applicative Steque where
pure a = Steque [a] []
(<*>) = ap

return a = Steque [a] []
Steque fs bs >>= f = Steque (fs >>= toList . f) (bs >>= toListOf (backwards folded) . f)``````

... but we're getting distracted, we really just want to be able to `cons`/`head`/`tail`/`uncons`, and `snoc`.

To make this interesting, let us show how to implement these using some lesser understood parts of `lens`.

The `lens` package provides a common `Cons` class for dealing with `cons`-like behavior. It is tricky to instantiate correctly, because it is overloaded to permit usecases where you can merely `cons` and not `uncons` and vice-versa, but we can define an instance:

``````instance (Choice p, Applicative f) => Cons p f (Steque a) (Steque b) a b where
_Cons = prism (\(x,Steque fs bs) -> Steque (x:fs) bs) \$ \ (Steque fs bs) -> case fs of
x:xs -> Right (x,Steque xs bs)
[]   -> case reverse bs of
x:xs -> Right (x,Steque xs [])
[] -> Left (Steque [] [])``````

This gives us the following `Prism`:

``_Cons :: Prism (Steque a) (Steque b) (a, Steque a) (b, Steque b)``

We can now use it to `cons` and `uncons`, using the definitions from `Control.Lens.Cons`:

``````cons a as = _Cons # (a,as)
uncons as = as^?_Cons``````

That module also provides combinators that makes either a `Getter`, `Setter`, `Lens` or `Traversal` for each of `_head` and `_tail` depending on the restricted form of `_Cons` you chose.

``````_head = _Cons._1
_tail = _Cons._2``````

On the other side we only want to be able to `snoc`, not use `tail`, `init` or `unsnoc`, so let's define:

``````instance (p ~ Reviewed, f ~ Identity, a ~ b) => Snoc p f (Steque a) (Steque b) a b where
_Snoc = unto \$ \(Steque f b,x) -> Steque f (x:b)``````

Note: we could also support `_init` and `_tail` and `unsnoc`, but they will be O(n), even for ephemeral use, because we do not try to preserve any balance between the front and back lists.

`unto` is used to define a `Review`, which is to a `Prism` what a `Getter` is to a `Lens`. In practice it makes something that is like a `Prism` in that you can apply `#` to it, but nothing else.

This is enough to permit us to use the stock definition of `snoc`:

``snoc as a = _Snoc # (as,a)``

With all of that we can finally play with our `Steque`:

``````{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE UndecidableInstances #-}
import Control.Applicative
import Control.Lens
import Control.Lens.Internal.Review
import Data.Foldable
import Data.Functor.Identity
import Data.Monoid
import Data.Function (on)

data Steque a = Steque [a] [a] deriving (Show,Read)

instance Eq a => Eq (Steque a) where
(==) = (==) `on` toList

instance Ord a => Ord (Steque a) where
compare = compare `on` toList

instance Functor Steque where
fmap f (Steque fs rs) = Steque (fmap f fs) (fmap f rs)

instance Foldable Steque where
foldMap f (Steque fs rs) = foldMap f fs `mappend` getDual (foldMap (Dual . f) rs)

instance Traversable Steque where
traverse f (Steque fs rs) = Steque <\$> traverse f fs <*> backwards traverse f rs

instance Applicative Steque where
pure a = Steque [a] []
(<*>) = ap

instance Alternative Steque where
Steque fs rs <|> Steque fs' rs' = Steque (fs ++ reverse rs ++ fs') rs'
empty = Steque [] []

return a = Steque [a] []
Steque fs bs >>= f = Steque (fs >>= toList . f) (bs >>= toListOf (backwards folded) . f)

mplus = (<|>)
mzero = Steque [] []

instance Monoid (Steque a) where
mappend = (<|>)
mempty = Steque [] []

instance (Choice p, Applicative f) => Cons p f (Steque a) (Steque b) a b where
_Cons = prism (\(x,Steque fs bs) -> Steque (x:fs) bs) \$ \ (Steque fs bs) -> case fs of
x:xs -> Right (x,Steque xs bs)
[]   -> case reverse bs of
x:xs -> Right (x,Steque xs [])
[] -> Left (Steque [] [])

instance (p ~ Reviewed, f ~ Identity, a ~ b) => Snoc p f (Steque a) (Steque b) a b where
_Snoc = unto \$ \(Steque f b,x) -> Steque f (x:b)

-- show
main = print \$ uncons (snoc (cons 1 (cons 2 (snoc (empty :: Steque Int) 4))) 5)
-- /show``````

Now, our definition of a `Steque` works passably well.

We can snoc, cons, uncons, head, or tail all in O(1) amortized time.

but the `Monoid`, `Alternative` and `MonadPlus` instances perform poorly, doing O(n) work because it has to glue together potentially rather long lists:

``Steque fs rs <|> Steque fs' rs' = Steque (fs ++ reverse rs ++ fs') rs'``

It is that `Monoid` performance that makes us sad, so let's do something about it.

# Bootstrapping

To solve that we turn to Chris Okasaki's Purely Functional Data Structures once again, this time to find bootstrapping.

Bootstrapping is a tool for defining a data structure with cheap concatenation by recursively storing it in itself.

``data Catenable f a = Empty | Cons a (f (Catenable f a))``

Here we could build a `Catenable Steque` out of our basic `Steque` following Okasaki's recipe. Here the vocabulary of Tarjan is more accurate and extensible. Tarjan calls a "catenable steque" a c-steque, while Okasaki uses the less informative name "catenable list."

As with the pairing `Heap`, recursion is simpler if we don't allow `Empty` catenable substructures, and I'm only going to need non-empty catenable structures, though, so `Catenable` simplifies to:

``data Catenable f a = Cons a (f (Catenable f a))``

which advanced Haskell programmers in the audience may recognize as my old friend the cofree comonad! In the interest of retaining some semblance of accessibility, I'm not going to work through this exercise in its full generality.

Now it is obvious how to concatenate a pair of catenable steques, we simply insert one into the other.

``````data CSteque a = Cons a (Steque (CSteque a))

instance Semigroup (CSteque a) where
CSteque a as <> bs = CSteque a (snoc as bs)``````

Hrmm. That sounds suspiciously like the operation that drove our pairing heap.

You can derive `cons` and `snoc` by just concatenating singleton catenable steques, and `uncons` simply has to put back the leftovers if any into the underlying `Steque`, which it can do because our `Steque` supports O(1) `cons`!

If we inline the definition of `Steque` into `CSteque` we get:

``data CSteque a = Cons a [CSteque a] [CSteque a]``

and that looks a lot like our pairing heap, just with different invariants.

Hrmm.

# Catenable Heaps

Finally, what we want looks kind of like a steque of heaps, but where the heaps can also act like steques of heaps, ad infinitum. We can get there by mashing all the parts we've described together to get:

``data Heap a = Heap {-# UNPACK #-} !Int a [Heap a] [Heap a] [Heap a]``

This type represents a form of catenable non-empty pairing heap. The three lists of heaps in turn are:

1.) the jumbled mess of heaps we haven't sorted relative to one another except partially via the heap property, followed by 2.) a list of heaps that do not overlap one another in key space in ascending order, followed by 3.) another list of heaps that do not overlap one another in key space in descending order.

Now we can concatenate in O(1)!

``````fby :: Heap a -> Heap a -> Heap a
fby (Heap i a as ls rs) r = Heap i a as ls (r:rs)``````

A number of operations remain trivial:

``````top :: Heap a -> (Key, a)
top (Heap i a _ _ _) = (i, a)

singleton :: Key -> a -> Heap a
singleton k v = Heap k v [] [] []

fromList :: [(Key,a)] -> Heap a
fromList ((k0,v0):xs) = Prelude.foldr (\(k,v) r -> mix (singleton k v) r) (singleton k0 v0) xs
fromList [] = error "empty Heap"

fromAscList :: [(Key,a)] -> Heap a
fromAscList ((k0,v0):xs) = Prelude.foldr (\(k,v) r -> fby (singleton k v) r) (singleton k0 v0) xs
fromAscList [] = error "empty Heap"``````

`pop` got a bit more complicated. We'll just pass it the 3 lists from our heap, as it doesn't use the key/value pair out front and we don't want to require the compiler to figure out to specialize on the constructor.

``````pop :: [Heap a] -> [Heap a] -> [Heap a] -> Maybe (Heap a)
pop (x:xs) ls     rs = Just \$ fbys (merge x xs) ls rs
pop []     (l:ls) rs = Just \$ fbys l ls rs
pop []     []     rs = case reverse rs of
f:fs -> Just (fbys f fs [])
[]   -> Nothing

fbys :: Heap a -> [Heap a] -> [Heap a] -> Heap a
fbys (Heap i a as [] []) ls' rs' = Heap i a as ls' rs'
fbys (Heap i a as ls []) ls' rs' = Heap i a as ls \$ rs' <> reverse ls'
fbys (Heap i a as ls rs) ls' rs' = Heap i a as ls \$ rs' <> reverse ls' <> rs``````

No-Prize #7

Computing with `fbys` in `pop` is somewhat unpleasant, but I don't have a cleaner way.

Is there something where I can preserve the correctness of this but not have to move content into the right hand reversed list and re-reverse it? Keep in mind it isn't sound to move it into the left list, because we don't know the length of the left list and can only move content left when the left list is empty.

We could probably insert a pair of other heaps into the reversed list by building up a chain of `fby`s popping something off of the `ls'` and smashing the other `ls'` and `rs'` underneath it by using this function recursively. What does that look like?

and `mix` also becomes more complex as it now has to deal with the extra concatenated components, pushing what it can into the nested pairing heap.

``````mix :: Heap a -> Heap a -> Heap a
mix x@(Heap i a as al ar) y@(Heap j b bs bl br)
| i <= j    = Heap i a (y:pops as al ar) [] []
| otherwise = Heap j b (x:pops bs bl br) [] []

merge :: Heap a -> [Heap a] -> Heap a
merge x (y:ys) = case ys of
(z:zs) -> mix x y `mix` merge z zs
[]     -> mix x y
merge x [] = x

pops :: [Heap a] -> [Heap a] -> [Heap a] -> [Heap a]
pops xs     []     [] = xs
pops (x:xs) ls     rs = [fbys (merge x xs) ls rs]
pops []     (l:ls) rs = [fbys l ls rs]
pops []     []     rs = case reverse rs of
f:fs -> [fbys f fs []]
_    -> [] -- caught above``````

`pops` is a slight optimization of `pop` that uses the fact that the result is winding up in a list subject to the heap property and need not be reduced all the way down to a `Maybe (Heap a)` just yet.

Finally, we can define a useful conversion to a `Stream` that can get some benefit out of stream fusion.

``````data HeapState a
= Start !(Heap a)
| Ready {-# UNPACK #-} !Key a !(Heap a)
| Final {-# UNPACK #-} !Key a
| Finished

streamHeapWith :: Monad m => (a -> a -> a) -> Maybe (Heap a) -> Stream m (Key, a)
streamHeapWith f h0 = Stream step (maybe Finished Start h0) Unknown where
step (Start (Heap i a xs ls rs))     = return \$ Skip \$ maybe (Final i a) (Ready i a) \$ pop xs ls rs
step (Ready i a (Heap j b xs ls rs)) = return \$ case compare i j of
LT -> Yield (i, a)      \$ maybe (Final j b) (Ready j b) \$ pop xs ls rs
EQ | c <- f a b -> Skip \$ maybe (Final i c) (Ready i c) \$ pop xs ls rs
GT -> Yield (j, b)      \$ maybe (Final i a) (Ready i a) \$ pop xs ls rs
step (Final i a) = return \$ Yield (i,a) Finished
step Finished    = return Done
{-# INLINE  step #-}
{-# INLINE  streamHeapWith #-}``````

This doesn't quite hit the stream fusion goal of being made of a single loop that can fuse into other loops, due to the recursive calls to `pop`. I leave it as an exercise for the reader to find a form that implements `pop` via iterative `Skip` steps and a more complex state -- In other words, I haven't bothered yet. ;)

Putting it all together yields:

``````-- show
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE PatternGuards #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE TypeFamilies #-}

import Control.Applicative
import Control.Lens
import Data.Bits
import Data.Foldable
import Data.Monoid
import Data.Vector.Fusion.Stream.Size
import Data.Vector.Fusion.Util
import Data.Word
import Prelude

type Key = Int

-- | Bootstrapped _catenable_ non-empty pairing heaps
data Heap a = Heap {-# UNPACK #-} !Key a [Heap a] [Heap a] [Heap a]

-- | Append two heaps where we know every key in the first occurs before every key in the second
fby :: Heap a -> Heap a -> Heap a
fby (Heap i a as ls rs) r = Heap i a as ls (r:rs)

-- | Interleave two heaps making a new 'Heap'
mix :: Heap a -> Heap a -> Heap a
mix x@(Heap i a as al ar) y@(Heap j b bs bl br)
| i <= j    = Heap i a (y:pops as al ar) [] []
| otherwise = Heap j b (x:pops bs bl br) [] []

merge :: Heap a -> [Heap a] -> Heap a
merge x (y:ys) = case ys of
(z:zs) -> mix x y `mix` merge z zs
[]     -> mix x y
merge x [] = x

top :: Heap a -> (Key, a)
top (Heap i a _ _ _) = (i, a)

pop :: [Heap a] -> [Heap a] -> [Heap a] -> Maybe (Heap a)
pop (x:xs) ls     rs = Just \$ fbys (merge x xs) ls rs
pop []     (l:ls) rs = Just \$ fbys l ls rs
pop []     []     rs = case reverse rs of
f:fs -> Just (fbys f fs [])
[]   -> Nothing

singleton :: Key -> a -> Heap a
singleton k v = Heap k v [] [] []

fromList :: [(Key,a)] -> Heap a
fromList ((k0,v0):xs) = Prelude.foldr (\(k,v) r -> mix (singleton k v) r) (singleton k0 v0) xs
fromList [] = error "empty Heap"

fromAscList :: [(Key,a)] -> Heap a
fromAscList ((k0,v0):xs) = Prelude.foldr (\(k,v) r -> fby (singleton k v) r) (singleton k0 v0) xs
fromAscList [] = error "empty Heap"

-- * Internals

fbys :: Heap a -> [Heap a] -> [Heap a] -> Heap a
fbys (Heap i a as [] []) ls' rs' = Heap i a as ls' rs'
fbys (Heap i a as ls []) ls' rs' = Heap i a as ls \$ rs' <> reverse ls'
fbys (Heap i a as ls rs) ls' rs' = Heap i a as ls \$ rs' <> reverse ls' <> rs

pops :: [Heap a] -> [Heap a] -> [Heap a] -> [Heap a]
pops xs     []     [] = xs
pops (x:xs) ls     rs = [fbys (merge x xs) ls rs]
pops []     (l:ls) rs = [fbys l ls rs]
pops []     []     rs = case reverse rs of
f:fs -> [fbys f fs []]
_    -> [] -- caught above by the 'go as [] []' case

-- * Instances

instance Functor Heap where
fmap f (Heap k a xs ls rs) = Heap k (f a) (fmap f <\$> xs) (fmap f <\$> ls) (fmap f <\$> rs)

instance FunctorWithIndex Key Heap where
imap f (Heap k a xs ls rs) = Heap k (f k a) (imap f <\$> xs) (imap f <\$> ls) (imap f <\$> rs)

instance Foldable Heap where
foldMap f = go where
go (Heap _ a xs ls rs) = case pop xs ls rs of
Nothing -> f a
Just h  -> f a `mappend` go h
{-# INLINE foldMap #-}

instance FoldableWithIndex Key Heap where
ifoldMap f = go where
go (Heap i a xs ls rs) = case pop xs ls rs of
Nothing -> f i a
Just h  -> f i a `mappend` go h
{-# INLINE ifoldMap #-}

-- this linearizes the heap
instance Traversable Heap where
traverse f xs = fromAscList <\$> traverse (traverse f) (itoList xs)
{-# INLINE traverse #-}

instance TraversableWithIndex Key Heap where
itraverse f xs = fromAscList <\$> traverse (\(k,v) -> (,) k <\$> f k v) (itoList xs)
{-# INLINE itraverse #-}

data HeapState a
= Start !(Heap a)
| Ready {-# UNPACK #-} !Key a !(Heap a)
| Final {-# UNPACK #-} !Key a
| Finished

streamHeapWith :: Monad m => (a -> a -> a) -> Maybe (Heap a) -> Stream m (Key, a)
streamHeapWith f h0 = Stream step (maybe Finished Start h0) Unknown where
step (Start (Heap i a xs ls rs))     = return \$ Skip \$ maybe (Final i a) (Ready i a) \$ pop xs ls rs
step (Ready i a (Heap j b xs ls rs)) = return \$ case compare i j of
LT -> Yield (i, a)      \$ maybe (Final j b) (Ready j b) \$ pop xs ls rs
EQ | c <- f a b -> Skip \$ maybe (Final i c) (Ready i c) \$ pop xs ls rs
GT -> Yield (j, b)      \$ maybe (Final i a) (Ready i a) \$ pop xs ls rs
step (Final i a) = return \$ Yield (i,a) Finished
step Finished    = return Done
{-# INLINE  step #-}
{-# INLINE  streamHeapWith #-}

streamHeapWith0 :: Monad m => (a -> a -> Maybe a) -> Maybe (Heap a) -> Stream m (Key, a)
streamHeapWith0 f h0 = Stream step (maybe Finished Start h0) Unknown where
step (Start (Heap i a xs ls rs))     = return \$ Skip \$ maybe (Final i a) (Ready i a) \$ pop xs ls rs
step (Ready i a (Heap j b xs ls rs)) = return \$ case compare i j of
LT -> Yield (i, a) \$ maybe (Final j b) (Ready j b) \$ pop xs ls rs
EQ -> case f a b of
Nothing -> Skip  \$ maybe Finished Start \$ pop xs ls rs
Just c  -> Skip  \$ maybe (Final i c) (Ready i c) \$ pop xs ls rs
GT -> Yield (j, b) \$ maybe (Final i a) (Ready i a) \$ pop xs ls rs
step (Final i a) = return \$ Yield (i,a) Finished
step Finished = return Done
{-# INLINE  step #-}
{-# INLINE  streamHeapWith0 #-}

-- /show
-- show
main = print \$ (fromList [(10,"hi"),(20,"there")] `mix` singleton 100 "hello") `fby` singleton 200 "goodbye"
-- show``````

I've left figuring out how to write `_Cons` and `_Snoc` as an exercise for the reader.

With that we can finally concatenate heaps and mix them together as needed, and the stream fusion combinators can be used after the fact to spit out a stream of values that have been merged by our desired concept of addition. What has happened is that we for the most part are able to punt considering anything later in the chain of concatenations until after we've fully explored the stuff nearer to hand.

In practice this made a couple of orders of magnitude difference in the performance of the resulting code, so all this theoretical nonsense paid off.

I'd like a version of this that let's me reach the asymptotic bounds on set union set and reached by Erik Demaine et al.'s tour de force Adaptive set intersections, unions, and differences, but I don't think we can get there given the extra constraints that we don't fully know the contents of the streams we're skipping over in advance.

August 23 2013

Edited September 14th, 2013: Mihály Bárász pointed out to me that I had over-simplified the pairing heap implementation. I've rectified that by replacing `foldl mix` with `merge` uniformly throughout the code above. This makes a noticeable improvement in performance.