Interactive code snippets not yet available for SoH 2.0, see our Status of of School of Haskell 2.0 blog post

Perf for low-level profiling

perf and low-level Haskell programs

TL;DR: perf can be particularly helpful when profiling performance critical sections of your programs. We show how through an example.

The perf_events framework in the linux kernel, together with the perf userland tools, can be used to profile applications without any instrumentation -- beyond compiling them with debug symbols. The debug symbols are specified in the DWARF format. Recently, support was added to GHC to support DWARF debug symbols, thanks to work by Peter Wortmann.

In this article we go through the process of understanding the performance of a low-level Haskell program -- a function computing the CRC32 digest of a ByteString. We compare a pure Haskell version, provided by the digest-pure package; to a version using a C function provided by zlib, provided by the digest package. After we understand why the Haskell version is slower we modify it to make it almost at fast as the C one.

Note that this document is meant to be a showcase of what can be done with perf, rather than a tutorial on how to make an Haskell program as fast as a C one. In this case, the shortest route would have been to directly replicate the behaviour of the C program in Haskell: we use perf to show how it's possible to understand precisely where and why time is being spent.

The setup

Building Haskell libraries with DWARF

Getting up to speed with perf and Haskell is easy but tedious and undocumented. First, GHC HEAD (7.11) is needed, for DWARF support. GHC 7.10.1 has partial DWARF support, but some features are broken, so GHC HEAD is recommended. The issues with GHC 7.10.1 are resolved in GHC 7.10.2, which will be released shortly.

Moreover, the RTS and the libraries built as part of the GHC build process should be built with debug symbols. This can be done by adding

GhcRtsHcOpts += -g
GhcLibHcOpts += -g

to your mk/build.mk file in the ghc source tree.

Once you have GHC 7.11, special care is needed to install every library with the right options when using cabal. Specifically, we want to pass the -g options to GHC to generate debug symbols. Moreover, we must stop cabal from stripping the object files from debug symbols. For example, to install the library digest we'll use

$ cabal install --disable-library-stripping --ghc-options="-g -rtsopts" digest

Debug symbols for non-Haskell code

Moreover, when debugging Haskell code involving foreign C code, we need to make sure that we have the debug symbols for the C libraries too. This can be accomplished in various ways -- on Debian systems debug symbols are packaged separately in -dbg packages. So if we wanted to properly debug a program using the digest library, which uses zlib, we would need to issue

$ sudo apt get install zlib1g-dbg

Keeping the source code for built library

Moreover, if we want to see source code in the perf tools, as we'll see in the following sections, we need to keep the Haskell source files where they were picked up to be built. The easiest way to do this is using cabal get, for example:

$ cabal get zlib
Unpacking to zlib-0.6.1.0/
$ cd zlib-0.6.1.0
$ cabal install --disable-library-stripping --ghc-options="-g -rtsopts"

The same applies for C libraries that we want to debug with source code support.

Building the example program

As an example, we will be testing the digest-pure library, implementing the CRC32 algorithm in Haskell; against the digest library, which also offers the CRC32 algorithm by binding to the zlib library.

First we'll build the zlib library manually to keep the source code in place, taking care of adding -g to the CFLAGS in the Makefile:

$ wget http://zlib.net/zlib-1.2.8.tar.gz
$ tar xzvf zlib-1.2.8.tar.gz
$ cd zlib-1.2.8
$ nano Makefile.in # Add -g to CFLAGS
$ ./configure
$ make
$ export LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH

Then, we'll install the digest and digest-pure libraries, which we need for the benchmark:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.11.20150411
$ cabal get digest
Unpacking to digest-0.0.1.2/
$ cd digest-0.0.1.2/
$ cabal install --disable-library-stripping --ghc-options="-g -rtsopts"
$ cd ..
$ cabal get digest-pure
$ cd digest-pure-0.0.3/
$ cabal install --disable-library-stripping --ghc-options="-g -rtsopts"

Note how we're getting the packages using cabal get, so that the Haskell sources will be preserved in the locations where the objects were built from.

We'll also install mwc-random without bothering to keep the sources, since we won't need to inspect its functions (we use it to generate random data to run our benchmark on):

$ cabal install --disable-library-stripping --ghc-options="-g -rtsopts" mwc-random

Making the Haskell version fast

Comparing the C and Haskell version

We're going to use a simple benchmark to compare the Haskell and C function.

$ wget https://gist.github.com/bitonic/781ff80e6eb65eebf14a/raw/7cee087dbe12c6cf813667eccb0a00d721ecb9a6/crc32.hs
$ ghc -g -O2 crc32.hs

crc32.hs let us compare two functions with an identical interface, computing the CRC32 for a given ByteString:

crc32 :: ByteString -> Word32

digest implements this function by calling a foreign crc32 routine provided by zlib, while digest-pure implements it natively in Haskell.

Now we're ready to compare the two versions, one using the pure CRC32, and one using zlib. First we generate some random data to work on

$ ./crc32 generate

This will write a data file to disk, which we can use in all subsequent tests. Then we time each version:

$ time ./crc32 hask # Using digest-pure
./crc32 hask  34,55s user 0,01s system 99% cpu 34,599 total
$ time ./crc32 c # Using digest
./crc32 c  10,42s user 0,00s system 99% cpu 10,444 total

The Haskell version is more than 3 times slower than the C one. Let's try to find out why.

Finding out where time is spent

The first thing we'll do is understand where time is spent. We expect some CRC32 function to make up for most of the running time, but we'd like to verify this. Moreover, we'd like to know what exactly is taking time in the CRC32 function itself.

Let's run the two versions again, but this time using perf record:

$ perf record -o perf-hask.data ./crc32 hask
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.346 MB perf-hask.data (~15126 samples) ]
$ perf record -o perf-c.data ./crc32 c
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.110 MB perf-c.data (~4789 samples) ]

perf record will sample the execution with the aid of the perf_event_open linux syscall. Using perf report -i perf-hask.data we can inspect the result of the Haskell version. The main screen shows a breakdown of where time is spent in the executable:

Samples: 8K of event 'cpu-clock', Event count (approx.): 2233500000
  99,75%  crc32  crc32              [.] s4Tw_info
   0,10%  crc32  crc32              [.] c5gn_info
   0,03%  crc32  crc32              [.] c5g3_info
   0,02%  crc32  crc32              [.] c2k_info
   0,02%  crc32  crc32              [.] c7F_info
   0,01%  crc32  [kernel.kallsyms]  [k] __do_softirq
   0,01%  crc32  crc32              [.] allocatePinned
   0,01%  crc32  crc32              [.] c6le_info
   0,01%  crc32  crc32              [.] chg3_info
   0,01%  crc32  crc32              [.] 0x000000000041ca88
   0,01%  crc32  libc-2.19.so       [.] _dl_addr

Symbols like c5gn_info are generated by the GHC code generator. As we can see, almost all the time is spent in a single symbol, s4Tw_info. We can drill down into s4Tw_info, and "annotate" it, to get a precise breakdown of where time is spent in s4Tw_alone:

s4Tw_info  /home/francesco/crc32/crc32
       │    Disassembly of section .text:
       │
       │    000000000063a738 <s4Tw_info>:
       │        crc = flipAll prevCRC
       │        flipAll = xor 0xffffffff
       │
       │    -- | Unsafe array access.
       │    (!!!) :: (IArray a e, Ix i, Integral i) => a i e -> i -> e
       │    arr !!! i = unsafeAt arr $ fromIntegral i
  0,79 │ 0:┌─→add    $0x10,%r12
 10,33 │   │  cmp    0x358(%r13),%r12
       │   │↓ ja     27
  0,04 │   │  cmp    %rdi,%rsi
  0,01 │   │↓ jne    36
       │   │  movq   $0xa737a0,-0x8(%r12)
       │   │  mov    %r14,(%r12)
       │   │  lea    -0x7(%r12),%rbx
       │   │↓ jmpq   ffffffffff9c58c8
       │27:│  movq   $0x10,0x388(%r13)
       │   │↓ jmpq   ffffffffff9c58c8
       │   │  flipAll (tblEntry `xor` (crc `shiftR` 8))
       │   │  where
       │   │    -- Note: unsafe access is ok here, we guarantee that the value is within
       │   │    -- (0..255), crc32Table covers that range.
       │   │    tblEntry = crc32Table !!! ((crc `xor` (fromIntegral c)) .&. 0xff)
       │   │    crc = flipAll prevCRC
  0,76 │36:│  add    $0xfffffffffffffff0,%r12
  0,02 │   │  mov    $0xffffffff,%eax
 10,01 │   │  xor    %rax,%r14
  0,03 │   │  mov    $0xffffffff,%eax
  1,07 │   │  mov    %r14,%rcx
       │   │  shr    $0x8,%rcx
 10,55 │   │  xor    %rax,%rcx
  0,10 │   │  mov    0x4(%rbx),%rax
  1,03 │   │  movzbl (%rsi),%edx
  0,16 │   │  xor    %rdx,%r14
  9,69 │   │  and    $0xff,%r14d
  0,75 │   │  mov    0x10(%rax,%r14,4),%eax
 44,32 │   │  xor    %rcx,%rax
 10,31 │   │  inc    %rsi
  0,01 │   │  mov    %rax,%r14
       │   └──jmp    0

What this breakdown tells us is that a whopping 75% of the time is spent in xor and and instructions alone. Another 20% of the time is spent in looping code: a cmp at the beginning, and an inc at the end -- which keep track of an index in the ByteString we're digesting to find out when we're done traversing it. Moreover, we get the Haskell source code interleaved with the assembly deriving from it, which lets us understand what Haskell code parts of s4Tw_info correspond to. In this case, the part that eats almost all of the time is the section of the code corresponding to the main section of the CRC32 algorithm, which is what we expected. What the code does is get a value out of pre computed table, and then use it together with the current CRC to get the new value.

To get more information about the symbol, we can also use objdump on the binary to get more information about the "compile unit" s4Tw_info comes from:

$ objdump -Wi crc32
[...]
  Compilation Unit @ offset 0xdbd68:
   Length:        0x629 (32-bit)
   Version:       3
   Abbrev Offset: 0x6cb
   Pointer Size:  8
 <0><dbd73>: Abbrev Number: 1 (DW_TAG_compile_unit)
    <dbd74>   DW_AT_name        : src/Data/Digest/Pure/CRC32.hs 
    <dbd92>   DW_AT_producer    : The Glorious Glasgow Haskell Compilation System 7.11.20150411 
    <dbdd0>   DW_AT_language    : 0x18  (Unknown: 18)
    <dbdd4>   DW_AT_comp_dir    : /tmp/digest-pure-0.0.3/
    <dbdec>   DW_AT_use_UTF8    : 255
    <dbded>   DW_AT_stmt_list   : 0x32790
[...]
 <1><dbf83>: Abbrev Number: 2 (DW_TAG_subprogram)
    <dbf84>   DW_AT_name        : !!!
    <dbf88>   DW_AT_MIPS_linkage_name: s4Tw_info
    <dbf92>   DW_AT_external    : 0
    <dbf93>   DW_AT_low_pc      : 0x63afc8
    <dbf9b>   DW_AT_high_pc     : 0x63b03a
    <dbfa3>   DW_AT_frame_base  : 1 byte block: 9c      (DW_OP_call_frame_cfa)
[...]

Which tells us the exact file the symbol comes from. The DWARF information also takes a stab at pairing the symbol with a name, although it is often difficult to do so: in this case it picked the inlined !!! function, which is misleading.

Check out this screencast for a taste of what interacting with perf on this data set looks like.

Now let's do the same with the C version, using perf report -i perf-c.data:

asciicast

In this case almost all the time is spent in libz.so crc32 function, which is again what we expect. However when drilling into the crc32 function, we get a much bigger disassemble output. While confusing at first, it is quite clear that all the time is spent in the mysterious DOLIT32, a rather long section of the code:

       │             DOLIT32;
  0,04 │ 70:   xor    (%r8),%edi
  1,29 │       mov    %rdi,%rax
       │       mov    %edi,%r11d
       │       movzbl %ah,%ebx
  2,31 │       movzbl %dil,%eax
  0,04 │       shr    $0x18,%edi
       │       mov    0xc00(%rcx,%rax,4),%eax
  3,33 │       xor    (%rcx,%rdi,4),%eax
  2,76 │       shr    $0x10,%r11d
  [...omitted code...]
       │       shr    $0x18,%edi
  1,17 │       shr    $0x10,%r11d
  0,08 │       movzbl %ah,%eax
  1,06 │       mov    (%rcx,%rdi,4),%edi
  4,20 │       xor    0xc00(%rcx,%rbx,4),%edi
  2,38 │       movzbl %r11b,%r11d
       │       xor    0x800(%rcx,%rax,4),%edi
  1,14 │       xor    0x400(%rcx,%r11,4),%edi

In fact, summing up the small percentages in DOLIT32 we find out that around 85% of the time is spent there. To find out what DOLIT32 does we need to consult the zlib source code. DOLIT32 is a macro repeating DOLIT4 8 times. DOLIT4 is a macro which computes the CRC32 4 bytes at a time, assuming little endianness. Equivalent macros for big endian architecture are present.

Why is the Haskell version is slower

The fact that zlib computes the CRC32 4 bytes at a time, and that it unrolls the main loop makes the code per byte needed to compute the CRC32 much smaller. The loop (from cmp to jmp) with DOLIT32 as a body uses around 110 instructions to compute the CRC32 for 32 bytes (≈3.5 instructions per byte), while the loop generated from the pure Haskell version uses 25 instructions for just one byte. This should amount to a seven fold increase in the instructions executed by the Haskell program.

We can easily verify this fact using perf. In this case we'll use perf stat, which gives us global statistics, since we know that one symbol is taking up all the time. Otherwise we could use perf record -e to record the events we're interested and get figures for the specific symbols.

$ perf stat ./crc32 hask

 Performance counter stats for './crc32 hask':

      35623,015084 task-clock (msec)         #    0,995 CPUs utilized
             4.982 context-switches          #    0,140 K/sec
               207 cpu-migrations            #    0,006 K/sec
             3.318 page-faults               #    0,093 K/sec
    94.594.036.631 cycles                    #    2,655 GHz
    44.565.899.687 stalled-cycles-frontend   #   47,11% frontend cycles idle
   <not supported> stalled-cycles-backend
   210.111.081.169 instructions              #    2,22  insns per cycle
                                             #    0,21  stalled cycles per insn
    30.020.583.035 branches                  #  842,730 M/sec
         1.138.765 branch-misses             #    0,00% of all branches

      35,788883623 seconds time elapsed

$ perf stat ./crc32 c

 Performance counter stats for './crc32 c':

      10690,308263 task-clock (msec)         #    0,995 CPUs utilized
             1.754 context-switches          #    0,164 K/sec
                63 cpu-migrations            #    0,006 K/sec
             3.314 page-faults               #    0,310 K/sec
    28.453.816.903 cycles                    #    2,662 GHz
    18.427.339.226 stalled-cycles-frontend   #   64,76% frontend cycles idle
   <not supported> stalled-cycles-backend
    31.027.193.409 instructions              #    1,09  insns per cycle
                                             #    0,59  stalled cycles per insn
       330.594.059 branches                  #   30,925 M/sec
           670.612 branch-misses             #    0,20% of all branches

      10,743639214 seconds time elapsed

The first thing to notice is the difference in the number of instructions. The Haskell program uses 201 billions, versus the 31 billions of the C program. A 6.5 difference, which is a factor close to our rough estimate of 7.

Note that the 6.5 difference in instructions does not translate to a 6.5 performance increase in the C program. This can be due to a variety of reasons, but the perf stat output already shows a factor: the C version spends more time stalled, as evident from the stalled-cycles-frontend hardware counter. This is most likely since the C code spends less time in the loop code, and more time fetching bytes from the input buffer and from the static tables used to compute the CRC32. In fact, the overall productivity is lower: 1 instruction per cycle in C versus more than 2 in Haskell. Thus, we only get a 3x speedup in C , as we have seen.

Making the Haskell version faster

Now that we know the main reason for the Haskell version being slower, we can make the Haskell version more similar to the C one. The first measure we adopted is to implement the CRC32 algorithm so that it consumes 4 bytes at a time. We choose to do this before loop unrolling because it has more possibilities to impact performance. For instance it makes the algorithm less sequential and increases the chances for pipelining, since we don't have to wait for each byte to continue.

To do so, we will split the input ByteString in three sections: an aligned vector of Word32s, and the remaining unaligned leading and trailing bytes. For example, for a ByteString of length 26 starting at address 9, we will have 3 leading bytes (up to address 12), 6 aligned words (up to address 38), and 2 trailing bytes. This will give us a memory-aligned section to perform the CRC32 word-by-word instead of byte-by-byte. The type signature for the function splitting the ByteString will be

getWord32Vector
  :: BS.ByteString
  -> (BS.ByteString -> V.Vector Word32 -> BS.ByteString -> IO a)
  -> IO a

We use a continuation-passing style since we need to manipulate the pointer underlying the ByteString to know where the aligned section of the memory starts, and the functions that let us do this use CPS themselves.

Once we have this function, the ugly part is over: we can very easily port the algorithm present in zlib to Haskell, but in a more functional style:

{-# NOINLINE crc32UpdateLittle #-}
crc32UpdateLittle
  :: Word32
  -- ^ The previous CRC
  -> BS.ByteString
  -- ^ The input to digest
  -> Word32
  -- ^ Resulting CRC
crc32UpdateLittle prevC bs =
  unsafePerformIO $ getWord32Vector bs $ \leading aligned trailing -> do
    let c0 = complement prevC
    let c1 = BS.foldl' word8Step c0 leading  -- Perform CRC32 on the leading bytes...
    let c2 = V.foldl' word32Step c1 aligned  -- ...on the aligned words...
    let c3 = BS.foldl' word8Step c2 trailing -- ...and on the trailing bytes.
    return $ complement c3
  where
    {-# INLINE word32Step #-}
    word32Step :: Word32 -> Word32 -> Word32
    word32Step c0 word =
      (crc32Table_3 !!! (c .&. 0xff)) `xor`
      (crc32Table_2 !!! ((c `unsafeShiftR` 8) .&. 0xff)) `xor`
      (crc32Table_1 !!! ((c `unsafeShiftR` 16) .&. 0xff)) `xor`
      (crc32Table_0 !!! (c `unsafeShiftR` 24))
      where
        c = c0 `xor` word

    {-# INLINE word8Step #-}
    word8Step :: Word32 -> Word8 -> Word32
    word8Step c byte =
      (crc32Table_0 !!! ((c `xor` fromIntegral byte) .&. 0xff)) `xor` (c `unsafeShiftR` 8)

The various crc32Table_n are various pre-computed tables needed to perform the algorithm, and !!! is an unsafe indexing operator (we know we're going to be within the bounds). The Little is to indicate that this is the algorithm for little-endian architectures. A matching crc32UpdateBig has to be defined for big-endian architectures.

You can find the full source in my fork of digest-pure.

After we recompile digest-pure again with the changes above, and recompile our crc32.hs benchmark, we can run it again using the Haskell version:

$ time ./crc32 hask
./crc32 hask  12,87s user 0,02s system 99% cpu 12,903 total

With the new algorithm we get a three fold speedup, and we're only 20% slower than the C version.

$ perf stat ./crc32 hask

 Performance counter stats for './crc32 hask':

      13025,715282 task-clock (msec)         #    0,993 CPUs utilized
             1.668 context-switches          #    0,128 K/sec
                61 cpu-migrations            #    0,005 K/sec
             3.340 page-faults               #    0,256 K/sec
    35.174.753.255 cycles                    #    2,700 GHz
    15.093.419.773 stalled-cycles-frontend   #   42,91% frontend cycles idle
   <not supported> stalled-cycles-backend  
    77.735.375.354 instructions              #    2,21  insns per cycle
                                             #    0,19  stalled cycles per insn
     2.541.638.382 branches                  #  195,125 M/sec
           956.494 branch-misses             #    0,04% of all branches

      13,121155788 seconds time elapsed

The number of instructions is reduced by almost three times, while the productivity stays the same -- which explains the threefold increase in performance. Note that we're still using more than twice the instructions as the C program, due to loop unrolling. However, loop unrolling seems to push the program towards being memory-bound and thus does not result in a great performance increase. We have not pursued this road further.

However, when analyzing the program through perf report, we get somewhat surprising results:

       │    crc32UpdateLittle :: Word32 -> BS.ByteString -> Word32
       │    crc32UpdateLittle prevC bs =
       │      unsafePerformIO $ getWord32Vector bs $ \leading aligned trailing -> do
       │        let c0 = complement prevC
       │        let c1 = BS.foldl' word8Step c0 leading
       │        let c2 = V.foldl' word32Step c1 aligned
  0,05 │ 9:   mov    0xe(%rbx),%rax
  0,03 │      mov    0x16(%rbx),%rax
  7,21 │      mov    0x1e(%rbx),%rax
  0,01 │      mov    0x26(%rbx),%rax
  0,06 │      mov    0x3e(%rbx),%rax
  0,02 │      mov    0x46(%rbx),%rcx
  7,27 │      mov    0x4e(%rbx),%rdx
  0,02 │      mov    0x56(%rbx),%rdi
  0,08 │      mov    0x36(%rbx),%r8
  0,03 │      mov    (%r8,%rsi,4),%r8d
  7,34 │      xor    %r8,%r14
  0,07 │      mov    %r14,%r8
  0,06 │      and    $0xff,%r8d
  0,09 │      mov    (%rax,%r8,4),%eax
 29,32 │      mov    %r14,%r8
  0,01 │      shr    $0x8,%r8
  0,02 │      and    $0xff,%r8d
       │      mov    (%rcx,%r8,4),%ecx
 10,83 │      mov    %r14,%r8
  0,00 │      shr    $0x10,%r8
  0,02 │      and    $0xff,%r8d
  0,01 │      mov    (%rdx,%r8,4),%edx
  8,89 │      shr    $0x18,%r14
  0,00 │      mov    (%rdi,%r14,4),%edi
  1,18 │      inc    %rsi
  0,00 │      xor    %rdi,%rdx
 13,49 │      xor    %rdx,%rcx
  6,75 │      xor    %rcx,%rax
  7,15 │      mov    %rax,%r14

mov instructions from register to register make up for more than 60% of the time spent in the critical section of the code, while we would expect most of the time to be spent xoring and anding. I have not investigated why this is the case, ideas welcome!

Acknowledgements

Thanks to Peter Wortmann and to Arash Rouhani for the very informative discussions on the status of DWARF in GHC. Also thanks to Niklas Hambüchen for reviewing a draft of this blogpost.