Monday, 25 June 2012

Slowdown with GHC when using multiple CPUs

Recently I was writing a tool to perform a simple calculation on several large fasta sequence files. When I had it working, I thought "cool, this is haskell, let's speed it up by simply running on multiple processors". Everything worked beautifully with a near linear speedup by increasing number of cores up to 4.

However, I later discovered a serious problem. I was running it on a shared server which had one of its 4 cores pegged at 100% by another user. I didn't notice this straight away, and ran my tool telling it to use all available cores, and instead of the nice speed up, it took significantly longer than using a single core!

I have since discovered that this problem is documented in the GHC docs:
Be careful when using all the processors in your machine: if some of your processors are in use by other programs, this can actually harm performance rather than improve it.
I think this is a serious problem, because I just want to be able to provide a tool for people to use. It is not realistic to have the users of the tool estimate how many CPU cores to use, and in fact, it is very difficult to estimate since the tool might be long-running, and the load on the shared server quite variable.

I have since reproduced this problem with code taken directly from the haskell wiki
{-# LANGUAGE BangPatterns #-}
    import Data.Digest.Pure.MD5
    import qualified Data.ByteString.Lazy as L
    import System.Environment
    import Control.Concurrent
    import Control.Monad (replicateM_)
    main = do
        files <- getArgs
        str <- newEmptyMVar
        mapM_ (forkIO . hashAndPrint str) files
        printNrResults (length files) str
    printNrResults i var = replicateM_ i (takeMVar var >>= putStrLn)
    hashAndPrint str f = do
        bs <- L.readFile f
        let !h = show $ md5 bs
        putMVar str (f ++ ": " ++ h)
I compiled this using GHC v7.4.1 on Mac OSX 10.7, using the follow command
   ghc --make -rtsopts -threaded -O2 run.hs
I tested this on my desktop which has 4 real cores, with reading in 4 large files simultaneously. There was an almost linear speedup when using 2 or 4 cores:
  ./run +RTS -N1   : 20.4 sec
  ./run +RTS -N2   : 11.0 sec
  ./run +RTS -N4   : 6.7 sec

To demonstrate the problem, I then ran 2 different processes that would just keep 2 of my CPU cores busy. That is, for these tests 2 of my 4 CPU cores already had 100% load.
  ./run +RTS -N1   : 23.5 sec
  ./run +RTS -N2   : 14.1 sec
  ./run +RTS -N4   : 57.8 sec   <---- Blowout...

Notice that when told there are 4 available cores to run on, but in fact 2 of those cores were already busy, the program took over twice as long as when told to only use 1 core. Running the program with "-sstderr" typically showed "%GC time" at around 6%. In the final test case this increased to 52.4%.

This is quite a concern in practice since it is desirable to be able to distribute a binary that is always run with "+RTS -N" (telling GHC to automatically determine the number of capabilities). However, this will work very poorly on machines with unpredictable load, such as shared servers.

Output from ThreadScope

Below I show 2 screenshots from ThreadScope from running the test program with -N4.  First with all CPUs idle, then with 2 of the 4 CPUs already busy.

Clearly the issue is some interaction with garbage collection.  I'd love to know if this is a fundamental problem with the GHC RTS scheduler or if there is something that could be done to improve this situation.

Update (2012-07-02)

After posting to haskell-cafe, an explanation from Alexander Morozov suggested it was the parallel GC.  This has to stop all threads when doing GC, and if one of the threads can't complete GC during its time-slice then all threads have to block until that thread gets another time-slice to complete its GC.  He suggested using the single threaded GC which performed better on his machine.  Re-running my tests using (-qg)

Single threaded GC, with no load on CPUs:
  ./run +RTS -N1 -qg  : 18.7 sec
  ./run +RTS -N2 -qg  : 10.2 sec
  ./run +RTS -N4 -qg  : 8.0 sec

Single threaded GC, with 2 of 4 CPUs busy:
  ./run +RTS -N1 -qg  : 21.5 sec
  ./run +RTS -N2 -qg  : 11.8 sec
  ./run +RTS -N4 -qg  : 12.1 sec

These numbers are much better - so I think I'd be recommending single threaded GC on a machine with unpredictable load.

Sunday, 24 June 2012

Awesome shell feature : Process substitution

Just last week I was discussing with colleagues at the VBC that mkfifo is useful, but what you typically want is to have the fifo created and cleaned up for you by the shell.  It turns out this already exists in popular shells... How is it that I've never come across this in all my years using unix!?!

Working in bioinformatics we find many tools that cannot decompress sequence files automatically.   When the tool in question only takes a single sequence file we can typically use a standard shell pipe, for example :
    zcat seq.fa.gz | wc

However, when the tool takes multiple files we need another approach.   Imagine we have a tool, "seqs", which takes two file parameters "-file1" and "-file2" but we want to give it the compressed files without decompressing them to the filesystem.  We could do this using mkfifo :
    mkfifo pipe1 pipe2
    zcat seq1.fa.gz > pipe1 &
    zcat seq2.fa.gz > pipe2 &
    seqs -file1 pipe1 -file2 pipe2
    rm pipe1 pipe2

There is a much nicer way to do this using Process substitution
    seqs -file1 <(zcat seq1.fa.gz) -file2 <(zcat seq2.fa.gz)

It is also possible to use this for a programs output.  This is often useful in conjunction with the tee tool to send a programs output to several different programs.  Here's an example, from the tee manual that demonstrates download a file, and computing the sha1 hash and md5 hash at the same time :
    wget -O - \
      | tee >(sha1sum > dvd.sha1) \
            >(md5sum > dvd.md5) \
      > dvd.iso

Process substitution is supported by bash and zsh, but not by csh based shells.

One limitation of using process substitution this way is that it uses pipes, so if your tool tries to seek() around in the file it will fail.  zsh has a nice feature to overcome this limitation by creating a temporary file and deleting it afterwards : simple use =(...) instead of <(...).