Monday, 25 June 2012

Slowdown with GHC when using multiple CPUs

Recently I was writing a tool to perform a simple calculation on several large fasta sequence files. When I had it working, I thought "cool, this is haskell, let's speed it up by simply running on multiple processors". Everything worked beautifully with a near linear speedup by increasing number of cores up to 4.

However, I later discovered a serious problem. I was running it on a shared server which had one of its 4 cores pegged at 100% by another user. I didn't notice this straight away, and ran my tool telling it to use all available cores, and instead of the nice speed up, it took significantly longer than using a single core!

I have since discovered that this problem is documented in the GHC docs:
Be careful when using all the processors in your machine: if some of your processors are in use by other programs, this can actually harm performance rather than improve it.
I think this is a serious problem, because I just want to be able to provide a tool for people to use. It is not realistic to have the users of the tool estimate how many CPU cores to use, and in fact, it is very difficult to estimate since the tool might be long-running, and the load on the shared server quite variable.

I have since reproduced this problem with code taken directly from the haskell wiki
{-# LANGUAGE BangPatterns #-}
    import Data.Digest.Pure.MD5
    import qualified Data.ByteString.Lazy as L
    import System.Environment
    import Control.Concurrent
    import Control.Monad (replicateM_)
    main = do
        files <- getArgs
        str <- newEmptyMVar
        mapM_ (forkIO . hashAndPrint str) files
        printNrResults (length files) str
    printNrResults i var = replicateM_ i (takeMVar var >>= putStrLn)
    hashAndPrint str f = do
        bs <- L.readFile f
        let !h = show $ md5 bs
        putMVar str (f ++ ": " ++ h)
I compiled this using GHC v7.4.1 on Mac OSX 10.7, using the follow command
   ghc --make -rtsopts -threaded -O2 run.hs
I tested this on my desktop which has 4 real cores, with reading in 4 large files simultaneously. There was an almost linear speedup when using 2 or 4 cores:
  ./run +RTS -N1   : 20.4 sec
  ./run +RTS -N2   : 11.0 sec
  ./run +RTS -N4   : 6.7 sec

To demonstrate the problem, I then ran 2 different processes that would just keep 2 of my CPU cores busy. That is, for these tests 2 of my 4 CPU cores already had 100% load.
  ./run +RTS -N1   : 23.5 sec
  ./run +RTS -N2   : 14.1 sec
  ./run +RTS -N4   : 57.8 sec   <---- Blowout...

Notice that when told there are 4 available cores to run on, but in fact 2 of those cores were already busy, the program took over twice as long as when told to only use 1 core. Running the program with "-sstderr" typically showed "%GC time" at around 6%. In the final test case this increased to 52.4%.

This is quite a concern in practice since it is desirable to be able to distribute a binary that is always run with "+RTS -N" (telling GHC to automatically determine the number of capabilities). However, this will work very poorly on machines with unpredictable load, such as shared servers.

Output from ThreadScope

Below I show 2 screenshots from ThreadScope from running the test program with -N4.  First with all CPUs idle, then with 2 of the 4 CPUs already busy.

Clearly the issue is some interaction with garbage collection.  I'd love to know if this is a fundamental problem with the GHC RTS scheduler or if there is something that could be done to improve this situation.

Update (2012-07-02)

After posting to haskell-cafe, an explanation from Alexander Morozov suggested it was the parallel GC.  This has to stop all threads when doing GC, and if one of the threads can't complete GC during its time-slice then all threads have to block until that thread gets another time-slice to complete its GC.  He suggested using the single threaded GC which performed better on his machine.  Re-running my tests using (-qg)

Single threaded GC, with no load on CPUs:
  ./run +RTS -N1 -qg  : 18.7 sec
  ./run +RTS -N2 -qg  : 10.2 sec
  ./run +RTS -N4 -qg  : 8.0 sec

Single threaded GC, with 2 of 4 CPUs busy:
  ./run +RTS -N1 -qg  : 21.5 sec
  ./run +RTS -N2 -qg  : 11.8 sec
  ./run +RTS -N4 -qg  : 12.1 sec

These numbers are much better - so I think I'd be recommending single threaded GC on a machine with unpredictable load.

1 comment:

  1. What does profiling the execution with shark or oprofile reveal?