|Version 27 (modified by chak, 6 years ago) (diff)|
Status of DPH Benchmarks
This page gives an overview of how well the benchmarks in the examples/ directory of package dph are currently working.
Overview over the benchmark programs
- Computes the sum of the squares from 1 to N using Int. There are two variants of this program: (1) "primitives" is directly coded against the array primitives from package dph and (2) "vectorised" is a high-level DPH program transformed by GHC's vectoriser. As a reference implementation, we have a sequential C program denoted by "ref C".
- Computes the dot product of two vectors of Doubles. There are two variants of this program: (1) "primitives" is directly coded against the array primitives from package dph and (2) "vectorised" is a high-level DPH program transformed by GHC's vectoriser. In addition to these two DPH variants of the dot product, we also have two non-DPH reference implementations: (a) "ref Haskell" is a Haskell program using imperative, unboxed arrays and and (b) "ref C" is a C implementation using pthreads.
- Multiplies a dense vector with a sparse matrix represented in the compressed sparse row format (CSR). There are three variants of this program: (1) "primitives" is directly coded against the array primitives from package dph and (2) "vectorised" is a high-level DPH program transformed by GHC's vectoriser. As a reference implementation, we have a sequential C program denoted by "ref C".
- The Sieve of Eratosthenes using parallel writes into a sieve structure represented as an array of Bools. We currently don't have a proper parallel implementation of this benchmark, as we are missing a parallel version of default backpermute. The problem is that we need to make the representation of parallel arrays of Bool dependent on whether the hardware supports atomic writes of bytes. Investigate whether any of the architectures relevant for DPH actually do have trouble with atomic writes of bytes (aka Word8).
Execution on LimitingFactor (2x Quad-Core Xeon)
Hardware spec: 2x 3.0GHz Quad-Core Intel Xeon 5400; 12MB (2x6MB) on-die L2 cache per processor; independent 1.6GHz frontside bus per processor; 800MHz DDR2 FB-DIMM; 256-bit-wide memory architecture; Mac OS X Server 10.5.6
Software spec: GHC 6.11 (from end of Feb 09); gcc 4.0.1
|SumSq, ref C||10M||9||–||–||–||–|
|DotP, primitives||100M elements||823/823/824||812/813/815||408/408/409||220/223/227||210/214/221|
|DotP, vectorised||100M elements||823/824/824||814/816/818||412/417/421||222/225/227||227/232/238|
|DotP, ref Haskell||100M elements||–||810||437||221||209|
|DotP, ref C||100M elements||–||458||235||210||210|
|SMVM, primitives||100kx100k @ density 0.001||119/119||254/254||154/154||90/90||67/67|
|SMVM, vectorised||100kx100k @ density 0.001||_|_||_|_||_|_||_|_||_|_|
|SMVM, ref C||100kx100k @ density 0.001||46||–||–||–||–|
All results are in milliseconds, and the triples report best/average/worst execution time (wall clock) of three runs. The column marked "sequential" reports times when linked against dph-seq and the columns marked "P=n" report times when linked against dph-par and run in parallel using the specified number of parallel OS threads.
Comments regarding SumSq
The versions compiled against dph-par are by factor of two slower than the ones linked against dph-seq. This is as the parallel versions needs to compute the length of the array to determine how to split the work.
However, found a number of general problems when working on this example:
- We need an extra -funfolding-use-threshold. We don't really want users having to worry about that.
- mapP (\x -> x * x) xs essentially turns into zipWithU (*) xs xs, which doesn't fuse with enumFromTo anymore. We have a rewrite rule in the library to fix that, but that's not general enough. We really would rather not vectorise the lambda abstraction at all.
- enumFromTo doesn't fuse due to excessive dictionaries in the unfolding of zipWithUP.
- Finally, to achieve the current result, we needed an analysis that avoids vectorising subcomputations that don't to be vectorised, and worse, that fusion has to turn back into their original form. In this case, the lambda abstraction \x -> x * x. This is currently implemented in a rather limited and ad-hoc way. We should implement this on the basis of a more general analysis.
Comments regarding DotP
Performance is memory bound, and hence, the benchmark stops scaling once the memory bus saturated. As a consequence, the wall-clock execution time of the Haskell programs and the C reference implementation are the same when all available parallelism is exploited. The parallel DPH library delivers the same single core performance as the sequential one in this benchmark.
Comments regarding smvm
There seems to be a fusion problem in DotP with dph-par (even if the version of zipWithSUP that uses splitSD/joinSD is used); hence the much lower runtime for "N=1" than for "sequential". The vectorised version runs out of memory; maybe because we didn't solve the bpermute problem, yet.
Execution on greyarea (1x UltraSPARC T2)
Hardware spec: 1x 1.4GHz UltraSPARC T2; 8 cores/processors with 8 hardware threads/core; 4MB on-die L2 cache per processor; FB-DIMM; Solaris 5.10
Software spec: GHC 6.11 (from end of Feb 09) with gcc 4.1.2 for Haskell code; gccfss 4.0.4 (gcc front-end with Sun compiler backend) for C code (as it generates code that is more than twice as fast for numeric computations than vanilla gcc)
|SumSq, ref C||10M||120||–||–||–||–||–||–||–|
|DotP, primitives||100M elements||937/937||934/934||474/474||238/238||120/120||65/65||38/38||28/28|
|DotP, vectorised||100M elements||937/937||942/942||471/471||240/240||118/118||65/65||43/43||29/29|
|DotP, ref Haskell||100M elements||–||934||467||238||117||61||65||36|
|DotP, ref C||100M elements||–||554||277||142||72||37||22||20|
|SMVM, primitives||100kx100k @ density 0.001||1112/1112||1926/1926||1009/1009||797/797||463/ 463||326/326||189/189||207/207|
|SMVM, vectorised||100kx100k @ density 0.001||_|_||_|_||_|_||_|_||_|_||_|_||_|_||_|_|
|SMVM, ref C||100kx100k @ density 0.001||600||–||–||–||–||–||–||–|
All results are in milliseconds, and the triples report best/worst execution time (wall clock) of three runs. The column marked "sequential" reports times when linked against dph-seq and the columns marked "P=n" report times when linked against dph-par and run in parallel using the specified number of parallel OS threads.
Comments regarding SumSq
The primitives scale nicely, but something is deeply wrong (lack of fusion, perhaps) with the vectorised version.
Comments regarding DotP
The benchmark scales nicely up to the maximum number of hardware threads. Memory latency is largely covered by excess parallelism. It is unclear why the Haskell reference implementation "ref Haskell" falls of at 32 and 64 threads. See also a comparison graph between LimitingFactor and greyarea.
Comments regarding smvm
As on LimitingFactor, but it scales much more nicely and improves until using four threads per core. This suggets that memory bandwidth is again a critical factor in this benchmark (this fits well with earlier observations on other architectures). Despite fusion problem with dph-par, the parallel Haskell program, using all 8 cores, still ends up three times faster than the sequential C program.
|Program||Sequential (manually vectorised)||Vectorised||Parallel|
|DotP||Order of mag. faster than list impl||Same performance as seq.||speedup of 2 for 2 CPUs, 4 threads|
|QuickSort||Slower than list (fusion)||Slower than seq. (why?)||speedup of 1.4 on 2 CPUs|
|SparseVector||Similar to DotP|
|Primes (Nesl)||15 x faster than list version||NYI||20 x slower than seq (fusion?)|
|BarnesHut||Small bug in alg||Working||See seq.|
- I only ran a first set of benchmarks when checking for what's there. I'll run the benchmarks properly as next step
- Fusion doesn't work well on parallel programs yet, so for all but simple examples, the parallel program performs worse than the sequential
- The compiler doesn't exploit all fusion opportunities for QSort and BarnesHut. Once this is fixed, they should run considerably faster.
- Interestingly, the automatically vectorised version of qsort is quite a bit faster than the hand-flattened. Need to find out why.