wiki:DataParallel/BenchmarkStatus

Version 12 (modified by chak, 5 years ago) (diff)

--

Status of DPH Benchmarks

This page gives an overview of how well the benchmarks in the examples/ directory of package dph are currently working.

Overview over the benchmark programs

DotP
Computes the dot product of two vectors of Doubles. There are two variants of this program: (1) "primitives" is directly coded against the array primitives from package dph and (2) "vectorised" is a high-level DPH program transformed by GHC's vectoriser. In addition to these two DPH variants of the dot product, we also have two non-DPH reference implementations: (a) "ref Haskell" is a Haskell program using imperative, unboxed arrays and and (b) "ref C" is a C implementation using pthreads.
SMVM
Multiplies a dense vector with a sparse matrix represented in the compressed sparse row format (CSR). There are three variants of this program: (1) "primitives" is directly coded against the array primitives from package dph and (2) "vectorised" is a high-level DPH program transformed by GHC's vectoriser.

Execution on LimitingFactor (2x Quad-Core Xeon)

Hardware spec: 2x 3.0GHz Quad-Core Intel Xeon 5400; 12MB (2x6MB) on-die L2 cache per processor; independent 1.6GHz frontside bus per processor; 800MHz DDR2; 256-bit-wide memory architecture; Mac OS X Server 10.5.6

Software spec: GHC 6.11 (from end of Feb 09); gcc 4.0.1

Program Problem size sequential 1 core 2 cores 4 cores 8 cores
DotP, primitives 100M elements 823/823/824 812/813/815 408/408/409 220/223/227 210/214/221
DotP, vectorised 100M elements 823/824/824 814/816/818 412/417/421 222/225/227 227/232/238
DotP, ref Haskell 100M elements 810 601 221 209
DotP, ref C 100M elements 458 235 210 210
SMVM, primitives ?? elems, density ??
SMVM, vectorised ?? elems, density ??

All results are in milliseconds, and the triples report best/average/worst execution case time (wall clock) of three runs. The column marked "sequential" reports times when linked against dph-seq and the columns marked "N cores" report times when linked against dph-par and run in parallel on the specified number of processor cores.

NB: At the moment, the CPU times goes up proportional to the number of cores added beyond four cores (hence, no wall clock speed up beyond four cores). Not clear where this comes from.


Program Sequential (manually vectorised) Vectorised Parallel
DotP Order of mag. faster than list impl Same performance as seq. speedup of 2 for 2 CPUs, 4 threads
QuickSort Slower than list (fusion) Slower than seq. (why?) speedup of 1.4 on 2 CPUs
SparseVector Similar to DotP
Primes (Nesl) 15 x faster than list version NYI 20 x slower than seq (fusion?)
Primes (Simon) NYI Working NYI
BarnesHut Small bug in alg Working See seq.

General remarks:

  • I only ran a first set of benchmarks when checking for what's there. I'll run the benchmarks properly as next step
  • Fusion doesn't work well on parallel programs yet, so for all but simple examples, the parallel program performs worse than the sequential
  • The compiler doesn't exploit all fusion opportunities for QSort and BarnesHut. Once this is fixed, they should run considerably faster.
  • Interestingly, the automatically vectorised version of qsort is quite a bit faster than the hand-flattened. Need to find out why.