Version 50 (modified by 6 years ago) (diff)  ,

Status of DPH Benchmarks
This page gives an overview of how well the benchmarks in the dphexamples/ directory of package dph are currently working.
The benchmarks are run each night by DPH BuildBot. The results are posted to cvsghc and uploaded to http://log.ouroborus.net/limitingfactor/dph/. Check there for the latest numbers.
Key
<project>.<benchmark>.<version>.<parallelism>.<threads>
Project
 Either dph or repa. Repa programs use the same parallel array library as DPH, but do not go through the vectorising transform.
Version
 vectorised means it's been through the DPH vectorising transform.
 vector is a hand written version using immutable Data.Vectors
 vectormutable is a hand written version using mutable Data.Vectors.
 vectorimmutable means the same as vector and is used when there is also an mutable version.
Parallelism
 Whether a benchmark is natively parallel or sequential.
 Parallel versions are also run single threaded (with N1) and sequential versions are also run with (N4) so we get the parallel GC.
 Parallel versions with N1 will tend to be slower than natively sequential versions due to overheads for supporting parallelism.
Statically Nested
Statically nested parallelism is where the parallelism has a fixed, finite depth. For example mapP f (filterP g xs)
. Statically nested programs are easier to vectorize than dynamically nested programs. At present, single threaded statically nested programs should run as fast as equivalent Data.Vector programs. Parallel versions should display a good speedup.
 SumSquares

Computes the sum of the squares from 1 to N using
Int
. N = 100M.
name runtime speedup notes dph.sumsq.vector.seq.N4 404ms 1 dph.sumsq.vectorised.seq.N4 434ms 0.93 dph.sumsq.vectorised.par.N1 443ms 0.91 dph.sumsq.vectorised.par.N2 222ms 1.82 dph.sumsq.vectorised.par.N4 111ms 3.63
Summary: fine
Todo: Add the sequential C version.
 DotProduct

Computes the dot product of two vectors of
Double
s. N=10M.
name runtime speedup notes dph.sumsq.vector.seq.N4 68ms 1 dph.sumsq.vectorised.seq.N4 58ms 1.17 A dph.sumsq.vectorised.par.N1 55ms 1.24 dph.sumsq.vectorised.par.N2 33ms 2.06 dph.sumsq.vectorised.par.N4 25ms 2.72
A: The vectorised version is faster than with Data.Vector. Why was this?
Summary: fine.
Todo: Add the sequential C version.
 SMVM
 Multiplies a dense vector with a sparse matrix represented in the compressed sparse row format (CSR).
Todo: Add this to the nightly run.
Dynamically Nested
Dynamically nested programs have a recursive structure where each level of the recursion invokes more parallel computations. This is common for benchmarks that use divideandconquer style algorithms.
 Primes

The Sieve of Eratosthenes using parallel writes into a sieve structure represented as an array of
Bool
s. We currently don't have a proper parallel implementation of this benchmark, as we are missing a parallel version of default backpermute. The problem is that we need to make the representation of parallel arrays ofBool
dependent on whether the hardware supports atomic writes of bytes. Investigate whether any of the architectures relevant for DPH actually do have trouble with atomic writes of bytes (akaWord8
).
 Quickhull
 Given a set of points (in a plane), compute the sequence of points that encloses all points in the set. This benchmark is interesting as it is the simplest code that exploits the ability to implement divideandconquer algorithms with nested data parallelism. We have only a "vectorised" version of this benchmark and a sequential Haskell reference implementation, "ref Haskell", using vanilla lists.
 Quicksort
 FIXME
Dynamically Nested with Algebraic Data Types
These programs also use user defined algebraic data types. Vectorization of these programs is still a work in progress.
 BarnesHut
 This benchmark implements the BarnesHut algorithm to solve the nbody problem in two dimensions. Currently won't compile with vectorisation due to excessive inlining of dictionaries.
Execution on LimitingFactor (2x QuadCore Xeon)
Hardware spec: 2x 3.0GHz QuadCore Intel Xeon 5400; 12MB (2x6MB) ondie L2 cache per processor; independent 1.6GHz frontside bus per processor; 800MHz DDR2 FBDIMM; 256bitwide memory architecture; Mac OS X Server 10.5.6