|Version 55 (modified by 6 years ago) (diff),|
Status of DPH Benchmarks
Last updated: 2nd December 2010.
This page gives an overview of how well the benchmarks in the dph-examples/ directory of package dph are currently working.
- Evens: gets slower as the number of threads increases, probably because it's using a filtering operation.
- QuickHull: vectorised.par.N1 version is 6x slower than the immutable Data.Vector version in absolute terms. This may be related to the problem with Evens.
- QuickSort: vectorised.seq version doesn't compile due to a blow-up in SpecConstr.
- NBody: has a core-lint error due to a bug in the rule matcher. If you turn off -dcore-lint it segfaults when run. Before recent GHC changes it compiled (with core-lint error), but vectorised.par Barnes-Hut algorithm was 50x slower than the version using immutable Data.Vector.
- Benchmarks are currently being run with -fasm, and not via the LLVM backend. This will affect comparisons with C, but not with Data.Vector as it uses the same backend.
Todo: add Repa benchmarks.
Statically Nested Parallelism
Statically nested parallelism is where the parallelism has a fixed, finite depth. For example
mapP f (filterP g xs)
. Statically nested programs are easier to vectorize than dynamically nested programs. At present, single threaded statically nested programs should run as fast as equivalent Data.Vector programs. Parallel versions should display a good speedup.
Computes the sum of the squares from 1 to N using
Int. N = 100M.
name runtime speedup notes dph.sumsq.vector.seq.N4 404ms 1 dph.sumsq.vectorised.seq.N4 434ms 0.93 dph.sumsq.vectorised.par.N1 443ms 0.91 dph.sumsq.vectorised.par.N2 222ms 1.82 dph.sumsq.vectorised.par.N4 111ms 3.63
Todo: Add a sequential C version.
Computes the dot product of two vectors of
name runtime speedup notes dph.dotp.vector.seq.N4 68ms 1 dph.dotp.vectorised.seq.N4 58ms 1.17 A dph.dotp.vectorised.par.N1 55ms 1.24 dph.dotp.vectorised.par.N2 33ms 2.06 dph.dotp.vectorised.par.N4 25ms 2.72
A: The sequential vectorised version is faster than with Data.Vector. Why was this?
Todo: Add a sequential C version.
- Evens (SLOWDOWN)
Takes the even valued
Ints from a vector. N=10M.
name runtime speedup notes dph.evens.vectorised.seq.N4 1.075s 1 dph.evens.vectorised.par.N1 736ms 1.46 dph.evens.vectorised.par.N2 768ms 1.40 dph.evens.vectorised.par.N4 859ms 1.25
Status: Benchmark runs slower when number of threads increases. This benchmark invokes
packByTagdue to the filtering operation. This is probably affecting Quickhull as it also uses filtering.
Todo: Fix slowdown. Add a sequential C version.
- Multiplies a dense vector with a sparse matrix represented in the compressed sparse row format (CSR).
Todo: Add this to the nightly run.
Dynamically Nested Parallelism
Dynamically nested programs have a recursive structure where each level of the recursion invokes more parallel computations. This is common for benchmarks that use divide-and-conquer style algorithms.
The Sieve of Eratosthenes using parallel writes into a sieve structure represented as an array of
Todo: We currently don't have a proper parallel implementation of this benchmark, as we are missing a parallel version of default backpermute. The problem is that we need to make the representation of parallel arrays of
Booldependent on whether the hardware supports atomic writes of bytes. Investigate whether any of the architectures relevant for DPH actually do have trouble with atomic writes of bytes (aka
- QuickSort (BROKEN)
- Sort a vector of doubles by recursively splitting the vector and sorting the two halves. This is a "fake" benchmark because we divide right down to two-point vectors and construct the result using copying append. A production algorithm would switch to an in-place sort once the size of the vector reaches a few thousand elements.
name runtime speedup notes dph.quicksort.vectorised.par.N1 428ms 1 dph.quicksort.vectorised.par.N2 400ms 1.07 dph.quicksort.vectorised.par.N4 392ms 1.09
Status: Sequential vectorised version does not compile due to a blowup in SpecConstr.
- Quickhull (SLOWLORIS)
- Given a set of points in the plane, compute the sequence of points that encloses all points in the set. This benchmark is interesting as it is the simplest code that exploits the ability to implement divide-and-conquer algorithms with nested data parallelism.
name runtime speedup notes dph.quickhull.vector-immutable.seq.N4 0.166s 1 dph.quickhull.vectorised.seq.N4 0.677s 0.24 4x slower dph.quickhull.vectorised.par.N1 1.059s 0.15 6x slower dph.quickhull.vectorised.par.N2 0.809s 0.21 dph.quickhull.vectorised.par.N4 0.686s 0.24 dph.quickhull.vector-mutable.seq.N4 0.086s 1.93 A dph.quickhull.vector-forkIO.par.N4 0.064s 2.59 B dph.quickhull.c.seq 0.044s 3.77 C
A: Uses mutable Data.Vectors for intermediate buffers.
B: Uses mutable Data.Vectors, forkIO and atomicModifyIORef. Concurrent threads fill a shared output vector. Code is uglier than the C version.
C: Sequential C version with pre-allocated mutable intermediate buffers.
Status: Benchmark scales but single threaded vectorised.par version is 6x slower than slower than version using immutable Data.Vectors. QuickHull is based around filtering operations, so the fact that Evens is also slow is probably related.
Dynamically Nested Parallelism with Algebraic Data Types
These programs also use user defined algebraic data types. Vectorization of these programs is still a work in progress.
- This benchmark implements the Barnes-Hut algorithm to solve the n-body problem in two dimensions. Currently won't compile with vectorisation due to excessive inlining of dictionaries.
- Either dph or repa. Repa programs use the same parallel array library as DPH, but do not go through the vectorising transform.
- vectorised means it's been through the DPH vectorising transform.
- vector is a hand written version using immutable Data.Vectors
- vector-mutable is a hand written version using mutable Data.Vectors.
- vector-immutable means the same as vector and is used when there is also an mutable version.
- Whether a benchmark is natively parallel or sequential.
- Parallel versions are also run single threaded (with -N1) and sequential versions are also run with (-N4) so we get the parallel GC.
- Parallel versions with -N1 will tend to be slower than natively sequential versions due to overheads for supporting parallelism.
- BROKEN: Benchmark doesn't compile, or crashes when run.
- SLOWDOWN: Benchmark gets slower as number of threads increases.
- SLOWLORIS: Benchmark scales as the number of threads increases, but the absolute performance is not acceptable compared with equivalent versions using immutable Data.Vectors.
- 2x 3.0GHz Quad-Core Intel Xeon 5400
- 12MB (2x6MB) on-die L2 cache per processor
- independent 1.6GHz frontside bus per processor
- 800MHz DDR2 FB-DIMM
- 256-bit-wide memory architecture
- Mac OS X Server 10.5.6