Changes between Version 50 and Version 51 of DataParallel/BenchmarkStatus
 Timestamp:
 Dec 2, 2010 3:33:09 AM (7 years ago)
Legend:
 Unmodified
 Added
 Removed
 Modified

DataParallel/BenchmarkStatus
v50 v51 1 1 2 2 3 == Status of DPH Benchmarks == 3 = Status of DPH Benchmarks = 4 5 '''Last updated''': 2nd December 2010. 4 6 5 7 This page gives an overview of how well the benchmarks in the [http://darcs.haskell.org/packages/dph/dphexamples dphexamples/] directory of package dph are currently working. … … 7 9 The benchmarks are run each night by [http://darcs.haskell.org/packages/dph/dphbuildbot DPH BuildBot]. The results are posted to cvsghc and uploaded to [http://log.ouroborus.net/limitingfactor/dph/]. Check there for the latest numbers. 8 10 9 == Key == 11 !ToDo: Benchmarks are currently being run with fasm, and not via the LLVM backend. This will affect comparisons with C, but not with Data.Vector as it uses the same backend. 12 13 = Key = 10 14 <project>.<benchmark>.<version>.<parallelism>.<threads> 11 15 … … 24 28 * Parallel versions with N1 will tend to be slower than natively sequential versions due to overheads for supporting parallelism. 25 29 30 Status 31 * '''BROKEN''': Benchmark doesn't compile, or crashes when run. 32 * '''SLOWDOWN''': Benchmark gets slower as number of threads increases. 33 * '''SLOWLORIS''': Benchmark scales as the number of threads increases, but the absolute performance is not acceptable compared with equivalent versions using immutable Data.Vectors. 34 35 26 36 [[br]] 27 = = Statically Nested ==37 = Statically Nested = 28 38 Statically nested parallelism is where the parallelism has a fixed, finite depth. For example ``mapP f (filterP g xs)``. Statically nested programs are easier to vectorize than dynamically nested programs. At present, single threaded statically nested programs should run as fast as equivalent Data.Vector programs. Parallel versions should display a good speedup. 29 39 … … 39 49 40 50 '''Summary''': fine[[br]] 41 '''Todo''': Add thesequential C version.51 '''Todo''': Add a sequential C version. 42 52 43 53 [[br]] 44 54 [http://darcs.haskell.org/packages/dph/dphexamples/imaginary/DotProduct DotProduct]:: 45 55 Computes the dot product of two vectors of `Double`s. N=10M. 46 56 47 57  '''name'''  '''runtime'''  '''speedup'''  '''notes''' 48  dph. sumsq.vector.seq.N4  68ms  1  49  dph. sumsq.vectorised.seq.N4  58ms  1.17  A 50  dph. sumsq.vectorised.par.N1  55ms  1.24  51  dph. sumsq.vectorised.par.N2  33ms  2.06  52  dph. sumsq.vectorised.par.N4  25ms  2.72  58  dph.dotp.vector.seq.N4  68ms  1   59  dph.dotp.vectorised.seq.N4  58ms  1.17  A  60  dph.dotp.vectorised.par.N1  55ms  1.24   61  dph.dotp.vectorised.par.N2  33ms  2.06   62  dph.dotp.vectorised.par.N4  25ms  2.72   53 63 54 A: The vectorised version is faster than with Data.Vector. Why was this?64 A: The sequential vectorised version is faster than with Data.Vector. Why was this? 55 65 56 66 '''Summary''': fine.[[br]] 57 '''Todo''': Add thesequential C version.67 '''Todo''': Add a sequential C version. 58 68 69 [[br]] 70 [http://darcs.haskell.org/libraries/dph/dphexamples/imaginary/Evens/ Evens] '''(SLOWDOWN)''':: 71 Takes the even valued `Int`s from a vector. N=10M. 72 73  '''name'''  '''runtime'''  '''speedup'''  '''notes''' 74  dph.evens.vectorised.seq.N4  1.075s  1   75  dph.evens.vectorised.par.N1  736ms  1.46   76  dph.evens.vectorised.par.N2  768ms  1.40   77  dph.evens.vectorised.par.N4  859ms  1.25   78 79 '''Summary''': Benchmark runs slower when number of threads increases. This benchmark invokes {{{packByTag}}} due to the filtering operation. This is probably affecting Quickhull as it also uses filtering. [[br]] 80 '''Todo''': Fix slowdown. Add a sequential C version. 81 82 [[br]] 59 83 [http://darcs.haskell.org/packages/dph/examples/smvm/ SMVM]:: 60 84 Multiplies a dense vector with a sparse matrix represented in the ''compressed sparse row format (CSR).'' 61 85 62 86 '''Todo''': Add this to the nightly run. 63 87 64 88 [[br]] 65 = == Dynamically Nested ===89 = Dynamically Nested = 66 90 Dynamically nested programs have a recursive structure where each level of the recursion invokes more parallel computations. This is common for benchmarks that use divideandconquer style algorithms. 67 91 68 92 [http://darcs.haskell.org/packages/dph/examples/primes/ Primes]:: 69 The Sieve of Eratosthenes using parallel writes into a sieve structure represented as an array of `Bool`s. We currently don't have a proper parallel implementation of this benchmark, as we are missing a parallel version of default backpermute. The problem is that we need to make the representation of parallel arrays of `Bool` dependent on whether the hardware supports atomic writes of bytes. '''Investigate whether any of the architectures relevant for DPH actually do have trouble with atomic writes of bytes (aka `Word8`).'''93 The Sieve of Eratosthenes using parallel writes into a sieve structure represented as an array of `Bool`s. 70 94 71 [http://darcs.haskell.org/packages/dph/examples/quickhull/ Quickhull]:: 72 Given a set of points (in a plane), compute the sequence of points that encloses all points in the set. This benchmark is interesting as it is the simplest code that exploits the ability to implement divideandconquer algorithms with nested data parallelism. We have only a "vectorised" version of this benchmark and a sequential Haskell reference implementation, "ref Haskell", using vanilla lists. 95 '''Todo''': We currently don't have a proper parallel implementation of this benchmark, as we are missing a parallel version of default backpermute. The problem is that we need to make the representation of parallel arrays of `Bool` dependent on whether the hardware supports atomic writes of bytes. Investigate whether any of the architectures relevant for DPH actually do have trouble with atomic writes of bytes (aka `Word8`). 96 97 [[br]] 98 [http://darcs.haskell.org/libraries/dph/dphexamples/spectral/QuickHull/ Quickhull] '''(SLOWLORIS)''':: 99 Given a set of points (in a plane), compute the sequence of points that encloses all points in the set. This benchmark is interesting as it is the simplest code that exploits the ability to implement divideandconquer algorithms with nested data parallelism. 100 101 102  '''name'''  '''runtime'''  '''speedup'''  '''notes''' 103  dph.quickhull.vectorimmutable.seq.N4  0.166s  1   104  dph.quickhull.vectorised.seq.N4  0.677s  0.24  4x slower  105  dph.quickhull.vectorised.par.N1  1.059s  0.15  6x slower 106  dph.quickhull.vectorised.par.N2  0.809s  0.21   107  dph.quickhull.vectorised.par.N4  0.686s  0.24   108  dph.quickhull.vectormutable.seq.N4  0.086s  1.93   109  dph.quickhull.vectorforkIO.par.N4  0.064s  2.59   110  dph.quickhull.c.seq  0.044s  3.77   111 112 '''Status''': Benchmark scales but is 4x slower than version using immutable Data.Vectors. This benchmark is based around filtering operations, so the fact that Evens is also slow is probably related. 73 113 74 114 [http://darcs.haskell.org/packages/dph/examples/qsort/ Quicksort]:: … … 76 116 77 117 [[br]] 78 = == Dynamically Nested with Algebraic Data Types ===118 = Dynamically Nested with Algebraic Data Types = 79 119 These programs also use user defined algebraic data types. Vectorization of these programs is still a work in progress. 80 120 … … 82 122 This benchmark implements the BarnesHut algorithm to solve the ''n''body problem in two dimensions. '''Currently won't compile with vectorisation due to excessive inlining of dictionaries.''' 83 123 84 85 === Execution on !LimitingFactor (2x QuadCore Xeon) === 86 87 Hardware spec: 2x 3.0GHz QuadCore Intel Xeon 5400; 12MB (2x6MB) ondie L2 cache per processor; independent 1.6GHz frontside bus per processor; 800MHz DDR2 FBDIMM; 256bitwide memory architecture; Mac OS X Server 10.5.6 124 [[br]] 125 = Benchmark machine = 126 * 2x 3.0GHz QuadCore Intel Xeon 5400 127 * 12MB (2x6MB) ondie L2 cache per processor 128 * independent 1.6GHz frontside bus per processor 129 * 800MHz DDR2 FBDIMM 130 * 256bitwide memory architecture 131 * Mac OS X Server 10.5.6