Changes between Version 50 and Version 51 of DataParallel/BenchmarkStatus


Ignore:
Timestamp:
Dec 2, 2010 3:33:09 AM (3 years ago)
Author:
benl
Comment:

Update status of quickhull

Legend:

Unmodified
Added
Removed
Modified
  • DataParallel/BenchmarkStatus

    v50 v51  
    11 
    22 
    3 == Status of DPH Benchmarks == 
     3= Status of DPH Benchmarks = 
     4 
     5'''Last updated''': 2nd December 2010. 
    46 
    57This page gives an overview of how well the benchmarks in the [http://darcs.haskell.org/packages/dph/dph-examples dph-examples/] directory of package dph are currently working. 
     
    79The benchmarks are run each night by [http://darcs.haskell.org/packages/dph/dph-buildbot DPH BuildBot]. The results are posted to cvs-ghc and uploaded to [http://log.ouroborus.net/limitingfactor/dph/]. Check there for the latest numbers. 
    810 
    9 == Key == 
     11!ToDo: Benchmarks are currently being run with -fasm, and not via the LLVM backend. This will affect comparisons with C, but not with Data.Vector as it uses the same backend. 
     12 
     13= Key = 
    1014<project>.<benchmark>.<version>.<parallelism>.<threads> 
    1115 
     
    2428 * Parallel versions with -N1 will tend to be slower than natively sequential versions due to overheads for supporting parallelism. 
    2529 
     30Status 
     31 * '''BROKEN''': Benchmark doesn't compile, or crashes when run. 
     32 * '''SLOWDOWN''': Benchmark gets slower as number of threads increases.  
     33 * '''SLOWLORIS''': Benchmark scales as the number of threads increases, but the absolute performance is not acceptable compared with equivalent versions using immutable Data.Vectors. 
     34 
     35  
    2636[[br]] 
    27 == Statically Nested == 
     37= Statically Nested = 
    2838Statically nested parallelism is where the parallelism has a fixed, finite depth. For example ``mapP f (filterP g xs)``. Statically nested programs are easier to vectorize than dynamically nested programs. At present, single threaded statically nested programs should run as fast as equivalent Data.Vector programs. Parallel versions should display a good speedup. 
    2939 
     
    3949 
    4050  '''Summary''': fine[[br]] 
    41   '''Todo''': Add the sequential C version. 
     51  '''Todo''': Add a sequential C version. 
    4252 
    43  
     53[[br]] 
    4454 [http://darcs.haskell.org/packages/dph/dph-examples/imaginary/DotProduct DotProduct]:: 
    4555  Computes the dot product of two vectors of `Double`s. N=10M. 
    4656 
    4757  || '''name''' || '''runtime''' || '''speedup''' || '''notes''' 
    48   || dph.sumsq.vector.seq.N4 ||  68ms || 1 || || 
    49   || dph.sumsq.vectorised.seq.N4 || 58ms || 1.17 || A || 
    50   || dph.sumsq.vectorised.par.N1 || 55ms || 1.24 || || 
    51   || dph.sumsq.vectorised.par.N2 || 33ms || 2.06 || || 
    52   || dph.sumsq.vectorised.par.N4 || 25ms || 2.72 || || 
     58  || dph.dotp.vector.seq.N4 ||  68ms || 1 || || 
     59  || dph.dotp.vectorised.seq.N4 || 58ms || 1.17 || A || 
     60  || dph.dotp.vectorised.par.N1 || 55ms || 1.24 || || 
     61  || dph.dotp.vectorised.par.N2 || 33ms || 2.06 || || 
     62  || dph.dotp.vectorised.par.N4 || 25ms || 2.72 || || 
    5363  
    54   A: The vectorised version is faster than with Data.Vector. Why was this? 
     64  A: The sequential vectorised version is faster than with Data.Vector. Why was this? 
    5565 
    5666  '''Summary''': fine.[[br]] 
    57   '''Todo''': Add the sequential C version. 
     67  '''Todo''': Add a sequential C version. 
    5868 
     69[[br]] 
     70  [http://darcs.haskell.org/libraries/dph/dph-examples/imaginary/Evens/ Evens] '''(SLOWDOWN)''':: 
     71   Takes the even valued `Int`s from a vector. N=10M. 
     72 
     73  || '''name''' || '''runtime''' || '''speedup''' || '''notes''' 
     74  || dph.evens.vectorised.seq.N4 || 1.075s || 1 || || 
     75  || dph.evens.vectorised.par.N1 || 736ms ||  1.46 || || 
     76  || dph.evens.vectorised.par.N2 || 768ms ||  1.40 || || 
     77  || dph.evens.vectorised.par.N4 || 859ms ||  1.25 || || 
     78 
     79  '''Summary''': Benchmark runs slower when number of threads increases. This benchmark invokes {{{packByTag}}} due to the filtering operation. This is probably affecting Quickhull as it also uses filtering. [[br]]  
     80  '''Todo''': Fix slowdown. Add a sequential C version.  
     81 
     82 [[br]] 
    5983 [http://darcs.haskell.org/packages/dph/examples/smvm/ SMVM]:: 
    60   Multiplies a dense vector with a sparse matrix represented in the ''compressed sparse row format (CSR).''   
     84 Multiplies a dense vector with a sparse matrix represented in the ''compressed sparse row format (CSR).''   
    6185 
    6286  '''Todo''': Add this to the nightly run. 
    6387 
    6488[[br]] 
    65 === Dynamically Nested === 
     89= Dynamically Nested = 
    6690Dynamically nested programs have a recursive structure where each level of the recursion invokes more parallel computations. This is common for benchmarks that use divide-and-conquer style algorithms. 
    6791 
    6892 [http://darcs.haskell.org/packages/dph/examples/primes/ Primes]:: 
    69   The Sieve of Eratosthenes using parallel writes into a sieve structure represented as an array of `Bool`s.  We currently don't have a proper parallel implementation of this benchmark, as we are missing a parallel version of default backpermute.  The problem is that we need to make the representation of parallel arrays of `Bool` dependent on whether the hardware supports atomic writes of bytes.  '''Investigate whether any of the architectures relevant for DPH actually do have trouble with atomic writes of bytes (aka `Word8`).''' 
     93  The Sieve of Eratosthenes using parallel writes into a sieve structure represented as an array of `Bool`s.   
    7094 
    71  [http://darcs.haskell.org/packages/dph/examples/quickhull/ Quickhull]:: 
    72   Given a set of points (in a plane), compute the sequence of points that encloses all points in the set. This benchmark is interesting as it is the simplest code that exploits the ability to implement divide-and-conquer algorithms with nested data parallelism.  We have only a "vectorised" version of this benchmark and a sequential Haskell reference implementation, "ref Haskell", using vanilla lists. 
     95  '''Todo''': We currently don't have a proper parallel implementation of this benchmark, as we are missing a parallel version of default backpermute.  The problem is that we need to make the representation of parallel arrays of `Bool` dependent on whether the hardware supports atomic writes of bytes. Investigate whether any of the architectures relevant for DPH actually do have trouble with atomic writes of bytes (aka `Word8`). 
     96 
     97[[br]] 
     98 [http://darcs.haskell.org/libraries/dph/dph-examples/spectral/QuickHull/ Quickhull] '''(SLOWLORIS)''':: 
     99  Given a set of points (in a plane), compute the sequence of points that encloses all points in the set. This benchmark is interesting as it is the simplest code that exploits the ability to implement divide-and-conquer algorithms with nested data parallelism. 
     100 
     101 
     102  || '''name''' || '''runtime''' || '''speedup''' || '''notes''' 
     103  || dph.quickhull.vector-immutable.seq.N4 || 0.166s || 1 || || 
     104  || dph.quickhull.vectorised.seq.N4 || 0.677s ||  0.24 || 4x slower || 
     105  || dph.quickhull.vectorised.par.N1 || 1.059s ||  0.15 || 6x slower|| 
     106  || dph.quickhull.vectorised.par.N2 || 0.809s ||  0.21 || || 
     107  || dph.quickhull.vectorised.par.N4 || 0.686s ||  0.24 || || 
     108  || dph.quickhull.vector-mutable.seq.N4 || 0.086s ||  1.93 || || 
     109  || dph.quickhull.vector-forkIO.par.N4 || 0.064s ||  2.59 || || 
     110  || dph.quickhull.c.seq || 0.044s || 3.77 || || 
     111 
     112  '''Status''': Benchmark scales but is 4x slower than version using immutable Data.Vectors. This benchmark is based around filtering operations, so the fact that Evens is also slow is probably related. 
    73113 
    74114 [http://darcs.haskell.org/packages/dph/examples/qsort/ Quicksort]:: 
     
    76116 
    77117[[br]] 
    78 === Dynamically Nested with Algebraic Data Types === 
     118= Dynamically Nested with Algebraic Data Types = 
    79119These programs also use user defined algebraic data types. Vectorization of these programs is still a work in progress. 
    80120 
     
    82122  This benchmark implements the Barnes-Hut algorithm to solve the ''n''-body problem in two dimensions.  '''Currently won't compile with vectorisation due to excessive inlining of dictionaries.''' 
    83123 
    84  
    85 === Execution on !LimitingFactor (2x Quad-Core Xeon) === 
    86  
    87 Hardware spec: 2x 3.0GHz Quad-Core Intel Xeon 5400; 12MB (2x6MB) on-die L2 cache per processor; independent 1.6GHz frontside bus per processor; 800MHz DDR2 FB-DIMM; 256-bit-wide memory architecture; Mac OS X Server 10.5.6 
     124[[br]] 
     125= Benchmark machine  = 
     126 * 2x 3.0GHz Quad-Core Intel Xeon 5400 
     127 * 12MB (2x6MB) on-die L2 cache per processor 
     128 * independent 1.6GHz frontside bus per processor 
     129 * 800MHz DDR2 FB-DIMM 
     130 * 256-bit-wide memory architecture 
     131 * Mac OS X Server 10.5.6