Changes between Version 30 and Version 31 of DataParallel/BenchmarkStatus

Mar 8, 2009 10:26:03 AM (9 years ago)



  • DataParallel/BenchmarkStatus

    v30 v31  
    1515 [ Primes]::
    1616  The Sieve of Eratosthenes using parallel writes into a sieve structure represented as an array of `Bool`s.  We currently don't have a proper parallel implementation of this benchmark, as we are missing a parallel version of default backpermute.  The problem is that we need to make the representation of parallel arrays of `Bool` dependent on whether the hardware supports atomic writes of bytes.  '''Investigate whether any of the architectures relevant for DPH actually do have trouble with atomic writes of bytes (aka `Word8`).'''
     17 [ Quickhull]::
     18  Given a set of points (in a plane), compute the sequence of points that encloses all points in the set. There is only a vectorised version.  '''Currently doesn't work due to bugs in dph-par.  Also needs to get a wrapper using the new benchmark framework to generated test input and time execution.'''
     19 [ Quicksort]::
     20  FIXME
     21 [ ConComp]::
     22  Implementation of the Awerbuch-Shiloach and Hybrid algorithms for finding connected components in undirected graphs.  There is only a version directly coded against the array primitives.  '''Needs to be adapted to new benchmark framework.'''
     23 [ BarnesHut]::
     24  This benchmark implements the Barnes-Hut algorithm to solve the ''n''-body problem in two dimensions.  '''Currently won't compile with vectorisation due to excessive inlining of dictionaries.'''
    1827=== Execution on !LimitingFactor (2x Quad-Core Xeon) ===
    8695As on !LimitingFactor, but it scales much more nicely and improves until using four threads per core.  This suggets that memory bandwidth is again a critical factor in this benchmark (this fits well with earlier observations on other architectures).  Despite fusion problem with `dph-par`, the parallel Haskell program, using all 8 cores, still ends up three times faster than the sequential C program.
    88 ----
    90 || '''Program'''  || '''Sequential (manually vectorised) ''' || '''Vectorised'''         || ''' Parallel'''                     ||
    91 || DotP           ||Order of mag. faster than list impl      || Same performance as seq. || speedup of 2 for 2 CPUs, 4 threads  ||
    92 || !QuickSort     ||Slower than list (fusion)                || Slower than seq. (why?)  || speedup of 1.4 on 2 CPUs            ||
    93 || !SparseVector  ||Similar to DotP                          ||                          ||                                     ||
    94 || Primes (Nesl)  ||15 x faster than list version            || NYI                      || 20 x slower than seq (fusion?)      ||
    95 || Primes (Simon) ||NYI                                      || Working                  || NYI                                 ||
    96 || !BarnesHut     ||Small bug in alg                         || Working                  || See seq.                            || 
    99 General remarks:
    101  * I only ran a first set of benchmarks when checking for what's there. I'll run the benchmarks properly as next step
    103  * Fusion doesn't work well on parallel programs yet, so for all but simple examples, the parallel program performs worse than the sequential
    105  * The compiler doesn't exploit all fusion opportunities for QSort and !BarnesHut. Once this is fixed, they should run considerably faster.
    107  * Interestingly, the automatically vectorised version of qsort is quite a bit faster than the hand-flattened. Need to find out why.