Changes between Version 30 and Version 31 of DataParallel/BenchmarkStatus

Mar 8, 2009 10:26:03 AM (6 years ago)



  • DataParallel/BenchmarkStatus

    v30 v31  
    1515 [ Primes]:: 
    1616  The Sieve of Eratosthenes using parallel writes into a sieve structure represented as an array of `Bool`s.  We currently don't have a proper parallel implementation of this benchmark, as we are missing a parallel version of default backpermute.  The problem is that we need to make the representation of parallel arrays of `Bool` dependent on whether the hardware supports atomic writes of bytes.  '''Investigate whether any of the architectures relevant for DPH actually do have trouble with atomic writes of bytes (aka `Word8`).''' 
     17 [ Quickhull]:: 
     18  Given a set of points (in a plane), compute the sequence of points that encloses all points in the set. There is only a vectorised version.  '''Currently doesn't work due to bugs in dph-par.  Also needs to get a wrapper using the new benchmark framework to generated test input and time execution.''' 
     19 [ Quicksort]:: 
     20  FIXME 
     21 [ ConComp]:: 
     22  Implementation of the Awerbuch-Shiloach and Hybrid algorithms for finding connected components in undirected graphs.  There is only a version directly coded against the array primitives.  '''Needs to be adapted to new benchmark framework.''' 
     23 [ BarnesHut]:: 
     24  This benchmark implements the Barnes-Hut algorithm to solve the ''n''-body problem in two dimensions.  '''Currently won't compile with vectorisation due to excessive inlining of dictionaries.''' 
    1827=== Execution on !LimitingFactor (2x Quad-Core Xeon) === 
    8695As on !LimitingFactor, but it scales much more nicely and improves until using four threads per core.  This suggets that memory bandwidth is again a critical factor in this benchmark (this fits well with earlier observations on other architectures).  Despite fusion problem with `dph-par`, the parallel Haskell program, using all 8 cores, still ends up three times faster than the sequential C program. 
    88 ---- 
    90 || '''Program'''  || '''Sequential (manually vectorised) ''' || '''Vectorised'''         || ''' Parallel'''                     || 
    91 || DotP           ||Order of mag. faster than list impl      || Same performance as seq. || speedup of 2 for 2 CPUs, 4 threads  || 
    92 || !QuickSort     ||Slower than list (fusion)                || Slower than seq. (why?)  || speedup of 1.4 on 2 CPUs            || 
    93 || !SparseVector  ||Similar to DotP                          ||                          ||                                     || 
    94 || Primes (Nesl)  ||15 x faster than list version            || NYI                      || 20 x slower than seq (fusion?)      || 
    95 || Primes (Simon) ||NYI                                      || Working                  || NYI                                 || 
    96 || !BarnesHut     ||Small bug in alg                         || Working                  || See seq.                            ||   
    99 General remarks: 
    101  * I only ran a first set of benchmarks when checking for what's there. I'll run the benchmarks properly as next step 
    103  * Fusion doesn't work well on parallel programs yet, so for all but simple examples, the parallel program performs worse than the sequential 
    105  * The compiler doesn't exploit all fusion opportunities for QSort and !BarnesHut. Once this is fixed, they should run considerably faster. 
    107  * Interestingly, the automatically vectorised version of qsort is quite a bit faster than the hand-flattened. Need to find out why.