Changes between Version 18 and Version 19 of DataParallel/BenchmarkStatus


Ignore:
Timestamp:
Mar 5, 2009 5:03:27 AM (6 years ago)
Author:
chak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DataParallel/BenchmarkStatus

    v18 v19  
    2323|| DotP, ref Haskell || 100M elements || – || 810 || 437 || 221 || 209 || 
    2424|| DotP, ref C || 100M elements || – || 458 || 235 || 210 || 210 || 
    25 || SMVM, primitives || ?? elems, density ?? ||  ||  ||  ||  ||  || 
    26 || SMVM, vectorised || ?? elems, density ?? ||  ||  ||  ||  ||  || 
     25|| SMVM, primitives || 100kx100k @ density 0.001 || 119/119 || 254/254 || 154/154 || 90/90 || 67/67 || 
     26|| SMVM, vectorised || 100kx100k @ density 0.001 || _|_ || _|_ || _|_ || _|_ || _|_ || 
     27|| SMVM, ref C || 100kx100k @ density 0.001 ||  46 || – || – || – || – || 
    2728 
    2829All results are in milliseconds, and the triples report best/average/worst execution time (wall clock) of three runs.  The column marked "sequential" reports times when linked against `dph-seq` and the columns marked "P=n" report times when linked against `dph-par` and run in parallel using the specified number of parallel OS threads. 
    2930 
    30 ==== Observations regarding DotP ==== 
     31==== Comments regarding DotP ==== 
    3132 
    3233Performance is memory bound, and hence, the benchmark stops scaling once the memory bus saturated.  As a consequence, the wall-clock execution time of the Haskell programs and the C reference implementation are the same when all available parallelism is exploited.  The parallel DPH library delivers the same single core performance as the sequential one in this benchmark. 
     34 
     35==== Comments regarding smvm ==== 
     36 
     37There seems to be a fusion problem in DotP with `dph-par` (even if the version of `zipWithSUP` that uses `splitSD/joinSD` is used); hence the much lower runtime for "N=1" than for "sequential".  The vectorised version runs out of memory; maybe because we didn't solve the `bpermute` problem, yet. 
    3338 
    3439=== Execution on greyarea (1x UltraSPARC T2) === 
     
    4348|| DotP, ref Haskell || 100M elements || – || 934 || 467 || 238 || 117 || 61 || 65 || 36 || 
    4449|| DotP, ref C || 100M elements || – || 554 || 277 || 142 || 72 || 37 || 22 || 20 || 
    45 || SMVM, primitives || ?? elems, density ?? ||  || || || || || || || || 
    46 || SMVM, vectorised || ?? elems, density ?? ||  || || || || || || || || 
     50|| SMVM, primitives || 100kx100k @ density 0.001 || 1112/1112 || 1926/1926 || 1009/1009 || 797/797 || 463/ 463 || 326/326 || 189/189 || 207/207 || 
     51|| SMVM, vectorised || 100kx100k @ density 0.001 ||  || || || || || || || || 
     52|| SMVM, ref C || 100kx100k @ density 0.001 || 600 || – || – || – || – || – || – || – || 
    4753 
    4854All results are in milliseconds, and the triples report best/worst execution time (wall clock) of three runs.  The column marked "sequential" reports times when linked against `dph-seq` and the columns marked "P=n" report times when linked against `dph-par` and run in parallel using the specified number of parallel OS threads. 
    4955 
    50 ==== Observations regarding DotP ==== 
     56==== Comments regarding DotP ==== 
    5157 
    5258The benchmark scales nicely up to the maximum number of hardware threads.  Memory latency is largely covered by excess parallelism.  It is unclear why the Haskell reference implementation "ref Haskell" falls of at 32 and 64 threads. 
     59 
     60==== Comments regarding smvm ==== 
     61 
     62As on !LimitingFactor, but it scales much more nicely and improves until using four threads per core.  This suggets that memory bandwidth is again a critical factor in this benchmark (this fits well with earlier observations on other architectures).  Despite fusion problem with `dph-par`, the parallel Haskell program, using all 8 cores, still ends up three times faster than the sequential C program. 
    5363 
    5464----