Changes between Version 47 and Version 48 of DataParallel/BenchmarkStatus


Ignore:
Timestamp:
Dec 2, 2010 2:01:55 AM (5 years ago)
Author:
benl
Comment:

Remove old benchmarking details

Legend:

Unmodified
Added
Removed
Modified
  • DataParallel/BenchmarkStatus

    v47 v48  
    33== Status of DPH Benchmarks == 
    44 
    5 This page gives an overview of how well the benchmarks in the [http://darcs.haskell.org/packages/dph/examples/ examples/] directory of package dph are currently working. 
     5This page gives an overview of how well the benchmarks in the [http://darcs.haskell.org/packages/dph/dph-examples dph-examples/] directory of package dph are currently working. 
    66 
    77=== Overview over the benchmark programs === 
     
    2828 
    2929Hardware spec: 2x 3.0GHz Quad-Core Intel Xeon 5400; 12MB (2x6MB) on-die L2 cache per processor; independent 1.6GHz frontside bus per processor; 800MHz DDR2 FB-DIMM; 256-bit-wide memory architecture; Mac OS X Server 10.5.6 
    30  
    31 Software spec: GHC 6.11 (from first week of Mar 09); gcc 4.0.1 
    32  
    33 || '''Program''' || '''Problem size''' || '''sequential''' || '''P=1''' || '''P=2''' || '''P=4''' || '''P=8''' || 
    34 || !SumSq, primitives || 10M || 22 || 40 || 20 || 10 || 5 || 
    35 || !SumSq, vectorised || 10M || 22 || 40 || 20 || 10 || 5 || 
    36 || !SumSq, ref C ||10M || 9 || – || – || – || – || 
    37 || DotP, primitives || 100M elements || 823/823/824 || 812/813/815 || 408/408/409 || 220/223/227 || 210/214/221 || 
    38 || DotP, vectorised || 100M elements || 823/824/824 || 814/816/818 || 412/417/421 || 222/225/227 || 227/232/238 || 
    39 || DotP, ref Haskell || 100M elements || – || 810 || 437 || 221 || 209 || 
    40 || DotP, ref C || 100M elements || – || 458 || 235 || 210 || 210 || 
    41 || SMVM, primitives || 10kx10k @ density 0.1 || 119/119 || 111/111 || 78/78 || 36/36 || 21/21 || 
    42 || SMVM, vectorised || 10kx10k @ density 0.1 || 175/175 || 137/137 || 74/74 || 47/47 || 23/23 || 
    43 || SMVM, ref C || 10kx10k @ density 0.1 ||  35 || – || – || – || – || 
    44 || SMVM, primitives || 100kx100k @ density 0.001 || 132/132 || 135/135 || 81/81 || 91/91 || 48/48 || 
    45 || SMVM, vectorised || 100kx100k @ density 0.001 || 182/182 || 171/171 || 93/93 || 89/89 || 53/53 || 
    46 || SMVM, ref C || 100kx100k @ density 0.001 ||  46 || – || – || – || – || 
    47  
    48 All results are in milliseconds, and the triples report best/average/worst execution time (wall clock) of three runs.  The column marked "sequential" reports times when linked against `dph-seq` and the columns marked "P=n" report times when linked against `dph-par` and run in parallel using the specified number of parallel OS threads. 
    49  
    50 ==== Comments regarding !SumSq ==== 
    51  
    52 The versions compiled against `dph-par` are by factor of two slower than the ones linked against `dph-seq`.   
    53  
    54 However, found a number of general problems when working on this example: 
    55  * We need an extra -funfolding-use-threshold.  We don't really want users having to worry about that. 
    56  * `enumFromTo` doesn't fuse due to excessive dictionaries in the unfolding of `zipWithUP`. 
    57  * `mapP (\x -> x * x) xs` essentially turns into `zipWithU (*) xs xs`, which doesn't fuse with `enumFromTo` anymore.  We have a rewrite rule in the library to fix that, but that's not general enough.  We really would rather not vectorise the lambda abstraction at all. 
    58  * Finally, to achieve the current result, we needed an analysis that avoids vectorising subcomputations that don't to be vectorised, and worse, that fusion has to turn back into their original form.  In this case, the lambda abstraction `\x -> x * x`.  This is currently implemented in a rather limited and ad-hoc way.  We should implement this on the basis of a more general analysis. 
    59  
    60 ==== Comments regarding DotP ==== 
    61  
    62 Performance is memory bound, and hence, the benchmark stops scaling once the memory bus saturated.  As a consequence, the wall-clock execution time of the Haskell programs and the C reference implementation are the same when all available parallelism is exploited.  The parallel DPH library delivers the same single core performance as the sequential one in this benchmark. 
    63  
    64 ==== Comments regarding smvm ==== 
    65  
    66 "SMVM, vectorised" needs a lot of tinkering in the form of special rules at the moment and forcing particular inlines.  We need more expressive rewrite rules; in particular, we need these more expressive rules to express important rewrites for the replicate combinator in its various forms and to optimise shape computations that enable other optimisations. 
    67  
    68 Moreover, "SMVM, primitives" & "SMVM, vectorised" exhibit a strange behaviour from 2 to 4 threads with the matrix of density 0.001.  This might be a scheduling problem. 
    69  
    70 === Execution on greyarea (1x UltraSPARC T2) === 
    71  
    72 Hardware spec: 1x 1.4GHz UltraSPARC T2; 8 cores/processors with 8 hardware threads/core; 4MB on-die L2 cache per processor; FB-DIMM; Solaris 5.10 
    73  
    74 Software spec: GHC 6.11 (from first week of Mar 09) with gcc 4.1.2 for Haskell code; gccfss 4.0.4 (gcc front-end with Sun compiler backend) for C code (as it generates code that is more than twice as fast for numeric computations than vanilla gcc) 
    75  
    76 || '''Program''' || '''Problem size''' || '''sequential''' || '''P=1''' || '''P=2''' || '''P=4''' || '''P=8''' || '''P=16''' || '''P=32''' || '''P=64''' || 
    77 || !SumSq, primitives || 10M || 212/212 || 254/254 || 127/127 || 64/64 || 36/36 || 25/25 || 17/17 || 10/10 || 
    78 || !SumSq, vectorised || 10M || 212/212 || 254/254 || 128/128 || 64/64 || 32/32 || 25/25 || 17/17 || 10/10 || 
    79 || !SumSq, ref C ||10M || 120 || – || – || – || – || – || – || – || 
    80 || DotP, primitives || 100M elements || 937/937 || 934/934 || 474/474 || 238/238 || 120/120 || 65/65 || 38/38 || 28/28 || 
    81 || DotP, vectorised || 100M elements || 937/937 || 942/942 || 471/471 || 240/240 || 118/118 || 65/65 || 43/43 || 29/29 ||  
    82 || DotP, ref Haskell || 100M elements || – || 934 || 467 || 238 || 117 || 61 || 65 || 36 || 
    83 || DotP, ref C || 100M elements || – || 554 || 277 || 142 || 72 || 37 || 22 || 20 || 
    84 || SMVM, primitives || 10kx10k @ density 0.1 || 1102/1102 || 1112/1112 || 561/561 || 285/285 || 150/150 || 82/82 || 63/70 || 54/100 || 
    85 || SMVM, vectorised || 10kx10k @ density 0.1 || 1784/1784 || 1810/1810 || 910/910 || 466/466 || 237/237 || 131/131 || 96/96 || 87/87 || 
    86 || SMVM, ref C || 10kx10k @ density 0.1 || 580 || – || – || – || – || – || – || – || 
    87 || SMVM, primitives || 100kx100k @ density 0.001 || 1112/1112 || 1299/1299 || 684/684 || 653/653 || 368/368 || 294/294 || 197/197 || 160/160 || 
    88 || SMVM, vectorised || 100kx100k @ density 0.001 || 1824/1824 || 2008/2008 || 1048/1048 || 1010/1010 || 545/545 || 426/426 || 269/269 || 258/258 || 
    89 || SMVM, ref C || 100kx100k @ density 0.001 || 600 || – || – || – || – || – || – || – || 
    90  
    91 All results are in milliseconds, and the triples report best/worst execution time (wall clock) of three runs.  The column marked "sequential" reports times when linked against `dph-seq` and the columns marked "P=n" report times when linked against `dph-par` and run in parallel using the specified number of parallel OS threads. 
    92  
    93 ==== Comments regarding !SumSq ==== 
    94  
    95 As on !LimitingFactor. 
    96  
    97 ==== Comments regarding DotP ==== 
    98  
    99 The benchmark scales nicely up to the maximum number of hardware threads.  Memory latency is largely covered by excess parallelism.  It is unclear why the Haskell reference implementation "ref Haskell" falls of at 32 and 64 threads.  See also [http://justtesting.org/post/83014052/this-is-the-performance-of-a-dot-product-of-two a comparison graph between LimitingFactor and greyarea]. 
    100  
    101 ==== Comments regarding smvm ==== 
    102  
    103 As on !LimitingFactor, but it scales much more nicely and improves until using four threads per core.  This suggets that memory bandwidth is again a critical factor in this benchmark (this fits well with earlier observations on other architectures). 
    104  
    105 On this machine, "SMVM primitives" & "SMVM, vectorised" also have a quirk from 2 to 4 threads.  This re-enforces the suspicion that this is a scheduling problem. 
    106  
    107 === Summary === 
    108  
    109 The speedup relative to a sequential C program for !SumSq, DotP, and SMVM on both architectures is illustrated by [http://justtesting.org/post/85103645/these-graphs-summarise-the-performance-of-data two summary graphs.]  In all cases, the data parallel Haskell program outperforms the sequential C program by a large margin on 8 cores.  The gray curve is a parallel C program computing the dot product using pthreads.  It clearly shows that the two Quad-Core Xeon with 8x1 threads are memory-limited for this benchmark, and the C code is barely any faster on 8 cores than the Haskell code. 
    110  
    111 === Regular, multidimensional arrays === 
    112  
    113 First benchmark results for the multiplication of two dense matrices using `dph-seq` are summarised in [http://www.scribd.com/doc/22091707/Delayed-Regular-Arrays-Sep09 comparison graph].