Changes between Version 65 and Version 66 of DataParallel/BenchmarkStatus


Ignore:
Timestamp:
Dec 2, 2010 7:43:28 AM (5 years ago)
Author:
benl
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DataParallel/BenchmarkStatus

    v65 v66  
    2929   Matrix-Matrix multiplication. Size=1024x1024. 
    3030 
    31   || '''name''' || '''runtime''' || '''speedup''' || '''notes''' 
    32   || repa.mmult.c.seq ||  3.792s || 1 || A || 
    33   || repa.mmult.par.N4 || 2.147s || 1.77 || || 
     31  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes''' 
     32  || repa.mmult.c.seq ||  3.792s || 1 || 1 || A || 
     33  || repa.mmult.par.N4 || 2.147s || 1.77 || 0.44 || || 
    3434  A: Straightforward C program using triple nested loops. A cache-friendly block-based version would be faster. 
    3535 
     
    4141   Solves the Laplace equation in the 2D plane. Size=400x400. 
    4242 
    43   || '''name''' || '''runtime''' || '''speedup''' || '''notes''' 
    44   || repa.laplace.c.seq ||  1.299s || 1 || A || 
    45   || repa.laplace.par.N4 || 2.521s || 0.51 || || 
     43  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes''' || 
     44  || repa.laplace.c.seq ||  1.299s || 1 || 1 || A || 
     45  || repa.laplace.par.N4 || 2.521s || 0.51 || 0.13 || || 
    4646  A: Straightforward C program using triple nested loops. A cache-friendly block-based version would be faster. 
    4747 
     
    7777  Computes the sum of the squares from 1 to N using `Int`.  N = 100M. 
    7878 
    79   || '''name''' || '''runtime''' || '''speedup''' || '''notes''' 
    80   || dph.sumsq.vector.seq.N4 ||  404ms || 1 || || 
    81   || dph.sumsq.vectorised.seq.N4 || 434ms || 0.93 || || 
    82   || dph.sumsq.vectorised.par.N1 || 443ms || 0.91 || || 
    83   || dph.sumsq.vectorised.par.N2 || 222ms || 1.82 || || 
    84   || dph.sumsq.vectorised.par.N4 || 111ms || 3.63 || || 
     79  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes''' 
     80  || dph.sumsq.vector.seq.N4 ||  404ms || 1 || 1 || || 
     81  || dph.sumsq.vectorised.seq.N4 || 434ms || 0.93 ||  || || 
     82  || dph.sumsq.vectorised.par.N1 || 443ms || 0.91 || 0.91 || || 
     83  || dph.sumsq.vectorised.par.N2 || 222ms || 1.82 || 0.91 || || 
     84  || dph.sumsq.vectorised.par.N4 || 111ms || 3.63 || 0.91 || || 
    8585 
    8686  '''Status''': fine[[br]] 
     
    9191  Computes the dot product of two vectors of `Double`s. N=10M. 
    9292 
    93   || '''name''' || '''runtime''' || '''speedup''' || '''notes''' 
    94   || dph.dotp.vector.seq.N4 ||  68ms || 1 || || 
    95   || dph.dotp.vectorised.seq.N4 || 58ms || 1.17 || A || 
    96   || dph.dotp.vectorised.par.N1 || 55ms || 1.24 || || 
    97   || dph.dotp.vectorised.par.N2 || 33ms || 2.06 || || 
    98   || dph.dotp.vectorised.par.N4 || 25ms || 2.72 || || 
     93  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes''' || 
     94  || dph.dotp.vector.seq.N4 ||  68ms || 1 || 1 || || 
     95  || dph.dotp.vectorised.seq.N4 || 58ms || 1.17 || || A || 
     96  || dph.dotp.vectorised.par.N1 || 55ms || 1.24 || 1.24 || || 
     97  || dph.dotp.vectorised.par.N2 || 33ms || 2.06 || 1.03 || || 
     98  || dph.dotp.vectorised.par.N4 || 25ms || 2.72 || 0.68 || || 
    9999  
    100100  A: The sequential vectorised version is faster than with Data.Vector. Why was this? 
     
    107107   Takes the even valued `Int`s from a vector. N=10M. 
    108108 
    109   || '''name''' || '''runtime''' || '''speedup''' || '''notes''' 
    110   || dph.evens.vectorised.seq.N4 || 1.075s || 1 || || 
    111   || dph.evens.vectorised.par.N1 || 736ms ||  1.46 || || 
    112   || dph.evens.vectorised.par.N2 || 768ms ||  1.40 || || 
    113   || dph.evens.vectorised.par.N4 || 859ms ||  1.25 || || 
     109  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes''' || 
     110  || dph.evens.vectorised.seq.N4 || 1.075s || 1 || 1 || || 
     111  || dph.evens.vectorised.par.N1 || 736ms ||  1.46 || 1.46 || || 
     112  || dph.evens.vectorised.par.N2 || 768ms ||  1.40 || 0.70 || || 
     113  || dph.evens.vectorised.par.N4 || 859ms ||  1.25 || 0.31 || || 
    114114 
    115115  '''Status''': Benchmark runs slower when number of threads increases. This benchmark invokes {{{packByTag}}} due to the filtering operation. This is probably affecting Quickhull as it also uses filtering. [[br]]  
     
    135135  Sort a vector of doubles by recursively splitting it and sorting the two halves. This is a naive benchmark used for regression testing only. We divide right down to two-point vectors and construct the result using copying append. A production algorithm would switch to an in-place sort once the size of the vector reaches a few thousand elements. N=100k. 
    136136 
    137   || '''name''' || '''runtime''' || '''speedup''' || '''notes''' || 
    138   || dph.quicksort.vectorised.par.N1 || 428ms ||  1 || || 
    139   || dph.quicksort.vectorised.par.N2 || 400ms ||  1.07 || || 
    140   || dph.quicksort.vectorised.par.N4 || 392ms ||  1.09 || || 
     137  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes''' || 
     138  || dph.quicksort.vectorised.par.N1 || 428ms ||  1 || 1 || || 
     139  || dph.quicksort.vectorised.par.N2 || 400ms ||  1.07 || 0.54 || || 
     140  || dph.quicksort.vectorised.par.N4 || 392ms ||  1.09 || 0.27 || || 
    141141 
    142142  '''Status''': Sequential vectorised version does not compile due to a blowup in !SpecConstr. 
     
    147147 
    148148 
    149   || '''name''' || '''runtime''' || '''speedup''' || '''notes''' 
    150   || dph.quickhull.vector-immutable.seq.N4 || 0.166s || 1 || || 
    151   || dph.quickhull.vectorised.seq.N4 || 0.677s ||  0.24 || 4x slower || 
    152   || dph.quickhull.vectorised.par.N1 || 1.059s ||  0.15 || 6x slower|| 
    153   || dph.quickhull.vectorised.par.N2 || 0.809s ||  0.21 || || 
    154   || dph.quickhull.vectorised.par.N4 || 0.686s ||  0.24 || || 
    155   || dph.quickhull.vector-mutable.seq.N4 || 0.086s ||  1.93 || A || 
    156   || dph.quickhull.vector-forkIO.par.N4 || 0.064s ||  2.59 || B || 
    157   || dph.quickhull.c.seq || 0.044s || 3.77 || C || 
     149  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes''' || 
     150  || dph.quickhull.vector-immutable.seq.N4 || 0.166s || 1 || 1 || || 
     151  || dph.quickhull.vectorised.seq.N4 || 0.677s ||  0.24 ||  || 4x slower || 
     152  || dph.quickhull.vectorised.par.N1 || 1.059s ||  0.15 || 0.15 || 6x slower|| 
     153  || dph.quickhull.vectorised.par.N2 || 0.809s ||  0.21 || 0.11 || || 
     154  || dph.quickhull.vectorised.par.N4 || 0.686s ||  0.24 || 0.06 || || 
     155  || dph.quickhull.vector-mutable.seq.N4 || 0.086s ||  1.93 || || A || 
     156  || dph.quickhull.vector-forkIO.par.N4 || 0.064s ||  2.59 || 0.65 || B || 
     157  || dph.quickhull.c.seq || 0.044s || 3.77 || || C || 
    158158 
    159159 A: Uses mutable Data.Vectors for intermediate buffers.[[br]] 
     
    193193 * Parallel versions are also run single threaded (with -N1) and sequential versions are also run with (-N4) so we get the parallel GC. 
    194194 * Parallel versions with -N1 will tend to be slower than natively sequential versions due to overheads for supporting parallelism. 
     195 
     196Speedup 
     197 * Runtime of reference / runtime of benchmark. 
     198 * Measures how much faster a benchmark is relative to the reference. 
     199 
     200Relative Efficiency.  
     201 * Speedup / number of threads. 
     202 * Indicates the communication overhead involved with running something in parallel. 
     203 * Can be > 1 if the parallel version running with a single thread is faster than the sequential reference version. 
    195204 
    196205Status