Changes between Version 65 and Version 66 of DataParallel/BenchmarkStatus


Ignore:
Timestamp:
Dec 2, 2010 7:43:28 AM (5 years ago)
Author:
benl
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DataParallel/BenchmarkStatus

    v65 v66  
    2929   Matrix-Matrix multiplication. Size=1024x1024.
    3030
    31   || '''name''' || '''runtime''' || '''speedup''' || '''notes'''
    32   || repa.mmult.c.seq ||  3.792s || 1 || A ||
    33   || repa.mmult.par.N4 || 2.147s || 1.77 || ||
     31  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes'''
     32  || repa.mmult.c.seq ||  3.792s || 1 || 1 || A ||
     33  || repa.mmult.par.N4 || 2.147s || 1.77 || 0.44 || ||
    3434  A: Straightforward C program using triple nested loops. A cache-friendly block-based version would be faster.
    3535
     
    4141   Solves the Laplace equation in the 2D plane. Size=400x400.
    4242
    43   || '''name''' || '''runtime''' || '''speedup''' || '''notes'''
    44   || repa.laplace.c.seq ||  1.299s || 1 || A ||
    45   || repa.laplace.par.N4 || 2.521s || 0.51 || ||
     43  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes''' ||
     44  || repa.laplace.c.seq ||  1.299s || 1 || 1 || A ||
     45  || repa.laplace.par.N4 || 2.521s || 0.51 || 0.13 || ||
    4646  A: Straightforward C program using triple nested loops. A cache-friendly block-based version would be faster.
    4747
     
    7777  Computes the sum of the squares from 1 to N using `Int`.  N = 100M.
    7878
    79   || '''name''' || '''runtime''' || '''speedup''' || '''notes'''
    80   || dph.sumsq.vector.seq.N4 ||  404ms || 1 || ||
    81   || dph.sumsq.vectorised.seq.N4 || 434ms || 0.93 || ||
    82   || dph.sumsq.vectorised.par.N1 || 443ms || 0.91 || ||
    83   || dph.sumsq.vectorised.par.N2 || 222ms || 1.82 || ||
    84   || dph.sumsq.vectorised.par.N4 || 111ms || 3.63 || ||
     79  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes'''
     80  || dph.sumsq.vector.seq.N4 ||  404ms || 1 || 1 || ||
     81  || dph.sumsq.vectorised.seq.N4 || 434ms || 0.93 ||  || ||
     82  || dph.sumsq.vectorised.par.N1 || 443ms || 0.91 || 0.91 || ||
     83  || dph.sumsq.vectorised.par.N2 || 222ms || 1.82 || 0.91 || ||
     84  || dph.sumsq.vectorised.par.N4 || 111ms || 3.63 || 0.91 || ||
    8585
    8686  '''Status''': fine[[br]]
     
    9191  Computes the dot product of two vectors of `Double`s. N=10M.
    9292
    93   || '''name''' || '''runtime''' || '''speedup''' || '''notes'''
    94   || dph.dotp.vector.seq.N4 ||  68ms || 1 || ||
    95   || dph.dotp.vectorised.seq.N4 || 58ms || 1.17 || A ||
    96   || dph.dotp.vectorised.par.N1 || 55ms || 1.24 || ||
    97   || dph.dotp.vectorised.par.N2 || 33ms || 2.06 || ||
    98   || dph.dotp.vectorised.par.N4 || 25ms || 2.72 || ||
     93  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes''' ||
     94  || dph.dotp.vector.seq.N4 ||  68ms || 1 || 1 || ||
     95  || dph.dotp.vectorised.seq.N4 || 58ms || 1.17 || || A ||
     96  || dph.dotp.vectorised.par.N1 || 55ms || 1.24 || 1.24 || ||
     97  || dph.dotp.vectorised.par.N2 || 33ms || 2.06 || 1.03 || ||
     98  || dph.dotp.vectorised.par.N4 || 25ms || 2.72 || 0.68 || ||
    9999 
    100100  A: The sequential vectorised version is faster than with Data.Vector. Why was this?
     
    107107   Takes the even valued `Int`s from a vector. N=10M.
    108108
    109   || '''name''' || '''runtime''' || '''speedup''' || '''notes'''
    110   || dph.evens.vectorised.seq.N4 || 1.075s || 1 || ||
    111   || dph.evens.vectorised.par.N1 || 736ms ||  1.46 || ||
    112   || dph.evens.vectorised.par.N2 || 768ms ||  1.40 || ||
    113   || dph.evens.vectorised.par.N4 || 859ms ||  1.25 || ||
     109  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes''' ||
     110  || dph.evens.vectorised.seq.N4 || 1.075s || 1 || 1 || ||
     111  || dph.evens.vectorised.par.N1 || 736ms ||  1.46 || 1.46 || ||
     112  || dph.evens.vectorised.par.N2 || 768ms ||  1.40 || 0.70 || ||
     113  || dph.evens.vectorised.par.N4 || 859ms ||  1.25 || 0.31 || ||
    114114
    115115  '''Status''': Benchmark runs slower when number of threads increases. This benchmark invokes {{{packByTag}}} due to the filtering operation. This is probably affecting Quickhull as it also uses filtering. [[br]]
     
    135135  Sort a vector of doubles by recursively splitting it and sorting the two halves. This is a naive benchmark used for regression testing only. We divide right down to two-point vectors and construct the result using copying append. A production algorithm would switch to an in-place sort once the size of the vector reaches a few thousand elements. N=100k.
    136136
    137   || '''name''' || '''runtime''' || '''speedup''' || '''notes''' ||
    138   || dph.quicksort.vectorised.par.N1 || 428ms ||  1 || ||
    139   || dph.quicksort.vectorised.par.N2 || 400ms ||  1.07 || ||
    140   || dph.quicksort.vectorised.par.N4 || 392ms ||  1.09 || ||
     137  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes''' ||
     138  || dph.quicksort.vectorised.par.N1 || 428ms ||  1 || 1 || ||
     139  || dph.quicksort.vectorised.par.N2 || 400ms ||  1.07 || 0.54 || ||
     140  || dph.quicksort.vectorised.par.N4 || 392ms ||  1.09 || 0.27 || ||
    141141
    142142  '''Status''': Sequential vectorised version does not compile due to a blowup in !SpecConstr.
     
    147147
    148148
    149   || '''name''' || '''runtime''' || '''speedup''' || '''notes'''
    150   || dph.quickhull.vector-immutable.seq.N4 || 0.166s || 1 || ||
    151   || dph.quickhull.vectorised.seq.N4 || 0.677s ||  0.24 || 4x slower ||
    152   || dph.quickhull.vectorised.par.N1 || 1.059s ||  0.15 || 6x slower||
    153   || dph.quickhull.vectorised.par.N2 || 0.809s ||  0.21 || ||
    154   || dph.quickhull.vectorised.par.N4 || 0.686s ||  0.24 || ||
    155   || dph.quickhull.vector-mutable.seq.N4 || 0.086s ||  1.93 || A ||
    156   || dph.quickhull.vector-forkIO.par.N4 || 0.064s ||  2.59 || B ||
    157   || dph.quickhull.c.seq || 0.044s || 3.77 || C ||
     149  || '''name''' || '''runtime''' || '''speedup''' || '''efficiency''' || '''notes''' ||
     150  || dph.quickhull.vector-immutable.seq.N4 || 0.166s || 1 || 1 || ||
     151  || dph.quickhull.vectorised.seq.N4 || 0.677s ||  0.24 ||  || 4x slower ||
     152  || dph.quickhull.vectorised.par.N1 || 1.059s ||  0.15 || 0.15 || 6x slower||
     153  || dph.quickhull.vectorised.par.N2 || 0.809s ||  0.21 || 0.11 || ||
     154  || dph.quickhull.vectorised.par.N4 || 0.686s ||  0.24 || 0.06 || ||
     155  || dph.quickhull.vector-mutable.seq.N4 || 0.086s ||  1.93 || || A ||
     156  || dph.quickhull.vector-forkIO.par.N4 || 0.064s ||  2.59 || 0.65 || B ||
     157  || dph.quickhull.c.seq || 0.044s || 3.77 || || C ||
    158158
    159159 A: Uses mutable Data.Vectors for intermediate buffers.[[br]]
     
    193193 * Parallel versions are also run single threaded (with -N1) and sequential versions are also run with (-N4) so we get the parallel GC.
    194194 * Parallel versions with -N1 will tend to be slower than natively sequential versions due to overheads for supporting parallelism.
     195
     196Speedup
     197 * Runtime of reference / runtime of benchmark.
     198 * Measures how much faster a benchmark is relative to the reference.
     199
     200Relative Efficiency.
     201 * Speedup / number of threads.
     202 * Indicates the communication overhead involved with running something in parallel.
     203 * Can be > 1 if the parallel version running with a single thread is faster than the sequential reference version.
    195204
    196205Status