Changes between Version 2 and Version 3 of DataParallel/Benchmarks


Ignore:
Timestamp:
Mar 19, 2007 10:19:28 AM (7 years ago)
Author:
chak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DataParallel/Benchmarks

    v2 v3  
    66 
    77The speedup for the Xeon box and the !SunFire are in [attachment:speedup-colour.png] and the speedup for our 8x dualcore Opteron NUMA box is in [attachment:serenity-all-speedup-colour.png].  The speedup of on the NUMA machine is limited by the memory bandwidth for smvm.  When we only use one core per CPU, the benchmark scales much better.  Moreover, the memory traffic/compute ratio is slightly more favourable when processing arrays of `Float`s than when processing arrays of `Double`s. 
     8 
     9=== Connected components in undirected graphs === 
     10 
     11This is based on http://www.cs.cmu.edu/~scandal/nesl/algorithms.html#concomp. Currently, two of the tree algorithms described there are implemented:  
     12Awerbuch-Shiloach and the hybrid algorithm. The random mate algorithm needs a data-parallel random number genenerator and is left for later. 
     13 
     14The algorithms are interesting because they actually don't do a lot of computations; they mostly filter and copy edges. Thus, it is perhaps not unreasonable to expect that if they scale well, then so will more computation-intensive ones. Also, they should make any inefficiencies introduced by missed fusion opportunities quite obvious. 
     15 
     16Both algorithms take the number of nodes ({{{n :: Int}}}) and an array of edges ({{{es :: Int :*: Int}}}) and yield an array of {{{Int}}}s of length {{{n}}} where each node is assigned a number between {{{0}}} and {{{n-1}}}. Nodes which are assigned the same number are connected. The algorithms are described and compared in http://citeseer.ifi.unizh.ch/greiner94comparison.html. At the moment, I have only benchmarked them for a random graph with 1000000 nodes and 40000 edges; eventually, I'll add benchmarks for other kinds of graphs described in the paper. The benchmarks have been run on a dual Intel Xeon 2.8 GHz with two cores per processor which effectively gives us 4 processors overall. 
     17 
     18The parallel versions are currently very much slower than the sequential ones, particularly so for Awerbuch-Shiloach. This is because fusion doesn't work for the parallel algorithms at the moment. 
     19 
     20No benchmarks on 4 processors yet, as the 4th PE is currently busy. 
     21 
     22==== Awerbuch-Shiloach ==== 
     23 
     24The sequential code is in http://darcs.haskell.org/packages/ndp/Data/Array/Parallel/test/nesl/concomp/AwShU.hs and the parallel in http://darcs.haskell.org/packages/ndp/Data/Array/Parallel/test/nesl/concomp/AwShUP.hs. 
     25 
     26|| '''Version''' || '''Threads''' || '''Time (ms)''' || '''Speedup''' || 
     27|| sequential    ||               ||        1600     ||               || 
     28|| parallel      ||       1       ||       29800     ||               || 
     29||               ||       2       ||       16800     ||      1.8      || 
     30||               ||       3       ||       12800     ||      2.3      || 
     31||               ||       4       ||       ''???''   ||      ''???''  || 
     32 
     33==== Hybrid ==== 
     34 
     35The sequential code is in http://darcs.haskell.org/packages/ndp/Data/Array/Parallel/test/nesl/concomp/HybU.hs and the parallel in http://darcs.haskell.org/packages/ndp/Data/Array/Parallel/test/nesl/concomp/HybUP.hs. 
     36 
     37|| '''Version''' || '''Threads''' || '''Time (ms)''' || '''Speedup''' || 
     38|| sequential    ||               ||       1850      ||               || 
     39|| parallel      ||       1       ||       7450      ||               || 
     40||               ||       2       ||       4600      ||      1.6      || 
     41||               ||       3       ||       3800      ||      2.0      || 
     42||               ||       4       ||       ''???''   ||      ''???''  || 
     43 
     44I haven't completely parallelised this one yet (it's only a matter of implementing some parallel combinators).