Changes between Version 14 and Version 15 of DataParallel/BenchmarkStatus


Ignore:
Timestamp:
Mar 5, 2009 1:03:33 AM (5 years ago)
Author:
chak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DataParallel/BenchmarkStatus

    v14 v15  
    1414=== Execution on !LimitingFactor (2x Quad-Core Xeon) === 
    1515 
    16 Hardware spec: 2x 3.0GHz Quad-Core Intel Xeon 5400; 12MB (2x6MB) on-die L2 cache per processor; independent 1.6GHz frontside bus per processor; 800MHz DDR2; 256-bit-wide memory architecture; Mac OS X Server 10.5.6 
     16Hardware spec: 2x 3.0GHz Quad-Core Intel Xeon 5400; 12MB (2x6MB) on-die L2 cache per processor; independent 1.6GHz frontside bus per processor; 800MHz DDR2 FB-DIMM; 256-bit-wide memory architecture; Mac OS X Server 10.5.6 
    1717 
    1818Software spec: GHC 6.11 (from end of Feb 09); gcc 4.0.1 
    1919 
    20 || '''Program''' || '''Problem size''' || '''sequential''' || '''1 core''' || '''2 cores''' || '''4 cores''' || '''8 cores''' || 
     20|| '''Program''' || '''Problem size''' || '''sequential''' || '''P=1''' || '''P=2''' || '''P=4''' || '''P=8''' || 
    2121|| DotP, primitives || 100M elements || 823/823/824 || 812/813/815 || 408/408/409 || 220/223/227 || 210/214/221 || 
    2222|| DotP, vectorised || 100M elements || 823/824/824 || 814/816/818 || 412/417/421 || 222/225/227 || 227/232/238 || 
     
    2626|| SMVM, vectorised || ?? elems, density ?? ||  ||  ||  ||  ||  || 
    2727 
    28 All results are in milliseconds, and the triples report best/average/worst execution case time (wall clock) of three runs.  The column marked "sequential" reports times when linked against `dph-seq` and the columns marked "N cores" report times when linked against `dph-par` and run in parallel on the specified number of processor cores. 
     28All results are in milliseconds, and the triples report best/average/worst execution time (wall clock) of three runs.  The column marked "sequential" reports times when linked against `dph-seq` and the columns marked "P=n" report times when linked against `dph-par` and run in parallel using the specified number of parallel OS threads. 
    2929 
    3030==== Observations regarding DotP ==== 
    3131 
    3232Performance is memory bound, and hence, the benchmark stops scaling once the memory bus saturated.  As a consequence, the wall-clock execution time of the Haskell programs and the C reference implementation are the same when all available parallelism is exploited.  The parallel DPH library delivers the same single core performance as the sequential one in this benchmark. 
     33 
     34=== Execution on greyarea (1x UltraSPARC T2) === 
     35 
     36Hardware spec: 1x 1.4GHz UltraSPARC T2; 8 cores/processors with 8 hardware threads/core; 4MB on-die L2 cache per processor; FB-DIMM; Solaris 5.10 
     37 
     38Software spec: GHC 6.11 (from end of Feb 09); gccfss 4.0.4 (gcc front-end with Sun compiler backend) 
     39 
     40|| '''Program''' || '''Problem size''' || '''sequential''' || '''P=1''' || '''P=2''' || '''P=4''' || '''P=8''' || '''P=16''' || '''P=32''' || '''P=64''' || 
     41|| DotP, primitives || 100M elements || 937/937 || 934/934 || 474/474 || 238/238 || 120/120 || 65/65 || 38/38 || 28/28 || 
     42|| DotP, vectorised || 100M elements || || || || || || || || ||  
     43|| DotP, ref Haskell || 100M elements || – || || || || || || || || 
     44|| DotP, ref C || 100M elements || – || || || || || || || || || 
     45|| SMVM, primitives || ?? elems, density ?? ||  || || || || || || || || 
     46|| SMVM, vectorised || ?? elems, density ?? ||  || || || || || || || || 
     47 
     48All results are in milliseconds, and the triples report best/worst execution time (wall clock) of three runs.  The column marked "sequential" reports times when linked against `dph-seq` and the columns marked "P=n" report times when linked against `dph-par` and run in parallel using the specified number of parallel OS threads. 
     49 
     50==== Observations regarding DotP ==== 
     51 
     52The benchmark scales nicely up to the maximum number of hardware threads.  Memory latency is largely covered by excess parallelism. 
    3353 
    3454----