Changes between Version 37 and Version 38 of DataParallel/Regular

Jan 20, 2010 12:36:31 PM (6 years ago)



  • DataParallel/Regular

    v37 v38  
     1=DArrays - Haskell Support for Array Computations =
    23The library provides a layer on top of DPH unlifted arrays to support multi-dimensional arrays, and shape polymorphic
    422423=== Performance of Matrix-Matrix Multiplication ===
    424 We measured the performance of the two matrix multiplication implementations and compared their
    425 performance to C. Both matrices contain (size * size) elements. As we can see, the first version is significantly slower.
     425The following table contains the running times of `mmMult1` and `mmMult2', applied to two matrices of with `size * size` elements. As mentioned before, `mmMult2` is faster than `mmMult1`, as `replicate` can be implemented more efficiently than the general permutation which is the result of the element-wise index computation in `mmMult1`. This is the case for most problems: if it is possible to use collection oriented operations, than it will lead to more efficient code. We can also see that using `forceDArray` for improved locality has  a big impact on performance (we have O (size*size*size) memory accesses, and creating the transposed matrix has only a memory overhead of O(size*size)). `mmMult1` without the
     426transposed matrix is about as fast as `mmMult2` without `forceDArray` (times omitted). We can also see that the speedup on two processors is close to the optimal speedup of 2.
     428To get an idea about the absolute performance of DArrays, we compared it to two C implementations. The first (handwritten) is a straight forward C implementation with three nested loops, iterations re-arranged to get better performance, which has a similar effect on the performance than the `forceDArray`/`transpose` step. The second implementation uses the matrix-matrix multiplication operation provided by MacOS accelerate library. We can see that, for reasonably large arrays, DArrays is about a factor of 3 slower than the C implementation if run sequentially.
    427431  ----------------------------------------------------------------------