Opened 8 years ago

Closed 8 years ago

#3586 closed bug (fixed)

Initialisation of unboxed arrays is too slow

Reported by: simonpj Owned by: simonmar
Priority: high Milestone: 6.12.2
Component: libraries (other) Version: 6.10.4
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

On the Clean mailing list, Philippos Apolinarius [phi500ac@…] describes a small array program (using the ST monad) that runs 400x more slowly in Haskell than Clean. I can understand a bit slower (e.g. maybe they do a better job of array-bound checks), but 400x seems large. This ticket is an invitation for someone to investigate.

He writes: I wrote a very simple program to check whether Haskell improved its array processing libraries or not. Here is how to compile and run the program arr.hs in Haskell (I have used the GHC compiler):

>ghc -O arr.hs -o arr.exe

$ time arr.exe +RTS -K32000000
2.8e8

real    0m3.938s
user    0m0.031s
sys     0m0.000s

The same program in Clean:

C:\Clean 2.2\exemplos\console>arraytest.exe
280000000
Execution: 0.01  Garbage collection: 0.01  Total: 0.03

C:\Clean 2.2\exemplos\console>arraytest.exe
280000000
Execution: 0.01  Garbage collection: 0.01  Total: 0.03

This means that Clean is 390 times faster than Haskell in this particular problem. These results makes me worder whether Haskell is safer than Clean. It turns out that Haskell checks index out of range at runtime, exactly like Clean. Larger problems make the difference between Clean and Haskell even worse. For instance, neural networks like the one described in Schmidtt et al. run 400 times faster in Clean.

Below you will find the array examples. You can check that Clean is really much faster than Haskell. I wonder why the Benchmarks Game site does not report such a large difference between Haskell and Clean performances. I believe that people who wrote Haskell benchmarks for the Benchmarks Game site cheated in using foreign pointers to access arrays.

-- arr.hs
import Control.Monad.ST
import Data.Array.ST
main = print $ runST
          (do arr <- newArray (1,2000000) 
                        137.0 :: ST s 
                                  (STArray s 
                                    Int Double)
              a <- readArray arr 1
              b <- readArray arr 1
              fn 2000000 arr 0.0 )


fn i a acc | i < 1 = do (return acc) 
fn i a acc= do 
             b <- readArray a i
             writeArray a i (b+3.0)
             c <- readArray a i
             fn (i-1) a (acc+c)

Clean version

module arraytest
import StdEnv
fn i a acc | i<1 = acc
fn i a=:{[i]=b} acc
  # a= {a&[i]= b+3.0}
  # (c, a)= a![i]
  = fn (i-1) a (c+acc)
  
Start= fn 2000000 vt 0.0
where
   vt:: .{#Real}
   vt = createArray 2000001 137.0

Attachments (1)

faster-newArray-for-STU.dpatch (7.1 KB) - added by int-e 8 years ago.
proposed fix. note that there are two patches in the bundle.

Download all attachments as: .zip

Change History (23)

comment:1 Changed 8 years ago by dons

Isn't the problem here that you're using unboxed arrays in Clean, but boxed arrays in Haskell?

Secondly, there's a space leak in the fn function, and a bunch of things that aren't in the Clean code. Fixing things:

Correcting the space leak:

{-# LANGUAGE BangPatterns #-}

import Control.Monad.ST
import Data.Array.ST
main = print $ runST
          (do arr <- newArray (1,2000000) 
                        137.0 :: ST s 
                                  (STArray s 
                                    Int Double)
              a <- readArray arr 1
              b <- readArray arr 1
              fn 2000000 arr 0.0 )


fn i !a !acc | i < 1 = do (return acc) 
fn i a acc= do 
             b <- readArray a i
             writeArray a i (b+3.0)
             c <- readArray a i
             fn (i-1) a (acc+c)

Results in:

$ time ./B +RTS -sstderr
./B +RTS -sstderr 
2.8e8

./B +RTS -sstderr  2.95s user 0.07s system 99% cpu 3.034 total

3.034s

Where 98.6% of the time is spent in GC (this is why you don't use boxed mutable arrays when you can use unboxed ones.

Increasing the GC thresholds, improves performance 20 fold:

$ time ./B +RTS -sstderr -A200M
./B +RTS -sstderr -A200M 
2.8e8

./B +RTS -sstderr -A200M  0.06s user 0.12s system 100% cpu 0.176 total

0.176s using boxed arrays. Switching to an STUArray, and adding some type annotations:

{-# LANGUAGE BangPatterns #-}
{-# OPTIONS -fexcess-precision #-}

import Control.Monad.ST
import Data.Array.ST
import Data.Array.Base

main = print $ runST
          (do arr <- newArray (1,2000000) 137.0 :: ST s (STUArray s Int Double)
              fn 2000000 arr 0.0 )


fn :: Int -> STUArray s Int Double -> Double -> ST s Double
fn i !a !acc | i < 1 = return acc
fn i a acc = do
     b <- unsafeRead a i
     unsafeWrite a i (b+3.0)
     c <- unsafeRead a i
     fn (i-1) a (c + acc)


./B +RTS -sstderr  0.26s user 0.01s system 97% cpu 0.283 total

So still not great. Perhaps someone can look into it further?

The original bug report is a duplicate of the boxed, mutable arrays ticket.

comment:2 Changed 8 years ago by dons

How I'd write it:

{-# LANGUAGE BangPatterns #-}
{-# OPTIONS -fvia-C -optc-O3 -fexcess-precision -optc-msse3 #-}

import Control.Monad.ST
import Data.Array.ST
import Data.Array.Base

main = print $ runST
          (do arr <- newArray (1,2000000) 137.0 :: ST s (STUArray s Int Double)
              go arr 2000000 0.0 )


go :: STUArray s Int Double -> Int -> Double -> ST s Double
go !a i !acc
    | i < 1     = return acc
    | otherwise = do
         b <- unsafeRead a i
         unsafeWrite a i (b+3.0)
         c <- unsafeRead a i
         go a (i-1) (c+acc)

Which yield this loop for 'go':

Main_zdwa_info:
        leaq    16(%r12), %rax
        cmpq    %r15, %rax
        movq    %rax, %r12
        ja      .L4
        testq   %rdi, %rdi
        jle     .L5
        leaq    16(%rsi,%rdi,8), %rdx
        leaq    -16(%rax), %r12
        leaq    -1(%rdi), %rdi
        movsd   (%rdx), %xmm0
        addsd   .LC0(%rip), %xmm0
        addsd   %xmm0, %xmm5
        movsd   %xmm0, (%rdx)
        jmp     Main_zdwa_info

comment:3 Changed 8 years ago by dons

Rewriting to use uvector:

import Data.Array.Vector

main = print . sumU . mapU (+3) $ replicateU 2000000 (137.0 :: Double)

Compiling:

$ ghc -O2 --make Z.hs -fvia-C -optc-O3 -fforce-recomp -optc-msse4

$ time ./Z
2.8e8
./Z  0.01s user 0.00s system 108% cpu 0.012 tota

Which I believe is both purely functional, and twice as fast as the Clean code.

comment:4 Changed 8 years ago by dons

The uvector version generates an inner loop of:

Main_zdwfold_info:
        cmpq    $2000000, %rsi
        je      .L8
.L4:
        addsd   .LC0(%rip), %xmm5
        leaq    1(%rsi), %rsi
        jmp     Main_zdwfold_info

Fusion FTW.

comment:5 Changed 8 years ago by dons

The problem in the STUArray version is newArray_, which uses the generic newArray, which fills from a list. It does not fuse.

Rewriting the STUArray version to initialize directly and we get identical results to Clean.

{-# LANGUAGE BangPatterns #-}
{-# OPTIONS -fvia-C -optc-O3 -fexcess-precision -optc-msse3 #-}

import Control.Monad.ST
import Data.Array.ST
import Data.Array.Base

main = print $ runST
          (do a <- unsafeNewArray_ (1,2000000)

              let init i | i >= 2000000 = return ()
                         | otherwise    = unsafeWrite a i 137 >> init (i+1)
              init 0

              go a 2000000 0.0 )


go :: STUArray s Int Double -> Int -> Double -> ST s Double
go !a i !acc
    | i < 1     = return acc
    | otherwise = do
         b <- unsafeRead a i
         unsafeWrite a i (b+3.0)
         c <- unsafeRead a i
         go a (i-1) (c+acc)

$ ghc -O2 --make B.hs -fvia-C -optc-O3 -optc-msse4
[1 of 1] Compiling Main             ( B.hs, B.o )
Linking B ...

$ time ./B
2.79999863e8
./B  0.01s user 0.02s system 108% cpu 0.031 total

Recommendation: custom newArray_ methods in the Data.Array.Base module that fill directly with a tail recursive loop.

comment:6 Changed 8 years ago by int-e

Apparently, newArray isn't getting inlined as intended, and thus not getting specialized for the STU array types. Simply moving the implementation of newArray to its own function (which is marked inline) helps a lot.

Changed 8 years ago by int-e

proposed fix. note that there are two patches in the bundle.

comment:7 Changed 8 years ago by int-e

Addendum: The code generated for forM_ [0..n-1] ... is just as fast as a hand-written loop in my tests, if compiled with optimisations enabled.

comment:8 Changed 8 years ago by simonpj

Adrian Hey points out that #650 (a long-standing performance bug in GC of mutable arrays) may be relevant here.

Simon

comment:9 in reply to:  8 Changed 8 years ago by int-e

Replying to simonpj:

Adrian Hey points out that #650 (a long-standing performance bug in GC of mutable arrays) may be relevant here.

Yes, that explains that most of the time is spent doing GC in the boxed array version without increasing the heap size. It does not explain why switching to unboxed arrays makes the program slower. (We have tracked down the reason for that, see earlier comments.)

comment:10 Changed 8 years ago by simonmar

Component: Compilerlibraries (other)
Milestone: 6.12.2
Priority: normalhigh
Summary: Slow array codeInitialisation of unboxed arrays is too slow

Summary so far:

  • boxed mutable arrays suffer from #650 (no sub-array write barrier in the GC)
  • apparently newArray isn't getting inlined or specialised for STUArray, so initialisation of unboxed arrays goes very slowly. That should be fixable.

comment:11 Changed 8 years ago by dons

Assuming int-e's patch is corret, that looks like a good approach. Float out the initializer, and switch to forM_, yielding proper specialization.

Who can go ahead and apply the patch?

comment:12 Changed 8 years ago by simonpj

Concerning int-e's patch, why do you say that That should not happen. I've just tried it with this module:

module Seq where

import Control.Monad
import Data.Array.Base

newArrayImplSeq (l,u) initialValue
  = do let n = safeRangeSize (l,u)
       marr <- unsafeNewArray_ (l,u)
       sequence_ [unsafeWrite marr i initialValue | i <- [0 .. n - 1]]
       return marr

newArrayImplFor (l,u) initialValue
  = do let n = safeRangeSize (l,u)
       marr <- unsafeNewArray_ (l,u)
       forM_ [0 .. n - 1] (\i -> unsafeWrite marr i initialValue)
       return marr


doSeq :: Monad m => (Int -> m ()) -> Int -> m ()
doSeq f n = sequence_ [f i | i <- [1..n]]

doFor :: Monad m => (Int -> m ()) -> Int -> m ()
doFor f n = forM_ [1..n] f

Both versions yield identical code. So why do you think one is better than t'other?

Secondly, it's absolutely true that GHC's current inlining of default methods is flaky, but that is something I'm fixing as part of a long running project called "the never-ending INLINE patch". When I'm done, your work-around shouldn't be ncessary (although it's not harmful). Stay tuned.

Simon

comment:13 in reply to:  12 ; Changed 8 years ago by int-e

Replying to simonpj:

Both versions yield identical code. So why do you think one is better than t'other?

I'm afraid I must have imagined that. In any case, I'm unable to reproduce my measurements although I remember a distinct decrease in runtime with a variant of dons' testcase, from 0.090 to 0.060 seconds, with multiple measurements. Perhaps I mixed up user and total execution time? Oh well.

So that part of the patch should be left out.

On to more serious matters - the explicitely inlined worker trick works for ghc 6.10.4, but is not sufficient for the 6.12 RC, so that will need a different fix. One idea that works is to add

newArray = newArrayImpl

to all MArray (STUArray ...) ... instances. Of course getting inlining for default methods to work would be preferable. In fact, for default method it would be nice to have a special form of the SPECIALIZE pragma as well, one that duplicates the definition for all instances and optimizes it in that context.

comment:14 Changed 8 years ago by Philippos

I have a genetic programming code that I am trying to translate to Haskell. I applied suggestions received here to a slightly more realistic situation, where array ar contains a tree population. Function fn updates the array randomly, in order to simulate crossover and mutation of individuals of the population. The result seems to be very good, i.e., Haskell is from 3 to 5 times slower than Clean. I would rather not use unsafe features. Therefore, I will appreciate if someone substitute something else for (unsafePerformIO (randomList (1, 2000000))). I wonder whether a member of this forum could make the gap between Clean and Haskell even smaller. Here are the programs:

-- Haskell version
data Op = AND | OR | NOT;
data Tree= L Double | T Op [Tree]

main = print $ runST
          (do arr <- newArray (1,2000000) (L 0.0) :: ST s  (STArray s Int Tree)                   
         
              go  arr 2000000 0.0 (unsafePerformIO (randomList (1, 2000000))))

go ::  STArray s Int Tree -> Int -> Double -> [Int] -> ST s Double
go a i acc (x1:x2:xs)
  | i < 1 = return acc
  | otherwise=do 
               b <- readArray a x1
               writeArray a x2 (setDouble ((getDouble b)+3.0))
               c <- readArray a i
               go  a (i-1) (acc+ (getDouble c)) xs

-- What I really need is a random index in Haskell.
             
getDouble (L r)= r
getDouble _ = 0.0

setDouble r= L r

randomList r = 
 do 
   g <- getStdGen
   return $ randomRs r g
                
-- ghc -O2 --make bingo.hs -fvia-C -optc-O3 -optc-msse3 -o bingo.exe

-- bingo.exe +RTS -sstderr -K100m -H100


module boxed;
import StdEnv, MersenneTwister, StdTime;

::Op = AND | OR | NOT;
::Tree= L Real | T Op [Tree];


fn i a acc xs | i<1 = acc;
fn i a acc [x1:x2:xs]
   # (i1, i2)= (ri x1, ri x2);
   # (b, a)= a![i1];
   # a= {a&[i2]= setDouble ((getDouble b)+3.0)};
   # (c, a)= a![i];
   = fn (i-1) a ((getDouble c)+acc) xs;

ri x= (abs x) rem 2000000;

getDouble (L r)= r;
getDouble _ = 0.0;

setDouble r= L r;

Start w
  #	(ct,w)= getCurrentTime w;
  	seed= 1+ct.seconds;
  	xs= genRandInt seed;
  = fn (2000000-1) vt 0.0 xs;
 where{
    vt:: *{Tree};
    vt = createArray 2000000 (L 137.0);}






comment:15 Changed 8 years ago by dons

Hello Philippos, you should take up these general design questions on the haskell-cafe@… mailing list, not on the bug tracker.

Some quick points: since you use the mersenne-twister in Clean, you should use it in Haskell too: http://hackage.haskell.org/package/mersenne-random. Also, you should never need to increase the stack size with the -K option. Finally, it is faster to thread around a seed to the random generator, than to generate the infinite list of randoms.

Here's how I would translate your Clean program to Haskell:

{-# LANGUAGE BangPatterns #-}
{-# OPTIONS -funbox-strict-fields #-}

--
-- Compile with: ghc -O2 -fvia-C -optc-O3 --make A.hs
-- Run with : ./A +RTS -H200M
--
-- $ time ./A +RTS -H200M
-- 2.77000034e8
-- ./A +RTS -H200M  0.62s user 0.10s system 96% cpu 0.751 total
--

import Control.Monad.ST
import Control.Monad
import Data.Array.ST
import Data.Array.Base
import System.Random.Mersenne

data Op = And | Or | Not

data Tree = L !Double | T !Op [Tree]

main = do
    g  <- getStdGen
    print $ runST $ do
        let l = 0
            u = 2000000 - 1
        vt <- unsafeNewArray_ (l,u)
        forM_ [l..u] $ \i -> unsafeWrite vt i (L 137)
        fn u vt 0 g

fn :: Int -> STArray s Int Tree -> Double -> MTGen -> ST s Double
fn i a !acc g | i < 1     = return acc
fn i a  acc g = do
    i1 <- ri `fmap` unsafeIOToST (random g) :: ST s Int
    i2 <- ri `fmap` unsafeIOToST (random g) :: ST s Int
    b <- a `readArray` i1
    writeArray a i2 (set (get b + 3))
    c <- a `readArray` i1
    fn (i-1) a (get c + acc) g

ri :: Int -> Int
ri x = (abs x) `rem` 2000000

get :: Tree -> Double
get (L r) = r
get _     = 0.0

set :: Double -> Tree
set r = L r

Your Haskell version completes for me in just over 2 minutes. The above Haskell version completes in 0.7s. I'd imagine this is very competitive with the Clean implementation? I give detailed instructions on how to compile this for best results.

comment:16 Changed 8 years ago by dons

And how to run it:

$ ghc A.hs -O2 -fvia-C -optc-O3 --make
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...

$ time ./A +RTS -H200M
2.7700091e8
./A +RTS -H200M  0.63s user 0.12s system 93% cpu 0.806 total

comment:17 Changed 8 years ago by dons

And an analysis by int-e of how different optimizations affects performance http://www.imn.htwk-leipzig.de/~felgenha/test3586/summary Worth a read to see why I went straight to the fast version.

comment:18 in reply to:  13 Changed 8 years ago by simonpj

Replying to int-e:

On to more serious matters - the explicitely inlined worker trick works for ghc 6.10.4, but is not sufficient for the 6.12 RC, so that will need a different fix. One idea that works is to add

newArray = newArrayImpl

to all MArray (STUArray ...) ... instances. Of course getting inlining for default methods to work would be preferable. In fact, for default method it would be nice to have a special form of the SPECIALIZE pragma as well, one that duplicates the definition for all instances and optimizes it in that context.

There's an underlying problem, which is the fragility of inlining of default methods. I'm in the midst of fixing that (the long-awaited INLINE patch), so I propose not to contort the libraries meanwhile. I'll ping when I commit the patch so that you can check that it does improve matters.

The only downside is that this perf boost won't get into 6.12, but I think it's too bad.

Simon

comment:19 Changed 8 years ago by simonmar

I suggest that:

  • after the INLINE patch lands in HEAD, test that it fixes this issue
  • for 6.12.2 (or 6.12.1 if there's time), apply the workaround of extracting newArray to a top-level function

comment:20 Changed 8 years ago by simonmar

Owner: set to simonmar

I have verified that 6.13.20091111 doesn't have this bug, and I'm testing a patch for 6.12.1 to work around it.

comment:21 Changed 8 years ago by simonmar

Type of failure: Runtime performance bug

comment:22 Changed 8 years ago by simonmar

Resolution: fixed
Status: newclosed

Workaround applied in 6.12.1:

Thu Nov 12 05:38:07 PST 2009  Simon Marlow <marlowsd@gmail.com>
  * Apply a workaround for #3586
  This is a pretty severe performance bug in newArray and newArray_,
  fixed in HEAD, but needs a workaround in STABLE.  The workaround from
  the ticket didn't work, but a slight modification of it did: I had to
  disable the INLINE pragma on the newArray default method.

    M ./libraries/tarballs/array-0.3.0.0.tar.gz
Note: See TracTickets for help on using tickets.