Opened 7 years ago

Last modified 19 months ago

#1216 new bug

Missed opportunity for let-no-esape

Reported by: claus Owned by: simonpj
Priority: normal Milestone: 7.6.2
Component: Compiler Version: 6.6
Keywords: Cc: Bulat.Ziganshin@…, claus.reinke@…, daniel.is.fischer@…, sveina@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

readArray/writeArray call GHC.Arr.index, which seems inexplicably slow
for 2d arrays. inexplicably, because simply copying the default implementation
of index from GHC.Arr into the local module can speed things up considerably.

originally raised in this thread:
http://www.haskell.org/pipermail/haskell-cafe/2007-March/023394.html

shortened example or matrix/vector-multiplication attached. comment out
the first line of myindex to use the local copy. this results in a speedup
from 20s to 13s (time ./Index 100000) on my system, not to mention the
difference in space usage (a factor of 1000 in allocation, according to
+RTS -sstderr..).

Attachments (1)

Index.hs (1.9 KB) - added by claus 7 years ago.

Download all attachments as: .zip

Change History (25)

Changed 7 years ago by claus

comment:1 Changed 7 years ago by guest

  • Cc Bulat.Ziganshin@… added

comment:2 Changed 7 years ago by igloo

  • Milestone set to 6.6.2

We should look into this for 6.6.2.

comment:3 Changed 7 years ago by guest

  • Cc Bulat.Ziganshin@… added; Bulat.Ziganshin@… removed

comment:4 Changed 6 years ago by simonmar

  • Milestone changed from 6.6.2 to 6.8 branch
  • Owner set to simonpj

I've investigated 6.8.1's performance on this example. In summary:

  • the performance difference between calling index and using the inline copy still exists
  • there is virtually no different in allocation between the two, i.e. the "leaky" aspect of the original report doesn't exist
  • Specialisation is working properly: we're calling the specialised version of index for (Int,Int).
  • The difference appears to be inlining + case-of-case. Although we're calling the specialised index, it isn't inlined. We can force it to be inlined with -finline-use-threshold=30, but that doesn't help. It looks like case-of-case isn't being applied after inlining.
  • something else is strange: there's a specialised function named $sa_something in the output, but it isn't referenced from anywhere.

Simon: could you take a look? The code is in ~simonmar/scratch/1216.hs.

comment:5 Changed 6 years ago by simonmar

  • Milestone changed from 6.8 branch to 6.8.3

comment:6 Changed 6 years ago by simonpj

A Friday afternoon, so I thought I'd look at this. The problem is that GHC.Arr.$windex1, which is called from the inner loop has this inlining:

$windex1 :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Base.Int
  {- Arity: 3 Strictness: LLL
     Unfolding: (\ ww :: GHC.Prim.Int# ww1 :: GHC.Prim.Int# ww2 :: GHC.Prim.Int# ->
                 case @ GHC.Base.Int GHC.Prim.<=# ww ww2 of wild {
                   GHC.Base.False
                   -> GHC.Arr.indexError2
                        (GHC.Base.I# ww, GHC.Base.I# ww1)
                        (GHC.Base.I# ww2)
                        GHC.Arr.lvl3
                   GHC.Base.True
                   -> case @ GHC.Base.Int GHC.Prim.<=# ww2 ww1 of wild1 {
                        GHC.Base.False
                        -> GHC.Arr.indexError2
                             (GHC.Base.I# ww, GHC.Base.I# ww1)
                             (GHC.Base.I# ww2)
                             GHC.Arr.lvl3
                        GHC.Base.True -> GHC.Base.I# (GHC.Prim.-# ww2 ww) } }) -}

Note the allocation of the I# box! Why? Becuase GHC.Arr.indexError is not marked as a bottoming function. Why not? Because it looks like this:

[Arity 1
 Str: DmdType L]
GHC.Arr.indexError =
  \ (@ a_aRh) (@ b_aRi) ($dShow_aRD [ALWAYS Just L] :: GHC.Show.Show a_aRh) ->
    let {
      shows1_X1Am [ALWAYS Just L] :: a_aRh -> GHC.Show.ShowS
      [Str: DmdType {aRD->U(C(S)AA)}]
      shows1_X1Am = GHC.Show.shows @ a_aRh $dShow_aRD } in
    let {
      shows2_X1Ap [ALWAYS Just L] :: a_aRh -> GHC.Show.ShowS
      [Str: DmdType {aRD->U(C(S)AA)}]
      shows2_X1Ap = GHC.Show.shows @ a_aRh $dShow_aRD } in
    \ (rng_ah3 [ALWAYS Just L] :: (a_aRh, a_aRh))
      (i_ah5 [ALWAYS Just L] :: a_aRh)
      (tp_ah7 [ALWAYS Just S] :: GHC.Base.String) ->
      GHC.Err.error @ b_aRi (...stuff...)

Note the stupid arity and that the strictness analyser is not spotting the bottom result.

I'll think about how to fix this stupidity.

Simon

comment:7 Changed 6 years ago by simonmar

  • Type changed from bug to run-time performance bug

comment:8 Changed 6 years ago by igloo

  • Milestone changed from 6.8.3 to 6.10.1

comment:9 Changed 6 years ago by simonmar

  • Architecture changed from Unknown to Unknown/Multiple

comment:10 Changed 6 years ago by simonmar

  • Operating System changed from Unknown to Unknown/Multiple

comment:11 Changed 6 years ago by igloo

  • Milestone changed from 6.10.1 to 6.10.2

comment:12 Changed 5 years ago by igloo

  • Milestone changed from 6.10.2 to 6.12.1

comment:13 Changed 4 years ago by simonmar

  • Type of failure set to Runtime performance bug

comment:14 Changed 4 years ago by daniel.is.fischer

  • Cc Bulat.Ziganshin@… claus.reinke@… daniel.is.fischer@… added; Bulat.Ziganshin@… removed

Still slow and also overallocating in 6.10.3:

./indexL 100000 +RTS -sstderr                                             
array (1,40) [snip]                                       
   7,132,475,396 bytes allocated in the heap                                           
         935,808 bytes copied during GC                                                
          31,748 bytes maximum residency (1 sample(s))                                 
          22,992 bytes maximum slop                                                    
               1 MB total memory in use (0 MB lost due to fragmentation)               

  Generation 0: 13604 collections,     0 parallel,  0.20s,  0.16s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   11.34s  ( 11.42s elapsed)
  GC    time    0.20s  (  0.16s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   11.55s  ( 11.58s elapsed)

  %GC time       1.8%  (1.4% elapsed)

  Alloc rate    628,705,232 bytes per MUT second

  Productivity  98.2% of total user, 97.9% of total elapsed

dafis@linux-mkk1:~/Haskell/CafeTesting> ./indexH +RTS -sstderr -RTS 100000
./indexH 100000 +RTS -sstderr
array (1,40) [snip]
       9,800,880 bytes allocated in the heap
          11,908 bytes copied during GC
          31,696 bytes maximum residency (1 sample(s))
          22,992 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:    18 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    3.80s  (  3.80s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    3.80s  (  3.80s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    2,581,735 bytes per MUT second

  Productivity 100.0% of total user, 100.0% of total elapsed

Both compiled with -O2.

comment:15 Changed 4 years ago by igloo

  • Milestone changed from 6.12.1 to 6.12 branch

comment:16 Changed 4 years ago by guest

  • Cc sveina@… added

Problem still exists on 6.12-rc2

The memory leak appears to have gone away again, but there's a factor of two difference in speed between the two versions of the code.

comment:17 Changed 4 years ago by simonpj

  • Summary changed from indexing 2d arrays is slow and leaky. why? to Missed opportunity for let-no-esape

I've finally gotten around to this, but only in HEAD. It's all to do with inlining array indexing, and doing so without causing grotesque code bloat from copied error messages.

The key things are

  • Adding INLINE pragmas to index in GHC.Arr.
  • Floating out error message code (the bottom-extraction patch)

Results.

           index is    Allocation      Elapsed time
--------------------------------------------------------
GHC 6.10   local            19M          1.68s
           library      14,000M          8.73s

HEAD       local            16M          1.78s
           library          18M          1.52s 

By "index is local" I mean that we use the locally-defined myIndex. By "library" I mean that myIndex is set equal to index so we use the overloaded index from GHC.Arr. (I think the allocation is higher than in the figures above because I'm measuring on a 64-bit machine.)

So now the naive code is fastest. I'm not quite sure why the 'local' version in HEAD is slower than the 'library' version, but I've spent enough time on this.
Wed Dec 16 17:04:41 GMT 2009 simonpj@…

The library patch is this:

  * Mark 'index' as INLINE in GHC.Arr
  
  This makes indexing much faster. See Trac #1216

However I'm leaving the ticket open because I see that the inner loop shows the same phenomenon as #3458: we could greatly reduce allocation by pushing the binding for a local function to nearer its call site (in the scrutinee of a case).

I'll re-title the ticket though.

Simon

comment:18 Changed 4 years ago by igloo

  • Milestone changed from 6.12 branch to 6.12.3

comment:19 Changed 4 years ago by igloo

  • Milestone changed from 6.12.3 to 6.14.1
  • Priority changed from normal to low

comment:20 Changed 3 years ago by igloo

  • Milestone changed from 7.0.1 to 7.0.2

comment:21 Changed 3 years ago by igloo

  • Milestone changed from 7.0.2 to 7.2.1

comment:22 Changed 3 years ago by igloo

  • Milestone changed from 7.2.1 to 7.4.1

comment:23 Changed 2 years ago by igloo

  • Milestone changed from 7.4.1 to 7.6.1
  • Priority changed from low to normal

comment:24 Changed 19 months ago by igloo

  • Milestone changed from 7.6.1 to 7.6.2
Note: See TracTickets for help on using tickets.