Opened 6 years ago

Closed 2 years ago

#5916 closed bug (fixed)

runST isn't free

Reported by: tibbe Owned by:
Priority: normal Milestone: 8.0.1
Component: Compiler Version: 7.4.1
Keywords: Cc: carter.schonwald@…, wren@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case: simplCore/should_run/runST
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

While optimizing some code I discovered that runST isn't free. I had a function on the form:

f ... = ...let x = runST in (f x)...

Manually transforming it into

f ... = runST (g ...)
 where g ... = do
    ...
    x <- ...
    g x
    ...

(The real example is at https://github.com/tibbe/unordered-containers/commit/58b7815a057b3575c58b746d5971d59d806b1333)

improved performance quite a bit on this, already highly tuned, function. Unfortunately, combining all the calls to runST into one and pulling them "up" is not all good. The code is now less modular, has a somewhat over specified evaluation order, and generally looks more imperative.

The cause of the decrease in performance is that runSTRep cannot be inlined, which causes allocation of closures inline in the code (where none should be necessary.)

The comment next to runSTRep explains why it's implemented the way it is, but I wonder if matters could be improved by making it a primop. If we create a fresh state token every time, instead of reusing realWorld#, it should be impossible for mutable structures to let-float and become CAFs (which is what runSTRep tries to avoid.)

Change History (25)

comment:1 Changed 6 years ago by simonmar

difficulty: Unknown

I chatted with tibbe a little on IRC about this, and one conclusion is that we should probably try harder to inline runSTRep. It has to be inlined very late in order to avoid bad things happening; indeed we used to try to do that, but it didn't work (see the comment on runSTRep). If we did this, then the allocation of the function closure for passing to runST could disappear, which might be critical in an inner loop.

comment:2 Changed 6 years ago by simonpj

I can't make sense of the exmaple in the intro. Could you make a tiny test case that demonstrates the problem?

comment:3 Changed 6 years ago by tibbe

I'm afraid I'm too stuck in my current use case to think of examples from some other domain. This is not a standalone test case, but hopefully it at least makes the problem clear. Imagine we want to write a function:

-- | /O(n)/ Update the element at the given position in this array.
update :: Array e -> Int -> e -> Array e

where Array is some simple wrapper around Array#.

We can write this function using another function, update', in ST.

-- | /O(n)/ Update the element at the given position in this array.
update' :: Array e -> Int -> e -> ST s (Array e)
update' ary idx b = do
    mary <- thaw ary 0 count  -- Copies all elements into a mutable array
    write mary idx b          -- Update the one element
    unsafeFreeze mary         -- Return as an immutable array
  where !count = length ary

update ary idx b = runST (update' ary idx b)

Now, say we have a function that calls update many times in a loop:

data RoseTree a = Rose a (Array (Rose a)) | Nil

insert :: a -> RoseTree a -> RoseTree a
insert x Nil = Rose x ...
insert x (Rose y subtrees) 
    | x == y    = Rose x subtrees
    | otherwise = Rose y $ update subtrees idx (insert x subtreeToUpdate)
  where
    idx = subtreeIndex x y  -- Pick the right subtree to update
    subtreeToUpdate = index subtrees idx
    
index :: Array a -> Int -> a
index = ...

(If you find insert hard to understand, it might be helpful to consider what insert would look like if we were using a tuple of fixed size (e.g. 2 for a binary tree) instead of an Array.)

insert will end up calling runST many times. If we want to avoid the cost of runST at each level of the tree, we could structure the code like so:

insert :: a -> RoseTree a -> RoseTree a
insert x0 t0 = runST (go x0 t0)
    go x Nil = return $ Rose x ...
    go x (Rose y subtrees) 
        | x == y    = return $ Rose x subtrees
        | otherwise = do
            st <- go x subtreeToUpdate
            ary <- update' subtrees idx st
            return $ Rose y ary
      where
        idx = subtreeIndex x y  -- Pick the right subtree to update
        subtreeToUpdate = index subtrees idx

N.B. We have replace the call to update, which embeds a call to runST, with a call to update', which doesn't.

We have now reduced the number of calls to runST from O(log n) to 1, which turns out to be a performance improvement since runST allocates.

comment:4 Changed 6 years ago by simonpj

Thanks for the example. What I don't understand is why does runST allocate. That's the example I'd like to see. In your code

update ary idx b
  = runST (update' ary idx b)

  = {- inline runST; g is a coercion between ST and STRep -}
     runSTRep (update' ary idx b) |> g)

  = {- inline the wrapper on update'
      runSTRep ($wupdate' ary idx b)

  = {- inline runSTRep, admittedly late -}
     case $wupdate' ary idx b realWorld# of (# s', _ #) -> s'

No allocation. So where am I wrong? A tiny repro case of this phenomenon would be great.

I understand about not wanting to take the runST outwards... apart from anything else it makes it sequential and it should not be!

comment:5 Changed 6 years ago by simonpj

This is wrong I think, runSTRep is NOINLINE:

    {-# NOINLINE runSTRep #-}
    runSTRep :: (forall s. STRep s a) -> a
    runSTRep st_rep = case st_rep realWorld# of
        (# _, r #) -> r

We thus need to create a closure containing the arguments to $wupdate' and give it as an argument to runSTRep.

comment:6 Changed 6 years ago by simonpj

Conclusion: inline runSTRep, but very late. The comments suggest this is difficult. Perhaps we could arrange it so full laziness (which is what causes the wrong let floating according to the comment) happens before the last inliner phase. But this feels a bit brittle still, we have to be really careful so any optimizations we run won't result in things becoming shared CAFs.

comment:7 Changed 6 years ago by tibbe

I locally defined my own versions of runST and runSTRep, which I inlined in the last phase. I looked at the core and validated that in my particular case nothing bad had happened. This let me measure the actual performance difference: 7% on my particular benchmark. In addition it let me move back to nice, clean code.

comment:8 Changed 6 years ago by pcapriotti

Milestone: 7.6.1

comment:9 Changed 5 years ago by igloo

Milestone: 7.6.17.6.2

comment:10 Changed 5 years ago by carter

Cc: carter.schonwald@… added

comment:11 Changed 4 years ago by nomeata

Just noting down some idée fixe:

The reason we are not inlining runSTRep is that we don’t want to expressions using the realWorld# token to become shared, right?

How about runSTRep gets a special inlining that mentions all free variables of its argument, using a special token of type stWorld# :: a -> RealWorld#. So runSTRep (f x (z x)) would get unfolded to

case f x (z x) (stWorld# (f,x,z)) of (# _, r #) -> r

The variables mentioned in stWorld#’s argument prevent this from being floated out, and I believe it ouldn’t that solve tibbe’s problem, right? It might also enable more CPR-related optimizations.

When generating STG, the argument of stWorld# would of course be ignored.

(In fact, it seems it would be correct ot have runSTRep e = case e (stWorld# e) of (# _, #r) -> r, which could be formulated as a plain RULE, but that would blow up the core expressions during compilation too much, I’d expect.)

Last edited 4 years ago by nomeata (previous) (diff)

comment:12 Changed 4 years ago by nomeata

Using

{-# INLINE runSTRep #-}
runSTRep :: (forall s. STRep s a) -> a
runSTRep st_rep = case st_rep (stWorld# st_rep) of
                        (# _, r #) -> r

{-# NOINLINE stWorld# #-}
stWorld# :: a -> State# RealWorld
stWorld# _ = realWorld#

it validates. But do we actually have a test for this?

Here would be one:

import Control.Monad.ST
import Data.STRef

main =
    let f n = runST $ do
        ref <- newSTRef 0
        modifySTRef ref (+n)
        readSTRef ref
    in print (f 1 + f 2)

returns 3 in master, returns 4 if I inline runSTRep, and returns 3 with the stWorld#-trick.

So unless someone says that this is not worth it (which would be strange, given that this bug is about an unexpected real-world performance issue), I suggest to give runSTRep a custom unfolding in MkId that replaces runSTRep e by case e (stWorld# $(all free local variables of e)) of (# _, r #) -> r.

comment:13 Changed 4 years ago by Joachim Breitner <mail@…>

In e17a62c14b6a1fce681b06ebd0cc11223841f92f/testsuite:

Test that runST is not inlined prematurely

This resulted form a discussion about #5916.

comment:14 Changed 4 years ago by nomeata

Turns out I spoke too soon (and that the test suite really does not cover this):

This code

main =
    let f () = runST $ do
        ref <- newSTRef 0
        modifySTRef ref (+1)
        readSTRef ref
    in print (f () + f ())

will be broken by my approach (output 3 instead of 2), as there are no free variables that differing – obvious, now that I see it.

Too bad, was hoping to produce something usable today... at least I’ll add a possibly useful testcase.

comment:15 Changed 4 years ago by simonpj

I think the Right Thing here is to add a case to CorePrep that (in effect) inlines runSTRep on the spot. At that point there are no further transformations to worry about.

Also it occurs to me that the C-- code for these two code fragments will be nearly identical

(1)   case st_rep s of (# _, r #) -> r
(2)   st_rep s

The latter will be a tail call; the former will push a return frame, call, and then simply return after that. So we could maybe code-gen (1) just like (2).

comment:16 in reply to:  15 Changed 4 years ago by simonmar

Replying to simonpj:

Also it occurs to me that the C-- code for these two code fragments will be nearly identical

(1)   case st_rep s of (# _, r #) -> r
(2)   st_rep s

The latter will be a tail call; the former will push a return frame, call, and then simply return after that. So we could maybe code-gen (1) just like (2).

Not exactly: (1) evaluates r, but (2) doesn't.

comment:17 Changed 4 years ago by WrenThornton

Cc: winterkoninkje@… added

comment:18 Changed 4 years ago by WrenThornton

Cc: wren@… added; winterkoninkje@… removed

comment:19 Changed 3 years ago by thoughtpolice

Milestone: 7.6.27.10.1

Moving to 7.10.1.

comment:20 Changed 3 years ago by thomie

Type of failure: None/UnknownRuntime performance bug

comment:21 Changed 3 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:22 Changed 2 years ago by simonpj

See also #10678

comment:23 Changed 2 years ago by rwbarton

Test Case: simplCore/should_run/runST

comment:24 Changed 2 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:25 Changed 2 years ago by bgamari

Resolution: fixed
Status: newclosed

The new runRW# primop is just what the doctor ordered. See 351de169e14ad9277aaca653df4a3753c151f7bb

Note: See TracTickets for help on using tickets.