Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#10678 closed bug (fixed)

integer-gmp's runS seems unnecessarily expensive

Reported by: rwbarton Owned by:
Priority: normal Milestone: 8.0.1
Component: Compiler (CodeGen) Version: 7.10.1
Keywords: Cc: hvr, bgamari, simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D1103
Wiki Page:

Description (last modified by bgamari)

integer-gmp uses an unsafePerformIO-like operation to work with mutable BigNats (unsafePerformIO and even the IO type are not yet available, since integer-gmp is a dependency of base):

type S s a = State# s -> (# State# s, a #)

-- NB: equivalent of GHC.IO.unsafeDupablePerformIO, see notes there
runS :: S RealWorld a -> a
runS m = lazy (case m realWorld# of (# _, r #) -> r)
{-# NOINLINE runS #-}

It's tempting to think of such an operation as "free" like an unsafeCoerce, but it is actually somewhat expensive.

Consider plusBigNat for instance. (Most BigNat operations have a similar structure.)

plusBigNat :: BigNat -> BigNat -> BigNat
plusBigNat x y
  | isTrue# (eqBigNatWord# x 0##) = y
  | isTrue# (eqBigNatWord# y 0##) = x
  | isTrue# (nx# >=# ny#) = go x nx# y ny#
  | True                  = go y ny# x nx#
  where
    go (BN# a#) na# (BN# b#) nb# = runS $ do
        mbn@(MBN# mba#) <- newBigNat# na#
        (W# c#) <- liftIO (c_mpn_add mba# a# na# b# nb#)
        case c# of
              0## -> unsafeFreezeBigNat# mbn
              _   -> unsafeSnocFreezeBigNat# mbn c#

    nx# = sizeofBigNat# x
    ny# = sizeofBigNat# y

The assembly for go begins

00000000000001d0 <integerzmgmp_GHCziIntegerziType_zdwgo_info>:
 1d0:   49 83 c4 28             add    $0x28,%r12
 1d4:   4d 3b a5 58 03 00 00    cmp    0x358(%r13),%r12
 1db:   77 26                   ja     203 <integerzmgmp_GHCziIntegerziType_zdwgo_info+0x33>
 1dd:   49 c7 44 24 e0 00 00    movq   $0x0,-0x20(%r12)
 1e4:   00 00 
                        1e2: R_X86_64_32S       .text+0x38
 1e6:   4d 89 74 24 e8          mov    %r14,-0x18(%r12)
 1eb:   49 89 7c 24 f0          mov    %rdi,-0x10(%r12)
 1f0:   49 89 74 24 f8          mov    %rsi,-0x8(%r12)
 1f5:   4d 89 04 24             mov    %r8,(%r12)
 1f9:   4d 8d 74 24 e1          lea    -0x1f(%r12),%r14
 1fe:   e9 00 00 00 00          jmpq   203 <integerzmgmp_GHCziIntegerziType_zdwgo_info+0x33>
                        1ff: R_X86_64_PC32      integerzmgmp_GHCziIntegerziType_runS_info-0x4
 203:   ...                     ; heap overflow

This allocates a 5-word closure (containing a#, na#, b#, nb#) whose code is at .text+0x38 and passes it to runS, which does some stg_ap-y things to call back into the closure, which reads its free variables back from the heap and finally does all the real work. Altogether it's around two dozen instructions compared to if we could call directly from go to the argument of runS.

The old integer-gmp somehow avoided this particular overhead by instead using the implicit "unsafePerformIO" of a foreign import prim which performed both the allocation and the addition. Is this overhead a necessary consequence of doing the work in multiple steps in Haskell?

I understand that we cannot allow everything to be inlined and, for example, the newBigNat# to be shared between a plusBigNat and minusBigNat with the same arguments. But once runS has done its job of keeping the newBigNat#/c_mpn_add/unsafeFreeze* together, it would be nice to eliminate it completely in the backend when compiling go, or any inlined version of go.

I'm not sure whether this should be fixed in the code generator or in integer-gmp itself. I'm also aware that this is a tricky subject but haven't really done my homework on the related tickets, so I might be missing something important!

Attachments (1)

runRW-nofib.txt (189.6 KB) - added by rwbarton 3 years ago.
nofib results

Download all attachments as: .zip

Change History (19)

comment:1 Changed 3 years ago by rwbarton

I note that bytestring and text have the same issue (using unsafeDupablePerformIO and runSTRep respectively in the role of runS) so this may be a known problem.

comment:2 Changed 3 years ago by simonpj

Reid, indeed it is known: #5916. (If you find any other tickets about this, please add them.)

I think the best approach would be as in comment:16 of ticket:5916#comment:16

  • Inline runS very late, in CorePrep

Some thoughts about this.

  • We'd want a single, magical function which has this behaviour. Things like unsafeDupablePerformIO should call runS.
  • Now I think about it, I wonder if the magical function should instead be this:
    runRW# :: (State# s -> (# State s, a)) -> (# State s, a #)
    {-# NOINLINE runRW# #-}
    runRW# f = f realWorld#
    
    That is, all runRW# does is to apply its argument to a fresh state token.

Now we can define your runS thus:

  runS f = case ruNRW# f of (# _, r #) -> r

I think we can inline runS freely, which is good because it means that more code is exposed to the optimiser (in particular that case).

  • Moreover, if we do this, I think we now don't need lazy. Because runRW#'s strictness signature won't be strict in the r part; see Note [unsafeDupablePerformIO has a lazy RHS] in GHC.IO.
  • If CorePrep saw runRW# (f a b) it can generate f a b realWorld#. But if it sees runRW# (\s.e), it should generate e[realWorld#/s]. That is, it should do the beta-reduction on the fly. That's slightly annoying because CorePrep doesn't currently carry around a substitution. But I suppose that you could literally call substExpr; this doesn't happen much.
  • Alternatively maybe this could be done in the code generator, effectively treating runRW as a primop. That would be a good plan, except that by the time we get to STG the program is in ANF, so instead of runRW# (f a b) we'd have
    let g = f a b in ruNRW# g
    
    and before we see the runRW# we'll have generated code to allocate a closure for g. So we'd need to allow STG syntax (for specified primops) to have expressions, not just atoms, as the arguments to the primop.

This might actually be a Jolly Good Thing. At the moment catch# (\s -> e1) (\x s -> e2) allocates two closures before it gets to the catch#, which is pretty stupid since the first thing we do after pushing the exception-catch frame is to execute e1.

I think changes like this would make modest but significant improvements to many programs. If you'd like to dig into it, I'd gladly help.

Simon

comment:3 Changed 3 years ago by rwbarton

I am working on a patch that adds such a runRW# function with special treatment in CorePrep, since it seems easy and most of the work could be reused for the special primop approach anyways. No benchmarks yet but it does produce the expected Cmm without the floating-out problems of inlinePerformIO in my simple test, and it passes validate (which might not greatly stress this case, but the new runRW# is being used from integer-gmp and I guess also from bytestring via unsafeDupablePerformIO).

comment:4 Changed 3 years ago by simonpj

Great! Thank you.

comment:5 Changed 3 years ago by rwbarton

Results are mildly promising so far but I ran into an unexpected snag related to CPR analysis that is undoing some of the gains. See #10694.

Changed 3 years ago by rwbarton

Attachment: runRW-nofib.txt added

nofib results

comment:6 Changed 3 years ago by rwbarton

With #10694 fixed I am pretty happy with the results so far. Allocations are down for Integer and ByteString users as expected. Module size, compile allocations and compile time are also slightly down on average. The variation in program runtime seems to be due to a combination of noise and #8279. I suspect real world modern Haskell programs may gain more on average due to preferring ByteString and Text over String. The only programs in nofib which use ByteString (none use Text) are three shootout programs that have been highly optimized by hand. Maybe I should try fibon?

For a microbenchmark

f :: B.ByteString -> B.ByteString
f s = case B.uncons s of
  Just (c, s') -> B.snoc s' c
  Nothing -> B.empty

allocations are down from 136 bytes to 96 bytes and runtime from 16ns to 13ns (when s is a 9-byte string). I got roughly similar results from an integer-gmp benchmark (repeatedly adding 1 to a large Integer).

There is more room for improvement, though. Both these microbenchmarks allocate a boxed heap value inside the runRW#, only to immediately unbox it outside the runRW#. Some kind of CPR analysis + w/w-type transformation could eliminate these intermediate allocations. I implemented this transformation manually in bytestring and it shaved off another 40 bytes of allocation (indeed the size of a ByteString heap object) and ~10% of the runtime. However it would be much nicer for GHC to do this automatically. Need to think more on the best way to accomplish this.

comment:7 Changed 3 years ago by simonpj

Good! Can you give a small example of your last para? I bet it's something like

case (runRW (\s -> I# x)) of (# _ , I# y) -> blah

I bet that you can make something similar happen with catch#. A new ticket, perhaps.

comment:8 Changed 3 years ago by hvr

Cc: bgamari added

Interesting, there's some (albeit not perfect) overlap between the improved nofib cases with the regressed nofib cases noted in CompilerPerformance#Comparinginteger-gmp0.5and1.0

E.g.

     bernouilli          -0.3%     -8.2%     -2.4%     -2.4%      0.0%
          fasta          -0.3%     -5.1%     -0.1%     -0.1%      0.0%
          kahan          -0.2%    -11.4%     +0.3%     +0.3%      0.0%
      primetest          -0.3%    -18.5%     -3.5%     -3.6%      0.0%
            rsa          -0.3%    -19.8%     0.060     0.060      0.0%
            scs          -0.4%     -3.2%     -2.0%     -2.0%    +14.3%
         symalg          -0.3%     -2.1%     0.024     0.024      0.0%
      wave4main          -0.2%     -4.1%     -2.1%     -2.1%     -5.3%

vs (source)

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
     bernouilli         -68.7%    +15.3%     0.092     0.092      0.0%
          fasta         -76.4%     -0.1%     -0.6%     -0.7%      0.0%
          kahan         -74.0%    +39.9%     0.131     0.131      0.0%
      primetest         -67.5%    +49.9%     0.050     0.050      0.0%
            rsa         -67.9%    +53.4%     0.012     0.012      0.0%
            scs         -60.2%     +6.8%     +1.2%     +1.2%      0.0%
         symalg         -63.1%     +9.5%     0.005     0.005      0.0%
      wave4main         -69.5%     +0.0%     0.109     0.109      0.0%

comment:9 in reply to:  7 Changed 3 years ago by rwbarton

Replying to simonpj:

Good! Can you give a small example of your last para? I bet it's something like

case (runRW (\s -> I# x)) of (# _ , I# y) -> blah

Right, so in integer-gmp, where I previously had

runS :: S RealWorld a -> a      -- type S s a = State# s -> (# State# s, a #)
runS m = case runRW# m of (# _, a #) -> a

when a = BigNat (data BigNat = BN# ByteArray#), I can instead use

runS_BigNat :: S RealWorld BigNat -> BigNat
runS_BigNat m = case runRW# (\s -> case m s of (# s', BN# ba #) -> (# s', ba #)) of
  (# _, ba #) -> BN# ba

The idea is

  • m usually simplifies to something like
    \s0 -> case newByteArray# n s0 of
      (# s1, arr #) -> case someGmpFunction arr s1 of
        s2 -> case unsafeFreezeByteArray# arr s2 of
          (# s3, farr #) -> (# s3, BN# farr #)
    
    so in the "inner case" in runS_BigNat, the constructor BN# will cancel
  • Now the caller of runS_BigNat (say, plusBigNat) has the CPR property because the BN# constructor is visible outside the runRW#
  • Its caller in turn (say, plusInteger) looks like
    plusInteger (Jp# x) (Jp# y) = Jp# (plusBigNat x y)
    
    but Jp#'s argument is unboxed for size reasons
    data Integer = ... | Jp# {-# UNPACK #-} !BigNat | ...
    
    so Jp# (plusBigNat x y) would have meant extracting the field from a BN# and repacking it into a Jp#. But by calling the unboxed worker we can avoid ever allocating an intermediate BN# constructor.

Note that the transformation runS -> runS_BigNat is always valid for any m, even though it looks like a nested CPR situation. The reason is that the context of the runRW# is strict in the second argument anyways. To determine when the transformation is a good idea for a particular argument m, we can use regular CPR analysis on the function \s -> case m s of (# _, r #) -> r.

comment:10 Changed 3 years ago by simonpj

Interesting. DO do this less manually we would need two things:

  • The expression runRW e has the CPR property if e rw has the nested CPR property.
  • That might lead us to w/w a function like plusBigNat which in turn would lead to expressions like
       case (runRW e) of BN# farr -> farr
    
    we would need some kind of special case to push the case inside the runRW to get
      runRW (\s -> case e s of (# s', r #) ->
                   case r of BN# farr -> (# s', farr #)
    
    That does not look too hard.

It'd be good to resurrect nested CPR (there's a ticket for that, #1600).

Simon

comment:11 Changed 3 years ago by simonpj

Anyway let's land runRW.

comment:12 Changed 3 years ago by bgamari

Milestone: 8.0.1

comment:13 Changed 3 years ago by bgamari

Differential Rev(s): Phab:D1103
Status: newpatch

comment:14 Changed 3 years ago by bgamari

Description: modified (diff)

comment:15 Changed 3 years ago by simonmar

Cc: simonmar added

comment:16 Changed 3 years ago by Ben Gamari <ben@…>

In 351de169/ghc:

New magic function for applying realWorld#

Test Plan: validate

Reviewers: goldfire, erikd, rwbarton, simonpj, austin, simonmar, hvr

Reviewed By: simonpj

Subscribers: simonmar, thomie

Differential Revision: https://phabricator.haskell.org/D1103

GHC Trac Issues: #10678

comment:17 Changed 3 years ago by bgamari

Resolution: fixed
Status: patchclosed
Note: See TracTickets for help on using tickets.