integer-gmp's runS seems unnecessarily expensive

integer-gmp uses an unsafePerformIO-like operation to work with mutable BigNats (unsafePerformIO and even the IO type are not yet available, since integer-gmp is a dependency of base):

type S s a = State# s -> (# State# s, a #)

-- NB: equivalent of GHC.IO.unsafeDupablePerformIO, see notes there
runS :: S RealWorld a -> a
runS m = lazy (case m realWorld# of (# _, r #) -> r)
{-# NOINLINE runS #-}

It's tempting to think of such an operation as "free" like an unsafeCoerce, but it is actually somewhat expensive.

Consider plusBigNat for instance. (Most BigNat operations have a similar structure.)

plusBigNat :: BigNat -> BigNat -> BigNat
plusBigNat x y
  | isTrue# (eqBigNatWord# x 0##) = y
  | isTrue# (eqBigNatWord# y 0##) = x
  | isTrue# (nx# >=# ny#) = go x nx# y ny#
  | True                  = go y ny# x nx#
  where
    go (BN# a#) na# (BN# b#) nb# = runS $ do
        mbn@(MBN# mba#) <- newBigNat# na#
        (W# c#) <- liftIO (c_mpn_add mba# a# na# b# nb#)
        case c# of
              0## -> unsafeFreezeBigNat# mbn
              _   -> unsafeSnocFreezeBigNat# mbn c#

    nx# = sizeofBigNat# x
    ny# = sizeofBigNat# y

The assembly for go begins

00000000000001d0 <integerzmgmp_GHCziIntegerziType_zdwgo_info>:
 1d0:   49 83 c4 28             add    $0x28,%r12
 1d4:   4d 3b a5 58 03 00 00    cmp    0x358(%r13),%r12
 1db:   77 26                   ja     203 <integerzmgmp_GHCziIntegerziType_zdwgo_info+0x33>
 1dd:   49 c7 44 24 e0 00 00    movq   $0x0,-0x20(%r12)
 1e4:   00 00 
                        1e2: R_X86_64_32S       .text+0x38
 1e6:   4d 89 74 24 e8          mov    %r14,-0x18(%r12)
 1eb:   49 89 7c 24 f0          mov    %rdi,-0x10(%r12)
 1f0:   49 89 74 24 f8          mov    %rsi,-0x8(%r12)
 1f5:   4d 89 04 24             mov    %r8,(%r12)
 1f9:   4d 8d 74 24 e1          lea    -0x1f(%r12),%r14
 1fe:   e9 00 00 00 00          jmpq   203 <integerzmgmp_GHCziIntegerziType_zdwgo_info+0x33>
                        1ff: R_X86_64_PC32      integerzmgmp_GHCziIntegerziType_runS_info-0x4
 203:   ...                     ; heap overflow

This allocates a 5-word closure (containing a#, na#, b#, nb#) whose code is at .text+0x38 and passes it to runS, which does some stg_ap-y things to call back into the closure, which reads its free variables back from the heap and finally does all the real work. Altogether it's around two dozen instructions compared to if we could call directly from go to the argument of runS.

The old integer-gmp somehow avoided this particular overhead by instead using the implicit "unsafePerformIO" of a foreign import prim which performed both the allocation and the addition. Is this overhead a necessary consequence of doing the work in multiple steps in Haskell?

I understand that we cannot allow everything to be inlined and, for example, the newBigNat# to be shared between a plusBigNat and minusBigNat with the same arguments. But once runS has done its job of keeping the newBigNat#/c_mpn_add/unsafeFreeze* together, it would be nice to eliminate it completely in the backend when compiling go, or any inlined version of go.

I'm not sure whether this should be fixed in the code generator or in integer-gmp itself. I'm also aware that this is a tricky subject but haven't really done my homework on the related tickets, so I might be missing something important!

Edited Mar 10, 2019 by Ben Gamari

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information