Opened 5 years ago

Closed 4 years ago

Last modified 4 years ago

#8647 closed task (fixed)

Reduce allocations in `integer-gmp`

Reported by: hvr Owned by: ekmett
Priority: normal Milestone: 7.8.1
Component: Core Libraries Version: 7.6.3
Keywords: integer-gmp Cc: core-libraries-committee@…
Operating System: Unknown/Multiple Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #8638 Differential Rev(s):
Wiki Page:

Description

I've added printf(3)s to integer-gmps GMP allocation/reallocation functions, and noticed there to a significant amount of allocations going on.

This is due to the fact that the CMM wrappers call gmpz_init() to initialize the result, and this already allocates a 1-limb StgArray. But the actual GMP operations perform a reallocate-call (which by contract also needs memcpy the untouched limb over to the newly allocated structure) based on the type of operation and the involved operands, causing the optimistically allocated 1-limb structure to become garbage right away w/o even being written to.

I've been able to get rid of most such superfluous allocations by replacing

ccall __gmpz_init(mp_result1 "ptr");

by

MP_INT__mp_alloc(mp_result1) = 0;
MP_INT__mp_size(mp_result1)  = 0;
MP_INT__mp_d(mp_result1)     = 0; // (1)

which is handled properly by all GMP operations we make use of currently. This elegantly lets the very first allocation be performed within the GMP operation (which has better knowledge with respect to how many limbs the result will require)

I've got working proof-of-concept code, which reduces the allocations the big-num heavy nofib programs (I've omitted the benchmarks which had a 0% allocation change):

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
...
     bernouilli          -0.1%     -5.2%      0.12      0.12     +0.0%
          kahan          -0.1%    -10.5%      0.17      0.17     +0.0%
      primetest          -0.1%    -12.8%      0.07      0.07     +0.0%
            rsa          -0.1%    -13.8%      0.02      0.02     +0.0%
         symalg          -0.1%     -2.3%      0.01      0.01     +0.0%
...
--------------------------------------------------------------------------------
            Min          -0.1%    -13.8%     -3.1%     -3.1%     -6.5%
            Max          -0.0%     +0.4%     +4.0%     +4.0%     +1.5%
 Geometric Mean          -0.1%     -0.5%     +0.5%     +0.4%     -0.1%

(1) This should actually point to a proper static closure I didn't need this for the proof-of-concept


Another place which helps avoiding temporarily allocating 1-limb StgArrays which live only for the duration of GMP FFI calls are those caused by the toBig-casts, which I'm also experimenting by making use of GMP's big-int/small-int add/sub/mul primitives (the deltas shown above are actually measured on top of this optimization), here's what to expect from such an optimization by itself (i.e. w/o the realloc-optimization from above):

     bernouilli          +0.1%     -4.2%      0.12      0.12     +0.0%
          kahan          +0.1%    -12.6%      0.17      0.17     +0.0%
          power          +0.1%     -2.7%     +2.2%     +2.2%     +0.0%
            rsa          +0.1%     -4.1%      0.02      0.02     +0.0%
            scs          +0.1%     -2.6%     -1.6%     -1.6%     +0.0%

Thus, the kahan benchmark benefits from this optimization by -12.6%, and on top of that another -10.5% by also avoiding the GMP-reallocations.

Attachments (2)

0001-Allocate-initial-1-limb-mpz_t-on-the-Cmm-Stack-re-86.patch (33.6 KB) - added by hvr 5 years ago.
experimental patch for review (see also comment in patch header)
0001-Allocate-initial-1-limb-mpz_t-on-the-Cmm-Stack-re-86.nofib.txt (186.6 KB) - added by hvr 5 years ago.
nofib comparision relative to [8bf9541912c30ffb740d6ab67edcadcfbe4fc80b/integer-gmp]

Download all attachments as: .zip

Change History (24)

comment:1 Changed 5 years ago by hvr

Keywords: integer-gmp added

comment:2 Changed 5 years ago by simonpj

That looks like a very worthwhile improvement.

Please do make sure that you document the new code carefully. By definition, it wasn't obvious to whoever first wrote it!

Simon

comment:3 Changed 5 years ago by Herbert Valerio Riedel <hvr@…>

In baeeef7af6e37441d1fa28f4197a8cb878a9ef0b/integer-gmp:

Add new `mpz_mul_si`-based primop (re #8647)

This primop helps reducing allocation by being able to pass one `S#`
argument directly to the GMP multiplication primitive without needing to
promote (and thus allocate a `ByteArray#` as well) the `J#` first.

This benefits a few nofib benchmarks wrt to allocations (having most
impact on `kahan` resulting in about 10% less allocations)

Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>

comment:4 Changed 5 years ago by Herbert Valerio Riedel <hvr@…>

In 8bf9541912c30ffb740d6ab67edcadcfbe4fc80b/integer-gmp:

Add new `mpz_{sub,add}_ui`-based primop (re #8647)

This adds `{plus,minus}IntegerInt#` which help to reduce temporary
allocations in `plusInteger` and `minusInteger`.

This and the previous commit introducing `timesIntegerInt#` (i.e. baeeef7af6e)
result in reduced allocations for the following nofib benchmarks on Linux/amd64:

         Program      Size    Allocs   Runtime   Elapsed  TotalMem
  ------------------------------------------------------------------
      bernouilli     +0.0%     -4.2%      0.12      0.12     +0.0%
           kahan     +0.1%    -12.6%      0.17      0.17     +0.0%
        pidigits     +0.0%     -0.5%     -4.7%     -4.5%     +0.0%
           power     +0.0%     -2.7%     +3.1%     +3.1%     +9.1%
       primetest     +0.0%     -4.2%      0.07      0.07     +0.0%
             rsa     +0.0%     -4.1%      0.02      0.02     +0.0%
             scs     +0.0%     -2.6%     -0.8%     -0.7%     +0.0%
  ------------------------------------------------------------------
             Min     +0.0%    -12.6%     -4.7%     -4.5%     -5.0%
             Max     +0.1%     +0.2%     +3.1%     +3.1%     +9.1%
  Geometric Mean     +0.1%     -0.3%     -0.0%     +0.0%     +0.1%
  ------------------------------------------------------------------

Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>

comment:5 Changed 5 years ago by hvr

The two commits above implement the 2nd of the two proposed optimizations in the ticket description, the avoid-reallocations-optimization is a bit more difficult to get implemented properly (but I'm still working on it).

Btw, this optimization helps also in cases, when several small-ints are added/multiplied to an accumulator as in e.g.:

sum (toInteger (maxBound :: Int) : [1..10000])
-- or
product [1..100]

Changed 5 years ago by hvr

experimental patch for review (see also comment in patch header)

comment:6 Changed 5 years ago by hvr

Status: newpatch

I attached the patch instead of pushing to Git, as I'm not satisfied yet with it. It's works stable here (passes validate & nofib), but I'm not sure about the unsafeCoerce#.

In my first version of this patch, I used a 3-tuple (# Int#, ByteArray#, Word# #) as return value, but that wasted 1 return register per Integer result, and still had to return a NULL pointer for ByteArray# for the case it wasn't used.

I'm wondering if it would be feasible and sensible to return a properly constructed Integer value from the Cmm wrappers (i.e. to fold tupToInteger into gmp-wrappers.cmm), but I haven't figured out yet how to implement that in Cmm.

Moreover, I'm also still wondering how to properly allocate stack-space in "high-level" Cmm procedures (see posting on ghc-devs@), as the current code seems to work properly only as long the Cmm procedures don't force the code generator to save registers to the stack.

comment:7 in reply to:  6 ; Changed 5 years ago by hvr

Replying to hvr:

I attached the patch instead of pushing to Git, as I'm not satisfied yet with it. It's works stable here (passes validate & nofib), but I'm not sure about the unsafeCoerce#.

In my first version of this patch, I used a 3-tuple (# Int#, ByteArray#, Word# #) as return value, but that wasted 1 return register per Integer result, and still had to return a NULL pointer for ByteArray# for the case it wasn't used.

After a conversation with Duncan, I'm now convinced that the (# Int#, ByteArray# #) + unsafeCoerce# hack is really bad, as the GC could try to follow the ByteArray# pointer even though it was really just a Word# in disguise.

Otoh, passing a 0-pointer for the 3-tuple variant has the same danger.

So now I'm left wondering if I can somehow statically allocate a dummy ByteArray# I can return in Cmm when there's no need to allocate a proper ByteArray# in order to avoid confusing the GC.

comment:8 Changed 5 years ago by Herbert Valerio Riedel <hvr@…>

In af2ba9c81cf6a3223ccc74c5aa17de18138fd824/integer-gmp:

Wrap `gmpz_tdiv_{q,r,qr}_ui` to optimize `quot`/`rem`

This is useful as `quot`/`rem` are often used with small-int divisors,
like when computing the digits of an `Integer`. This optimization
reduces allocations in the following `nofib` benchmarks:

      Program        Size    Allocs   Runtime   Elapsed  TotalMem
   -----------------------------------------------------------------
        power       +0.3%     -0.8%     -1.2%     -1.2%     +0.0%
    primetest       +0.3%     -3.9%      0.07      0.07     +0.0%
          rsa       +0.3%     -4.0%      0.02      0.02     +0.0%
       symalg       +0.2%     -1.4%      0.01      0.01     +0.0%

This addresses #8647

Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>

comment:9 Changed 5 years ago by Herbert Valerio Riedel <hvr@…>

In 868b93beb1215a44f5f8251a397bba7fcc0815ad/integer-gmp:

Manually float out `int2Integer# INT_MINBOUND`

This avoids allocating this special value over and over again every
time it's needed, and therefore this addresses #8647.

Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>

comment:10 Changed 5 years ago by Herbert Valerio Riedel <hvr@…>

In a3878d172b358b896b3c8302e58199958479d8e5/integer-gmp:

Temporary disable `mpz_gmpz_tdiv_qr_ui` to workaround #8661

I still need to investigated, but for some reason not yet obvious to me,
commit [af2ba9c8/integer-gmp] (re #8647) seems to have triggered #8661
on linux/32

This commit disables the use of the `quotRemIntegerWord#` primop on
32bit (which seems to trigger the issue).

Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>

comment:11 in reply to:  7 Changed 5 years ago by hvr

Replying to hvr:

Otoh, passing a 0-pointer for the 3-tuple variant has the same danger.

So now I'm left wondering if I can somehow statically allocate a dummy ByteArray# I can return in Cmm when there's no need to allocate a proper ByteArray# in order to avoid confusing the GC.

I've workarounded this by returning a pointer to one of the statically allocated INTLIKE Ints, (which as far as I understand it should correspond to unsafeCoerce#ing an Int into a ByteArray#) which should be somewhat more polite to the garbage collector than coercing a non-pointer 0## :: Word# into a ByteArray#.

The latest patch for review with more details in its commit message can be found at [20d7bfdd29917f5a8b8937fba9b724f7e71cd8dd/integer-gmp].

comment:12 Changed 5 years ago by simonpj

That looks better but would need COPIOUS comments. The ice is thin whenever you use unsafeCoerce.

Simon

comment:13 Changed 5 years ago by Herbert Valerio Riedel <hvr@…>

In 7bdcadda7e884edffb1427f0685493f3a2e5c5fa/integer-gmp:

Allocate initial 1-limb mpz_t on the Stack and introduce MPZ# type

We now allocate a 1-limb mpz_t on the stack instead of doing a more
expensive heap-allocation (especially if the heap-allocated copy becomes
garbage right away); this addresses #8647.

In order to delay heap allocations of 1-limb `ByteArray#`s instead of
the previous `(# Int#, ByteArray# #)` pair, a 3-tuple
`(# Int#, ByteArray#, Word# #)` is returned now. This tuple is given the
type-synonym `MPZ#`.

This 3-tuple representation uses either the 1st and the 2nd element, or
the 1st and the 3rd element to represent the limb(s) (NB: undefined
`ByteArray#` elements must not be accessed as they don't point to a
proper `ByteArray#`, see also `DUMMY_BYTE_ARR`); more specifically, the
following encoding is used (where `⊥` means undefined/unused):

 -  (#  0#, ⊥, 0## #) -> value = 0
 -  (#  1#, ⊥, w   #) -> value = w
 -  (# -1#, ⊥, w   #) -> value = -w
 -  (#  s#, d, 0## #) -> value = J# s d

The `mpzToInteger` helper takes care of converting `MPZ#` into an
`Integer`, and allocating a 1-limb `ByteArray#` in case the
value (`w`/`-w`) doesn't fit the `S# Int#` representation).

The following nofib benchmarks benefit from this optimization:

        Program      Size    Allocs   Runtime   Elapsed  TotalMem
 ------------------------------------------------------------------
     bernouilli     +0.2%     -5.2%      0.12      0.12     +0.0%
         gamteb     +0.2%     -1.7%      0.03      0.03     +0.0%
          kahan     +0.3%    -13.2%      0.17      0.17     +0.0%
         mandel     +0.2%    -24.6%      0.04      0.04     +0.0%
          power     +0.2%     -2.6%     -2.0%     -2.0%     -8.3%
      primetest     +0.1%    -17.3%      0.06      0.06     +0.0%
            rsa     +0.2%    -18.5%      0.02      0.02     +0.0%
            scs     +0.1%     -2.9%     -0.1%     -0.1%     +0.0%
         sphere     +0.3%     -0.8%      0.03      0.03     +0.0%
         symalg     +0.2%     -3.1%      0.01      0.01     +0.0%
 ------------------------------------------------------------------
            Min     +0.1%    -24.6%     -4.6%     -4.6%     -8.3%
            Max     +0.3%     +0.0%     +5.9%     +5.9%     +4.5%
 Geometric Mean     +0.2%     -1.0%     +0.2%     +0.2%     -0.0%

Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>

comment:14 Changed 5 years ago by Herbert Valerio Riedel <hvr@…>

In a3616cd624e84d7967257561ad1c80b6793d4b46/ghc:

Lower T4830/allocated_bytes due to [7bdcadda7/integer-gmp] (#8647)

Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>

comment:15 Changed 5 years ago by Herbert Valerio Riedel <hvr@…>

In cbde86278d2090e19e62f0dd22682b4381433658/integer-gmp:

Wrap `gmpz_fdiv_{q,r,qr}_ui` to optimize `div`/`mod`

This is similiar to what has been done in [af2ba9c8/integer-gmp] for
`gmpz_tdiv_{q,r,qr}_ui` (re #8647); However, the gain is more modest
here, as performance-conscious code tends to use `quot`/`rem` rather
than `div`/`mod`:

     Program    Size    Allocs   Runtime   Elapsed  TotalMem
 -------------------------------------------------------------
   primetest   +0.3%     -2.4%      0.06      0.06     +0.0%
         rsa   +0.2%     -3.3%      0.02      0.02     +0.0%

Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>

comment:16 Changed 5 years ago by Herbert Valerio Riedel <hvr@…>

In 8a0f1d2f8336cf6c63d32b76f22df8ee4407f19e/ghc:

Adapt perf values due to [cbde8627/integer-gmp]

These are slight improvements due to optimizations in `integer-gmp` (#8647)

Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>

comment:17 Changed 5 years ago by Herbert Valerio Riedel <hvr@…>

In 0476327376045838375fbbe4a78c02859a1913cb/integer-gmp:

Dont use big/small-int primops on IL32P64 (i.e. Win/x86_64) for now

This is due to `mpz_*()` functions having @long@ arguments which are
32bit on IL32P64, whereas `Int#` and `Word#` are 64bit wide, causing all
sorts of malfunction due to truncation.

This affects mostly the new big/small-int primops introduced in the
course of #8647, so when `SIZEOF_W != SIZEOF_LONG` we simply fall back
to using the big/big-int primops. big/small primops implemented via the
low-level `mpn_*()` GMP operations are not affected, as those use
`mp_limb_t` arguments.

Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>

comment:18 Changed 5 years ago by thoughtpolice

Status: patchnew

This work is basically completed for 7.8, but Herbert will do more. At the moment this doesn't need to be in patch status.

comment:19 Changed 4 years ago by thoughtpolice

Milestone: 7.8.37.10.1

Moving to 7.10.1

comment:20 Changed 4 years ago by thoughtpolice

Component: libraries (other)Core Libraries
Owner: set to ekmett

Moving over to new owning component 'Core Libraries'.

comment:21 Changed 4 years ago by thomie

Cc: core-libraries-committee@… added
Resolution: fixed
Status: newclosed

Since we're using integer-gmp2 now (#9281), I'm assuming this can be closed. Please re-open if I'm mistaken, of course.

comment:22 Changed 4 years ago by hvr

Milestone: 7.10.17.8.1

Yeah, this has been effectively superseded by #9281 (which ironically took two steps forward and one step back... but that's still being worked on)

Note: See TracTickets for help on using tickets.