Opened 6 years ago

Closed 4 months ago

#3065 closed bug (duplicate)

Reorder tests in quot to improve code

Reported by: simonpj Owned by: ekmett
Priority: lowest Milestone: 7.10.1
Component: Core Libraries Version: 6.10.1
Keywords: Cc: bertram.felgenhauer@…, kr.angelov@…, dterei, core-libraries-committee@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

[SLPJ: capturing an email thread in a ticket]

Krasimir Angelov found that this function:

a `quot` b
    | b == 0                     = divZeroError
    | a == minBound && b == (-1) = overflowError
    | otherwise                  =  a `quotInt` b

is expanded to:

a `quot` b =
   if b == 0
     then divZeroError
     else if a == minBound
              then if b == (-1)
                       then overflowError
                       else a `quotInt` b
              else a `quotInt` b

Then the compiler sees that b is a constant and computes that b == 0
is False and b == (-1) is also False so it could eliminate two If
statements. The result is:

a `quot` b =
   if a == minBound
     then a `quotInt` b
     else a `quotInt` b

and this is exactly what we get. I bet that if the original function was:

a `quot` b
    | b == 0                     = divZeroError
    | b == (-1) && a == minBound = overflowError   -- Note the changed order here
    | otherwise                  =  a `quotInt` b

then we would get what we want. I think that it is much more often to
have division where the divisor is known so we will get the best code
in this case.

Shortly afterwards, Bertram Felgenhauer tried it out. It works, and it has the desired effect. It's not always a win though; the nofib prime sieve benchmark suffers.

For a patch, see

http://int-e.home.tlink.de/haskell/ghc-libraries-base-tune-division.patch

(SLPJ: I've attached it to the ticket too)

Nofib results extract:

------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed
------------------------------------------------------------------------
           fish          -0.7%     -5.3%      0.05      0.04
         primes          -0.0%    +28.5%    +25.6%    +25.5%
   wheel-sieve2          -0.0%     -0.3%    -17.9%    -18.6%
------------------------------------------------------------------------
            Min          -0.9%     -5.3%    -17.9%    -18.6%
            Max          +0.1%    +28.5%    +25.6%    +25.5%
 Geometric Mean          -0.2%     +0.2%     -0.0%     +0.2%

'primes' is an outlier - the other slowdowns are below 3%

What happens in 'primes' is that 'mod' no longer gets inlined;
apparently it now looks bigger to the compiler than before.

Using -funfolding-use-threshold=10 brings the benchmark back to its
original speed, despite the extra comparison before doing the
division.

In 'wheel-sieve2', the first different optimization choice I see is
again a 'mod' that is no longer inlined; this leads to a change in other
inlining choices that result in a speedup for reasons that I have not
investigated.

Attachments (1)

ghc-libraries-base-tune-division.patch.gz (126.1 KB) - added by simonpj 6 years ago.
Bertram's patch

Download all attachments as: .zip

Change History (30)

Changed 6 years ago by simonpj

Bertram's patch

comment:2 Changed 5 years ago by simonmar

  • Type of failure set to Runtime performance bug

comment:3 Changed 5 years ago by rl

Bumping this ticket as using any kind of integer division in an inner loop tends to create unnecessary join points and thus interferes with array fusion. I don't understand why there is an overflow test at all. I certainly wouldn't expect there to be one as Int arithmetic is usually unchecked. The report says in 6.4.2:

The quot, rem, div, and mod class methods satisfy these laws if y is non-zero:

(x `quot` y)*y + (x `rem` y) == x
(x `div`  y)*y + (x `mod` y) == x

This seems to imply that minBound `quot` (-1) should overflow instead of raising an exception. Doesn't the current behaviour violate the language definition?

comment:4 Changed 5 years ago by simonpj

This ticket badly needs someone to "own" it, propose what to do, and measure it. Any volunteers?

Simon

comment:5 Changed 5 years ago by igloo

See #1042 for why we need a test, and why we decided to raise an exception rather than returning a value.

comment:6 follow-up: Changed 5 years ago by rl

I see. I guess the problem with #1042 was that x86 raises an exception on int division overflow. IMO the right thing to do here is to use more bits for division in the generated code. This is what gcc does, for instance.

I really don't like the overflow checks. They are inconsistent with the rest of Int arithmetic and can have a big impact on the performance of tight loops. I propose to get rid of them. This would require a patch to base which is easy to do (I can do that) and a change to the NCG which is probably a bit harder (I have no idea how to do this).

comment:7 in reply to: ↑ 6 Changed 5 years ago by simonmar

Replying to rl:

I see. I guess the problem with #1042 was that x86 raises an exception on int division overflow. IMO the right thing to do here is to use more bits for division in the generated code. This is what gcc does, for instance.

Are you sure gcc does this? Trying the example from #1042, it generates a floating point exception:

$ cat 1042.c
#include <stdio.h>

main () {
    int i = -2147483648;
    int j = -1;
    int k = i / j;
    printf("%d\n", k);
}
$ gcc 1042.c --std=c99
1042.c:3: warning: return type defaults to ‘int’
$ ./a.out
[1]    18136 floating point exception (core dumped)  ./a.out

I'm not sure if its possible to do a division with a 64-bit result on 32-bit x86: the idivl instruction takes a 64-bit divident, but gives you a 32-bit quotient and remainder.

comment:8 Changed 5 years ago by rl

Doh. Your program prints -2147483648 with -O2 (which is what I tried) but that's only because of gcc's constant folding. You are right that it traps if it doesn't see the constants. I thought it handled this differently, sorry for the confusion.

However, that doesn't really change my opinion. If x86 doesn't have a 64-bit div instruction, then I think that having the NCG insert the necessary test when generating code for quotInt# or a calling a 64-bit div subroutine would result in much better code in the vast majority of cases. Exposing the test to GHC really tends to break loop performance. It also imposes a performance penalty on signed types smaller than Int which would work just fine without the test.

comment:9 Changed 5 years ago by simonmar

All things being equal, we should expose the test to GHC because that enables more intelligent decisions to be made in the simplifier and the code generator. If it results in bad loop code, we should fix that, rather than work around it by making the test part of the primitive.

The narrower Int types have their own implementations of quot, which currently include an overflow test. It might be possible in those cases to just omit the test, although that would make the behaviour of Int32 on a 64-bit platform different from on a 32-bit platform in this particular case.

comment:10 Changed 5 years ago by rl

In general, I agree with you but IMO things are far from equal here. The test is only there because of one particular quirk of one particular processor architecture. It isn't needed on the Sparc, for instance. There, we pay a performance penalty for no reason at all. Even on x86, it isn't needed for small integer types and I suspect it isn't needed for Int64 on x86-32. IMO, things like that are best handled by the processor-specific backend. The code generator should still be able to see the test. The simplifier isn't doing anything clever with it anyway and I don't see that changing in the near future (see, for instance #2132).

comment:11 Changed 5 years ago by simonmar

Currently the meaning of overflow in division is to throw a (software) exception at runtime. If we were to change the meaning of overflow in division to be truncation instead, and to apply that consistently, then yes I agree with you that the test in quot is only necessary on some processors, and therefore arguably should be in the NCG.

I'd be happy to make that change, although I did suggest it before (in the comments on #1042) and Ian argued for an exception instead, hence the current behaviour.

comment:12 Changed 5 years ago by igloo

I don't feel strongly about the behaviour.

comment:13 follow-up: Changed 5 years ago by rl

I don't feel strongly about the behaviour, either. But that's precisely the point: since the semantics doesn't really matter in this particular case, we should pick whatever has the best performance.

comment:14 follow-up: Changed 5 years ago by simonmar

Ok, so we should go ahead then. It can be done in two steps:

  • change the behaviour of overflow in quot to mean truncation, not an exception. Remove the tests from the smaller Int types.
  • move the test from quotInt into the NCG, or perhaps CgPrimOp?

volunteers?

comment:15 in reply to: ↑ 13 ; follow-up: Changed 5 years ago by isaacdupree

Replying to rl:

I don't feel strongly about the behaviour, either. But that's precisely the point: since the semantics doesn't really matter in this particular case, we should pick whatever has the best performance.

We need to pick the same semantics for all CPUs. I worry about whether the best-performing semantic differs... at least it might differ on 32-bit vs. 64-bit for us

comment:16 in reply to: ↑ 15 Changed 5 years ago by simonmar

Replying to isaacdupree:

We need to pick the same semantics for all CPUs. I worry about whether the best-performing semantic differs... at least it might differ on 32-bit vs. 64-bit for us

The semantics will be the same for all CPUs. The proposed change can't make performance any worse, but it can improve performance on some CPUs where the native division semantics matches what we expose in Haskell.

comment:17 in reply to: ↑ 14 Changed 5 years ago by rl

Replying to simonmar:

volunteers?

I'd be happy to do it but I don't really know enough about the NCG so I'd need to be pointed in the right direction.

comment:18 Changed 5 years ago by simonmar

I would do it in the translation from STG to Cmm, i.e. in CgPrimOp. It can be target-specific if necessary - the Cmm translation is already target-specific in various ways.

comment:19 Changed 5 years ago by igloo

  • Milestone changed from 6.12 branch to 6.12.3

comment:20 Changed 5 years ago by igloo

  • Milestone changed from 6.12.3 to 6.14.1
  • Priority changed from normal to low

comment:21 Changed 4 years ago by igloo

  • Milestone changed from 7.0.1 to 7.0.2

comment:22 Changed 4 years ago by igloo

  • Milestone changed from 7.0.2 to 7.2.1

comment:23 Changed 4 years ago by dterei

  • Cc dterei added

comment:24 Changed 3 years ago by igloo

  • Milestone changed from 7.2.1 to 7.4.1

comment:25 Changed 3 years ago by igloo

  • Milestone changed from 7.4.1 to 7.6.1
  • Priority changed from low to lowest

comment:26 Changed 3 years ago by igloo

  • Milestone changed from 7.6.1 to 7.6.2

comment:27 Changed 9 months ago by thoughtpolice

  • Milestone changed from 7.6.2 to 7.10.1

Moving to 7.10.1.

comment:28 Changed 6 months ago by thoughtpolice

  • Component changed from libraries/base to Core Libraries
  • Owner set to ekmett

Moving over to new owning component 'Core Libraries'.

comment:29 Changed 4 months ago by thomie

  • Cc core-libraries-committee@… added
  • Resolution set to duplicate
  • Status changed from new to closed

This is a duplicate of #5161, which was closed by these 2 commits:

  • 08db4daba1fdc14ae9647beec3f5162732a9cf4d
    Author: Denys Rtveliashvili <>
    Date:   Thu Apr 28 08:45:10 2011 +0100
    
        Performance improvement for division: got rid of an unnecessary branching in cases where the second argument is a constant and is not -1.
    
Note: See TracTickets for help on using tickets.