Opened 12 months ago

Closed 9 months ago

#9013 closed bug (fixed)

plusWord2# is buggy

Reported by: pumpkin Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.8.2
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions: Phab:D137

Description

Looks like its behavior differs between optimization levels:

{-# LANGUAGE MagicHash, UnboxedTuples #-}

import GHC.Prim
import GHC.Word

big :: Word
big = maxBound

good = let x = case big of W# w -> w in case plusWord2# x x of (# a, b #) -> (W# a, W# b)

main = do
  putStrLn $ case good of
    (0, z) -> "Such optimal, much sad"
    (1, z) -> "No optimization here"

I thought I was going crazy! Try the above code with and without -O2 and you'll get different output.

Change History (12)

comment:1 Changed 12 months ago by pumpkin

  • Summary changed from addWord2# is buggy to plusWord2# is buggy

comment:2 Changed 12 months ago by pumpkin

I'm attempting to write a fast native Haskell multiprecision natural library, so those *2 variants of the primops are quite handy.

comment:3 Changed 12 months ago by pumpkin

Reid Barton tracked down the culprit on IRC: https://github.com/ghc/ghc/blob/master/compiler/nativeGen/X86/Ppr.hs#L553

So it looks like:

  1. it's not specific to plusWord2#
  2. it only happens on compile-time literals (so is admittedly pretty rare)
  3. it's not always sound to switch an add to a dec because the latter doesn't affect the carry flag and the former does

comment:4 Changed 12 months ago by simonpj

The offending code you point to is

pprInstr (ADD size (OpImm (ImmInt (-1))) dst)
  = pprSizeOp (sLit "dec") size dst

Interestingly, this line dates back at least before 2005; see 7d61cb61 for example.

It's surprising to me that

  • it's the pretty-printer that is doing some peephole-style optimisations
  • there is no case for optimising SUB src 1 to dec src, or SUB src (-1).

Moreover, it's clearly too late in the pipeline, because by this point the difference between plusWord# and plusWord2# has disappeared. So fixing the problem by deleting these two lines would lose a perfectly good optimisation for plusWord#.

Surely it'd be better for some earlier phase to optimise (plusWord# x (-1)) to (minusWord# x 1); but of course not to do so for plusWord2# since (as you say) the carry is affected differently. Moreover, the transformation would then be platform-independent, rather than per-platform. Although, now I think about it, the reason for doing the optimisation is to expose the possibility for subsequent, platform-specific inc/dec optimisations. Maybe that would be OK if documented. Or maybe we could simply do it in passage from Cmm to native code (when we know what primop is being used) rather than in the pretty printer?

It looks to me as if plusWord2# is the primop WordAdd2Op, which in turn is translated to the CallishMachOp called MO_Add2, which in turn is translated (on X86) to ADC. So now I'm confused how the pretty-printer's optimisation of ADD affected this program.

Would someone like to dig a little deeper and propose a fix (other than the sticking-plaster of deleting above two lines)?

Simon

PS: great catch BTW!

Last edited 12 months ago by simonpj (previous) (diff)

comment:5 Changed 12 months ago by pumpkin

Thanks! It was a lucky (or unlucky) test case. Good point about the ADD vs. ADC, and I'm not sure. I don't see myself having much time in the near future to dig in but if nobody gets to it first in the next two or three weeks I'll take a look.

comment:6 Changed 12 months ago by rwbarton

Simon: MO_Add2 is translated to an ADD (which is expected to either set or clear the carry flag as appropriate) followed by an ADC. Printing the ADD as (in this case) dec rather than add loses the effect on the carry flag, so the subsequent ADC will just use whatever garbage happens to be in the carry flag.

Like Simon, I'm not totally thrilled with the organization of the code here, but one simple fix would be to add an ADD' instruction which means the same as ADD except it indicates that the side effect of setting the carry flag is important. Then the pretty printer can produce dec for an ADD of -1 but not for an ADD' of -1. GHC doesn't expose the carry flag to Haskell directly so we would only need to use ADD' in the translation of MO_Add2 and other similar operations.

comment:7 Changed 9 months ago by Reid Barton <rwbarton@…>

In caa9c8aa7d17af04e16070e15ba274da0ab93247/ghc:

Add test case for #9013

comment:8 Changed 9 months ago by rwbarton

In fact it's even more involved than I originally thought. MO_Add2 invokes the code generator for MO_Add to do the first addition. MO_Add sees that the second argument is constant so it can produce an LEA instruction (which wouldn't set the carry flag). The pretty-printer notices that the source and destination registers of the LEA instruction are the same, so it rewrites it to an ADD (which would set the carry flag); and then it notices that the addend is -1, so it finally emits dec (which doesn't set the carry flag).

All of what the code generator does for MO_Add is fine, since the semantics of MO_Add don't include any effect (or lack thereof) on the condition register. The error is in the code generation for MO_Add2: it should not use MO_Add, for this reason.

I will put up a patch for review shortly.

comment:9 Changed 9 months ago by rwbarton

  • Differential Revisions set to Phab:D137

comment:10 Changed 9 months ago by simonmar

I've accepted @rwbarton's diff.

Surely it'd be better for some earlier phase to optimise (plusWord# x (-1)) to (minusWord# x 1)

The problem is that this would only catch instructions that arise from those particular primops, not any other instructions. Add instructions arise for lots of reasons, even during native code generation itself.

The reason the peephole is done very late is

  • We don't need explicit inc/dec instructions in the data type
  • We catch all the opportunities. Anything else would run the risk of any subsequent optimisation passes leaving opportunities behind. A smart constructor would work but also runs the risk that you might forget to use it somewhere.

comment:11 Changed 9 months ago by Reid Barton <rwbarton@…>

In 71bd4e310793b9225767b66f3aa758156816632e/ghc:

x86: Always generate add instruction in MO_Add2 (#9013)

Test Plan:
 - ran validate
 - ran T9013 test with all ways
 - ran CarryOverflow test with all ways, for good measure

Reviewers: austin, simonmar

Reviewed By: simonmar

Differential Revision: https://phabricator.haskell.org/D137

comment:12 Changed 9 months ago by rwbarton

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.