Opened 3 years ago

Last modified 6 weeks ago

#9661 new feature request

Branchless ==# is compiled to branchy code

Reported by: dfeuer Owned by: bgamari
Priority: normal Milestone:
Component: Compiler Version: 7.9
Keywords: datacon-tags Cc: jstolarek, bgamari
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D854
Wiki Page:

Description

This started as a comment on #6135, but Jan Stolarek requested that I file a separate report. The branchless tests <#, >#, <=#, and >=#, and their wordy cousins, seem to work properly, but ==# and eqWord# don't. If I write

foo x = tagToEnum# ((x <# 3#) `orI#` (x ># 100#) `orI#`
                    (x ==# 12#) `orI#` (x ==# 15#))

I get (in 7.8.3 and in 7.9)

IsDigit.foo :: GHC.Prim.Int# -> GHC.Types.Bool
[GblId,
 Arity=1,
 Caf=NoCafRefs,
 Str=DmdType <S,1*U>,
 Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True,
         Guidance=ALWAYS_IF(arity=1,unsat_ok=True,boring_ok=False)
         Tmpl= \ (x3_a2if [Occ=Once!] :: GHC.Prim.Int#) ->
                 case x3_a2if of wild_Xe {
                   __DEFAULT ->
                     GHC.Prim.tagToEnum#
                       @ GHC.Types.Bool
                       (GHC.Prim.orI# (GHC.Prim.<# wild_Xe 3) (GHC.Prim.># wild_Xe 100));
                   12 -> GHC.Types.True;
                   15 -> GHC.Types.True
                 }}]
IsDigit.foo =
  \ (x3_a2if :: GHC.Prim.Int#) ->
    case x3_a2if of wild_Xe {
      __DEFAULT ->
        GHC.Prim.tagToEnum#
          @ GHC.Types.Bool
          (GHC.Prim.orI# (GHC.Prim.<# wild_Xe 3) (GHC.Prim.># wild_Xe 100));
      12 -> GHC.Types.True;
      15 -> GHC.Types.True

and branching assembly to match. Jan Stolarek indicates this is probably a problem with constant folding.

Change History (30)

comment:1 Changed 3 years ago by jstolarek

After some thinking I don't think it's constant folding but more generally the special rules for ==#. As promised here are some hints how to get started on this one:

  1. Try to reproduce the problem with a simpler example, eg. using only two comparison primops and orI#. Try out different combinations of comparison primops. See also whether this happens for andI#. We want to be sure that this only depends on presence of ==# and not other operators (don't forget /=#).
  1. Once you have a smaller test case follow the transformation that GHC is performing to arrive at the result. Is it turning the simpler test case into the final result in one step or are there any intermediate steps?
  1. My guess would be that the culprit is litEq function in PrelRules.lhs module. The comment above it claims to do exactly what you've reported here. However, I'm not sure what the fix should be. Disabling this is probably not a good idea, since in some cases we really want that transformation to happen. We'd need to think more but first let's confirm whether litEq is the cause or not. You can verify that by disabling that rule and seeing whether the code compiles without branches.

EDIT:

  1. After disabling litEq run the testsuite. Some things will almost certainly break and this will show us when this transformation is used (aside from your test case). With this we can start thinking about a better rule for ==#.
Last edited 3 years ago by jstolarek (previous) (diff)

comment:2 in reply to:  1 ; Changed 3 years ago by dfeuer

Replying to jstolarek:

After some thinking I don't think it's constant folding but more generally the special rules for ==#. As promised here are some hints how to get started on this one:

  1. Try to reproduce the problem with a simpler example, eg. using only two comparison primops and orI#. Try out different combinations of comparison primops. See also whether this happens for andI#. We want to be sure that this only depends on presence of ==# and not other operators (don't forget /=#).

This happens with both ==# and /=#. It does not appear to happen with the others. Furthermore, it appears to happen only when one of the arguments is known at compile time.

  1. Once you have a smaller test case follow the transformation that GHC is performing to arrive at the result. Is it turning the simpler test case into the final result in one step or are there any intermediate steps?

There are intermediate steps, but it starts very early. For example,

--Desugar
equals0AndEquals2
equals0AndEquals2 =
  \ x_a1wL y_a1wM -> andI# (==# x_a1wL 0) (==# y_a1wM 2)

== Gentle simplification ==>

equals0AndEquals2
equals0AndEquals2 =
  \ x_a1wL y_a1wM ->
    andI#
      (case x_a1wL of _ {
         __DEFAULT -> 0;
         0 -> 1
       })
      (case y_a1wM of _ {
         __DEFAULT -> 0;
         2 -> 1
       })

-- Nothing changes for a bit ....
== Simplifier phase 2 ==>

equals0AndEquals2
equals0AndEquals2 =
  \ x_a1wL y_a1wM ->
    case x_a1wL of _ {
      __DEFAULT -> 0;
      0 ->
        case y_a1wM of _ {
          __DEFAULT -> 0;
          2 -> 1
        }
    }
  1. My guess would be that the culprit is litEq function in PrelRules.lhs module. The comment above it claims to do exactly what you've reported here. However, I'm not sure what the fix should be. Disabling this is probably not a good idea, since in some cases we really want that transformation to happen. We'd need to think more but first let's confirm whether litEq is the cause or not. You can verify that by disabling that rule and seeing whether the code compiles without branches.

I'm confident you're right, but I won't have time to do that experiment until after Yom Kippur. The comment says

-- This is a Good Thing, because it allows case-of case things
-- to happen, and case-default absorption to happen.  For
-- example:
--
--      if (n ==# 3#) || (n ==# 4#) then e1 else e2
-- will transform to
--      case n of
--        3# -> e1
--        4# -> e1
--        m  -> e2

There are a couple tagToEnum# invocations missing there, but let's look at the whole process. We're presumably starting with

if (n == 3) || (n == 4) then e1 else e2

Expanding if,

case (n == 3) || (n == 4) of
  True -> e1
  False -> e2

Expanding (||),

case (case n == 3 of {True -> True; False -> n == 4}) of
  True -> e1
  False -> e2

Case of case gets us

case n == 3 of
  True -> case True of {True -> e1; False -> e2}
  False -> case n == 4 of {True -> e1; False -> e2}

Case of known constructor or whatever it's called produces

case n == 3 of
  True -> e1
  False -> case n == 4 of
             True -> e1
             False -> e2

So we can get this far without knowing anything about what (==) means. Now let's add part of that knowledge:

case n# ==# 3# of
  1# -> e1
  0# -> case n# ==# 4# of
          1# -> e1
          0# -> e2

In this case, this is probably just fine, but I imagine the example is intended to suggest that there are (perhaps many) more tests, so that a different pattern matching strategy may be in order. I would venture to guess that what we want to do is attack it at this stage—recognizing that we're looking at nested case of ==# with a shared operand.

Last edited 3 years ago by dfeuer (previous) (diff)

comment:3 in reply to:  2 ; Changed 3 years ago by jstolarek

Replying to dfeuer:

Furthermore, it appears to happen only when one of the arguments is known at compile time.

That seems right. In litEq (literal equality) you'll see that we expect one of the arguments to be a literal. Also, litEq works for both equality and inequality tests.

There are a couple tagToEnum# invocations missing there

The comment was written for the old primops and should have been updated when the new ones were introduced. This rule is intended for primops, not (==).

but let's look at the whole process. (...)

I think the order of inlining should be different here: first calls to ==#, then a call to orI# (at least that's what the Core you pasted suggests). This won't change the final result but it may be important for devising a solution.

So we can get this far without knowing anything about what (==) means. Now let's add part of that knowledge:

case n# ==# 3# of
  1# -> e1
  0# -> case n# ==# 4# of
          1# -> e1
          0# -> e2

When you rewrite the transformation to work on the ==# primop this should be the result that you get, but I don't think that you can get such transformation by just inlining definition of (==). The problem is that calls to tagToEnum# are not currently optimized away in Core (#8317). This stage should be followed by case merging to get the final result shown in the comment.

After writing all this I really wonder what will happen when we disable litEq. It might not be as bad as I initially thought. After all, with these new primops we don't want branches.

comment:4 in reply to:  3 Changed 3 years ago by dfeuer

Replying to jstolarek:

When you rewrite the transformation to work on the ==# primop this should be the result that you get, but I don't think that you can get such transformation by just inlining definition of (==). The problem is that calls to tagToEnum# are not currently optimized away in Core (#8317).

I doubt this is going to make much of a difference in most cases, but we'll have to see (I can probably look into it tomorrow night). My overriding philosophy is that band-aids are bad, and band-aids that complicate other things are very bad. If litEq is currently serving primarily as a band-aid for #8317, I think we should rip it off and wait for a proper fix.

comment:5 Changed 3 years ago by jstolarek

No, I think that litEq rule is independent from #8317. Band-aid for tagToEnum# is somewhere else, so this ticket should be fixed independently.

comment:6 Changed 3 years ago by dfeuer

I hope that works. As a side note, we might consider redefining not as not x = tagToEnum# (dataToTag# x /=# 1#), which should work better for branchless stuff and I think should work exactly the same for branchy stuff. That'll be one less thing I need to add to a branchless Bool module. I already need complementBL x = tagToEnum# (dataToTag# xorI# 1#) to play as nicely as possible with &&! and ||!, I think; I'd rather not have to add notBL if that can replace not without any negative consequences.

Last edited 3 years ago by dfeuer (previous) (diff)

comment:7 Changed 3 years ago by rwbarton

Type: bugfeature request

I think GHC is doing the right thing here, or at least not the wrong thing.

It's wrong to think of (==#) as "branchless equality". It's just a function whose value is 1# if its arguments are equal, and 0# if they aren't. This fact needs to be visible to GHC in order to optimize code like

f :: Int -> (Int,Int)
f x = if x == 0        -- defined in terms of ==#
      then (x', 1)     -- here GHC can inline and then constant-fold x' to 257
      else (x-1, x')
  where x' = 1000 * x + 257

and this is the purpose of the litEq rule. As a result, GHC may not output assembly containing the number of branches that you intended. This is the normal function of an optimizing compiler; if GHC was not free to replace your program with an equivalent one, then users would be forced to perform all kinds of optimizations by hand to get efficient code.

I would note that this behavior is not specific to GHC. If you compile this C program with gcc (Debian 4.9.1-14) at optimization level -O1 or higher

void g();
void h();
void f(int x)
{
  if ((x == 0) OR (x == 5))
    g();
  else
    h();
}

you will get identical assembly output (with two branches, not an or instruction) whether you set -DOR=| or -DOR=||. So C does not give you control over branchlessness either.

I understand that depending on the characteristics of the input, the "branchless" code may be faster (e.g. for the C function above, x randomly drawn from the set 0 (49%), 5 (49%), 10 (2%)). However, with certain other distributions of the input value, the "branchy" version is likely to be faster. So there is no right or wrong answer here.

In general it takes a lot of engineering effort to get a compiler to reliably produce two different optimized outputs for two different programs that it can see are semantically identical. In this case maybe there is an easy solution: add a new secretlyEquality# which is the same as (==#), but where that fact is hidden from GHC for long enough for secretlyEquality# to survive the stages that transform (==#) unscathed. This approach is intrinsically somewhat fragile (for example, the LLVM backend may introduce additional branches anyways), but it has the merit of ease of implementation.

In any case, I would like to see some benchmarks showing that there are performance gains to be had in real programs from control over branchlessness before expending any effort on implementation.

comment:8 in reply to:  7 ; Changed 3 years ago by dfeuer

Replying to rwbarton:

I think GHC is doing the right thing here, or at least not the wrong thing.

It's wrong to think of (==#) as "branchless equality". It's just a function whose value is 1# if its arguments are equal, and 0# if they aren't. This fact needs to be visible to GHC in order to optimize code like

...

and this is the purpose of the litEq rule.

I wonder if there's another way to accomplish that, but see below.

As a result, GHC may not output assembly containing the number of branches that you intended. This is the normal function of an optimizing compiler; if GHC was not free to replace your program with an equivalent one, then users would be forced to perform all kinds of optimizations by hand to get efficient code.

With the exception of some bottom-of-the-stack base code with module dependency nightmares, people who use such primops are usually trying to do something specific for performance reasons, and we should be careful not to interfere excessively if we can avoid doing so. Your preferred ==# can be implemented in terms of mine, I believe, like this:

bartonEq x y = case feuerEq x y of
                 1# -> 1#
                 _  -> 0#

Going the other way around, however, is impossible. If you prefer, we can give feuerEq a different name so current code will continue to work the same.

I understand that depending on the characteristics of the input, the "branchless" code may be faster (e.g. for the C function above, x randomly drawn from the set 0 (49%), 5 (49%), 10 (2%)). However, with certain other distributions of the input value, the "branchy" version is likely to be faster. So there is no right or wrong answer here.

Of course.

In any case, I would like to see some benchmarks showing that there are performance gains to be had in real programs from control over branchlessness before expending any effort on implementation.

That ship has long since sailed—the massive breaking primBool change was undertaken specifically to support this sort of thing.

comment:9 in reply to:  8 ; Changed 3 years ago by rwbarton

Replying to dfeuer:

With the exception of some bottom-of-the-stack base code with module dependency nightmares, people who use such primops are usually trying to do something specific for performance reasons, and we should be careful not to interfere excessively if we can avoid doing so.

Everyone who uses == on Int is using ==#, that was the point of my Haskell f example above. Let's not lose track of the big picture here.

Your preferred ==# can be implemented in terms of mine, I believe, like this:

bartonEq x y = case feuerEq x y of
                 1# -> 1#
                 _  -> 0#

Going the other way around, however, is impossible. If you prefer, we can give feuerEq a different name so current code will continue to work the same.

I don't understand what bartonEq or feuerEq is or what the difference between them is supposed to be. They look equal to me.

In any case, I would like to see some benchmarks showing that there are performance gains to be had in real programs from control over branchlessness before expending any effort on implementation.

That ship has long since sailed—the massive breaking primBool change was undertaken specifically to support this sort of thing.

No, it was undertaken to allow GHC to find better code to generate under some circumstances. It was not undertaken to give the user control over whether GHC produces branchless code or branching code.

comment:10 in reply to:  9 Changed 3 years ago by jstolarek

Replying to rwbarton:

It was not undertaken to give the user control over whether GHC produces branchless code or branching code.

Yes, it was undertaken exactly for that reason! Read the original report for #6135 and PrimBool wiki page.

comment:11 in reply to:  9 Changed 3 years ago by dfeuer

Replying to rwbarton:

I don't understand what bartonEq or feuerEq is or what the difference between them is supposed to be. They look equal to me.

You're right. I got confused. Sorry. I still think we should have that secretEq# that acts the same as ==# but without the litEq modification. I'm not the one to come up with a good justification—all I want it for at the moment is isSpace, whicn is not exactly a hotspot in most code.

comment:12 Changed 3 years ago by simonpj

As Jan says, n ==# 3# is transformed into a case expression by litEq. The comment says

-- This stuff turns
--      n ==# 3#
-- into
--      case n of
--        3# -> True
--        m  -> False
--
-- This is a Good Thing, because it allows case-of case things
-- to happen, and case-default absorption to happen.  For
-- example:
--
--      if (n ==# 3#) || (n ==# 4#) then e1 else e2
-- will transform to
--      case n of
--        3# -> e1
--        4# -> e1
--        m  -> e2

So a chain of equality comparisons will turn into a table-driven jump in the end, which is a good thing. On the other hand, generating branch nests is a bad thing.

Arguably, the Big Reason for wanting the tables is for source programs like this:

f 0# = e1
f 1# = e2
f 2# = e3
...
f _ = en

Here we really would like to generate a single case with many alternatives. When compiled, without litEq, it'd look like

f x = case (x ==# 0#) of
        1# -> e1  -- True branch
        _  -> case (x ==# 1#) of
                1# -> e2  -- True branch
                _ -> ..etc..

(I'm omitting the annoying tagToEnum clutter, see #8317.)

So perhaps litEq should apply only when in the context of a case expression that immediately scrutinises its result. So litEq would become

   case (a ==# k) of 1# -> e1; _ -> e2
--->
   case a of { k -> e1; _ -> e2 }

Annoyingly, the CoreSyn.RuleFun API for built-in rules does not give access to the context of an application (the SimplCont), but it would not be hard to make it do so.

So if that is the right rewrite, then it'd be another useful project for someone.

Simon

Last edited 3 years ago by simonpj (previous) (diff)

comment:13 in reply to:  12 Changed 3 years ago by dfeuer

Replying to simonpj:

So a chain of equality comparisons will turn into a table-driven jump in the end, which is a good thing. On the other hand, generating branch nests is a bad thing.

You call this a "table-driven jump", and that may be the case for something like 0,1,2 (I don't know), but it certainly does not seem to work like that for, say, 32, 9, 10, 11, 12, 168, or whatever the numbers are for "is a space".

So perhaps litEq should apply only when in the context of a case expression that immediately scrutinises its result. So litEq would become

   case (a ==# k) of 1# -> e1; _ -> e2
--->
   case a of { k -> e1; _ -> e2 }

Annoyingly, the CoreSyn.RuleFun API for built-in rules does not give access to the context of an application (the SimplCont), but it would not be hard to make it do so.

So if that is the right rewrite, then it'd be another useful project for someone.

I don't have a clear sense of what you're getting at, but in the situations I'm thinking about (avoiding poorly predicted branches), the compiler may not be able to know what is best, and Reid Barton's "secret" equality predicate may be the best approach.

comment:14 Changed 3 years ago by simonpj

Well, when a table-driven jump doesn't work GHC generates a binary tree, reverting a table-driven jump whenever it finds a sufficiently dense patch. (I'm not sure this is fully done, but that's the intention.) Even if there are no dense patches, log(N) is better than N.

For the second point, I'm suggesting that the litEq rule should apply only when the result of the comparison is immediately scrutinised by a case.

I don't really understand Reid's proposal but it looks dangerously fragile to me. Apart from anything else it sounds as if the programmer has to think about two different equality operations; and that which to use depends on context. But the context can change radically after inlining. It smells bad to me, but I'm open to persuasion.

Meanwhile I think changing litEq as I suggest would do a nice job. I could be wrong about that too, but it's my working hypothesis.

Simon

comment:15 in reply to:  14 Changed 3 years ago by dfeuer

Replying to simonpj:

Well, when a table-driven jump doesn't work GHC generates a binary tree, reverting a table-driven jump whenever it finds a sufficiently dense patch. (I'm not sure this is fully done, but that's the intention.) Even if there are no dense patches, log(N) is better than N.

For sufficiently small N, it is likely better to do a linear search to avoid branch mispredictions (possibly even a branchless search using conditional moves or bit hacks). But it almost certainly depends to a certain extent on details of what is and what is not likely, which may be better known to the programmer than to the compiler. For example: suppose I we have

weird x = case x of
  1732 -> 12
  37893 -> 4
  98231 -> 5
  47835 -> 3
  98002 -> 17
  324 -> 56
  _ -> x * 74

If we're applying this to a uniform random distribution of numbers from 1 to 100000, binary search will give us a tremendously high rate of mispredicted branches, if I understand things correctly (never a certain thing). Looping through all the possibilities will work better. Of course, the compiler does not know anything about the distribution! I'm not saying I know "the right answer" (certainly I don't); I just think this could use a bit more thinking, and I like the idea of giving the programmer more tools to express intention in this realm.

comment:16 Changed 3 years ago by bgamari

Cc: bgamari added

comment:17 in reply to:  12 Changed 3 years ago by bgamari

I started looking at this over the weekend and have started putting together a Wiki page describing the of Simon's proposal implementation

comment:18 Changed 3 years ago by bgamari

Owner: set to bgamari

comment:19 Changed 3 years ago by jstolarek

Ben, I see you started working on the implementation. Perhaps it's worth putting it on Phab?

comment:20 Changed 3 years ago by bgamari

Differential Rev(s): D854

Jan, this is a good point. The work is on hold at the moment while I finish up my thesis and move to Germany but regardless I've put up the current state of the patch on Phabricator.

comment:21 Changed 3 years ago by rwbarton

Did anyone ever come up with actual benchmarks that show a potential performance improvement from non-branching code? I must say I'm still rather skeptical of this whole project.

comment:22 in reply to:  21 Changed 3 years ago by bgamari

Replying to rwbarton:

Did anyone ever come up with actual benchmarks that show a potential performance improvement from non-branching code? I must say I'm still rather skeptical of this whole project.

You raise some fair points and perhaps Joachim's findings suggest that it it's not worth bending over backwards to accommodate branch-free code. Looking further into the performance implications of these branches is high on my to-do list.

Regardless, there may be value in allowing rules to examine the application context even beyond this particular optimization (although at the risk of making RULEs a bit less predictable than they are currently).

comment:23 Changed 3 years ago by rwbarton

Sorry, my previous comment was poorly worded at best. I'm sure there are cases where the best non-branching code has better performance than the best branching code. However, I'm not at all convinced that it makes sense to try to give the user control over whether to generate branching or non-branching code. It's at odds with the notion of an optimizing compiler, and as I mentioned earlier even C compilers do not give you that kind of control once optimizations are enabled. GHC should just generate whatever it thinks is best, and if the user really needs to generate particular assembly, they can write that assembly manually.

I would need to see some very dramatic numbers to change this point of view; and currently we have no numbers at all.

I do think that expanding GHC's bag of tricks for post-Stg code generation (as in #10124 for example) is practical and worthwhile, though it will probably need more work in the native code generator as well for best results.

comment:24 in reply to:  21 Changed 3 years ago by jstolarek

Replying to rwbarton:

Did anyone ever come up with actual benchmarks that show a potential performance improvement from non-branching code?

https://ghc.haskell.org/trac/ghc/wiki/PrimBool#Benchmarks

I know these numbers are not much, but they give some hope that branchless operations can improve performance. I recall that Gregory Collins saw a potential use for branchless primops in his hastables library. When I worked on implementing #6135 I did some measurements on hastbales but saw now improvement in performance - now I realize that this might have been because ==# was compiled to branchy code but I didn't realize that.

comment:25 Changed 3 years ago by rwbarton

Again there is a distinction between the compiler being able to generate branchless code and the compiler giving the user control over whether to generate branchless code. For example in that benchmark, it would be even better if the user could simply write

        let inc     = if v >= 0 then 1 else 0

rather than

        let !(I# v) = x
            inc     = I# (v >=$# 0#)

Any optimizing C compiler would avoid a branch in a function like

int g(int a, int b)
{
  if (a >= 0)
    return b + 1;
  else
    return b;
}

So, this isn't an example of the kind that I am talking about.

comment:26 in reply to:  25 Changed 2 years ago by jstolarek

Replying to rwbarton:

Again there is a distinction between the compiler being able to generate branchless code and the compiler giving the user control over whether to generate branchless code. For example in that benchmark, it would be even better if the user could simply write

        let inc     = if v >= 0 then 1 else 0

rather than

        let !(I# v) = x
            inc     = I# (v >=$# 0#)

I wholeheartedly agree - it would be better if the compiler could generate optimized branchless code from higher level abstract code. That said, we expose low-level primops and it would be a real shame if the users couldn't get branchless behaviour from them. Besides, if we want GHC to optimize high-level code to branchless code then I suspect it will be much easier to implement if primops compiled to branchless code.

So, this isn't an example of the kind that I am talking about.

I don't follow. Why not? Surely, this is an ugly piece of code written by hand but at the moment GHC can't produce such code from high-level code - it had to be written by hand.

So, I don't claim that compiling ==# to branchless code is the final word on this topic. It's just fixing a bug that I didn't notice when originally implementing #6135 - >=#, >#, <=# and <# already compile to branchless code as they were intended to.

comment:27 Changed 21 months ago by bgamari

The operations in Data.Char may stand to benefit from less branchy code generation. See #11382 and #9638.

comment:28 Changed 20 months ago by bgamari

Differential Rev(s): D854Phab:D854

comment:29 Changed 6 weeks ago by dfeuer

Keywords: datacon-tags added

comment:30 Changed 6 weeks ago by dfeuer

This doesn't really have much to do with datacon tags, but it seems to relate to several of those tickets, so I tagged it.

Note: See TracTickets for help on using tickets.