Opened 5 years ago

Closed 3 years ago

Last modified 4 weeks ago

#6135 closed feature request (fixed)

Unboxed Booleans

Reported by: benl Owned by:
Priority: normal Milestone: 7.8.1
Component: Compiler Version: 7.4.1
Keywords: datacon-tags Cc: dterei, hackage.haskell.org@…, carter.schonwald@…, pho@…, choener@…, bgamari
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case: primops/should_run/T6135
Blocked By: #8103, #8103 Blocking:
Related Tickets: #605 Differential Rev(s):
Wiki Page: PrimBool

Description (last modified by bgamari)

Right now the only way to compare two integers is with primops that produce boxed bools:

<#  :: Int# -> Int# -> Bool
==# :: Int# -> Int# -> Bool

To consume the resulting Bool we need a case-expression, which introduces branches into the core IR and object code. Case expressions are bad in the presence of heavy inlining because case-of-case performed by the GHC simplifier tends to duplicate code (like in DPH and Repa programs). Mis-predicted branches are bad in object code because they stall the pipeline.

Consider the following expression:

 case  x <# 0# || x >=# aWidth
    || y <# 0# || y >=# aHeight of 
  True  -> ...
  False -> ...

This is very common in array programming. Ideally, it should turn into some straight-line code to compute the flag, and then a single branch instruction once we've worked out what alternative to take. However, as the only way to consume the Bools produced by the comparisons is to introduce more case expressions, we end up with *four* cases in the core IR.

What I want is this:

data Bool#
(.<#)  :: Int#  -> Int#  -> Bool#
(.==#) :: Int#  -> Int#  -> Bool#
(.||#) :: Bool# -> Bool# -> Bool#

case    x .<# 0# .||# x .>=# aWidth
   .||# y .<# 0# .||# y .>=# aHeight of
 True#  -> ...
 False# -> ...

Bool# is the type of one bit integers. I can compute with them algebraically and be sure that I'll get at most one branch instruction in the resulting object code.

Attachments (10)

Add-Int-comparison-primops-thet-return-an-Int-PROTOTYPE.patch (3.9 KB) - added by jstolarek 5 years ago.
Prototype patch implementing new primops
Tests-for-Int-comparison-primops-PROTOTYPE.patch (17.0 KB) - added by jstolarek 5 years ago.
Prototype patch with tests for the new primops
T6135-ghc.patch (62.2 KB) - added by jstolarek 4 years ago.
T6135-testsuite.patch (61.5 KB) - added by jstolarek 4 years ago.
T6135-packages-array.patch (913 bytes) - added by jstolarek 4 years ago.
T6135-packages-base.patch (5.7 KB) - added by jstolarek 4 years ago.
T6135-packages-ghc-prim.patch (10.6 KB) - added by jstolarek 4 years ago.
T6135-packages-integer-gmp.patch (7.8 KB) - added by jstolarek 4 years ago.
T6135-packages-integer-simple.patch (4.8 KB) - added by jstolarek 4 years ago.
T6135-packages-primitive.patch (1.9 KB) - added by jstolarek 4 years ago.

Download all attachments as: .zip

Change History (107)

comment:1 Changed 5 years ago by simonpj

difficulty: Unknown

Good point. I had not fully grokked that reason for having unboxed Bool#. See also #605.

However I don't think it would be a good plan to define

data Bool = B# Bool#

For one thing, the definition of Bool is in the language spec. Well I suppose we could define

fromBool :: Bool -> Bool#
toBool   :: Bool# -> Bool

(Actually these already exist, more or less, in the form of tagToEnum# and dataToTag#.)

I suppose we could then define not, (&&) etc thus:

not :: Bool -> Bool
not b = toBool (not# (fromBool b))

(&&) :: Bool -> Bool -> Bool
a && b = toBool (and# (fromBool a) (fromBool b))

What would happen to optimisations? As thing stand we get

not (not x) 

= { inline not, twice }
  case (case x of { T -> F; F -> T }) of { T -> F; F -> T }

= { case of case transformation }
  case x of
     T -> case F of { T -> F; F -> T }
     F -> case T of { T -> F; F -> T }

= { simplify cases }
  case x of 
     T -> T
     F -> F

= { case of identity }
  x

(And then we simplify to just x, but that's a separate matter.) To get this behaviour with the new stuff we would presumably say:

not (not x)

= { inline not, twice }
  toBool (not# (fromBool (toBool (not# (fromBool x)))))

= { RULE fromBool (toBool v) = v }
  toBool (not# (not# (fromBool x)))

= { RULE not# (not# v) = v }
  toBool (fromBool x)

= { RULE fromBool (toBool v) = v }
  x

If anything this seems simpler. We'd also need constant folding stuff like

   case x of
     True -> ....(fromBool x)....
===>
   case x of
     True -> ....True#....

but that should not be hard.

I was not optimistic when I started writing this, but now I've got this far it does actually look plausible. It is certainly true that conditionals with a lot of conditions can generate an absolute rats-nest of join points -- and that in turn can inhibt inlining. So the change would help that too.

I suppose that Bool# might be better spelt Word1#.

Can anyone think of any other wrinkles to consider?

Simon

comment:2 Changed 5 years ago by rl

Simon, your && is strict in both arguments so it doesn't work.

In general, I don't really think there is a lot to be gained here. You have to make sure that you (a) don't change the semantics of a program and (b) don't do too much unnecessary work. Let's take Ben's two examples:

case  x <# 0# || x >=# aWidth
    || y <# 0# || y >=# aHeight of 
  True  -> ...
  False -> ...
case    x .<# 0# .||# x .>=# aWidth
   .||# y .<# 0# .||# y .>=# aHeight of
 True#  -> ...
 False# -> ...

It is impossible to tell which of the two will be faster. If we expect x <# 0 to hold most of the time, then the second one will be slower since it will be doing three unnecessary comparisons in most iterations for no benefit. On the other hand, if the condition doesn't hold most of the time, then the second one might be slightly faster.

I fully agree that join points etc. can be very annoying, though.

comment:3 Changed 5 years ago by simonpj

Darn. I'd forgotten about the strictness issue!

comment:4 Changed 5 years ago by tibbe

If we end up doing something like this, we should consider if it could be extended to work for any enumeration (i.e. data type with only nullary constructors.)

comment:5 Changed 5 years ago by simonmar

Note that the case expressions are optimised to a simple inline comparison in the code generator, so it's not as bad as it seems.

However, the special case in the code generator is annoying, so I had been meaning to change all these primops to return Int#.

In your example:

case  x <# 0# || x >=# aWidth
    || y <# 0# || y >=# aHeight of 
  True  -> ...
  False -> ...

We would want to end up with

case  x <# 0# `or#` x >=# aWidth
    `or#` y <# 0# `or#` y >=# aHeight of 
  0# -> ...
  _  -> ...

But getting there could be a bit tricky. Still, I think changing the primops is a good first step.

comment:6 Changed 5 years ago by igloo

Milestone: 7.8.1

comment:7 Changed 5 years ago by dterei

Cc: dterei added

comment:8 Changed 5 years ago by dterei

comment:9 Changed 5 years ago by dterei

comment:10 Changed 5 years ago by singpolyma

It would be pretty awesome if any enumeration (all-nullary constructors) could be treated as codegen-equivalent to Int# internally, with Enum, Eq, and Ord just being wrappers around the obvious primops. Though I'm not sure what the moving parts of such a proposal would be.

comment:11 Changed 5 years ago by liyang

Cc: hackage.haskell.org@… added

comment:12 Changed 5 years ago by carter

Cc: carter.schonwald@… added

comment:13 Changed 5 years ago by simonpj

See PrimBool.

Janek, great initiative thanks. But I think it would be helpful to articulate a very clear story about the goal. I believe that the goal is to achieve much better performance, without changing the source language. The programmer does not change the program, but they run faster.

If so, then can we have (on the wiki page, not here):

  • A description of why programs will go faster, with specific examples.
  • Some example programs that will in fact go faster. Maybe you can code the examples up in Haskell, using inconvenient and messy Int# or whatever, to demonstrate that they really do go faster.

The sub-text is that I only want to introduce new complexity if it demonstrably buys us something.

One reason that things may go faster is this. At the moment, if a function is strict in a Bool argument, that argument is still passed boxed. But it could very sensibly be passed unboxed instead, just as we do with pairs, say. So perhaps strictness analysis would work well with unboxed enumerations.

But it's not entirely obvious that it will. Unboxing a pair is great because the pair is never allocated. But enumerations are never allocated either! (True and False are statically allocated.) Moreover, even though passing True passes a pointer to a (statically allocated) heap object, the True/False tag is in the pointer (see Faster laziness using dynamic pointer tagging), so no memory access is needed when discriminating True/False. So the gains from unboxing may be modest.

Pointer tagging only works for data types with a handful of constructors. Big enumerations really do keep the tag in memory. (See the paper.)

None of this is to say it's a bad idea. Just that we need clarity about where the gains are going to come from, and measurements that suggest the gains are real.

Simon

comment:14 Changed 5 years ago by jstolarek

Owner: set to jstolarek

comment:15 Changed 5 years ago by simonmar

I agree with Simon, I think we already got a lot of the benefit to be had here from pointer tagging. Perhaps for enums larger than the threshold (3 or 7 for 32-bit or 64-bit respectively) we should be unboxing them in worker/wrapper. It wouldn't be hard to write some examples by hand and measure the difference between using enums and Int#.

I would like to see the simplification I mentioned above - changing the comparison primops to return Int# rather than Bool - and I believe that would lead to more optimisations later. At the least it would expose the magic tagToEnum# that arises from these primops to the simplifier, where it could be optimised away.

comment:16 Changed 5 years ago by PHO

Cc: pho@… added

comment:17 Changed 5 years ago by jstolarek

I read paper about pointer tagging and things seem to be clearer now. I rewrote wiki page about PrimBool from scratch. I think that now it defines the goal very clearly by giving code examples. There are also workarounds for the problem. If you guys could take a look and tell me if this makes sense it would be great.

I failed to figure out a code example that would actually go faster using my workarounds. The ones I created have about 0,4% increase in performance, which is a negligible difference. I will be working on this.

comment:18 Changed 5 years ago by jstolarek

I linked the wrong page in my comment. Here's the correct one.

comment:19 Changed 5 years ago by tibbe

One possible use of unboxed booleans is branchless search. There are some algorithms where you can replace branches (e.g. case statements) with bit twiddling operators. I believe Gregory Collins used one such trick in the hashtables package.

comment:20 Changed 5 years ago by choenerzs

Cc: choener@… added

comment:21 Changed 5 years ago by jstolarek

I implemented a prototype version of new comparison primops. They compare two Int#s but as a result they return either 0# (denoting False) or 1# (denoting True) instead of returning a Bool. Now the results of comparing can be combined using bitwise orI#, andI#, xorI# and notI# primops, which addresses the original problem that was reported (at least partially). I'm attaching patches for GHC and the testsuite (includes some cleanup of my other code for testing primops). These patches are for review and discussion purposes - they are not yet ready for integration because they don't contain primops for comparing values of type Word#, Float# and Double#. I will implement comparisons for these types once I am certain that I got things right with Int#. Feedback will be much appreciated.

Changed 5 years ago by jstolarek

Prototype patch implementing new primops

Changed 5 years ago by jstolarek

Prototype patch with tests for the new primops

comment:22 Changed 5 years ago by jstolarek

Status: newpatch

comment:23 Changed 5 years ago by gregorycollins

One possible use of unboxed booleans is branchless search. There are some algorithms where you can replace branches (e.g. case statements) with bit twiddling operators. I believe Gregory Collins used one such trick in the hashtables package.

Didn't see this before now. Yes, this is true, part of the reason I wrote the cache-line-sized hash code search is that I couldn't get efficient straight-line code out of GHC for doing it. One very useful bit-twiddling hack that the processor can do but GHC can't:

int mask(int a, int b) { return -(a == b); }

The C code gives me this:

	xorl	%eax, %eax
	cmpl	%esi, %edi
	sete	%al
	negl	%eax

The Haskell code gives me code with branches because of the case on Bool, which I couldn't figure out how to get rid of. I did manage to get straight-line code for a pure Haskell function with the same semantics, but at a cost of about twice as many instructions. Interestingly, getting rid of the branch here, even in the roundabout way the Haskell version did it, was worth O(30%) on hash table lookup (it's in an inner loop). The C version I wrote for cache line-sized hash code lookups was good for another 20-30% besides that, and I chalked up the difference to this comparison issue with Bool and much better register allocation on the C side. (GHC was spilling even when it seemed there were many free 64-bit registers).

So a solid +1 for me for implementing something like this, although maybe not with the names in the patch?

comment:24 Changed 5 years ago by jstolarek

So a solid +1 for me for implementing something like this, although maybe not with the names in the patch?

Names are definitely open for discussion.

comment:25 Changed 5 years ago by simonpj

Why Int#? Wouldn't Word# be a better fit? We already have logical operations on that type.

Can you summarise the design you have implemented, on the Wiki page http://hackage.haskell.org/trac/ghc/wiki/PrimBool? it's a bit hard to reverse-engineer it from the patch. Many thanks.

Simon

comment:26 Changed 5 years ago by benl

I've got an idea for a concrete benchmark that definitely wants the new primops. Will post here when it works.

comment:27 Changed 5 years ago by simonmar

I'd rather see the existing primops changed to return Int# (or Word#) than to add new primops. There's no need to have both. But of course it's a deeper change - we'd have to add wrappers that implement the old primops and make sure that we have enough RULEs to get rid of any extra tagToEnum# calls. But I think it would ultimately be better that way, the Bool-returning primops are a hack.

comment:28 Changed 5 years ago by jstolarek

Can you summarise the design you have implemented, on the Wiki page

I updated the Wiki page - see this new section.

comment:29 Changed 5 years ago by jstolarek

Following Ben's idea I created a proof-of-concept filter function using the new primops:

filterN :: Vector Int -> Vector Int
filterN vec = runST $ do
  let !size = length vec
  fVec <- unsafeNew size
  let put i x = do
        let !(I# v) = x
            inc     = I# (v .>=# 0#)
        unsafeWrite fVec i x
        return $ i + inc
  fSize <- foldM' put 0 vec
  unsafeFreeze $ unsafeSlice 0 fSize fVec

Benchmarking with criterion shows that this function is 60% faster than the filter function based on stream fusion (tested for unboxed vectors containing 10 thousand and 10 million elements). Full code available at the wiki page.

comment:30 Changed 4 years ago by jstolarek

I'd like to finish working on this feature request. For that I need to know which approach to choose:

  • Add new primops and leave the old ones unmodified, or
  • Change the existing primops so that they return an Int# + add some wrappers (Int# -> Int# -> Bool and so on).

I think that the second option will not be backwards compatible - not sure if that's acceptable. OTOH I think it would simplify GHC code in some places, i.e this change seems to be more about removing code than adding.

I'm unable to decide which approach is the preferred one - people more experienced than me should decide.

comment:31 Changed 4 years ago by benl

I vote for changing the existing primops to use Word#, and don't bother supporting the existing Bool returning versions.

I think we can get away with changing the existing types because it's trivial to update client code to use the new versions. You just need to add (tagToEnum# (word2Int# ...)).

I vote for making the primops return Word# instead of Int#. I know that this requires an extra word2Int# conversion when using tagToEnum#, but there is really no reason to suggest that the result of a boolean comparison operator might be negative! I suspect the reason tagToEnum# currently takes an Int# is because literals like 5# are Int#.

comment:32 Changed 4 years ago by jstolarek

there is really no reason to suggest that the result of a boolean comparison operator might be negative!

There's a bit of inconsistency here anyway, because dataToTag# and tagToEnum# return an Int#, thus suggesting that we can have negative enumerations.

I changed the primops to return Word#, which - as expected - breaks a lot of code in boot libraries. I'm adding wrappers to avoid putting (tagToEnum# (word2Int# ...)) everywhere. Are there any preferences regarding the names of these wrappers? I decided to use the name of the primop with the word "Bool" before #. This looks nice for gtCharBool# or neFloatBool#, but not so good for >=Bool# and <Bool##. Does anyone have a better idea for wrapper names?

comment:33 Changed 4 years ago by simonpj

If we are going to change primpops, perhaps we should change dataToTag# and tagToEnum# to take/return Word# instead of Int#? Wouldn't that be more consistent? It's a breaking change (for primop users), but changing the type of the comparision primops is a much more substantial breakage.

Please can we ensure that the wiki page is up to date, explaining the actual proposed story and why it's better. Maybe it is already.

Simon

comment:34 Changed 4 years ago by jstolarek

Wiki page is up to date and shows one example benchmark where the code is about 3 times faster thanks to the new primops. I also have an idea for one more benchmark, but I'm not yet sure whether it'll give good results.

Changing the type of dataToTag# and tagToEnum# to maintain consistency sounds like a good idea. I can take care of that, once I'm done with comparison primops.

At the moment I hit a rather serious problem (or so it seems) - changing the return type of comparisons breaks automatic deriving of instances for Ord and Eq. I'm yet to figure out how to fix that.

Right now I would insist to decide what names should the wrappers have. There will be lots of changes needed and I would like to avoid going through this twice. My proposals are:

    gtCharBool#, geCharBool#, eqCharBool#,
    neCharBool#, ltCharBool#, leCharBool#,

    (>$#), (>=$#), (==$#), (/=$#), (<$#), (<=$#),

    gtWordBool#, geWordBool#, eqWordBool#,
    neWordBool#, ltWordBool#, leWordBool#,

    (>$##), (>=$##), (==$##), (/=$##), (<$##), (<=$##),

    gtFloatBool#, geFloatBool#, eqFloatBool#,
    neFloatBool#, ltFloatBool#, leFloatBool#,

    gtAddrBool#, geAddrBool#, eqAddrBool#,
    neAddrBool#, ltAddrBool#, leAddrBool#

comment:35 Changed 4 years ago by jstolarek

After a moment of thinking: it will be much easier to first change the types of dataToTag# and tagToEnum# to use Word# and then work on primops. Should I create a separate ticket so that changes to dataToTag# and tagToEnum# can be merged to HEAD before the work on new primops is finished?

comment:36 Changed 4 years ago by simonpj

I'd do it all on the same branch (eg on github.) and as part of one ticket.

It should be easy to fix the 'deriving' code (in TcGenDeriv).

I don't care much about names, but I'd have thought:

  gtChar#  :: Char# -> Char# -> Word#   -- Primop
  gtCharB# :: Char# -> Char# -> Bool    -- Wrapper

etc

comment:37 in reply to:  31 Changed 4 years ago by simonmar

Replying to benl:

there is really no reason to suggest that the result of a boolean comparison operator might be negative! I suspect the reason tagToEnum# currently takes an Int# is because literals like 5# are Int#.

There's no reason to suggest that the result of a boolean comparison might be 3, yet Word# includes that value. I don't think Int# is any worse than Word# in including lots of impossible values, and furthermore there's a lot less friction to using Int# because everything else uses it. Just saying.

comment:38 Changed 4 years ago by simonpj

True. But if dataToTag# returns Int# then so should grChar#. That is either

dataToTag# :: a -> Int#
grChar#    :: Char# -> Char# -> Int#

or

dataToTag# :: a -> Word#
grChar#    :: Char# -> Char# -> Word#

I don't mind which, but they should clearly be the same!

Simon

comment:39 Changed 4 years ago by simonmar

Definitely. I vote for the Int# versions, just because there's much less to change, and it's less likely to break anything.

comment:40 Changed 4 years ago by jstolarek

I just had this thought: how about we use the names of existing primops for wrappers and give new names to modified primops, e.g.:

  gtCharW# :: Char# -> Char# -> Word#   -- Primop, new name
  gtChar#  :: Char# -> Char# -> Bool    -- Wrapper, old name

This way the change will be backwards compatible.

comment:41 Changed 4 years ago by jstolarek

Almost backward compatible that is - people will need to import GHC.PrimWrappers instead of GHC.Prim, but that's easier than changing all uses of primops.

comment:42 Changed 4 years ago by simonpj

Good idea (giving new names to the new ops).

I've remembered why Word# is probably the right return type, not Int#: because we want to perform logical operations like and and or on the result of comparisons, and those work on Word# not Int#.

So to me that argues for the primops all returning Word#. Simon M what do you say?

Simon

comment:43 Changed 4 years ago by jstolarek

I already took care of that a few weeks ago. We have logical primops that operate on Int# (andI#, orI#, xorI#, notI#) - see #7689.

comment:44 Changed 4 years ago by simonpj

Oh, ok. Then I suppose the path of least resistance is to stick with Int# throughout.

comment:45 Changed 4 years ago by jstolarek

I'm having some problems and I admit I could use some hint on what to do. I'm able to build the stage1 compiler, but every program compiled with it segfaults immediately. Looks like the approach I used to add the new prototype primops does not extend easily to modifying the existing primops. Here's the summary of what I did:

I pushed these changes to github. The diffs are here: GHC, base, ghc-prim, integer-gmp. If anyone would be willing to look at them and suggest me where I went wrong I would be grateful. I know I need to make some more changes in ghc-prim library in GHC.!IntWord64 module and cbits, but I'm leaving that for now.

comment:46 Changed 4 years ago by benl

Work more incrementally. Change just a bit at a time and make sure all the codeGen tests work before doing the next part.

comment:47 Changed 4 years ago by jstolarek

I'm looking at the diff and I don't think that I could possibly split it into smaller changes. The problem is that once you change one of the primops to return Int#, you have to change them all because they all fall into the same Compare category. And then you need to fix deriving of Eq and Ord. These are the smallest changes I was able to make and have the code compile.

comment:48 Changed 4 years ago by benl

Ok, then find the smallest possible program that segfaults, then look at the Cmm and assembly code to see what's changed.

You'll particularly want to ensure that the base libraries are built with the original working compiler, but then build only your test programs with the hacked compiler. There is also a "Debugging GHC" guide on the developers wiki. It might be time to learn how to use GDB.

comment:49 Changed 4 years ago by simonpj

Also, use the stage-1 compiler to generate code for some tiny program like

f x = if x>0 then True else False
g x = x<0

and check that the comparison is going to work right.

comment:50 Changed 4 years ago by igloo

Owner: jstolarek deleted
Status: patchnew

If I understand correctly, there's no patch ready to be applied yet, but jstolarek is working on patches that make new primops returning Int#, and convert the old primop names to be wrappers that call the new primops.

comment:51 Changed 4 years ago by igloo

Owner: set to jstolarek

comment:52 Changed 4 years ago by jstolarek

I have implemented the changes. Now I am left with cleaning up the code, writing more tests and verifying that there are no performance regressions. This will take some more time, but I would like to ask how should I prepare my patches (there will be a total of 7: ghc, testsuite and libraries: base, ghc-prim, integer-gmp, integer-simpl and primitive). Should I squash all the commits in my branches into one big commit or can I leave the commits as they are (or perhaps squash them selectively)? Which user documentation, if any, should I update (aside from the wiki page)? What information should be documented in the source code?

There is one thing I wasn't able to implement the way I think it should be: constant folding for Integer. Each comparison function for Integer is now split into a primop, which returns an Int#, and a wrapper, which calls tagToEnum# and returns a Bool. The constant folding rules for Integer are working on the wrappers, so that whenever somebody writes ((1 :: Integer) == 1) it gets constant folded to True. This works properly and doesn't break any existing code, but is inconsistent with the rules for other comparisons, which are defined for primops, not for their wrappers. So the call to "eqInteger 1 1" (the wrapper) gets constant folded, while the call to "eqInteger# 1 1" (the primop) doesn't. I have run into problems with this (posted on ghc-devs on May 16th) and I don't know how to fix this. What is others' opinion on accepting the patches with this defect and opening a separate ticket for constant folding of Integer primops? I actually don't like that idea, but I'm afraid I might be stuck on this one for another 2 or 3 weeks - it would be a lot easier for me to push all the changes I have been making and then focus solely on Integer constant folding.

comment:53 Changed 4 years ago by simonpj

Great. I suggest

  • One patch per repo, with a commit message summarising changes. No need to record all the intermediate patches on your path to completion!
  • The goal of all this is better performance. So including that evidence in the commit message would be good. Maybe the improved perf only shows on benchmarks that aren't in nofib? Then perhaps add tests to perf/should_run and point to them from the commit message.
  • I don't understand the cause of the eqInteger# constant folding problem. It should be easy to reproduce in HEAD, just by writing a new Integer type in test, and some suitable wrappers and a fake primop (with NOINLINE on that). Then you can open a ticket on that and we can solve it separately. Probably it'll be done before you are ready with all the rest.
  • User docs: I'm not sure that any change is needed. The important thing is that the wiki page is up to date, explains the motivation and the reasoning for the main design choices (eg why Int# rather than Word# as the return type).

Thanks Jan

Simon

comment:54 Changed 4 years ago by jstolarek

Performance: I think this patch shouldn't have any impact on performance of existing code. It will however allow to write branchless algorithms that can be faster than their branching counterparts. I'm not sure about the tests in the testsuite - do they allow to measure execution speed? I see only some memory stats.

Integers: I'm afraid I don't understand how could I reproduce my problem using HEAD. Integer constant folding rules are built into the compiler and trigger based on function name, so not sure if writing fake primops and wrappers can be considered bug reproduction. I'm missing something, right?

comment:55 Changed 4 years ago by jstolarek

Status: newpatch

Patches submitted. Some remarks:

  • I updated the wiki page. It describes motivation for changes, design decisions (why return Int#) and an example benchmark that shows how new primops can be used to write faster algorithms.
  • This is my first serious change to the compiler so I think this patch should be carefully reviewed. Most of all, someone with more knowledge than me should probably take a look at functions isComparisonMachOp and maybeInvertComparison in cmm/CmmMachOp.hs and decide whether they are still necessary or not. I think they are, but I am not certain.
  • I remind that constant folding for Integer works for primop wrappers and not primops themselves. I think it will be much easier to deal with this when it is considered as a separate issue. Ones the patches get accepted I'll fill a ticket for this.
  • I noticed that the testsuite is not adapted to work with integer-simple library. Running validation with INTEGER_LIBRARY=integer-simple results with many tests failing because they have an explicit dependency on integer-gmp package or one of its internal modules. Should I create a ticket for this?

comment:56 Changed 4 years ago by simonmar

I don't see any changes to the code generator in there - how does it work if the code generator is still generating code for Bool-returning primops, or is it just a gigantic coincidence that it still works somehow?

comment:57 Changed 4 years ago by simonpj

Awaiting response from Jan...

comment:58 Changed 4 years ago by jstolarek

Simon M., you're right - there are no changes in the code generator. According to tests and my analysis of generated assembly everything works correct. I wish I could give an explanation of how that works in the codegen but I have to admit that I am not familiar with that part of the compiler. I'll take a look at the code and see if I can give a more professional response :)

Right now I'm trying to see if I can use new primops to get more performance out of Gregory Collins' hashtables package - I'll post here if I get good results.

comment:59 Changed 4 years ago by simonmar

It's possible that the code generator just works with the new primops, but that you can remove the old special cases that were there to make the Bool results work. The main one is in StgCmmExpr.hs, see "Note [case on bool]", there might be others lurking.

comment:60 Changed 4 years ago by jstolarek

Thanks. I'll look into it. I noticed some parts of code that I suspected to be obsolete after modifying primops, but I wasn't quite sure, so I just left them unchanged.

comment:61 Changed 4 years ago by jstolarek

I updated the patches yesterday so they merge cleanly with the recent changes in HEAD. Removing the dead code might take me some time - I want to read more about STG and Cmm first so that I know what I'm doing.

comment:62 Changed 4 years ago by simonpj

Great, thanks. You'll be at MSR in person in a month, so let's do code review and commit then. Before we do:

  • We need to be clear about the codegen question that Simon M asked. It's not enough to work by coincidence.
  • Some performance numbers that show that there is a useful performance improvement would be welcome. After all, that is the whole point!

Simon

comment:63 Changed 4 years ago by jstolarek

Regarding performance, are the results for filter shown here convincing, or do we want a more sophisticated example? These results do show that it is possible to get a significant speedup, even with very simple functions.

comment:64 Changed 4 years ago by simonmar

I'm happy so long as nofib performance and code size are unchanged. The change is a useful cleanup in itself: the PrimOps correspond more closely to the MachOps, and it removes the need for special cases in the code generator.

comment:65 Changed 4 years ago by jstolarek

I didn't notice any significant code size changes with a few programs that I compiled. I will make sure that there are no regressions, both in terms of code size and performance, as soon as I manage to run nofib - right now I'm struggling with some nofib errors (mailed on ghc-devs today). Also, #7958 blocks me partially with benchmarking hastables package.

comment:66 Changed 4 years ago by jstolarek

I managed to run nofib. There is one notable performance regression in kahan benchmark. I will investigate why that happens. Interestingly, there is a consistent - although small - reduction in executable size.

comment:67 Changed 4 years ago by tibbe

Please do check kahan, it's a very simple benchmark for arithmetic in simple loops and GHC should generate good code for it.

comment:68 Changed 4 years ago by jstolarek

I spent some time on figuring out why my patch works. Here's what I come up with (hopefully I got things right).

Previously the PrimOps of type Compare used to return Bool (prelude/PrimOp.lhs):

getPrimOpResultInfo :: PrimOp -> PrimOpResultInfo
getPrimOpResultInfo op
  = case (primOpInfo op) of
      Compare _ _  -> ReturnsAlg boolTyCon 

I changed them to return Int#:

getPrimOpResultInfo :: PrimOp -> PrimOpResultInfo
getPrimOpResultInfo op
  = case (primOpInfo op) of
      Compare _ _   -> ReturnsPrim (tyConPrimRep intPrimTyCon)

This is how primops are converted from Core to STG (CoreToStg.lhs, coreToStgApp function):

        res_ty = exprType (mkApps (Var f) args)
        app = case idDetails f of
                -- Some primitive operator that might be implemented as a library call.
                PrimOpId op      -> ASSERT( saturated )
                                    StgOpApp (StgPrimOp op) args' res_ty

Previously res_ty was set to Bool, now it is Int#. STG representation of a primop is converted to Cmm (StgCmmExpr.hs) by cgExpr function:

cgExpr (StgOpApp op args ty) = cgOpApp op args ty

So previously cgOpApp function matched this guard:

  | ReturnsAlg tycon <- result_info
  , isEnumerationTyCon tycon
        -- c.f. cgExpr (...TagToEnumOp...)
  = do dflags <- getDynFlags
       tag_reg <- newTemp (bWord dflags)
       cgPrimOp [tag_reg] primop args
       emitReturn [tagToClosure dflags tycon
                                (CmmReg (CmmLocal tag_reg))]

and now it matches this one:

  | ReturnsPrim rep <- result_info
  = do dflags <- getDynFlags
       res <- newTemp (primRepCmmType dflags rep)
       cgPrimOp [res] primop args
       emitReturn [CmmReg (CmmLocal res)]

The main difference being that there is no call to tagToClosure, which was the goal of this ticket - to return an unboxed unlifted value instead of a closure that needs to be entered or have its tag examined. This leads me to a conclusion: since tagToClosure is an implementation of tagToEnum# primop, there should be no performance difference between the previous version of code, where call to tagToEnum# was generated implicitly by the code generator, and the new primop wrappers, which make the call to tagToEnum# explicit. In theory at least, because kahan for some reason is slower. I'm investigating this but so far I found nothing that would be obvious.

And I'm also working on identifying code that is no longer necessary. One thing that keeps me puzzled is that there are still primops that return Bool, e.g.

primop  SameMutVarOp "sameMutVar#" GenPrimOp
   MutVar# s a -> MutVar# s a -> Bool

I'm not sure if this allows me to remove special case that Simon M. mentioned.

comment:69 Changed 4 years ago by simonpj

I think we should treat all Bool-returning primops identically. I think it's just the sameX family of operators, and they can perfectly well return Int# in the same way that eqInt# now does. Then you can eliminate those cases from the codegen path, which is great.

One thing I'm puzzled about is this. If ># yields an Int#, how does that Int# get into a register? Do we get code like

  cmp x,y
  bne L1
  ld R1, 1
  goto L2
L1: ld R1, 0
L2: ...now R1 has the Int#...

Getting the info from the status flags into a value, which we then compare with 0, would be silly. How is this avoided? Esp when there are several tests and you do logical operations on the Int#?

I looked on the wiki page http://hackage.haskell.org/trac/ghc/wiki/PrimBool, but could not figure it out. Maybe worth adding a section to explain this? Thanks

Simon

Simon

comment:70 Changed 4 years ago by gregorycollins

It's the setge %cl instruction in the code snippet on that wiki page, which:

        Sets the byte in the operand to 1 if the Sign Flag equals the
        Overflow Flag, otherwise sets the operand to 0.

(see e.g. http://web.itu.edu.tr/kesgin/mul06/intel/instr/setge_setnl.html)

Or one of the other SETcc functions. This instruction doesn't branch, so the generated code pipelines better.

comment:71 Changed 4 years ago by jstolarek

Simon, I updated this section to contain a detailed discussion of the generated assembly. I hope this helps.

I'll try to change other primops and remove unnecessary code before my arrival in Cambridge.

comment:72 Changed 4 years ago by jstolarek

I need advice regarding names of the modified primops. Right now we have sameMutableArray# primop returning a Bool. I was thinking that having a primop sameMutableArray# returning an Int# + having a wrapper sameMutableArray will be more elegant (because # at the end distinguishes built-in primops), but this will break people's code. I can follow approach used in patches created so far, that is change the name of the modified primop to sameMutableArray$# and reuse the name sameMutableArray# for wrapper, which will certainly break less code (people only need to import GHC.PrimWrappers) but seems a bit confusing.

comment:73 Changed 4 years ago by jstolarek

I uploaded the updated patches - they remove redundant code in code generator.

Could someone remove the two prototype patches I uploaded some time ago? They are not important any more.

comment:74 Changed 4 years ago by igloo

Owner: jstolarek deleted
Status: patchnew

I have the impression that these patches need some more polishing before they're ready to apply, so I'm reopening the ticket. Please let me know if that's wrong.

Regarding the name of sameMutableArray#, I don't have a strong opinion. I suspect there are few users of the function, so personally I'd be inclined to use the most consistent names.

comment:75 in reply to:  69 Changed 4 years ago by simonmar

Replying to simonpj:

One thing I'm puzzled about is this. If ># yields an Int#, how does that Int# get into a register?

If we branch directly on the result of the comparison, then the intermediate Int# is never manifested, the code generator just generates a conditional branch instruction. The optimisations in CmmSink try to ensure that the comparison is pushed into the CmmCondBranch so the code generator can spot it.

If the Int# value is really used (as in the example with logical ops on the wiki page), then on x86 there are instructions to generate it (e.g. sete, setge, etc.).

comment:76 in reply to:  74 Changed 4 years ago by jstolarek

Replying to igloo:

I have the impression that these patches need some more polishing before they're ready to apply, so I'm reopening the ticket. Please let me know if that's wrong.

Sure. If there's anything wrong with patches please let me know - I'll do my best to make them better.

comment:77 Changed 4 years ago by jstolarek

I'm running some benchmarks and analysing produced assembly and I just realized that removing case-on-bool in the codegen causes a performance regression. It seems to me that this special case is necessary, even after removing primops that return Bool.

Changed 4 years ago by jstolarek

Attachment: T6135-ghc.patch added

Changed 4 years ago by jstolarek

Attachment: T6135-testsuite.patch added

Changed 4 years ago by jstolarek

Attachment: T6135-packages-array.patch added

Changed 4 years ago by jstolarek

Attachment: T6135-packages-base.patch added

Changed 4 years ago by jstolarek

Changed 4 years ago by jstolarek

Changed 4 years ago by jstolarek

Changed 4 years ago by jstolarek

comment:78 Changed 4 years ago by jstolarek

Status: newpatch

I'm uploading the final patches. I don't yet have push permissions to main GHC repos, so I can't merge them myself.

comment:79 Changed 4 years ago by jstolarek

Status: patchnew

comment:80 Changed 4 years ago by jstolarek

Owner: set to jstolarek

comment:81 Changed 4 years ago by simonmar

Blocked By:

comment:82 Changed 4 years ago by Jan Stolarek <jan.stolarek@…>

Resolution: fixed
Status: newclosed

In 6579a6c73082387f82b994305011f011d9d8382b/ghc:

Comparison primops return Int# (Fixes #6135)

This patch modifies all comparison primops for Char#, Int#, Word#, Double#,
Float# and Addr# to return Int# instead of Bool. A value of 1# represents True
and 0# represents False. For a more detailed description of motivation for this
change, discussion of implementation details and benchmarking results please
visit the wiki page: http://hackage.haskell.org/trac/ghc/wiki/PrimBool

There's also some cleanup: whitespace fixes in files that were extensively edited
in this patch and constant folding rules for Integer div and mod operators (which
for some reason have been left out up till now).

comment:83 Changed 4 years ago by Jan Stolarek <jan.stolarek@…>

In ac52535d35a47e70d63498aa8c85c4fe30d1fd92/testsuite:

Comparison primops return Int# (Fixes #6135)

This patch adds tests for new primops and fixes the existing ones.
For a deatiled discussion of this changes please visit the wiki page:
http://hackage.haskell.org/trac/ghc/wiki/PrimBool

comment:84 Changed 4 years ago by Jan Stolarek <jan.stolarek@…>

In f6e2398adb63f5c35544333268df9c8837fd2581/base:

Comparison primops return Int# (Fixes #6135)

For a deatiled discussion of this changes please visit the wiki page:
http://hackage.haskell.org/trac/ghc/wiki/PrimBool

comment:85 Changed 4 years ago by simonpj

Test Case: primops/should_run/T6135

comment:86 Changed 4 years ago by Jan Stolarek <jan.stolarek@…>

In 77041157b00da259e2ae45fba80ed5d1b0eaa0fa/testsuite:

Follow changes in comparison primops (see #6135)

comment:87 Changed 4 years ago by Jan Stolarek <jan.stolarek@…>

In 427cbd522c3aeb04dc6a9bd696ffb7c71069facc/base:

Follow changes in comparison primops (see #6135)

comment:88 Changed 4 years ago by Jan Stolarek <jan.stolarek@…>

In 5ab5b3d7ca15fc7620de4c979f4eece7780895f4/ghc-prim:

Follow changes in comparison primops (see #6135)

comment:89 Changed 4 years ago by Lemming

Btw. LLVM usually represents boolean values by type i1. It would certainly be easier to translate Bool# to i1, than translating Int# to i1 in a special case. But as I understand the Bool# type was finally rejected in order to optimize all enumeration types the same way, right?

comment:90 Changed 4 years ago by carter

@lemming, Bool# is now Int# afaik, could you explain what you mean? even if we used a i1 rep, it still needs a full register! So theres no lost for the in register rep to be Int#, afaik

comment:91 in reply to:  90 ; Changed 4 years ago by Lemming

Replying to carter:

@lemming, Bool# is now Int# afaik, could you explain what you mean? even if we used a i1 rep, it still needs a full register! So theres no lost for the in register rep to be Int#, afaik

Int# represents a large range of integers, whereas i1 in LLVM has only values 0 and 1. That is, LLVM can optimize based on the knowledge of the restricted value set of i1, but it certainly fails to do so on i32 or i64.

comment:92 in reply to:  91 Changed 3 years ago by altaic

Replying to Lemming:

Replying to carter:

@lemming, Bool# is now Int# afaik, could you explain what you mean? even if we used a i1 rep, it still needs a full register! So theres no lost for the in register rep to be Int#, afaik

Int# represents a large range of integers, whereas i1 in LLVM has only values 0 and 1. That is, LLVM can optimize based on the knowledge of the restricted value set of i1, but it certainly fails to do so on i32 or i64.

Two of the advantages in representing Bool# as an i1 is that llvm can vectorize them, and arrays of i1 can fit in an i32 or i64. Off the top of my head, I don't know of any software that makes use of large arrays of booleans, but in that case this sort of optimization would likely be very effective.

comment:93 Changed 3 years ago by dfeuer

Owner: jstolarek deleted
Resolution: fixed
Status: closednew

It appears that this works as intended for <#, >#, <=#, and >=#, but not for ==#. If I write something like

foo x = tagToEnum# ((x <# 3#) `orI#` (x ># 100#) `orI#`
                    (x ==# 12#) `orI#` (x ==# 15#))

I get (in 7.8.3) Core doing something like

foo x = case x of
          12# -> True
          15# -> True
          _   -> tagToEnum# ((x <# 3#) `orI#` (x ># 100#))

(with different syntax, or course) and branching assembly to match. If this transformation is important in some other context, it would be great to have something that acts just like ==# but that doesn't do that, for the times when we actually don't want it.

comment:94 Changed 3 years ago by dfeuer

Resolution: fixed
Status: newclosed

Per jstolarek's request, opening a separate ticket for ==#, #9661

comment:95 Changed 3 years ago by bgamari

Cc: bgamari added

comment:96 Changed 16 months ago by jstolarek

Wiki Page: PrimBool

comment:97 Changed 4 weeks ago by bgamari

Description: modified (diff)
Keywords: datacon-tags added
Note: See TracTickets for help on using tickets.