Opened 10 months ago

Last modified 6 months ago

#9246 new bug

GHC generates poor code for repeated uses of min/max

Reported by: arotenberg Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.8.2
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #6135 Differential Revisions:

Description

Consider the following module, which intends to implement a branchless ray-AABB intersection test:

module SimpleGeom where

data Vec3 = Vec3
    {-# UNPACK #-} !Double
    {-# UNPACK #-} !Double
    {-# UNPACK #-} !Double
data Ray = Ray !Vec3 !Vec3 !Vec3
data AABB = AABB !Vec3 !Vec3

testRayAABBIntersection :: Ray -> AABB -> Bool
testRayAABBIntersection (Ray (Vec3 ox oy oz) _ (Vec3 invDx invDy invDz))
        (AABB (Vec3 minX minY minZ) (Vec3 maxX maxY maxZ)) =
    let tx1 = (minX - ox) * invDx
        tx2 = (maxX - ox) * invDx
        ty1 = (minY - oy) * invDy
        ty2 = (maxY - oy) * invDy
        tz1 = (minZ - oz) * invDz
        tz2 = (maxZ - oz) * invDz
        tmin = min tx1 tx2 `max` min ty1 ty2 `max` min tz1 tz2
        tmax = max tx1 tx2 `min` max ty1 ty2 `min` max tz1 tz2
    in tmax >= max 0 tmin

Everything is strict primitive operations, so GHC should generate very simple, fast code, right? But upon compiling with ghc -O -ddump-simpl -ddump-to-file SimpleGeom, I found a mess of nested local lambdas and similar performance-killing expression forms. (See the attached output file.)

There are two possible issues I can see here.

  1. GHC is trying to expand out all of the branches recursively (I would presume via case-of-cases transformation), which is a bad idea in this instance compared to just performing the cases sequentially and storing their results.
  2. GHC is generating branches for floating-point min/max. Instruction sets like SSE2 include non-branching floating-point min/max instructions, which is exactly what this algorithm was designed to exploit, but GHC does not generate code that could take advantage of these instructions.

Attachments (2)

SimpleGeom.dump-simpl (19.6 KB) - added by arotenberg 10 months ago.
Output of compiling with -O -ddump-simpl
UglyBranching.dump-simpl (25.5 KB) - added by arotenberg 10 months ago.
Core generated from UglyBranching example

Download all attachments as: .zip

Change History (14)

Changed 10 months ago by arotenberg

Output of compiling with -O -ddump-simpl

comment:1 Changed 10 months ago by carter

  • Milestone set to 7.10.1
  • Type changed from bug to feature request

So if you take a look at the haskell level primops that GHC provides
http://www.haskell.org/ghc/docs/latest/html/libraries/ghc-prim-0.3.1.0/GHC-Prim.html
you'll notice that theres no primops for min or max on the underlying unlifted types Double# or Float# (or Int#).

So if you need a branchless inner loop today, you'll either need to write a teeny bit of C you'll call out to, encode the logic as some bit fiddling on the floats, OR do something like describe the algorithm programmatically in a Haskell DSL, and at runtime use LLVM-General to generate the code you want http://hackage.haskell.org/package/llvm-general (i'm told some video codecs actually use LLVM for runtime code gen!)

Adding these to ghc is a pretty reasonable feature request, though such wouldn't land till 7.10. If you're interested in doing some of the leg work , i'm happy to try to guide you through the process! (If not, thats fine too, its a really concrete task i'll give to someone who wants to do their first ghc patch, or spend a weekend doing myself)

zooming out,

  1. do you have any benchmarks (say, using criterion) for this code?
  2. how does eg using -fllvm as your backend fair vs -fasm for this code in such a benchmark
Last edited 10 months ago by carter (previous) (diff)

comment:2 Changed 10 months ago by arotenberg

I tried using criterion to profile different versions of this function, but I couldn't get the results to not be noisy and "microbenchmarky".

The way I originally found this issue was by profiling the raytracer program I'm writing, which fingered this function as both the single largest time sink and the largest allocator. I tried replacing the definition with this variant:

testRayAABBIntersection :: Ray -> AABB -> Bool
testRayAABBIntersection
      (Ray (Vec3 (D# ox) (D# oy) (D# oz)) _
        (Vec3 (D# invDx) (D# invDy) (D# invDz)))
      (AABB (Vec3 (D# minX) (D# minY) (D# minZ))
        (Vec3 (D# maxX) (D# maxY) (D# maxZ))) =
    let tx1 = minusTimes## minX ox invDx
        tx2 = minusTimes## maxX ox invDx
        ty1 = minusTimes## minY oy invDy
        ty2 = minusTimes## maxY oy invDy
        tz1 = minusTimes## minZ oz invDz
        tz2 = minusTimes## maxZ oz invDz
        tmin = min## tx1 tx2 `max##` min## ty1 ty2 `max##` min## tz1 tz2
        tmax = max## tx1 tx2 `min##` max## ty1 ty2 `min##` max## tz1 tz2
    in isTrue# (tmax >=## max## 0.0## tmin)

{-# NOINLINE minusTimes## #-}
minusTimes## :: Double# -> Double# -> Double# -> Double#
minusTimes## minA oa invDa = (minA -## oa) *## invDa
{-# NOINLINE min## #-}
{-# NOINLINE max## #-}
min##, max## :: Double# -> Double# -> Double#
max## x y = if isTrue# (x <=## y) then y else x
min## x y = if isTrue# (x <=## y) then x else y

This gets the function's heap allocation down to the expected zero bytes, but at the cost of actually making the program slower (both with and without profiling)!

I haven't tried comparing results with the LLVM backend yet. I might look into it later today.

It would be nice to see non-branching min/max implemented for the primitive numeric types where available. Min/max for integer types can probably be implemented efficiently on many platforms using conditional move instructions. I'm not particularly interested in implementing it myself, but I'm happy with having it as a feature request.

comment:3 Changed 10 months ago by carter

I think your foo## operations can be safely marked INLINE, rather than NOINLINE. Was there a reason for the NOINLINE? the code loos pretty darn pure, so i don't see how NOINLINE would be needed, and that lack of inlining would explain why the branchless stuff was slower

comment:4 Changed 10 months ago by arotenberg

On the contrary, replacing the NOINLINEs with INLINEs results in essentially the original Core, local lambdas and all. I explicitly put the NOINLINEs in because that was the only way I could find to force GHC to not expand the branches and add the unnecessary lambdas and lazy lets.

Perhaps this issue should be split into two tickets - something like "Add branchless min/max primops" and "Min/max-like functions cause unnecessary closure creation"?

comment:5 Changed 10 months ago by carter

  • Milestone 7.10.1 deleted
  • Type changed from feature request to bug

yes, lets do that. I'll switch this ticket to being the bug one and open a new one for the feature request element and we can back and forth about generating the core etc here.

What were the ghc flags and optimization levels etc you were using to check the core? I'd like to make sure i can reproduce your findings before I try out my own hacking to help you out with near term work around over the next week (as my own work permits, though helping folks with perf engineering in haskell DOES overlap with my work a teeny bit)

comment:6 Changed 10 months ago by arotenberg

You should be able to reproduce the issue by using the versions and flags I listed in the original description with a fresh GHC 7.8.2 install.

I'm in no rush to get this fixed since the program I'm working on is just a hobby project. I just ran into the issue and figured I'd report it.

I mentioned "min/max-like functions" in my previous comment because it's easy to contrive other functions that cause similar issues. Try compiling this module with ghc -O -ddump-simpl -ddump-to-file UglyBranching.hs and look at the Core file it generates.

module UglyBranching where

foo :: Int -> Int -> Int -> Int -> Int
foo a b c d =
    (((a `bar` b) `bar` (c `bar` d)) `bar` ((a `bar` c) `bar` (b `bar` d))) `bar`
    (((b `bar` a) `bar` (d `bar` c)) `bar` ((c `bar` a) `bar` (d `bar` b)))
bar :: Int -> Int -> Int
bar m n = if m + n > 5 then m else n

Changed 10 months ago by arotenberg

Core generated from UglyBranching example

comment:7 Changed 10 months ago by simonpj

Indeed! What Core would you like to see generates for examples like UglyBranching?

The underlying difficulty is that the case-of-case transformation is utterly crucial for optimising Haskell programs, and it's hard to predict when it'll be unproductive, or even counter-productive. I'd welcome ideas.

case-of-case is described in "A transformation based optimiser for Haskell", and (in passing) in many other papers.

Simon

comment:8 Changed 10 months ago by arotenberg

Having now read "A transformation based optimiser for Haskell", I learned about the let-no-escape optimization, which I was not previously aware of. Looking at the output of -ddump-stg on UglyBranching, all of the local lambdas are actually bound by let-no-escape, which makes the whole question irrelevant.

Well, now I feel silly. I guess if there's something to be learned here, it's that let-no-escape could be publicized better! Everything I've seen on the internet up until now was all "look at the Core, if you see let, that's a Bad Thing and you might want to Do Something About It!" (Quoth the GHC docs: "If profiling has pointed the finger at particular functions, look at their Core code. lets are bad, cases are good, ... nested lambdas are bad, ...")

comment:9 Changed 10 months ago by simonpj

Yes, let-no-escape should be better documented. And I feel bad that although it's terribly important for performance, it's entirely implicit in Core, and it's not hard for a transformation to inadvertently lose the LNE property. I'd like it to be more explicit, somehow.

Simon

comment:10 Changed 10 months ago by arotenberg

I don't have any great ideas on the LNE issue, other than maybe add a line in the GHC user docs about it somewhere. carter has opened #9251 for the min/max primops, so I guess this issue can be closed. Thanks for your help!

comment:11 Changed 10 months ago by carter

@arontenberg, need not close it per se, I think @simonpj raise a good point (which perhaps can be spun off into its own ticket?)

I notice Jan has a blog post about LNE http://lambda.jstolarek.com/2013/10/let-no-escape/

is it kinda like a higher order phi function?

comment:12 Changed 6 months ago by gintas

  • Operating System changed from Windows to Unknown/Multiple
Note: See TracTickets for help on using tickets.