Opened 11 years ago

Last modified 21 months ago

#849 new feature request

Offer control over branch prediction

Reported by: simonpj Owned by: simonmar
Priority: normal Milestone:
Component: Compiler Version: 6.4.2
Keywords: Cc: pho@…, johan.tibell@…, claus, daniel.is.fischer@…, michal.terepeta@…, ikke+ghc@…, buecking@…, hackage.haskell.org@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case: N/A
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

So now that GHC is so good at producing really fast low level code (see ByteString benchmarks) we start to encounter low level gubbins where to get the fastest possible code we need slightly more influence over branch prediction.

In the code for ByteString's 'writeStrUp/Dn' functions we're doing a tight loop writing bytes to memory.

The first version looks much like this:

loop fp n off s = case next s of
    Done       -> return $! PS fp 0 off
    Skip    s' -> loop fp n off s'
    Yield x s' -> do
             withForeignPtr fp $ \p -> pokeByteOff p off x
             loop fp n (off+1) s'

When we get to the end of a memory block we need to double the size of the memory block and carry on. So this adds an additional conditional test into this loop.

loop fp n off s = case next s of
    Done       -> trimUp fp n off
    Skip    s' -> loop fp n off s'
    Yield x s'
        | n == off -> realloc fp n off s' x
        | otherwise -> do
             withForeignPtr fp $ \p -> pokeByteOff p off x
             loop fp n (off+1) s'

There are dramatic differences between equivalent forms of code. Just by reversing the order of the (n==off) test one form I can process 50Mb of data in 0.20 seconds and in the other it takes 0.34 seconds.

That is:

        | n == off -> realloc fp n off s' x
        | otherwise -> do ...

vs

        | n /= off -> do ...
        | otherwise -> realloc fp n off s' x

I think that this is all down to branch prediction and the arrangement of basic blocks. On a modern CPU correctly predicted branches are nearly free while mis-predicted branches are very costly due to stalled pipelines etc.

In gcc they have a branch prediction framework. It annotates conditionals with prediction information. It then uses that during code generation to do things like arranging basic blocks so that unlikely blocks are moved out of line. It gets the prediction information from a few sources. One is a simple static branch predictor, for example branches back to to the beginning of a loop are likely to be taken and branches to error functions are not. gcc even has a profile feedback system to find the real probabilities of branches from a sample run of the program. It also has a user annotation __builtin_expect() which many C projects use in the form of a macro:

if (unlikely(x==0)) { ...

The Linux kernel uses this fairly heavily to move error handling code out of the main 'fast path' code.

Anyway, so what I wish is that I could write something like:

loop fp n off s = case next s of
    Done       -> {-# UNLIKELY #-}
                  trimUp fp n off
    Skip    s' -> loop fp n off s'
    Yield x s'
        | n == off -> {-# UNLIKELY #-}
                      realloc fp n off s' x
        | otherwise -> do
             withForeignPtr fp $ \p -> pokeByteOff p off x
             loop fp n (off+1) s'

The best approximating effect that I can use at the moment is to make the unlikely realloc branch a local function in a where clause but mark it with {-# NOINLINE #-} so that the code for the realloc case doesn't pollute the tight loop for the normal fast case. Then the other approximation is fiddling with the logic of the binary test and looking at the benchmarks. The combination of these two techniques makes a dramatic difference to the speed.

However I do need more control to do it reliably, while I was able to make it work for writeStrUp, I could not find a combination to work for writeStrDn - we loose about 50% performance when we add the realloc branch. So a slightly more reliable method to hint at the likelihood of conditionals would make this kind of low level code faster and easier to write.

Some time ago as a quick hack I did add a branch prediction annotation to the CMM conditional constructor. I only used it in the C backend to turn it into uses of gcc's __builtin_expect(). I also didn't connect it to the front end so I didn't have any high level static branch prediction (which could be done by looking for branches with recursive calls or calls to error etc). The only place I did actually use it was in the heap check and stack check. I assumed that it would be advantageous to predict heap and stack overflow branches as not taken. It was only a quick hack - I didn't do any benchmarks to see if it had any effect.

Anyway, just a thought, and of course not something we should be thinking about 'til post 6.6.

Duncan

Change History (29)

comment:1 Changed 11 years ago by igloo

Milestone: 6.8
Test Case: N/A

comment:2 Changed 11 years ago by AndyGill

Hpc can generate accurate branch information, which could be passed onto gcc. In C compilers, such 'training' makes a huge performance difference.

AndyG

comment:3 Changed 10 years ago by simonmar

Milestone: 6.8 branch6.10 branch

Let's look at this in the context of the new back end.

comment:4 Changed 9 years ago by PHO

Cc: pho@… added

comment:5 Changed 9 years ago by simonmar

Architecture: UnknownUnknown/Multiple

comment:6 Changed 9 years ago by simonmar

Operating System: UnknownUnknown/Multiple

comment:7 Changed 9 years ago by tibbe

Cc: johan.tibell@… added

comment:8 Changed 9 years ago by claus

Cc: claus added

GHC actually ignores any expected-frequency information that programmers might try to encode in the ordering of pattern-match/conditional clauses, sorting them by data type definition order instead. If GHC would respect the source ordering of branches, programmers would have a chance to express expected frequency information without pragmas (assuming that successful match would correspond to fall-through, no jump). See this thread

compilation of pattern-matching?

comment:9 Changed 9 years ago by igloo

Milestone: 6.10 branch6.12 branch

comment:10 Changed 7 years ago by igloo

Milestone: 6.12 branch6.12.3

comment:11 Changed 7 years ago by tibbe

Type of failure: None/Unknown

We have the same problem in Data.Binary.Builder and Data.Text.Lazy.Builder where we need to check if the output buffer if full before writing a byte/character. The path that creates a new buffer should be out-of-line.

comment:12 Changed 7 years ago by igloo

Milestone: 6.12.36.14.1
Priority: normallow

comment:13 Changed 7 years ago by daniel.is.fischer

Cc: daniel.is.fischer@… added

comment:14 Changed 7 years ago by igloo

Milestone: 7.0.17.0.2

comment:15 Changed 7 years ago by igloo

Milestone: 7.0.27.2.1

comment:16 Changed 6 years ago by michalt

Cc: michal.terepeta@… added

comment:17 Changed 6 years ago by igloo

Milestone: 7.2.17.4.1

comment:18 Changed 6 years ago by igloo

Milestone: 7.4.17.6.1
Priority: lownormal

comment:19 Changed 5 years ago by nicolast

Cc: ikke+ghc@… added

When using the LLVM backend, this might be feasible using the "branch_weights" [1] metadata annotations. This way there's no need to do any reordering in the intermediate IR: the optimization steps do this based on these annotations (I checked using Clang with __builtin_expect, comparing the generated IR and the final assembly).

This allows more granular weights than just "LIKELY" or "UNLIKELY".

Would there be any interest if I'd look into adding such "BRANCH_WEIGHT (uint)" pragma? (Not sure I can pull it off though, not familiar with GHC internals (yet)).

[1] http://llvm.org/docs/BranchWeightMetadata.html

comment:20 Changed 5 years ago by igloo

Milestone: 7.6.17.6.2

comment:21 Changed 5 years ago by buecking

Cc: buecking@… added

comment:22 Changed 5 years ago by liyang

Cc: hackage.haskell.org@… added

comment:23 Changed 4 years ago by jstolarek

Cc: jan.stolarek@… added

comment:24 Changed 3 years ago by thoughtpolice

Milestone: 7.6.27.10.1

Moving to 7.10.1.

comment:25 Changed 3 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:26 Changed 2 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:27 Changed 2 years ago by jstolarek

Cc: jan.stolarek@… removed

comment:28 Changed 2 years ago by simonmar

Owner: set to simonmar

comment:29 Changed 21 months ago by thomie

Milestone: 8.0.1
Note: See TracTickets for help on using tickets.