Version 15 (modified by 4 years ago) (diff) | ,
---|

# Implementing new primitive comparisons to allow branchless algorithms

This page gathers the notes about implementing new primitive logical operations and thus resolving ticket #6135.

## The problem

Consider following fragment of code:

case (x <# 0#) || (x >=# width) || (y <# 0#) || (y >=# height) of True -> E1 False -> E2

This kind of code is common in image processing (and array programming in general) where one needs to check whether the `(x,y)`

coordinates are within the image. Primitive comparison operators `<#`

and `>=#`

have type `Int# -> Int# -> Bool`

. Logical OR operator `(||)`

is defined as:

(||) :: Bool -> Bool -> Bool True || _ = True False || x = x

in GHC.Classes (ghc-prim library) which is equivalent of:

(||) x y = case x of True -> True False -> y

During the compilation process (assuming the optimizations are turned on) the definition of `(||)`

gets inlined and then case-of-case transform is performed successively. This results in following Core (cleaned up for clarity):

case <# x 0 of _ { False -> case >=# x width of _ { False -> case <# y 0 of _ { False -> case >=# y height of _ { False -> E2 True -> E1 }; True -> E1 }; True -> E1 }; True -> E1 };

and in following assembler code:

.Lc1rf: testq %r14,%r14 jl .Lc1rk cmpq %rdi,%r14 jge .Lc1rp testq %rsi,%rsi jl .Lc1ru cmpq %r8,%rsi jge .Lc1rz movl $Main_g2_closure+1,%ebx jmp *0(%rbp) .Lc1rk: movl $Main_g1_closure+1,%ebx jmp *0(%rbp) .Lc1rp: movl $Main_g1_closure+1,%ebx jmp *0(%rbp) .Lc1ru: movl $Main_g1_closure+1,%ebx jmp *0(%rbp) .Lc1rz: movl $Main_g1_closure+1,%ebx jmp *0(%rbp)

There are five possible branches to take, although four of them have the same result. This is caused by code duplication introduced by case-of-case transform (see this blog post for a step by step derivation). According to Ben Lippmeier, who submitted the original bug report, mis-predicted branches are bad in object code because they stall the pipeline.

## Solution

This problem was solved by modifying comparison primops to return unboxed unlifted `Int#`

instead of `Bool`

(which is lifted and thus is returned as a thunk that needs to be evaluated). Having `Int#`

returned as a result of logical comparison will allow to use branchless bitwise logical operators instead of branching logical operators defined by Haskell.

## Implementation details

Below is a summary of implementation details and decisions:

- the new comparison primops return a value of type
`Int#`

:`1#`

represents`True`

and`0#`

represents`False`

. The`Int#`

type was chosen because on Haskell it is more common to use signed Int type insetad of unsigned Word. By using`Int#`

the users can easily convert unboxed result into a boxed value, without need to use`word2Int#`

and`int2word#`

primops. - as a small side-task, four new logical bitwise primops have been implemented:
`andI#`

,`orI#`

,`xorI#`

and`negI#`

(#7689). These operate on values of type`Int#`

. Earlier we had only bitwise logical primops operating on values of type`Word#`

. - names of the existing comparison primops were changed. Operators will have
`$`

added before`#`

, others will have`I`

added before the`#`

(this is a mnemonic denoting that this primop returns and`Int#`

). Examples:

>=$# :: Int# -> Int# -> Int# /=$## :: Double# -> Double# -> Int# gtCharI# :: Char# -> Char# -> Int# eqWordI# :: Word# -> Word# -> Int# ltFloatI# :: Float# -> Float# -> Int# leAddrI# :: Addr# -> Addr# -> Int#

- a new module
`GHC.PrimWrappers`

was added to ghc-prim library. This module contains wrappers for comparison primops. These wrappers have names identical to removed primops and return a`Bool`

. Examples:

gtChar# :: Char# -> Char# -> Bool gtChar# a b = tagToEnum# (a `gtCharI#` b) (>=#) :: Int# -> Int# -> Bool (>=#) a b = tagToEnum# (a >=$# b) eqWord# :: Word# -> Word# -> Bool eqWord# a b = tagToEnum# (a `eqWordI#` b) (/=##) :: Double# -> Double# -> Bool (/=##) a b = tagToEnum# (a /=$## b) ltFloat# :: Float# -> Float# -> Bool ltFloat# a b = tagToEnum# (a `ltFloatI#` b) leAddr# :: Addr# -> Addr# -> Bool leAddr# a b = tagToEnum# (a `leAddrI#` b)

Thanks to these wrappers the change is almost backwards compatible. The only thing primop users need to change in their existing code to make it work again is adding import of !GHC.PrimWrappers module.

- functions for comparing
`Integer`

type, implemented in integer-gmp and integer-simple libraries, received a similar treatment. Technically they are not primops, because they are implemented in Haskell (in case of integer-gmp also with FFI), but they pretend to be ones. There are six primops for comparing`Integer`

values:eqInteger# :: Integer -> Integer -> Int# neqInteger# :: Integer -> Integer -> Int# leInteger# :: Integer -> Integer -> Int# ltInteger# :: Integer -> Integer -> Int# gtInteger# :: Integer -> Integer -> Int# geInteger# :: Integer -> Integer -> Int#

Each of these functions has a wrapper that calls `tagToEnum#`

and returns a `Bool`

. These wrappers are: `eqInteger`

, `neqInteger`

, `leInteger`

, `ltInteger`

, `gtInteger`

and `geInteger`

.

- Other libraries that were modified to work with the new primops are: base, ghc-prim and primitive. The only required modifications were imports of the !GHC.PrimWrappers module in modules that use the primops.

## Proof of concept

The prototype patch posted on the trac on 13th of March implemented six new prototype comparison primops:

.># :: Int# -> Int# -> Int# .<# :: Int# -> Int# -> Int# .>=# :: Int# -> Int# -> Int# .<=# :: Int# -> Int# -> Int# .==# :: Int# -> Int# -> Int# ./=# :: Int# -> Int# -> Int#

Each of these new primops takes two `Int#`

s that are to be compared. The result is also an `Int#`

: `0#`

if the relation between the operands does not hold and `1#`

when it does hold. For example `5# .># 3#`

returns `1#`

and `3# .># 3#`

returns `0#`

. With the new primops we can rewrite the original expression that motivated the problem:

case (x <# 0#) || (x >=# width) || (y <# 0#) || (y >=# height) of True -> E1 False -> E2

as

case (x .<# 0#) `orI#` (x .>=# width) `orI#` (y .<# 0#) `orI#` (y .>=# height) of True -> E1 False -> E2

(Note: `orI#`

is a bitwise OR operation on operands of type `Int#`

. It was introduced together with `andI#`

, `notI#`

and `xor#`

in #7689). Using the LLVM backend this compiles to:

# BB#0: # %c1nK movq %rsi, %rax orq %r14, %rax shrq $63, %rax cmpq %rdi, %r14 setge %cl movzbl %cl, %ecx orq %rax, %rcx cmpq %r8, %rsi setge %al movzbl %al, %eax orq %rcx, %rax jne .LBB2_1 # BB#3: # %c1ol movq (%rbp), %rax movl $r1m6_closure+1, %ebx jmpq *%rax # TAILCALL .LBB2_1: # %c1nK cmpq $1, %rax jne .LBB2_2 # BB#4: # %c1ov movq (%rbp), %rax movl $r1m7_closure+1, %ebx jmpq *%rax # TAILCALL .LBB2_2: # %c1ob movq r1m5_closure(%rip), %rax movl $r1m5_closure, %ebx jmpq *%rax # TAILCALL

The assembly does not contain comparisons and jumps in the scrutinee of the case expression, but still does jumps for selecting an appropriate branch of the case expression.

### Benchmarks for the proposed patch

Below is a benchmark for the proof-of-concept filter function that demonstrates performance gains possible with the new primops:

{-# LANGUAGE BangPatterns, MagicHash #-} module Main ( main ) where import Control.Monad.ST (runST) import Criterion.Config (Config, cfgPerformGC, defaultConfig, ljust) import Criterion.Main import Data.Vector.Unboxed.Mutable (unsafeNew, unsafeSlice, unsafeWrite) import Data.Vector.Unboxed as U (Vector, filter, foldM', fromList, length, unsafeFreeze) import GHC.Exts (Int (I#), (.>=#)) import System.Random (RandomGen, mkStdGen, randoms) import Prelude hiding (filter, length) filterN :: U.Vector Int -> U.Vector Int filterN vec = runST $ do let !size = length vec fVec <- unsafeNew size let put i x = do let !(I# v) = x inc = I# (v .>=# 0#) unsafeWrite fVec i x return $ i + inc fSize <- foldM' put 0 vec unsafeFreeze $ unsafeSlice 0 fSize fVec main :: IO () main = return (mkStdGen 1232134332) >>= defaultMainWith benchConfig (return ()) . benchmarks benchmarks :: RandomGen g => g -> [Benchmark] benchmarks gen = let dataSize = 10 ^ (7 :: Int) inputList = take dataSize . randoms $ gen :: [Int] inputVec = fromList inputList isPositive = (> 0) in [ bgroup "Filter" [ bench "New" $ whnf (filterN) inputVec , bench "Vector" $ whnf (filter isPositive) inputVec ] ] benchConfig :: Config benchConfig = defaultConfig { cfgPerformGC = ljust True }

Compile and run with:

ghc -O2 -fllvm -optlo-O3 Main.hs ./Main -o report.html

Benchmarking shows that `filterN`

function is 60% faster than the `filter`

function based on stream fusion (tested for unboxed vectors containing 10 thousand and 10 million elements).

### Attachments (1)

- prim-bool-criterion.png (107.6 KB) - added by 4 years ago.

Download all attachments as: .zip