Changes between Version 15 and Version 16 of PrimBool


Ignore:
Timestamp:
May 21, 2013 3:21:24 PM (11 months ago)
Author:
jstolarek
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • PrimBool

    v15 v16  
    8484 
    8585== Implementation details == 
    86   
     86 
    8787Below is a summary of implementation details and decisions: 
    8888 
     
    132132gtInteger#  :: Integer -> Integer -> Int# 
    133133geInteger#  :: Integer -> Integer -> Int# 
    134 }}}  
     134}}} 
    135135Each of these functions has a wrapper that calls `tagToEnum#` and returns a `Bool`. These wrappers are: `eqInteger`, `neqInteger`, `leInteger`, `ltInteger`, `gtInteger` and `geInteger`. 
    136136 
    137137  * Other libraries that were modified to work with the new primops are: base, ghc-prim and primitive. The only required modifications were imports of the !GHC.PrimWrappers module in modules that use the primops. 
    138138 
    139 == Proof of concept == 
    140  
    141 The prototype patch posted on the trac on 13th of March implemented six new prototype comparison primops: 
    142  
    143 {{{ 
    144 .>#  :: Int# -> Int# -> Int# 
    145 .<#  :: Int# -> Int# -> Int# 
    146 .>=# :: Int# -> Int# -> Int# 
    147 .<=# :: Int# -> Int# -> Int# 
    148 .==# :: Int# -> Int# -> Int# 
    149 ./=# :: Int# -> Int# -> Int# 
    150 }}} 
    151  
    152 Each of these new primops takes two `Int#`s that are to be compared. The result is also an `Int#`: `0#` if the relation between the operands does not hold and `1#` when it does hold. For example `5# .># 3#` returns `1#` and `3# .># 3#` returns `0#`. With the new primops we can rewrite the original expression that motivated the problem: 
     139== Eliminating branches using new primops == 
     140 
     141With the new primops we can rewrite the original expression that motivated the problem: 
    153142 
    154143{{{ 
     
    161150 
    162151{{{ 
    163 case (x .<# 0#) `orI#` (x .>=# width) `orI#` (y .<# 0#) `orI#` (y .>=# height) of 
     152case (x <$# 0#) `orI#` (x >=$# width) `orI#` (y <$# 0#) `orI#` (y >=$# height) of 
    164153  True  -> E1 
    165154  False -> E2 
    166155}}} 
    167156 
    168 (Note: `orI#` is a bitwise OR operation on operands of type `Int#`. It was introduced together with `andI#`, `notI#` and `xor#` in #7689). Using the LLVM backend this compiles to: 
    169  
    170 {{{ 
    171 # BB#0:                                 # %c1nK 
    172   movq  %rsi, %rax 
    173   orq %r14, %rax 
    174   shrq  $63, %rax 
    175   cmpq  %rdi, %r14 
    176   setge %cl 
    177   movzbl  %cl, %ecx 
    178   orq %rax, %rcx 
    179   cmpq  %r8, %rsi 
    180   setge %al 
    181   movzbl  %al, %eax 
    182   orq %rcx, %rax 
    183   jne .LBB2_1 
    184 # BB#3:                                 # %c1ol 
    185   movq  (%rbp), %rax 
    186   movl  $r1m6_closure+1, %ebx 
    187   jmpq  *%rax  # TAILCALL 
    188 .LBB2_1:                                # %c1nK 
    189   cmpq  $1, %rax 
    190   jne .LBB2_2 
    191 # BB#4:                                 # %c1ov 
    192   movq  (%rbp), %rax 
    193   movl  $r1m7_closure+1, %ebx 
    194   jmpq  *%rax  # TAILCALL 
    195 .LBB2_2:                                # %c1ob 
    196   movq  r1m5_closure(%rip), %rax 
    197   movl  $r1m5_closure, %ebx 
    198   jmpq  *%rax  # TAILCALL 
    199 }}} 
    200  
    201 The assembly does not contain comparisons and jumps in the scrutinee of the case expression, but still does jumps for selecting an appropriate branch of the case expression. 
    202  
    203 === Benchmarks for the proposed patch === 
    204  
    205 Below is a benchmark for the proof-of-concept filter function that demonstrates performance gains possible with the new primops: 
     157Using the LLVM backend this compiles to: 
     158 
     159{{{ 
     160# BB#0:                                 # %c1oe 
     161  movq   %rsi, %rax 
     162  orq    %r14, %rax 
     163  shrq   $63, %rax 
     164  cmpq   %rdi, %r14 
     165  setge  %cl 
     166  movzbl %cl, %ecx 
     167  orq    %rax, %rcx 
     168  cmpq   %r8, %rsi 
     169  setge  %al 
     170  movzbl %al, %eax 
     171  orq    %rcx, %rax 
     172  jne    .LBB2_1 
     173# BB#3:                                 # %c1oP 
     174  movq   (%rbp), %rax 
     175  movl   $r1mu_closure+1, %ebx 
     176  jmpq   *%rax  # TAILCALL 
     177.LBB2_1:                                # %c1oe 
     178  cmpq   $1, %rax 
     179  jne    .LBB2_2 
     180# BB#4:                                 # %c1oZ 
     181  movq   (%rbp), %rax 
     182  movl   $r1mv_closure+1, %ebx 
     183  jmpq   *%rax  # TAILCALL 
     184.LBB2_2:                                # %c1oF 
     185  movq   r1mt_closure(%rip), %rax 
     186  movl   $r1mt_closure, %ebx 
     187  jmpq   *%rax  # TAILCALL 
     188 
     189}}} 
     190 
     191The assembly does not contain comparisons and branches in the scrutinee of the case expression, but still uses jumps to select an appropriate branch of the case expression. 
     192 
     193=== Benchmarks === 
     194 
     195Below is a benchmark for the proof-of-concept branchless filter function that demonstrates performance gains possible with the new primops: 
    206196 
    207197{{{ 
     
    218208import Data.Vector.Unboxed               as U (Vector, filter, foldM', 
    219209                                               fromList, length, unsafeFreeze) 
    220 import GHC.Exts                          (Int (I#), (.>=#)) 
     210import GHC.Exts                          (Int (I#), (>=$#)) 
    221211import System.Random                     (RandomGen, mkStdGen, randoms) 
    222212import Prelude                    hiding (filter, length) 
     
    229219  let put i x = do 
    230220        let !(I# v) = x 
    231             inc     = I# (v .>=# 0#) 
     221            inc     = I# (v >=$# 0#) 
    232222        unsafeWrite fVec i x 
    233223        return $ i + inc 
     
    260250             cfgPerformGC = ljust True 
    261251           } 
    262  
    263252}}} 
    264253 
     
    270259}}} 
    271260 
    272 Benchmarking shows that `filterN` function is 60% faster than the `filter` function based on stream fusion (tested for unboxed vectors containing 10 thousand and 10 million elements). 
     261Benchmarking shows that `filterN` function is about 55-65% faster than the `filter` function based on stream fusion (tested for unboxed vectors containing 10 thousand and 10 million elements). Below is an example benchmarking report from criterion: 
     262 
     263[[Image(http://ics.p.lodz.pl/~stolarek/ghc/prim-bool-criterion.png)]]