Changes between Version 15 and Version 16 of PrimBool


Ignore:
Timestamp:
May 21, 2013 3:21:24 PM (2 years ago)
Author:
jstolarek
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • PrimBool

    v15 v16  
    8484
    8585== Implementation details ==
    86  
     86
    8787Below is a summary of implementation details and decisions:
    8888
     
    132132gtInteger#  :: Integer -> Integer -> Int#
    133133geInteger#  :: Integer -> Integer -> Int#
    134 }}} 
     134}}}
    135135Each of these functions has a wrapper that calls `tagToEnum#` and returns a `Bool`. These wrappers are: `eqInteger`, `neqInteger`, `leInteger`, `ltInteger`, `gtInteger` and `geInteger`.
    136136
    137137  * Other libraries that were modified to work with the new primops are: base, ghc-prim and primitive. The only required modifications were imports of the !GHC.PrimWrappers module in modules that use the primops.
    138138
    139 == Proof of concept ==
    140 
    141 The prototype patch posted on the trac on 13th of March implemented six new prototype comparison primops:
    142 
    143 {{{
    144 .>#  :: Int# -> Int# -> Int#
    145 .<#  :: Int# -> Int# -> Int#
    146 .>=# :: Int# -> Int# -> Int#
    147 .<=# :: Int# -> Int# -> Int#
    148 .==# :: Int# -> Int# -> Int#
    149 ./=# :: Int# -> Int# -> Int#
    150 }}}
    151 
    152 Each of these new primops takes two `Int#`s that are to be compared. The result is also an `Int#`: `0#` if the relation between the operands does not hold and `1#` when it does hold. For example `5# .># 3#` returns `1#` and `3# .># 3#` returns `0#`. With the new primops we can rewrite the original expression that motivated the problem:
     139== Eliminating branches using new primops ==
     140
     141With the new primops we can rewrite the original expression that motivated the problem:
    153142
    154143{{{
     
    161150
    162151{{{
    163 case (x .<# 0#) `orI#` (x .>=# width) `orI#` (y .<# 0#) `orI#` (y .>=# height) of
     152case (x <$# 0#) `orI#` (x >=$# width) `orI#` (y <$# 0#) `orI#` (y >=$# height) of
    164153  True  -> E1
    165154  False -> E2
    166155}}}
    167156
    168 (Note: `orI#` is a bitwise OR operation on operands of type `Int#`. It was introduced together with `andI#`, `notI#` and `xor#` in #7689). Using the LLVM backend this compiles to:
    169 
    170 {{{
    171 # BB#0:                                 # %c1nK
    172   movq  %rsi, %rax
    173   orq %r14, %rax
    174   shrq  $63, %rax
    175   cmpq  %rdi, %r14
    176   setge %cl
    177   movzbl  %cl, %ecx
    178   orq %rax, %rcx
    179   cmpq  %r8, %rsi
    180   setge %al
    181   movzbl  %al, %eax
    182   orq %rcx, %rax
    183   jne .LBB2_1
    184 # BB#3:                                 # %c1ol
    185   movq  (%rbp), %rax
    186   movl  $r1m6_closure+1, %ebx
    187   jmpq  *%rax  # TAILCALL
    188 .LBB2_1:                                # %c1nK
    189   cmpq  $1, %rax
    190   jne .LBB2_2
    191 # BB#4:                                 # %c1ov
    192   movq  (%rbp), %rax
    193   movl  $r1m7_closure+1, %ebx
    194   jmpq  *%rax  # TAILCALL
    195 .LBB2_2:                                # %c1ob
    196   movq  r1m5_closure(%rip), %rax
    197   movl  $r1m5_closure, %ebx
    198   jmpq  *%rax  # TAILCALL
    199 }}}
    200 
    201 The assembly does not contain comparisons and jumps in the scrutinee of the case expression, but still does jumps for selecting an appropriate branch of the case expression.
    202 
    203 === Benchmarks for the proposed patch ===
    204 
    205 Below is a benchmark for the proof-of-concept filter function that demonstrates performance gains possible with the new primops:
     157Using the LLVM backend this compiles to:
     158
     159{{{
     160# BB#0:                                 # %c1oe
     161  movq   %rsi, %rax
     162  orq    %r14, %rax
     163  shrq   $63, %rax
     164  cmpq   %rdi, %r14
     165  setge  %cl
     166  movzbl %cl, %ecx
     167  orq    %rax, %rcx
     168  cmpq   %r8, %rsi
     169  setge  %al
     170  movzbl %al, %eax
     171  orq    %rcx, %rax
     172  jne    .LBB2_1
     173# BB#3:                                 # %c1oP
     174  movq   (%rbp), %rax
     175  movl   $r1mu_closure+1, %ebx
     176  jmpq   *%rax  # TAILCALL
     177.LBB2_1:                                # %c1oe
     178  cmpq   $1, %rax
     179  jne    .LBB2_2
     180# BB#4:                                 # %c1oZ
     181  movq   (%rbp), %rax
     182  movl   $r1mv_closure+1, %ebx
     183  jmpq   *%rax  # TAILCALL
     184.LBB2_2:                                # %c1oF
     185  movq   r1mt_closure(%rip), %rax
     186  movl   $r1mt_closure, %ebx
     187  jmpq   *%rax  # TAILCALL
     188
     189}}}
     190
     191The assembly does not contain comparisons and branches in the scrutinee of the case expression, but still uses jumps to select an appropriate branch of the case expression.
     192
     193=== Benchmarks ===
     194
     195Below is a benchmark for the proof-of-concept branchless filter function that demonstrates performance gains possible with the new primops:
    206196
    207197{{{
     
    218208import Data.Vector.Unboxed               as U (Vector, filter, foldM',
    219209                                               fromList, length, unsafeFreeze)
    220 import GHC.Exts                          (Int (I#), (.>=#))
     210import GHC.Exts                          (Int (I#), (>=$#))
    221211import System.Random                     (RandomGen, mkStdGen, randoms)
    222212import Prelude                    hiding (filter, length)
     
    229219  let put i x = do
    230220        let !(I# v) = x
    231             inc     = I# (v .>=# 0#)
     221            inc     = I# (v >=$# 0#)
    232222        unsafeWrite fVec i x
    233223        return $ i + inc
     
    260250             cfgPerformGC = ljust True
    261251           }
    262 
    263252}}}
    264253
     
    270259}}}
    271260
    272 Benchmarking shows that `filterN` function is 60% faster than the `filter` function based on stream fusion (tested for unboxed vectors containing 10 thousand and 10 million elements).
     261Benchmarking shows that `filterN` function is about 55-65% faster than the `filter` function based on stream fusion (tested for unboxed vectors containing 10 thousand and 10 million elements). Below is an example benchmarking report from criterion:
     262
     263[[Image(http://ics.p.lodz.pl/~stolarek/ghc/prim-bool-criterion.png)]]