For closures, Loop Invariant Code Flow related to captured free values not lifted outside the loop...

Background: I've been intrigued investigating whether GHC can produce code "as fast as Cee (C/C++/Rust/etc.)" by-any-means-possible, and have been using the very tight inner composite culling loops (purely integer operations) of a basic Sieve of Eratosthenes implementation as a test vehicle.

Synopsis: This is a follow-on of the work leading to finding the efficiency problem described in ticket #12798, but involves pushing the speed even further as per the method described for "primesieve" as per http://primesieve.org/ in the "Highly optimized inner loop" section.

Shortest possible test code that clearly shows closures not being optimized, but optimized when unified by a "join point": Please refer directly to comment 12https://ghc.haskell.org/trac/ghc/ticket/12808#ticket:12808#comment:128621 and follow-on comments.

A version of test code that triggered this ticket: Essentially, this method involves extreme loop unrolling with very imperative code although coded functionally; in the case of the following code it means that, recognizing that for all odd primes (which they all are other than two), and that all word sizes are of an even number of bits, there will be a repeating pattern of composite number culls that repeats every word size number of bits. Thus for a word size of one eight-bit byte, we can unroll to eight composite culls in the body of one loop, with loops cases for the primes modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions (b0..b7) meaning there are four times eight is 32 loop cases. When there are no longer a full eight culls left, the culling reverts to conventional single-cull-per-loop as per the test program of ticket #12798.

To do this using GHC we need pointer arithmetic, and the only way to implement pointer arithmetic in GHC is to use the Addr# primitive. GHC/Haskell has one other slight overhead over Cee languages in that we need to move the culling array to a pinned array to avoid having it moved while the culling is going on and then move it back when finished but this takes a negligible amount of time (one percent or so) as compared to the culling. As usual for test programs, the culling operations are repeated in a loop for a number of times to give more accurate timing not influenced by execution not related to the culling. All of this is included in the following code (truncated as to loop coses for inclusion here):

-- EfficiencyBug.hs
-- showing that there is a register loop invariant bug in generation of assembler code...

-- LLVM shows the bug clearer since the code is generally a little faster...
{-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-}
{-# OPTIONS_GHC -O2 -rtsopts -keep-s-files #-} -- or -O2 -fllvm

import Data.Word
import Data.Bits
import Data.Array.ST (runSTUArray)
import Data.Array.Base
import GHC.ST ( ST(..) )
import GHC.Exts

numLOOPS = 10000 :: Int

-- Uses a very simple Sieve of Eratosthenes for fixed 2 ^ 18 range (so one L1 cache size) to prove it.
twos :: UArray Int Word32
twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]]

soep1 :: () -> [Word32]
soep1() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where
 bufb = runSTUArray $ do
  let bfBts = (256 * 1024) `div` 2 -- to 2^18 + 2 is 128 KBits = 16 KBytes
  bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool)
  cullb bf
 cullb bf@(STUArray l u n marr#) = ST $ \s0# ->
  case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) ->
  let loop t mr# s0# = -- cull a number of times to test timing
        if t <= 0 then (# s0#, STUArray l u n mr# #) else
        case getSizeofMutableByteArray# mr# s0# of { (# s1#, n# #) ->
        case newPinnedByteArray# n# s1#         of { (# s2#, marr'# #) ->
        case copyMutableByteArray# mr# 0# marr'# 0# n# s2# of { s3# ->
        case unsafeFreezeByteArray# marr'# s3#  of { (# s4#, arr# #) -> -- must do this
        case byteArrayContents# arr#            of { adr# -> -- to obtain the addr# here
        let cullp i@(I# i#) sp# =
              let !p@(I# p#) = i + i + 3 in
              let !s@(I# s#) = (p * p - 3) `div` 2 in
              if s >= n then
                case copyMutableByteArray# marr'# 0# mr# 0# n# sp# of
                  so# -> (# so#, mr# #) else
                let !(UArray _ _ _ tarr#) = twos in
                case readWord64Array# marr# (i# `uncheckedIShiftRL#` 6#) sp# of { (# sp0#, v0# #) ->
                case (v0# `and#` ((int2Word# 1#) `uncheckedShiftL#` (i# `andI#` 63#))) `eqWord#` (int2Word# 0#) of
                  0# -> cullp (i + 1) sp0# -- not prime
                  _ -> -- is prime
                    -- most program execution time spent in the following tight loops.
                    -- the following code implments extream loop unrolling...
                    let !pi@(I# pi#) = fromIntegral p in
                    let !sw@(I# sw#) = s `shiftR` 3 in let !sb@(I# sb#) = s .&. 7 in
                    let p1 = sb + pi in let !(I# r1#) = p1 `shiftR` 3 in
                    let p2 = p1 + pi in let !(I# r2#) = p2 `shiftR` 3 in
                    let p3 = p2 + pi in let !(I# r3#) = p3 `shiftR` 3 in
                    let p4 = p3 + pi in let !(I# r4#) = p4 `shiftR` 3 in
                    let p5 = p4 + pi in let !(I# r5#) = p5 `shiftR` 3 in
                    let p6 = p5 + pi in let !(I# r6#) = p6 `shiftR` 3 in
                    let p7 = p6 + pi in let !(I# r7#) = p7 `shiftR` 3 in
                    let !lmt@(I# lmt#) = (fromIntegral n `shiftR` 3) - p7 in
                    let !lmt1# = plusAddr# adr# lmt# in
                    let !strt# = plusAddr# adr# sw# in
                    let !(I# n#) = n in
                    let (# !so#, !sco# #) = case ((((p - 1) `div` 2) .&. 3) `shiftL` 3) + sb of {
                      0 ->
                        let cull c# sp# =
                              case c# `ltAddr#` lmt1# of
                                0# -> (# c#, sp# #)
                                _ ->
                                  case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) ->
                                  case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 1#)) sp0# of { sp1# ->
                                  case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) ->
                                  case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 2#)) sp2# of { sp3# ->
                                  case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) ->
                                  case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 4#)) sp4# of { sp5# ->
                                  case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) ->
                                  case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 8#)) sp6# of { sp7# ->
                                  case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) ->
                                  case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 16#)) sp8# of { sp9# ->
                                  case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) ->
                                  case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 32#)) sp10# of { sp11# ->
                                  case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) ->
                                  case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 64#)) sp12# of { sp13# ->
                                  case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) ->
                                  case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 128#)) sp14# of { sp15# ->
                                  cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in
                        cull strt# sp0#;
                      1 ->
                        let cull c# sp# =
                              case c# `ltAddr#` lmt1# of
                                0# -> (# c#, sp# #)
                                _ ->
                                  case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) ->
                                  case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 2#)) sp0# of { sp1# ->
                                  case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) ->
                                  case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 4#)) sp2# of { sp3# ->
                                  case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) ->
                                  case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 8#)) sp4# of { sp5# ->
                                  case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) ->
                                  case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 16#)) sp6# of { sp7# ->
                                  case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) ->
                                  case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 32#)) sp8# of { sp9# ->
                                  case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) ->
                                  case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 64#)) sp10# of { sp11# ->
                                  case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) ->
                                  case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 128#)) sp12# of { sp13# ->
                                  case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) ->
                                  case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 1#)) sp14# of { sp15# ->
                                  cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in
                        cull strt# sp0#;
                      -- and so on for 30 more cases...
                      _ -> (# strt#, sp0# #) {- normally never taken case, all cases covered -} } in
                    let !ns# = ((minusAddr# so# adr#) `uncheckedIShiftL#` 3#) +# sb# in
                    -- extreme loop unrolling ends here; remaining primes culled conventionally...
                    let cull j# sc# = -- very tight inner loop
                          case j# <# n# of
                            0# -> cullp (i + 1) sc#
                            _ -> let i# = j# `andI#` 31# in
                                 let !sh# = indexWord32Array# tarr# i# in -- (1 `shiftL` (j .&. 31)))
                                 let w# = j# `uncheckedIShiftRL#` 5# in
                                 case readWord32Array# marr'# w# sc# of { (# sc0#, ov# #) -> 
                                 case writeWord32Array# marr'# w# (ov# `or#` sh#) sc0# of { sc1# ->
                                 cull (j# +# pi#) sc1# }} in
                    cull ns# sp0# } in
        case cullp 0 s4# of (# sp#, mrp# #) -> loop (t - 1) mrp# sp# }}}}} in loop numLOOPS marr# s1# }

main = print $ length $ soep1()

The problem: The problem is in the innermost loop of the cases, for which case "0" the following assembly code (using -fllvm) is produced:

seGU_info$def:
# BB#0:                                 # %cgRL
	cmpq	%r14, 70(%rbx)
	jbe	.LBB35_1
	.align	16, 0x90
.LBB35_2:                               # %cgRJ
                                        # =>This Inner Loop Header: Depth=1
	movq	14(%rbx), %rcx
	movq	22(%rbx), %rdx
	movq	30(%rbx), %rsi
	movq	38(%rbx), %rdi
	movq	46(%rbx), %r8
	movq	54(%rbx), %r9
	movq	62(%rbx), %r10
	movq	6(%rbx), %rax
	addq	%r14, %rax
	orb	$1, (%r14)
	orb	$2, (%rcx,%r14)
	orb	$4, (%rdx,%r14)
	orb	$8, (%rsi,%r14)
	orb	$16, (%rdi,%r14)
	orb	$32, (%r8,%r14)
	orb	$64, (%r9,%r14)
	orb	$-128, (%r10,%r14)
	cmpq	70(%rbx), %rax
	movq	%rax, %r14
	jb	.LBB35_2
	jmp	.LBB35_3
.LBB35_1:
	movq	%r14, %rax
.LBB35_3:                               # %cgRK
	movq	(%rbp), %rcx
	movq	%rax, %rbx
	rex64 jmpq	*%rcx           # TAILCALL

One can readily see that the compiler is not lifting the Loop Invariant Code Flow as in initializing the registers to outside the inner loop, meaning there are many register loads from memory which are not necessary.

Desired results: The desired assembly code is something like the following, which is similar to what is produced by Cee (C/C++/Rust/etc.):

seGU_info$def:
# BB#0:                                 # %cgRL
	movq	14(%rbx), %rcx
	movq	22(%rbx), %rdx
	movq	30(%rbx), %rsi
	movq	38(%rbx), %rdi
	movq	46(%rbx), %r8
	movq	54(%rbx), %r9
	movq	62(%rbx), %r10
	movq	6(%rbx), %rax
	movq	70(%rbx), %rbx
	cmpq	%r14, %rbx              # rbx clobbered here, but old value
	jbe	.LBB35_1                # never used again until replaced after loop
	.align	16, 0x90
.LBB35_2:                               # %cgRJ
                                        # =>This Inner Loop Header: Depth=1
	orb	$1, (%r14)
	orb	$2, (%rcx,%r14)
	orb	$4, (%rdx,%r14)
	orb	$8, (%rsi,%r14)
	orb	$16, (%rdi,%r14)
	orb	$32, (%r8,%r14)
	orb	$64, (%r9,%r14)
	orb	$-128, (%r10,%r14)
	addq	%rax, %r14
	cmpq	%rbx, %r14
	jb	.LBB35_2
	jmp	.LBB35_3
.LBB35_1:
	movq	%r14, %rax
.LBB35_3:                               # %cgRK
	movq	(%rbp), %rcx
	movq	%rax, %rbx              # rbx clobbered here anyway
	rex64 jmpq	*%rcx           # TAILCALL

Full testing: The actual unrolled loop code including all the case loops is too long to post here, but to verify the result is correct (23000) and the performance, the full actual file is attached here. Due to the magic of modern CPU instruction fusion and Out Of Order (OOE) execution, the code is not as slow as it would indicate by the number of increased instructions, but while it is about twice as fast as when culled conventionally (Intel Skylake), it is about half again as slow as Cee. On an Intel Sky Lake i5-6500 (running at 3.5 GHz for single threading), this takes about one second, about two seconds culled conventionally, but only about 0.6 seconds for Rust/LLVM (with the assembly code output essentially identical to the "desired" code).

Other back ends and targets: Although the code generated by the native NCG has other problems (not moving the loop test to the end of the loop to avoid one jump, and not combining the read and modify and store instructions into the single available instruction), it exhibits the same problem as to not lifting the Loop Invariant Code Flow register initialization.

Although this code is x86_64, the problem also applies to x86 code even though the x86 architecture doesn't have enough registers to do this in one loop and needs to be split into two loops culling only four composites per loop, but there still is a significant gain in speed. Although not tested, it probably also applies to other targets such as ARM (which has many general purpose registers).

Conclusion: The use of Addr# primitives is probably not a frequent use case, but as shown here that when one needs their use, they should be efficient.

I considered that GHC may intentionally limit the performance of these unsafe primitives to limit their use unless absolutely necessary as in marshalling, something as C# does for the use of unsafe pointers, but surely GHC would not do that as the target programmers are different.

If this and ticket #12798 were fixed, for this use case the GHC code would be within a percent or two of the performance of Cee.

Edited Mar 10, 2019 by GordonBGood

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information