Opened 2 years ago

Last modified 8 weeks ago

#12808 new bug

For closures, Loop Invariant Code Flow related to captured free values not lifted outside the loop...

Reported by: GordonBGood Owned by:
Priority: normal Milestone: 8.10.1
Component: Compiler Version: 8.0.1
Keywords: JoinPoints Cc: maurerl@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by GordonBGood)

Background: I've been intrigued investigating whether GHC can produce code "as fast as Cee (C/C++/Rust/etc.)" by-any-means-possible, and have been using the very tight inner composite culling loops (purely integer operations) of a basic Sieve of Eratosthenes implementation as a test vehicle.

Synopsis: This is a follow-on of the work leading to finding the efficiency problem described in ticket #12798, but involves pushing the speed even further as per the method described for "primesieve" as per http://primesieve.org/ in the "Highly optimized inner loop" section.

Shortest possible test code that clearly shows closures not being optimized, but optimized when unified by a "join point": Please refer directly to comment 12https://ghc.haskell.org/trac/ghc/ticket/12808#comment:12 and follow-on comments.

A version of test code that triggered this ticket: Essentially, this method involves extreme loop unrolling with very imperative code although coded functionally; in the case of the following code it means that, recognizing that for all odd primes (which they all are other than two), and that all word sizes are of an even number of bits, there will be a repeating pattern of composite number culls that repeats every word size number of bits. Thus for a word size of one eight-bit byte, we can unroll to eight composite culls in the body of one loop, with loops cases for the primes modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions (b0..b7) meaning there are four times eight is 32 loop cases. When there are no longer a full eight culls left, the culling reverts to conventional single-cull-per-loop as per the test program of ticket #12798.

To do this using GHC we need pointer arithmetic, and the only way to implement pointer arithmetic in GHC is to use the Addr# primitive. GHC/Haskell has one other slight overhead over Cee languages in that we need to move the culling array to a pinned array to avoid having it moved while the culling is going on and then move it back when finished but this takes a negligible amount of time (one percent or so) as compared to the culling. As usual for test programs, the culling operations are repeated in a loop for a number of times to give more accurate timing not influenced by execution not related to the culling. All of this is included in the following code (truncated as to loop coses for inclusion here):

-- EfficiencyBug.hs
-- showing that there is a register loop invariant bug in generation of assembler code...

-- LLVM shows the bug clearer since the code is generally a little faster...
{-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-}
{-# OPTIONS_GHC -O2 -rtsopts -keep-s-files #-} -- or -O2 -fllvm

import Data.Word
import Data.Bits
import Data.Array.ST (runSTUArray)
import Data.Array.Base
import GHC.ST ( ST(..) )
import GHC.Exts

numLOOPS = 10000 :: Int

-- Uses a very simple Sieve of Eratosthenes for fixed 2 ^ 18 range (so one L1 cache size) to prove it.
twos :: UArray Int Word32
twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]]

soep1 :: () -> [Word32]
soep1() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where
 bufb = runSTUArray $ do
  let bfBts = (256 * 1024) `div` 2 -- to 2^18 + 2 is 128 KBits = 16 KBytes
  bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool)
  cullb bf
 cullb bf@(STUArray l u n marr#) = ST $ \s0# ->
  case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) ->
  let loop t mr# s0# = -- cull a number of times to test timing
        if t <= 0 then (# s0#, STUArray l u n mr# #) else
        case getSizeofMutableByteArray# mr# s0# of { (# s1#, n# #) ->
        case newPinnedByteArray# n# s1#         of { (# s2#, marr'# #) ->
        case copyMutableByteArray# mr# 0# marr'# 0# n# s2# of { s3# ->
        case unsafeFreezeByteArray# marr'# s3#  of { (# s4#, arr# #) -> -- must do this
        case byteArrayContents# arr#            of { adr# -> -- to obtain the addr# here
        let cullp i@(I# i#) sp# =
              let !p@(I# p#) = i + i + 3 in
              let !s@(I# s#) = (p * p - 3) `div` 2 in
              if s >= n then
                case copyMutableByteArray# marr'# 0# mr# 0# n# sp# of
                  so# -> (# so#, mr# #) else
                let !(UArray _ _ _ tarr#) = twos in
                case readWord64Array# marr# (i# `uncheckedIShiftRL#` 6#) sp# of { (# sp0#, v0# #) ->
                case (v0# `and#` ((int2Word# 1#) `uncheckedShiftL#` (i# `andI#` 63#))) `eqWord#` (int2Word# 0#) of
                  0# -> cullp (i + 1) sp0# -- not prime
                  _ -> -- is prime
                    -- most program execution time spent in the following tight loops.
                    -- the following code implments extream loop unrolling...
                    let !pi@(I# pi#) = fromIntegral p in
                    let !sw@(I# sw#) = s `shiftR` 3 in let !sb@(I# sb#) = s .&. 7 in
                    let p1 = sb + pi in let !(I# r1#) = p1 `shiftR` 3 in
                    let p2 = p1 + pi in let !(I# r2#) = p2 `shiftR` 3 in
                    let p3 = p2 + pi in let !(I# r3#) = p3 `shiftR` 3 in
                    let p4 = p3 + pi in let !(I# r4#) = p4 `shiftR` 3 in
                    let p5 = p4 + pi in let !(I# r5#) = p5 `shiftR` 3 in
                    let p6 = p5 + pi in let !(I# r6#) = p6 `shiftR` 3 in
                    let p7 = p6 + pi in let !(I# r7#) = p7 `shiftR` 3 in
                    let !lmt@(I# lmt#) = (fromIntegral n `shiftR` 3) - p7 in
                    let !lmt1# = plusAddr# adr# lmt# in
                    let !strt# = plusAddr# adr# sw# in
                    let !(I# n#) = n in
                    let (# !so#, !sco# #) = case ((((p - 1) `div` 2) .&. 3) `shiftL` 3) + sb of {
                      0 ->
                        let cull c# sp# =
                              case c# `ltAddr#` lmt1# of
                                0# -> (# c#, sp# #)
                                _ ->
                                  case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) ->
                                  case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 1#)) sp0# of { sp1# ->
                                  case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) ->
                                  case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 2#)) sp2# of { sp3# ->
                                  case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) ->
                                  case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 4#)) sp4# of { sp5# ->
                                  case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) ->
                                  case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 8#)) sp6# of { sp7# ->
                                  case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) ->
                                  case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 16#)) sp8# of { sp9# ->
                                  case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) ->
                                  case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 32#)) sp10# of { sp11# ->
                                  case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) ->
                                  case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 64#)) sp12# of { sp13# ->
                                  case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) ->
                                  case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 128#)) sp14# of { sp15# ->
                                  cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in
                        cull strt# sp0#;
                      1 ->
                        let cull c# sp# =
                              case c# `ltAddr#` lmt1# of
                                0# -> (# c#, sp# #)
                                _ ->
                                  case readWord8OffAddr# c# 0# sp# of { (# sp0#, v0# #) ->
                                  case writeWord8OffAddr# c# 0# (v0# `or#` (int2Word# 2#)) sp0# of { sp1# ->
                                  case readWord8OffAddr# c# r1# sp1# of { (# sp2#, v1# #) ->
                                  case writeWord8OffAddr# c# r1# (v1# `or#` (int2Word# 4#)) sp2# of { sp3# ->
                                  case readWord8OffAddr# c# r2# sp3# of { (# sp4#, v2# #) ->
                                  case writeWord8OffAddr# c# r2# (v2# `or#` (int2Word# 8#)) sp4# of { sp5# ->
                                  case readWord8OffAddr# c# r3# sp5# of { (# sp6#, v3# #) ->
                                  case writeWord8OffAddr# c# r3# (v3# `or#` (int2Word# 16#)) sp6# of { sp7# ->
                                  case readWord8OffAddr# c# r4# sp7# of { (# sp8#, v4# #) ->
                                  case writeWord8OffAddr# c# r4# (v4# `or#` (int2Word# 32#)) sp8# of { sp9# ->
                                  case readWord8OffAddr# c# r5# sp9# of { (# sp10#, v5# #) ->
                                  case writeWord8OffAddr# c# r5# (v5# `or#` (int2Word# 64#)) sp10# of { sp11# ->
                                  case readWord8OffAddr# c# r6# sp11# of { (# sp12#, v6# #) ->
                                  case writeWord8OffAddr# c# r6# (v6# `or#` (int2Word# 128#)) sp12# of { sp13# ->
                                  case readWord8OffAddr# c# r7# sp13# of { (# sp14#, v7# #) ->
                                  case writeWord8OffAddr# c# r7# (v7# `or#` (int2Word# 1#)) sp14# of { sp15# ->
                                  cull (plusAddr# c# pi#) sp15# }}}}}}}}}}}}}}}} in
                        cull strt# sp0#;
                      -- and so on for 30 more cases...
                      _ -> (# strt#, sp0# #) {- normally never taken case, all cases covered -} } in
                    let !ns# = ((minusAddr# so# adr#) `uncheckedIShiftL#` 3#) +# sb# in
                    -- extreme loop unrolling ends here; remaining primes culled conventionally...
                    let cull j# sc# = -- very tight inner loop
                          case j# <# n# of
                            0# -> cullp (i + 1) sc#
                            _ -> let i# = j# `andI#` 31# in
                                 let !sh# = indexWord32Array# tarr# i# in -- (1 `shiftL` (j .&. 31)))
                                 let w# = j# `uncheckedIShiftRL#` 5# in
                                 case readWord32Array# marr'# w# sc# of { (# sc0#, ov# #) -> 
                                 case writeWord32Array# marr'# w# (ov# `or#` sh#) sc0# of { sc1# ->
                                 cull (j# +# pi#) sc1# }} in
                    cull ns# sp0# } in
        case cullp 0 s4# of (# sp#, mrp# #) -> loop (t - 1) mrp# sp# }}}}} in loop numLOOPS marr# s1# }

main = print $ length $ soep1()

The problem: The problem is in the innermost loop of the cases, for which case "0" the following assembly code (using -fllvm) is produced:

seGU_info$def:
# BB#0:                                 # %cgRL
	cmpq	%r14, 70(%rbx)
	jbe	.LBB35_1
	.align	16, 0x90
.LBB35_2:                               # %cgRJ
                                        # =>This Inner Loop Header: Depth=1
	movq	14(%rbx), %rcx
	movq	22(%rbx), %rdx
	movq	30(%rbx), %rsi
	movq	38(%rbx), %rdi
	movq	46(%rbx), %r8
	movq	54(%rbx), %r9
	movq	62(%rbx), %r10
	movq	6(%rbx), %rax
	addq	%r14, %rax
	orb	$1, (%r14)
	orb	$2, (%rcx,%r14)
	orb	$4, (%rdx,%r14)
	orb	$8, (%rsi,%r14)
	orb	$16, (%rdi,%r14)
	orb	$32, (%r8,%r14)
	orb	$64, (%r9,%r14)
	orb	$-128, (%r10,%r14)
	cmpq	70(%rbx), %rax
	movq	%rax, %r14
	jb	.LBB35_2
	jmp	.LBB35_3
.LBB35_1:
	movq	%r14, %rax
.LBB35_3:                               # %cgRK
	movq	(%rbp), %rcx
	movq	%rax, %rbx
	rex64 jmpq	*%rcx           # TAILCALL

One can readily see that the compiler is not lifting the Loop Invariant Code Flow as in initializing the registers to outside the inner loop, meaning there are many register loads from memory which are not necessary.

Desired results: The desired assembly code is something like the following, which is similar to what is produced by Cee (C/C++/Rust/etc.):

seGU_info$def:
# BB#0:                                 # %cgRL
	movq	14(%rbx), %rcx
	movq	22(%rbx), %rdx
	movq	30(%rbx), %rsi
	movq	38(%rbx), %rdi
	movq	46(%rbx), %r8
	movq	54(%rbx), %r9
	movq	62(%rbx), %r10
	movq	6(%rbx), %rax
	movq	70(%rbx), %rbx
	cmpq	%r14, %rbx              # rbx clobbered here, but old value
	jbe	.LBB35_1                # never used again until replaced after loop
	.align	16, 0x90
.LBB35_2:                               # %cgRJ
                                        # =>This Inner Loop Header: Depth=1
	orb	$1, (%r14)
	orb	$2, (%rcx,%r14)
	orb	$4, (%rdx,%r14)
	orb	$8, (%rsi,%r14)
	orb	$16, (%rdi,%r14)
	orb	$32, (%r8,%r14)
	orb	$64, (%r9,%r14)
	orb	$-128, (%r10,%r14)
	addq	%rax, %r14
	cmpq	%rbx, %r14
	jb	.LBB35_2
	jmp	.LBB35_3
.LBB35_1:
	movq	%r14, %rax
.LBB35_3:                               # %cgRK
	movq	(%rbp), %rcx
	movq	%rax, %rbx              # rbx clobbered here anyway
	rex64 jmpq	*%rcx           # TAILCALL

Full testing: The actual unrolled loop code including all the case loops is too long to post here, but to verify the result is correct (23000) and the performance, the full actual file is attached here. Due to the magic of modern CPU instruction fusion and Out Of Order (OOE) execution, the code is not as slow as it would indicate by the number of increased instructions, but while it is about twice as fast as when culled conventionally (Intel Skylake), it is about half again as slow as Cee. On an Intel Sky Lake i5-6500 (running at 3.5 GHz for single threading), this takes about one second, about two seconds culled conventionally, but only about 0.6 seconds for Rust/LLVM (with the assembly code output essentially identical to the "desired" code).

Other back ends and targets: Although the code generated by the native NCG has other problems (not moving the loop test to the end of the loop to avoid one jump, and not combining the read and modify and store instructions into the single available instruction), it exhibits the same problem as to not lifting the Loop Invariant Code Flow register initialization.

Although this code is x86_64, the problem also applies to x86 code even though the x86 architecture doesn't have enough registers to do this in one loop and needs to be split into two loops culling only four composites per loop, but there still is a significant gain in speed. Although not tested, it probably also applies to other targets such as ARM (which has many general purpose registers).

Conclusion: The use of Addr# primitives is probably not a frequent use case, but as shown here that when one needs their use, they should be efficient.

I considered that GHC may intentionally limit the performance of these unsafe primitives to limit their use unless absolutely necessary as in marshalling, something as C# does for the use of unsafe pointers, but surely GHC would not do that as the target programmers are different.

If this and ticket #12798 were fixed, for this use case the GHC code would be within a percent or two of the performance of Cee.

Attachments (1)

GHC_EfficiencyBug.hs (71.3 KB) - added by GordonBGood 2 years ago.
Test program

Download all attachments as: .zip

Change History (25)

Changed 2 years ago by GordonBGood

Attachment: GHC_EfficiencyBug.hs added

Test program

comment:1 Changed 2 years ago by GordonBGood

Description: modified (diff)

comment:2 Changed 2 years ago by GordonBGood

Description: modified (diff)

comment:3 Changed 2 years ago by carter

So roughly: We don't have loop invariant hoisting and we should?

comment:4 Changed 2 years ago by carter

Or it's not firing as much as we'd like? (In c minus minus)

@gordon: have you looked at the c-- code ghc has after vs before optimization?

Last edited 2 years ago by carter (previous) (diff)

comment:5 Changed 2 years ago by GordonBGood

@carter, yes, we don't have loop invariant hoisting (at least for Addr#) and we should, or it's not firing as we'd like.

The c-- (cmm) code starts like this:

 cull_seCS_entry() //  [R2, R1]
         { info_tbl: [(cgKv,
                       label: cull_seCS_info
                       rep:HeapRep 9 nonptrs { Fun {arity: 2 fun_type: ArgSpec 4} })]
           stack_info: arg_space: 8 updfr_space: Just 8
         }
     {offset
       cgKv:
           _seCT::I64 = R2;
           _seCS::P64 = R1;
           goto cgKo;
       cgKo:
           if ((old + 0) - <highSp> < SpLim) goto cgKw; else goto cgKx;
       cgKw:
           R2 = _seCT::I64;
           R1 = _seCS::P64;
           call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
       cgKx:
           goto cgKn;
       cgKn:
           _seBQ::I64 = I64[_seCS::P64 + 6];
           _seCD::I64 = I64[_seCS::P64 + 14];
           _seCF::I64 = I64[_seCS::P64 + 22];
           _seCH::I64 = I64[_seCS::P64 + 30];
           _seCJ::I64 = I64[_seCS::P64 + 38];
           _seCL::I64 = I64[_seCS::P64 + 46];
           _seCN::I64 = I64[_seCS::P64 + 54];
           _seCP::I64 = I64[_seCS::P64 + 62];
           _seCQ::I64 = I64[_seCS::P64 + 70];
           _cgKq::I64 = _seCT::I64 < _seCQ::I64;
           _seCV::I64 = _cgKq::I64;
           switch [-9223372036854775808 .. 9223372036854775807] _seCV::I64 {
               case 0 : goto cgKu;
               default: goto cgKt;
           }
       cgKu:
           goto cgKF;
       cgKF:
           R1 = _seCT::I64;
           call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8;
       cgKt:
           goto cgKA;
       cgKA:
           _seCY::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64]);
           _seCY::I64 = _seCY::I64;
           _cgKI::I64 = _seCY::I64 | 1;
           _seCZ::I64 = _cgKI::I64;
           I8[_seCT::I64] = %MO_UU_Conv_W64_W8(_seCZ::I64);
           _seD3::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCD::I64 << 0)]);
           _seD3::I64 = _seD3::I64;
           _cgKN::I64 = _seD3::I64 | 2;
           _seD4::I64 = _cgKN::I64;
           I8[_seCT::I64 + (_seCD::I64 << 0)] = %MO_UU_Conv_W64_W8(_seD4::I64);
           _seD8::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCF::I64 << 0)]);
           _seD8::I64 = _seD8::I64;
           _cgKS::I64 = _seD8::I64 | 4;
           _seD9::I64 = _cgKS::I64;
           I8[_seCT::I64 + (_seCF::I64 << 0)] = %MO_UU_Conv_W64_W8(_seD9::I64);
           _seDd::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCH::I64 << 0)]);
           _seDd::I64 = _seDd::I64;
           _cgKX::I64 = _seDd::I64 | 8;
           _seDe::I64 = _cgKX::I64;
           I8[_seCT::I64 + (_seCH::I64 << 0)] = %MO_UU_Conv_W64_W8(_seDe::I64);
           _seDi::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCJ::I64 << 0)]);
           _seDi::I64 = _seDi::I64;
           _cgL2::I64 = _seDi::I64 | 16;
           _seDj::I64 = _cgL2::I64;
           I8[_seCT::I64 + (_seCJ::I64 << 0)] = %MO_UU_Conv_W64_W8(_seDj::I64);
           _seDn::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCL::I64 << 0)]);
           _seDn::I64 = _seDn::I64;
           _cgL7::I64 = _seDn::I64 | 32;
           _seDo::I64 = _cgL7::I64;
           I8[_seCT::I64 + (_seCL::I64 << 0)] = %MO_UU_Conv_W64_W8(_seDo::I64);
           _seDs::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCN::I64 << 0)]);
           _seDs::I64 = _seDs::I64;
           _cgLc::I64 = _seDs::I64 | 64;
           _seDt::I64 = _cgLc::I64;
           I8[_seCT::I64 + (_seCN::I64 << 0)] = %MO_UU_Conv_W64_W8(_seDt::I64);
           _seDx::I64 = %MO_UU_Conv_W8_W64(I8[_seCT::I64 + (_seCP::I64 << 0)]);
           _seDx::I64 = _seDx::I64;
           _cgLh::I64 = _seDx::I64 | 128;
           _seDy::I64 = _cgLh::I64;
           I8[_seCT::I64 + (_seCP::I64 << 0)] = %MO_UU_Conv_W64_W8(_seDy::I64);
           _cgLm::I64 = _seCT::I64 + _seBQ::I64;
           _seDA::I64 = _cgLm::I64;
           _seCT::I64 = _seDA::I64;
           goto cgKn;
     }
 },

with the register initialization already inside the loop, and further optimizations just serving to clean up the code with the final optimization as follows:

cull_seCS_entry() //  [R1, R2]
        { [(cgKv,
            cull_seCS_info:
                const 8589934596;
                const 38654705664;
                const 9;)]
        }
    {offset
      cgKv:
          _seCT::I64 = R2;
          _seCS::P64 = R1;
          goto cgKn;
      cgKn:
          switch [-9223372036854775808 .. 9223372036854775807] (_seCT::I64 < I64[_seCS::P64 + 70]) {
              case 0 : goto cgKu;
              default: goto cgKt;
          }
      cgKu:
          R1 = _seCT::I64;
          call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
      cgKt:
          _seBQ::I64 = I64[_seCS::P64 + 6];
          _seCD::I64 = I64[_seCS::P64 + 14];
          _seCF::I64 = I64[_seCS::P64 + 22];
          _seCH::I64 = I64[_seCS::P64 + 30];
          _seCJ::I64 = I64[_seCS::P64 + 38];
          _seCL::I64 = I64[_seCS::P64 + 46];
          _seCN::I64 = I64[_seCS::P64 + 54];
          _seCP::I64 = I64[_seCS::P64 + 62];
          I8[_seCT::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64]) | 1);
          I8[_seCT::I64 + _seCD::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCD::I64]) | 2);
          I8[_seCT::I64 + _seCF::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCF::I64]) | 4);
          I8[_seCT::I64 + _seCH::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCH::I64]) | 8);
          I8[_seCT::I64 + _seCJ::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCJ::I64]) | 16);
          I8[_seCT::I64 + _seCL::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCL::I64]) | 32);
          I8[_seCT::I64 + _seCN::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCN::I64]) | 64);
          I8[_seCT::I64 + _seCP::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCP::I64]) | 128);
          _seCT::I64 = _seCT::I64 + _seBQ::I64;
          goto cgKn;
    }
}

The movement of the register initialization to inside the loops in the first conversion to CMM code (as when it is recognized that these are pointer/addr# operations?) and never gets fixed...

At some stage, the CMM optimizer (or the code that generates the CMM code in the first place) should move the register initializations outside the body of the loop, such as is done in the following desired CMM code by the end of the CMM optimization phase:

cull_seCS_entry() //  [R1, R2]
        { [(cgKv,
            cull_seCS_info:
                const 8589934596;
                const 38654705664;
                const 9;)]
        }
    {offset
      cgKv:
          _seCT::I64 = R2;
          _seCS::P64 = R1;
          _seBQ::I64 = I64[_seCS::P64 + 6]; /* all register initializations moved outside the loop */
          _seCD::I64 = I64[_seCS::P64 + 14];
          _seCF::I64 = I64[_seCS::P64 + 22];
          _seCH::I64 = I64[_seCS::P64 + 30];
          _seCJ::I64 = I64[_seCS::P64 + 38];
          _seCL::I64 = I64[_seCS::P64 + 46];
          _seCN::I64 = I64[_seCS::P64 + 54];
          _seCP::I64 = I64[_seCS::P64 + 62]; /* moved register initializations ends here */
          _seCQ::I64 = I64[_seCS::P64 + 70]; /* register of limit value kept from before optimization passes */
          goto cgKn;
      cgKn:
          switch [-9223372036854775808 .. 9223372036854775807] (_seCT::I64 < _seCQ) { /* register used rather than a memory access */
              case 0 : goto cgKu;
              default: goto cgKt;
          }
      cgKu:
          R1 = _seCT::I64;
          call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
      cgKt:
          I8[_seCT::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64]) | 1);
          I8[_seCT::I64 + _seCD::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCD::I64]) | 2);
          I8[_seCT::I64 + _seCF::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCF::I64]) | 4);
          I8[_seCT::I64 + _seCH::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCH::I64]) | 8);
          I8[_seCT::I64 + _seCJ::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCJ::I64]) | 16);
          I8[_seCT::I64 + _seCL::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCL::I64]) | 32);
          I8[_seCT::I64 + _seCN::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCN::I64]) | 64);
          I8[_seCT::I64 + _seCP::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_seCT::I64 + _seCP::I64]) | 128);
          _seCT::I64 = _seCT::I64 + _seBQ::I64;
          goto cgKn;
    }
}

As can been seen, the above code has all efficiency problems fixed in that the registers are initialized outside the loop and as well the register is kept that contains the loop limit value so that no memory load needs to occur inside the loop for loop limit checking; thus, other than the cull or'ing operations, which do not have dependencies so there is no latency due to them, there are almost no latencies in the loop (other than for CPU's such as the AMD Bulldozer series, which have a severe cache bottleneck problem).

Last edited 2 years ago by GordonBGood (previous) (diff)

comment:6 Changed 2 years ago by simonpj

As I understand it, these movements could not sensibly be hoisted at the Core level, or could they? I'm failing to see how the code at the top lines up with the Cmm you are showing. Maybe show STG code too, and say how they match up? If we can do the floating in Core, that would be better!

If the opportunity only gets exposed when we are in Cmm, I wonder if it's worth our doing this in Cmm, or whether it's best left to LLVM?

comment:7 in reply to:  6 Changed 2 years ago by GordonBGood

Replying to simonpj:

I'm failing to see how the code at the top lines up with the Cmm you are showing.

@simonpj, the cmm code shown is the first of the "cull" case loops from the Haskell GHC code. Both the top "non-optimized" version and the bottom "optimized" version have had the register initialization dropped down into inside the loop.

Maybe show STG code too, and say how they match up? If we can do the floating in Core, that would be better!

Here is the STG code for the same cull loop/recursive function, massively back tabbed for display purposes here (output of -ddump-stg from GHC version 8.0.1 on 64-bit Windows, lines 898 through 1132):

let {
  cull_seCS [Occ=LoopBreaker]
    :: GHC.Prim.Addr#
       -> GHC.Prim.State#
            GHC.Prim.RealWorld
       -> (# GHC.Prim.Addr#,
             GHC.Prim.State#
               GHC.Prim.RealWorld #)
  [LclId,
   Arity=2,
   Str=DmdType <S,U><S,U>,
   Unf=OtherCon []] =
      sat-only \r srt:SRT:[] [c#_seCT
                              sp#_seCU]
          case
              ltAddr# [c#_seCT
                       lmt1#_seCQ]
          of
          _ [Occ=Dead]
          { __DEFAULT ->
                case
                    readWord8OffAddr# [c#_seCT
                                       0#
                                       sp#_seCU]
                of
                _ [Occ=Dead]
                { (#,#) ipv8_seCX [Occ=Once]
                        ipv9_seCY [Occ=Once] ->
                      case
                          or# [ipv9_seCY
                               1##]
                      of
                      sat_seCZ
                      { __DEFAULT ->
                            case
                                writeWord8OffAddr# [c#_seCT
                                                    0#
                                                    sat_seCZ
                                                    ipv8_seCX]
                            of
                            sp1#_seD0 [OS=OneShot]
                            { __DEFAULT ->
                                  case
                                      readWord8OffAddr# [c#_seCT
                                                         r1#_seCD
                                                         sp1#_seD0]
                                  of
                                  _ [Occ=Dead]
                                  { (#,#) ipv10_seD2 [Occ=Once]
                                          ipv11_seD3 [Occ=Once] ->
                                        case
                                            or# [ipv11_seD3
                                                 2##]
                                        of
                                        sat_seD4
                                        { __DEFAULT ->
                                              case
                                                  writeWord8OffAddr# [c#_seCT
                                                                      r1#_seCD
                                                                      sat_seD4
                                                                      ipv10_seD2]
                                              of
                                              sp3#_seD5 [OS=OneShot]
                                              { __DEFAULT ->
                                                    case
                                                        readWord8OffAddr# [c#_seCT
                                                                           r2#_seCF
                                                                           sp3#_seD5]
                                                    of
                                                    _ [Occ=Dead]
                                                    { (#,#) ipv12_seD7 [Occ=Once]
                                                            ipv13_seD8 [Occ=Once] ->
                                                          case
                                                              or# [ipv13_seD8
                                                                   4##]
                                                          of
                                                          sat_seD9
                                                          { __DEFAULT ->
                                                                case
                                                                    writeWord8OffAddr# [c#_seCT
                                                                                        r2#_seCF
                                                                                        sat_seD9
                                                                                        ipv12_seD7]
                                                                of
                                                                sp5#_seDa [OS=OneShot]
                                                                { __DEFAULT ->
                                                                      case
                                                                          readWord8OffAddr# [c#_seCT
                                                                                             r3#_seCH
                                                                                             sp5#_seDa]
                                                                      of
                                                                      _ [Occ=Dead]
                                                                      { (#,#) ipv14_seDc [Occ=Once]
                                                                              ipv15_seDd [Occ=Once] ->
                                                                            case
                                                                                or# [ipv15_seDd
                                                                                     8##]
                                                                            of
                                                                            sat_seDe
                                                                            { __DEFAULT ->
                                                                                  case
                                                                                      writeWord8OffAddr# [c#_seCT
                                                                                                          r3#_seCH
                                                                                                          sat_seDe
                                                                                                          ipv14_seDc]
                                                                                  of
                                                                                  sp7#_seDf [OS=OneShot]
                                                                                  { __DEFAULT ->
                                                                                        case
                                                                                            readWord8OffAddr# [c#_seCT
                                                                                                               r4#_seCJ
                                                                                                               sp7#_seDf]
                                                                                        of
                                                                                        _ [Occ=Dead]
                                                                                        { (#,#) ipv16_seDh [Occ=Once]
                                                                                                ipv17_seDi [Occ=Once] ->
                                                                                              case
                                                                                                  or# [ipv17_seDi
                                                                                                       16##]
                                                                                              of
                                                                                              sat_seDj
                                                                                              { __DEFAULT ->
                                                                                                    case
                                                                                                        writeWord8OffAddr# [c#_seCT
                                                                                                                            r4#_seCJ
                                                                                                                            sat_seDj
                                                                                                                            ipv16_seDh]
                                                                                                    of
                                                                                                    sp9#_seDk [OS=OneShot]
                                                                                                    { __DEFAULT ->
                                                                                                          case
                                                                                                              readWord8OffAddr# [c#_seCT
                                                                                                                                 r5#_seCL
                                                                                                                                 sp9#_seDk]
                                                                                                          of
                                                                                                          _ [Occ=Dead]
                                                                                                          { (#,#) ipv18_seDm [Occ=Once]
                                                                                                                  ipv19_seDn [Occ=Once] ->
                                                                                                                case
                                                                                                                    or# [ipv19_seDn
                                                                                                                         32##]
                                                                                                                of
                                                                                                                sat_seDo
                                                                                                                { __DEFAULT ->
                                                                                                                      case
                                                                                                                          writeWord8OffAddr# [c#_seCT
                                                                                                                                              r5#_seCL
                                                                                                                                              sat_seDo
                                                                                                                                              ipv18_seDm]
                                                                                                                      of
                                                                                                                      sp11#_seDp [OS=OneShot]
                                                                                                                      { __DEFAULT ->
                                                                                                                            case
                                                                                                                                readWord8OffAddr# [c#_seCT
                                                                                                                                                   r6#_seCN
                                                                                                                                                   sp11#_seDp]
                                                                                                                            of
                                                                                                                            _ [Occ=Dead]
                                                                                                                            { (#,#) ipv20_seDr [Occ=Once]
                                                                                                                                    ipv21_seDs [Occ=Once] ->
                                                                                                                                  case
                                                                                                                                      or# [ipv21_seDs
                                                                                                                                           64##]
                                                                                                                                  of
                                                                                                                                  sat_seDt
                                                                                                                                  { __DEFAULT ->
                                                                                                                                        case
                                                                                                                                            writeWord8OffAddr# [c#_seCT
                                                                                                                                                                r6#_seCN
                                                                                                                                                                sat_seDt
                                                                                                                                                                ipv20_seDr]
                                                                                                                                        of
                                                                                                                                        sp13#_seDu [OS=OneShot]
                                                                                                                                        { __DEFAULT ->
                                                                                                                                              case
                                                                                                                                                  readWord8OffAddr# [c#_seCT
                                                                                                                                                                     r7#_seCP
                                                                                                                                                                     sp13#_seDu]
                                                                                                                                              of
                                                                                                                                              _ [Occ=Dead]
                                                                                                                                              { (#,#) ipv22_seDw [Occ=Once]
                                                                                                                                                      ipv23_seDx [Occ=Once] ->
                                                                                                                                                    case
                                                                                                                                                        or# [ipv23_seDx
                                                                                                                                                             128##]
                                                                                                                                                    of
                                                                                                                                                    sat_seDy
                                                                                                                                                    { __DEFAULT ->
                                                                                                                                                          case
                                                                                                                                                              writeWord8OffAddr# [c#_seCT
                                                                                                                                                                                  r7#_seCP
                                                                                                                                                                                  sat_seDy
                                                                                                                                                                                  ipv22_seDw]
                                                                                                                                                          of
                                                                                                                                                          sp15#_seDz [OS=OneShot]
                                                                                                                                                          { __DEFAULT ->
                                                                                                                                                                case
                                                                                                                                                                    plusAddr# [c#_seCT
                                                                                                                                                                               p#_seBQ]
                                                                                                                                                                of
                                                                                                                                                                sat_seDA
                                                                                                                                                                { __DEFAULT ->
                                                                                                                                                                      cull_seCS
                                                                                                                                                                          sat_seDA
                                                                                                                                                                          sp15#_seDz;
                                                                                                                                                                };
                                                                                                                                                          };
                                                                                                                                                    };
                                                                                                                                              };
                                                                                                                                        };
                                                                                                                                  };
                                                                                                                            };
                                                                                                                      };
                                                                                                                };
                                                                                                          };
                                                                                                    };
                                                                                              };
                                                                                        };
                                                                                  };
                                                                            };
                                                                      };
                                                                };
                                                          };
                                                    };
                                              };
                                        };
                                  };
                            };
                      };
                };
            0# ->
                (#,#) [c#_seCT
                       sp#_seCU];
          };
} in 

You can see that the STG code just reflects the original Haskell source code and that the faulty register initialization has not yet been dropped down to within the loop(s), so the problem is not here. The problem is in the generation of the first CMM code and is not fixed by further CMM optimization passes, thus it also applies to NCG (although of course NCG also has other problems).

If the opportunity only gets exposed when we are in Cmm, I wonder if it's worth our doing this in Cmm, or whether it's best left to LLVM?

The easiest way to fix this might be to turn on the appropriate LLVM loop invariant code flow optimizations (if they would work) and have the fix only apply to LLVM, which would be acceptable to most users given the general inefficiency of the NCG backend.

Last edited 2 years ago by GordonBGood (previous) (diff)

comment:8 Changed 2 years ago by GordonBGood

Description: modified (diff)

comment:9 Changed 2 years ago by simonpj

You can see that the STG code just reflects the original Haskell source code and that the faulty register initialization has not yet been dropped down to within the loop(s), so the problem is not here. The problem is in the generation of the first CMM

Aha! Could you possibly make the tiniest possible example that illustrates precisely this point. You can motivate its importance by this thread, but in thinking about how to fix it, it's MUCH easier to grok a small example.

comment:10 in reply to:  9 Changed 2 years ago by GordonBGood

Replying to simonpj:

You can see that the STG code just reflects the original Haskell source code and that the faulty register initialization has not yet been dropped down to within the loop(s), so the problem is not here. The problem is in the generation of the first CMM

Aha! Could you possibly make the tiniest possible example that illustrates precisely this point. You can motivate its importance by this thread, but in thinking about how to fix it, it's MUCH easier to grok a small example.

I can't cut the test program down to just a few lines as I believe that the problem is related to pointers and pointer arithmetic (the Addr# primitive) and thus there is some setup involved in their use in a loop that shows the problems.

However, I have boiled the test down to a very simple tail-recursive loop with only one cull operation per loop using an Addr# and an offset that shows the problems; this loop is inside another loop to feed a variable prime "p" to the inner loop so it doesn't get optimized away as constants, and this is inside the setup code to produced the pinned byte array on which the loop works in the following code:

-- SimpleEfficiencyBug

{-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-}
{-# OPTIONS_GHC -O3 -rtsopts -keep-s-files -ddump-stg -ddump-cmm -ddump-opt-cmm -ddump-to-file -dumpdir . #-} -- or -O2 -fllvm -v -dcore-lint -ddump-asm

import Data.Bits
import Data.Array.ST (runSTUArray)
import Data.Array.Base
import GHC.ST ( ST(..) )
import GHC.Exts

cull :: () -> [Int]
cull() = [i | (i, True) <- assocs arr ] where
  arr = runSTUArray $ do
    let bfBts = 1 `shiftL` 17 -- 16 Kilobytes worth of bits
    bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool)
    cullb bf
  cullb (STUArray l u n marr#) = ST $ \s0# -> -- following is just setup for the loop...
    case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) ->
    case newPinnedByteArray# n# s1#         of { (# s2#, marr'# #) ->
    case copyMutableByteArray# marr# 0# marr'# 0# n# s2# of { s3# ->
    case unsafeFreezeByteArray# marr'# s3#  of { (# s4#, arr# #) -> -- must do this
    case byteArrayContents# arr#            of { adr# -> -- to obtain the addr# of pinned marr' here
    let cullp !p@(I# p#) sp# = -- for several prime values
          if p > 5 then case copyMutableByteArray# marr'# 0# marr# 0# n# sp# of
                          so# -> (# so#, STUArray l u n marr# #) else
          let !r1@(I# r1#) = ((p .&. 7) + p) `shiftR` 3 in -- register offset value
          let !(I# szlmt#) = n `div` 8 - r1 in
          let lmt# = plusAddr# adr# szlmt# in
          let doit c# s# = -- all the work is done here; herein lies the bugs...
                case c# `ltAddr#` lmt# of
                  0# -> s#
                  _ ->
                    case readWord8OffAddr# c# r1# s# of { (# s0#, v0# #) ->
                    case writeWord8OffAddr# c# r1# (v0# `or#` (int2Word# 1#)) s0# of { s1# ->
                    doit (plusAddr# c# p#) s1# }} in
          case doit adr# sp# of sd# -> cullp (p + 2) sd# in cullp 1 s4# }}}}}

main = print $ length $ cull()

When compiled with the "-fllvm" compiler flag, the above code produces the following STG code for the inner loop (located by searching for the first "doit1_"):

let {
  doit1_s7pL [Occ=LoopBreaker]
    :: GHC.Prim.Addr#
       -> GHC.Prim.State#
            GHC.Prim.RealWorld
       -> GHC.Prim.State#
            GHC.Prim.RealWorld
  [LclId,
   Arity=2,
   Str=DmdType <S,U><S,U>,
   Unf=OtherCon []] =
      sat-only \r srt:SRT:[] [c#_s7pM
                              s#_s7pN]
          case
              ltAddr# [c#_s7pM
                       lmt#1_s7pJ]
          of
          _ [Occ=Dead]
          { __DEFAULT ->
                case
                    readWord8OffAddr# [c#_s7pM
                                       r1#_s7pG
                                       s#_s7pN]
                of
                _ [Occ=Dead]
                { (#,#) ipv6_s7pQ [Occ=Once]
                        ipv7_s7pR [Occ=Once] ->
                      case
                          or# [ipv7_s7pR
                               1##]
                      of
                      sat_s7pS
                      { __DEFAULT ->
                            case
                                writeWord8OffAddr# [c#_s7pM
                                                    r1#_s7pG
                                                    sat_s7pS
                                                    ipv6_s7pQ]
                            of
                            s1#1_s7pT [OS=OneShot]
                            { __DEFAULT ->
                                  case
                                      plusAddr# [c#_s7pM
                                                 ww_s7pC]
                                  of
                                  sat_s7pU
                                  { __DEFAULT ->
                                        doit1_s7pL
                                            sat_s7pU
                                            s1#1_s7pT;
                                  };
                            };
                      };
                };
            0# ->
                s#_s7pN;
          };
} in 

Which first produces the following CMM code (found by search for "doit1_"):

 doit1_s7pL_entry() //  [R2, R1]
         { info_tbl: [(c7Kw,
                       label: doit1_s7pL_info
                       rep:HeapRep 3 nonptrs { Fun {arity: 2 fun_type: ArgSpec 4} })]
           stack_info: arg_space: 8 updfr_space: Just 8
         }
     {offset
       c7Kw:
           _s7pM::I64 = R2;
           _s7pL::P64 = R1;
           goto c7Kp;
       c7Kp:
           if ((old + 0) - <highSp> < SpLim) goto c7Kx; else goto c7Ky;
       c7Kx:
           R2 = _s7pM::I64;
           R1 = _s7pL::P64;
           call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
       c7Ky:
           goto c7Ko;
       c7Ko:
           _s7pC::I64 = I64[_s7pL::P64 + 6]; // registers initialized inside loop here
           _s7pG::I64 = I64[_s7pL::P64 + 14];
           _s7pJ::I64 = I64[_s7pL::P64 + 22]; // to here
           _c7Kr::I64 = _s7pM::I64 < _s7pJ::I64;
           _s7pO::I64 = _c7Kr::I64;
           switch [-9223372036854775808 .. 9223372036854775807] _s7pO::I64 {
               case 0 : goto c7Kv;
               default: goto c7Ku;
           }
       c7Kv:
           goto c7KG;
       c7KG:
           call (P64[(old + 8)])() args: 8, res: 0, upd: 8;
       c7Ku:
           goto c7KB;
       c7KB:
           _s7pR::I64 = %MO_UU_Conv_W8_W64(I8[_s7pM::I64 + (_s7pG::I64 << 0)]);
           _s7pR::I64 = _s7pR::I64;
           _c7KJ::I64 = _s7pR::I64 | 1;
           _s7pS::I64 = _c7KJ::I64;
           I8[_s7pM::I64 + (_s7pG::I64 << 0)] = %MO_UU_Conv_W64_W8(_s7pS::I64);
           _c7KO::I64 = _s7pM::I64 + _s7pC::I64;
           _s7pU::I64 = _c7KO::I64;
           _s7pM::I64 = _s7pU::I64;
           goto c7Ko;
     }
 },

then after many optimization passes produces the following optimized CMM code:

==================== Optimised Cmm ====================
2016-11-11 13:14:21.2389114 UTC

doit1_s7pL_entry() //  [R1, R2]
        { [(c7Kw,
            doit1_s7pL_info:
                const 8589934596;
                const 12884901888;
                const 9;)]
        }
    {offset
      c7Kw:
          _s7pM::I64 = R2;
          _s7pL::P64 = R1;
          goto c7Ko;
      c7Ko:
          switch [-9223372036854775808 .. 9223372036854775807] (_s7pM::I64 < I64[_s7pL::P64 + 22]) {
              case 0 : goto c7Kv;
              default: goto c7Ku;
          }
      c7Kv:
          call (P64[Sp])() args: 8, res: 0, upd: 8;
      c7Ku:
          _s7pC::I64 = I64[_s7pL::P64 + 6]; // registers initialized inside loop here
          _s7pG::I64 = I64[_s7pL::P64 + 14]; // and here
          I8[_s7pM::I64 + _s7pG::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_s7pM::I64 + _s7pG::I64]) | 1);
          _s7pM::I64 = _s7pM::I64 + _s7pC::I64;
          goto c7Ko;
    }
}

and finally the following assembly code:

s7pL_info$def:
# BB#0:                                 # %c7Kw
	cmpq	%r14, 22(%rbx)
	jbe	.LBB18_2
	.align	16, 0x90
.LBB18_1:                               # %c7Ku
                                        # =>This Inner Loop Header: Depth=1
	movq	14(%rbx), %rax    # registers initialized inside loop here
	movq	6(%rbx), %rcx     # and here
	addq	%r14, %rcx
	orb	$1, (%rax,%r14)
	cmpq	22(%rbx), %rcx    # and an additional unnecessary memory load here by LLVM?
	movq	%rcx, %r14        # extra unnecessary instruction if code reformulated
	jb	.LBB18_1
.LBB18_2:                               # %c7Kv
	movq	(%rbp), %rax
	rex64 jmpq	*%rax           # TAILCALL

I find no problems with the STG code, but the problems persist through all of the other codes including the initial CMM code. I have commented on where the problems are in the above codes. I would like to see the optimized CMM code look like the following:

doit1_s7pL_entry() //  [R1, R2]
        { [(c7Kw,
            doit1_s7pL_info:
                const 8589934596;
                const 12884901888;
                const 9;)]
        }
    {offset
      c7Kw:
          _s7pM::I64 = R2;
          _s7pL::P64 = R1;
          _s7pC::I64 = I64[_s7pL::P64 + 6]; // registers initialized outside loop here
          _s7pG::I64 = I64[_s7pL::P64 + 14]; // and here
          goto c7Ko;
      c7Ko:
          switch [-9223372036854775808 .. 9223372036854775807] (_s7pM::I64 < I64[_s7pL::P64 + 22]) {
              case 0 : goto c7Kv;
              default: goto c7Ku;
          }
      c7Kv:
          call (P64[Sp])() args: 8, res: 0, upd: 8;
      c7Ku:
          I8[_s7pM::I64 + _s7pG::I64] = %MO_UU_Conv_W64_W8(%MO_UU_Conv_W8_W64(I8[_s7pM::I64 + _s7pG::I64]) | 1);
          _s7pM::I64 = _s7pM::I64 + _s7pC::I64;
          goto c7Ko;
    }
}

which should produce the following desired assembly code:

s7pL_info$def:
# BB#0:                                 # %c7Kw
	movq	14(%rbx), %rax  # registers initialized outside loop here
	movq	6(%rbx), %rcx   # and here
	movq	22(%rbx), %rbx  # and here
	cmpq	%r14, %rbx
	jbe	.LBB18_2
	.align	16, 0x90
.LBB18_1:                               # %c7Ku
                                        # =>This Inner Loop Header: Depth=1
	orb	$1, (%rax,%r14)
	addq	%rcx, %r14
	cmpq	%rbx, %r14      # use a register for comparison
	jb	.LBB18_1
.LBB18_2:                               # %c7Kv
	movq	(%rbp), %rax
	rex64 jmpq	*%rax           # TAILCALL

The above assembly code is about as good as it gets in any language, and GHC should be able to produce this, at least with the LLVM backend.

Last edited 2 years ago by GordonBGood (previous) (diff)

comment:11 Changed 2 years ago by GordonBGood

Summary: For primitive (Addr#) operations, Loop Invariant Code Flow not lifted outside the loop...For non-strict code including primitive (Addr#) code, Loop Invariant Code Flow not lifted outside the loop...

comment:12 Changed 2 years ago by GordonBGood

It seems that the loop invariant code flow not being lifted out of the loops in not limited to primitive operations (also including Addr#), but is a general case for any code that is not purely strict, thus anything involving boxed thunks does not seem to be optimized properly.

The following code of a simple naive Sieve of Eratosthenes implementation with the composite number culling operations run a number of times in a loop for better timing purposes demonstrates the problem:

{-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-}
{-# OPTIONS_GHC -O3 -rtsopts -keep-s-files -ddump-stg -ddump-cmm -ddump-opt-cmm -ddump-to-file -dumpdir . #-} -- or -O2 -fllvm -v -dcore-lint -ddump-asm

import Data.Word
import Data.Bits
import Data.Array.ST (runSTUArray)
import Data.Array.Base
import GHC.ST ( ST(..) )

twos = listArray (0, 31) [ 1 `shiftL` i | i <- [0 .. 31]] :: UArray Int Word32

eos :: Int -> [Int]
eos top = [fromIntegral i | (i, False) <- assocs cmpsts] where
  cmpsts = runSTUArray $ do
    cmpstsb <- newArray (0, top) False :: ST s (STUArray s Int Bool)
    cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int Word32)) cmpstsb
    unsafeWrite cmpstsw 0 3 -- precull 0 and 1
    let loop i =
          if i <= 0 then return cmpstsb else
          let nxtp p =
                let s = p * p in
                if s > top then loop (i - 1) else do
                v <- unsafeRead cmpstsw (p `shiftR` 5)
                if v .&. unsafeAt twos (p .&. 31) /= 0 then nxtp (p + 1) else
                  let nxtc c =
                        if c > top then return () else do
                        let w = c `shiftR` 5
                        v <- unsafeRead cmpstsw w
                        unsafeWrite cmpstsw w (v .|. unsafeAt twos (c .&. 31))
                        nxtc (c + p) in do { nxtc s; nxtp (p + 1) } in twos `seq` nxtp 2
    loop (10000 :: Int)

main = print $ length $ eos(131071)

When run with the -fllvm (LLVM back end) compiler flag, it produces the following STG code for the inner loop (located by searching for "nxtc", massively indented for display):

let {
  $wnxtc_s7Ru [InlPrag=[0],
               Occ=LoopBreaker]
    :: GHC.Prim.Int#
       -> GHC.Prim.State#
            GHC.Prim.RealWorld
       -> (# GHC.Prim.State#
               GHC.Prim.RealWorld,
             () #)
  [LclId,
   Arity=2,
   Str=DmdType <S,U><S,U>,
   Unf=OtherCon []] =
      sat-only \r srt:SRT:[] [ww1_s7Rv
                              w1_s7Rw]
          case
              ># [ww1_s7Rv
                  131071#]
          of
          sat_s7Rx
          { __DEFAULT ->
                case
                    tagToEnum# [sat_s7Rx]
                of
                _ [Occ=Dead]
                { GHC.Types.False ->
                      case
                          uncheckedIShiftRA# [ww1_s7Rv
                                              5#]
                      of
                      i#_s7Rz [Dmd=<S,U>]
                      { __DEFAULT ->
                            case
                                readWord32Array# [ipv1_s7R1
                                                  i#_s7Rz
                                                  w1_s7Rw]
                            of
                            _ [Occ=Dead]
                            { (#,#) ipv8_s7RB [Occ=Once]
                                    ipv9_s7RC [Occ=Once] ->
                                  case
                                      andI# [ww1_s7Rv
                                             31#]
                                  of
                                  sat_s7RD
                                  { __DEFAULT ->
                                        case
                                            indexWord32Array# [ipv5_s7Rf
                                                               sat_s7RD]
                                        of
                                        wild5_s7RE
                                        { __DEFAULT ->
                                              case
                                                  or# [ipv9_s7RC
                                                       wild5_s7RE]
                                              of
                                              sat_s7RF
                                              { __DEFAULT ->
                                                    case
                                                        writeWord32Array# [ipv1_s7R1
                                                                           i#_s7Rz
                                                                           sat_s7RF
                                                                           ipv8_s7RB]
                                                    of
                                                    s2#1_s7RG [OS=OneShot]
                                                    { __DEFAULT ->
                                                          case
                                                              +# [ww1_s7Rv
                                                                  ww_s7Rh]
                                                          of
                                                          sat_s7RH
                                                          { __DEFAULT ->
                                                                $wnxtc_s7Ru
                                                                    sat_s7RH
                                                                    s2#1_s7RG;
                                                          };
                                                    };
                                              };
                                        };
                                  };
                            };
                      };
                  GHC.Types.True ->
                      (#,#) [w1_s7Rw
                             GHC.Tuple.()];
                };
          };
} in 

This, in turn produces the following CMM code:

       c8oB:
           _s7R1::P64 = P64[_s7Ru::P64 + 6];
           _s7Rf::P64 = P64[_s7Ru::P64 + 14];
           _s7Rh::I64 = I64[_s7Ru::P64 + 22];
           _c8oE::I64 = %MO_S_Gt_W64(_s7Rv::I64, 131071);
           _s7Rx::I64 = _c8oE::I64;
           switch [0 .. 1] _s7Rx::I64 {
               case 0 : goto c8oM;
               case 1 : goto c8oN;
           }
       c8oN:
           R1 = GHC.Tuple.()_closure+1;
           call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8;
       c8oM:
           _c8oP::I64 = %MO_S_Shr_W64(_s7Rv::I64, 5);
           _s7Rz::I64 = _c8oP::I64;
           _s7RC::I64 = %MO_UU_Conv_W32_W64(I32[(_s7R1::P64 + 16) + (_s7Rz::I64 << 2)]);
           _s7RC::I64 = _s7RC::I64;
           _c8oS::I64 = _s7Rv::I64 & 31;
           _s7RD::I64 = _c8oS::I64;
           _c8oV::I64 = %MO_UU_Conv_W32_W64(I32[(_s7Rf::P64 + 16) + (_s7RD::I64 << 2)]);
           _s7RE::I64 = _c8oV::I64;
           _c8oY::I64 = _s7RC::I64 | _s7RE::I64;
           _s7RF::I64 = _c8oY::I64;
           I32[(_s7R1::P64 + 16) + (_s7Rz::I64 << 2)] = %MO_UU_Conv_W64_W32(_s7RF::I64);
           _c8p3::I64 = _s7Rv::I64 + _s7Rh::I64;
           _s7RH::I64 = _c8p3::I64;
           _s7Rv::I64 = _s7RH::I64;
           goto c8oB;

which is reduced to the following CMM code after many optimization passes:

      c8oB:
          switch [0 .. 1] (%MO_S_Gt_W64(_s7Rv::I64, 131071)) {
              case 0 : goto c8oM;
              case 1 : goto c8oN;
          }
      c8oN:
          R1 = GHC.Tuple.()_closure+1;
          call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
      c8oM:
          _s7R1::P64 = P64[_s7Ru::P64 + 6];
          _s7Rh::I64 = I64[_s7Ru::P64 + 22];
          _s7Rz::I64 = %MO_S_Shr_W64(_s7Rv::I64, 5);
          I32[(_s7R1::P64 + 16) + (_s7Rz::I64 << 2)] = %MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(I32[(_s7R1::P64 + 16) + (_s7Rz::I64 << 2)]) | %MO_UU_Conv_W32_W64(I32[P64[_s7Ru::P64 + 14] + ((_s7Rv::I64 & 31 << 2) + 16)]));
          _s7Rv::I64 = _s7Rv::I64 + _s7Rh::I64;
          goto c8oB;

and finally the following assembly code:

	.align	16, 0x90
.LBB29_1:                               # %c8oM
                                        # =>This Inner Loop Header: Depth=1
	movq	%r14, %rax
	sarq	$5, %rax
	movq	6(%rbx), %rcx
	movq	14(%rbx), %rdx
	movl	%r14d, %esi
	andl	$31, %esi
	movl	16(%rdx,%rsi,4), %edx
	addq	22(%rbx), %r14
	orl	%edx, 16(%rcx,%rax,4)
	cmpq	$131072, %r14           # imm = 0x20000
	jl	.LBB29_1

where one can clearly see the multiple register loads inside the inner loop. This code runs at almost four CPU clock cycles per loop on Intel Skylake.

It is easy to see that this code is partially non-strict by running the +RTS -s command line option on the run to observed that heap use is much higher than it should be, although not so high that it causes a significant amount of GC or cost in execution time. The extra execution time is almost entirely due to the register reloads seen above inside the inner loop.

The Work Around

By merely changing the inner loop as follows, the non-strictness goes away (as seen in the amount of heap used, which drops to a few 10's of Kilobytes from 10's of Megabytes:

                  let nxtc c =
                        if c > top then nxtp (p + 1) else do
                        let w = c `shiftR` 5
                        v <- unsafeRead cmpstsw w
                        unsafeWrite cmpstsw w (v .|. unsafeAt twos (c .&. 31))
                        nxtc (c + p) in nxtc s in twos `seq` nxtp 2

With the modified code producing the following STG (massively indented for display here):

lvl21_s7Rl [Dmd=<S,U>]
{ __DEFAULT ->
      let-no-escape {
        $wnxtc_s7Rm [InlPrag=[0],
                     Occ=LoopBreaker]
          :: GHC.Prim.Int#
             -> GHC.Prim.State#
                  GHC.Prim.RealWorld
             -> (# GHC.Prim.State#
                     GHC.Prim.RealWorld,
                   Data.Array.Base.STUArray
                     GHC.Prim.RealWorld
                     GHC.Types.Int
                     GHC.Types.Bool #)
        [LclId,
         Arity=2,
         Str=DmdType <S,U><S,U>,
         Unf=OtherCon []] =
            sat-only \r srt:SRT:[] [ww3_s7Rn
                                    w3_s7Ro]
                case
                    ># [ww3_s7Rn
                        131071#]
                of
                sat_s7Rp
                { __DEFAULT ->
                      case
                          tagToEnum# [sat_s7Rp]
                      of
                      _ [Occ=Dead]
                      { GHC.Types.False ->
                            case
                                uncheckedIShiftRA# [ww3_s7Rn
                                                    5#]
                            of
                            i#_s7Rr [Dmd=<S,U>]
                            { __DEFAULT ->
                                  case
                                      readWord32Array# [ipv1_s7Qj
                                                        i#_s7Rr
                                                        w3_s7Ro]
                                  of
                                  _ [Occ=Dead]
                                  { (#,#) ipv8_s7Rt [Occ=Once]
                                          ipv9_s7Ru [Occ=Once] ->
                                        case
                                            andI# [ww3_s7Rn
                                                   31#]
                                        of
                                        sat_s7Rv
                                        { __DEFAULT ->
                                              case
                                                  indexWord32Array# [ipv5_s7Qx
                                                                     sat_s7Rv]
                                              of
                                              wild7_s7Rw
                                              { __DEFAULT ->
                                                    case
                                                        or# [ipv9_s7Ru
                                                             wild7_s7Rw]
                                                    of
                                                    sat_s7Rx
                                                    { __DEFAULT ->
                                                          case
                                                              writeWord32Array# [ipv1_s7Qj
                                                                                 i#_s7Rr
                                                                                 sat_s7Rx
                                                                                 ipv8_s7Rt]
                                                          of
                                                          s2#1_s7Ry [OS=OneShot]
                                                          { __DEFAULT ->
                                                                case
                                                                    +# [ww3_s7Rn
                                                                        ww2_s7R8]
                                                                of
                                                                sat_s7Rz
                                                                { __DEFAULT ->
                                                                      $wnxtc_s7Rm
                                                                          sat_s7Rz
                                                                          s2#1_s7Ry;
                                                                };
                                                          };
                                                    };
                                              };
                                        };
                                  };
                            };
                        GHC.Types.True ->
                            $wnxtp1_s7R7
                                lvl21_s7Rl
                                w3_s7Ro;
                      };
                };
      } in 
        $wnxtc_s7Rm
            x1_s7Ra
            ipv6_s7Rf;
};
};

converted to the following initial CMM code:

      c8o0:
          switch [0 .. 1] (%MO_S_Gt_W64(_s7QO::I64, 131071)) {
              case 0 : goto c8o8;
              case 1 : goto c8o9;
          }
      c8o9:
          _s7Qz::I64 = _s7QM::I64;
          goto c8ni;
      c8o8:
          _s7QS::I64 = %MO_S_Shr_W64(_s7QO::I64, 5);
          I32[(_s7Qj::P64 + 16) + (_s7QS::I64 << 2)] = %MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(I32[(_s7Qj::P64 + 16) + (_s7QS::I64 << 2)]) | %MO_UU_Conv_W32_W64(I32[(_s7Qx::P64 + 16) + (_s7QO::I64 & 31 << 2)]));
          _s7QO::I64 = _s7QO::I64 + _s7Qz::I64;
          goto c8o0;

and the following optimized CMM code:

      c8p9:
          switch [0 .. 1] (%MO_S_Gt_W64(_s7Rn::I64, 131071)) {
              case 0 : goto c8ph;
              case 1 : goto c8pi;
          }
      c8pi:
          _s7R8::I64 = _s7Rl::I64;
          goto c8ou;
      c8ph:
          _s7Rr::I64 = %MO_S_Shr_W64(_s7Rn::I64, 5);
          I32[(_s7Qj::P64 + 16) + (_s7Rr::I64 << 2)] = %MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(I32[(_s7Qj::P64 + 16) + (_s7Rr::I64 << 2)]) | %MO_UU_Conv_W32_W64(I32[(_s7Qx::P64 + 16) + (_s7Rn::I64 & 31 << 2)]));
          _s7Rn::I64 = _s7Rn::I64 + _s7R8::I64;
          goto c8p9;

to produce the following almost ideal assembly code (this particular code doesn't seem to manifest the symptoms of ticket #12798):

.LBB29_10:                              # %c8ph
                                        #   Parent Loop BB29_7 Depth=1
                                        #     Parent Loop BB29_8 Depth=2
                                        # =>    This Inner Loop Header: Depth=3
	movq	%rsi, %rdx
	sarq	$5, %rdx
	movl	%esi, %edi
	andl	$31, %edi
	movl	16(%rcx,%rdi,4), %edi
	orl	%edi, 16(%r10,%rdx,4)
	addq	%rax, %rsi
	cmpq	$131071, %rsi           # imm = 0x1FFFF
	jle	.LBB29_10

which one can see has no register loads and is almost ideal as to speed for the purpose - it runs at about 3.09 CPU clock cycles per loop whereas I have seen some code slightly re-ordered as produced by Clang/Rust/LLVM that runs at about 3.00 clock cycles.

In order to fix the previous code using primitive Addr# operations for which the ticket was opened, one just has to convince the compiler that it is to be evaluated strictly; although this is not so easy or one runs into the mixed lifted and un-lifted types error message.

However, there is likely a whole wide range of programs where executing entirely strictly is either not possible or not desired. I don't see why non-strict boxed code (for Haskell, likely the majority of code) can not be just as effectively optimized.

In conclusion: this is a very serious performance bug that can cause up to about a half again cost in execution time (50% increase), occurs in many use cases with a typical performance cost of about 30% (for instance for highly recursive code using list basted tail calls), and I believe has a great deal to do with the general perception that (GHC) Haskell is very much slower than Cee languages (C/C++/Rust, etc.).

comment:13 Changed 2 years ago by GordonBGood

Description: modified (diff)

comment:14 Changed 2 years ago by simonpj

Cc: maurerl@… added
Keywords: JoinPoints added

Interesting. Looking at comment:12, note the difference between the SLOW version:

let nxtc c = if c > top
             then return ()
             else do { ...; nxtc (c+p) }
in do { nxtc ss; nxtp (p + 1) }

and the FAST version

let nxtc c = if c > top
             then nxtp (p + 1) else do
             else do { ...; nxtc (c+p) }
in nxtc ss

In SLOW, nxtc is represented by a heap-allocated closure, whereas in FAST nxtc is a join point, and hence not allocated at all (you can see that from the let-no-escape in the STG). See our paper Compiling without continuations, and SequentCore.

Notice that in SLOW, the call to nxtc s is followed by a call to nxtp (p+1). But in FAST we move that call right into nxtc itself, in the return branch of the if. That's what makes nxtc into a join point.

(None of this has anything to do with non-strictness, incidentally.)

This is a rather non-trivial transformation. You clearly think it's a pretty obvious optimisation, but it doesn't look obvious to me.

Happily, though, our new Core-with-join-point (described in the paper) should catch this nicely. If we start with SLOW, after a bit we'll get this

let nxtc c s = if c > top
               then (# s,c #)
               else case ... of (# s',p #) -> nxtc (c+p) s' }
in case nxtc ss s of
     (# s', r #) -> nxtp (p + 1) s' }

Now if we do float-in, to move the let nxtc in to the scruintee of the case, it becomes a join point, and join-point analysis should find it. After that, the transformations in the paper will turn it into FAST.

The example in comment:10 looks similar:

  case doit adr# sp# of sd# -> cullp (p + 2) sd# in cullp 1 s4# }}}}}

Here again, doit will become a join point if we float it in. The example in the Descrption is too big for me to analyse.

I've cc'd Luke Maurer who is implementing Core with join points; this looks like another good example. (cf #12781)

comment:15 in reply to:  14 Changed 2 years ago by GordonBGood

Replying to simonpj:

...

In SLOW, nxtc is represented by a heap-allocated closure, whereas in FAST nxtc is a join point, and hence not allocated at all (you can see that from the let-no-escape in the STG). See our paper Compiling without continuations, and SequentCore.

Yes, I saw that. By "non-strict", I meant that in SLOW these nxtc closures are accumulated on the heap for each invocation (one for each prime to be culled multiplied by the loop iterations for numLOOPS times) until forced to be executed by runSTUArray whereas in FAST they are not.

Notice that in SLOW, the call to nxtc s is followed by a call to nxtp (p+1). But in FAST we move that call right into nxtc itself, in the return branch of the if. That's what makes nxtc into a join point.

I had wondered if the work on join points might help here...

(None of this has anything to do with non-strictness, incidentally.)

I just wondered if strictness might be a clue. Now I see that the extra heap space is just as required for these closures when the compiler believes them to be necessary, with this whole loop execution deferred until the array cmpsts is actually used in either case.

This is a rather non-trivial transformation. You clearly think it's a pretty obvious optimisation, but it doesn't look obvious to me.

No, I didn't think it was trivial, just that the optimisation gets triggered in the one case; I couldn't see why it couldn't be extended to the other...

Happily, though, our new Core-with-join-point (described in the paper) should catch this nicely. If we start with SLOW, after a bit we'll get this

let nxtc c s = if c > top
               then (# s,c #)
               else case ... of (# s',p #) -> nxtc (c+p) s' }
in case nxtc ss s of
     (# s', r #) -> nxtp (p + 1) s' }

Now if we do float-in, to move the let nxtc in to the scruintee of the case, it becomes a join point, and join-point analysis should find it. After that, the transformations in the paper will turn it into FAST.

The example in comment:10 looks similar:

  case doit adr# sp# of sd# -> cullp (p + 2) sd# in cullp 1 s4# }}}}}

Here again, doit will become a join point if we float it in. The example in the Descrption is too big for me to analyse.

Don't worry about the big example in the description as if the small example in comment 10 and this latest example in comment 12 get fixed, I'm pretty sure it will take core of that larger case too. The gist of it is that there 32 cases of the inner cull closure with the one run per invocation of the outer cullp chosen by a case/"switch", then the return value of (the chosen) cull is processed to become ns#, which in turn is passed to a different version of cull (at the bottom), which on its output does a direct call to cullp to form the loop. The difference is that there are several levels of "float in" required: with the 32 cases of the inner cull's floated out to the case/"switch", which would be floated in to the ns# calculation, which in turn would be floated into the other cull invocation, and this cull should already be a join point as it calls directly as in the FAST case.

I've cc'd Luke Maurer who is implementing Core with join points; this looks like another good example. (cf #12781)

That is encouraging news the the Join Point Analysis should catch this. Thanks for looking at this, as I think it important in many cases beyond this.

New information:

However, it would appear that the base cause of the problem is that closures are not optimised as well as non-closures as to Loop Invariant Code Flow (and perhaps other optimizations) when applied to captured free variables.

My reasoning is as follows:

1) loop could be considered to be a closure as, although it doesn't directly refer to any captured values, it contains code that does refer to captured values; however, in actual implementation it isn't a closure as it is transformed to contain the captured value (the array reference) as an additional parameter.

2) nxtp would be considered a closure as it contains a reference to the captured value, but again is transformed just as for loop with the additional parameter.

3) In the SLOW case, nxtc is a closure, and doesn't get optimised as to Loop Invariant Code Flow.

4) In the FAST case, nxtc is recognized as a join point and lifted inline to its enclosing nxtp function (not a closure), thus optimised properly with LICF lifting.

As I see it, recognizing the join point and lifting it is just curing the symptom by lifting the closure into a non-closure (although it also has its advantages in reducing heap use due to closure allocations, and execution time of calling the closure). However, the root cause of closure code not being optimised, at least for LICF) still remains for any cases where, say, the enclosing scope is itself a closure.

As proof, I offer the following version of nxtc which is still a "closure" although not actually capturing any free variables by making the otherwise captured values parameters to the "closure":

                  let nxtc cmpstsw twos p top c =
                        if c > top then return () else do
                        let w = c `shiftR` 5
                        v <- unsafeRead cmpstsw w
                        unsafeWrite cmpstsw w (v .|. unsafeAt twos (c .&. 31))
                        nxtc cmpstsw twos p top (c + p) in do { nxtc cmpstsw twos p top s; nxtp (p + 1) } in twos `seq` nxtp 2

This has almost exactly the desired effect of eliminating the LICF (with the help of back end optimisations - LLVM), although it triggers another slight performance bug as described in ticket #12798 (there thought to be a bug in LLVM 3.7 fixed in LLVM HEAD). The extra parameters are optimized away almost like a worker/wrapper pattern, and the resulting code runs at the speed of #12798 (almost ideal except for that bug). Thus the problem is captured values inside closures (so that the closure needs to be stored on the heap).

As I have stated, I believe there are many use cases, even to the common worker/wrapper pattern that seeks to reduce the amount of parameter passing by enclosing a recursive worker closure inside a wrapper; any gains from this pattern could be cancelled and more if the wrapped enclosure is less efficient.

Last edited 2 years ago by GordonBGood (previous) (diff)

comment:16 Changed 2 years ago by GordonBGood

Description: modified (diff)
Summary: For non-strict code including primitive (Addr#) code, Loop Invariant Code Flow not lifted outside the loop...For closures, Loop Invariant Code Flow not lifted outside the loop...

comment:17 Changed 2 years ago by GordonBGood

Summary: For closures, Loop Invariant Code Flow not lifted outside the loop...For closures, Loop Invariant Code Flow related to captured free values not lifted outside the loop...

comment:18 Changed 2 years ago by lukemaurer

Confirmed—with join points, heap allocation drops from 23M to 70K. Looks like the simplifier is doing exactly what we want, and all the loops wind up as join points.

comment:19 Changed 2 years ago by simonpj

As I have stated, I believe there are many use cases, even to the common worker/wrapper pattern that seeks to reduce the amount of parameter passing by enclosing a recursive worker closure inside a wrapper; any gains from this pattern could be cancelled and more if the wrapped enclosure is less efficient.

Let me say what I think you are saying. Consider

f x p q = let g y = ...g y'...x...
          in (g p, g q)

Left as-is we allocate a closure for g every time we call f. But instead we could lambda-lift g:

g2 x y = ...g2 x y'...x...

f x p q = (g2 x p, g2 x q)

Now we don't allocate a closure for g. That is good. Is this what you mean?

We don't want to do this in general, early in optimisation, because we get huge benefits from being able to "see" the binding site of g's free variable x. But these benefits are over when it comes to code generation.

So we have experimented with so-called "late lambda lifting" (LLF). There's a whole wiki page about it: LateLamLift. It can be a real win.

One obstacle to LLF is, ironically, that it can destroy join points (see the wiki page). A second benefit of Luke's new join-point work is that it becomes much easier to ensure that LLF doesn't destroy join points, and thus renders it much more practical. I think Luke will turn his attention to it once join points are solidly in.

comment:20 in reply to:  19 Changed 2 years ago by GordonBGood

Replying to simonpj:

As I have stated, I believe there are many use cases, even to the common worker/wrapper pattern that seeks to reduce the amount of parameter passing by enclosing a recursive worker closure inside a wrapper; any gains from this pattern could be cancelled and more if the wrapped enclosure is less efficient.

Let me say what I think you are saying. Consider

f x p q = let g y = ...g y'...x...
          in (g p, g q)

Left as-is we allocate a closure for g every time we call f. But instead we could lambda-lift g:

g2 x y = ...g2 x y'...x...

f x p q = (g2 x p, g2 x q)

Now we don't allocate a closure for g. That is good. Is this what you mean?

We don't want to do this in general, early in optimisation, because we get huge benefits from being able to "see" the binding site of g's free variable x. But these benefits are over when it comes to code generation.

So we have experimented with so-called "late lambda lifting" (LLF). There's a whole wiki page about it: LateLamLift. It can be a real win.

One obstacle to LLF is, ironically, that it can destroy join points (see the wiki page). A second benefit of Luke's new join-point work is that it becomes much easier to ensure that LLF doesn't destroy join points, and thus renders it much more practical. I think Luke will turn his attention to it once join points are solidly in.

Yes, Simon, I manually lifted the closure by turning all the otherwise free variables into arguments to the function; although I didn't actually move the code to the top level it could have been, as it is no longer a closure capturing free variables but rather just a free function.

However, my point in doing so was not to show that even this early manual lambda lifting is effective but that the code generated inside the function became so much more efficient due to seeing the Loop Invariant Code Flow (LICF) and lifting the register loading outside of the loop; in my mind the compiler should have done this whether the code was a closure or not.

As the article on lambda lifting says, this lambda lifting at an early stage can prevent some optimizations in some cases, and definitely there will be some cost in this case in run time overhead in passing all of those extra arguments to the function each and every time the function is called; however, as the recursive calls are eliminated by the tail call optimization of making a loop internal to the function, the function only gets called very few times so as to have a negligible overall impact here. There may be other negative effects of the very early lambda lifting, but again they are negligible compared to the gains made in efficiency of the internal loop due to properly using LICF lifting.

So the question I pose is "Why isn't LICF lifting used for the code internal to closures when it is used for non-closures?".

Your Lambda Lifting and Join Point Analysis can serve to reduce this problem by eliminating closures, but the problem is still there for cases where the closures can't be eliminated and/or shouldn't be lifted.

I sometimes wonder whether this is the problem that makes LL and JPA appear to be so effective: this problem makes the code that doesn't use JPA/LL much slower than it would otherwise be so that if this problem were not there, the gains made from LL/JPA in eliminating some/many closures would not likely be so great.

I brought up the wrapper/worker pattern because its whole point is to reduce the number of passed parameters that need to be tail call optimized away, but the "worker" then must needs be a closure; with the problem of this issue, in many cases the resulting wrapper/worker will be slower than if we didn't factor in the closure "worker". In the sieve code of this thread, the "nxtc" closure as originally written is essentially a "worker" to the "nxtp" wrapper function and manually lifting it as I did means it is no longer a "worker". I should have gotten slightly worse performance in doing this but instead got much better performance because the compiler now did LICF optimization on the non-closure, where it currently doesn't seem to be able to do it on a non-closure.

comment:21 Changed 23 months ago by bgamari

Milestone: 8.2.18.4.1

Given that 8.2.1-rc1 is imminent, I'm bumping these off to the 8.4

comment:22 Changed 13 months ago by bgamari

Milestone: 8.4.18.6.1

This ticket won't be resolved in 8.4; remilestoning for 8.6. Do holler if you are affected by this or would otherwise like to work on it.

comment:23 Changed 8 months ago by bgamari

Milestone: 8.6.18.8.1

These won't be fixed for 8.6, bumping to 8.8.

comment:24 Changed 8 weeks ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of low-priority tickets.

Note: See TracTickets for help on using tickets.