Opened 12 months ago

Last modified 3 months ago

#8971 new bug

Native Code Generator for 7.8 is not as optimized as 7.6.3...

Reported by: GordonBGood Owned by:
Priority: normal Milestone: 7.12.1
Component: Compiler (NCG) Version: 7.8.1-rc2
Keywords: Cc: simonmar, gideon@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

The output assembly code is not as optimized for the Windows 32-bit version 7.8.1 RC2 compiler as the Windows 7.6.3 compiler (32-bit) when the option switches are exactly the same although it may not be limited to only the Windows platform; this has a negative impact on execution time for tight loops of about a factor of two times slower.

The following code will reproduce the problem:

-- GHC_NCG_OptimizationBug.hs
-- it seems the Haskell GHC 7.8.1 NCG  Native Code Generator (NCG) doesn't
-- optimize as well for (at least) the x86 target as version 7.6.3

{-# OPTIONS_GHC -O3 -rtsopts -v -dcore-lint -ddump-asm -ddump-to-file -dumpdir . #-} -- or O2

import Data.Bits
import Control.Monad.ST (runST,ST(..))
import Data.Array.Base

-- Uses a very simple Sieve of Eratosthenes to 2 ^ 18 to prove it.
accNumPrimes :: Int -> Int
accNumPrimes acc = acc `seq` runST $ do
  let bfSz = (256 * 1024 - 3) `div` 2
      bfLmtWrds = (bfSz + 1) `div` 32
  bufw <- newArray (0, bfLmtWrds) (-1) :: ST s (STUArray s Int Int)
  -- to clear the last "uneven" bit(s)
  unsafeWrite bufw bfLmtWrds (complement ((-2) `shiftL` (bfSz .&. 31)))
  bufb <- (castSTUArray :: STUArray s Int Int -> ST s (STUArray s Int Bool)) bufw
  let cullp i =
        let p = i + i + 3 in
        let s = (p * p - 3) `div` 2 in
        if s > bfSz then
          let count i sm = do
                sm `seq` if i > bfLmtWrds then return (acc + sm) else do
                  wd <- unsafeRead bufw i
                  count (i + 1) (sm + (popCount wd)) in
          count 0 1 -- use '1' for the '2' prime not in the array
        else do
          v <- unsafeRead bufb i
          if v then
            let cull j = do -- very tight inner loop
                  if j > bfSz then cullp (i + 1) else do
                    unsafeWrite bufb j False
                    cull (j + p) in
            cull s
          else cullp (i + 1)
  cullp 0

main =
  -- run the program a number of times to get a reasonable time...
  let numloops = 2000 in
  let loop n acc =
        acc `seq` if n <= 0 then acc else
        loop (n - 1) (accNumPrimes acc) in
  print $ loop numloops 0

The above code takes almost twice as long to run when compiled under 7.8.1 RC2 for Windows (32-bit) as it does for the version 7.6.3 compiler (both 32-bit compilers).

The -ddump-simpl Core dump is almost identical between the two, which is also evidenced by that using the -fllvm LLVM compiler back end switch for each results in code that runs at about the same speed for each compiler run (which would use the same Core output as used for NCG, right?).

Under Windows, the compilation and run for 7.8.1 RC2 goes like this:

*Main> :! E:\ghc-7.8.0.20140228_32\bin\ghc --make -pgmlo "E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc" "GHC_NCG_OptimizationBug.hs"
compile: input file WindowsVsLinuxNCG.hs
Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc15460_0
*** Checking old interface for main:Main:
*** Parser:
*** Renamer/typechecker:
[1 of 1] Compiling Main             ( GHC_NCG_OptimizationBug.hs, GHC_NCG_OptimizationBug.o )
*** Desugar:
Result size of Desugar (after optimization)
  = {terms: 260, types: 212, coercions: 0}
*** Core Linted result of Desugar (after optimization):
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 213, types: 136, coercions: 52}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 215, types: 148, coercions: 67}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
  = {terms: 209, types: 135, coercions: 51}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 209, types: 135, coercions: 42}
*** Core Linted result of Simplifier:
*** Specialise:
Result size of Specialise = {terms: 209, types: 135, coercions: 42}
*** Core Linted result of Specialise:
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = False})
  = {terms: 286, types: 185, coercions: 42}
*** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
*** Float inwards:
Result size of Float inwards
  = {terms: 286, types: 185, coercions: 42}
*** Core Linted result of Float inwards:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 502, types: 393, coercions: 103}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 428, types: 326, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
  = {terms: 420, types: 321, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 420, types: 321, coercions: 29}
*** Core Linted result of Simplifier:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 418, types: 318, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 418, types: 318, coercions: 29}
*** Core Linted result of Simplifier:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 475, types: 383, coercions: 32}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 444, types: 336, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 444, types: 336, coercions: 9}
*** Core Linted result of Simplifier:
*** Demand analysis:
Result size of Demand analysis
  = {terms: 444, types: 336, coercions: 9}
*** Core Linted result of Demand analysis:
*** Worker Wrapper binds:
Result size of Worker Wrapper binds
  = {terms: 579, types: 457, coercions: 9}
*** Core Linted result of Worker Wrapper binds:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 510, types: 415, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 420, types: 322, coercions: 9}
*** Core Linted result of Simplifier:
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = True})
  = {terms: 426, types: 326, coercions: 9}
*** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
*** Common sub-expression:
Result size of Common sub-expression
  = {terms: 424, types: 326, coercions: 9}
*** Core Linted result of Common sub-expression:
*** Float inwards:
Result size of Float inwards
  = {terms: 424, types: 326, coercions: 9}
*** Core Linted result of Float inwards:
*** Liberate case:
Result size of Liberate case
  = {terms: 1,824, types: 1,259, coercions: 9}
*** Core Linted result of Liberate case:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 608, types: 422, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 604, types: 413, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
  = {terms: 604, types: 413, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 604, types: 413, coercions: 9}
*** Core Linted result of Simplifier:
*** SpecConstr:
Result size of SpecConstr = {terms: 708, types: 505, coercions: 9}
*** Core Linted result of SpecConstr:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 702, types: 499, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 608, types: 405, coercions: 9}
*** Core Linted result of Simplifier:
*** Tidy Core:
Result size of Tidy Core = {terms: 608, types: 405, coercions: 9}
*** Core Linted result of Tidy Core:
*** CorePrep:
Result size of CorePrep = {terms: 825, types: 489, coercions: 9}
*** Core Linted result of CorePrep:
*** Stg2Stg:
*** CodeOutput:
*** New CodeGen:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** Assembler:
"E:\ghc-7.8.0.20140228_32\lib/../mingw/bin/gcc.exe" "-U__i686" "-fno-stack-protector" "-DTABLES_NEXT_TO_CODE" "-I." "-x" "assembler-with-cpp" "-c" "C:\Users\Gordon\AppData\Local\Temp\ghc15460_0\ghc15460_2.s" "-o" "GHC_NCG_OptimizationBug.o"
Linking GHC_NCG_OptimizationBug.exe ...
*Main> :! GHC_NCG_OptimizationBug +RTS -s
46000000
      32,965,096 bytes allocated in the heap
           7,032 bytes copied during GC
          41,756 bytes maximum residency (2 sample(s))
          19,684 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0        61 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    1.73s  (  1.73s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    1.73s  (  1.73s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    19,006,902 bytes per MUT second

  Productivity 100.0% of total user, 100.2% of total elapsed

whereas under version 7.6.3 goes like this:

*Main> :! E:\ghc-7.6.3_32\bin\ghc --make -pgmlo "E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc" "GHC_NCG_OptimizationBug.hs"
compile: input file GHC_NCG_OptimizationBug.hs
Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc28200_0
*** Checking old interface for main:Main:
*** Parser:
*** Renamer/typechecker:
[1 of 1] Compiling Main             ( GHC_NCG_OptimizationBug.hs, GHC_NCG_OptimizationBug.o )
*** Desugar:
Result size of Desugar (after optimization)
  = {terms: 247, types: 212, coercions: 0}
*** Core Linted result of Desugar (after optimization):
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 198, types: 132, coercions: 35}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 200, types: 144, coercions: 43}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
  = {terms: 194, types: 131, coercions: 57}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 194, types: 131, coercions: 39}
*** Core Linted result of Simplifier:
*** Specialise:
Result size of Specialise = {terms: 194, types: 131, coercions: 39}
*** Core Linted result of Specialise:
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = False})
  = {terms: 277, types: 191, coercions: 39}
*** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
*** Float inwards:
Result size of Float inwards
  = {terms: 277, types: 191, coercions: 39}
*** Core Linted result of Float inwards:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 514, types: 403, coercions: 103}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 420, types: 317, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
  = {terms: 412, types: 312, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 412, types: 312, coercions: 29}
*** Core Linted result of Simplifier:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 410, types: 309, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 410, types: 309, coercions: 29}
*** Core Linted result of Simplifier:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 455, types: 364, coercions: 32}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 422, types: 317, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 422, types: 317, coercions: 9}
*** Core Linted result of Simplifier:
*** Demand analysis:
Result size of Demand analysis
  = {terms: 422, types: 317, coercions: 9}
*** Core Linted result of Demand analysis:
*** Worker Wrapper binds:
Result size of Worker Wrapper binds
  = {terms: 536, types: 427, coercions: 9}
*** Core Linted result of Worker Wrapper binds:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 480, types: 391, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 400, types: 306, coercions: 9}
*** Core Linted result of Simplifier:

*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = True})
  = {terms: 408, types: 311, coercions: 9}
*** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
*** Common sub-expression:
Result size of Common sub-expression
  = {terms: 406, types: 311, coercions: 9}
*** Core Linted result of Common sub-expression:
*** Float inwards:
Result size of Float inwards
  = {terms: 406, types: 311, coercions: 9}
*** Core Linted result of Float inwards:
*** Liberate case:
Result size of Liberate case
  = {terms: 1,186, types: 824, coercions: 9}
*** Core Linted result of Liberate case:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 585, types: 411, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 569, types: 392, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
  = {terms: 569, types: 392, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 569, types: 392, coercions: 9}
*** Core Linted result of Simplifier:
*** SpecConstr:
Result size of SpecConstr = {terms: 746, types: 566, coercions: 9}
*** Core Linted result of SpecConstr:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 739, types: 560, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 762, types: 546, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 642, types: 402, coercions: 9}
*** Core Linted result of Simplifier:
*** Tidy Core:
Result size of Tidy Core = {terms: 642, types: 402, coercions: 9}
*** Core Linted result of Tidy Core:
writeBinIface: 10 Names
writeBinIface: 34 dict entries
*** CorePrep:
Result size of CorePrep = {terms: 779, types: 483, coercions: 9}
*** Core Linted result of CorePrep:
*** Stg2Stg:
*** CodeOutput:
*** CodeGen:
*** Assembler:
"E:\ghc-7.6.3_32\lib/../mingw/bin/gcc.exe" "-fno-stack-protector" "-Wl,--hash-size=31" "-Wl,--reduce-memory-overheads" "-I." "-c" "C:\Users\Gordon\AppData\Local\Temp\ghc28200_0\ghc28200_0.s" "-o" "GHC_NCG_OptimizationBug.o"
Linking GHC_NCG_OptimizationBug.exe ...
*Main> :! GHC_NCG_OptimizationBug +RTS -s
46000000
      32,989,396 bytes allocated in the heap
           4,976 bytes copied during GC
          41,860 bytes maximum residency (2 sample(s))
          19,580 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0        61 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.64s  (  0.64s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.66s  (  0.64s elapsed)

  %GC     time       0.0%  (0.1% elapsed)

  Alloc rate    51,495,642 bytes per MUT second

  Productivity 100.0% of total user, 102.3% of total elapsed

Looking at the ASM dump for the innermost tight culling loop reveals the problem, with 7.8.1 RC2 outputting as follow:

_n3nx:
	movl 76(%esp),%ecx
_c3gf:
	cmpl %ecx,%eax
	jg _c3jB
_c3jC:
	movl %eax,%edx
	sarl $5,%edx
	movl %ecx,76(%esp)
	movl $1,%ecx
	movl %ecx,280(%esp)
	movl %eax,%ecx
	andl $31,%ecx
	movl %eax,292(%esp)
	movl 280(%esp),%eax
	shll %cl,%eax
	xorl $-1,%eax
	movl 64(%esp),%ecx
	addl $8,%ecx
	movl (%ecx,%edx,4),%ecx
	andl %eax,%ecx
	movl 64(%esp),%eax
	addl $8,%eax
	movl %ecx,(%eax,%edx,4)
	movl 292(%esp),%eax
	addl $3,%eax
	jmp _n3nx

and 7.6.3 outputting as follows:

.text
	.align 4,0x90
	.long	1894
	.long	32
s1GZ_info:
_c1YB:
	cmpl 16(%ebp),%esi
	jg _c1YE
	movl %esi,%edx
	sarl $5,%edx
	movl $1,%eax
	movl %esi,%ecx
	andl $31,%ecx
	shll %cl,%eax
	xorl $-1,%eax
	movl 12(%ebp),%ecx
	movl 8(%ecx,%edx,4),%ecx
	andl %eax,%ecx
	movl 12(%ebp),%eax
	movl %ecx,8(%eax,%edx,4)
	addl 4(%ebp),%esi
	jmp s1GZ_info
_c1YE:
	movl 8(%ebp),%esi
	addl $8,%ebp
	jmp s1GB_info

The second code is clearly much more efficient, with the only memory access reading/writing the sieve buffer array and one register reload of the prime value to add to the current position index, whereas the first (7.8.1 RC2) code has three register spills and five register re-loads, almost as if debugging were still turned on.

This bug was tested under Windows, but likely applies to other platforms, at least for 32-bit versions but also possibly to others.

Change History (23)

comment:1 Changed 12 months ago by simonmar

  • Cc simonmar added
  • Component changed from Compiler to Compiler (NCG)

comment:2 Changed 12 months ago by gidyn

  • Cc gideon@… added

comment:3 Changed 12 months ago by ezyang

Doesn't seem to be a problem on 64-bit:

.Lc4cz:
        cmpq %r8,%rbx
        jg .Lc4cJ
.Lc4cK:
        movq %rbx,%rdx
        sarq $6,%rdx
        movl $1,%edi
        movq %rcx,%r9
        movq %rbx,%rcx
        andq $63,%rcx
        shlq %cl,%rdi
        xorq $-1,%rdi
        leaq 16(%rax),%rcx
        movq (%rcx,%rdx,8),%rcx
        andq %rdi,%rcx
        leaq 16(%rax),%rdi
        movq %rcx,(%rdi,%rdx,8)
        addq %r9,%rbx
.Ln4g8:
        movq %r9,%rcx
        jmp .Lc4cz

What happens if you use -freg-graph? If you could post C-- dumps that would also be helpful.

comment:4 Changed 12 months ago by awson

I didn't go into details but it looks -fregs-graph doesn't help. 7.8.1-compiled exec is still much slower than 7.6.3-compiled.

comment:5 Changed 12 months ago by awson

I've checked things on 32-bit linux. It suffers too.

comment:6 Changed 11 months ago by awson

Also, when on 32-bit the difference between NCG and LLVM is *huge*, the difference on 64-bit (Windows) is also pretty large - about 55% on Yorkfield, and 40% on Sandy Bridge.

Last edited 11 months ago by awson (previous) (diff)

comment:7 Changed 11 months ago by awson

The same picture is for 64-bit linux ghc-7.8.1 and LLVM 3.4. NCG loses badly.

comment:8 follow-ups: Changed 11 months ago by GordonBGood

This bug has nothing to do with the difference between the LLVM and the NCG compiler backends; it it well known that NCG isn't that efficient for tight loops. For this bug use case code, NCG has never combined the memory read, the 'and' operation, and the memory write into one read/modify/write instruction whereas LLVM does, which explains the difference between them.

This bug has to do with why the GHC version 7.8.1 NCG has taken a step backward from the previous stable version 7.6.3 in performance due to more register spills and reloads.

I only referred to LLVM as proof that the problem seems to be limited to NCG as both NCG and LLVM will share the same C-- output (or at least I think so???) yet NCG shows this step backwards whereas LLVM does not.

comment:9 in reply to: ↑ 8 ; follow-up: Changed 11 months ago by tibbe

Replying to GordonBGood:

I only referred to LLVM as proof that the problem seems to be limited to NCG as both NCG and LLVM will share the same C-- output (or at least I think so???) yet NCG shows this step backwards whereas LLVM does not.

That's a good indication that the NGC is to blame, but it's also possible that the Cmm codegen has regressed but some LLVM optimizations make up for the regression.

comment:10 in reply to: ↑ 9 Changed 11 months ago by GordonBGood

Replying to tibbe:

Replying to GordonBGood:

I only referred to LLVM as proof that the problem seems to be limited to NCG as both NCG and LLVM will share the same C-- output (or at least I think so???) yet NCG shows this step backwards whereas LLVM does not.

That's a good indication that the NGC is to blame, but it's also possible that the Cmm codegen has regressed but some LLVM optimizations make up for the regression.

Alright, as requested by "ezyang" earlier, I added the -ddump-cmm and -ddump-opt-cmm switches to the compilation, with the following results:

For version 7.6.3, the CMM code for the loops looks like this:

 s1Hq_ret()
         { label: s1Hq_info
           rep:StackRep [True, False, True, True, True]
         }
     c1XX:
         _c1Tr::I32 = %MO_S_Gt_W32(R1, I32[Sp + 4]);
         ;
         if (_c1Tr::I32 >= 1) goto c1XZ;
         _s1H9::I32 = %MO_S_Shr_W32(R1, 5);
         _s1He::I32 = I32[I32[Sp + 8] + 8 + (_s1H9::I32 << 2)];
         _s1KJ::I32 = R1;
         _s1Hh::I32 = _s1KJ::I32 & 31;
         _s1Hj::I32 = _s1Hh::I32;
         _s1KI::I32 = 1 << _s1Hj::I32;
         _s1Hm::I32 = _s1KI::I32 ^ 18446744073709551615;
         _s1KH::I32 = _s1He::I32 & _s1Hm::I32;
         I32[I32[Sp + 8] + 8 + (_s1H9::I32 << 2)] = _s1KH::I32;
         _s1KG::I32 = R1 + 3;
         R1 = _s1KG::I32;
         jump s1Hq_info; // [R1]
     c1XZ:
         R1 = 1;
         jump s1HY_info; // [R1]
 },

and the opt-cmm code for about the same area looks like this:

s1Hq_ret()
        { Just s1Hq_info:
                   const 933;
                   const 32;
        }
    c1XX:
        ;
        if (%MO_S_Gt_W32(R1, I32[Sp + 4])) goto c1XZ;
        _s1H9::I32 = %MO_S_Shr_W32(R1, 5);
        I32[I32[Sp + 8] + ((_s1H9::I32 << 2) + 8)] = I32[I32[Sp + 8] + ((_s1H9::I32 << 2) + 8)] & (1 << R1 & 31) ^ 18446744073709551615;
        R1 = R1 + 3;
        jump s1Hq_info; // [R1]
    c1XZ:
        R1 = 1;
        jump s1HY_info; // [R1]
}

For version 7.8.1 RC2 the cmm code is about eight times larger with about eight times the number of lines and looks like this

  c3jO:
      _c3jQ::I32 = %MO_S_Gt_W32(_s33c::I32, _s32t::I32);
      _s33e::I32 = _c3jQ::I32;
      if (_s33e::I32 >= 1) goto c3jY; else goto c3jZ;
  c3jY:
      _s32S::I32 = _s335::I32;
      goto c3gp;
  c3jZ:
      _c3kn::I32 = %MO_S_Shr_W32(_s33c::I32, 5);
      _s33g::I32 = _c3kn::I32;
      _s33j::I32 = I32[(_s327::P32 + 8) + (_s33g::I32 << 2)];
      _s33j::I32 = _s33j::I32;
      _c3kq::I32 = _s33c::I32;
      _s33k::I32 = _c3kq::I32;
      _c3kt::I32 = _s33k::I32 & 31;
      _s33l::I32 = _c3kt::I32;
      _c3kw::I32 = _s33l::I32;
      _s33m::I32 = _c3kw::I32;
      _c3kz::I32 = 1 << _s33m::I32;
      _s33n::I32 = _c3kz::I32;
      _c3kC::I32 = _s33n::I32 ^ 4294967295;
      _s33o::I32 = _c3kC::I32;
      _c3kF::I32 = _s33j::I32 & _s33o::I32;
      _s33p::I32 = _c3kF::I32;
      I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] = _s33p::I32;
      _c3kK::I32 = _s33c::I32 + _s32U::I32;
      _s33r::I32 = _c3kK::I32;
      _s33c::I32 = _s33r::I32;
      goto c3jO;

the the optimized opt-cmm code looks like this:

      c3jO:
          if (%MO_S_Gt_W32(_s33c::I32,
                           _s32t::I32)) goto c3jY; else goto c3jZ;
      c3jY:
          Sp = Sp + 8;
          _s32S::I32 = _s335::I32;
          goto c3gp;
      c3jZ:
          _s33g::I32 = %MO_S_Shr_W32(_s33c::I32, 5);
          I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] = I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] & (1 << _s33c::I32 & 31) ^ 4294967295;
          _s33c::I32 = _s33c::I32 + _s32U::I32;
          goto c3jO;

It appears to me that the opt-cmm code is about the same but the straight cmm dump has regressed to have lost even the basic optimizations that were there with the older version.

It may be that the NCG is using the non-optimized version of CMM as a source where the LLVM generator is using the optimized CMM version, which would explain why LLVM backend generated code is still efficient where NCG generated code is not.

Anticipating your next request to look at the STG output, I used the -ddump-stg compiler option to examine that code. The code is too difficult to post here (wordy and lines very long), but a quick examination shows it to be about the same for both versions except that internal constants are recorded as 32-bit numbers in 7.8.1 whereas they were recorded as 64-bit numbers even when referring to 32-bit registers for the older 7.6.3 code; this corresponds to the way they are recorded in both the optimized and non-optimized CMM files by version as listed above.

Thus, the bug/regression appears to be go further back than just the new NCG (which is likely using the non-optimized CMM code as input) but also to the CMM code generator in that it is producing much less efficient CMM code.

EDIT_ADD: In looking into what the contents of the CMM dump file mean, I found https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/CodeGen which shows what, when, and where, optimizations for the different sections are done. In further poking around, I find that there are optimized sections within the large file which are about as optimized as what I now understand to be the final optimized CMM file. Therefore, as long as NCG is using the right section (or right stage), the CMM output is just as optimized for version 7.8.1 as for version 7.6.3 and the problem must be in NCG.

Last edited 11 months ago by GordonBGood (previous) (diff)

comment:11 follow-up: Changed 11 months ago by jstolarek

I know nothing about 7.6 Cmm generation, but I worked on 7.8 Cmm pipeline this summer so I can offer some guidance. First I'd like to clarify a few things:

both NCG and LLVM will share the same C-- output

No, they will not. LLVM backend requires "proc-point splitting". This means we need to turn every Cmm block that is succcesor of more than one block into a separate procedure (at least that is my understanding). See here and here to see source of differences between Cmm generated for both backends. Here you'll find more on proc-points.

straight cmm dump has regressed to have lost even the basic optimizations that were there with the older version.

I'm not sure what was the philosophy behind 7.6 Cmm geneartion but in 7.8 we just generate Cmm from STG in the simplest possible way and then optimize that Cmm. This is similar to generating Core from Haskell and then doing a series of core-to-core transformations. So you need to look at the final Cmm, not the one that comes out from the Cmm->STG pass.

It may be that the NCG is using the non-optimized version of CMM as a source

It is using the optimized version. You can see for yourself in the tryNewCodeGen function and its call site.

Anticipating your next request to look at the STG output

Wrong anticipation :-) Look at generated assembly. Only this can tell you what is the real difference between generated code. I wouldn't be surprised to see something in the lines of #8048.

Thus, the bug/regression appears to be go further back than just the new NCG (which is likely using the non-optimized CMM code as input) but also to the CMM code generator in that it is producing much less efficient CMM code.

Let me repeat: a) NCG is using optimized Cmm (look at the code); b) don't look at the un-optimized Cmm - it's irrelevant.

Finally, I you want to learn more about Cmm pipeline and Cmm debugging see Commentary/Compiler/CodeGen.

Hope that helps.

Last edited 11 months ago by jstolarek (previous) (diff)

comment:12 in reply to: ↑ 8 ; follow-up: Changed 11 months ago by awson

Replying to GordonBGood:

This bug has nothing to do with the difference between the LLVM and the NCG compiler backends; it it well known that NCG isn't that efficient for tight loops. For this bug use case code, NCG has never combined the memory read, the 'and' operation, and the memory write into one read/modify/write instruction whereas LLVM does, which explains the difference between them.

Oh, I know this. But in the last days I fought with a pile of bugs in 64-bit Windows GHC and I saw a lot of NCG-produced assembly that was bad far beyond what you've described above.

Off the top of my head, when I worked on #8974, I saw this:

	call suspendThread
	addq $40,%rsp
	subq $8,%rsp
	subq $32,%rsp
	movq %rax,%rbx
	movl $0,%eax
	call performMajorGC
	addq $40,%rsp
	movq %rbx,%rcx
	subq $8,%rsp
	subq $32,%rsp
	movl $0,%eax
	call resumeThread

produced by NCG, versus

	callq	suspendThread
	movq	%rax, %rsi
	callq	performMajorGC
	movq	%rsi, %rcx
	callq	resumeThread

produced by LLVM.

Last edited 11 months ago by awson (previous) (diff)

comment:13 in reply to: ↑ 12 Changed 11 months ago by GordonBGood

Replying to awson:

Replying to GordonBGood:

This bug has nothing to do with the difference between the LLVM and the NCG compiler backends; it it well known that NCG isn't that efficient for tight loops. For this bug use case code, NCG has never combined the memory read, the 'and' operation, and the memory write into one read/modify/write instruction whereas LLVM does, which explains the difference between them.

Oh, I know this. But in the last days I fought with a pile of bugs in 64-bit Windows GHC and I saw a lot of NCG-produced assembly that was bad far beyond what you've described above.

Off the top of my head, when I worked on #8974, I saw this:

	call suspendThread
	addq $40,%rsp
	subq $8,%rsp
	subq $32,%rsp
	movq %rax,%rbx
	movl $0,%eax
	call performMajorGC
	addq $40,%rsp
	movq %rbx,%rcx
	subq $8,%rsp
	subq $32,%rsp
	movl $0,%eax
	call resumeThread

produced by NCG, versus

	callq	suspendThread
	movq	%rax, %rsi
	callq	performMajorGC
	movq	%rsi, %rcx
	callq	resumeThread

produced by LLVM.

Thanks for your reply and explanation; I've seen the same types of things with 7.6.3 comparing tight loops, but more generally just much less efficient use of registers by NCG as compared to LLVM with many more register spills and reloads; however, the 32-bit version of 7.8.1 RC2's new NCG is even worse as per my bug report.

comment:14 in reply to: ↑ 11 Changed 11 months ago by GordonBGood

Replying to jstolarek:

I know nothing about 7.6 Cmm generation, but I worked on 7.8 Cmm pipeline this summer so I can offer some guidance. First I'd like to clarify a few things:

Thanks for your explanations and clarifications, some of which I have dug out for myself in the last few minutes - I had already edit added something to the post to which you are replying:

both NCG and LLVM will share the same C-- output

No, they will not. LLVM backend requires "proc-point splitting". This means we need to turn every Cmm block that is succcesor of more than one block into a separate procedure (at least that is my understanding). See here and here to see source of differences between Cmm generated for both backends. Here you'll find more on proc-points.

Learned something new here. OK, end result is that using LLVM as a proof that the problem is in NCG because they use the same CMM code is invalid, so it was worth checking the CMM code.

straight cmm dump has regressed to have lost even the basic optimizations that were there with the older version.

I'm not sure what was the philosophy behind 7.6 Cmm geneartion but in 7.8 we just generate Cmm from STG in the simplest possible way and then optimize that Cmm. This is similar to generating Core from Haskell and then doing a series of core-to-core transformations. So you need to look at the final Cmm, not the one that comes out from the Cmm->STG pass.

OK, I dug that out for myself and now see that in the final pass the CMM code is almost identical for the code that triggers the symptom (other than for the constants recorded as 32-bit depth for 32-bit registers where they used to be recorded as 64-bit even though only 32-bits were used).

It may be that the NCG is using the non-optimized version of CMM as a source

It is using the optimized version. You can see for yourself in the tryNewCodeGen function and its call site.

OK, so given (almost) identical final CMM code, the problem is (likely) not in the CMM code but in the NCG.

Anticipating your next request to look at the STG output

Wrong anticipation :-) Look at generated assembly. Only this can tell you what is the real difference between generated code. I wouldn't be surprised to see something in the lines of #8048.

Thus, the bug/regression appears to be go further back than just the new NCG (which is likely using the non-optimized CMM code as input) but also to the CMM code generator in that it is producing much less efficient CMM code.

I had already posted the inner loop from the assembly as part of the original bug report; someone requested that I post CMM output, which shows it is not (likely) the problem and we are back to NCG as you confirm here.

Let me repeat: a) NCG is using optimized Cmm (look at the code); b) don't look at the un-optimized Cmm - it's irrelevant.

Yup, you have made that clear: I never want to look at CMM code ever again.

Finally, I you want to learn more about Cmm pipeline and Cmm debugging see Commentary/Compiler/CodeGen.

Hope that helps.

Yes, I found that link on my own which is how I came to a fuller understanding of the contents of the new CMM file and its relationship to the final opt-cmm file. But you have been most helpful in explaining that LLVM and NCG do not (can not) use the same CMM source.

comment:15 Changed 11 months ago by GordonBGood

The bug is still there with 7.8.2

comment:16 Changed 11 months ago by GordonBGood

  • Summary changed from Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... to Native Code Generator for 7.8 is not as optimized as 7.6.3...

comment:17 follow-up: Changed 11 months ago by simonmar

  • Milestone set to 7.10.1

I looked at this. The short version is that this isn't going to be easy to fix, and I recommend using LLVM.

The new code generator generates better code for let-no-escapes than the old code generator. Whereas previously a let-no-escape would be a whole function with a fixed calling convention, with the new code generator it is just a label, and to call it we assign local variables for the arguments. All things being equal, this is a better way to do things, because we're not fixing the arguments to particular registers, and we're giving the later phases more flexibility to optimise.

However, the register allocator in our NCG isn't very good, particularly with loops, and it ends up doing a worse job with the new code than the old, even though previously most of the arguments were actually passed on the Haskell stack. I don't think there are any easy tweaks to improve things, unfortunately.

On 64-bit we get away with it because we have more registers.

comment:18 follow-up: Changed 11 months ago by carter

@simon, so the right path forward is someone putting some engineering time into the new ncg register allocator?

comment:19 in reply to: ↑ 17 ; follow-up: Changed 11 months ago by GordonBGood

Replying to simonmar:

The short version is that this isn't going to be easy to fix, and I recommend using LLVM.

In fact, I prefer LLVM because it is more consistent for timing, but of course that brings us to that other bug for 64-bit Windows where 64-bit GHC 7.8.2 still doesn't work with LLVM (3.4/3.5) at least for mingw as per the distrubuted binary, although it does under 64-bit Linux. The above test program runs under LLVM for a 32-bit compile but crashes on run for 64-bit, both on Windows.

Or perhaps there is something wrong with my compile of 64-bit LLVM even though I followed the exact same procedure as generating the 32-bit version from trunk. Don Stewart's entry that working LLVM should really be distributed with GHC is a very good one.

comment:20 in reply to: ↑ 19 Changed 11 months ago by awson

Replying to GordonBGood:

but of course that brings us to that other bug for 64-bit Windows where 64-bit GHC 7.8.2 still doesn't work with LLVM (3.4/3.5) at least for mingw as per the distrubuted binary, although it does under 64-bit Linux. The above test program runs under LLVM for a 32-bit compile but crashes on run for 64-bit, both on Windows.

Or perhaps there is something wrong with my compile of 64-bit LLVM even though I followed the exact same procedure as generating the 32-bit version from trunk. Don Stewart's entry that working LLVM should really be distributed with GHC is a very good one.

This is #8974. My patch there makes your program work on 64-bit Windows (although I don't like it because it uses mangler). It is compatible with existing 64-bit LLVM-3.4, distributed by MSYS2 (mingw-w64) project. We could pack this LLVM distro along with 64-bit GHC binary and configure ghc to use it (then also #7143 would be solved, and I have the patch for this ready) - just like we go with gcc/binutils/perl in all windows distros.

comment:21 in reply to: ↑ 18 Changed 11 months ago by simonmar

Replying to carter:

@simon, so the right path forward is someone putting some engineering time into the new ncg register allocator?

If you mean -fregs-graph, then yes, working on that would be a good way to fix this bug. Right now I believe it has regressed since the switch to the new code generator: #7679

Another angle would be to change the calling convention so that we use %ebx for R1 on 32-bit platforms. Right now we use %esi, which doesn't have an 8-bit version so we can't let the register allocator use it on 32-bit x86 (because neither register allocator can handle the constraint that this reg can't be used for 8-bit calculations). Implementing that is more work than changing the calling convention, but changing the calling convention means doing the same thing in LLVM (and we get a dependency on a new version of LLVM, ugh).

Yet another angle would be to make the linear scan allocator sensitive to loops. This might not be too hard, actually: the general plan would be to (a) identify the loops, (b) identify the important loop-carried variables, and (c) assign registers on entry to the loop so that the important variables go into regs.

Still, I find it hard to get too excited about any of this when it doesn't affect 64-bit platforms and in any case we can use LLVM.

comment:22 Changed 11 months ago by carter

Which of these plans of attack are more likely (well, until benchmarks say otherwise) to also have a positive impact on 64-bit systems? (i'm similarly neutral about 32bit performance, but if theres some work that could have positive fallout for both, that'd be a good starting point)

comment:23 Changed 3 months ago by thoughtpolice

  • Milestone changed from 7.10.1 to 7.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

Note: See TracTickets for help on using tickets.