Native Code Generator for 8.0.1 is not as optimized as 7.6.3...

The output assembly code is not as optimized for the Windows 32-bit version 8.0.1 compiler as the Windows 7.6.3 compiler (32-bit) when the option switches are exactly the same although it may not be limited to only the Windows platform; this has a negative impact on execution time for tight loops of about a factor of two times slower.

The following code will reproduce the problem:

-- GHC_NCG_OptimizationBug.hs
-- it seems the Haskell GHC 7.8.1 NCG  Native Code Generator (NCG) doesn't
-- optimize as well for (at least) the x86 target as version 7.6.3

{-# OPTIONS_GHC -O3 -rtsopts -v -dcore-lint -ddump-asm -ddump-to-file -dumpdir . #-} -- or O2

import Data.Bits
import Control.Monad.ST (runST,ST(..))
import Data.Array.Base

-- Uses a very simple Sieve of Eratosthenes to 2 ^ 18 to prove it.
accNumPrimes :: Int -> Int
accNumPrimes acc = acc `seq` runST $ do
  let bfSz = (256 * 1024 - 3) `div` 2
      bfLmtWrds = (bfSz + 1) `div` 32
  bufw <- newArray (0, bfLmtWrds) (-1) :: ST s (STUArray s Int Int)
  -- to clear the last "uneven" bit(s)
  unsafeWrite bufw bfLmtWrds (complement ((-2) `shiftL` (bfSz .&. 31)))
  bufb <- (castSTUArray :: STUArray s Int Int -> ST s (STUArray s Int Bool)) bufw
  let cullp i =
        let p = i + i + 3 in
        let s = (p * p - 3) `div` 2 in
        if s > bfSz then
          let count i sm = do
                sm `seq` if i > bfLmtWrds then return (acc + sm) else do
                  wd <- unsafeRead bufw i
                  count (i + 1) (sm + (popCount wd)) in
          count 0 1 -- use '1' for the '2' prime not in the array
        else do
          v <- unsafeRead bufb i
          if v then
            let cull j = do -- very tight inner loop
                  if j > bfSz then cullp (i + 1) else do
                    unsafeWrite bufb j False
                    cull (j + p) in
            cull s
          else cullp (i + 1)
  cullp 0

main =
  -- run the program a number of times to get a reasonable time...
  let numloops = 2000 in
  let loop n acc =
        acc `seq` if n <= 0 then acc else
        loop (n - 1) (accNumPrimes acc) in
  print $ loop numloops 0

The above code takes almost twice as long to run when compiled under 7.8.1 RC2 for Windows (32-bit) as it does for the version 7.6.3 compiler (both 32-bit compilers).

The -ddump-simpl Core dump is almost identical between the two, which is also evidenced by that using the -fllvm LLVM compiler back end switch for each results in code that runs at about the same speed for each compiler run (which would use the same Core output as used for NCG, right?).

Under Windows, the compilation and run for 7.8.1 RC2 goes like this:

*Main> :! E:\ghc-7.8.0.20140228_32\bin\ghc --make -pgmlo "E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc" "GHC_NCG_OptimizationBug.hs"
compile: input file WindowsVsLinuxNCG.hs
Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc15460_0
*** Checking old interface for main:Main:
*** Parser:
*** Renamer/typechecker:
[1 of 1] Compiling Main             ( GHC_NCG_OptimizationBug.hs, GHC_NCG_OptimizationBug.o )
*** Desugar:
Result size of Desugar (after optimization)
  = {terms: 260, types: 212, coercions: 0}
*** Core Linted result of Desugar (after optimization):
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 213, types: 136, coercions: 52}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 215, types: 148, coercions: 67}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
  = {terms: 209, types: 135, coercions: 51}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 209, types: 135, coercions: 42}
*** Core Linted result of Simplifier:
*** Specialise:
Result size of Specialise = {terms: 209, types: 135, coercions: 42}
*** Core Linted result of Specialise:
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = False})
  = {terms: 286, types: 185, coercions: 42}
*** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
*** Float inwards:
Result size of Float inwards
  = {terms: 286, types: 185, coercions: 42}
*** Core Linted result of Float inwards:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 502, types: 393, coercions: 103}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 428, types: 326, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
  = {terms: 420, types: 321, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 420, types: 321, coercions: 29}
*** Core Linted result of Simplifier:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 418, types: 318, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 418, types: 318, coercions: 29}
*** Core Linted result of Simplifier:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 475, types: 383, coercions: 32}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 444, types: 336, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 444, types: 336, coercions: 9}
*** Core Linted result of Simplifier:
*** Demand analysis:
Result size of Demand analysis
  = {terms: 444, types: 336, coercions: 9}
*** Core Linted result of Demand analysis:
*** Worker Wrapper binds:
Result size of Worker Wrapper binds
  = {terms: 579, types: 457, coercions: 9}
*** Core Linted result of Worker Wrapper binds:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 510, types: 415, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 420, types: 322, coercions: 9}
*** Core Linted result of Simplifier:
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = True})
  = {terms: 426, types: 326, coercions: 9}
*** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
*** Common sub-expression:
Result size of Common sub-expression
  = {terms: 424, types: 326, coercions: 9}
*** Core Linted result of Common sub-expression:
*** Float inwards:
Result size of Float inwards
  = {terms: 424, types: 326, coercions: 9}
*** Core Linted result of Float inwards:
*** Liberate case:
Result size of Liberate case
  = {terms: 1,824, types: 1,259, coercions: 9}
*** Core Linted result of Liberate case:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 608, types: 422, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 604, types: 413, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
  = {terms: 604, types: 413, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 604, types: 413, coercions: 9}
*** Core Linted result of Simplifier:
*** SpecConstr:
Result size of SpecConstr = {terms: 708, types: 505, coercions: 9}
*** Core Linted result of SpecConstr:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 702, types: 499, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 608, types: 405, coercions: 9}
*** Core Linted result of Simplifier:
*** Tidy Core:
Result size of Tidy Core = {terms: 608, types: 405, coercions: 9}
*** Core Linted result of Tidy Core:
*** CorePrep:
Result size of CorePrep = {terms: 825, types: 489, coercions: 9}
*** Core Linted result of CorePrep:
*** Stg2Stg:
*** CodeOutput:
*** New CodeGen:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** Assembler:
"E:\ghc-7.8.0.20140228_32\lib/../mingw/bin/gcc.exe" "-U__i686" "-fno-stack-protector" "-DTABLES_NEXT_TO_CODE" "-I." "-x" "assembler-with-cpp" "-c" "C:\Users\Gordon\AppData\Local\Temp\ghc15460_0\ghc15460_2.s" "-o" "GHC_NCG_OptimizationBug.o"
Linking GHC_NCG_OptimizationBug.exe ...
*Main> :! GHC_NCG_OptimizationBug +RTS -s
46000000
      32,965,096 bytes allocated in the heap
           7,032 bytes copied during GC
          41,756 bytes maximum residency (2 sample(s))
          19,684 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0        61 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    1.73s  (  1.73s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    1.73s  (  1.73s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    19,006,902 bytes per MUT second

  Productivity 100.0% of total user, 100.2% of total elapsed

whereas under version 7.6.3 goes like this:

*Main> :! E:\ghc-7.6.3_32\bin\ghc --make -pgmlo "E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc" "GHC_NCG_OptimizationBug.hs"
compile: input file GHC_NCG_OptimizationBug.hs
Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc28200_0
*** Checking old interface for main:Main:
*** Parser:
*** Renamer/typechecker:
[1 of 1] Compiling Main             ( GHC_NCG_OptimizationBug.hs, GHC_NCG_OptimizationBug.o )
*** Desugar:
Result size of Desugar (after optimization)
  = {terms: 247, types: 212, coercions: 0}
*** Core Linted result of Desugar (after optimization):
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 198, types: 132, coercions: 35}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 200, types: 144, coercions: 43}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
  = {terms: 194, types: 131, coercions: 57}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 194, types: 131, coercions: 39}
*** Core Linted result of Simplifier:
*** Specialise:
Result size of Specialise = {terms: 194, types: 131, coercions: 39}
*** Core Linted result of Specialise:
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = False})
  = {terms: 277, types: 191, coercions: 39}
*** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
*** Float inwards:
Result size of Float inwards
  = {terms: 277, types: 191, coercions: 39}
*** Core Linted result of Float inwards:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 514, types: 403, coercions: 103}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 420, types: 317, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
  = {terms: 412, types: 312, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 412, types: 312, coercions: 29}
*** Core Linted result of Simplifier:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 410, types: 309, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 410, types: 309, coercions: 29}
*** Core Linted result of Simplifier:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 455, types: 364, coercions: 32}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 422, types: 317, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 422, types: 317, coercions: 9}
*** Core Linted result of Simplifier:
*** Demand analysis:
Result size of Demand analysis
  = {terms: 422, types: 317, coercions: 9}
*** Core Linted result of Demand analysis:
*** Worker Wrapper binds:
Result size of Worker Wrapper binds
  = {terms: 536, types: 427, coercions: 9}
*** Core Linted result of Worker Wrapper binds:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 480, types: 391, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 400, types: 306, coercions: 9}
*** Core Linted result of Simplifier:

*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = True})
  = {terms: 408, types: 311, coercions: 9}
*** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
*** Common sub-expression:
Result size of Common sub-expression
  = {terms: 406, types: 311, coercions: 9}
*** Core Linted result of Common sub-expression:
*** Float inwards:
Result size of Float inwards
  = {terms: 406, types: 311, coercions: 9}
*** Core Linted result of Float inwards:
*** Liberate case:
Result size of Liberate case
  = {terms: 1,186, types: 824, coercions: 9}
*** Core Linted result of Liberate case:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 585, types: 411, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 569, types: 392, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
  = {terms: 569, types: 392, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 569, types: 392, coercions: 9}
*** Core Linted result of Simplifier:
*** SpecConstr:
Result size of SpecConstr = {terms: 746, types: 566, coercions: 9}
*** Core Linted result of SpecConstr:
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 739, types: 560, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
  = {terms: 762, types: 546, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 642, types: 402, coercions: 9}
*** Core Linted result of Simplifier:
*** Tidy Core:
Result size of Tidy Core = {terms: 642, types: 402, coercions: 9}
*** Core Linted result of Tidy Core:
writeBinIface: 10 Names
writeBinIface: 34 dict entries
*** CorePrep:
Result size of CorePrep = {terms: 779, types: 483, coercions: 9}
*** Core Linted result of CorePrep:
*** Stg2Stg:
*** CodeOutput:
*** CodeGen:
*** Assembler:
"E:\ghc-7.6.3_32\lib/../mingw/bin/gcc.exe" "-fno-stack-protector" "-Wl,--hash-size=31" "-Wl,--reduce-memory-overheads" "-I." "-c" "C:\Users\Gordon\AppData\Local\Temp\ghc28200_0\ghc28200_0.s" "-o" "GHC_NCG_OptimizationBug.o"
Linking GHC_NCG_OptimizationBug.exe ...
*Main> :! GHC_NCG_OptimizationBug +RTS -s
46000000
      32,989,396 bytes allocated in the heap
           4,976 bytes copied during GC
          41,860 bytes maximum residency (2 sample(s))
          19,580 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0        61 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.64s  (  0.64s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.66s  (  0.64s elapsed)

  %GC     time       0.0%  (0.1% elapsed)

  Alloc rate    51,495,642 bytes per MUT second

  Productivity 100.0% of total user, 102.3% of total elapsed

Looking at the ASM dump for the innermost tight culling loop reveals the problem, with 7.8.1 RC2 outputting as follow:

_n3nx:
	movl 76(%esp),%ecx
_c3gf:
	cmpl %ecx,%eax
	jg _c3jB
_c3jC:
	movl %eax,%edx
	sarl $5,%edx
	movl %ecx,76(%esp)
	movl $1,%ecx
	movl %ecx,280(%esp)
	movl %eax,%ecx
	andl $31,%ecx
	movl %eax,292(%esp)
	movl 280(%esp),%eax
	shll %cl,%eax
	xorl $-1,%eax
	movl 64(%esp),%ecx
	addl $8,%ecx
	movl (%ecx,%edx,4),%ecx
	andl %eax,%ecx
	movl 64(%esp),%eax
	addl $8,%eax
	movl %ecx,(%eax,%edx,4)
	movl 292(%esp),%eax
	addl $3,%eax
	jmp _n3nx

and 7.6.3 outputting as follows:

.text
	.align 4,0x90
	.long	1894
	.long	32
s1GZ_info:
_c1YB:
	cmpl 16(%ebp),%esi
	jg _c1YE
	movl %esi,%edx
	sarl $5,%edx
	movl $1,%eax
	movl %esi,%ecx
	andl $31,%ecx
	shll %cl,%eax
	xorl $-1,%eax
	movl 12(%ebp),%ecx
	movl 8(%ecx,%edx,4),%ecx
	andl %eax,%ecx
	movl 12(%ebp),%eax
	movl %ecx,8(%eax,%edx,4)
	addl 4(%ebp),%esi
	jmp s1GZ_info
_c1YE:
	movl 8(%ebp),%esi
	addl $8,%ebp
	jmp s1GB_info

The second code is clearly much more efficient, with the only memory access reading/writing the sieve buffer array and one register reload of the prime value to add to the current position index, whereas the first (7.8.1 RC2) code has three register spills and five register re-loads, almost as if debugging were still turned on.

This bug was tested under Windows, but likely applies to other platforms, at least for 32-bit versions but also possibly to others.

Edited Mar 10, 2019 by GordonBGood

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information