Opened 4 years ago

Closed 2 years ago

#4223 closed bug (fixed)

LLVM slower then NCG, C example

Reported by: dterei Owned by: dterei
Priority: normal Milestone: 7.4.1
Component: Compiler (LLVM) Version: 6.12.3
Keywords: Cc:
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Difficulty:
Test Case: Blocked By:
Blocking: Related Tickets:

Description (last modified by dterei)

The following program is slower when compiled via the LLVM backend then when compiled via C or NCG.

{-# LANGUAGE BangPatterns #-}

{-
    ghc 6.12.1 -O2
    1.752
-}

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types

-- Define a 4 element vector type
data Vec4 = Vec4 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat

------------------------------------------------------------------------

-- Ensure we can store it in an array
instance Storable Vec4 where
  sizeOf _ = sizeOf (undefined :: CFloat) * 4
  alignment _ = alignment (undefined :: CFloat)

  {-# INLINE peek #-}
  peek p = do
             a <- peekElemOff q 0
             b <- peekElemOff q 1
             c <- peekElemOff q 2
             d <- peekElemOff q 3
             return (Vec4 a b c d)
    where
      q = castPtr p

  {-# INLINE poke #-}
  poke p (Vec4 a b c d) = do
                            pokeElemOff q 0 a
                            pokeElemOff q 1 b
                            pokeElemOff q 2 c
                            pokeElemOff q 3 d
    where
      q = castPtr p

------------------------------------------------------------------------

a = Vec4 0.2 0.1 0.6 1.0
m = Vec4 0.99 0.7 0.8 0.6

add :: Vec4 -> Vec4 -> Vec4
{-# INLINE add #-}
add (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a+a') (b+b') (c+c') (d+d')

mult :: Vec4 -> Vec4 -> Vec4
{-# INLINE mult #-}
mult (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a*a') (b*b') (c*c') (d*d')

vsum :: Vec4 -> CFloat
{-# INLINE vsum #-}
vsum (Vec4 a b c d) = a+b+c+d

multList :: Int -> Vector Vec4 -> Vector Vec4
multList !count !src
    | count <= 0    = src
    | otherwise     = multList (count-1) $ V.map (\v -> add (mult v m) a) src

main = do
    print $ Data.Vector.Storable.sum
          $ Data.Vector.Storable.map vsum
          $ multList repCount
          $ Data.Vector.Storable.replicate arraySize (Vec4 0 0 0 0)

repCount, arraySize :: Int
repCount = 10000
arraySize = 20000

Timings under Linux/64bit:

  * fasm: 1.502s
  * viac: 1.525s
  * llvm: 1.853s

This isn't universal though (as in across all targets), for example we get the following timings under these other targets:

Windows/32bit:

  * fasm: 2.178s
  * viac: 3.997s
  * llvm: 1.932s

Linux/32bit:

  * fasm: 5.233s
  * viac: 10.615s
  * llvm: 5.298s

Change History (9)

comment:1 Changed 4 years ago by dterei

  • Description modified (diff)

comment:2 Changed 4 years ago by dterei

OK found the problem.

The LLVM opt tool (which is the llvm optimiser) actually makes the code worse. Here are the timings again, (on a different 64 bit machine then before [msrc64]):

  * fasm: 1.993s
  * viac: 1.965s
  * llvm (O0): 1.935s
  * llvm (O1): 2.358s
  * llvm (O2): 2.366s
  * llvm (O3): 2.366s

Will have to narrow down further which optimisation pass is causing the problem. This also raises the question again though of what flags are a good default to pass to the opt tool for Haskell code. Here we see it significantly degrades the performance of the code. And for the nofib benchmark suite it doesn't seem to have any effect. Yet Don found some cases where it made significant differences!... what to do? Perhaps a good default is actually to not use opt for now, since here it hinders the code and for nofib there is no difference. Then users can manually choose optimisation passes anyway for the specific code, which is what good C programmers generally have to do anyway...

comment:3 Changed 4 years ago by dterei

Here are the different LLVM opt flag timing on Linux/32bit:

  * fasm: 5.208s
  * viac: 10.564s
  * llvm (O0): 5.173s
  * llvm (O1): 5.214s
  * llvm (O2): 5.264s
  * llvm (O3): 5.229s

Not as much difference as on 64bit, but still, O0 is the best performance!

comment:4 Changed 4 years ago by dterei

Here are the different LLVM opt flag timings on Windows/32bit:

  * llvm (O0): 2.033s
  * llvm (O3): 2.041s

Basically no difference (within the variation amount). So not worse but we are still wasting compile time with no gain!

comment:5 Changed 4 years ago by igloo

  • Milestone set to 6.14.1

comment:6 Changed 3 years ago by igloo

  • Milestone changed from 7.0.1 to 7.0.2

comment:7 Changed 3 years ago by igloo

  • Milestone changed from 7.0.2 to 7.2.1

comment:8 Changed 3 years ago by igloo

  • Milestone changed from 7.2.1 to 7.4.1

comment:9 Changed 2 years ago by dterei

  • Resolution set to fixed
  • Status changed from new to closed

Great! LLVM 3.0 fixes this bug!

* llvm 28 (O1): 2.108s
* llvm 28 (O0): 1.787s
* llvm 29 (01): 2.128s
* llvm 29 (00): 1.801s
* llvm 30 (O1): 1.760s
* llvm 30 (O0): 1.750s
Note: See TracTickets for help on using tickets.