Opened 4 months ago

Last modified 3 months ago

#14013 new bug

Bad monads performance

Reported by: danilo2 Owned by: simonpj
Priority: high Milestone: 8.4.1
Component: Compiler Version: 8.2.1-rc3
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by danilo2)

Hi! We've been struggling with a very strange GHC behavior on IRC today. Let's consider the following code (needs mtl and criterion to be compiled):

module Main where

import Prelude
import Criterion.Main
import qualified Control.Monad.State.Strict as Strict
import qualified Control.Monad.State.Class  as State
import Control.DeepSeq (NFData, rnf, force)
import GHC.IO          (evaluate)
import Data.Monoid


-----------------------------
-- === Criterion utils === --
-----------------------------

eval :: NFData a => a -> IO a
eval = evaluate . force ; {-# INLINE eval #-}

liftExp :: (Int -> a) -> (Int -> a)
liftExp f = f . (10^) ; {-# INLINE liftExp #-}

expCodeGen :: NFData a => (Int -> a) -> (Int -> IO a)
expCodeGen f i = do
    putStrLn $ "generating input code (10e" <> show i <> " chars)"
    out <- eval $ liftExp f i
    putStrLn "code generated sucessfully"
    return out
{-# INLINE expCodeGen #-}

expCodeGenBench :: (NFData a, NFData b) => (Int -> a) -> (a -> b) -> Int -> Benchmark
expCodeGenBench f p i = env (expCodeGen f i) $ bench ("10e" <> show i) . nf p ; {-# INLINE expCodeGenBench #-}


-------------------------------
-- === (a*) list parsing === --
-------------------------------

genList_a :: Int -> [Char]
genList_a i = replicate i 'a' ; {-# INLINE genList_a #-}

pureListParser_a :: [Char] -> Bool
pureListParser_a = \case
    'a':s -> pureListParser_a s
    []    -> True
    _     -> False
{-# INLINE pureListParser_a #-}

mtlStateListParser_a :: State.MonadState [Char] m => m Bool
mtlStateListParser_a = State.get >>= \case
    'a':s -> State.put s >> mtlStateListParser_a
    []    -> return True
    _     -> return False
{-# INLINE mtlStateListParser_a #-}

mtlStateListParser_a_typed :: Strict.State [Char] Bool
mtlStateListParser_a_typed = State.get >>= \case
    'a':s -> State.put s >> mtlStateListParser_a_typed
    []    -> return True
    _     -> return False
{-# INLINE mtlStateListParser_a_typed #-}

mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool
mtlStateListParser_a_let = go where
    go = Strict.get >>= \case
        'a':s -> Strict.put s >> go
        []    -> return True
        _     -> return False
{-# INLINE mtlStateListParser_a_let #-}


{-# SPECIALIZE mtlStateListParser_a :: Strict.State [Char] Bool #-}
{-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State [Char] Bool #-}


main = do
    defaultMain
        [ bgroup "a*" $
            [ bgroup "pure"                    $ expCodeGenBench genList_a pureListParser_a                              <$> [6..6]
            , bgroup "mtl.State.Strict"        $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a)       <$> [6..6]
            , bgroup "mtl.State.Strict typed"  $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a_typed) <$> [6..6]
            , bgroup "mtl.State.Strict let"    $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a_let)   <$> [6..6]
            ]
        ]

The code was compiled with following options (and many other variations): -threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100 -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick-factor=1000 -flate-dmd-anal

Everything in this code has INLINE pragma. The important part we should focus on are these two functions:

pureListParser_a :: [Char] -> Bool
pureListParser_a = \case
    'a':s -> pureListParser_a s
    []    -> True
    _     -> False
{-# INLINE pureListParser_a #-}

mtlStateListParser_a :: State.MonadState [Char] m => m Bool
mtlStateListParser_a = State.get >>= \case
    'a':s -> State.put s >> mtlStateListParser_a
    []    -> return True
    _     -> return False
{-# INLINE mtlStateListParser_a #-}

Which are just "parsers" accepting strings containing only 'a' characters. The former is pure one, while the later uses State to keep the remaining input. The following list contains performance related observations:

  1. For the rest of the points, let's call the performance of pureListParser_a a "good" one and everything worse a "bad" one.
  1. The performance of mtlStateListParser_a is bad, it runs 10 times slower than pureListParser_a. Inspecting CORE we can observe that GHC jumps between (# a,b #) and (a,b) representations all the time.
  1. If we add a specialize pragma {-# SPECIALIZE mtlStateListParser_a :: Strict.State [Char] Bool #-}, the performance of mtlStateListParser_a is good (exactly the same as pureListParser_a).
  1. If we do NOT use specialize pragma, but we use explicite, non-polymorphic type signature mtlStateListParser_a_typed :: Strict.State [Char] Bool, the performance is bad (!), identical to the polymorphic version without specialization.
  1. If we use SPECIALIZE pragma together with explicite, non-polymorphic type, so we use BOTH mtlStateListParser_a_typed :: Strict.State [Char] Bool AND {-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State [Char] Bool #-} we get the good performance.
  1. If we transform pureListParser_a to
mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool
mtlStateListParser_a_let = go where
    go = Strict.get >>= \case
        'a':s -> Strict.put s >> go
        []    -> return True
        _     -> return False
{-# INLINE mtlStateListParser_a_let #-}

we again get the good performance without the need to use SPECIALIZE pragmas.

  1. The performance of all the functions that are not optimized as good as pureListParser_a is a lot worse in GHC 8.2.1-rc3 than in 8.0.2.
  1. The not-yet documented flag -fspecialise-aggressively does NOT affect the results (https://ghc.haskell.org/trac/ghc/ticket/12463).
  1. If you do NOT use INLINE pragma on functions mtlStateListParser_a and mtlStateListParser_a_typed their performance is good (so INLINE pragma makes it bad until we provide explicit specialization). Moreover, if we use INLINABLE pragma instead of INLINE on these functions (which logically makes more sense, because they are recursive), performance of the polymorphic one mtlStateListParser_a is good, while performance of the explicitly typed mtlStateListParser_a_typed is bad until we provide explicite specialization.

The above points raise the following questions:

  1. Why GHC does not optimize mtlStateListParser_a the same way as pureListParser_a and where the jumping between (# a,b #) and (a,b) comes from?
  1. Is there any way to tell GHC to automatically insert SPECIALIZE pragmas, especially in performance critical code?
  1. Why providing very-explicite type signature mtlStateListParser_a_typed :: Strict.State [Char] Bool does not solve the problem unless we use SPECIALIZE pragma that tells the same as the signature? (GHC even warns: SPECIALISE pragma for non-overloaded function ‘mtlStateListParser_a_typed’ but it affects performance.)
  1. Why the trick to alias the body of recursive function to a local variable go affects the performance in any way, especially when it does NOT bring any variable to the local let scope?

We've been testing this behavior in GHC 8.0.2 and 8.2.1-rc3 and several people reported exactly the same observations.

Attachments (2)

simpl-INLINE-patch (7.9 KB) - added by simonpj 3 months ago.
WIP on floating from stable unfoldings
occ-anal-rules-patch (13.4 KB) - added by simonpj 3 months ago.
WIP on occurrence analysis and rules

Download all attachments as: .zip

Change History (15)

comment:1 Changed 4 months ago by danilo2

Description: modified (diff)

comment:2 Changed 4 months ago by danilo2

Description: modified (diff)

comment:3 Changed 4 months ago by danilo2

Description: modified (diff)

comment:4 Changed 4 months ago by danilo2

Description: modified (diff)

comment:5 Changed 4 months ago by danilo2

Description: modified (diff)

comment:6 Changed 4 months ago by bgamari

Description: modified (diff)

Very interesting; this will be an interesting thing to mull over tomorrow morning.

comment:7 Changed 4 months ago by danilo2

Description: modified (diff)

comment:8 Changed 4 months ago by danilo2

I added new observation - point 8.

comment:9 Changed 4 months ago by simonpj

I have not worked all of this out, but I have learned something.

You have this INLINE pragma:

{-# INLINE mtlStateListParser_a #-}

It does nothing, because the function is recursive. But alas, it harms things a lot. Simply removing that INLINE pragma makes mtlStateListParser_a behave well in all settings, I think. Can you confirm that?

I believe that the reason things go bad is this. GHC does this transformation (always):

      f = e |> co
===>
      f' = e
      f  = f' |> co

Reason: f can now be inlined at all use sites, and co may cancel. But if the original f has an INLINE pragma we get

      f = e |> co
        { INLINE = <inline rhs> }
===>
      f' = e
      f  = f' |> co
        { INLINE = <inline rhs> }

where the { INLINE = <inline rhs> } is the (stable, user-written) inlining for f. Now the point of the transformation is lost, becuase f won't be replaced at its use sites by f' |> co; the INLINE pragma is what gets inlined.

Moreover, if <inline rhs> and e both mention f, then f becomes a loop breaker and we get mutual recusion between f and f'. This what ultimately leads to the alternation between (,) and (##) you observed. Solution (I think) don't do this transformation if f has an INLINE pragma. I'll try that.

comment:10 Changed 4 months ago by danilo2

@simon, you are right. Removing INLINE makes these two functions run with fine performance. It is already described in point 8 in the ticket description. Please note, that point 8 tells about probably related problem with INLINABLE pragma.

comment:11 Changed 4 months ago by simonpj

Sigh. This has turned out to be much nastier than I expected. I worked solely on

import qualified Control.Monad.State.Strict as Strict
import qualified Control.Monad.State.Class  as State

mtlStateListParser_a :: State.MonadState [Char] m => m Bool
mtlStateListParser_a = State.get >>= \case
    'a':s -> State.put s >> mtlStateListParser_a
    []    -> return True
    _     -> return False
{-# INLINE mtlStateListParser_a #-}

foo :: [Char] -> Bool
foo = Strict.evalState mtlStateListParser_a

I'll refer to mtlStateListParser_a as msp.

  • Yes, comment:9 is right; the right path is to make doFloatFromRhs return False for bindings with a stable unfolding.
  • Even when that is done, the occurrence analyser does a bad job. We get
    Rec { msp = ... lvl ...
          {-# INLINE = ..msp.. #-}   -- The stable unfolding
    
          lvl = ...msp...
        }
    
    The occurrence analyser treats the occurrence of lvl as a "weak" reference, and so sorts into SCCs thus: Rec{ msp }, NonRec { lvl }. So then it stupidly marks msp as a loop breaker, and lvl as a weak loop breaker. In this case they'd be better in one SCC, in which case we'd pick msp (but not lvl) as a loop breaker. The relevant change is in OccurAnal, around line 1280.
        -- Find the "nd_inl" free vars; for the loop-breaker phase
        inl_fvs = udFreeVars bndr_set rhs_usage1 `unionVarSet`
                  case mb_unf_uds of
                    Nothing -> emptyVarSet -- udFreeVars bndr_set rhs_usage1 -- No INLINE, use RHS
                    Just unf_uds -> udFreeVars bndr_set unf_uds
    
    But I'm not fully confident of this change.
  • Even if we fix that, then the strictness analyser fails. We end up with
      msp = (\ (s::[Char]). case s of
                               p1 -> (False, x)
                               p2 -> (msp |> sym co) s'
            ) |> co
    
    Those casts are enough to kill demand analysis. It was relying on the coercion-floating that we nuked in comment:9! The function looks to the demand analyser as if it has arity zero, and so we get no useful strictness.

Yes, we could teach the demand analyser more tricks, but the tail is beginning to wag the dog.

  • This is all stupid. An INILNE pragma on a recursive function is doing no good at all. Maybe we should just discard it. And indeed that makes things work.
  • Until you use an INLINABLE pragma! We don't want to discard the INLINEABLE pragama on a recursive function -- it is super-useful. But if we don't the same ills happen as with INLINE.

Actually, the specialiser propagates an INLINE pragma to the specialised function, but does not propagate an INLINEABLE pragam. Result: if you give an overloaded signature for msp, the specialiser will create a pragma-free specialised version, which will optimise nicely. But if you give a non-overloaded signature msp :: Strict.State [Char] Bool, the function fails to optimise for the reasons above. Mind you, in the latter case the INLINEABLE pragma is just as useless as the INLINE pragma was.

This is ridiculously terrible. The pragmas (which are there to optimise the program) are getting in the way of optimising the function itself. What to do?

Here's a simple idea;

  • Discard INLINE pragmas for recursive, or mutually recursive, functions. (You can do this manually too!)
  • Peel off a top-level function for INLINEABLE pragmas, thus:
      Rec { f = e[f] {-# INLINEABLE = e[f] #-} }
    ===>
      Rec { f' = e[f'] }
      Rec { f = f' {-# INLINEABLE = e[f] #-} }
    
    The first Rec is a pragma-free group. The second has all its pragmas (for later clients), but just indirects to the first group if you actually call it.

Alas, you can't do this manually right now.

But somehow none of this really feels right. I'm not sure what to do, so I'm just brain-dumping this. Maybe someone else will have better ideas

Last edited 4 months ago by simonpj (previous) (diff)

Changed 3 months ago by simonpj

Attachment: simpl-INLINE-patch added

WIP on floating from stable unfoldings

Changed 3 months ago by simonpj

Attachment: occ-anal-rules-patch added

WIP on occurrence analysis and rules

comment:12 Changed 3 months ago by simonpj

The attached patches are not finished; they were just WIP related to the comments above.

comment:13 Changed 3 months ago by bgamari

Milestone: 8.4.1
Owner: set to simonpj
Note: See TracTickets for help on using tickets.