Opened 14 months ago

Last modified 2 months ago

#14610 new bug

newtype wrapping of a monadic stack kills performance

Reported by: mrkkrp Owned by:
Priority: normal Milestone: 8.10.1
Component: Compiler Version: 8.2.2
Keywords: JoinPoints Cc: dfeuer, nomeata
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #14620 Differential Rev(s):
Wiki Page:

Description (last modified by mrkkrp)

I have a project where I in one module (A) I decided to build something like a minimal framework for the other bigger module (B) to use. One of the main points of the framework is that the monadic stacks are hidden behind newtypes and in those monads you can only use the functions that the module A provides.

The module A can be found here:

https://github.com/mrkkrp/mmark/blob/ghc-bug-newtypes/Text/MMark/Parser/Internal.hs

There are two monadic stack wrapped with newtypes: BParser and IParser.

The module B is this:

https://github.com/mrkkrp/mmark/blob/ghc-bug-newtypes/Text/MMark/Parser.hs

But it's not really relevant.

Now if I change newtypes to type synonyms like so:

type BParser a = ParsecT MMarkErr Text (State BlockState) a
type IParser a = StateT InlineState (Parsec MMarkErr Text) a

and do corresponding minor corrections, I get 2x less allocations and almost 2x faster code (before):

Case                                           Allocated  GCs     Max
with file: data/bench-yaml-block.md              119,080    0  11,088
with file: data/bench-thematic-break.md           74,368    0  10,224
with file: data/bench-heading.md                 901,928    0   9,432
with file: data/bench-fenced-code-block.md       145,744    0   9,368
with file: data/bench-indented-code-block.md     124,312    0   9,368
with file: data/bench-unordered-list.md        2,010,496    1  10,784
with file: data/bench-ordered-list.md          2,025,016    1  10,728
with file: data/bench-blockquote.md            1,961,672    1  42,648
with file: data/bench-paragraph.md             2,084,104    1  42,648
with file: data/comprehensive.md              25,899,496   24  44,200

benchmarking with file: data/comprehensive.md
time                 3.590 ms   (3.578 ms .. 3.601 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.555 ms   (3.546 ms .. 3.565 ms)
std dev              31.07 μs   (24.63 μs .. 39.90 μs)

After:

Case                                           Allocated  GCs     Max
with file: data/bench-yaml-block.md              116,864    0  11,088
with file: data/bench-thematic-break.md           64,776    0  10,392
with file: data/bench-heading.md                 615,672    0   9,432
with file: data/bench-fenced-code-block.md       144,736    0   9,368
with file: data/bench-indented-code-block.md     123,352    0   9,368
with file: data/bench-unordered-list.md          795,072    0  41,568
with file: data/bench-ordered-list.md            808,808    0  41,512
with file: data/bench-blockquote.md              826,392    0  41,568
with file: data/bench-paragraph.md               881,432    0  41,568
with file: data/comprehensive.md              10,945,104   10  44,440

benchmarking with file: data/comprehensive.md
time                 2.451 ms   (2.448 ms .. 2.456 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.432 ms   (2.427 ms .. 2.437 ms)
std dev              15.86 μs   (13.16 μs .. 19.01 μs)

I'm not great at reading non-trivial core, but I gave it a shot and dumped some core. One thing I noticed that core for the module A is the same in both cases. Then the problem is an inter-module problem, probably like when you don't dump definitions into interface files with INLINEABLE and end up without specialization, but here it perhaps has to do with the fact that B has no information about internals of the monadic stacks from A, but I'm not sure.

Here is the beginnings of core dumps (before):

==================== Tidy Core ====================
2017-12-23 04:55:17.967944505 UTC

Result size of Tidy Core
  = {terms: 210,169,
     types: 209,675,
     coercions: 82,492,
     joins: 973/3,753}

After:

==================== Tidy Core ====================
2017-12-23 04:58:46.386265108 UTC

Result size of Tidy Core
  = {terms: 301,256,
     types: 263,104,
     coercions: 28,560,
     joins: 1,726/5,393}

So it looks like newtype wrapping reduces GHC's ability to create join points SPJ talked about recently.

I do understand that you expect a minimal reproducing example but I could not produce it even though I spent several hours in futile attempts. I started by creating a small project, defining a similar stack with a newtype wrapper and using it in a simple parser in another module. Then I benchmarked the parser. There is no difference between newtyped and code and the code that uses just type synonyms. I tried different variations, no difference.

The core output is too big for me to analyze, it's like 25 Mb and 33 Mb, and I have no idea what's going on there.

You're welcome to experiment with the repo, there are benchmarks for memory usage and criterion benchmarks for speed of execution:

https://github.com/mrkkrp/mmark

I have created two branches I won't touch, one with newtypes and another one with type synonyms:

Just checkout one of these and run stack bench.

This is the commit that changes newtypes to type synonyms:

https://github.com/mrkkrp/mmark/commit/759d8d4aa52dd57a393299c63e8c9b70d0d43290

I'm submitting this because my friend convinced me that it's better to let you know (even though I could not create a minimal reproducing example on my own) than to let it go completely unnoticed.

Change History (25)

comment:1 Changed 14 months ago by mrkkrp

Description: modified (diff)

comment:2 Changed 14 months ago by nomeata

So it looks like newtype wrapping reduces GHC's ability to create join points SPJ talked about recently.

Without looking at the code: This is a good guess. foo x = if … then foo (x+1) else … is a tail call (and hence allows foo to be a join point) while foo x = if … then foo (x+1) cast coercion … else … is not. But (from an operational point of view) it could well be!

But someone would have to produce a minimal example to be sure this is really the issue.

comment:3 Changed 14 months ago by dfeuer

This comment has been removed in favor of opening #14620 as nomeata suggests below.

Last edited 14 months ago by dfeuer (previous) (diff)

comment:4 Changed 14 months ago by dfeuer

Cc: dfeuer added
Milestone: 8.6.1
Type of failure: None/UnknownRuntime performance bug

comment:5 Changed 14 months ago by dfeuer

Here's an example that doesn't quite go far enough to demonstrate the problem, but seems closer in spirit to the original.

newtype D a = D {getD :: a}
d :: a -> D a
d a = D a

baz :: D Int -> Int -> D Int
baz y x0 = foo x0
  where
    foo :: Int -> D Int
    foo 0 = y
    foo x = D (bar (x - 3))

    bar :: Int -> Int
    bar 0 = getD y
    bar x = getD (foo x)

Compiling with -dverbose-core2core, we see that after the first simplifier run (gentle, before floating), we get

baz :: D Int -> Int -> D Int
[LclIdX,
 Arity=2,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [0 0] 121 0}]
baz
  = \ (y_aSE :: D Int) (x0_aSF :: Int) ->
      letrec {
        foo_aSG [Occ=LoopBreaker] :: Int -> D Int
        [LclId,
         Arity=1,
         Unf=Unf{Src=<vanilla>, TopLvl=False, Value=True, ConLike=True,
                 WorkFree=True, Expandable=True, Guidance=IF_ARGS [20] 81 0}]
        foo_aSG
          = \ (ds_d27R :: Int) ->
              case ds_d27R of { GHC.Types.I# ds_d27T ->
              case ds_d27T of ds_X289 {
                __DEFAULT ->
                  (case ds_X289 of ds_X284 {
                     __DEFAULT ->
                       (foo_aSG (GHC.Types.I# (GHC.Prim.-# ds_X284 3#)))
                       `cast` (Foo.N:D[0] <Int>_R :: (D Int :: *) ~R# (Int :: *));
                     3# ->
                       y_aSE `cast` (Foo.N:D[0] <Int>_R :: (D Int :: *) ~R# (Int :: *))
                   })
                  `cast` (Sym (Foo.N:D[0] <Int>_R) :: (Int :: *) ~R# (D Int :: *));
                0# -> y_aSE
              }
              }; } in
      foo_aSG x0_aSF

Note that foo_aSG is bound by letrec.

If we switch to a type synonym version,

type D a = a
getD :: D a -> a
getD a = a

d :: a -> D a
d a = a

baz :: Int -> Int -> Int
baz y x0 = foo x0
  where
    foo :: Int -> Int
    foo 0 = y
    foo x = d (bar (x - 3))
    
    bar :: Int -> Int
    bar 0 = getD y
    bar x = getD (foo x)

then at the same point in core2core we instead see

baz :: Int -> Int -> Int
[LclIdX,
 Arity=2,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [0 0] 59 0}]
baz
  = \ (y_aSQ :: Int) (x0_aSR :: Int) ->
      joinrec {
        foo_aSS [Occ=LoopBreaker] :: Int -> Int
        [LclId[JoinId(1)],
         Arity=1,
         Unf=Unf{Src=<vanilla>, TopLvl=False, Value=True, ConLike=True,
                 WorkFree=True, Expandable=True, Guidance=IF_ARGS [20] 55 0}]
        foo_aSS (ds_d27X :: Int)
          = case ds_d27X of { GHC.Types.I# ds_d27Z ->
            case ds_d27Z of ds_X28a {
              __DEFAULT -> jump foo_aSS (GHC.Types.I# (GHC.Prim.-# ds_X28a 3#));
              0# -> y_aSQ;
              3# -> y_aSQ
            }
            }; } in
      jump foo_aSS x0_aSR

The reason this example doesn't quite go far enough is that later transformations work out the kinks and recognize the join point. But based on the bug report, that isn't always the case.

comment:6 Changed 14 months ago by nomeata

Cc: nomeata added

The polymorphic go in comment:3 is a separate problem from the one that I described further up (but it might well be the OP’s problem). It could be fixed by static argument transformation. On my TODO list, after #14068, is SAT for join points, but it would not help here, because only after SAT it is a join point. Exitification unfortunately also does not help to turn this into a joint point. Maybe worth opening a separate ticket.

Your example in comment:4 is precisely what I described, thanks for working it out. The fix here probably requires some significant thought about the join point typing. Maybe it is enough to relax the rule “The return type must not depend on any arguments” to “The return type must be representatoinally equal for all arguments” and that might allow us to e `cast` co to be a tail-call position.

comment:7 Changed 14 months ago by dfeuer

nomeata, I opened #14620 for the issue in comment:3.

comment:8 Changed 14 months ago by bgamari

Blocked By: 14620 added

comment:9 Changed 14 months ago by dfeuer

Blocked By: 14620 removed

bgamari, I don't think this is actually blocking on #14620. Why do you think it is?

comment:10 Changed 14 months ago by bgamari

Ahh, indeed you are right. It's an orthogonal concern.

comment:11 Changed 14 months ago by simonpj

Maybe it is enough to relax the rule “The return type must not depend on any arguments” to “The return type must be representatoinally equal for all arguments” and that might allow us to e cast co to be a tail-call position

Actually I think we may just be able to say that if (f x) is a tail call, then (f x |> co) is a tail call. So in this code:

occAnal env (Cast expr co)
  = case occAnal env expr of { (usage, expr') ->
    let usage1 = zapDetailsIf (isRhsEnv env) usage
          -- usage1: if we see let x = y `cast` co
          -- then mark y as 'Many' so that we don't
          -- immediately inline y again.
        usage2 = addManyOccsSet usage1 (coVarsOfCo co)
          -- usage2: see Note [Gather occurrences of coercion variables]
    in (markAllNonTailCalled usage2, Cast expr' co)
    }

just remove the markAllNonTailCalled. That call was in Luke's original join-point patch, but it seems over-conservative to me.

That said, I'm intrigued about how this happens in practice, if it really does. In the example given in comment:5, a single run of the simplifer removes the redundant casts, and for recursive join points the tail calls really must return the same type as the function itself, so the cast seems unlikely. While the change above (removing markAllNonTailCalled) is ok (I think), I'm surprised if it has any effect. An example would be great.

comment:12 Changed 14 months ago by simonpj

Keywords: JoinPoints added

comment:13 Changed 14 months ago by nomeata

Here is an example. This code (updated with simpler, less contrived example):

{-# LANGUAGE KindSignatures, DataKinds, GADTs, TypeFamilies, UndecidableInstances #-}
module RecCast (addSing) where

data Nat = Z | S Nat

data Sing (n :: Nat) where
    FZ :: Sing Z
    FS :: Sing n -> Sing (S n)

type family Plus n m :: Nat where
    Plus Z m = m
    Plus (S n) m = Plus n (S m)

addSing :: (forall a. a -> a) -> Sing n -> Sing m -> Sing (Plus n m)
addSing f = go -- The id is to ensure that go is a local function
  where
    go :: Sing n -> Sing m -> Sing (Plus n m)
    go FZ m = f m
    go (FS n) m = go n (FS m)

produces this function that would be a join point when we consider casted expressions as tail-recursive:

      letrec {
        go_sXc [Occ=LoopBreaker]
          :: forall (n1 :: Nat) (m1 :: Nat).
             Sing n1 -> Sing m1 -> Sing (Plus n1 m1)
        [LclId, Arity=2, Str=<S,1*U><L,U>, Unf=OtherCon []]
        go_sXc
          = \ (@ (n1_arO :: Nat))
              (@ (m1_arP :: Nat))
              (ds_dWz :: Sing n1_arO)
              (m2_apO :: Sing m1_arP) ->
              case ds_dWz of {
                FZ cobox_arR ->
                  (f_apK @ (Sing m1_arP) m2_apO)
                  `cast` ((Sing
                             (Trans (Sym (D:R:Plus[0] <m1>_N)) (Plus (Sym cobox) <m1>_N)_N))_R
                          :: (Sing m1_arP :: *) ~R# (Sing (Plus n1_arO m1_arP) :: *));
                FS @ n2_arV cobox_arW n3_apP ->
                  (go_sXc @ n2_arV @ ('S m1_arP) n3_apP ($WFS @ m1_arP m2_apO))
                  `cast` ((Sing
                             (Trans
                                  (Sym (D:R:Plus[1] <n2>_N <m1>_N)) (Plus (Sym cobox) <m1>_N)_N))_R
                          :: (Sing (Plus n2_arV ('S m1_arP)) :: *)
                             ~R#
                             (Sing (Plus n1_arO m1_arP) :: *))
              }; } in
Last edited 14 months ago by nomeata (previous) (diff)

comment:14 Changed 14 months ago by simonpj

Good example. But it fails the isValidJoinPointType test (c.f. #14620), so even if we roped the markAllNonTailCalled in the Cast case of occurrence-analysis, we would not get go as a join point. And indeed that's not unreasonable. Consider

case  ( letrec go n m ds m2 = case ds of                   )
      (                          FX co -> m2 |> (...co..)  )
      (                          FS ... -> (go ...) |> co2 )
      ( in go t1 t2 a b                                    ) of
  BLAH

Operationally we have a join point, but the transformation to move that case inwards would give this

letrec go n m ds m2 = case ds of
                         FX co -> case m2 |> (...co..) of BLAH
                         FS ... -> (go ...) |> co2
in go t1 t2 a b

but now the outer case is scrutinising something involving n, m etc, which makes no sense.

Operationally the transformation makes sense, but it's not well typed. I have no idea how to fix this.

We still have no example of a program that has a cast in the return (and and hence might benefit from dropping the makrAllNonTailCalled) but which passes the isValidJoinPointType test.

comment:15 Changed 14 months ago by nomeata

That example is one where the result type depends on the parameters… but don’t see why we need that restriction – this code shows that it would be beneficial and not crash to allow it. The note [The polymorphism rule of join points] basically says “We need this restriction because the CPS translation would not be easily typable, but that is not very convincing.

comment:16 Changed 14 months ago by nomeata

Operationally the transformation makes sense, but it's not well typed. I have no idea how to fix this.

So am I right to say “we forbid polymorphic return types because we cannot prove it to be sound, even though we know it would not actually crash if we did allow it”?

comment:17 Changed 14 months ago by simonpj

So am I right to say “we forbid polymorphic return types because we cannot prove it to be sound, even though we know it would not actually crash if we did allow it”?

Yes, that's right. But be specific about what you mean by "it" in "would not crash if we did allow it". In comment:14 I show a particular transformation. I believe it is (slightly) beneficial, and will not crash, but the result is not type-correct.

If you can figure out how to express the proof that it won't crash, as a coercion perhaps, maybe you could somehow express that proof in Core.

You don't need casts to expose the problem: #14620 is enough.

comment:18 Changed 14 months ago by nomeata

Here is an idea. Consider a recursive function foo with type

foo :: forall t, a -> r t
foo @t a = E[foo @t2 x `cast` co)] -- E[_] is tail call context, and the cast prevents join-point-hood

But we can transform this into a form that is allowed by now, just by introducing casts:

foo :: forall t, a -> r t
foo @t a = go @t a refl
  where
    go :: forall t', a -> (r t ~ r t') -> r t
    go @t' a co' = E[go @t'2 (co `trans` co')]

Note that now the return type does no longer depend on the arguments of go, so it is a join point.

This transformation smells like a worker-wrapper transformation, but I am not sure.

Anyways, I am not arguing that we should do this transformation in GHC, but do it in the proofs so that we can simply happily remove the restriction in isValidJoinPointType.

Last edited 14 months ago by nomeata (previous) (diff)

comment:19 Changed 14 months ago by nomeata

Ah, of course you are right: If we drop isValidJoinPointType then we get more join points, and still no crashes. But it would indeed prevent the transformation in comment:14. That leaves two options:

  1. Un-join-point go if we want to do this transformation, and get what we get right now.
  2. Actually do this transformation I propose in comment:18 in GHC, which means we get a join point as before and can do that transformation.

All not carefully thought out, of course, and too much distraction here right now, so I am mostly conveying my gut feeling, hoping to not make a fool out of myself here.

Last edited 14 months ago by simonpj (previous) (diff)

comment:20 Changed 14 months ago by simonpj

I utterly hate (1). But maybe something along the lines of (2) could work. Adding a coercion as an accumulating parameter precisely builds up the proof I referred to in comment:17. Though I have no idea how to do this in general!

comment:21 Changed 14 months ago by nomeata

Ok, thanks for the clarification. Finally I get why that restriction is there. I will try to improve the Note about it (and pass it by you for confirmation).

I think the transformation is possible, but it is somewhat non-local, as you need to move all casts inwards towards the recursive call, and then into the newly added parameter. Ff we find that there are performance gains to be won, then it might be worth it!

comment:22 Changed 14 months ago by nomeata

Simon, can you briefly check that https://phabricator.haskell.org/D4281 is an improvement to the Note?

comment:23 Changed 14 months ago by Joachim Breitner <mail@…>

In 862c59e/ghc:

Rewrite Note [The polymorphism rule of join points]

I found the reference to CPS unhelpful, but Simon gave me a good
explanation in #14610 that I believe is going to be more enlightening
for future readers.

Differential Revision: https://phabricator.haskell.org/D4281

comment:24 Changed 8 months ago by bgamari

Milestone: 8.6.18.8.1

These won't be addressed in GHC 8.6.

comment:25 Changed 2 months ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of low-priority tickets.

Note: See TracTickets for help on using tickets.