Opened 6 years ago

Last modified 9 days ago

#8095 new bug

TypeFamilies painfully slow

Reported by: MikeIzbicki Owned by: goldfire
Priority: high Milestone: 8.8.1
Component: Compiler (Type checker) Version: 7.6.3
Keywords: TypeFamilies Cc: dimitris@…, jstolarek, RyanGlScott, magesh.b, michalt, mbieleck, alanz, adamgundry, kcsongor, redv, danilo2
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #5321, #11598, #12506, #13386 Differential Rev(s): Phab:D3752, Phab:D4766
Wiki Page:

Description

I'm using the TypeFamilies extension to generate types that are quite large. GHC can handle these large types fine when they are created manually, but when type families get involved, GHC's performance dies.

Unlike in ticket #5321, using tail recursion does not eliminate the problem, and the order of arguments greatly affects compile time.

I've attached a file Types.hs that demonstrates the problems. This file generates another Haskell file which has the problems. It takes 3 flags. The first is the size of the type to generate, the second is which type family function to use, and the third is whether to call the type family or just use a manually generated type.

Here are my performance results:

Using non-tail recursion, I get these results. I have to increase the stack size based on the size of the type I want to generate.

$ ./Types 200 a a > test.hs && time ghc test.hs > /dev/null -fcontext-stack=250
real    0m2.973s
$ ./Types 300 a a > test.hs && time ghc test.hs > /dev/null -fcontext-stack=350
real    0m6.018s
$ ./Types 400 a a > test.hs && time ghc test.hs > /dev/null -fcontext-stack=450
real    0m9.995s
$ ./Types 500 a a > test.hs && time ghc test.hs > /dev/null -fcontext-stack=550
real    0m15.645s

Tail recursion generates much slower compile times for some reason, and I still need to adjust the stack size:

$ ./Types 200 b a > test.hs && time ghc test.hs > /dev/null -fcontext-stack=250
real    0m16.120s

Changing the order of arguments to the recursive type family greatly changes the run times:

$ ./Types 200 c a > test.hs && time ghc test.hs > /dev/null -fcontext-stack=250
real    0m6.095s

Without the type family, I get MUCH better performance:

$ ./Types 10000 a d > test.hs && time ghc test.hs > /dev/null
real    0m2.271s

Attachments (4)

Types.hs (2.7 KB) - added by MikeIzbicki 6 years ago.
timing.sh (662 bytes) - added by mbieleck 19 months ago.
Timing script for Types.hs. Requires fomit-type-family-coercions.patch.
fomit-type-family-coercions.patch (8.5 KB) - added by mbieleck 19 months ago.
timing2.txt (1.2 KB) - added by mbieleck 19 months ago.
Impact of forcing the coercion on performance

Download all attachments as: .zip

Change History (65)

Changed 6 years ago by MikeIzbicki

Attachment: Types.hs added

comment:1 Changed 6 years ago by goldfire

I've been playing around with type families recently, so I just took a look at this one. I can reproduce Mike's observed behavior exactly, both with 7.6.3 and with HEAD (although I had to strike the "import GHC.TypeLits" to get the produced file to compile with HEAD). It seems that the problem lies outside the area I've been editing, so I'm going to leave this for now.

For what it's worth, I tried this example with closed type families, and the resultant behavior is the same. This leads me to believe the problem is indeed in the constraint solver.

Thanks for reporting and for creating a great test environment!

comment:2 Changed 5 years ago by simonpj

Cc: dimitris@… added

comment:3 Changed 4 years ago by thomie

Milestone: 7.10.1
Priority: normalhigh

Running the first example from the description with ghc HEAD:

$ ./Types 200 a a > test.hs && ghc-7.9.20141119 test.hs > /dev/null -fcontext-stack=250 +RTS -t
   
test.hs:12:10:
    Type function application stack overflow; size = 201
    Use -ftype-function-depth=N to increase stack size to N
      Replicate1 'Zero () ~ '[]
    In the instance declaration for ‘Class (Data xs)’
<<ghc: 3459540512 bytes, 388 GCs,
70724196/419708144 avg/max bytes residency (12 samples), 815M in use,
0.001 INIT (0.001 elapsed), 36.325 MUT (36.475 elapsed), 5.933 GC (5.971 elapsed)
:ghc>>

That's a lot of bytes. Adding -ftype-function-depth=250 doesn't help: ever increasing memory usage until I have to kill the process. I did not try any of the other examples.

This looks like a regression, setting priority to high.

comment:4 Changed 4 years ago by jstolarek

Cc: jstolarek added

comment:5 Changed 4 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to the 7.12.1 milestone, as these tickets won't be fixed in time for the 7.10.1 release (unless you, the reader, help write a patch :)

comment:6 Changed 4 years ago by archblob

These are times measured on an i3-2100 CPU @ 3.10GHz running Ubuntu 15.04 and GHC 7.10.1 :

input real
200 a a 0m0.609s
300 a a 0m0.689s
400 a a 0m0.884s
500 a a 0m0.942s
200 b a 0m2.028s
200 c a 0m1.092s
10000 a d 0m1.978s

They are grately improved and needed no stack adjustment. Tail-recursive variant is still a lot slower.

comment:7 Changed 3 years ago by bgamari

Owner: set to bgamari

I'll try to have a look at this. Looks like a nice area for improvement.

comment:8 Changed 3 years ago by goldfire

I got caught in this morass some time ago, but my comment:1 is out-of-date at this point: I am well acquainted with how all this works in the solver these days. So I'm happy to help out, but I don't have the bandwidth now to take the lead here. Thanks, Ben!

comment:9 Changed 3 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:10 Changed 3 years ago by bgamari

Finally have a chance to look at this. In particular I'll be looking at the b a case identified in comment:6 as still being slow. For the record runghc Types.hs 5 b a produces output of the form,

{-# LANGUAGE TypeOperators,DataKinds,KindSignatures,TypeFamilies,PolyKinds,UndecidableInstances #-}
import GHC.TypeLits

data Nat1 = Zero | Succ Nat1

type family Replicate1 (n :: Nat1) (x::a) :: [a]
type instance Replicate1 n x = Replicate1' '[] n x

type family Replicate1' (acc::[a]) (n :: Nat1) (x::a)  :: [a]
type instance Replicate1' acc Zero x = acc
type instance Replicate1' acc (Succ n) x = Replicate1' (x ': acc) n x

class Class a where
    f :: a -> a

data Data (xs::a) = X | Y
    deriving (Read,Show)

main = print test1

-- The numeric parameter adjusts this
-- This definition is inlined in the actual output
type N = Succ (Succ (Succ (Succ (Succ Zero)))

instance (xs ~ Replicate1 N ()) => Class (Data xs) where
    f X = Y
    f Y = X

test1 = f (X :: Data (Replicate1 N () ))

In particular the numeric parameter to Test is adjusting N.

In the case of N=500 compile-time is quite bad indeed,

$ ./Types 500 b a >| test.hs && time ghc test.hs  -freduction-depth=0 -dshow-passes
Glasgow Haskell Compiler, Version 7.11.20151006, stage 2 booted by GHC version 7.10.2
...
*** Chasing dependencies:
Chasing modules from: *test.hs
Stable obj: []
Stable BCO: []
Ready for upsweep
  [NONREC
      ModSummary {
         ms_hs_date = 2015-10-06 11:44:53.905638216 UTC
         ms_mod = Main,
         ms_textual_imps = [import (implicit) Prelude, import GHC.TypeLits]
         ms_srcimps = []
      }]
*** Deleting temp files:
compile: input file test.hs
Created temporary directory: /tmp/ghc27682_0
*** Checking old interface for Main:
[1 of 1] Compiling Main             ( test.hs, test.o )
*** Parser:
*** Renamer/typechecker:
*** Desugar:
Result size of Desugar (after optimization)
  = {terms: 103, types: 11,779, coercions: 509,036}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 107, types: 7,294, coercions: 516}
Result size of Simplifier
  = {terms: 107, types: 7,294, coercions: 516}
*** Tidy Core:
Result size of Tidy Core
  = {terms: 111, types: 7,302, coercions: 518}
*** CorePrep:
Result size of CorePrep
  = {terms: 152, types: 7,880, coercions: 518}
*** Stg2Stg:
*** CodeGen:
*** Assembler:
Upsweep completely successful.
*** Deleting temp files:
Warning: deleting non-existent /tmp/ghc27682_0/ghc_3.c
Warning: deleting non-existent /tmp/ghc27682_0/ghc_1.s
Linking test ...
*** C Compiler:
*** C Compiler:
*** Linker:
*** Deleting temp files:
*** Deleting temp dirs:

real    0m22.835s
user    0m22.514s
sys     0m0.315s

It appears that the desugaring pass ends up producing a number of coercions quadratic in the number of depth of the equality in the context result (with almost all being eliminated during simplification),

input terms types coercions
400 b a 103 9479 327236
200 b a 103 4879 83636
100 b a 103 2579 21836
50 b a 103 1429 5936
2 b a 103 325 80

Looking at the output from desugaring, it seems that the only definition which is growing non-linearly is main itself, where indeed we find a chain of quadratically growing coercions. For instance, in the case of N=3 we have,

-- RHS size: {terms: 6, types: 85, coercions: 102}
main :: IO ()
[LclIdX, Str=DmdType]
main =
  print
    @ (Data '[(), (), ()])
    (Main.$fShowData @ [*] @ '[(), (), ()])
    (f @ (Data '[(), (), ()])
       (Main.$fClassData
          @ '[(), (), ()]
          (Eq#
             @ [*]
             @ '[(), (), ()]
             @ (Replicate1 ('Succ ('Succ ('Succ 'Zero))) ())
             @~ (Sym
                   (Main.TFCo:R:Replicate1'kaccZerox[0]
                      <*>_N <'[(), (), ()]>_N <()>_N)
                 ; Sym
                     (Main.TFCo:R:Replicate1'kaccSuccx[0]
                        <*>_N <'[(), ()]>_N <'Zero>_N <()>_N)
                 ; Sym
                     (Main.TFCo:R:Replicate1'kaccSuccx[0]
                        <*>_N <'[()]>_N <'Succ 'Zero>_N <()>_N)
                 ; Sym
                     (Main.TFCo:R:Replicate1'kaccSuccx[0]
                        <*>_N <'[]>_N <'Succ ('Succ 'Zero)>_N <()>_N)
                 ; Sym
                     (Main.TFCo:R:Replicate1knx[0]
                        <*>_N <'Succ ('Succ ('Succ 'Zero))>_N <()>_N)
                 :: '[(), (), ()] ~# Replicate1 ('Succ ('Succ ('Succ 'Zero))) ())))
       (((Main.X @ [*] @ '[(), (), ()])
         `cast` ((Data
                    (UnivCo opt_phantom phantom
                       '[(), (), ()] (Replicate1 ('Succ ('Succ ('Succ 'Zero))) ())))_R
                 :: Data '[(), (), ()]
                    ~R# Data (Replicate1 ('Succ ('Succ ('Succ 'Zero))) ())))
        `cast` ((Data
                   (UnivCo opt_phantom phantom
                      (Replicate1 ('Succ ('Succ ('Succ 'Zero))) ()) '[(), (), ()]))_R
                :: Data (Replicate1 ('Succ ('Succ ('Succ 'Zero))) ())
                   ~R# Data '[(), (), ()])))

More analysis to follow.

comment:11 Changed 3 years ago by goldfire

The blowup in coercions is fixed in the simplifier, yes? That means that coercion optimization is working.

When you vary N, what changes in the output? The long sequence of coercions in the middle of that dump looks correct to me. I'm less sure about the two casts at the end. The unoptimized Core term for this program may truly grow quadratically. Maybe the solution is to be more eager with the coercion optimizer? I don't see any harm in doing so, if we can observe a speedup.

Let me know if I can be of further assistance.

comment:12 Changed 3 years ago by simonpj

Maybe the solution is to be more eager with the coercion optimizer?

I don't think so. If the type checker generates a term of quadratic size, it'll take at least quadratic time to generate, and quadratic time to optimise.

We should look into why it gets so big in the first place.

comment:13 Changed 3 years ago by goldfire

It has to get big, as the coercion witnesses the sequence of reductions that a type family makes to get to a normal form. There are roughly N steps. And each step must explicitly mention the decreasing value of N. And that's quadratic.

Unless I'm missing something, the only way to remove the quadratic behavior here is to redesign Core to be less explicit. That might just be possible.

Right now, we can take a coercion and ask what types it relates. We will always get an answer. But perhaps we can redesign things so that we ask a coercion what types it relates, and we always provide one of the types. Then some function tells us what the other type is. (This operation can't just be, say, left-to-right because we need sym to work.) As a really easy example, this new design means that reflexive coercions don't need to store their types, because one type is always the same as the other. In the case of axioms, getting from one type to another is much harder. Going left-to-right seems possible, but it will require a matching algorithm. Going right-to-left on non-injective type family axioms is not possible without some extra information, so that's annoying. But maybe there's a design of this feature available.

What would be the consequences? Coercions would get smaller. Compilation would get faster. Core Lint would get slower. Core would get harder to debug, perhaps. Might be worth it.

Wait! A new idea!

This is actually an old idea of mine, but one I've never articulated. What if we just don't bother with coercions at all when -dcore-lint is off? I conjecture that they're entirely pointless without -dcore-lint. Sometimes we'll need to ask for the type of (expr |> co), and so we'll have to store the type of the result of a cast (since we're omitting the coercion). Implementing this idea may be a challenge given our current code infrastructure, but I do think it would work. And then, poof, this problem goes away, and compilation probably gets a lot faster in a lot of cases. We absolutely want to keep coercions around for -dcore-lint, as that feature has saved us from unknown quantities of pain and embarrassment. But there's no good reason ordinary, trusting users need to pay the price (in compilation times/memory) for this feature.

comment:14 Changed 3 years ago by bgamari

For the record, #7428 is characterized by a similar blow-up in coercions. I'm not quite sure whether it's the same cause, but it is eerily reminiscent.

Disabling production of coercions is an interesting idea; I wonder how much this would actually improve compilation speed in the non-pathological cases.

comment:15 Changed 3 years ago by simonpj

Good idea. And here's an easier way to implement it: replace every (non-trivial) coercion in the program (specifically in cases) with UnivCo!

That deals with the "sometimes we'll need to type of the result in the cast" question.

But we need to take care. If we start with

data T a where
  T1 :: Bool -> T Bool
  T2 :: T a

f :: T a -> a -> Bool
f = /\a (x:T a) (y:a).
    case x of
          T1 (c : a~Bool) (z : Bool) -> not (y |> c)
          T2 -> True

If we discard the cast, or turn it into UnivCo we might then wrongly float the not-expression, thus

f = /\a (x:T a) (y:a).
    let w :: Bool = not (y |> UnivCo a Bool)
    case x of
          T1 (c : a~Bool) (z : Bool) -> w
          T2 -> True

This woudl not have happened before, because the 'c' would prevent the not-expression being floated. But if we dump the 'c' it could (utterly bogusly, and risking seg-faults) be floated.

So if we are going to radically abbreviate to UnivCo or something like it, we should include the free variables of the coercion we have discarded, something like UnivCo t1 tc [v1,...,vn]. So we'd get:

f :: T a -> a -> Bool
f = /\a (x:T a) (y:a).
    case x of
          T1 (c : a~Bool) (z : Bool) -> not (y |> UnivCo a Bool [c])
          T2 -> True

But now what if c was itself bound to a big coercion? Then the UnivCo would keep the big coercion alive. But it's ok: we should just substitute for c to get

UnivCo a Bool [big-coercion]

and now have a magic UnivCo optimisation to take the free vars of big-coercion:

UnivCo a Bool [v1,..,vn]

I think that would do it.

We could do this in the desugarer; or even earlier in setEvBind in the type checker. In the latter case we'd essentially kill off those big coercions at birth.

Simon

comment:16 Changed 3 years ago by goldfire

Good observation about tracking free variables.

But I think we'll have to be careful to avoid quadratic behavior. Suppose a call to flatten_exact_fam_app_fully creates a sequence of coercions with a quadratic size. This is exactly the case in comment:10. By the time we get to setEvBind, the quadratic-sized coercion is built. I thought for a moment laziness might save us, but the transformation from big coercion to UnivCo has to traverse the big coercion looking for coercion variables, forcing the thunks. And I think this problem might occur in places other than flatten_exact_fam_app_fully.

One solution is to make the mkTcXXX functions in TcEvidence monadic. They could then consult the DynFlags to see how to proceed. At first blush, that looks terrible, but I think it's actually OK. I just searched for mkTcTransCo, and it is near a monad at every use site. Maybe other mkTcXXX functions are less well-placed, but I tend to doubt it.

comment:17 Changed 3 years ago by bgamari

I'm not yet certain but #5645 may be another instance of this.

comment:18 Changed 3 years ago by simonpj

What makes you think they are connected? #5645 is to do with sharing a big data structure at runtime. This ticket is about big coercion structures at compile time.

comment:19 Changed 3 years ago by bgamari

Oh dear, I cited the wrong ticket; I meant #5642 which also appears to be generating large strings coercions (in particular when deriving Generic on a large product type).

comment:20 Changed 3 years ago by bgamari

Milestone: 8.0.18.2.1

This pretty obviously won't make 8.0.

comment:21 Changed 3 years ago by thomie

Keywords: TypeFamilies added

comment:22 Changed 3 years ago by rwbarton

What's the status of discarding coercions during normal compilations? Is it worth separating that idea out into its own ticket? Seems like it would be good for compilation time more generally, not just when computing with type families.

comment:23 Changed 3 years ago by goldfire

It somehow didn't make it onto the ticket, but there was a concern about this idea: once you omit coercions from one module, then all downstream modules are also missing (some) coercions. This means that -dcore-lint doesn't do nearly as much as it normally does. That makes me a bit uncomfortable, but perhaps it's OK.

comment:24 Changed 3 years ago by simonpj

I think the design in comment:15 gets us a decent way forward. But someone will have to implement it, and I'm totally swamped. So the status is "stalled" I think.

comment:25 Changed 3 years ago by goldfire

Made easier by #11598.

comment:26 Changed 3 years ago by goldfire

See comment:2:ticket:11598 for some thoughts on how to do this.

comment:27 Changed 3 years ago by RyanGlScott

Cc: RyanGlScott added

comment:28 Changed 2 years ago by bgamari

comment:29 Changed 2 years ago by bgamari

Owner: bgamari deleted
Version 0, edited 2 years ago by bgamari (next)

comment:30 Changed 2 years ago by bgamari

Owner: set to goldfire

Assigning to Richard. Feel free to unassign if you don't think you will get to this, Richard.

comment:31 Changed 2 years ago by magesh.b

Cc: magesh.b added

comment:32 Changed 2 years ago by bgamari

Milestone: 8.2.18.4.1

comment:33 Changed 2 years ago by michalt

Cc: michalt added

comment:34 in reply to:  15 Changed 19 months ago by mbieleck

Cc: mbieleck added

comment:35 Changed 19 months ago by mbieleck

Here's another idea, based on comment:15: Replace every coercion resulting from flattening type family applications with UnivCo. I think that could fix the problem Simon mentioned, because the resulting UnivCo should have the same free variables as the original chain of coercions. But I know approximately nothing about GHC internals, so this may be terribly wrong.

I am experimenting with an implementation of this idea. The code is

--- a/compiler/typecheck/TcFlatten.hs
+++ b/compiler/typecheck/TcFlatten.hs
@@ -1139,16 +1139,18 @@ flatten_fam_app tc tys  -- Can be over-saturated
                  -- in which case the remaining arguments should
                  -- be dealt with by AppTys
       do { let (tys1, tys_rest) = splitAt (tyConArity tc) tys
-         ; (xi1, co1) <- flatten_exact_fam_app tc tys1
+         ; (xi1, _co1) <- flatten_exact_fam_app tc tys1
                -- co1 :: xi1 ~ F tys1
 
                -- all Nominal roles b/c the tycon is oversaturated
-         ; (xis_rest, cos_rest) <- flatten_many (repeat Nominal) tys_rest
+         ; (xis_rest, _cos_rest) <- flatten_many (repeat Nominal) tys_rest
                -- cos_res :: xis_rest ~ tys_rest
+         ; zonkedTy <- liftTcS $ zonkTcType $ mkTyConApp tc tys
+         ; role <- getRole
 
          ; return ( mkAppTys xi1 xis_rest   -- NB mkAppTys: rhs_xi might not be a type variable
                                             --    cf Trac #5655
-                  , mkAppCos co1 cos_rest
+                  , mkUnsafeCo role (mkAppTys xi1 xis_rest) zonkedTy
                             -- (rhs_xi :: F xis) ; (F cos :: F xis ~ F tys)
                   ) }

Here I'm trying to generate an UnsafeCo of the same type as the original.

This passes all tests but T13822, which fails with a Core Lint error, and I have no idea why. Can someone help me debug this?

comment:36 Changed 19 months ago by goldfire

While I do think your proposed solution may fix the problem reported here -- slow compilation -- in my opinion it throws the baby out with the bathwater. Using mkUnsafeCo should happen only from unsafeCoerce#. There's nothing unsafe going on here, yet the Core code will be littered with what appear to be unsafe coercions.

Instead of abusing the current unsafe coercions, the plan above is to introduce a new OmittedCo (controlled by -f flags) that is quite like UnsafeCo (but stores used coercion variables, for instance). Then, when reading the Core, we can tell the difference between the product of unsafeCoerce# and something omitted.

By the way, I say "may fix" above because this approach might still create the coercion, only to throw it away. If we prevent the creation of the coercion a little earlier, we might have more luck.

By the way, I'm sorry to rain on your parade a bit here. I do very much appreciate your just going in to fix this... but I don't think the simple solution here is quite the right approach that will continue to work in the long term.

Changed 19 months ago by mbieleck

Attachment: timing.sh added

Timing script for Types.hs. Requires fomit-type-family-coercions.patch.

Changed 19 months ago by mbieleck

comment:37 Changed 19 months ago by mbieleck

Thanks for the response.

Replying to goldfire:

Using mkUnsafeCo should happen only from unsafeCoerce#. There's nothing unsafe going on here, yet the Core code will be littered with what appear to be unsafe coercions.

I realize that. I just picked any variant of UnivCo to quickly test if this has any chance of working (and actually improving compile times). Of course the right thing to do is to introduce a new UnivCoProvenance specifically for this case.

By the way, I say "may fix" above because this approach might still create the coercion, only to throw it away. If we prevent the creation of the coercion a little earlier, we might have more luck.

Yes, it might still create the coercion, but at least it will prevent the simplifier from processing it repeatedly later.

I don't think the simple solution here is quite the right approach that will continue to work in the long term.

Who said we can't make incremental improvements? If this would be put behind a flag, it should not harm anyone.

For the record, here are some timing results (using Types.hs from the ticket):

$ ./timing.sh 
GHC=inplace/bin/ghc-stage2
    N        old       omit   old_tail  omit_tail
  200   0m0.465s   0m0.404s   0m1.674s   0m0.445s
  400   0m0.695s   0m0.504s  0m10.745s   0m0.645s
  600   0m1.077s   0m0.614s  0m28.975s   0m0.985s
  800   0m1.565s   0m0.755s    timeout   0m1.446s
 1000   0m2.225s   0m1.075s    timeout   0m2.435s
 1200   0m3.167s   0m1.426s    timeout   0m3.575s
 1400   0m4.236s   0m1.705s    timeout   0m5.045s
 1600   0m5.786s   0m1.654s    timeout   0m4.775s
 1800   0m5.756s   0m1.955s    timeout   0m6.156s
 2000   0m7.215s   0m3.106s    timeout   0m9.865s

old is the default behavior, omit is my patch. _tail variants use the tail recursive type family.

The reason I'm so desperate is that it greatly affects my code. I'm using extensible (Vinyl) records computed using some type families. For bigger records, GHC just runs out of memory. This patch helps.

comment:38 Changed 19 months ago by mpickering

@mbieleck Can you please put your patch on phabricator so that we can continue to iterate on it?

I am going to try a build now and see how it improves the compile times in my type family heavy code.

comment:39 Changed 19 months ago by mbieleck

Differential Rev(s): Phab:D3752

@mpickering Sure.

comment:40 Changed 19 months ago by goldfire

comment:37 is very convincing, with numbers to back up your claims and your appeal to incremental improvement. Thanks.

comment:41 Changed 19 months ago by simonpj

As Richard says this is a direction we've been wanting to move for some time.

But do read comment:15 above carefully. It's crucial to keep track of the free coercion variables of a coercion, otherwise GHC may (after discarding coercions) "optimise" a correct program into one that seg-faults. We don't want that.

Last edited 19 months ago by simonpj (previous) (diff)

comment:42 in reply to:  41 Changed 19 months ago by mbieleck

Replying to simonpj:

But do read comment:15 above carefully. It's crucial to keep track of the free coercion variables of a coercion, otherwise GHC may (after discarding coercions) "optimise" a correct program into one that seg-faults. We don't want that.

My hypothesis was is that the coercion returned specifically by flatten_fam_app should not have more free variables than the original type (F arg1 arg2). I failed to see how example from comment:15 would interact with type families. But I was wrong - flatten_fam_app could not only use axioms, but also coercion variables that happen to be in scope.

Here's an example that demonstrates bad behavior of -fomit-type-family-coercions:

{-# LANGUAGE GADTs, TypeFamilies, UndecidableInstances #-}
module Bad where
type family Id a
data T a where
  T1 :: Id a ~ Bool => T a
f :: T a -> Id a -> Bool
f x y =
    case x of
      T1 -> not y

Core (simplified) without -fomit-type-family-coercions:

f = \ (@ a) (x :: T a) (y :: Id a) ->
      case x of
        T1 (co :: Id a ~ Bool) ->
          not (y |> co)

Core (simplified) with -fomit-type-family-coercions:

f = \ (@ a) (x :: T a) (y :: Id a) ->
      case x of
        T1 (co :: Id a ~ Bool) ->
          not (y |> UnivCo (Id a) Bool)

not (y |> UnivCo (Id a) Bool) can be floated out, which should not happen.

The next obvious thing is to generate the coercion, traverse it to find free variables and discard it, putting free variables in UnivCo. I've measured how much just traversing the coercion would impact the performance (using seqCo). This results in about 50% slowdown compared to just discarding the coercion (for timings.sh).

An alternative is to track used coercion variables in FlattenEnv, but I don't know how reliable would that be. The in-scope coercion gets pulled in in lookupFlatCache, is that right? Are there other places in the flattener which can use coercion variables?

I apologise for posting so much comments and code, I want to confirm whether my reasoning is correct.

Changed 19 months ago by mbieleck

Attachment: timing2.txt added

Impact of forcing the coercion on performance

comment:43 Changed 19 months ago by simonpj

I'd be very reluctant to have a complicated solution that was hard to be sure was correct here. (e.g. tracking covasrs in FlattenEnv.)

I'm surprised that finding the free variables of a coercion is so expensive.

Zonking traerses coercions anyway. Suppose the result of zonking a coercion was always UnivCo t1 t2 free-vars. Then to zonk Trans c1 c2 you can zonk c1 and c2, and unition their free vars.

comment:44 Changed 14 months ago by alanz

I wonder if this is contributing to the slow compilation times for the TTG patches, related to deriving Data instances for type family indexed types.

comment:45 Changed 14 months ago by alanz

Cc: alanz added

comment:46 Changed 13 months ago by bgamari

Milestone: 8.4.1

This is a rather open-ended and non-trivial issue; removing milestone.

comment:47 Changed 10 months ago by adamgundry

Cc: adamgundry added

comment:48 Changed 10 months ago by kcsongor

Cc: kcsongor added

comment:49 Changed 9 months ago by bgamari

According to my understanding, this should be reasonably straightforward: Introduce a placeholder Coercion variety which tracks the types, role, and free variables of the coercion.

comment:50 Changed 9 months ago by simonpj

Differential Rev(s): Phab:D3752Phab:D3752, Phab:D4766

comment:51 Changed 7 months ago by _recursion

Is there any chance that this will be able to make it into 8.6? We (Luna) have a codebase that is very heavy on type families, and we're seeing obscene compile times and memory usage when building with optimisation. When it takes longer to compile than ghc in the perf build flavour, we have something wrong!

I'd have tried the patch (D4766) myself, but it's currently not able to apply without conflict resolution to either the master or ghc-8.6 branches, and I don't want to risk mucking something up trying to resolve conflicts!

comment:52 Changed 7 months ago by redv

Cc: redv added

I would also be keen on seeing some progress here. More libraries are using type families. I would like to use superrecord for example, but compile time performance is not good. At a guess, probably due to slow type families.

comment:53 Changed 7 months ago by danilo2

I'm just writing to tell that me and our team is waiting for this patch as well. We heavily use type families for our extensible records implementation (and other things) and our current build takes over an hour and consumes almost 30Gb of RAM. We have investigated it heavily and we are sure its related exactly to type families resolution.

comment:54 Changed 7 months ago by redv

Cc: danilo2 added

comment:55 Changed 6 months ago by bgamari

Milestone: 8.8.1

The current state of play is that there are some performance issues which need to be sorted out in the patch currently but I expect to be able to get to these by 8.8.

comment:56 Changed 6 months ago by _recursion

Thanks for the update Ben. We're very much looking forward to it.

comment:57 Changed 2 months ago by _recursion

Sorry to be an annoyance and ask about this again, but is it looking like this will land for 8.8?

comment:58 Changed 4 weeks ago by hussein.aitlahcen

Hello everyone,

My team experienced the same issue when using Squeal extensively (https://github.com/morphismtech/squeal). I found two things:

Because of that, GHC was taking 12gb+ to compile some files. I found that omitting the interfaces pragmas with -fomit-interface-pragmas was helping a lot. In fact, no more unfolded type in the interface file, making the dumped .txt file going from 3.5gb to 700kb with a maximum allocated memory of 2.5gb.

I already patched Stack to stop asking for a dump and directly read the binary interface file (the startup time is greatly improved) and will probably push a PR.

Finally, It looks like unfolding options for GHC does not apply to types (or I missed something?). I'm asking because of https://github.com/ghc/ghc/blob/886ddb27bfbbb52c41690cd29e2ab3ed80bf5450/compiler/coreSyn/CoreUnfold.hs#L583. Since we are now able to compute stuff at type level, it would be handy to be able to control the unfolding depending on something like typeLevelExprSize.

Thanks for reading

Last edited 4 weeks ago by hussein.aitlahcen (previous) (diff)

comment:59 in reply to:  58 Changed 9 days ago by pingu

Replying to hussein.aitlahcen:

Hello everyone,

My team experienced the same issue when using Squeal extensively (https://github.com/morphismtech/squeal).

My team appears to be experiencing the same issue using Beam migrations. Beam is a similar library to Squeal which makes heavy use of type families (https://tathougies.github.io/beam/)...

I found this ticket whilst trying to work out why GHC was spending so much time and memory simplifying.

... Because of that, GHC was taking 12gb+ to compile some files. I found that omitting the interfaces pragmas with -fomit-interface-pragmas was helping a lot. In fact, no more unfolded type in the interface file, making the dumped .txt file going from 3.5gb to 700kb with a maximum allocated memory of 2.5gb.

This flag reduced our maximum residency from ~= 14GB down to ~= 10GB, and compile times roughly halved. Thank you very much for this tip, it's stopped us swapping during compilation for now.

comment:60 Changed 9 days ago by simonpj

Ben: you have a patch in flight for this don't you? It'd be great to resurrect it.

comment:61 in reply to:  58 Changed 9 days ago by _recursion

Because of that, GHC was taking 12gb+ to compile some files. I found that omitting the interfaces pragmas with -fomit-interface-pragmas was helping a lot. In fact, no more unfolded type in the interface file, making the dumped .txt file going from 3.5gb to 700kb with a maximum allocated memory of 2.5gb.

Doing this has been an immense help for Luna. Compile times on my machine, which is no slouch, have gone from just over one hour to a hair over six minutes.

Note: See TracTickets for help on using tickets.