Opened 5 years ago
Last modified 4 weeks ago
#5642 merge bug
Deriving Generic of a big type takes a long time and lots of space
Reported by: | basvandijk | Owned by: | bgamari |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Compiler | Version: | 7.3 |
Keywords: | deriving-perf, Generics | Cc: | v.dijk.bas@…, dimitris@…, hackage.haskell.org@…, carter.schonwald@…, gregmainland@…, vhaisman@…, gidyn, RyanGlScott, akst, bgamari |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | Compile-time performance bug | Test Case: | T5642 |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | Phab:D2304 | |
Wiki Page: |
Description
If I load the following module into ghci my system will run out of memory after about 15 minutes:
{-# LANGUAGE DeriveGeneric #-} import GHC.Generics data BigSum = C0 | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | C20 | C21 | C22 | C23 | C24 | C25 | C26 | C27 | C28 | C29 | C30 | C31 | C32 | C33 | C34 | C35 | C36 | C37 | C38 | C39 | C40 | C41 | C42 | C43 | C44 | C45 | C46 | C47 | C48 | C49 | C50 | C51 | C52 | C53 | C54 | C55 | C56 | C57 | C58 | C59 | C60 | C61 | C62 | C63 | C64 | C65 | C66 | C67 | C68 | C69 | C70 | C71 | C72 | C73 | C74 | C75 | C76 | C77 | C78 | C79 | C80 | C81 | C82 | C83 | C84 | C85 | C86 | C87 | C88 | C89 | C90 | C91 | C92 | C93 | C94 | C95 | C96 | C97 | C98 | C99 | C100 | C101 | C102 | C103 | C104 | C105 | C106 | C107 | C108 | C109 | C110 | C111 | C112 | C113 | C114 | C115 | C116 | C117 | C118 | C119 | C120 | C121 | C122 | C123 | C124 | C125 | C126 | C127 | C128 | C129 | C130 | C131 | C132 | C133 | C134 | C135 | C136 | C137 | C138 | C139 | C140 | C141 | C142 | C143 | C144 | C145 | C146 | C147 | C148 | C149 | C150 | C151 | C152 | C153 | C154 | C155 | C156 | C157 | C158 | C159 | C160 | C161 | C162 | C163 | C164 | C165 | C166 | C167 | C168 | C169 | C170 | C171 | C172 | C173 | C174 | C175 | C176 | C177 | C178 | C179 | C180 | C181 | C182 | C183 | C184 | C185 | C186 | C187 | C188 | C189 | C190 | C191 | C192 | C193 | C194 | C195 | C196 | C197 | C198 | C199 | C200 | C201 | C202 | C203 | C204 | C205 | C206 | C207 | C208 | C209 | C210 | C211 | C212 | C213 | C214 | C215 | C216 | C217 | C218 | C219 | C220 | C221 | C222 | C223 | C224 | C225 | C226 | C227 | C228 | C229 | C230 | C231 | C232 | C233 | C234 | C235 | C236 | C237 | C238 | C239 | C240 | C241 | C242 | C243 | C244 | C245 | C246 | C247 | C248 | C249 | C250 | C251 | C252 | C253 | C254 | C255 | C256 | C257 | C258 | C259 | C260 | C261 | C262 | C263 | C264 | C265 | C266 | C267 | C268 | C269 | C270 | C271 | C272 | C273 | C274 | C275 | C276 | C277 | C278 | C279 | C280 | C281 | C282 | C283 | C284 | C285 | C286 | C287 | C288 | C289 | C290 | C291 | C292 | C293 | C294 | C295 | C296 | C297 | C298 | C299 deriving Generic
Big products have the same problem:
data BigProductderiving Generic
Change History (46)
comment:1 Changed 5 years ago by basvandijk
- Cc v.dijk.bas@… added
comment:2 follow-ups: ↓ 3 ↓ 5 Changed 5 years ago by clintm
comment:3 in reply to: ↑ 2 Changed 5 years ago by clintm
Replying to clintm:
Came to report this sort of thing too. I can't actually get this code to completely run ...
Forgot to mention, this is ghc 7.2.2
comment:4 Changed 5 years ago by dreixel
- Cc dreixel removed
- Owner set to dreixel
Thanks for reporting, I will look into it.
comment:5 in reply to: ↑ 2 Changed 5 years ago by basvandijk
Replying to clintm:
Came to report this sort of thing too. I can't actually get this code to completely run as my machine runs out of memory (12g) almost immediately. ...
This doesn't have anything to do with generics. You just forgot to give a definition for either show or showsPrec. Since these methods have default implementations which are defined in terms of each other you get an infinite loop which apparently overflows the stack.
Also note that the Show type class does not use DefaultSignatures and so does not have default generic implementations.
comment:6 Changed 5 years ago by dreixel
- Owner changed from dreixel to dimitris
- Version changed from 7.2.1 to 7.3
I thought the problem was the generation of Generic instances, but it isn't. If you write the instance by hand, it still takes long. So the bottleneck is the performance of the constraint solver TcSimplify. Things have improved with the new coercion solver; see the following figures for compiling derive Generic on a datatype with 100/200/300 constructors:
ghc-7.2.1 100 0m30.079s 200 3m44.419s 300 12m52.306s HEAD 100 0m8.302s 200 1m1.631s 300 3m32.507s
But they could probably be much better, since the coercions introduced are not that complicated. I've notified Dimitrios, he'll look further into it.
comment:7 Changed 5 years ago by igloo
- Milestone set to 7.4.1
- Priority changed from normal to high
comment:8 follow-up: ↓ 10 Changed 5 years ago by simonpj
- difficulty set to Unknown
I think this is going to be a difficult one to solve. The underlying problem is that the types grow non-linearly with the program size. Why? Look at Section 2.3 of Scrap your type applications.
This is a fundamental problem with System F, so it's not easy for GHC to get around it. It shows up especially with deeply-nested sums and products, which is exactly what is generated by the generic stuff. I'm not sure what to do here.
Simon
comment:9 Changed 5 years ago by dreixel
I had the impression Dimitrios had already made some progress on this, and we were just waiting to turn it into a perf test for the testsuite before closing this ticket. I think the resulting performance is not great, but it's good enough for most cases, including ASTs with 100 or so constructors.
comment:10 in reply to: ↑ 8 ; follow-up: ↓ 11 Changed 5 years ago by basvandijk
Replying to simonpj:
I think this is going to be a difficult one to solve. The underlying problem is that the types grow non-linearly with the program size. Why? Look at Section 2.3 of Scrap your type applications.
Got it: N-ary data constructors like (C e1 e2 e3 e4 e5) are translated to nested pairs (Pair : ∀a,b. a → b → (a, b)). This causes a quadratic blow-up in size:
Pair σ1 (σ2,(σ3,(σ4,σ5))) e1 (Pair σ2 (σ3,(σ4,σ5)) e2 (Pair σ3 (σ4,σ5) e3 (Pair σ4 σ5 e4 e5)))
This is a fundamental problem with System F, so it's not easy for GHC to get around it. It shows up especially with deeply-nested sums and products, which is exactly what is generated by the generic stuff. I'm not sure what to do here.
So if I understand the paper correctly System IF would solve this by removing the redundant type applications using:
Pair ψ τ s t → Pair ψ s τ t by (ξ2) → Pair s τ t by (ξ1) → Pair s t by (ξ1)
Of course the question is: is it worth implementing System IF for only solving this problem? (I understand System IF did not significantly improve compile times for the base library).
comment:11 in reply to: ↑ 10 Changed 5 years ago by dreixel
Replying to basvandijk:
Got it: N-ary data constructors like (C e1 e2 e3 e4 e5) are translated to nested pairs (Pair : ∀a,b. a → b → (a, b)). This causes a quadratic blow-up in size:
Pair σ1 (σ2,(σ3,(σ4,σ5))) e1 (Pair σ2 (σ3,(σ4,σ5)) e2 (Pair σ3 (σ4,σ5) e3 (Pair σ4 σ5 e4 e5)))
No; we balance the sums and the products, so it grows with logarithmic complexity. See the second column of the 4th page of the original paper.
comment:12 Changed 4 years ago by igloo
- Test Case set to T5642
comment:13 Changed 4 years ago by igloo
- Milestone changed from 7.4.1 to 7.4.2
comment:14 Changed 4 years ago by simonpj
- Cc dimitris@… added
- Milestone changed from 7.4.2 to 7.6.1
There is still something odd here. Here's the output for HEAD, doing ghc T5642.hs -dshow-passes +RTS -s, for various sizes of input data type. If you do it yourself you'll see that 99% of the time is spent in the typechecker. The rest of compilation is fast.
So not only is it non-linear but the constant factor is terrible. It just can't take that long to typecheck this stuff!
I'm punting for 7.4, but we need to look again for 7.6.
Simon
100 constructors
simonpj@cam-05-unx:~/tmp$ ~/5builds/HEAD/inplace/bin/ghc-stage2 -c T5642.hs -dshow-passes +RTS -s Glasgow Haskell Compiler, Version 7.5, stage 2 booted by GHC version 7.4.1 *** Checking old interface for main:T5642: *** Parser: *** Renamer/typechecker: *** Desugar: Result size of Desugar (after optimization) = {terms: 3,498, types: 160,413, coercions: 1,305} *** Simplifier: Result size of Simplifier iteration=1 = {terms: 3,298, types: 110,113, coercions: 51,906} Result size of Simplifier = {terms: 3,298, types: 110,113, coercions: 51,805} *** Tidy Core: Result size of Tidy Core = {terms: 3,298, types: 110,113, coercions: 51,805} *** CorePrep: Result size of CorePrep = {terms: 4,642, types: 159,413, coercions: 51,805} *** Stg2Stg: *** CodeGen: *** CodeOutput: *** Assembler: *** Deleting temp files: Warning: deleting non-existent /tmp/ghc46808_0/ghc46808_0.c *** Deleting temp dirs: 3,189,810,752 bytes allocated in the heap 1,176,054,736 bytes copied during GC 70,837,720 bytes maximum residency (19 sample(s)) 1,225,552 bytes maximum slop 158 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 6039 colls, 0 par 1.99s 1.99s 0.0003s 0.0038s Gen 1 19 colls, 0 par 1.45s 1.45s 0.0761s 0.2469s TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 5.86s ( 5.95s elapsed) GC time 3.44s ( 3.43s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 9.29s ( 9.38s elapsed) Alloc rate 544,723,338 bytes per MUT second Productivity 63.0% of total user, 62.4% of total elapsed
150 constructors
simonpj@cam-05-unx:~/tmp$ ~/5builds/HEAD/inplace/bin/ghc-stage2 -c T5642.hs -dshow-passes +RTS -s Glasgow Haskell Compiler, Version 7.5, stage 2 booted by GHC version 7.4.1 *** Checking old interface for main:T5642: *** Parser: *** Renamer/typechecker: *** Desugar: Result size of Desugar (after optimization) = {terms: 5,690, types: 401,601, coercions: 2,85} *** Simplifier: Result size of Simplifier iteration=1 = {terms: 5,370, types: 273,121, coercions: 131,46} Result size of Simplifier = {terms: 5,370, types: 273,121, coercions: 130,885} *** Tidy Core: Result size of Tidy Core = {terms: 5,370, types: 273,121, coercions: 130,885} *** CorePrep: Result size of CorePrep = {terms: 7,738, types: 400,1, coercions: 130,885} *** Stg2Stg: *** CodeGen: *** CodeOutput: *** Assembler: *** Deleting temp files: Warning: deleting non-existent /tmp/ghc47478_0/ghc47478_0.c *** Deleting temp dirs: 8,336,602,152 bytes allocated in the heap 4,066,619,088 bytes copied during GC 218,323,536 bytes maximum residency (33 sample(s)) 3,689,368 bytes maximum slop 479 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 15893 colls, 0 par 6.45s 6.45s 0.0004s 0.0033s Gen 1 33 colls, 0 par 4.78s 4.78s 0.1448s 0.7436s TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 24.45s ( 24.60s elapsed) GC time 11.23s ( 11.23s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 35.68s ( 35.82s elapsed) Alloc rate 340,990,013 bytes per MUT second Productivity 68.5% of total user, 68.3% of total elapsed
200 constructors
*** Desugar: Result size of Desugar (after optimization) = {terms: 7,170, types: 622,561, coercions: 2,605} *** Simplifier: Result size of Simplifier iteration=1 = {terms: 6,770, types: 421,961, coercions: 203,806} Result size of Simplifier = {terms: 6,770, types: 421,961, coercions: 203,605} *** Tidy Core: Result size of Tidy Core = {terms: 6,770, types: 421,961, coercions: 203,605} *** CorePrep: Result size of CorePrep = {terms: 9,858, types: 620,561, coercions: 203,605} *** Stg2Stg: *** CodeGen: *** CodeOutput: *** Assembler: *** Deleting temp files: Warning: deleting non-existent /tmp/ghc47501_0/ghc47501_0.c *** Deleting temp dirs: 13,262,925,720 bytes allocated in the heap 7,286,910,712 bytes copied during GC 298,657,992 bytes maximum residency (45 sample(s)) 5,548,736 bytes maximum slop 640 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 25322 colls, 0 par 11.98s 11.97s 0.0005s 0.0034s Gen 1 45 colls, 0 par 8.58s 8.59s 0.1908s 0.9460s TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 50.07s ( 50.22s elapsed) GC time 20.56s ( 20.55s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 70.63s ( 70.78s elapsed) Alloc rate 264,903,443 bytes per MUT second Productivity 70.9% of total user, 70.7% of total elapsed
comment:15 Changed 4 years ago by igloo
- Milestone changed from 7.6.1 to 7.6.2
comment:16 Changed 3 years ago by liyang
- Cc hackage.haskell.org@… added
comment:17 Changed 3 years ago by carter
- Cc carter.schonwald@… added
comment:18 Changed 2 years ago by thoughtpolice
- Priority changed from high to normal
Lowering priority (these tickets are assigned to older versions, so they're getting bumped as they've been around for a while).
comment:19 Changed 2 years ago by thoughtpolice
- Milestone changed from 7.6.2 to 7.10.1
Moving to 7.10.1.
comment:20 Changed 19 months ago by thoughtpolice
- Milestone changed from 7.10.1 to 7.12.1
Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.
comment:21 Changed 11 months ago by thoughtpolice
- Milestone changed from 7.12.1 to 8.0.1
Milestone renamed
comment:22 Changed 10 months ago by basvandijk
Just a heads-ups that a user of aeson just ran into this.
comment:23 Changed 10 months ago by bgamari
- Owner changed from dimitris to bgamari
I'll be looking into this.
comment:24 Changed 10 months ago by bgamari
For the record this sounds quite similar to #8095 in that the compiler seems to be creating long strings of coercions (although here the simplifier doesn't even seem capable of eliminating them).
comment:25 Changed 9 months ago by basvandijk
Another user of aeson also ran into this (or something which looks like this). However, instead of a big sum type he has a big product type.
The other interesting thing is that the issue only seems to appear with aeson-0.10.0.0.
comment:26 Changed 9 months ago by ghorn
- Cc gregmainland@… added
comment:27 Changed 7 months ago by ezyang
- Keywords deriving-perf added
comment:28 Changed 6 months ago by wilx
- Cc vhaisman@… added
comment:29 Changed 6 months ago by gidyn
- Cc gidyn added
comment:30 Changed 6 months ago by thomie
- Milestone 8.0.1 deleted
comment:31 Changed 6 months ago by thomie
- Keywords Generics added
comment:32 Changed 6 months ago by Ben Gamari <ben@…>
In 16cf460/ghc:
comment:33 Changed 3 months ago by nh2
I just filed a duplicate of this (#11991) with another easy-to-run repro that demonstrates that it's precisely O(n²) for sum types: https://github.com/nh2/ghc-generics-deriving-is-slow/
comment:34 Changed 3 months ago by bgamari
The profile compiling Data400.hs (admittedly with a very old profiled compiler that I had laying around) from the repo given in comment:33 looks like this,
Wed Apr 27 23:26 2016 Time and Allocation Profiling Report (Final) ghc +RTS -hc -p -RTS -B/opt/exp/ghc/roots/profiled/lib/ghc-8.1.20160222 Data400.hs -fforce-recomp total time = 5.12 secs (5118 ticks @ 1000 us, 1 processor) total alloc = 7,708,824,224 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc canEvVar TcCanonical 41.0 45.8 zonkTopDecls TcRnDriver 17.4 25.3 CorePrep HscMain 6.4 7.5 SimplTopBinds SimplCore 5.9 2.6 CoreTidy HscMain 5.5 5.6 Simplify SimplCore 4.2 0.0 deSugar HscMain 3.6 1.7 tc_rn_src_decls TcRnDriver 3.2 2.9 solve_loop TcInteract 2.5 0.7 OccAnal SimplCore 1.6 0.2 pprNativeCode AsmCodeGen 1.1 1.3
comment:35 Changed 3 months ago by simonpj
Do read comment:8. I think there is something fundamentally difficult here.
comment:36 Changed 3 months ago by RyanGlScott
- Cc RyanGlScott added
comment:37 Changed 7 weeks ago by akst
- Cc akst added
comment:38 Changed 7 weeks ago by RyanGlScott
- Differential Rev(s) set to Phab:D2304
I've found a little trick to help reduce the constant factors in deriving Generic. (See Phab:D2304 for more details.) One can factor out the topmost M1 from the derived definitions of to and from to reduce the number of times GHC has to typecheck an application of M1. This in turn leads to a reduction in the number of coercions, which lowers allocation.
I tested this trick against basvandijk's 300-constructor datatype, and it saves almost 3 seconds of compilation time and uses ~6.4 times fewer coercions.
comment:39 Changed 6 weeks ago by RyanGlScott
Per SPJ's request, I've reposted some sleuthing I did in the comments of Phab:D2304.
I decided to my changes on the 300-constructor sum type mentioned at the top of this ticket. I manually implemented a Generic instance for this datatype three times, and put each one in its own file:
- Gen_v1.hs, which contains a Generic instance as GHC derives it currently (without the changes in Phab:D2304)
- Gen_v2.hs, which is like Gen_v1.hs except that it factors out the topmost M1 in from/to (i.e., with the changes in Phab:D2304)
- Gen_v3.hs, which is like Gen_v1.hs except that it both (1) factors out the topmost M1 in from/to and (2) factors out common occurrences of L1/R1 in to
I compiled each file with ghc -O2 -v3 +RTS -s and dumped the results to logs. Here are the highlights:
- Gen_v1.txt
- {terms: 16,282, types: 2,563,921, coercions: 639,012}
- 7,781,716,432 bytes allocated in the heap
- Total time 8.708s ( 8.719s elapsed)
- Gen_v2.txt
- {terms: 16,288, types: 2,924,492, coercions: 9,950}
- 4,479,400,544 bytes allocated in the heap
- Total time 5.580s ( 5.590s elapsed)
- Gen_v3.txt
- {terms: 16,288, types: 2,924,492, coercions: 9,950}
- 4,016,934,848 bytes allocated in the heap
- Total time 4.990s ( 5.006s elapsed)
There is a huge difference between v1 and v2, as suspected. There is a difference between v2 and v3 in that it allocated fewer bytes on the heap, but interestingly, v3 has the exact same number of types and coercions, so I'm not sure where the improvement is coming from.
comment:40 Changed 6 weeks ago by bgamari
- Cc bgamari added
comment:41 Changed 5 weeks ago by Ben Gamari <ben@…>
In 9649fc0a/ghc:
comment:42 Changed 5 weeks ago by bgamari
- Status changed from new to merge
We might consider merging this for 8.0.2.
comment:43 follow-up: ↓ 44 Changed 5 weeks ago by gidyn
Gen_v3 not going in?
comment:44 in reply to: ↑ 43 Changed 5 weeks ago by RyanGlScott
Replying to gidyn:
Gen_v3 not going in?
At least, not for now. In Phab:D2304, SPJ also asked to repeat the same experiment in comment:39, but with a four-constructor datatype:
data Bigsum = C0 | C1 | C2 | C3 -- Perhaps not the best name in hindsight...
As above, here are Gen_v1.hs, Gen_v2.hs, and Gen_v3.hs, and here are their logs:
- Gen_v1.txt
- {terms: 143, types: 562, coercions: 239}
- 94,739,056 bytes allocated in the heap
- Total time 0.201s ( 0.191s elapsed)
- Gen_v2.txt
- {terms: 148, types: 638, coercions: 174}
- 93,979,880 bytes allocated in the heap
- Total time 0.196s ( 0.187s elapsed)
- Gen_v3.txt
- {terms: 148, types: 638, coercions: 174}
- 94,281,768 bytes allocated in the heap
- Total time 0.203s ( 0.192s elapsed)
Curiously, in this example, v3 fares worse than v2. Again, I'm not sure what's going on here, and with a sample size of two, it's difficult to conclude if the strategy in v3 is always better than v2. I suspect it's asymptotically better, since we go from an O(n) number of L1s/R1s to an O(log n) number, but I'd need more evidence to support that claim.
comment:45 Changed 4 weeks ago by rrnewton
Small addendum. If you split this up and compile the "from" and "to" methods separately, it is clear that most of the work is in the "from". On the small-four constructor examples:
Version 1:
- 239 coercions with both to/from.
- 200 coercions with just "from"
- 39 coercions with just "to"
Version 2:
- 174 coercions with both to/from.
- 132 with just from
- 42 with just "to"
And on the full-sized, 300-way sums, here are the realtime measurements on my personal machine:
Version 1:
- both: 639K coercions, 7.0s
- from-only: 636.9K coercions, 5.7s
- to-only: 2K coercions, 1.8s
Version 2:
- both: 9.9K coercions, 4.4s
- from-only: 7.5K coercions, 3.6s
- to-only: 2.4K coercions, 1.0s
At least in the bigger case, it looks like there is a drop in realtime for both of the from/to methods, with the former being constructor-heavy, and the latter being deep-pattern-match-heavy.
Slicing it another way, if we cut down the cases in the 300-way match for "from-only"/v1 -- leaving incomplete pattern matches -- the coercions/time drops as follows:
- 300 cases -- 636.9K coercions, 5.7s
- 150 cases -- 316K coercions, 4.0s
- 100 cases -- 212K coercions, 2.5s
- 50 cases -- 106K coercions, 1.3s
- 25 cases -- 53K coercions, 0.8s
And "from-only"/v2:
- 300 cases -- 7.5K coercions, 3.7s
- 150 cases -- 5.8K coercions, 2.8s
- 50 cases -- 4.8K coercions, 1.1s
- 25 cases -- 4.5K coercions, 0.7s
Ok, so nothing that unusual in this dimension. As should be expected, the "factored" v2 gets a larger benefit over v1 the more cases the factored bit is amortized over.
comment:46 Changed 4 weeks ago by rrnewton
One more small experiment. Before diving into the compiler internals, I thought it might be possible to do standard test case minimization. I.e. can we create some stub definitions for the relevant bits of GHC.Generics, while preserving the bad behavior?
The file in this gist attempts to do so, using the following stub definitions:
class Generic a where from :: a -> M1 a data M1 a where M1 :: b -> M1 a data U1 p = U1 data (:+:) (f :: * -> *) (g :: * -> *) p = L1 (f p) | R1 (g p)
This is a lower bound -- it fails to produce the really slow behavior. It is simplified enough that the time for compiling just the "from" case drops from 5.7s all the way down to 0.85s. Getting back to the bad behavior will require putting more of the complexity of "M1" back into play...
Came to report this sort of thing too. I can't actually get this code to completely run as my machine runs out of memory (12g) almost immediately.