Opened 3 years ago
Closed 2 years ago
#7436 closed bug (fixed)
Derived Foldable and Traversable instances become extremely inefficient due to eta-expansion
Reported by: | shachaf | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | 7.6.3 |
Component: | Compiler | Version: | 7.6.1 |
Keywords: | Cc: | patrick@…, twanvl, dreixel, hackage.haskell.org@…, jwlato@… | |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | Runtime performance bug | Test Case: | perf/should_runt/T7436 |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): |
Description (last modified by simonpj)
The following program:
{-# LANGUAGE DeriveFunctor, DeriveFoldable #-} import Prelude hiding (foldr) import Data.Foldable data List a = Nil | Cons a (List a) deriving (Functor, Foldable) mkList :: Int -> List Int mkList 0 = Nil mkList n = Cons n (mkList (n-1)) main :: IO () main = print $ foldr (\x y -> y) "end" (mkList n) where n = 100000
Takes n^2 time to run with GHC 7.6.1 -O2.
The generated Foldable code looks something like this:
instance Foldable List where foldr f z Nil = z foldr f z (Cons x xs) = f x (foldr (\a b -> f a b) z xs)
Eta-reducing the function, i.e.
instance Foldable List where foldr f z Nil = z foldr f z (Cons x xs) = f x (foldr f z xs)
Makes the program linear in n (in this case, runtime goes from 8.160s to 0.004s).
The Traversable instance also has the same issue.
There seem to be three different issues:
- Derived Foldable and Traversable instances are nearly unusable for large structures.
- An eta-expanded definition like foldr becomes asymptotically worse for some reason. Maybe this is expected behavior for this function, since f gets eta-expanded at each iteration?
- Foldable instances are generated with foldr instead of foldMap.
This isn't directly related, since the code would have the same problem either way, but since I'm already writing about it... foldMap can allow asymptotically better operations on a structure than foldr (for example, finding the rightmost leaf of a binary tree using Data.Monoid.Last), so it should probably be generated instead. A foldMap definition should look like a simpler version of traverse, which is already derivable. Maybe this should be a separate ticket.
Attachments (2)
Change History (26)
comment:1 Changed 3 years ago by simonpj
- Description modified (diff)
- difficulty set to Unknown
comment:2 Changed 3 years ago by simonpj
comment:3 Changed 3 years ago by shachaf
The two instances I gave work as as they are without deriving (no Core necessary -- the code is a simplified version of what -ddump-deriv generated). Here's a simpler version without type classes (or lists):
data Nat = Z | S Nat mkNat :: Int -> Nat mkNat 0 = Z mkNat n = S (mkNat (n-1)) unNat :: Nat -> () unNat Z = () unNat (S n) = unNat n fast :: (b -> b) -> b -> Nat -> b fast s z Z = z fast s z (S n) = s (fast s z n) slow :: (b -> b) -> b -> Nat -> b slow s z Z = z slow s z (S n) = s (slow (\e -> s e) z n) main :: IO () --main = print $ unNat . fast S Z . mkNat $ n main = print $ unNat . slow S Z . mkNat $ n where n = 100000
And the +RTS -s output for both:
- Fast:
shachaf@carbon:~/9$ ghc -rtsopts -O2 C.hs && time ./C +RTS -s [1 of 1] Compiling Main ( C.hs, C.o ) Linking C ... () 9,651,768 bytes allocated in the heap 10,904 bytes copied during GC 44,416 bytes maximum residency (2 sample(s)) 21,120 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 17 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 0.00s ( 0.00s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.00s ( 0.00s elapsed) %GC time 0.0% (9.8% elapsed) Alloc rate 2,412,942,000 bytes per MUT second Productivity 100.0% of total user, 116.6% of total elapsed real 0m0.005s user 0m0.004s sys 0m0.000s
- Slow:
shachaf@carbon:~/9$ ghc -rtsopts -O2 C.hs && time ./C +RTS -s [1 of 1] Compiling Main ( C.hs, C.o ) Linking C ... () 11,251,768 bytes allocated in the heap 4,122,872 bytes copied during GC 1,223,248 bytes maximum residency (3 sample(s)) 528,816 bytes maximum slop 4 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 19 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s Gen 1 3 colls, 0 par 0.00s 0.00s 0.0006s 0.0016s INIT time 0.00s ( 0.00s elapsed) MUT time 8.23s ( 8.25s elapsed) GC time 0.00s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.24s ( 8.25s elapsed) %GC time 0.0% (0.1% elapsed) Alloc rate 1,366,747 bytes per MUT second Productivity 100.0% of total user, 99.8% of total elapsed real 0m8.253s user 0m8.237s sys 0m0.000s
comment:4 Changed 3 years ago by simonpj
Thanks for the simple version. Truly bizarre. The slow code allocate ONLY 11 MBYTE, and yet takes EIGHT SECONDS to run. No time to look today, but if anyone cares to investigate that'd be great. (One thing to try would be profiling, though that might make the bug disappear.)
Simon
comment:5 Changed 3 years ago by shachaf
I'd think what I said above about the eta-expansion happening at each iteration is sufficient to explain it.
The reduction should happen something like this (not all at once, of course, but this is enough to demonstrate what I mean):
slow S Z $ S (S (S (S Z))) S $ slow (\a -> S a) Z $ S (S (S Z)) S $ (\a -> S a) $ slow (\b -> (\a -> S a) b) Z $ S (S Z) S $ (\a -> S a) $ (\b -> (\a -> S a) b) $ slow (\c -> (\b -> (\a -> S a) b)) Z $ (S Z) S $ (\a -> S a) $ (\b -> (\a -> S a) b) $ (\c -> (\b -> (\a -> S a) b)) $ slow (\d -> (\c -> (\b -> (\a -> S a) b))) Z $ Z S $ (\a -> S a) $ (\b -> (\a -> S a) b) $ (\c -> (\b -> (\a -> S a) b)) $ Z
So we'd expect a tower of 1+2+3+...+n-1 reductions of the lambda.
The profiling results seem to agree (sum [1..100000-1] = 4999950000):
shachaf@carbon:~/9$ ghc -prof -fprof-auto -rtsopts -O2 C.hs && time ./C +RTS -p () real 1m3.477s user 1m3.372s sys 0m0.004s shachaf@carbon:~/9$ cat C.prof Wed Nov 21 06:21 2012 Time and Allocation Profiling Report (Final) C +RTS -p -RTS total time = 49.32 secs (49319 ticks @ 1000 us, 1 processor) total alloc = 11,249,528 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc slow.\ Main 100.0 14.2 slow Main 0.0 49.8 mkNat Main 0.0 35.6 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 45 0 0.0 0.0 100.0 100.0 main Main 91 0 0.0 0.1 0.0 0.1 CAF Main 89 0 0.0 0.0 100.0 99.6 main Main 90 1 0.0 0.0 100.0 99.6 mkNat Main 95 100001 0.0 35.6 0.0 35.6 slow Main 94 100001 0.0 49.8 100.0 64.0 slow.\ Main 96 4999950000 100.0 14.2 100.0 14.2 unNat Main 93 100001 0.0 0.0 0.0 0.0 main.n Main 92 1 0.0 0.0 0.0 0.0 CAF GHC.IO.Handle.FD 88 0 0.0 0.3 0.0 0.3 CAF GHC.Show 87 0 0.0 0.0 0.0 0.0 CAF GHC.Conc.Signal 86 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding 76 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding.Iconv 75 0 0.0 0.0 0.0 0.0
comment:6 Changed 3 years ago by shachaf
I just noticed that DeriveFunctor has the same issue, in a slightly less severe form (you have to look at all the elements to do quadratic work). "slow" is the instance that would be derived.
data List a = Nil | Cons a (List a) mkList :: Int -> List Int mkList 0 = Nil mkList n = Cons n (mkList (n-1)) sumList :: List Int -> Int sumList = go 0 where go a Nil = a go a (Cons n ns) = a `seq` go (a+n) ns slow :: (a -> b) -> List a -> List b slow f Nil = Nil slow f (Cons x xs) = Cons (f x) (slow (\e -> f e) xs) fast :: (a -> b) -> List a -> List b fast f Nil = Nil fast f (Cons x xs) = Cons (f x) (fast f xs) main :: IO () main = print $ sumList . slow id $ mkList n where n = 100000
comment:7 Changed 3 years ago by twanvl
The current deriving code uses 'holes'. For example, the fmap derivation has the type:
ft_fmap :: FFoldType (LHsExpr RdrName -> State [RdrName] (LHsExpr RdrName))
The deriviation takes an expresion (LHsExpr RdrName) to plug in the hole, and returns the total expression. For example, fmap f for the type "[a]" is \xs -> [|map f $xs|]. Note that the lambda is at the meta level. This also means that we can't just use f itself, we have to use \x -> [|f $x|].
When this is passed as an argument to a higher order function such as fmap or foldr, this lambda must be brought to the ast level. This is done by wrapping it in a concrete lambda. Since there is no way to tell wether the thing we are making concrete is in eta reducable form. Of course we could have a data type to capture eta-reducable expressions.
It would be much simpler to only have concrete lambdas. I.e. to derive [|\xs -> map f xs|] or without eta expanding, [|map f|]. The only downside is that the generated code becomes a bit weirder. For example, for the type:
data Foo a = Foo Int (Bar a) (a,a,a)
we currently generate
fmap f (Foo x y z) = Foo x (fmap (\a -> f a) y) (case z of (a,b,c) -> (f a, f b, f c))
but this would become instead:
fmap f (Foo x y z) = Foo ((\x' -> x') x) (fmap f y) ((\z' -> case z' of (a,b,c) -> (f a, f b, f c)) z)
Similarly for foldr, where we now generate
foldr f a0 (Foo x y z) = f x (foldr (\u v -> f u v) (case z of (a,b,c) -> f a (f b (f c a0))) y)
this would become
foldr f a0 (Foo x y z) = f x $ (\y' a0' -> foldr f a0' y') y $ (\z' a0' -> case z' of (a,b,c) -> f a (f b (f c a0)) ) z $ a0
This shouldn't be a very difficult change, it might even simplify the code a bit. If I have some free time this weekend, and if I manage to get a working ghc build, I'll give it a try.
comment:8 Changed 3 years ago by parcs
- Cc patrick@… added
Changed 3 years ago by twanvl
[PATCH] Changed deriving of Functor, Foldable, Traversable to fix #7436. Added foldMap to derived Foldable instance.
comment:9 Changed 3 years ago by twanvl
- Status changed from new to patch
I have done what I described in my previous comment, and attached a patch.
I have also added foldMap to the generated Foldable instance, which I agree is a more natural function.
comment:10 Changed 3 years ago by simonpj
- Cc twanvl dreixel added
OK, let's wrap this one up.
- Thank you for the patch. Are you confident that it's good to go? Pedro I believe that you were responsible for at least some of the Traversable deriving code; can you give an opinion?
- I wonder if you could add some comments to explain the construction? As you say, it may generate slightly strange code. Something like Note [Avoid eta-expanded code] with a compact explanation, an example, and a pointer to this ticket, would be useful.
- Are there are there any library changes?
- I looked at the eta-expansion thing. Yes, GHC is doing the right thing here, as you show. You might wonder why GHC doesn't eta-reduce (\e -> s e) to s. The reason is that doing so is unsound if s is bottom; then eta-reduction might turn a terminating program into a non-terminating one.
Then I'll commit it.
Simon
comment:11 Changed 3 years ago by dreixel
I don't think I ever wrote any of the deriving code apart from Generic and Typeable.
comment:12 Changed 3 years ago by twanvl
I wrote the original deriving code for Functor/Foldable/Traversable. Someone else later cleaned it up a bit.
The strange code that is generated are lambdas applied to arguments, which is something that GHC should be able to optimize away easily. To avoid the eta-expansion, the generated code is always a function. So you end up with code like
data Cons a = Cons (a,a) fmap f (Cons x) = (\x -> case x of (x1,x2) -> (f x1 , f x2)) x
I could add this example to a note, if you insist.
There are no library changes in the patch, and the derived code is functionally equivalent to what was there before.
comment:13 Changed 3 years ago by simonpj
Yes, I think it'd be very helpful to add a Note [Avoid eta-expanded code], explaining the problem, giving an example, and pointing to this ticket.
Thank you.
Simon
comment:14 Changed 3 years ago by twanvl
I added an explanation of the behaviour and a link to this ticket to the description at the top of the Functor deriving code. I did not make it a separate note, since there is no reference to it from inside the code. The patch that adds this note is attached.
Changed 3 years ago by twanvl
comment:15 Changed 3 years ago by simonpj
Thanks. I'll apply.
comment:16 Changed 3 years ago by twanvl@…
commit 49ca2a37bef18aa57235ff1dbbf1cc0434979b1e
Author: Twan van Laarhoven <[email protected]> Date: Fri Nov 23 15:03:45 2012 +0100 Changed deriving of Functor, Foldable, Traversable to fix #7436. Added foldMap to derived Foldable instance. The derived instances will no longer eta-expand the function. I.e. instead of fmap f (Foo a) = Foo (fmap (\x -> f x) a) we now derive fmap f (Foo a) = Foo (fmap f a) Some superflous lambdas are generated as a result. For example data X a = X (a,a) fmap f (X x) = (\y -> case y of (a,b) -> (f a, f b)) x The optimizer should be able to simplify this code, as it is just beta reduction. The derived Foldable instance now includes foldMap in addition to foldr. compiler/prelude/PrelNames.lhs | 9 ++- compiler/typecheck/TcGenDeriv.lhs | 178 ++++++++++++++++++++++--------------- 2 files changed, 114 insertions(+), 73 deletions(-)
comment:17 Changed 3 years ago by simonpj
- Resolution set to fixed
- Status changed from patch to closed
- Test Case set to perf/should_runt/T7436
Also:
commit 3d51f271e6819c52508956f2426c4c19dec0b2fb Author: Twan van Laarhoven <[email protected]> Date: Thu Jan 3 16:24:42 2013 +0100 Added note explaining the lambdas generated by functor deriving code, and how it compares to the old deriving code which used eta expansion.
Thanks for the patches!
I added a test in perf/should_run. Interestingly the massive difference is run-time, not in allocation or residency.
Simon
comment:18 Changed 3 years ago by liyang
- Cc hackage.haskell.org@… added
Since this issue affects a large number of users and doesn't seem too contentious, could this fix be included in 7.6.2?
comment:19 Changed 3 years ago by simonpj
- Status changed from closed to merge
I'd be happy to see this in 7.6.2 if there is still time to merge it, Ian.
comment:20 Changed 3 years ago by jwlato
- Cc jwlato@… added
comment:21 Changed 2 years ago by igloo
- Milestone set to 7.6.3
comment:22 Changed 2 years ago by simonpj
This one is fixed, just awaiting merging into 7.6.3
comment:23 Changed 2 years ago by carter
was this merged into 7.6? else wise, should it be merged into 7.8/head?
comment:24 Changed 2 years ago by monoidal
- Status changed from merge to closed
It has been fixed in HEAD long time ago, but was not merged into 7.6. Since 7.6.4 is not officially planned, I'm closing the ticket.
I'm very surprised that the eta-expanded version has asymptotically worse behaviour.
Could you generate an example with hand-written code (not generated by deriving) that elicits this behaviour? I'm gueesing that'd be simple do to, and it'd save wading through mountains of irrelevant Core. The smaller and more standalone the test case, the better.
I'm vaguely wondering whether it might be a space issue; ie takes lots of space for chains of partial applications etc. Maybe you can use +RTS -s for each version and give us the output. Thanks!
Simon