Opened 17 months ago

Closed 7 months ago

#7436 closed bug (fixed)

Derived Foldable and Traversable instances become extremely inefficient due to eta-expansion

Reported by: shachaf Owned by:
Priority: normal Milestone: 7.6.3
Component: Compiler Version: 7.6.1
Keywords: Cc: patrick@…, twanvl, dreixel, hackage.haskell.org@…, jwlato@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Difficulty: Unknown
Test Case: perf/should_runt/T7436 Blocked By:
Blocking: Related Tickets:

Description (last modified by simonpj)

The following program:

{-# LANGUAGE DeriveFunctor, DeriveFoldable #-}
import Prelude hiding (foldr)
import Data.Foldable

data List a = Nil | Cons a (List a)
    deriving (Functor, Foldable)

mkList :: Int -> List Int
mkList 0 = Nil
mkList n = Cons n (mkList (n-1))

main :: IO ()
main = print $ foldr (\x y -> y) "end" (mkList n)
  where n = 100000

Takes n^2 time to run with GHC 7.6.1 -O2.

The generated Foldable code looks something like this:

instance Foldable List where
    foldr f z Nil = z
    foldr f z (Cons x xs) = f x (foldr (\a b -> f a b) z xs)

Eta-reducing the function, i.e.

instance Foldable List where
    foldr f z Nil = z
    foldr f z (Cons x xs) = f x (foldr f z xs)

Makes the program linear in n (in this case, runtime goes from 8.160s to 0.004s).

The Traversable instance also has the same issue.

There seem to be three different issues:

  • Derived Foldable and Traversable instances are nearly unusable for large structures.
  • An eta-expanded definition like foldr becomes asymptotically worse for some reason. Maybe this is expected behavior for this function, since f gets eta-expanded at each iteration?
  • Foldable instances are generated with foldr instead of foldMap.

This isn't directly related, since the code would have the same problem either
way, but since I'm already writing about it... foldMap can allow
asymptotically better operations on a structure than foldr (for example,
finding the rightmost leaf of a binary tree using Data.Monoid.Last), so it
should probably be generated instead. A foldMap definition should look like a
simpler version of traverse, which is already derivable. Maybe this should be
a separate ticket.

Attachments (2)

0001-Changed-deriving-of-Functor-Foldable-Traversable-to-.patch (19.2 KB) - added by twanvl 17 months ago.
[PATCH] Changed deriving of Functor, Foldable, Traversable to fix #7436. Added foldMap to derived Foldable instance.
0002-Added-note-explaining-the-lambdas-generated-by-funct.patch (1.9 KB) - added by twanvl 16 months ago.

Download all attachments as: .zip

Change History (26)

comment:1 Changed 17 months ago by simonpj

  • Description modified (diff)
  • Difficulty set to Unknown

comment:2 Changed 17 months ago by simonpj

I'm very surprised that the eta-expanded version has asymptotically worse behaviour.

Could you generate an example with hand-written code (not generated by deriving) that elicits this behaviour? I'm gueesing that'd be simple do to, and it'd save wading through mountains of irrelevant Core. The smaller and more standalone the test case, the better.

I'm vaguely wondering whether it might be a space issue; ie takes lots of space for chains of partial applications etc. Maybe you can use +RTS -s for each version and give us the output. Thanks!

Simon

comment:3 Changed 17 months ago by shachaf

The two instances I gave work as as they are without deriving (no Core
necessary -- the code is a simplified version of what -ddump-deriv
generated). Here's a simpler version without type classes (or lists):

data Nat = Z | S Nat

mkNat :: Int -> Nat
mkNat 0 = Z
mkNat n = S (mkNat (n-1))

unNat :: Nat -> ()
unNat Z = ()
unNat (S n) = unNat n

fast :: (b -> b) -> b -> Nat -> b
fast s z Z = z
fast s z (S n) = s (fast s z n)

slow :: (b -> b) -> b -> Nat -> b
slow s z Z = z
slow s z (S n) = s (slow (\e -> s e) z n)

main :: IO ()
--main = print $ unNat . fast S Z . mkNat $ n
main = print $ unNat . slow S Z . mkNat $ n
  where n = 100000

And the +RTS -s output for both:

  • Fast:
shachaf@carbon:~/9$ ghc -rtsopts -O2 C.hs && time ./C +RTS -s
[1 of 1] Compiling Main             ( C.hs, C.o )
Linking C ...
()
       9,651,768 bytes allocated in the heap
          10,904 bytes copied during GC
          44,416 bytes maximum residency (2 sample(s))
          21,120 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0        17 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.00s  (  0.00s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.00s  (  0.00s elapsed)

  %GC     time       0.0%  (9.8% elapsed)

  Alloc rate    2,412,942,000 bytes per MUT second

  Productivity 100.0% of total user, 116.6% of total elapsed


real	0m0.005s
user	0m0.004s
sys	0m0.000s
  • Slow:
shachaf@carbon:~/9$ ghc -rtsopts -O2 C.hs && time ./C +RTS -s
[1 of 1] Compiling Main             ( C.hs, C.o )
Linking C ...
()
      11,251,768 bytes allocated in the heap
       4,122,872 bytes copied during GC
       1,223,248 bytes maximum residency (3 sample(s))
         528,816 bytes maximum slop
               4 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0        19 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s
  Gen  1         3 colls,     0 par    0.00s    0.00s     0.0006s    0.0016s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    8.23s  (  8.25s elapsed)
  GC      time    0.00s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    8.24s  (  8.25s elapsed)

  %GC     time       0.0%  (0.1% elapsed)

  Alloc rate    1,366,747 bytes per MUT second

  Productivity 100.0% of total user, 99.8% of total elapsed


real	0m8.253s
user	0m8.237s
sys	0m0.000s

comment:4 Changed 17 months ago by simonpj

Thanks for the simple version. Truly bizarre. The slow code allocate ONLY 11 MBYTE, and yet takes EIGHT SECONDS to run. No time to look today, but if anyone cares to investigate that'd be great. (One thing to try would be profiling, though that might make the bug disappear.)

Simon

comment:5 Changed 17 months ago by shachaf

I'd think what I said above about the eta-expansion happening at each iteration
is sufficient to explain it.

The reduction should happen something like this (not all at once, of course,
but this is enough to demonstrate what I mean):

slow S Z
  $ S (S (S (S Z)))

S $ slow (\a -> S a) Z
  $ S (S (S Z))

S $ (\a -> S a)
  $ slow (\b -> (\a -> S a) b) Z
  $ S (S Z)

S $ (\a -> S a)
  $ (\b -> (\a -> S a) b)
  $ slow (\c -> (\b -> (\a -> S a) b)) Z
  $ (S Z)

S $ (\a -> S a)
  $ (\b -> (\a -> S a) b)
  $ (\c -> (\b -> (\a -> S a) b))
  $ slow (\d -> (\c -> (\b -> (\a -> S a) b))) Z
  $ Z

S $ (\a -> S a)
  $ (\b -> (\a -> S a) b)
  $ (\c -> (\b -> (\a -> S a) b))
  $ Z

So we'd expect a tower of 1+2+3+...+n-1 reductions of the lambda.

The profiling results seem to agree (sum [1..100000-1] = 4999950000):

shachaf@carbon:~/9$ ghc -prof -fprof-auto -rtsopts -O2 C.hs && time ./C +RTS -p
()

real	1m3.477s
user	1m3.372s
sys	0m0.004s
shachaf@carbon:~/9$ cat C.prof
	Wed Nov 21 06:21 2012 Time and Allocation Profiling Report  (Final)

	   C +RTS -p -RTS

	total time  =       49.32 secs   (49319 ticks @ 1000 us, 1 processor)
	total alloc =  11,249,528 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc

slow.\      Main      100.0   14.2
slow        Main        0.0   49.8
mkNat       Main        0.0   35.6


                                                       individual     inherited
COST CENTRE  MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN         MAIN                     45           0    0.0    0.0   100.0  100.0
 main        Main                     91           0    0.0    0.1     0.0    0.1
 CAF         Main                     89           0    0.0    0.0   100.0   99.6
  main       Main                     90           1    0.0    0.0   100.0   99.6
   mkNat     Main                     95      100001    0.0   35.6     0.0   35.6
   slow      Main                     94      100001    0.0   49.8   100.0   64.0
    slow.\   Main                     96  4999950000  100.0   14.2   100.0   14.2
   unNat     Main                     93      100001    0.0    0.0     0.0    0.0
   main.n    Main                     92           1    0.0    0.0     0.0    0.0
 CAF         GHC.IO.Handle.FD         88           0    0.0    0.3     0.0    0.3
 CAF         GHC.Show                 87           0    0.0    0.0     0.0    0.0
 CAF         GHC.Conc.Signal          86           0    0.0    0.0     0.0    0.0
 CAF         GHC.IO.Encoding          76           0    0.0    0.0     0.0    0.0
 CAF         GHC.IO.Encoding.Iconv    75           0    0.0    0.0     0.0    0.0

comment:6 Changed 17 months ago by shachaf

I just noticed that DeriveFunctor has the same issue, in a slightly less severe form (you have to look at all the elements to do quadratic work). "slow" is the instance that would be derived.

data List a = Nil | Cons a (List a)

mkList :: Int -> List Int
mkList 0 = Nil
mkList n = Cons n (mkList (n-1))

sumList :: List Int -> Int
sumList = go 0
  where
    go a Nil = a
    go a (Cons n ns) = a `seq` go (a+n) ns

slow :: (a -> b) -> List a -> List b
slow f Nil = Nil
slow f (Cons x xs) = Cons (f x) (slow (\e -> f e) xs)

fast :: (a -> b) -> List a -> List b
fast f Nil = Nil
fast f (Cons x xs) = Cons (f x) (fast f xs)

main :: IO ()
main = print $ sumList . slow id $ mkList n
  where n = 100000

comment:7 Changed 17 months ago by twanvl

The current deriving code uses 'holes'. For example, the fmap derivation has the type:

ft_fmap :: FFoldType (LHsExpr RdrName -> State [RdrName] (LHsExpr RdrName))

The deriviation takes an expresion (LHsExpr RdrName) to plug in the hole, and returns the total expression.
For example, fmap f for the type "[a]" is \xs -> [|map f $xs|]. Note that the lambda is at the meta level. This also means that we can't just use f itself, we have to use \x -> [|f $x|].

When this is passed as an argument to a higher order function such as fmap or foldr, this lambda must be brought to the ast level. This is done by wrapping it in a concrete lambda. Since there is no way to tell wether the thing we are making concrete is in eta reducable form. Of course we could have a data type to capture eta-reducable expressions.

It would be much simpler to only have concrete lambdas. I.e. to derive [|\xs -> map f xs|] or without eta expanding, [|map f|].
The only downside is that the generated code becomes a bit weirder. For example, for the type:

data Foo a = Foo Int (Bar a) (a,a,a)

we currently generate

fmap f (Foo x y z) = Foo x (fmap (\a -> f a) y) (case z of (a,b,c) -> (f a, f b, f c))

but this would become instead:

fmap f (Foo x y z) = Foo ((\x' -> x') x) (fmap f y) ((\z' -> case z' of (a,b,c) -> (f a, f b, f c)) z)

Similarly for foldr, where we now generate

foldr f a0 (Foo x y z) = f x (foldr (\u v -> f u v) (case z of (a,b,c) -> f a (f b (f c a0))) y)

this would become

foldr f a0 (Foo x y z)
       = f x
       $ (\y' a0' -> foldr f a0' y') y
       $ (\z' a0' -> case z' of (a,b,c) -> f a (f b (f c a0)) ) z
       $ a0

This shouldn't be a very difficult change, it might even simplify the code a bit. If I have some free time this weekend, and if I manage to get a working ghc build, I'll give it a try.

comment:8 Changed 17 months ago by parcs

  • Cc patrick@… added

Changed 17 months ago by twanvl

[PATCH] Changed deriving of Functor, Foldable, Traversable to fix #7436. Added foldMap to derived Foldable instance.

comment:9 Changed 17 months ago by twanvl

  • Status changed from new to patch

I have done what I described in my previous comment, and attached a patch.

I have also added foldMap to the generated Foldable instance, which I agree is a more natural function.

comment:10 Changed 16 months ago by simonpj

  • Cc twanvl dreixel added

OK, let's wrap this one up.

  • Thank you for the patch. Are you confident that it's good to go? Pedro I believe that you were responsible for at least some of the Traversable deriving code; can you give an opinion?
  • I wonder if you could add some comments to explain the construction? As you say, it may generate slightly strange code. Something like Note [Avoid eta-expanded code] with a compact explanation, an example, and a pointer to this ticket, would be useful.
  • Are there are there any library changes?
  • I looked at the eta-expansion thing. Yes, GHC is doing the right thing here, as you show. You might wonder why GHC doesn't eta-reduce (\e -> s e) to s. The reason is that doing so is unsound if s is bottom; then eta-reduction might turn a terminating program into a non-terminating one.

Then I'll commit it.

Simon

comment:11 Changed 16 months ago by dreixel

I don't think I ever wrote any of the deriving code apart from Generic and Typeable.

comment:12 Changed 16 months ago by twanvl

I wrote the original deriving code for Functor/Foldable/Traversable?. Someone else later cleaned it up a bit.

The strange code that is generated are lambdas applied to arguments, which is something that GHC should be able to optimize away easily. To avoid the eta-expansion, the generated code is always a function. So you end up with code like

data Cons a = Cons (a,a)
fmap f (Cons x) = (\x -> case x of (x1,x2) -> (f x1 , f x2)) x

I could add this example to a note, if you insist.

There are no library changes in the patch, and the derived code is functionally equivalent to what was there before.

comment:13 Changed 16 months ago by simonpj

Yes, I think it'd be very helpful to add a Note [Avoid eta-expanded code], explaining the problem, giving an example, and pointing to this ticket.

Thank you.

Simon

comment:14 Changed 16 months ago by twanvl

I added an explanation of the behaviour and a link to this ticket to the description at the top of the Functor deriving code. I did not make it a separate note, since there is no reference to it from inside the code. The patch that adds this note is attached.

comment:15 Changed 16 months ago by simonpj

Thanks. I'll apply.

comment:16 Changed 16 months ago by twanvl@…

commit 49ca2a37bef18aa57235ff1dbbf1cc0434979b1e

Author: Twan van Laarhoven <twanvl@gmail.com>
Date:   Fri Nov 23 15:03:45 2012 +0100

    Changed deriving of Functor, Foldable, Traversable to fix #7436. Added foldMap to derived Foldable instance.
    
    The derived instances will no longer eta-expand the function. I.e. instead of
        fmap f (Foo a) = Foo (fmap (\x -> f x) a)
    we now derive
        fmap f (Foo a) = Foo (fmap f a)
    
    Some superflous lambdas are generated as a result. For example
        data X a = X (a,a)
        fmap f (X x) = (\y -> case y of (a,b) -> (f a, f b)) x
    The optimizer should be able to simplify this code, as it is just beta reduction.
    
    The derived Foldable instance now includes foldMap in addition to foldr.

 compiler/prelude/PrelNames.lhs    |    9 ++-
 compiler/typecheck/TcGenDeriv.lhs |  178 ++++++++++++++++++++++---------------
 2 files changed, 114 insertions(+), 73 deletions(-)

comment:17 Changed 16 months ago by simonpj

  • Resolution set to fixed
  • Status changed from patch to closed
  • Test Case set to perf/should_runt/T7436

Also:

commit 3d51f271e6819c52508956f2426c4c19dec0b2fb
Author: Twan van Laarhoven <twanvl@gmail.com>
Date:   Thu Jan 3 16:24:42 2013 +0100

    Added note explaining the lambdas generated by functor deriving code, and how it compares to the old deriving code which used eta expansion.

Thanks for the patches!

I added a test in perf/should_run. Interestingly the massive difference is run-time, not in allocation or residency.

Simon

comment:18 Changed 16 months ago by liyang

  • Cc hackage.haskell.org@… added

Since this issue affects a large number of users and doesn't seem too contentious, could this fix be included in 7.6.2?

comment:19 Changed 16 months ago by simonpj

  • Status changed from closed to merge

I'd be happy to see this in 7.6.2 if there is still time to merge it, Ian.

comment:20 Changed 16 months ago by jwlato

  • Cc jwlato@… added

comment:21 Changed 13 months ago by igloo

  • Milestone set to 7.6.3

comment:22 Changed 12 months ago by simonpj

This one is fixed, just awaiting merging into 7.6.3

comment:23 Changed 7 months ago by carter

was this merged into 7.6?
else wise, should it be merged into 7.8/head?

comment:24 Changed 7 months ago by monoidal

  • Status changed from merge to closed

It has been fixed in HEAD long time ago, but was not merged into 7.6. Since 7.6.4 is not officially planned, I'm closing the ticket.

Note: See TracTickets for help on using tickets.