Opened 5 years ago

Closed 5 years ago

#9339 closed bug (fixed)

last is not a good consumer

Reported by: dfeuer Owned by:
Priority: normal Milestone:
Component: libraries/base Version: 7.8.3
Keywords: Cc: hvr, ekmett
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D86
Wiki Page:

Description

The profiler indicates that print $ last [(1::Int)..10^7] (compiled with -O2) allocates around 8*108 bytes. Using the Henning Thienemann-inspired

myLast = fromJust . foldr (\x -> Just . maybe x id) Nothing

(based on his code for viewR/unsnoc) reduces allocation by about half at the cost of some extra work. What we really want, I believe, is for last to fuse with the producer in a fashion that allows the Ints to be unboxed, eliminating all the allocation. I have no idea if this will fall out of the general fusion work planned for 7.9.

Change History (11)

comment:1 Changed 5 years ago by nomeata

What general fusion work are you referring to? I only know about the improvements for fusing foldl, which doesn’t seem to apply here.

Although maybe it could. How about this definition:

myLast2 :: [a] -> a
myLast2 = foldl (\_ x -> x) undefined

While last yields 800051648 bytes, and myLast yields 560084432 bytes, this runs in 51648 bytes (i.e. it fuses completely).

Admitted, the resulting code looks almost stupidly efficient (at least if I write 10^7 as 10000 – I’m unjustifiably surprised that that is not constant-folded):

Rec {
$wgo
$wgo =
  \ w_s3Aj ->
    case w_s3Aj of wild_Xf {
      __DEFAULT -> $wgo (+# wild_Xf 1);
      10000000 -> 10000000
    }
end Rec }

comment:2 in reply to:  1 Changed 5 years ago by dfeuer

Replying to nomeata:

What general fusion work are you referring to? I only know about the improvements for fusing foldl, which doesn’t seem to apply here.

I was under the impression (possibly bogus) that the foldl fix involved some new transformation that could have other effects.

Although maybe it could. How about this definition:

myLast2 :: [a] -> a
myLast2 = foldl (\_ x -> x) undefined

While last yields 800051648 bytes, and myLast yields 560084432 bytes, this runs in 51648 bytes (i.e. it fuses completely).

That looks wonderful. Is it certain to be changed to a foldl' in cases where it doesn't fuse?

Admitted, the resulting code looks almost stupidly efficient (at least if I write 10^7 as 10000 – I’m unjustifiably surprised that that is not constant-folded):

Imagine the confusion if it were! Yes, map f [m..n] !! p could be optimized to O(1) whenever m has a known-sane type, but then all the newbies and half the oldies would get mixed up when little changes had radical performance impacts.

comment:3 Changed 5 years ago by nomeata

That looks wonderful. Is it certain to be changed to a foldl' in cases where it doesn't fuse?

Hopefully not. foldl' would force the accumulator, which we do *not* want here (otherwise the undefined would be forced, or last [undefined, 1] would not work).

I didn’t do further testing with that idea, it just crossed my mind. It maybe the that this implementation is only good when fusing works – would you mind trying to find out?

In that case one would have to do the replace, try to fuse, replace back trick (which might be tricky with foldl itself getting inlined).

Or maybe it is possible to use foldl and wrap the argument in a Id, so that forcing does the right thing.

comment:4 in reply to:  3 Changed 5 years ago by dfeuer

Replying to nomeata:

That looks wonderful. Is it certain to be changed to a foldl' in cases where it doesn't fuse?

Hopefully not. foldl' would force the accumulator, which we do *not* want here (otherwise the undefined would be forced, or last [undefined, 1] would not work).

Yes, you're right. I got mixed up a bit.

I didn’t do further testing with that idea, it just crossed my mind. It maybe the that this implementation is only good when fusing works – would you mind trying to find out?

I don't have anything beyond 7.8.3, and on 7.8.3 your code doesn't fuse. Could you maybe try it?

comment:5 Changed 5 years ago by nomeata

Still not a full evaluation, but some more factoids:

With

myLast2 = foldl (\_ x -> x) undefined

I get

myLast2 :: forall a. [a] -> a
myLast2 = \ (@ a) -> foldl (myLast1) (undefined)

while when used (in the same model, non-fusing), this gets turned into the nice

Rec {
main_go :: [Int] -> Int -> Int
main_go =
  \ (ds :: [Int]) (eta :: Int) ->
    case ds of _ {
      [] -> eta;
      : y ys -> main_go ys y
    }
end Rec }

Writing

myLast2 = inline foldl (\_ x -> x) undefined

also gives

Rec {
myLast1 :: forall a. [a] -> a -> a
myLast1 =
  \ (@ a) (ds :: [a]) (eta :: a) ->
    case ds of _ {
      [] -> eta;
      : y ys -> myLast1 ys y
    }
end Rec }

myLast2 :: forall a. [a] -> a
myLast2 = \ (@ a) (xs :: [a]) -> myLast1 xs (undefined)

So it looks good even when not fused. (Measurements are yet pending.) for the exported module, but when used, it produce

comment:6 Changed 5 years ago by nomeata

Ok, let’s do this systematically.

My benchmarks: One possibly fusing invocation of last:

main = print $ Last.last $ filter odd $ [1::Int ..100000000]

and one non-fusing

f = id
{-# NOINLINE f #-}
main = print $ Last.last $ f $ filter odd $ [1::Int ..100000000]

I am comparing the existing implementation of last, which is

last []                 =  errorEmptyList "last"
last (x:xs)             =  last' x xs
  where last' y []     = y
        last' _ (y:ys) = last' y ys

with the simpler

last = foldl (\_ x -> x) (errorEmptyList "last")

Just for fun (and because GHC HEAD still compiles), here the numbers with GHC-7.6:

LastTestFusing.hs
<<ghc: 3200051208 bytes, 6104 GCs, 36460/44416 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.29 MUT (1.29 elapsed), 0.01 GC (0.01 elapsed) :ghc>>
LastTestNonFusing.hs
<<ghc: 3200051224 bytes, 6104 GCs, 36460/44416 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.31 MUT (1.31 elapsed), 0.01 GC (0.01 elapsed) :ghc>>
NewLastTestFusing.hs
<<ghc: 3200051224 bytes, 6104 GCs, 36460/44416 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.34 MUT (1.34 elapsed), 0.01 GC (0.01 elapsed) :ghc>>
NewLastTestNonFusing.hs
<<ghc: 3200051240 bytes, 6104 GCs, 36460/44416 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.31 MUT (1.31 elapsed), 0.01 GC (0.01 elapsed) :ghc>>

No significant difference in either of these.

Now the interesting numbers are those with ghc HEAD, and for that I have to wait for the compilation to finish...

So here they are:

LastTestFusing.hs
<<ghc: 3200050800 bytes, 6104 GCs, 36372/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.27 MUT (1.27 elapsed), 0.01 GC (0.01 elapsed) :ghc>>
LastTestNonFusing.hs
<<ghc: 3200050816 bytes, 6104 GCs, 36372/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.26 MUT (1.26 elapsed), 0.01 GC (0.01 elapsed) :ghc>>
NewLastTestFusing.hs
<<ghc: 800050800 bytes, 1526 GCs, 36356/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.00 MUT (1.00 elapsed), 0.00 GC (0.00 elapsed) :ghc>>
NewLastTestNonFusing.hs
<<ghc: 3200050832 bytes, 6104 GCs, 36372/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.26 MUT (1.26 elapsed), 0.01 GC (0.01 elapsed) :ghc>>

We see that the new implementation performs the same in the non-fusing case, but much better in the fusing case.

Why does it even allocate in the fusing case? Because it needs to box them for the recursive call:

Rec {
main_go :: Int# -> Int -> Int
main_go =
  \ (x :: Int#) (eta :: Int) ->
    case remInt# x 2 of _ {
      __DEFAULT ->
        case x of wild {
          __DEFAULT -> main_go (+# wild 1) (I# wild);
          100000000 -> lvl
        };
      0 ->
        case x of wild {
          __DEFAULT -> main_go (+# wild 1) eta;
          100000000 -> eta
        }
    }
end Rec }

Guess that’s unavoidable here.

Also, to make sure that some integer unboxing doesn’t skew the comparison, here the numbers with Integer:

<<ghc: 5600050800 bytes, 10703 GCs, 36380/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 2.77 MUT (2.77 elapsed), 0.02 GC (0.02 elapsed) :ghc>>
LastTestNonFusing.hs
<<ghc: 5600050816 bytes, 10703 GCs, 36380/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 2.77 MUT (2.77 elapsed), 0.02 GC (0.02 elapsed) :ghc>>
NewLastTestFusing.hs
<<ghc: 3200050816 bytes, 6104 GCs, 36380/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 2.45 MUT (2.45 elapsed), 0.01 GC (0.01 elapsed) :ghc>>
NewLastTestNonFusing.hs
<<ghc: 5600050848 bytes, 10703 GCs, 36388/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 2.93 MUT (2.93 elapsed), 0.02 GC (0.02 elapsed) :ghc>>

This only confirms the previous findings.

TL;DR: Implementing last with foldl has no negative consequences, but works nicely when fused. I don’t expect that to happen often, but it’s still nice to have, and actually simplifies the existing library implementation. If noone complains, I’ll push this soon.

comment:7 Changed 5 years ago by nomeata

Status: newpatch

I wanted to push this as a phabricator review, but arc is currently broken for me. Hence I’ll parked it at wip/T9339. Marking as “patch” so that this is not forgotten.

comment:8 Changed 5 years ago by nomeata

Differential Rev(s): D86

comment:9 Changed 5 years ago by nomeata

Differential Rev(s): D86Phab:D86

comment:10 Changed 5 years ago by Joachim Breitner <mail@…>

In b709f0a047dc036de15dc43d3b0ab88d3e32c5e6/ghc:

Make last a good consumer

Summary:
Make last a good consumer simply by implementing it as foldl. This fixes Trac: #9339.
Thanks to David Feuer for bringing it up.

Test Plan: perf/should_run/T9339 + general validation

Reviewers: austin

Reviewed By: austin

Subscribers: phaskell, simonmar, relrod, carter

Differential Revision: https://phabricator.haskell.org/D86

Trac Issues: #9339

comment:11 Changed 5 years ago by nomeata

Resolution: fixed
Status: patchclosed
Note: See TracTickets for help on using tickets.