7.8 optimizes attoparsec improperly

changed weight to 5

That's a huge regression. Are you sure all libraries in the GHC tree where compiled with optimization turned on?

Yeah it is pretty serious. I just cleared out ~/.ghc and reinstalled with "optimization: 2" in my cabal config. Same result.

Talked with some people in #ghc. -fno-full-laziness eliminates the issue.

Can you identify on which module -fno-full-laziness makes the difference? Is it just the module above, or do you need to set that flag for attoparsec itself?

Using it when compiling the module above is enough.

Replying to [ticket:8814#comment:79592 joelteon]:

Using it when compiling the module above is enough.

Could you dump the core and see if you can spot what difference the flag makes?

I have not had any time to devote to this. I tried

ghc -O T8814.hs -ddump-simpl -o T8814

with and without -fno-full-laziness. Indeed I see the perf difference.

The Core from -ddump-simpl looks very different. Inside Main.$wa you'll see a call to runSTRep. The function to which runSTRep is applied looks very different.

Without full laziness, it consists of a call to newArray# followed by a couple of memcpy calls
With full laziness, it has a rather complicated local recursive function that allocates a LOT of memory.

I have no idea why. I think it must be to do with optimisations being done by RULES in the text library. If I add -ddump-rule-firings and grep for TEXT in the rule names, I get

-- With full laziness
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> unfused
Rule fired: TEXT tail -> unfused
Rule fired: TEXT tail -> unfused

-- Without full laziness
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> unfused
Rule fired: TEXT append -> unfused
Rule fired: TEXT append -> unfused
Rule fired: TEXT tail -> unfused
Rule fired: TEXT tail -> unfused
Rule fired: TEXT append -> unfused

So there is clearly a difference. Should that difference have such a massive performance impact? Ask the author of the text library! Why does full laziness have the effect? Well if you have (\x. map (f x) (map g ys)), say, full laziness may float out the map g ys and then the map/map fusion won't happen.

At this point I hope that someone else will take over debugging to find out more.

Simon

Trac metadata

Trac field	Value
Related	→ #8763 (closed)

mentioned in issue #8835 (closed)

I wrote a workaround for this problem that makes this particular performance regression vanish.

We're a bit stalled here. There is something mysterious going on, which a single INLINE pragma (Bryan's patch) fixes. But why? My guess is that it's something to do with the interaction between inlning and attoparsec's RULES. For example if a rule optimises (f (g x)), and g gets inlined, the rule won't fire. Or, if

h x = f x + 1

then the expression h (g x) won't fire the rule unless h is first inlined.

The "phase" annotations on INLINE pragmas and RULES let you control this stuff. So the bug might not be in GHC; it might just be a missing phase annotation. Or there might be a bug in GHC. We won't know until someone digs further.

Apart from lack of time, the difficulty is that I have no clue how attoparsec's RULES are supposed to work. So I think we are stalled unless/until someone feels able to do some digging to isolate what is going on.

Thanks!

Simon

I still think this bug is in GHC. It shouldn't take 15GB of RAM to compile a program with a large chain of <|>.

Edit: just realized I never added this to the bug. Compiling the thyme package makes GHC consume all available RAM (15.96GB for me) and stall. I have to kill -9 it. Offending section is here: https://github.com/liyang/thyme/blob/master/src/Data/Thyme/Format.hs#L591-L693. I cloned the package, replaced the function body with undefined, and it compiled normally.

But this ticket is about the run-time of the particular program given in the description of this ticket.

There may well be another bug, to do with the compile-time of an entirely different program. If so, could you open a ticket for that, including a way to reproduce? (Having checked that there isn't one already.)

It's confusing to mix up two bugs into one ticket!

Thanks

Simon

My apologies. I've created #8852 (closed) for the compile-time bug.

Replying to [ticket:8814#comment:79912 simonpj]:

We're a bit stalled here. There is something mysterious going on, which a single INLINE pragma (Bryan's patch) fixes. But why? My guess is that it's something to do with the interaction between inlining and attoparsec's RULES.

I assume you're referring to text's RULES, as attoparsec doesn't contain any (it does contain a lot of inlining, though).

The goal behind the RULES in text is to expose opportunities to perform stream fusion and, if the rewrite phases do not reveal any, to drop from the fusion style of programming back to thwacking directly on an array (which is typically a lot faster).

For example, here are the two main RULES that you spotted in the output above:

{-# RULES
"TEXT append -> fused" [~1] forall t1 t2.
    append t1 t2 = unstream (S.append (stream t1) (stream t2))
"TEXT append -> unfused" [1] forall t1 t2.
    unstream (S.append (stream t1) (stream t2)) = append t1 t2
 #-}

If we see an unadorned use of append in an early phase, we rewrite it so that a fusion rule can have a chance to transform it. Once we get closer to the end of our phases, if we find that a fusion-style append hasn't yet been gobbled up, we transform it back to the direct style.

There's a tangential wrinkle here: for the longest time, the definition of append had an INLINE annotation, which I just removed. I don't believe this is implicated in the slowdown.

Please let me know what else it would be helpful to explain. As you know, forests of rewrite rules are rather fragile affairs, so it's entirely possible that there's been a longstanding bug in those rules (and perhaps not in the compiler) that is merely now being exposed in 7.8.

What would be really helpful would be if you, or someone else, could diagnose exactly why the slow-down is happening. If you compile with -ticky you can very quickly zero in on the code that is taking more time, and compare one with the other.

It would be remarkable if fusion was really responsible for such an enormous change in time, but perhpas it is. Maybe it would be worth commenting out RULES one at a time, to see which (if any) is responsible, and to reduce accidental differences between the two. Maybe the test case can be boiled down quite a lot further.

You can switch off ALL rule rewrites with -fno-enable-rewrite-rules.

Switching off full laziness might also be a good thing to try -fno-full-laziness.

Reducing the inlining threshold (which doesn't affect INLINE pragmas) would mean less looking at inlined code. -funfolding-use-threshold=N (default is 60)

It's a bit fiddly and time consuming, which is why I was appealing for help. I'll gladly explain anything you (or whoever) wants to know about the GHC end of things.

See also #8835 (closed), which we believe to be a dup of this one.

Trac metadata

Trac field	Value
Operating system	MacOS X → Unknown/Multiple

mentioned in issue #10825 (closed)

added runtime perf label

added Pnormal label

Trac field	Value
Version	7.8.1-rc1
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system	MacOS X
Architecture

7.8 optimizes attoparsec improperly

Child items 0

Activity

7.8 optimizes attoparsec improperly

Relates to

Activity