{-# LANGUAGE OverloadedStrings #-}importControl.ApplicativeimportqualifiedData.Attoparsec.TextasAimportData.Text(Text)importqualifiedData.TextasTtestParser::Text->EitherStringInttestParserf=fmaplength.A.parseOnly(many(A.char'b'<|>A.anyChar))$fmain::IO()main=print.testParser$T.replicate50000"a"
On GHC 7.6.3 with -O2:
real 0m0.062suser 0m0.022ssys 0m0.007s
On GHC 7.8 tip with -O2:
real 0m12.700suser 0m12.504ssys 0m0.165s
On GHC 7.6.3 with -O0:
real 0m0.077suser 0m0.025ssys 0m0.007s
On GHC 7.8 tip with -O0:
real 0m0.055suser 0m0.026ssys 0m0.007s
This seems to be related to the use of <|>; if I change the program so that the second branch (A.anyChar) is never taken, 7.8 behavior is roughly the same as 7.6 under any optimization level.
Trac metadata
Trac field
Value
Version
7.8.1-rc1
Type
Bug
TypeOfFailure
OtherFailure
Priority
normal
Resolution
Unresolved
Component
Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
MacOS X
Architecture
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Can you identify on which module -fno-full-laziness makes the difference? Is it just the module above, or do you need to set that flag for attoparsec itself?
I have not had any time to devote to this. I tried
ghc -O T8814.hs -ddump-simpl -o T8814
with and without -fno-full-laziness. Indeed I see the perf difference.
The Core from -ddump-simpl looks very different. Inside Main.$wa you'll see a call to runSTRep. The function to which runSTRep is applied looks very different.
Without full laziness, it consists of a call to newArray# followed by a couple of memcpy calls
With full laziness, it has a rather complicated local recursive function that allocates a LOT of memory.
I have no idea why. I think it must be to do with optimisations being done by RULES in the text library. If I add -ddump-rule-firings and grep for TEXT in the rule names, I get
-- With full lazinessRule fired: TEXT append -> fusedRule fired: TEXT append -> fusedRule fired: TEXT append -> fusedRule fired: TEXT append -> fusedRule fired: TEXT append -> unfusedRule fired: TEXT tail -> unfusedRule fired: TEXT tail -> unfused-- Without full lazinessRule fired: TEXT append -> fusedRule fired: TEXT append -> fusedRule fired: TEXT append -> fusedRule fired: TEXT append -> fusedRule fired: TEXT append -> unfusedRule fired: TEXT append -> unfusedRule fired: TEXT append -> unfusedRule fired: TEXT tail -> unfusedRule fired: TEXT tail -> unfusedRule fired: TEXT append -> unfused
So there is clearly a difference. Should that difference have such a massive performance impact? Ask the author of the text library! Why does full laziness have the effect? Well if you have (\x. map (f x) (map g ys)), say, full laziness may float out the map g ys and then the map/map fusion won't happen.
At this point I hope that someone else will take over debugging to find out more.
We're a bit stalled here. There is something mysterious going on, which a single INLINE pragma (Bryan's patch) fixes. But why? My guess is that it's something to do with the interaction between inlning and attoparsec's RULES. For example if a rule optimises (f (g x)), and g gets inlined, the rule won't fire. Or, if
h x = f x + 1
then the expression h (g x) won't fire the rule unless h is first inlined.
The "phase" annotations on INLINE pragmas and RULES let you control this stuff. So the bug might not be in GHC; it might just be a missing phase annotation. Or there might be a bug in GHC. We won't know until someone digs further.
Apart from lack of time, the difficulty is that I have no clue how attoparsec's RULES are supposed to work. So I think we are stalled unless/until someone feels able to do some digging to isolate what is going on.
I still think this bug is in GHC. It shouldn't take 15GB of RAM to compile a program with a large chain of <|>.
Edit: just realized I never added this to the bug. Compiling the thyme package makes GHC consume all available RAM (15.96GB for me) and stall. I have to kill -9 it. Offending section is here: https://github.com/liyang/thyme/blob/master/src/Data/Thyme/Format.hs#L591-L693. I cloned the package, replaced the function body with undefined, and it compiled normally.
But this ticket is about the run-time of the particular program given in the description of this ticket.
There may well be another bug, to do with the compile-time of an entirely different program. If so, could you open a ticket for that, including a way to reproduce? (Having checked that there isn't one already.)
It's confusing to mix up two bugs into one ticket!
We're a bit stalled here. There is something mysterious going on, which a single INLINE pragma (Bryan's patch) fixes. But why? My guess is that it's something to do with the interaction between inlining and attoparsec's RULES.
I assume you're referring to text's RULES, as attoparsec doesn't contain any (it does contain a lot of inlining, though).
The goal behind the RULES in text is to expose opportunities to perform stream fusion and, if the rewrite phases do not reveal any, to drop from the fusion style of programming back to thwacking directly on an array (which is typically a lot faster).
For example, here are the two main RULES that you spotted in the output above:
If we see an unadorned use of append in an early phase, we rewrite it so that a fusion rule can have a chance to transform it. Once we get closer to the end of our phases, if we find that a fusion-style append hasn't yet been gobbled up, we transform it back to the direct style.
There's a tangential wrinkle here: for the longest time, the definition of append had an INLINE annotation, which I just removed. I don't believe this is implicated in the slowdown.
Please let me know what else it would be helpful to explain. As you know, forests of rewrite rules are rather fragile affairs, so it's entirely possible that there's been a longstanding bug in those rules (and perhaps not in the compiler) that is merely now being exposed in 7.8.
What would be really helpful would be if you, or someone else, could diagnose exactly why the slow-down is happening. If you compile with -ticky you can very quickly zero in on the code that is taking more time, and compare one with the other.
It would be remarkable if fusion was really responsible for such an enormous change in time, but perhpas it is. Maybe it would be worth commenting out RULES one at a time, to see which (if any) is responsible, and to reduce accidental differences between the two. Maybe the test case can be boiled down quite a lot further.
You can switch off ALL rule rewrites with -fno-enable-rewrite-rules.
Switching off full laziness might also be a good thing to try -fno-full-laziness.
Reducing the inlining threshold (which doesn't affect INLINE pragmas) would mean less looking at inlined code. -funfolding-use-threshold=N (default is 60)
It's a bit fiddly and time consuming, which is why I was appealing for help. I'll gladly explain anything you (or whoever) wants to know about the GHC end of things.