Opened 8 years ago
Last modified 7 months ago
#1544 new bug
Derived Read instances for recursive datatypes with infix constructors are too inefficient
Reported by: | jcpetruzza@… | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | 7.12.1 |
Component: | Compiler | Version: | 6.6.1 |
Keywords: | Cc: | koen@…, doaitse@…, malcolm@…, emlempsi@… | |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | Other | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Revisions: |
Description (last modified by simonpj)
Consider this definition:
data Exp = C | Exp :+: Exp | Exp :-: Exp deriving ( Read, Show )
Now, try something like:
> read "((((((((((C))))))))))" :: Exp
Even this simple expression may take several seconds to parse. It gets worse if you keep adding parenthesis. And even worse if you add more infix constructors....
Attachments (3)
Change History (34)
comment:1 Changed 8 years ago by jcpetruzza@…
comment:2 Changed 8 years ago by igloo
- Milestone set to 6.8
Changed 8 years ago by simonpj
comment:3 Changed 8 years ago by simonpj
- Description modified (diff)
Here is the story as far as I can work it out. I don't have a solution yet, but this records my diagnosis. I've asked Koen Claessen (who originated the clever parsing technology) for help. Here is his paper which describes it: http://www.cs.chalmers.se/~koen/pubs/entry-jfp04-parser.html
For a data type like
data Exp = C | Exp :+: Exp
we get a parser like this:
readPrec = parens ((do Ident "C" <- lexP return C) +++ (prec 9 $ (do a <- step readPrec Symbol ":+:" <- lexP b <- step readPrec return (a :+: b )) ))
The trouble comes because when we see a paren we don't know whether it encloses the whole expression or just part
(C :+: C :+: C)
or
(C :+: C) :+: C
So the a<-readPrec line expands to a new unfolding of readPrec. We get two parsers working in parallel, one expecting the closing paren at the end,
(t1)
where t1 is a term, and one expecting a term of form
(t1) :+: t2
(There are two more parsers, each expecting a plain C, but I'll ignore those. Also in case you are wondering, the precedence mechanism cuts off the left-recursion problem.)
So far so good. But when we consume a paren, each of these two parsers generate two new parsers in exactly the same way. So after consuming one paren we have FOUR parsers expecting terms of form
((t1)) ((t1) :+: t2) ((t1)) :+: t2 ((t1) :+: t2) :+: t3
This is clearly unacceptable, because after n parens we get 2^n parsers. At least that is my theory, and it certainly would explain what is going on.
I have now forgotten why I originally adopted Koen's parallel parser idea! I'm sure it's a good one. But this does seem like a serious problem. Remember that we must generate a parser for data type T without looking at the parsers for any other data types.
Furthermore, although I said I'd ignore it, we do get two (or more) parsers each independently parsing the same lexeme.
(do { Ident "C" <- lexP; ...} +++ do { Ident "C" <- lexP; ...})
They get expanded out, and then dynamically recombined by the parser machinery, but it seems like a lot of wasted work is done. But I don’t think that's what's giving the problem here.
I attach a complete program ParseInfix.hs that demonstrates the problem, with a hand-written read instance so that you don't have to guess what deriving is doing. Just compile and run.
Simon
comment:4 Changed 8 years ago by simonpj
- Description modified (diff)
comment:5 Changed 8 years ago by jcpetruzza@…
Just in case it helps, I'm attaching the Read instance I defined to workaround this problem. The idea is to transform this grammar
Exp :== C | Exp :+: Exp
into
Exp :== C (:+: Exp)*
The resulting code is a bit longer, but seems to work fine and doesn't depend on any other types
comment:6 Changed 8 years ago by simonpj
From Koen:
It was as I feared: Even the backtracking parser from the Haskell98 report has the exponential behavior you describe!
I took the code from Figure 8 here: http://haskell.org/onlinereport/derived.html
And copied the relevant bits into a file:
infixr 5 :^: data Tree a = Leaf a | Tree a :^: Tree a deriving ( Eq, Ord, Show ) instance (Read a) => Read (Tree a) where readsPrec d r = readParen (d > up_prec) (\r -> [(u:^:v,w) | (u,s) <- readsPrec (up_prec+1) r, (":^:",t) <- lex s, (v,w) <- readsPrec (up_prec+1) t]) r ++ readParen (d > app_prec) (\r -> [(Leaf m,t) | ("Leaf",s) <- lex r, (m,t) <- readsPrec (app_prec+1) s]) r up_prec = 5 :: Int -- Precedence of :^: app_prec = 10 :: Int -- Application has precedence one more than -- the most tightly-binding operator
And then added the following:
main = print (read s :: Tree Int) where s = "(((((((((((((((((((((((((((((((((Leaf 3" ++ ")))))))))))))))))))))))))))))))))"
Neither Hugs nor GHC can evaluate this expression in reasonable time. And (probably) neither would the old GHC without my ReadP/ReadPrec stuff.
Conclusion: We need to be smarter when generating parsers for datatypes with infix operators, independent of the underlying parsing technology. In order to make an efficient parser for a grammar with binary operators, one must massage the grammar to remove left-recursion from the grammar.
comment:7 Changed 8 years ago by simonpj
- Cc koen@… added
comment:8 Changed 8 years ago by simonpj
Koen writes: Take the following example datatype.
data T = T :+: T | A
Generating a naive recursive parser for this leads to the bad behavior.
However, for such simple recursive datatypes, we kind of know what parsers we should generate. For example, forthe above datatype we simply generate a parser that first parses a simple T, and then checks if there is an occurrence of :+:, in which case it tries to parse another T. Nice and linear.
However, we can never do this modularly. Here is why. Imagine a module with the following datatype:
data T a b = a :+: b deriving ( Read )
The only thing we can do here is to generate the naive parser.
Imagine now another module that imports the above module, with a datatype declaration that looks like:
data A = (T A A) :*: A | C deriving ( Read )
Again, the only thing we can do here is to generate a naive parser, that uses the parser we generated for T. (We do not even know that the above datatype is recursive without looking at the definition of T.)
Now, the resulting parser will again have the bad behavior: Any expression starting with many parentheses will lead to exponential behavior, because, for each parenthesis, we do not know if it belongs to something of type A or something of type T A A until we have parsed the whole expression.
We can do something about recursive datatypes where the left argument of the operators has exactly the same type as the whole datatype. This works for:
data SnocList a = Nil | SnocList a :- a
for example, but not for:
data T a = T [a] :+: T [a] | Leaf a
I guess it is worth implementing this special case anyway, although it will not apply to all cases.
Here is the idea.
First a simple case. For the following Haskell code:
infix 5 + infix 6 * data A = A + A | A * A | C B
We generate the following grammar:
A(n) ::= A(6) "+" A(6) (n<=5) | A(7) "*" A(7) (n<=6) | "C" B(0) | "(" A(0) ")"
Right now, you simply turn the above into a parser directly. However, we are going to "massage" the grammar a bit. First, we explicitly split up the grammar into parts, depending on precedence. For this, we need to sort and groups the operators according to precedences:
A(0..5) ::= A(6) "+" A(6) | A(6) A(6) ::= A(7) "*" A(7) | A(7) A(7..10) ::= "C" B(0) | "(" A(0) ")"
Then, we see that we have overlap in the first two grammar parts. We get rid of it by using optional [ ]'s:
A(0..5) ::= A(6) ["+" A(6)] A(6) ::= A(7) ["*" A(7)] A(7..10) ::= "C" B(0) | "(" A(0) ")"
This can be turned into efficient Haskell code directly.
Now a more complicated case. For the following Haskell code:
infix 5 + infix 6 * data A = A + A | B * A -- this operator is not left-recursive | C B
We generate the following grammar:
A(n) ::= A(6) "+" A(6) (n<=5) | B(7) "*" A(7) (n<=6) | "C" B(0) | "(" A(0) ")"
Right now, you simply turn the above into a parser directly. Again, we are going to "massage" the grammar. First, we explicitly split up the grammar into parts, depending on precedence.
A(0..5) ::= A(6) "+" A(6) | A(6) A(6) ::= B(7) "*" A(7) | A(7) A(7..10) ::= "C" B(0) | "(" A(0) ")"
Unfortunately, there is no (explicit) overlap in the cases for A(6). We know that there probably exists overlap (both grammars will accept any number of parentheses), but this is not clear, since we do not have B's grammar.
We can get rid of some overlap by using optional [ ]'s:
A(0..5) ::= A(6) ["+" A(6)] A(6) ::= B(7) "*" A(7) | A(7) A(7..10) ::= "C" B(0) | "(" A(0) ")"
This can be turned into Haskell code directly:
readP n | n <= 5 = do x <- readP 6 return x +++ do "+" <- lex y <- readP 6 return (x+y) | n <= 6 = do x <- readP 7 -- :: B "*" <- lex y <- readP 7 return (x*y) +++ do readP 7 -- :: A | otherwise = do "C" <- lex x <- readP 0 return (C x) +++ do "(" <- lex x <- readP 0 ")" <- lex return x
However, this code will be inefficient when parsing A(6)'s that start with lots of parentheses.
comment:9 Changed 8 years ago by simonpj
- Cc doaitse@… added
- Milestone changed from 6.8 branch to 6.10 branch
It's clear we aren't going to get this fixed in 6.8. There's actually an interesting research question here, about how to build parsers that are both modular and efficient.
Please would someone like to help with this?
(I've added Doaitse to the cc list; hope that's ok.)
Simon
comment:10 Changed 8 years ago by simonpj
Doaitse writes "It is still in my mind, but I am not yet convinced that things are easily solvable. There might even be problems with a precise definition of the problem. I will try to cook up some examples with questions what one should like to see happen."
comment:11 Changed 8 years ago by malcolm.wallace@…
- Cc malcolm@… added
comment:12 Changed 7 years ago by eelco
- Cc emlempsi@… added
comment:13 Changed 7 years ago by simonpj
Great news! Doaitse writes "I have found a student with whom I will embark on the endeavour to construct a solution to this (I have some hopes); either we will have to show a solution, or we will have to prove that this is "impossible". I hope to be able to use the kind of transformation techniques we have developed in the attached paper".
Simon
comment:14 Changed 7 years ago by simonpj
See The internals for a better Deriving Read and Write which describes progress by Doaitse and his colleagues.
Simon
comment:15 Changed 7 years ago by simonmar
- Architecture changed from Unknown to Unknown/Multiple
comment:16 Changed 7 years ago by simonmar
- Operating System changed from Unknown to Unknown/Multiple
comment:17 Changed 6 years ago by igloo
- Milestone changed from 6.10 branch to 6.12.1
Does anyone know what the status of this is?
comment:18 Changed 6 years ago by simonpj
Doaitse, Marcos, and Eelco tackle precisely this issue:
@inproceedings{1411296, author = {Marcos Viera and S. Doaitse Swierstra and Eelco Lempsink}, title = {Haskell, do you read me?: constructing and composing efficient top-down parsers at runtime}, booktitle = {Haskell '08: Proceedings of the first ACM SIGPLAN symposium on Haskell}, year = {2008}, isbn = {978-1-60558-064-7}, pages = {63--74}, location = {Victoria, BC, Canada}, doi = {http://doi.acm.org/10.1145/1411286.1411296}, publisher = {ACM}, address = {New York, NY, USA}, }
I'm not sure whether they regard their solution as suitable to directly incorporate in (say) GHC, but it's certainly a substantial contribution to this ticket!
Simon
comment:19 Changed 6 years ago by Doaitse
We consider this problem basically solved by the ChristmasTree library we uploaded to Hackage, with its associated packages.
See:
http://hackage.haskell.org/packages/archive/ChristmasTree/0.1/doc/html/Text-GRead.html
I am currently doubtful whether this should be incorporated in the GHC as it stands now. We think it is better to find some other uses of the TTTAS library before we decide how to proceed. Thus far we got no questions about these packages, which implies either that they solve the problem or that they are not used at all. In neither case this provides information to us on how to proceed. At least the severity could be lowered to "minor"?
Doaitse
comment:20 Changed 6 years ago by igloo
- Milestone changed from 6.12.1 to 6.12 branch
- Type of failure set to None/Unknown
comment:21 Changed 5 years ago by igloo
- Milestone changed from 6.12 branch to 6.12.3
comment:22 Changed 5 years ago by igloo
- Milestone changed from 6.12.3 to 6.14.1
- Priority changed from normal to low
comment:23 Changed 5 years ago by igloo
- Milestone changed from 7.0.1 to 7.0.2
comment:24 Changed 4 years ago by igloo
- Milestone changed from 7.0.2 to 7.2.1
comment:25 Changed 4 years ago by igloo
- Milestone changed from 7.2.1 to 7.4.1
comment:26 Changed 3 years ago by igloo
- Milestone changed from 7.4.1 to 7.6.1
- Priority changed from low to normal
comment:27 Changed 3 years ago by igloo
- Milestone changed from 7.6.1 to 7.6.2
comment:28 Changed 3 years ago by morabbin
Any progress on adapting Doaitse's work for inclusion?
comment:29 Changed 3 years ago by Doaitse
- Type of failure changed from None/Unknown to Other
When checking I noticed that the ChristmasTree package had problems compiling with the newer versions of GHC. I updated the package by adding some type annotations, and upgrading the cabal file.
The fact that we did not get complaints thus far probably indicates that the problem is not perceived as a severe one by most users. I have no idea how many actually rely on this package or this technology.
It would be nice if hackage would inform maintains about packages becoming obsolete due to upgrades of GHC or of packages it depends on, but that is another issue.
comment:30 Changed 13 months ago by thoughtpolice
- Milestone changed from 7.6.2 to 7.10.1
Moving to 7.10.1.
comment:31 Changed 7 months ago by thoughtpolice
- Milestone changed from 7.10.1 to 7.12.1
Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.
I meant to write