Changes between Initial Version and Version 3 of Ticket #1544


Ignore:
Timestamp:
Aug 20, 2007 9:30:21 PM (7 years ago)
Author:
simonpj
Comment:

Here is the story as far as I can work it out. I don't have a solution yet, but this records my diagnosis. I've asked Koen Claessen (who originated the clever parsing technology) for help. Here is his paper which describes it: http://www.cs.chalmers.se/~koen/pubs/entry-jfp04-parser.html

For a data type like

data Exp = C | Exp :+: Exp

we get a parser like this:

   readPrec = parens
			  ((do Ident "C" <- lexP
			       return C)
			 +++
			   (prec 9 $
			       (do a <- step readPrec
				   Symbol ":+:" <- lexP
				   b <- step readPrec
				   return (a :+: b ))
			))

The trouble comes because when we see a paren we don't know whether it encloses the whole expression or just part

	(C :+: C :+: C)

or

	(C :+: C) :+: C

So the a<-readPrec line expands to a new unfolding of readPrec. We get two parsers working in parallel, one expecting the closing paren at the end,

	(t1)

where t1 is a term, and one expecting a term of form

	(t1) :+: t2

(There are two more parsers, each expecting a plain C, but I'll ignore those. Also in case you are wondering, the precedence mechanism cuts off the left-recursion problem.)

So far so good. But when we consume a paren, each of these two parsers generate two new parsers in exactly the same way. So after consuming one paren we have FOUR parsers expecting terms of form

	((t1))
	((t1) :+: t2)
	((t1)) :+: t2
	((t1) :+: t2) :+: t3

This is clearly unacceptable, because after n parens we get 2^n parsers. At least that is my theory, and it certainly would explain what is going on.

I have now forgotten why I originally adopted Koen's parallel parser idea! I'm sure it's a good one. But this does seem like a serious problem. Remember that we must generate a parser for data type T without looking at the parsers for any other data types.

Furthermore, although I said I'd ignore it, we do get two (or more) parsers each independently parsing the same lexeme.

	(do { Ident "C" <- lexP; ...} +++ do { Ident "C" <- lexP; ...})

They get expanded out, and then dynamically recombined by the parser machinery, but it seems like a lot of wasted work is done. But I don’t think that's what's giving the problem here.

I attach a complete program ParseInfix.hs that demonstrates the problem, with a hand-written read instance so that you don't have to guess what deriving is doing. Just compile and run.

Simon

Legend:

Unmodified
Added
Removed
Modified
  • Ticket #1544

    • Property Milestone changed from to 6.8
  • Ticket #1544 – Description

    initial v3  
    11Consider this definition: 
    2  
     2{{{ 
    33data Exp = C | Exp :+: Exp | Exp :-: Exp deriving ( Read, Show ) 
    4  
     4}}} 
    55Now, try something like: 
    6  
     6{{{ 
    77> read "((((((((((C))))))))))" :: T 
    8  
     8}}} 
    99Even this simple expression may take several seconds to parse. It gets worse if you keep adding parenthesis. And even worse if you add more infix constructors....