wiki:Commentary/Compiler/Parser

The Parser

[Very incomplete. Please extend as you learn more.]

The parser is written using

Principles

Making a parser parse precisely the right language is hard. So GHC's parser follows the following principle:

  • We often parse "over-generously", and filter out the bad cases later.

Here are some examples:

  • Patterns are parsed as expressions, and transformed from HsExpr.HsExp into HsPat.HsPat in RdrHsSyn.checkPattern. An expression like [x | x<-xs] that doesn't look like a pattern is rejected by checkPattern.
  • The context of a type is parsed as a type, and then converted into a context by RdrHsSyn.checkContext. For example, when parsing
    f :: (Read a, Num a) => a -> a
    
    the parser can only discover that (Read a, Num a) is a context, rather than a type, when it meets the =>. That requires infinite lookahead. So instead we parse (Read a, Num a) as a tuple type, and then convert it to a context when we see the =>.

Sometimes the over-generous parsing is only dealt with by the renamer. For example:

  • Infix operators are parsed as if they were all left-associative. The renamer uses the fixity declarations to re-associate the syntax tree.

There are plenty more examples. A good feature of this approach is that the error messages later in compilation tend to produce much more helpful error messages. Errors generated by the parser itself tend to say "Parse error on line X" and not much more.

The main point is this. If you are changing the parser, feel free to make it accept more programs than it does at the moment, provided you also add a later test that rejects the bad programs. Typically you need this flexibility if some new thing you want to add makes the pars ambiguous, and you need more context to disambiguate. Delicate hacking of the LR grammar is to be discouraged. It's very hard to maintain and debug.

Avoiding right-recursion

Be sure to read this section of the Happy manual for tips on avoiding right recursion. In GHC, the preferred method is using a left-recursive OrdList, as below:

foolist :: { OrdList Foo }
  : foolist ',' foo { $1 `appOL` unitOL $3 }
  | foolist ','     { $1 } -- optional, but recommended
  | foo             { unitOL $1 }

OrdList operationally works the same way as building a list in reverse (as in the Happy manual), but it makes it less likely you'll forget to call reverse when you need to get the final list out.

One interesting, non-obvious fact, is that if you *do* use a right-recursive parser, the "extra semi-colons" production should NOT be pluralized:

foolist :: { [Foo] }
  : foo ',' foolist { $1 : $3 }
  | foo ','         { [$1] } -- NOT foolist
  | foo             { [$1] }

Indentation

Probably the most complicated interaction between the lexer and parser is with regards to whitespace-sensitive layout. The most important thing to know is that the lexer understands layout, and will output virtual open/close curlies (productions vocurly and vccurly) as well as semicolons, which can then be used as part of productions in Parser.y. So for example, if you are writing a rule that will make use of indentation, you should accept both virtual and literal curlies:

body    :: { ([AddAnn]                                                             
             ,([LImportDecl RdrName], [LHsDecl RdrName])) }                      
        :  '{'            top '}'      { (moc $1:mcc $3:(fst $2)                   
                                         , snd $2) }                             
        |      vocurly    top close    { (fst $2, snd $2) 

Notice the use of close rather than vccurly: close is a production that accepts both vccurly and a Happy error; that is, if we encounter an error in parsing, we try exiting an indentation context and trying again. This ensures, for example, that the top-level context can be closed even if no virtual curly was output.

The top-level of a Haskell file does not automatically have a layout context; when there is no module keyword, a context is implicitly pushed using missing_module_keyword.

When writing grammars that accept semicolon-separated sequences, be sure to include a rule allowing for trailing semicolons (see the previous section), otherwise, you will reject layout.

Syntax extensions

Many syntactic features must be enabled with a LANGUAGE flag, since they could cause existing Haskell programs to stop compiling, as turn some identifiers into keywords. We primarily affect this change of behavior in the lexer, by turning on/off certain tokens. This is done using predicates, which let Alex turn token rules on and off depending on what extensions are enabled:

<0> {                                                                              
  "[|"        / { ifExtension thEnabled } { token ITopenExpQuote } 
}

To add a new syntax extension, add a constructor to ExtBits and set the bit appropriately in mkPState.

Last modified 5 months ago Last modified on Mar 5, 2015 2:12:08 AM