wiki:Commentary/Compiler/DataTypes

Version 4 (modified by adamgundry, 10 months ago) (diff)

--

The GHC Commentary: Data types and data constructors

This chapter was thoroughly changed Feb 2003. If you are interested in how a particular data type is implemented take a look at this case study.

Data types

Consider the following data type declaration:

data T a = MkT !(a,a) !(T a) | Nil

f x = case x of
        MkT p q -> MkT p (q+1)
        Nil     -> Nil

The user's source program mentions only the constructors MkT and Nil. However, these constructors actually do something in addition to building a data value. For a start, MkT evaluates its arguments. Secondly, with the flag -funbox-strict-fields GHC will flatten (or unbox) the strict fields. So we may imagine that there's the source constructor MkT and the representation constructor MkT, and things start to get pretty confusing.

GHC now generates three unique Names for each data constructor:

                             ---- OccName ------
                               String  Name space     Used for
---------------------------------------------------------------------------
The "source data con"            MkT    DataName      The DataCon itself
The "worker data con"            MkT    VarName       Its worker Id
  aka "representation data con"
The "wrapper data con"           $WMkT  VarName       Its wrapper Id (optional)

Recall that each occurrence name (OccName?) is a pair of a string and a name space (see RdrNames, Modules, and OccNames), and two OccNames? are considered the same only if both components match. That is what distinguishes the name of the name of the DataCon? from the name of its worker Id. To keep things unambiguous, in what follows we'll write "MkT{d}" for the source data con, and "MkT{v}" for the worker Id. (Indeed, when you dump stuff with "-ddumpXXX", if you also add "-dppr-debug" you'll get stuff like "Foo {- d rMv -}". The "d" part is the name space; the "rMv" is the unique key.)

Each of these three names gets a distinct unique key in GHC's name cache.

The life cycle of a data type

Suppose the Haskell source looks like this:

data T a = MkT !(a,a) !Int | Nil

f x = case x of
        Nil     -> Nil
        MkT p q -> MkT p (q+1)

When the parser reads it in, it decides which name space each lexeme comes from, thus:

data T a = MkT{d} !(a,a) !Int | Nil{d}

f x = case x of
        Nil{d}     -> Nil{d}
        MkT{d} p q -> MkT{d} p (q+1)

Notice that in the Haskell source all data contructors are named via the "source data con" MkT{d}, whether in pattern matching or in expressions.

In the translated source produced by the type checker (-ddump-tc), the program looks like this:

f x = case x of
        Nil{d}     -> Nil{v}
        MkT{d} p q -> $WMkT p (q+1)

Notice that the type checker replaces the occurrence of MkT by the wrapper, but the occurrence of Nil by the worker. Reason: Nil doesn't have a wrapper because there is nothing to do in the wrapper (this is the vastly common case).

Though they are not printed out by "-ddump-tc", behind the scenes, there are also the following: the data type declaration and the wrapper function for MkT.

data T a = MkT{d} a a Int# | Nil{d}
 
$WMkT :: (a,a) -> T a -> T a
$WMkT p t = case p of 
            (a,b) -> seq t (MkT{v} a b t)

Here, the wrapper $WMkT evaluates and takes apart the argument p, evaluates the argument t, and builds a three-field data value with the worker constructor MkT{v}. (There are more notes below about the unboxing of strict fields.) The worker $WMkT is called an implicit binding, because it's introduced implicitly by the data type declaration (record selectors are also implicit bindings, for example). Implicit bindings are injected into the code just before emitting code or External Core.

After desugaring into Core (-ddump-ds), the definition of f looks like this:

f x = case x of
        Nil{d}       -> Nil{v}
        MkT{d} a b r -> let { p = (a,b); q = I# r } in 
                        $WMkT p (q+1)

Notice the way that pattern matching has been desugared to take account of the fact that the "real" data constructor MkT has three fields.

By the time the simplifier has had a go at it, f will be transformed to:

f x = case x of
      Nil{d}       -> Nil{v}
      MkT{d} a b r -> MkT{v} a b (r +# 1#)

Which is highly cool.

The constructor wrapper functions

The wrapper functions are automatically generated by GHC, and are really emitted into the result code (albeit only after CorePre?; see CorePrep.mkImplicitBinds). The wrapper functions are inlined very vigorously, so you will not see many occurrences of the wrapper functions in an optimised program, but you may see some. For example, if your Haskell source has

map MkT xs

then $WMkT will not be inlined (because it is not applied to anything). That is why we generate real top-level bindings for the wrapper functions, and generate code for them.

The constructor worker functions

Saturated applications of the constructor worker function MkT{v} are treated specially by the code generator; they really do allocation. However, we do want a single, shared, top-level definition for top-level nullary constructors (like True and False). Furthermore, what if the code generator encounters a non-saturated application of a worker? E.g. (map Just xs). We could declare that to be an error (CorePrep? should saturate them). But instead we currently generate a top-level defintion for each constructor worker, whether nullary or not. It takes the form:

MkT{v} = \ p q r -> MkT{v} p q r

This is a real hack. The occurrence on the RHS is saturated, so the code generator (both the one that generates abstract C and the byte-code generator) treats it as a special case and allocates a MkT; it does not make a recursive call! So now there's a top-level curried version of the worker which is available to anyone who wants it.

This strange definition is not emitted into External Core. Indeed, you might argue that we should instead pass the list of TyCons to the code generator and have it generate magic bindings directly. As it stands, it's a real hack: see the code in CorePrep?.mkImplicitBinds.

External Core

When emitting External Core, we should see this for our running example:

data T a = MkT a a Int# | Nil{d}
 
$WMkT :: (a,a) -> T a -> T a
$WMkT p t = case p of 
              (a,b) -> seq t (MkT a b t)

f x = case x of
        Nil       -> Nil
        MkT a b r -> MkT a b (r +# 1#)

Notice that it makes perfect sense as a program all by itself. Constructors look like constructors (albeit not identical to the original Haskell ones).

When reading in External Core, the parser is careful to read it back in just as it was before it was spat out, namely:

data T a = MkT{d} a a Int# | Nil{d}
 
$WMkT :: (a,a) -> T a -> T a
$WMkT p t = case p of 
              (a,b) -> seq t (MkT{v} a b t)

f x = case x of
        Nil{d}       -> Nil{v}
        MkT{d} a b r -> MkT{v} a b (r +# 1#)

Unboxing strict fields

If GHC unboxes strict fields (as in the first argument of MkT above), it also transforms source-language case expressions. Suppose you write this in your Haskell source:

case e of 
  MkT p t -> ..p..t..

GHC will desugar this to the following Core code:

case e of
  MkT a b t -> let p = (a,b) in ..p..t..

The local let-binding reboxes the pair because it may be mentioned in the case alternative. This may well be a bad idea, which is why -funbox-strict-fields is an experimental feature.

It's essential that when importing a type T defined in some external module M, GHC knows what representation was used for that type, and that in turn depends on whether module M was compiled with -funbox-strict-fields. So when writing an interface file, GHC therefore records with each data type whether its strict fields (if any) should be unboxed.

Labels and info tables

Quick rough notes: SLPJ March 2003.

Every data constructor C has two info tables:

  • The static info table (label C_static_info), used for statically-allocated constructors.
  • The dynamic info table (label C_con_info), used for dynamically-allocated constructors.

Statically-allocated constructors are not moved by the garbage collector, and therefore have a different closure type from dynamically-allocated constructors; hence they need a distinct info table. Both info tables share the same entry code, but since the entry code is physically juxtaposed with the info table, it must be duplicated (C_static_entry and C_con_entry respectively).