Version 67 (modified by adamgundry, 8 months ago) (diff)

latest data representation

Overloaded record fields: implementation notes

Here be dragons. This page describes implementation details and progress on the implementation of the overloaded record fields plan. Development of the extension is taking place on forks of the ghc, packages-base, haddock and testsuite repositories (on branch 'overloaded-record-fields'). A prototype implementation is also available.

The basic idea

The Has and Upd classes, and GetResult and SetResult type families, are defined in the module GHC.Records in the base package.

Typechecking a record datatype still generates record selectors, but their names have a $sel prefix and end with the name of their type. Moreover, instances for the classes and type families are generated. For example,

data T = MkT { x :: Int }


$sel:x:T :: T -> Int -- record selector (used to be called `x`)
$sel:x:T (MkT x) = x

$dfHasTx :: forall a . a ~ Int => Has T "x" a -- corresponds to the Has instance decl
$dfHasTx = Has { getField _ = $sel_x_T }

$dfUpdTx :: forall a . a ~ Int => Upd T "x" a -- corresponds to the Upd instance decl
$dfUpdTx = Upd { setField _ s e = s { x = e } }

axiom TFCo:R:GetResultTx : GetResult T "x" = Int   -- corresponds to the GetResult type family instance
axiom TFCo:R:SetResultTx : SetResult T "x" Int = T -- corresponds to the SetResult type family instance

The naming of cats


A field is represented by the following datatype, parameterised by the representation of names:

data FieldLbl a = FieldLabel {
      flLabel     :: FieldLabelString, -- ^ Label of the field
      flSelector  :: a,                -- ^ Record selector function
      flInstances :: FldInsts a        -- ^ Instances for overloading

type FieldLabelString = FastString
type FieldLabel = FieldLbl Name

data FldInsts a = FldInsts { fldInstsHas :: a
                           , fldInstsUpd :: a
                           , fldInstsGetResult :: a
                           , fldInstsSetResult :: a }

Every field has a label (FastString), selector, and names for the dfuns and axioms (stored together in the FldInsts record). The dcFields field of DataCon stores a list of FieldLabel, whereas the ifConFields field of IfaceConDecl stores a list of FieldLbl OccName.

AvailInfo and IE

The new definition of AvailInfo is:

data AvailInfo      = Avail Name | AvailTC Name [Name] AvailFields
data AvailFlds name = NonOverloaded [name] | Overloaded [(FieldLabelString, name)]
type AvailFields    = AvailFlds Name

The AvailTC constructor represents a type and its pieces that are in scope. Record fields are now stored separately in the third argument. If the fields are not overloaded, we store only the selector names, whereas if they are overloaded, we store the labels as well. The IEThingWith name [name] (AvailFlds name) constructor of IE represents a thing that can be imported or exported, and also has a separate argument for fields.

Note that a FieldLabelString and parent is not enough to uniquely identify a selector, because of data families: if we have

module M ( F (..) ) where
  data family F a
  data instance F Int { foo :: Int }

module N ( F (..) ) where
  import M ( F(..) )
  data instance F Char { foo :: Char }

then N exports two different selectors with the FieldLabelString "foo".

Parent and GlobalRdrElt

The Parent type has an extra constructor FldParent Name FastString that stores the parent Name and the field label FastString. The GlobalRdrElt (GRE) for a field stores the selector name directly, and uses the FldParent constructor to store the field. Thus a field x of type T gives rise this entry in the GlobalRdrEnv:

x |->  GRE $sel:x:T (FldParent T x) LocalDef

Note that the OccName used when adding a GRE to the environment (greOccName) now depends on the parent field: for FldParent it is the field label rather than the selector name.

Source expressions

The HsExpr type has extra constructors HsOverloadedRecFld FieldLabelString and HsSingleRecFld RdrName id. When -XOverloadedRecordFields is enabled, and rnExpr encounters HsVar "x" where x refers to multiple GREs that are all record fields, it replaces it with HsOverloadedRecFld "x". When the typechecker sees HsOverloadedRecFld x it emits a wanted constraint Has alpha x beta and returns type alpha -> beta where alpha and beta are fresh unification variables.

When the flag is not enabled, rnExpr turns an unambiguous record field foo into HsSingleRecFld foo $sel_foo_T. The point of this constructor is so we can pretty-print the field name (as the user typed it, hence a RdrName), but store the selector name for typechecking.

Where an AST representation type (e.g. HsRecField or ConDeclField) contained an argument of type Located id for a field, it now stores a Located RdrName for the label, and some representation of the selector. The parser uses an error thunk for the selector; it is filled in by the renamer (by rnHsRecFields1 in RnPat, and rnField in RnTypes). The new definition of ConDeclField (used in types) is:

data ConDeclField name
  = ConDeclField { cd_fld_lbl  :: Located RdrName,
                   cd_fld_sel  :: name,
                   cd_fld_type :: LBangType name, 
                   cd_fld_doc  :: Maybe LHsDocString }

The new definition of HsRecField is:

data HsRecField id arg = HsRecField {
        hsRecFieldLbl :: Located RdrName,
        hsRecFieldSel :: Either id [(id, id)],
        hsRecFieldArg :: arg,
        hsRecPun      :: Bool }

The renamer (rnHsRecFields1) supplies Left sel_name for the selector if it is unambiguous, or Right xs if it is ambiguous (because it is for a record update, and there are multiple fields with the correct label in scope). In the latter case, the possibilities xs are represented as a list of (parent name, selector name) pairs. The typechecker (tcExpr) tries three ways to disambiguate the update:

  1. Perhaps only one type has all the fields that are being updated.
  1. Use the type being pushed in, if it is already a TyConApp.
  1. Use the type signature of the record expression, if it exists and is a TyConApp.

Automatic instance generation

Typeclass and family instances are generated and typechecked by makeOverloadedRecFldInsts in TcInstDecls, regardless of whether or not the extension is enabled. This is called by tcTopSrcDecls to generate instances for fields from datatypes in the current group (just after derived instances, from deriving clauses, are generated). Overloaded record field instances are not exported to other modules (via tcg_insts and tcg_fam_insts), though underlying dfun ids and axioms are exported from the module as usual (via tcg_binds and a new field tcg_axioms). The new field is needed because there is otherwise no way to export an axiom without exporting the corresponding family instance.

Since the instances are not in scope in the usual way, matchClassInst and tcLookupFamInst look for the relevant constraints or type families and find the instances directly, rather than consulting tcg_inst_env or tcg_fam_inst_env. They first perform a lookup to check that the field name is in scope.

Unused imports

Unused imports and generation of the minimal import list (RnNames.warnUnusedImportDecls) use a map from selector names to labels, in order to print fields correctly. Moreover, consider the following:

module A where
  data T = MkT { x,y:Int }

module B where
  data S = MkS { x,y::Bool }

module C where
  import A( T(x) )
  import B( S(x) )

  foo :: T -> Int
  foo r = x r + 2

Now, do we expect to report the import B( S(x) ) as unused? Only the typechecker will eventually know that. To record this, I've added a new field tcg_used_selectors :: TcRef NameSet to the TcGblEnv, which records the selector names for fields that are encountered during typechecking (when looking up a Has instance etc.). This set is used to calculate the import usage and unused top-level bindings. Thus a field will be counted as used if it is needed by the typechecker, regardless of whether any definitions it appears in are themselves used.

Unused local bindings are trickier, as the following example illustrates:

module M (f)
  data S = MkS { foo :: Int }
  data T = MkT { foo :: Int }

  f = foo (MkS 3)
  g x = foo x

The renamer calculates the free variables of each definition, to produce a list of DefUses. In both f and g we get potential uses of S(foo) and T(foo), but the typechecker will discover that f uses only S(foo) while g uses neither. (But g requires foo to be in scope somehow!) The simplest thing is to make an occurrence of an overloaded field in an expression return as free variables all the selectors it might refer to. This will sometimes fail to report unused local bindings: in the example, it will not spot that T(foo) is unused.

Deprecated field names

Deprecations and fixity declarations look for a top-level name, so they cannot be applied to overloaded record fields. Perhaps this should change. Deprecations actually work by OccName, so we could make

{-# DEPRECATED foo "Don't use foo" #-}

apply to all the foo fields in a module, but there are difficulties in deciding when a deprecated field has been used similar to those for unused imports.

GADT record updates

Consider the example

data W a where
    MkW :: a ~ b => { x :: a, y :: b } -> W (a, b)

It would be nice to generate

-- setField :: proxy "x" -> W (a, b) -> a -> W (a, b)
setField _ s e = s { x = e }

but this record update is rejected by the typechecker, even though it is perfectly sensible, because of #2595. The currently implemented workaround is instead to generate the explicit update

setField _ (MkW _ y) x = MkW x y

which is fine, but rather long-winded if there are many constructors or fields. Essentially this is doing the job of the desugarer for record updates.

Note that W does not admit type-changing single update for either field, because of the a ~ b constraint. Without it, though, type-changing update should be allowed.

Data families

Consider the following:

data family F (a :: *) :: *
data instance F Int  = MkF1 { foo :: Int }
data instance F Bool = MkF2 { foo :: Bool }

This is perfectly sensible, and gives rise to two *different* record selectors foo, and corresponding Has instances:

instance t ~ Int => Has (F Int) "foo" t
instance t ~ Bool => Has (F Bool) "foo" t

Thus we use the name of the representation tycon, rather than the family tycon, when naming the record selectors: we get $sel:foo:R:FInt and $sel:foo:R:FBool. This requires a bit of care, because lexically (in the GlobalRdrEnv) the selectors still have the family tycon are their parent.

In order to have access to the representation tycon name in the renamer, it is generated by getLocalNonValBinders and stored in a new field dfid_rep_tycon of DataFamInstDecl. It would be nice if we could do the same for all the derived names, in order to localise the set of names that have been used (currently stored in the tcg_dfun_n mutable field). However, this is tricky:

  • Default associated type declarations result in axioms being generated during typechecking.
  • DFun names for instances of Typeable and the Generics classes are generated during typechecking.

We could work around this but it may not be worth the bother.

Mangling selector names

We could mangle selector names (using $sel:foo:T instead of foo) even when the extension is disabled, but we decided not to because the selectors really should be in scope with their original names, and doing otherwise leads to:

  • Trouble with import/export
  • Trouble with deriving instances in GHC.Generics (makes up un-renamed syntax using field RdrNames)
  • Boot files that export record selectors not working

GHC API changes

  • The minf_exports field of ModuleInfo is now of type [AvailInfo] rather than NameSet, as this provides accurate export information. An extra function modInfoExportsWithSelectors gives a list of the exported names including overloaded record selectors (whereas modInfoExports includes only non-mangled selectors).

To do

  • Add HsVarOut RdrName id instead of HsSingleRecFld (or perhaps rename HsVar to HsVarIn)?
    • This would also be useful to recall how the user referred to something.
  • Is it worth generating all the derived names early, to get rid of tcg_dfun_n?
  • Is TcInstDcls.tcFldInsts correct in its use of simplifyTop and assuming there will be no ev_binds?
  • Consider syntactic sugar for Upd constraints.
  • Improve unsolved Accessor p f error message where p is something silly?
  • Consider defaulting Accessor p to p = (->), and defaulting Has r "f" t constraints where there is only one datatype with a field f in scope.
  • Document the extension.
  • Tidy up code, comment, remove unused imports.