wiki:UnliftedDataTypes

Unlifted data types

This page describes the unlifted data types, i.e. algebraic data types which live in kind # rather than kind *. In fact, this is a collection of standalone proposals:

  1. Allow data types to be declared unlifted. (Should be easy; but has a harder subproposal to separate a new kind Unlifted from #)
  2. Allow newtypes over unlifted types (giving them kind #). (Should be easy.)
  3. Provide a built-in Force a which is an unlifted version of a, with no indirection cost. (Harder.)

See also UnpackedSumTypes.

Motivation

Bob Harper has written:

Haskell suffers from a paucity of types. It is not possible in Haskell to define the type of natural numbers, nor the type of lists of natural numbers (or lists of anything else), nor any other inductive type!

The reason, of course, is that whenever you write data Nat = Z | S !Nat, you define a type of strict natural numbers, AS WELL AS bottom. Ensuring that an x :: Nat is never bottom requires all use-sites of this type to do strict pattern matching / force the value appropriately. It would be nice if there was some type-directed mechanism which specified that a value x was always evaluated. This would give benefits, e.g. for code generation, where we can assume that a pointer never points to a thunk or indirection.

It would be hard to justify drastically changing Haskell to allow defining a type which doesn't include bottom, but as it turns out, GHC already supports such non-bottom types, in the form of unlifted types of kind #. In particular, we already have special strict evaluation rules for unlifted types like Int# and Array#.

The fact that a pointer never points to a thunk is especially helpful in the implementation of mutable variables: without this guarantee, code which writes to this mutable variable has to first check if the thunk is fully evaluated before actually performing the memory write. For this reason, the MutVar# primitive (which is a GC'd object) lives in kind #.

Proposal 1: Allow data types to be declared as unlifted

A data type can be declared as unlifted by writing data unlifted, e.g.:

data unlifted UBool = UTrue | UFalse

Such data types are always boxed, but the type does not include bottom and is operationally represented as a pointer to the value. Intuitively, if you have x :: UBool in scope, you are guaranteed to have UTrue or UFalse, and not bottom.

Iceland_jack: Yakshaving, to avoid new syntax we can use GADTSyntax and write

data UBool :: TYPE UnliftedRep where
  UFalse :: UBool
  UTrue  :: UBool

The evaluation rules for unlifted data types are identical to the existing rules we have for types kinded #: lets are strict, cannot be recursive, and function arguments are evaluated before calls. For example:

unot :: UBool -> UBool
unot UTrue = UFalse
unot UFalse = UTrue

main :: IO ()
main = let y = unot (error "foo")
       in return ()

In this example, we get the error "foo", rather than returning (), because the binding of y must be evaluated strictly.

Just like other unlifted types, you cannot bind values of an unlifted data type at top level, or in a recursive group. So this is illegal

module M where
  data unlifted UBool = UTrue | UFalse

  utrue :: UBool
  uTrue = UTrue   -- Illegal!

Non-polymorphic unlifted types can directly be unpacked. The following declaration is valid:

data unlifted StrictInt = StrictInt Int#
data MyInt = MyInt {-# UNPACK #-} StrictInt

and is representationally equivalent to MyInt' here:

data Int = Int Int#
data MyInt' = MyInt' {-# UNPACK #-} !Int

Of course, the constructors for MyInt and MyInt' have different types.

Proposal 2: Polymorphism over a new Unlifted kind

Currently, we have two different kinds (ahem) of unlifted types which live in #: unboxed, unlifted types such as Int#, and boxed, unlifted types such as Array# and the unlifted data types we can now define.

This subproposal is to distinguish between these two kinds, calling boxed, unlifted kinds Unlifted. If users start defining and using unlifted types in normal code, it is likely that they will request polymorphism over unlifted types. There are two particular cases of polymorphism we might like to support:

Polymorphism over unlifted types in types and functions. In data types and functions, we may want to be polymorphic over a type variable in kind Unlifted:

data unlifted UList (a :: Unlifted)
  = UNil | UCons a UList
umap :: forall (a :: Unlifted) (b :: Unlifted).
        UList a -> (a -> b) -> UList b

We cannot be polymorphic in # in general, because this includes unboxed types like Int# which don't have uniform representation. However, we can be polymorphic over unlifted types, which do have uniform representation.

Boxed levity polymorphism in types (and functions with extra code generation). In data types, we may want to have a type parameter which is polymorphic over all boxed types:

data BList (a :: Boxed)
  = BNil | BCons a BList

BList is representationally the same whether or not it is instantiated with a boxed lifted type, or a boxed unlifted type.

However, for levity polymorphism over functions we must generate code twice. Consider::

map :: forall a (b :: Boxed). (a -> b) -> BList a -> BList b
map f (BCons x xs) = BCons (f x) (map f xs)

We do not know if f x should be evaluated strictly or lazily; it depends on whether or not b is unlifted or lifted. This case can be handled by specializing map for the lifted and unlifted cases. The fact that the semantics of the function change depending on the type is a bit questionable (though not entirely unprecedented, c.f. type classes); additionally, there isn't any reason why we couldn't also generate copies of the code for all unboxed types, dealing with Int#, Float# and Double#: it's unclear when to stop generating copies. (For reference, .NET does this on the fly.)

Proposal 3: Allow newtypes over unlifted types

To allow cost-free abstraction over unlifted types, we should allow newtypes to be written over types of kind #, with the resulting newtype being in kind #. For example:

newtype MyInt# = MkInt# Int#

with MyInt# :: #. GHC already supports coercions over kind #, so this should be very simple to implement.

Iceland_jack: Use this to wrap Int# when used in boolean context in GHC.Exts, GHC.Prim

newtype Bool# = Bool# Int#

pattern False# :: Bool#
pattern False# = Bool 0#

pattern True# :: Bool#
pattern True# = Bool 1#

(==#)  :: Int#    -> Int#    -> Bool#
(==##) :: Double# -> Double# -> Bool#

and other similar cases.

Proposal 4: Allow unlifting existing data types with no overhead

Proposal 1 requires a user to define a new data type for every unlifted type they want to define. However, for every lifted data type a user can define, there is an obvious unlifted type one might be interested in: the one without bottom. Fortunately, we can define a data type to unlift an arbitrary lifted type:

data Force :: * -> Unlifted where
  Force :: !a -> Force a

suspend :: Force a -> a
suspend a = a

Force a is the "head strict" version of a: if you have x :: Force Int in scope, it is guaranteed to have already been evaluated to an Int. We can also suspend these strict computations: suspend (error "foo" :: Int#) does not error until forced. Like Box, unlifted computations may not be lifted out of suspend without changing the semantics.

Force and suspend can be written purely as library code, however there is a cost of an indirection of Force. We might notice, however, that the value of type Force a only admits the value Force a: undefined is excluded by the Unlifted kind, and Force undefined is excluded by the strict field. Thus, it would be great if we could represent Force on the heap simply as an unlifted pointer to a, which is never undefined.

Note from David Feuer: the indirection problem could be resolved via https://ghc.haskell.org/trac/ghc/wiki/NewtypeOptimizationForGADTS, which would be a good thing for other reasons too.

Ideally, we would like to define the coercion Coercible (Force a) a, to witness the fact that Force a is representationally the same as a. However, there are two problems:

  1. This coercion is ill-kinded (Force a has kind Unlifted but a has kind *), so we would need John Major equality style coercions.
  1. The coercion is only valid in one direction: I can coerce from Force a to a, but not vice-versa: in the other direction, evaluation may be necessary.

My suggested implementation strategy is to bake in Force as a special data type, which is represented explicitly in Core, but then optimized away in STG.

Optional extensions

(OPTIONAL) Give some syntax for Force. Instead of writing f :: Force Int -> Force Int, we might like to write f :: Int! -> Int!. We define post-fix application of bang to be a wrapping of Force.

(OPTIONAL) Introduce a pattern synonym Thunk. suspend can be generalized into the bidirectional pattern synonym Thunk:

pattern Thunk a <- x | let a = Force x
  where Thunk (Force a) = a

Iceland_jack: This syntax is invalid, is this what was intended?

pattern :: Force a -> a
pattern Thunk a        <- (Force -> a)
  where Thunk (Force a) = a

For example:

let x = Thunk (error "foo" :: Force Int) :: Int
in True

does not error. Pattern matching over Thunk forces the argument (similar to bang patterns) and returns the unlifted value (unlike bang patterns):

let Thunk x = 3 + 5 :: Int
in x :: Force Int

Dynamic semantics of unlifted types

In this section, we review the dynamic semantics of unlifted types. These are not being added by our proposal (since they are already implemented by #), but since they are fairly unfamiliar to most Haskell users, I think this section will be useful.

Case binding. Given case e of x1 -> e1, where e is Unlifted, e is evaluated to whnf, and then the result is case-matched upon. (i.e. it is always as if it is a strict pattern match.)

Let bindings. Given let x = e in e', where x is Unlifted), this desugars to let !x = e in e' which desugars to case e of !x -> e'. Mutually recursive let bindings of unlifted variables are not allowed. Let bindings are evaluated bottom up (but see #10824).

Conditionals. Given if e then e1 else e2 where e1 and e2 are Unlifted, this desugars into the obvious case. (NB: this means e1 and e2 are not eagerly evaluated, as they would be for an ifthenelse function.)

Constructors and function application. Given K e where e is Unlifted, this desugars into case e of !x -> K x. NB: if K is in kind star, then this expression admits bottom!

Intuitively, if you have an unlifted type, anywhere you let bind it or pass it to a function, you evaluate it. Strict let bindings cannot be arbitrarily floated; you must preserve the ordering of bindings and they cannot be floated beyond an expression kinded *.

FAQ

Why do you get to use error in the unlifted examples, isn't error a bottom? error is specially treated to be both lifted and unlifted. It's interpretation in an unlifted setting is that it immediately causes an exception when it is evaluated (it never goes into the heap).

Why not !Int rather than Int! as the syntax proposal? This syntax conflicts with strict data fields. data S a = S !a has a constructor of type S :: Int -> S, taking a lifted type and evaluating it before placing it in the constructor; data S a = S a! has a constructor of type S :: Force Int -> S, which requires the user to have forced the integer. Representationally, the data types are the same, but the outward behavior for clients differs dramatically.

Is Force (Maybe (Force Int)) allowed? No, because Force Int has kind Unlifted but Maybe has kind * -> Unlifted. A data type declaration must be explicitly written to accept an unlifted type (data StrictMaybe (a :: Unlifted) = SMaybe a), or simply be strict in its field (data StrictMaybe2 a = SMaybe2 !a).

How does this affect inlining? In any CBV language, inlining doesn't always preserve semantics; unlifted types are no different. For example, this errors:

  let x = error "foo" :: Force Int
      y = suspend x :: Int
  in True

but this example does not:

  let y = suspend (error "foo" :: Force Int) :: Int
  in True

What's the difference between Force Int and Int#? Force Int is an unlifted, boxed integer; Int# is an unlifted, unboxed integer.

Why aren't strict patterns enough? A user can forget to write a strict pattern at a use-site. Putting a type in kind unlifted forces all use-sites to act as if they had strict patterns.

Is Force [a] the type of strict lists? No. It is the type of a lazy list whose head is always evaluated and non-bottom.

Does foo :: Force [a] -> Force [a] force just the first constructor or the whole spine? You can't tell; the head not withstanding, [a] is still a lazy list, so you would need to look at the function body to see if any extra forcing goes on. Force does not induce seqing: it is an obligation for the call-site.

Fully alternate proposal

This section describes a totally alternate proposal to the one(s) above. The motivation is the same, but the proposal otherwise starts from scratch.

Proposal B1. Allow newtypes over unlifted types

Identical to Proposal 3 above.

Proposal B2. More levity polymorphism

NoSubKinds proposes the following arrangement:

data Levity = Lifted | Unlifted
TYPE :: Levity -> *
type * = TYPE 'Lifted
type # = TYPE 'Unlifted

Instead, Proposal B2 suggests the following generalization:

data Boxity = Boxed | Unboxed
data Levity = Lifted | Unlifted
TYPE :: Boxity -> Levity -> *
type * = TYPE 'Boxed 'Lifted
-- TYPE 'Boxed 'Unlifted is the kind of boxed, unlifted types
-- TYPE 'Unboxed 'Unlifted is the kind of unboxed, unlifted types
-- TYPE 'Unboxed 'Lifted is utterly wrong, and prohibited from ever appearing in a desugared program

type B_L   = TYPE 'Boxed   'Lifted    -- just for this wiki page
type B_UL  = TYPE 'Boxed   'Unlifted  -- just for this wiki page
type UB_UL = TYPE 'Unboxed 'Unlifted  -- just for this wiki page

Type variables would be allowed to have kinds B_L or B_UL but never UB_UL. This would allow polymorphism over boxed-but-unlifted things, by separating out the kinds. Functions could still not be properly levity-polymorphic -- GHC would default a Boxity metavariable to 'Boxed and a levity metavariable to 'Lifted during zonking. (Unless the Levity metavariable is paired with a boxity metavariable that is already known to be 'Unboxed.)

This treatment is a simple extension of levity polymorphism, and it serves the same goals: to have a few cases where we really do want levity-polymorphic code (error, undefined) while providing a nicely uniform structure for kinds-of-types-with-values. I (Richard) am a little bothered by the existence of an unused spot in the structure (TYPE 'Unboxed 'Lifted) but not enough to back down.

Definition: A valued kind is a kind of types that have values. In released GHC, the valued kinds are *, #, and Constraint. Proposal B2 suggests adding TYPE 'Boxed 'Unlifted and TYPE 'Unboxed 'Unlifted to this set.

(Extension: Edward Kmett would recommend

data Constraintiness = Constraint | NotConstraint
TYPE :: Boxity -> Levity -> Constraintiness -> *

to allow better inference around tuple types. While RAE thinks this would actually work just fine, it feels like one bridge too far.)

Proposal B3. Levity polymorphic data types

Proposal 1, above, allows users to make unlifted data types, separately from lifted ones. RAE is worried that this approach would lead to tons of duplication of data types (we might need four variants of Maybe, for example) and wants a more uniform way forward. Hence this Proposal B3.

The proposal will explain by way of example. (The use of GADT syntax here is incidental, but it makes the text in this proposal work out better. The normal traditional-syntax declaration would have identical behavior.)

data Maybe a where
  Just    :: a -> Maybe a
  Nothing :: Maybe a

All we know about the type variable a is that it is used as an argument to ->. Functions can naturally have any valued kind as an argument. But it would be silly to allow a :: TYPE 'Unboxed 'Unlifted, as we'll have trouble generating code for that. Furthermore, we wish to treat all datatypes as levity polymorphic in their return kind (the heart of Proposal B3). So, we infer this kind for Maybe:

Maybe :: forall (v1 :: Levity) (v2 :: Levity).
         TYPE 'Boxed v1 -> TYPE 'Boxed v2

Of course, this is all a bit of a hack. There really must be 4 datatypes under the hood. But each of the 4 make sense to have, and would, no doubt, be wanted ere long by programmers. When zonking term-level program text, we would need to squeeze out any remaining levity polymorphism so that the back-end could figure out which of the four Maybe types we really want here. (The four types really make a data family, don't they? Using data families in the implementation might simplify things a bit.)

To reiterate: the levity polymorphism here is just to simplify the life of the programmer by using one name -- Maybe -- for all four sensible variants of an option type. Just like using levity polymorphism for undefined make the life easier for the programmer.

A recursive example

Let's look at a recursive datatype to see how this works out:

data List a where
  Nil :: List a
  Cons :: a -> List a -> List a

After inference and with everything explicit, we would get

data List :: forall (v1 :: Levity) (v2 :: Levity).
             TYPE 'Boxed v1 -> TYPE 'Boxed v2 where
  Nil :: forall (v1 :: Levity) (v2 :: Levity) (a :: TYPE 'Boxed v1).
         List v1 v2 a
  Cons :: forall (v1 :: Levity) (v2 :: Levity) (a :: TYPE 'Boxed v1).
          a -> List v1 v2 a -> List v1 v2 a

According to this definition, List a :: B_UL is really a proper strict list. That is, we avoid some of the problems in the original proposal (not the "B" proposals) by avoiding issues around affecting only the head of lists. This definition affects the whole list.

Of course, users are free to use levity polymorphic recursion to achieve other combinations of strictness, but the default is to have either all lazy or all strict.

Note: This Proposal B3 does overlap some concerns: it combines the idea of unlifted datatypes with a means for using one name for multiple entities. But I think this makes sense given the convenience of this approach.

Design questions

  1. Should levity polymorphism apply to all datatypes? If we think "yes", do we turn it off when a user explicitly writes -> * in a kind signature for a datatype?
  1. It's tempting to say that the ! prefix operator is now just an identity operation with kind B_UL -> B_UL. That is, it just forces the existing kind inference machinery to infer an unlifted kind. But this would seem to break existing code that expects ! not to affect kinds.
  1. Could the Force and suspend ideas from the non-B proposals help here? Perhaps. RAE doesn't fully understand Force.

Proposal B4: Levity polymorphic functions

Carrying forward from Proposal B3, we could extend levity polymorphism to work on functions. For example, map could be inferred to have this type

map :: forall (v1 :: Levity) (v2 :: Levity) (v3 :: Levity) (v4 :: Levity)
              (a :: TYPE 'Boxed v1) (b :: TYPE 'Boxed v2).
       (a -> b) -> [] @v1 @v3 a -> [] @v2 @v4 b

Here, v1 and v2 are the levities of a and b, respectively, and v3 and v4 are the levities of the argument and result lists, respectively. As previously articulated, we can't have proper levity polymorphic functions, instead requiring specialization. This would mean sixteen maps. That's quite a high number. But I (Richard) think all of the variants make sense. I'd love for there to be some optimization to prevent the need to export 16 definitions here, but I don't see one. In any case, I do think this would work swimmingly for users. Is it worth the cost? That's up for debate.

Conjecture: If we take this idea to its logical extreme -- of making all functions levity polymorphic in all of the datatypes mentioned -- then RAE believes that using ! to control kinds in datatype definitions would not break any code.

Design questions

  1. Is there a way to reduce duplication in object code? If we can't find a "yes" answer, this problem may kill the idea.

Parametricity

Some of the examples given in the alternate proposals (B3 and B4) are not parametric polymorphism, but rather ad-hoc polymorphism of levities. For instance, the map example does not have one uniform implementation, but many based on the instantiations, and each case has quite different behavior.

However, there are many examples of functions that can be properly levity polymorphic, because they are just (at a low level) moving pointers around, and the levity only tracks the overall calling convention for those pointers, rather than the operation of the body of the function. The most immediate example is constructors of data types. If we have:

data T :: forall (l :: Levity). (a :: TYPE 'Boxed l) -> TYPE 'Boxed 'Lifted where
  C :: forall (l :: Levity) (a :: Type 'Boxed l). a -> T a

Then C is genuinely parametric in l. It accepts a pointer to a value that is either lifted or not, stores it, and matching on it later yields a pointer to a value with the same status. It is unclear to the present author whether C would also be genuinely parametric if T were parameterized on its levity status as well.

There are examples beyond constructors, though. All mutable cells for boxed types have the same behavior of simply storing a pointer, so put and get operations should be genuinely parametric. They simply preserve the calling conventions of things they are fed and return. Further, this area is quite rich, and enabling these parametric uses fixes problems with duplication, type safety and efficiency in primops. Right now one must choose between:

ArrayArray# -- not type safe
Array# (Array a) -- has extra indirection

whereas with levity polymorphic Array#, one could have Array# (Array# a) which is the best of both worlds (eliminating the need for duplication).

As a final thought, let's consider the map example:

map :: forall (v1 v2 v3 v4 :: Levity)
              (a :: TYPE 'Boxed v1)
              (b :: TYPE 'Boxed v2)
       (a -> b) -> [] @v1 @v3 a -> [] @v2 @v4 b
map f [] = []
map f (x:xs) =
  let y = f x
      ys = map f xs
   in y:ys

The author believes that map is parametric in v1, as the list stored a values with that calling convention, and f expects them with the same. However, v2 determines whether a closure needs to be built for y, and unless this is somehow attributed to f, different code must be generated for map. The same seems to go for v4, except the fact that it's a recursive call suggests that it doesn't matter who we attribute the closure-building responsibility to, and map is not parametric in it. The remaining question is v3, but it would seem that map is not parametric in it, either, since it determines whether case analysis is required to ensure evaluation. So map is only parametric polymorphic in one of the four variables given.

It is this author's opinion that it would be unfortunate to give ad-hoc levity polymorphism the same syntax as parametric polymorphism (and also that making everything ad-hoc overloaded in levity would probably be confusing), given that there are substantial examples of the latter.

Minimal Semantics

With regard to the quote in the Motivation section, it should be noted that it is mostly untrue. The data type data Nat = Z | S !Nat is a flat domain of natural numbers; but this is also true in strict languages like ML. The only difference between it and data unlifted UNat = UZ | US UNat is the (default) 'calling convention' for operations on the type. For Nat the calling convention is non-strict by default (although one can force strictness), and for UNat (and the type in ML) it is (necessarily) strict. The strict calling convention allows one to ignore the domain aspect for arguments to a function (as opposed to expressions of a type, which coudl denote bottom).

Thus, it seems as though the semantic gap could be filled if !Nat and the like were simply proper types. We could have !Nat :: Unlifted. It could be represented identically to Nat, but bindings of that type would need to be evaluated eagerly (and probably wouldn't be possible at the top level), and 'values' of the type could be relied upon to never be bottom. So for instance, a function of type !Nat -> ... would be able to rely upon its first argument being a non-bottom, evaluated natural number. Semantically, the values of !Nat are exactly the set of natural numbers, and expressions denote the flat domain, but the eager callling conventions ensure that we don't need to think about bottom as a value.

In fact, this minimal addition bears a strong resemblance to the original version of this proposal. However, it was noted that there is a problem with the current meanting of !a in constructors. Given the declaration:

data T = C !Nat !Nat

The constructor C has type Nat -> Nat -> T which forces its arguments, rather than !Nat -> !Nat -> T. And changing it to the latter would be a significant, backward-incompatible change.

However, there may be a solution to this problem in the upcoming work on impredicativity. That work proposes to add a new constraint t <~ u, meaning that t can-be-instantiated/is-a-subtype-of u. The core-level witness of t <~ u is simply a function t -> u.

It is conceivable that with this work, we could add the axiom !a <~ a. The witness of this is simply an identity function that has an effect of forgetting that its argument must already be evaluated, because the values of type !a are also valid values of type a. Then, if the match:

case t of
  C x y -> ...

resulted in x, y :: !Nat, the judgments x :: Nat and y :: Nat (as the Haskell report would specify) are valid. And also, if we have x, y :: !Nat, the application C x y is well-typed with C :: Nat -> Nat -> T as a constructor that evaluates its arguments. The only remaining piece of the puzzle would be to have an optimization pass that eliminates the redundant evaluation that C would do if called with arguments of type !Nat (via weakening to Nat).

Assuming this all works as outlined above, it appears to introduce no redundancy to the language, contrary to having an entirely separate data declaration form.

Last modified 7 months ago Last modified on Apr 21, 2017 10:49:45 AM