wiki:UnpackedSumTypes

Unboxed sum types

This page explains the motivation and implementation of unpacking for sum types.

-XUnboxedSums have been available since GHC 8.2.1.

See also UnliftedDataTypes

Current status

Use Keyword = UnboxedSums to ensure that a ticket ends up on these lists.

Open Tickets:

#12514
Can't write unboxed sum type constructors in prefix form
#13276
Unboxed sums are not Typeable
#14259
Worker/Wrapper for sum return

Closed Tickets:

#12417
API annotations for unboxed sums needs reworking
#12478
Template Haskell support for unboxed sums
#12711
GHC Internal error, unboxed sums
#14051
Unboxed sums-related panic: getUnboxedSumName 513
#14228
PatternSynonyms Non-exhaustive with UnboxedSums

Motivation

GHC does a good job of unpacking product types. Given a declaration like

data T1 a b = C1 a b
data T2 a b = C2 {-# UNPACK #-} !(T1 a b)

C2 will have a representation where all the overhead of the C1 constructor, both the pointer to it in the C2 constructor and the info table pointer in the C1 constructor, has been removed. This saves two words and one indirection compared to a packed representation, which uses five words.

Unfortunately, a similar example using sum types cannot be unpacked today:

data T1 a = Some a | None
data T2 a = C !(T1 a)  -- Cannot UNPACK here

Here the representation of the C constructor will contain a pointer to e.g. the Some constructor. The Some constructor will be a separate heap object and thus needs one word to store its info table pointer.

In this example there is an alternative, unpacked representation that is more memory efficient and has fewer indirections. We could store a constructor tag together with the union of the fields of T1 inside C. Conceptually the memory layout would look like this (in practice we group pointer and non-pointer fields together):

T2 info table pointer T1 constructor tag Fields of Some Fields of None

(In this case None has no fields.)

This representation saves one word and one indirection compared to the packed representation, which uses four words.

Source language design

We add new built-in types for anonymous unboxed sums. These are directly analogous to the existing anonymous unboxed tuples. Specifically:

  • A new language extension UnboxedSums.
  • We add a family of new built-in type constructors for unboxed sums:
    (#|#), (#||#), (#|||#), (#||||#), etc
    
    A sum of n "|"s is a n+1 ary sum. (Just like tuples (,), (,,), etc.)
  • Each n-ary-sum type constructor comes with n data constructors, with systematically-derived names, thus:
    data (#||#) a b c = (# _ | | #) a
                      | (# | _ | #) b
                      | (# | | _ #) c
    
    The _ indicates which disjunct of the sum we mean.
  • You use the type constructor in a distfix way, like so:
    (# Int | Bool #)        means   (#|#) Int Bool
    (# Int | Bool | Int #)  means   (#||#) Int Bool Int
    (# Int | Bool #)        means   (#|#) Int Bool
    
    And similarly the data constructors:
    (# | True #)     means   (# | _ #) True
    (# | 'c' | #)    means   (# | _ | #) 'c'
    
  • You can use the data constructors both in terms (to construct) and in patterns (to decompose).
    case x of
        (# x | | | #) -> ...
        (# | y | | #) -> ...
        ...two more disjuncts needed to be exhaustive
    
  • Unboxed sums are first class values. They can be passed as an argument to a function, returned as its result, be the type of a data constructor field, and so on. Of course, unboxed sums are unlifted (cannot be bottom), and should be represented efficiently (more on that below).
  • Just as for unboxed tuples: The components of an unboxed sum type may be of kind * or #. So (# Int# | Bool #) is fine. And you can nest unboxed sums and tuple arbitrarily, e.g.
    • (# (# Int,Bool #) | Char# #)
    • (# (# Int# | Char # #) | Int #)

All of these rules follow the same pattern as the rules for boxed/unboxed tuples.

Design questions

For large-arity anonymous sums, the data constructor syntax requires counting vertical bars. This is annoying. Might we consider switching to a new syntax where (# 0 of 3 | x #) means (# x | | #) and (# 2 of 6 | y #) means `(# | | y | | | #)`? I (Richard) saw this syntax in an email and thought it might be an improvement.

Implementation

Wired-in types

Unboxed sums get implemented very like boxed and unboxed tuples; see compiler/prelude/TysWiredIn.hs.

The Core language

No changes in Core.

Core to STG

When going to STG we need to eliminate the unboxed sums. This can be done in compiler/simplStg/UnariseStg.hs, just like for tuples.

Given the Core function

f :: (# t_1 | ... | t_n #) -> ...

we convert it to a call to STG which includes the tag and the maximum number of pointer and non-pointer arguments we might need. Example:

Core STG
f :: (# Int# | Bool #) -> ... f :: Word -> Word -> Pointer -> ...
f :: (# Int# | Word# #) -> ... f :: Word -> Word -> ...

See notes in compiler/simplStg/UnariseStg.hs for more details.

Code generation

New StgArg constructor StgRubbishArg and new CmmArg are added for efficient compilation of sums. See StgRubbishArg in compiler/stgSyn/StgSyn.hs.

Unpacking

(NOTE (osa): This part is not yet implemented, the patch is trivial and I'm going to submit it soon)

Given

data T1 a = Some a | None
data T2 a = C {-# UNPACK #-} !(T1 a)

we generate a "worker" constructor

C (# a | (# #) #)

((# #) playing the role of void.)

We then translate the construction of C as follows:

C x

===> (translates to)

case x of
    Some y -> C (# y | #)
    None   -> C (# | (# #) #)

We then translate the elimination of C as follows:

case e of
    C x -> ... x ...

===> (translates to)

case e of
    C x' ->
        let x = case x' of
            (# y | #) -> Some y
            (# | _ #) -> None
        in ... x ...

This above reboxing will go away, using case-of-case and case-of-known-constructor, if we scrutinize x again.


Exploiting nullary constructors

Joachim writes: The current proposed layout for a

    data D a = D a {-# UNPACK #-} !(Maybe a) would be
    [D’s pointer] [a] [tag (0 or 1)] [Just’s a]

So the representation of

         D foo (Just bar)     is     [D_info] [&foo] [1] [&bar]
and of   D foo Nothing        is     [D_info] [&foo] [0] [&dummy]

where dummy is something that makes the GC happy.

But assuming this dummy object is something that is never a valid heap objects of its own, then this should be sufficient to distinguish the two cases, and we could actually have that the representation of

         D foo (Just bar)     is     [D_info] [&foo] [&bar]
and of   D foo Nothing        is     [D_info] [&foo] [&dummy]

and an case analysis on D would compare the pointer in the third word with the well-known address of dummy to determine if we have Nothing or Just. This saves one word.

If we generate a number of such static dummy objects, we can generalize this tag-field avoiding trick to other data types than Maybe. It seems that it is worth doing that if

  • the number of constructors is no more than the number of static dummy objects, and
  • there is one constructor which has more pointer fields than all other constructors.

Also, this trick cannot be applied repeatedly: If we have

  data D = D {-# UNPACK #-} !(Maybe a) | D'Nothing
  data E = E {-# UNPACK #-} !(D a)

then it cannot be applied when unpacking D into E. (Or maybe it can, but care has to be taken that D’s Nothing is represented by a different dummy object than Maybe’s Nothing.)

Anyways, this is an optimization that can be implemented once unboxed sum type are finished and working reliably.

Last modified 3 months ago Last modified on Aug 1, 2017 4:09:57 PM