Changes between Version 32 and Version 33 of SIMD


Ignore:
Timestamp:
Apr 16, 2012 1:19:20 PM (3 years ago)
Author:
gmainland
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SIMD

    v32 v33  
    1 [[PageOutline]]
     1The goal of the SIMD project is to allow GHC and Haskell libraries to take advantage of SIMD vector instructions. Please see the proposed [wiki:SIMD/Design design] and he current [wiki:SIMD/Implementation/Status implementation status] for further details.
    22
    3 = Using SIMD instructions in GHC =
    4 
    5 '''Goal''': improve program running times by taking advantage of CPU's SIMD vector instructions.
    6 
    7 '''How''': by extending GHC to generate code using SIMD vector instructions and by modifying libraries as necessary.
    8 
    9 This page describes the issues involved and a design for implementing SIMD vector support in GHC.
    10 
    11 Related pages:
    12  * Notes on the [wiki:SIMDPlan current implementation plan]
    13 
    14 == Introduction ==
    15 
    16 We are interested in the SIMD vector instructions on current and future generations of CPUs. This includes SSE and AVX on x86/x86-64 and NEON on ARM chips (targets like GPUs or FPGAs are out of scope for this project). These SIMD vector instruction sets are broadly similar in the sense of having relatively short vector registers and operations for various sizes of integer and/or floating point operation. In the details however they have different capabilities and different vector register sizes.
    17 
    18 We therefore want a design for SIMD support in GHC that will let us efficiently exploit current vector instructions but a design that is not tied too tightly to one CPU architecture or generation. In particular, it should be possible to write portable Haskell programs that use SIMD vectors.
    19 
    20 On the other hand, we want to be able to write programs for maximum efficiency that exploit the native vector sizes, preferably while remaining portable. For example, algorithms on large variable length vectors are in principle agnostic about the size of the primitive vector operations.
    21 
    22 Finally, we want a design that is not too difficult or time consuming to implement.
    23 
    24 === Use cases ===
    25 
    26 We are mainly interested in scientific / numerical use cases with large arrays / vectors. These are the kinds of use cases that DPH already targets.
    27 
    28 In the interests of limiting implementation difficulty, we are prepared initially to sacrifice performance in use cases with small vectors. Examples with lots of small vectors include 3D work where there are lots of 4-element vectors and 4x4 matrices. These tradeoffs show up in our choices about calling conventions and vector memory alignment which are discussed below.
    29 
    30 Note: we will need to be clear with users that initially this SIMD work is not suitable for small vectors, just big arrays.
    31 
    32 === Existing SIMD instruction sets ===
    33 
    34 Intel and AMD CPUs use the [http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions SSE family] of extensions and, more recently (since Q1 2011), the [http://en.wikipedia.org/wiki/Advanced_Vector_Extensions AVX] extensions.  ARM CPUs (Cortex A series) use the [http://www.arm.com/products/processors/technologies/neon.php NEON] extensions. PowerPC and SPARC have similar vector extensions. Variations between different families of SIMD extensions and between different family members in one family of extensions include the following:
    35 
    36  '''Register width'''::
    37   SSE registers are 128 bits, whereas AVX registers are 256 bits, but they can also still be used as 128 bit registers with old SSE instructions. NEON registers can be used as 64-bit or 128-bit register.
    38  '''Register number'''::
    39   SSE sports 8 SIMD registers in the 32-bit i386 instruction set and 16 SIMD registers in the 64-bit x84_64 instruction set. (AVX still has 16 SIMD registers.) NEON's SIMD registers can be used as 32 64-bit registers or 16 128-bit registers.
    40  '''Register types'''::
    41   In the original SSE extension, SIMD registers could only hold 32-bit single-precision floats, whereas SSE2 extend that to include 64-bit double precision floats as well as 8 to 64 bit integral types. The extension from 128 bits to 256 bits in register size only applies to floating-point types in AVX. This is expected to be extended to integer types in AVX2, but in AVX, SIMD operations on integral types can only use the lower 128 bits of the SIMD registers. NEON registers can hold 8 to 64 bit integral types and 32-bit single-precision floats.
    42  '''Alignment requirements'''::
    43   SSE requires alignment on 16 byte boundaries. With AVX, it seems that operations on 128 bit SIMD vectors may be unaligned, but operations on 256 bit SIMD vectors needs to be aligned to 32 byte boundaries. NEON suggests to align SIMD vectors with ''n''-bit elements to ''n''-bit boundaries.
    44 
    45 === SIMD/vector support in other compilers ===
    46 
    47 Both GCC and LLVM provide some low-level yet portable support for SIMD vector types and operations.
    48 
    49 GCC provides [http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html vector extensions] to C where the programmer may define vector types of a fixed size. The standard C `+`, `-`, `*` etc operators then work on these vector types. GCC implements these operations using whatever hardware support is available. Depending on the requested vector size GCC uses native vector registers and instructions, or synthesises large requested vectors using smaller hardware vectors. For example it can generate code for operating on vectors of 4 doubles by using SSE2 registers and operations which only handle vectors of doubles of size 2.
    50 
    51 The LLVM compiler tools targeted by GHC's [wiki:Commentary/Compiler/Backends/LLVM LLVM backend] support a generic [http://llvm.org/docs/LangRef.html#t_vector vector type] of arbitrary, but fixed length whose elements may be any LLVM scalar type. In addition to three [http://llvm.org/docs/LangRef.html#vectorops vector operations], LLVM's operations on scalars are overloaded to work on vector types as well. LLVM compiles operations on vector types to target-specific SIMD instructions, such as those of the SSE, AVX, and NEON instruction set extensions. As the capabilities of the various versions of SSE, AVX, and NEON vary widely, LLVM's code generator maps operations on LLVM's generic vector type to the more limited capabilities of the various hardware targets.
    52 
    53 == General plan ==
    54 
    55 We need to implement support for vectors in several layers of GHC + Libraries, from bottom to top:
    56  * code generators (NCG, LLVM)
    57  * Cmm
    58  * Haskell/Core primops
    59  * Some strategy for making use of vector primops, e.g. DPH or Vector lib
    60 
    61 === Vector types ===
    62 
    63 We intend to provide vectors of the following basic types:
    64 
    65  || Int8  || Int16  || Int32  || Int64  ||
    66  || Word8 || Word16 || Word32 || Word64 ||
    67  ||       ||        || Float  || Double ||
    68 
    69 === Fixed and variable sized vectors ===
    70 
    71 The hardware supports only small fixed sized vectors. High level libraries would like to be able to use arbitrary sized vectors. Similar to the design in GCC and LLVM we will provide primitive Haskell types and operations for fixed-size vectors. The task of implementing variable sized vectors in terms of fixed-size vector types and primops is left to the next layer up (DPH, vector lib).
    72 
    73 That is, in the core primop layer and down, vector support is only for fixed-size vectors. The fixed sizes will be only powers of 2 and only up to some maximum size. The choice of maximum size should reflect the largest vector size supported by the current range of CPUs (256bit with AVX):
    74 
    75  || types ||        ||        || vector sizes    ||
    76  || Int8  || Word8  ||        || 2, 4, 8, 16, 32 ||
    77  || Int16 || Word16 ||        || 2, 4, 8, 16     ||
    78  || Int32 || Word32 || Float  || 2, 4, 8         ||
    79  || Int64 || Word64 || Double || 2, 4            ||
    80  || Int   || Word   ||        || 2, 4            ||
    81 
    82 In addition, we will support vector types with fixed but architecture-dependent sizes (see below).
    83 
    84 We could choose to support larger fixed sizes, or the same maximum size for all types, but there is no strict need to do so.
    85 
    86 === Portability and fallbacks ===
    87 
    88 To enable portable Haskell code we will provide the same set of vector types and operations on all architectures. Again this follows the approach taken by GCC and LLVM.
    89 
    90 We will rely on fallbacks for the cases where certain types or operations are not supported directly in hardware. In particular we can implement large vectors on machines with only small vector registers. Where there is no vector hardware support at all for a type (e.g. arch with no vectors or 64bit doubles on ARM's NEON) we can implement it using scalar code.
    91 
    92 The obvious approach is a transformation to synthesize larger vector types and operations using smaller vector operations or scalar operations. This synthesisation could plausibly be done at the core, Cmm or code generator layers, however the most natural choice would be as a Cmm -> Cmm transformation. This approach would reduce or eliminate the burden on code generators by allowing them to support only their architecture's native vector sizes and types, or none at all.
    93 
    94 Using fallbacks does pose some challenges for a stable/portable ABI, in particular how vector registers should be used in the GHC calling convention. This is discussed in a later section.
    95 
    96 === GHC command line flags ===
    97 
    98 We will add machine flags such as `-msse2` and `-mavx`. These tell GHC that it is allowed to make use of the corresponding instruction sets.
    99 
    100 For compatibility, the default will remain targeting the base instruction set of the architecture. This is the behaviour of most other compilers. We may also want to add a `-mnative` / `-mdetect` flag that is equivalent to the `-m` flag corresponding to the host machine.
    101 
    102 == Code generators ==
    103 
    104 We will not extend the portable C backend to emit vector instructions. It will rely on the higher layers transforming vector operations into scalar operations. The portable C backend is not ABI compatible with the other code generators so there is no concern about vector registers in the calling convention.
    105 
    106 The LLVM C library supports vector types and instructions directly. The GHC LLVM backend could be extended to translate vector ops at the Cmm level into LLVM vector ops.
    107 
    108 The NCG (native code generator) may need at least minimal support for vector types if vector registers are to be used in the calling convention (see below). If we choose a common calling convention where vectors are passed in registers rather than on the stack then minimal support in the NCG would be necessary if ABI compatibility is to be preserved with the LLVM backend. It is optional whether vector instructions are used to improve performance.
    109 
    110 == Cmm layer ==
    111 
    112 The Cmm layer will be extended to represent vector types and operations.
    113 
    114 The `CmmType` describes the machine-level type of data. It consists of the "category" of data, along with the `Width` in bits.
    115 
    116 {{{
    117 data CmmType = CmmType CmmCat Width
    118 data Width = ...
    119 data CmmCat     -- "Category" (not exported)
    120    = GcPtrCat   -- GC pointer
    121    | BitsCat    -- Non-pointer
    122    | FloatCat   -- Float
    123 }}}
    124 
    125 The current code distinguishes floats, pointer and non-pointer data. These are distinguished primarily because either they need to be tracked separately (GC pointers) or because they live in special registers on many architectures (floats).
    126 
    127 For vectors we add two new categories
    128 {{{
    129    | VBitsCat  Multiplicity   -- Non-pointer
    130    | VFloatCat Multiplicity   -- Float
    131 
    132 type Multiplicty = Int
    133 }}}
    134 We keep vector types separate from scalars, rather than representing scalars as having multiplicty 1. This is to limit disruption to existing code paths and also because it is expected that vectors will often need to be treated differently from scalars. Again we distinguish float from integral types as these may use different classes of registers. There is no need to support vectors of GC pointers.
    135 
    136 Vector operations on these machine vector types will be added to the Cmm `MachOp` type, e.g.
    137 {{{
    138 data MachOp =
    139   ...
    140   | MO_VF_Add Width Multiplicity
    141 }}}
    142 For example `MO_VF_Add W64 4` represents vector addition on a length-4 vector of 64bit floats.
    143 
    144 == Core layer ==
    145 
    146 We need Haskell data types and Haskell primitive operations for fixed size vectors. In some ways this is a harder problem than representing the vector types and opertions at the Cmm level. In particular, at the Haskell type level we cannot easily parametrise on the vector length.
    147 
    148 Our design is to provide a family of fixed size vector types and primitive operations, but not to provide any facility to parametrise this family on the vector length.
    149 
    150 for width {w} in 8, 16, 32, 64 and "", (empty for native Int#/Word# width)[[BR]]
    151 for multiplicity {m} in 2, 4, 8, 16, 32
    152 
    153 `type Int`''{w}''`Vec`''{m}''`#`[[BR]]
    154 `type Word`''{w}''`Vec`''{m}''`#`[[BR]]
    155 `type FloatVec`''{m}''`#`[[BR]]
    156 `type DoubleVec`''{m}''`#`[[BR]]
    157 
    158 Syntax note: here {m} is meta-syntax, not concrete syntax
    159 
    160 Hence we have individual type names with the following naming convention:
    161 
    162  ||              || length 2     || length 4     || length 8     || etc ||
    163  || native `Int` || `IntVec2#`   || `IntVec4#`   || `IntVec8#`   || ... ||
    164  || `Int8`       || `Int8Vec2#`  || `Int8Vec4#`  || `Int8Vec8#`  || ... ||
    165  || `Int16`      || `Int16Vec2#` || `Int16Vec4#` || `Int16Vec8#` || ... ||
    166  || etc          || ...          || ...          || ...          || ... ||
    167 
    168 Similarly there will be families of primops:
    169 {{{
    170 extractInt{w}Vec{m}#  :: Int{w}Vec{m}# -> Int# -> Int{w}#
    171 addInt{w}Vec{m}#      :: Int{w}Vec{m}# -> Int{w}Vec{m}# -> Int{w}Vec{m}#
    172 }}}
    173 From the point of view of the Haskell namespace for values and types, each member of each of these families is distinct. It is just a naming convention that suggests the relationship.
    174 
    175 === Optional extension: extra syntax ===
    176 
    177 We could add a new concrete syntax using `<...>` to suggest a paramater, but have it really still part of the name:
    178 
    179  ||              || length 2       || length 4       || length 8       || etc ||
    180  || native `Int` || `IntVec<2>#`   || `IntVec<4>#`   || `IntVec<8>#`   || ... ||
    181  || `Int8`       || `Int8Vec<2>#`  || `Int8Vec<4>#`  || `Int8Vec<8>#`  || ... ||
    182  || `Int16`      || `Int16Vec<2>#` || `Int16Vec<4>#` || `Int16Vec<8>#` || ... ||
    183  || etc          || ...            || ...            || ...            || ... ||
    184 
    185 === Primop generation and representation ===
    186 
    187 Internally in GHC we can take advantage of the obvious parametrisation within the families of primitive types and operations. In particular we extend GHC's `primop.txt.pp` machinery to enable us to describe the family as a whole and to generate the members.
    188 
    189 For example, here is some plausible concrete syntax for `primop.txt.pp`:
    190 {{{
    191 parameter <w, m> Width Multiplicity
    192   with <w, m> in <8, 2>,<8, 4>,<8, 8>,<8, 16>,<8, 32>,
    193                  <16,2>,<16,4>,<16,8>,<16,16>,
    194                  <32,2>,<32,4>,<32,8>,
    195                  <64,2>,<64,4>
    196 }}}
    197 Note that we allow non-rectangular combinations of values for the parameters. We declare the range of values along with the parameter so that we do not have to repeat it for every primtype and primop.
    198 {{{
    199 primtype <w,m> Int<w>Vec<m>#
    200 
    201 primop VIntAddOp <w,m> "addInt<w>Vec<m>#" Dyadic
    202   Int<w>Vec<m># -> Int<w>Vec<m># -> Int<w>Vec<m>#
    203   {Vector addition}
    204 }}}
    205 
    206 This would generate a family of primops, and an internal representation using the type names declared for the parameters:
    207 {{{
    208 data PrimOp = ...
    209    | IntAddOp
    210    | VIntQuotOp Width Multiplicity
    211 }}}
    212 It is not yet clear what syntax to achieve the names of the native sized types `Int` and `Word`. Perhaps we should use "", e.g.
    213 {{{
    214 parameter <w, m> Width Multiplicity
    215   with <w, m> in <8, 2>,<8, 4>,<8, 8>,<8, 16>,<8, 32>,
    216                  <16,2>,<16,4>,<16,8>,<16,16>,
    217                  <32,2>,<32,4>,<32,8>,
    218                  <64,2>,<64,4>
    219                  <"",2>,<"",4>
    220 }}}
    221 
    222 === Optional extension: primitive int sizes ===
    223 
    224 The above mechanism could be used to handle parametrisation between Int8#, Int16# etc. Currently these do not exist as primitive types. The types Int8, Int16 etc are implemented as a boxed native-sized Int# plus narrowing.
    225 
    226 Note that while this change is possible and would make things more uniform it is not essential for vector support.
    227 
    228 That is we might have:
    229 {{{
    230 parameter <w> Width
    231   with <w> in <8>, <16>, <32>, <64>, <"">
    232 
    233 primtype Int<w>#
    234 
    235 primop   IntAddOp <w>    "addInt<w>#"    Dyadic
    236    Int<w># -> Int<w># -> Int<w>#
    237    with commutable = True
    238 }}}
    239 generating
    240 {{{
    241 data PrimOp = ...
    242    | IntAddOp Width
    243 }}}
    244 We might want some other solution so we can use `+#` as well as `addInt#` since `+8#` as an infix operator doesn't really work.
    245 
    246 == Native vector sizes ==
    247 
    248 In addition to various portable fixed size vector types, we will have a portable vector type that is tuned for the hardware vector register size. This is analogous to the existing integer types that GHC supports. We have Int8, Int16, Int32 etc and in addition we have Int, the size of which is machine dependent (either 32 or 64bit).
    249 
    250 As with Int, the rationale is efficiency. For algorithms that could work with a variety of primitive vector sizes it will almost always be fastest to use the vector size that matches the hardware vector register size. Clearly it is suboptimal to use a vector size that is smaller than the native size. Using a larger vector is not nearly as bad as using as smaller one, though it does contribute to register pressure.
    251 
    252 Without a native sized vector, libraries would be forced to use CPP to pick a good vector size based on the architecture, or to pick a fixed register size that is always at least as big as the native size on all platforms that are likely to be used. The former is annoying and the latter makes less sense as vector sizes on some architectures increase.
    253 
    254 Note that the actual size of the native vector size will be fixed per architecture and will not vary based on "sub-architecture" features like SSE vs AVX. We will pick the size to be the maximum of all the sub-architectures. That is we would pick the AVX size for x86-64. The rationale for this is ABI compatibility which is discussed below. In this respect, the !IntVec# is like Int#, the size of both is crucial for the ABI and is determined by the target platform/architecture.
    255 
    256 So we extend our family of vector types with:
    257 
    258  ||              || native length || length 2     || length 4     || length 8     || etc ||
    259  || native `Int` || `IntVec#`     || `IntVec2#`   || `IntVec4#`   || `IntVec8#`   || ... ||
    260  || `Int8`       || `Int8Vec#`    || `Int8Vec2#`  || `Int8Vec4#`  || `Int8Vec8#`  || ... ||
    261  || `Int16`      || `Int16Vec#`   || `Int16Vec2#` || `Int16Vec4#` || `Int16Vec8#` || ... ||
    262  || etc          || ...           || ...          || ...          || ...          || ... ||
    263 
    264 and there are some top level constants describing the vector size so as to enable their portable use
    265 {{{
    266 intVecSize, int8VecSize, int16VecSize, int32VecSize, int64VecSize :: Int
    267 wordVecSize, word8VecSize, word16VecSize, word32VecSize, word64VecSize :: Int
    268 floatVecSize, doubleVecSize :: Int
    269 }}}
    270 Note that these constants are of type Int since top level values of type Int# are not currently supported. This should not be a problem as they should always get inlined and unboxed where it matters.
    271 
    272 The native-sized vector types are distinct types from the explicit-sized vector types, not type aliases for the corresponding explicit-sized vector. This is to support and encourage portable code.
    273 
    274 == Vector operations ==
    275 
    276 The following operations on vectors will be supported. They will need to be implemented at the Haskell/core primop layer, Cmm `MachOp` layer and optional support in the code generators.
    277 
    278 In the following, `<t>` ranges over `Int<w>`, `Word<w>`, `Float`, `Double`.
    279 
    280 Loading and storing vectors in arrays, `ByteArray#` and raw `Addr#`
    281 {{{
    282 indexInt<w>Vec<m>Array#  :: ByteArray# -> Int# -> Int<w>Vec<m>#
    283 indexWord<w>Vec<m>Array# :: ByteArray# -> Int# -> Word<w>Vec<m>#
    284 indexFloatVec<m>Array#   :: ByteArray# -> Int# -> FloatVec<m>#
    285 indexDoubleVec<m>Array#  :: ByteArray# -> Int# -> DoubleVec<m>#
    286 
    287 readInt<w>Vec<m>Array#  :: MutableByteArray# d -> Int# -> State# d -> (# State# d, Int<w>Vec<m>#  #)
    288 readWord<w>Vec<m>Array# :: MutableByteArray# d -> Int# -> State# d -> (# State# d, Word<w>Vec<m># #)
    289 readFloatVec<m>Array#   :: MutableByteArray# d -> Int# -> State# d -> (# State# d, FloatVec<m>#   #)
    290 readDoubleVec<m>Array#  :: MutableByteArray# d -> Int# -> State# d -> (# State# d, DoubleVec<m>#  #)
    291 
    292 writeInt<w>Vec<m>Array#  :: MutableByteArray# d -> Int# -> Int<w>Vec<m>#  -> State# d -> State# d
    293 writeWord<w>Vec<m>Array# :: MutableByteArray# d -> Int# -> Word<w>Vec<m># -> State# d -> State# d
    294 writeFloatVec<m>Array#   :: MutableByteArray# d -> Int# -> FloatVec<m>#   -> State# d -> State# d
    295 writeDoubleVec<m>Array#  :: MutableByteArray# d -> Int# -> DoubleVec<m>#  -> State# d -> State# d
    296 
    297 readInt<w>Vec<m>OffAddr#  :: Addr# -> Int# -> State# d -> (# State# d, Int<w>Vec<m># #)
    298 readWord<w>Vec<m>OffAddr# :: Addr# -> Int# -> State# d -> (# State# d, Word<w>Vec<m># #)
    299 readFloatVec<m>OffAddr#   :: Addr# -> Int# -> State# d -> (# State# d, FloatVec<m># #)
    300 readDoubleVec<m>OffAddr#  :: Addr# -> Int# -> State# d -> (# State# d, DoubleVec<m># #)
    301 
    302 writeInt<w>Vec<m>OffAddr#  :: Addr# -> Int# -> Int<w>Vec<m>#  -> State# d -> State# d
    303 writeWord<w>Vec<m>OffAddr# :: Addr# -> Int# -> Word<w>Vec<m># -> State# d -> State# d
    304 writeFloatVec<m>OffAddr#   :: Addr# -> Int# -> FloatVec<m>#   -> State# d -> State# d
    305 writeDoubleVec<m>OffAddr#  :: Addr# -> Int# -> DoubleVec<m>#  -> State# d -> State# d
    306 }}}
    307 
    308 Extracting and inserting vector elements:
    309 {{{
    310 extractInt<w>Vec<m>#   :: Int<w>Vec<m>#  -> Int# -> Int#
    311 extractWord<w>Vec<m>#  :: Word<w>Vec<m># -> Int# -> Word#
    312 extractFloatVec<m>#    :: FloatVec<m>#   -> Int# -> Float#
    313 extractDoubleVec<m>#   :: DoubleVec<m>#  -> Int# -> Double#
    314 }}}
    315 {{{
    316 insertInt<w>Vec<m>#   :: Int<w>Vec<m>#  -> Int# -> Int#    -> Int<w>Vec<m>#
    317 insertWord<w>Vec<m>#  :: Word<w>Vec<m># -> Int# -> Word#   -> Word<w>Vec<m>#
    318 insertFloatVec#       :: FloatVec<m>#   -> Int# -> Float#  -> FloatVec<m>#
    319 insertDoubleVec#      :: DoubleVec<m>#  -> Int# -> Double# -> DoubleVec<m>#
    320 }}}
    321 
    322 Duplicating a scalar to a vector:
    323 {{{
    324 replicateToInt<w>Vec<m>#  :: Int<w>Vec<m>#  -> Int#    -> Int<w>Vec<m>#
    325 replicateToWord<w>Vec<m># :: Word<w>Vec<m># -> Word#   -> Word<w>Vec<m>#
    326 replicateToFloatVec#      :: FloatVec<m>#   -> Float#  -> FloatVec<m>#
    327 replicateToDoubleVec#     :: DoubleVec<m>#  -> Double# -> DoubleVec<m>#
    328 }}}
    329 
    330 Vector shuffle:
    331 {{{
    332 shuffle<t>Vec<m>ToVec<m'> :: <t>Vec<m># -> Int32Vec<m'># -> <t>Vec<m'>#
    333 }}}
    334 For the fixed size vectors (not native size) we may also want to add pack/unpack functions like:
    335 {{{
    336 unpackInt<w>Vec4# :: Int<w>Vec4# -> (# Int#, Int#, Int#, Int# #)
    337 packInt<w>Vec4#   :: (# Int#, Int#, Int#, Int# #) -> Int<w>Vec4#
    338 }}}
    339 
    340 Arithmetic operations:
    341 {{{
    342 plus<t>Vec<m>#, minus<t>Vec<m>#,
    343 times<t>Vec<m>#, quot<t>Vec<m>#, rem<t>Vec<m># :: <t>Vec<m># -> <t>Vec<m># -> <t>Vec<m>#
    344 
    345 negate<t>Vec<m># :: <t>Vec<m># -> <t>Vec<m>#
    346 }}}
    347 Logic operations:
    348 {{{
    349 andInt<w>Vec<m>#, orInt<w>Vec<m>#, xorInt<w>Vec<m>#    :: Int<w>Vec<m>#  -> Int<w>Vec<m>#  -> Int<w>Vec<m>#
    350 andWord<w>Vec<m>#, orWord<w>Vec<m>#, xorWord<w>Vec<m># :: Word<w>Vec<m># -> Word<w>Vec<m># -> Word<w>Vec<m>#
    351 
    352 notInt<w>Vec<m>#  :: Int<w>Vec<m>#  -> Int<w>Vec<m>#
    353 notWord<w>Vec<m># :: Word<w>Vec<m># -> Word<w>Vec<m>#
    354 
    355 shiftLInt<w>Vec<m>#,  shiftRAInt<w>Vec<m>#  :: Int<w>Vec<m>#  -> Word# -> Int<w>Vec<m>#
    356 ShiftLWord<w>Vec<m>#, ShiftRLWord<w>Vec<m># :: Word<w>Vec<m># -> Word# -> Word<w>Vec<m>#
    357 }}}
    358 Comparison:
    359 {{{
    360 cmp<eq,ne,gt,gt,lt,le>Int<w>Vec<m>#  :: Int<w>Vec<m>#  -> Int<w>Vec<m>#  -> Word<w>Vec<m>#
    361 cmp<eq,ne,gt,gt,lt,le>Word<w>Vec<m># :: Word<w>Vec<m># -> Word<w>Vec<m># -> Word<w>Vec<m>#
    362 }}}
    363 Note that LLVM does not yet support the comparison operations (See the comment at the end of the documentation for the [http://llvm.org/docs/LangRef.html#i_icmp icmp] instruction, for example).
    364 
    365 Integer width narrow/widen operations:
    366 {{{
    367 narrowInt<w>To<w'>Vec<m>#  :: Int<w>Vec<m># -> Int<w'>Vec<m>#     -- for w' < w
    368 narrowWord<w>To<w'>Vec<m># :: Word<w>Vec<m># -> Word<w'>Vec<m>#   -- for w' < w
    369 
    370 widenInt<w>To<w'>Vec<m>#   :: Int<w>Vec<m># -> Int<w'>Vec<m>#     -- for w' > w
    371 widenWord<w>To<w'>Vec<m>#  :: Word<w>Vec<m># -> Word<w'>Vec<m>#   -- for w' > w
    372 }}}
    373 Note: LLVM calls these truncate and extend (signed extend or unsigned extend)
    374 
    375 Floating point conversion:
    376 {{{
    377 narrowDoubleToFloatVec<m>#  :: DoubleVec<m># -> FloatVec<m>#
    378 widenFloatToDoubleVec<m>#   :: FloatVec<m>#  -> DoubleVec<m>#
    379 
    380 roundFloatToInt32Vec<m>     :: FloatVec<m>#  -> Int32Vec<m>#
    381 roundFloatToInt64Vec<m>     :: FloatVec<m>#  -> Int64Vec<m>#
    382 roundDoubleToInt32Vec<m>    :: DoubleVec<m># -> Int32Vec<m>#
    383 roundDoubleToInt64Vec<m>    :: DoubleVec<m># -> Int64Vec<m>#
    384 
    385 truncateFloatToInt32Vec<m>  :: FloatVec<m>#  -> Int32Vec<m>#
    386 truncateFloatToInt64Vec<m>  :: FloatVec<m>#  -> Int64Vec<m>#
    387 truncateDoubleToInt32Vec<m> :: DoubleVec<m># -> Int32Vec<m>#
    388 truncateDoubleToInt64Vec<m> :: DoubleVec<m># -> Int64Vec<m>#
    389 
    390 promoteInt32ToFloatVec<m>   :: Int32Vec<m># -> FloatVec<m>#
    391 promoteInt64ToFloatVec<m>   :: Int64Vec<m># -> FloatVec<m>#
    392 promoteInt32ToDoubleVec<m>  :: Int32Vec<m># -> DoubleVec<m>#
    393 promoteInt64ToDoubleVec<m>  :: Int64Vec<m># -> DoubleVec<m>#
    394 }}}
    395 
    396 TODO: Should consider:
    397  * vector constants, at least at Cmm level
    398  * replicating a scalar to a vector
    399  * FMA: fused multiply add, this is supported by NEON and AVX however software fallback may not be possible with the same precision. Tricky.
    400  * SSE/AVX also suppports a bunch of interesting things:
    401    * add/sub/mul/div of vector by a scalar
    402    * reciprocal, square root, reciprocal of square root
    403    * permute, shuffle, "blend", masked moves.
    404    * abs
    405    * min, max within a vector
    406    * average
    407    * horizontal add/sub
    408    * shift whole vector left/right by n bytes
    409    * and not logical op
    410    * gather (but not scatter) of 32, 64bit int and fp from memory (base + vector of offsets)
    411 
    412 === Int/Word size wrinkle ===
    413 
    414 Note that there is a wrinkle with the 32 and 64 bit int and word types. For example, the types for the extract functions should be:
    415 {{{
    416 extractInt32Vec<m>#  :: Int32Vec#  -> Int# -> INT32
    417 extractInt64Vec<m>#  :: Int64Vec#  -> Int# -> INT64
    418 extractWord32Vec<m># :: Word32Vec# -> Int# -> WORD32
    419 extractWord64Vec<m># :: Word64Vec# -> Int# -> WORD64
    420 }}}
    421 where `INT32`, `INT64`, `INT64`, `WORD64` are CPP macros that expand in a arch-dependent way to the types Int#/Int64# and Word#/Word64#.
    422 
    423 To describe this in the primop definition we might want something like:
    424 {{{
    425 primop   IntAddOp <w,m,t>    "extractWord<w>Vec<m>#"    Dyadic
    426   Word<w>Vec<m># -> Int# -> <t>
    427   with <w, m, t> in <8, 2,Word#>,<8, 4,Word#>,<8, 8,Word#>,<8, 16,Word#>,<8, 32,Word#>,
    428                     <16,2,Word#>,<16,4,Word#>,<16,8,Word#>,<16,16,Word#>,
    429                     <32,2,WORD32>,<32,4,WORD32>,<32,8,WORD32>,
    430                     <64,2,WORD64>,<64,4,WORD64>
    431                     <"",2,WORD>,<"",4,WORD>
    432 }}}
    433 
    434 To iron out this wrinkle we would need the whole family of primitve types: Int8#, Int16#, Int32# etc whereas currently only the native register sized Int# type is provided, plus a primitive Int64# type is provided on 32bit systems.
    435 
    436 == Data Parallel Haskell layer ==
    437 
    438 In [http://www.haskell.org/haskellwiki/GHC/Data_Parallel_Haskell DPH], we will use the new SIMD instructions by suitably modifying the definition of the lifted versions of arithmetic and other operations that we would like to accelerate. These lifted operations are defined in the `dph-common` package and made accessible to the vectoriser via [wiki:DataParallel/VectPragma VECTORISE pragmas]. Many of them currently use `VECTORISE SCALAR` pragmas, such as
    439 {{{
    440 (+) :: Int -> Int -> Int
    441 (+) = (P.+)
    442 {-# VECTORISE SCALAR (+) #-}
    443 }}}
    444 We could define them more verbosely using a plain `VECTORISE` pragma, but might instead like to extend `VECTORISE SCALAR` or introduce a variant.
    445 
    446 '''NB:''' The use of SIMD instructions interferes with vectorisation avoidance for scalar subcomputations. Code that avoids vectorisation also avoids the use of SIMD instructions. We would like to use SIMD instructions, but still avoid full-scale vectorisation. This should be possible, but it is not immediately clear how to realise it (elegantly).
    447 
    448 == ABIs and calling conventions ==
    449 
    450 For each CPU architecture GHC has a calling convention that it uses for all Haskell function calls. The calling convention specifies where function arguments and results are placed in registers and on the stack. Adhering to the calling convention is necessary for correctness. Code compiled using different calling conventions should not be linked together. Note that currently the LLVM and NCG code generators adhere to the same ABI.
    451 
    452 The calling convention needs to be extended to take into account the primitive vector types. We have to decide if vectors should be passed in registers or on the stack and how to handle vectors that do not match the native vector register size.
    453 
    454 For efficiency it is highly desirable to make use of vector registers in the calling convention. This can be significantly quicker than copying vectors to and from the stack.
    455 
    456 Within the same overall CPU architecture, there are several sub-architectures with different vector capabilities and in particular different vector sizes. The x86-64 architecture supports SSE2 vectors as a baseline which includes pairs of doubles, but the AVX extension doubles the size of the vector registers. Ideally when compiling for AVX we would make use of the larger AVX vectors, including passing the larger vectors in registers.
    457 
    458 This poses a major challenge: we want to make use of large vectors when possible but we would also like to maintain some degree of ABI compatibility.
    459 
    460 === Alternative design: separate ABIs ===
    461 
    462 It is worth briefly exploring the option of abandoning ABI compatibility. We could declare that we have two ABIs on x86-64, the baseline SSE ABI and the AVX ABI. We would further declare that to generate AVX code you must build all of your libraries using AVX. Essentially this would mean having two complete sets of libraries, or perhaps simply two instances of GHC, each with their own libraries. While this would work and may be satisfactory when speed is all that matters, it would not encourage use of vectors more generally. In practice haskell.org and linux distributions would have to distribute the more compatible SSE build so that in many cases even users with AVX hardware would be using GHC installations that make no use of AVX code. On x86 the situation could be even worse since the baseline x86 sub-architecture used by many linux distributions does not include even SSE2. In addition it is wasteful to have two instances of libraries when most libraries do not use vectors at all.
    463 
    464 === Selected design: mixed ABIs using worker/wrapper ===
    465 
    466 It it worth exploring options for making use of AVX without having to force all code to be recompiled. Ideally the base package would not need to be recompiled at all and perhaps only have packages like vector recompiled to take advantage of AVX.
    467 
    468 Consider the situation where we have two modules `Lib.hs` and `App.hs` where `App` imports `Lib`. The `Lib` module exports:
    469 {{{
    470 f :: DoubleVec4# -> Int
    471 g :: (DoubleVec4# -> a) -> a
    472 }}}
    473 which are used by App. We compile:
    474 {{{
    475 ghc -msse2 Lib.hs
    476 ghc -mavx  App.hs.
    477 }}}
    478 There are two cases to consider:
    479  * if the function being called has an unfolding exported from `Lib` then that unfolding can be compiled in the context of App and can make use of AVX instructions
    480  * alternatively we are dealing with object code for the function which follows a certain ABI
    481 
    482 Notice that not only do we need to be careful to call `f` and `g` using the right calling convention, but in the case of `g`, the function that we pass as its argument must also follow the calling convention that `g` will call it with.
    483 
    484 Our solution is to take a worker/wrapper approach. We will split each function into a wrapper that uses a lowest common denominator calling convention and a worker that uses the best calling convention for the target sub-architecture. The simplest lowest common denominator calling convention is to pass all vectors on the stack, while the fast calling convention will use SSE2 or AVX registers.
    485 
    486 For `App` calling `Lib.f` we start with a call to the wrapper, this can be inlined to a call to the worker at which point we discover that the calling convention will use SSE2 registers. For `App` calling `Lib.g` with a locally defined `h`, we would pass the wrapper for `h` to `g` and since we assume we have no unfolding for `g` then this is how it remains: at runtime `g` will call `h` through the wrapper for `h` and so will use the lowest common denominator calling convention.
    487 
    488 We might be concerned with the reverse situation where we have A and B, with A importing B:
    489 {{{
    490 ghc -mavx  B.hs.
    491 ghc -msse2 A.hs
    492 }}}
    493 That is, a module compiled with SSE2 that imports a module that was compiled with AVX. How can we call functions using AVX registers if we are only targeting SSE2? There are two design options:
    494  * One option is to note that since we will be using AVX instructions at runtime when we call the functions in B, and hence it is legitimate to use AVX instructions in A also, at least for the calling convention.
    495  * The other is to avoid generating AVX instructions at all, even for the calling convention, in which case it is essential to avoid inlining the wrapper function since this exposes the worker that uses the AVX calling convention.
    496 
    497 While the first option is in some ways simpler, it also implies that all ABI-compatible code generators can produce at least some vector instructions. In particular it requires data-movement instructions to be supported. If however we wish to completely avoid implementing any vector support in the NCG backend then we must take the second approach.
    498 
    499 For the second approach we would need to add an extra architecture flag and check to inlining annotations. There are already several conditions that are checked prior to inlining (e.g. phase checks), this would add an additional check.
    500 
    501 === Optional extension: compiling code for multiple sub-architectures ===
    502 
    503 If we have support for arch-conditional inlining, we in future may want to extend the idea to allow inlining to one of a number of arch-specific implementations.
    504 
    505 Consider a hypothetical function in a core library that uses vectors but that is too large to be a candidate for inlining. We have to ship core libraries compiled for the base architecture. Hence the function from the core lib will not be compiled to use AVX. Another possibility is to generate several copies of the function worker, compiled for different sub-archtectires. Then when the function is called in another module compiled with -mavx we would like to call the AVX worker. This could be achieved by arch-conditional inlining or rules.
    506 
    507 This option should only be considered if we expect to have functions in core libs that are above the inlining threshold. This would probably not be the case for ghc-prim and base. It may however make sense for the vector library should that become part of the standard platform and hence typically shipped to users as a pre-compiled binary.
    508 
    509 === Types for calling conventions ===
    510 
    511 One of GHC's clever design tricks is that the type of a function in core determines its calling convention. A function in core that accepts an Int is different to one that accepts an Int#. The use of two different types, Int and Int# let us talk about the transformation and lets us write code in core for the wrapper function that converts from Int to Int#.
    512 
    513 If we are to take a worker wrapper approach with calling conventions for vectors then we would do well to use types to distinguish the common and special calling conventions. For example, we could define sub-architecture specific types:
    514 {{{
    515 FloatSseVec4#
    516 DoubleSseVec2#
    517 
    518 FloatAvxVec8#
    519 DoubleAvxVec4#
    520 }}}
    521 We would also need some suitable primitive conversion operations
    522 {{{
    523 toSseVec4#   :: FloatVec4# -> FloatSseVec4#
    524 fromSseVec4# :: FloatSseVec4# -> FloatVec4#
    525 etc
    526 }}}
    527 Then we can describe the types for the worker and wrapper for our function
    528 {{{
    529 f :: DoubleVec4# -> Int
    530 }}}
    531 This remains the type of the wrapper, which also is still called f. If we compile the module with -msse2 or -mavx then we would get workers with the respective types:
    532 {{{
    533 f_worker :: (# DoubleSseVec2#, DoubleSseVec2# #) -> Int
    534 }}}
    535 or
    536 {{{
    537 f_worker :: DoubleAvxVec4# -> Int
    538 }}}
    539 Note that in the SSE2 case we have to synthesize a vector of length 4 using native vectors of length 2.
    540 
    541 Now it is clearer what the calling convention of the workers are. What is the calling convention of the wrapper?
    542 {{{
    543 f :: DoubleVec4# -> Int
    544 }}}
    545 We have said that this is the lowest common denominator calling convention. The simplest is passing vectors on the stack. This has the advantage of not requiring vector support in the NCG.
    546 
    547 === Calling convention and performance ===
    548 
    549 The mixed ABI approach using worker/wrapper trades off some performance for convenience and compatibility. Why do we think the tradeoff is reasonable?
    550 
    551 In the case of ordinary unboxed types, Int/Int# etc, this approach is very effective. It is only when calling unknown functions, e.g. higher order functions that are not inlined that we would be calling the wrapper and using the slower calling convention. This is unlikely for high performance numeric code.
    552 
    553 === Optional extension: faster common calling convention ===
    554 
    555 On x86-64 we know we always have SSE2 available, so we might want to use that in our lowest common denominator calling convention. It would of course require support for vector data movement instructions in the NCG.
    556 
    557 === Alternative design: only machine-specific ABI types ===
    558 
    559 With the above extension to use vectors registers in the common calling convention, it would make sense to say that in fact the wrapper `f` has type:
    560 {{{
    561 f :: (# DoubleSseVec2#, DoubleSseVec2# #) -> Int
    562 }}}
    563 This is a plausible design, but it is not necessary to go this way. We can simply declare types like `DoubleVec4#` to have a particular calling convention without forcing it to be rewritten in terms of machine-specific types in core.
    564 
    565 But it would be plausible to say that types like `DoubleVec4#` are ephemeral, having no ABI and must be rewritten by a core -> core pass to use machine-specific types with an associated ABI.
    566 
    567 === Memory alignment for vectors ===
    568 
    569 Many CPUs that support vectors have strict alignment requirements, e.g. that 16 byte vectors must be aligned on 16byte boundaries. On some architectures the requirements are not strict but there may be a performance penalty, or alternative instruction may be required to load unaligned vectors. For example AVX has special instructions for unaligned loads and stores but Intel estimates a [http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors/ 20% performance loss].
    570 
    571 LLVM has primitives that can load and store vectors from unaligned memory locations, which (presumably) compile to either aligned vector instructions if the architecture has them, or non-vector instructions if not.  So alignment of vectors in memory is optional, and we can make an independent choice about whether we align stored vectors
    572 
    573  * on the stack
    574  * in a heap closure
    575  * in an array
    576 
    577 '''Alignment in arrays.''' Arrays of vectors are clearly the most important case, so we must support allocation of aligned unboxed arrays.  Indeed GHC already does support ''pinned'' arrays of unboxed data, and any array larger than about 3k is implicitly pinned. Supporting unpinned arrays would be somewhat more difficult, requiring some GC support to keep the objects aligned when copying them, and requiring that the "slop" be filled in some cases, but it could be done.
    578 
    579 '''Alignment on the stack.''' Aligning the stack could be done either by ensuring that all stack allocation is a multiple of the alignment (thus possibly wasting a lot of stack space), or by adding extra frames to fill the slop when allocating a frame that needs alignment.  Neither option is particularly attractive.  We propose to use unaligned access to vectors stored on the stack for the time being, and revisit this decision if it is found to be a performance bottleneck later.
    580 
    581 '''Alignment in the heap.'''  Again, while we could arrange the correct alignment for vectors stored in heap objects, it would be painful to do so, requiring code generator support and GC support.  We propose not to do this, at least for the time being, and to use unaligned loads and stores for vectors in heap objects.
    582 
    583 
    584 === ABI summary ===
    585 
    586 The size of the native-sized vectors `IntVec#`, `DoubleVec#` etc correspond to the maximum size for any sub-architecture, e.g. AVX on x86-64.
    587 
    588 The ordinary `IntVec#`, `Int32Vec4#` etc types correspond to the slow compatible calling convention which passes all vectors on the stack. These vectors must all have their obvious strict alignment. For example `Int32Vec4#` is 16 bytes large and has 16 byte alignment.
    589 
    590 Extra machine-specific types `DoubleSseVec2#`, `FloatAvxVec8#`, `FloatNeonVec4#` etc correspond to the fast calling convention which use the corresponding vector registers. These have the alignment requirements imposed by the hardware. The machine-specific types need not be exposed but it is also plausible to do so.
    591 
    592 We will use worker/wrapper to convert between the common types and the machine-specific types.
    593 
    594 Initially, to avoid implementing vector data-movement instructions in the NCG, we will add arch-conditional inlining of the wrapper functions.
    595 
    596 If later on we add vector data-movement instructions to the NCG, then the arch-conditional inlining of the wrapper functions can be discarded and the compatible calling convention could be changed to make use of any vector registers in the base architecture (e.g. SSE2 on x86-64).
    597 
    598 == See also ==
    599 
    600  * [http://perilsofparallel.blogspot.com/2008/09/larrabee-vs-nvidia-mimd-vs-simd.html Blog article about Larrabee and Nvidia, MIMD vs. SIMD]
    601  * [wiki:SimdLlvm SIMD LLVM] A previous (LLVM-specific) iteration of this SIMD proposal.
    602  * [wiki:VectorComputing VectorComputing]  A previous proposal to make use of x86 SSE in GHC.
    603 
    604 = Current Implementation Status =
    605 
    606 The prototype implementation of the above specification is vailable as the `simd` branch of GHC.
    607 
    608 == General plan ==
    609 
    610 === Vector types ===
    611 
    612 Vectors of the following types are implemented: `Int32`, `Int64`, `Float`, and `Double`.
    613 
    614 === Fixed and variable sized vectors ===
    615 
    616 For each type, currently only one vector width is implemented, namely the width that is appropriate for SSE2. This means that vectors are currently all 16 bytes in size.
    617 
    618 == Code generators ==
    619 
    620 Only the LLVM code generator is supported.
    621 
    622 == Cmm layer ==
    623 
    624 Our `CmmType` representation for vectors differs slightly from the proposal. See [source:/compiler/cmm/CmmType.hs?rev=e42746d07239888c74e937046fadf93655b44b65#L42 cmm/CmmType.hs].
    625 
    626 See [source:/compiler/cmm/CmmMachOp.hs?rev=e42746d07239888c74e937046fadf93655b44b65#L106 cmm/CmmMachOp.hs] for the new vector MachOps.
    627 
    628 == Core layer ==
    629 
    630 The implementation differs from the proposal in its naming scheme. We wanted to avoid overloading the term "vector," so, e.g., a 4-wide SIMD vector of `Float#`s is a `FloatX4#`.
    631 
    632 See [source:/compiler/prelude/primops.txt.pp?rev=e42746d07239888c74e937046fadf93655b44b65#L1935 compiler/prelude/primops.txt.pp] for the new primops. Not everything in the proposal is implemented, but we do have a useful subset.
    633 
    634 == Native vector sizes ==
    635 
    636 This is unimplemented. Instead we define a higher-level `Multi` data family whose instance is platform-dependent. For example, a `Multi Int` is represented using an `Int32X4#` on a 32-bit platform, and by a `Int64X2#` on a 64-bit platform.
    637 
    638 == ABIs and calling conventions ==
    639 
    640 Integrating variable-sized vectors with GHC's calling convention is a challenge. How many new registers do we add? Do we add registers for each vector type? The correct approach is unclear, so the current implementation passes all SIMD vectors on the stack.
    641 
    642 === Memory alignment for vectors ===
    643 
    644 The implementation does not attempt to align memory containing SIMD vectors. SIMD vector loads and stores do not assume alignment.
     3Obsolete sub-topics:
     4 * An [wiki:SIMD/Implementation/Plan implementation plan].
     5 * Manuel's notes on [wiki:SIMD/Implementation/Llvm implementing SIMD support in GHC using LLVM].
     6 * An [wiki:SIMD/Implementation/Old old implementation] of the very beginnings of SIMD support in GHC.
     7 * An example of vector operations in [wiki:SIMD/LlvmExample LLVM IL].