Changes between Version 14 and Version 15 of SIMD


Ignore:
Timestamp:
Nov 11, 2011 1:11:29 PM (4 years ago)
Author:
duncan
Comment:

updates for ABI section

Legend:

Unmodified
Added
Removed
Modified
  • SIMD

    v14 v15  
    99This page describes the issues involved and a design for implementing SIMD vector support in GHC.
    1010
    11 Related pages
    12  * [wiki:SIMDPlan Notes on the current implementation plan]
     11Related pages:
     12 * Notes on the [wiki:SIMDPlan current implementation plan]
    1313
    1414== Introduction ==
     
    8383Using fallbacks does pose some challenges for a stable/portable ABI, in particular how vector registers should be used in the GHC calling convention. This is discussed in a later section.
    8484
     85=== GHC command line flags ===
     86
     87We will add machine flags such as `-msse2` and `-mavx`. These tell GHC that it is allowed to make use of the corresponding instruction sets.
     88
     89For compatibility, the default will remain targeting the base instruction set of the architecture. This is the behaviour of most other compilers. We may also want to add a `-mnative` / `-mdetect` flag that is equivalent to the `-m` flag corresponding to the host machine.
     90
    8591== Code generators ==
    8692
     
    264270The native-sized vector types are distinct types from the explicit-sized vector types, not type aliases for the corresponding explicit-sized vector. This is to support and encourage portable code.
    265271
    266 
    267272== ABIs and calling conventions ==
    268273
     
    296301}}}
    297302There are two cases to consider:
    298  * if the function being called has an unfolding exported from Lib then that unfolding can be compiled in the context of App and can make use of AVX instructions
     303 * if the function being called has an unfolding exported from `Lib` then that unfolding can be compiled in the context of App and can make use of AVX instructions
    299304 * alternatively we are dealing with object code for the function which follows a certain ABI
    300305
    301306Notice that not only do we need to be careful to call `f` and `g` using the right calling convention, but in the case of `g`, the function that we pass as its argument must also follow the calling convention that `g` will call it with.
    302307
    303 Our solution is to take a worker/wrapper approach. We will split each function into a wrapper that uses a lowest common denominator calling convention and a worker that uses the best calling convention for the target sub-architecture. The simplest lowest common denominator calling convention is to pass all vectors on the stack, while the worker convention will use SSE2 or AVX registers.
     308Our solution is to take a worker/wrapper approach. We will split each function into a wrapper that uses a lowest common denominator calling convention and a worker that uses the best calling convention for the target sub-architecture. The simplest lowest common denominator calling convention is to pass all vectors on the stack, while the fast calling convention will use SSE2 or AVX registers.
    304309
    305310For `App` calling `Lib.f` we start with a call to the wrapper, this can be inlined to a call to the worker at which point we discover that the calling convention will use SSE2 registers. For `App` calling `Lib.g` with a locally defined `h`, we would pass the wrapper for `h` to `g` and since we assume we have no unfolding for `g` then this is how it remains: at runtime `g` will call `h` through the wrapper for `h` and so will use the lowest common denominator calling convention.
     
    310315ghc -msse2 A.hs
    311316}}}
    312 That is, a module compiled with SSE2 that imports a module that was compiled with AVX. How can we call functions using AVX registers if we are only targeting SSE2? One option is to note that since we will be using AVX instructions at runtime when we call the functions in B, and hence it is legitimate to use AVX instructions in A also, at least for the calling convention. There may however be some technical restriction to using AVX instructions in A, for example if we decided that we would implement AVX support only in the LLVM backend and not the NCG backend and we chose to compile B using LLVM and A using the NCG. In that case we would have to avoid inlining the wrapper an exposing the worker that uses the AVX calling convention. There are already several conditions that are checked prior to inlining (e.g. phase checks), this would add an additional architecture check.
    313 
    314 It may well be simpler however to just implement minimal SSE2 and AVX support in the NCG, even if it is not used for vector operations and simply for the calling convention.
     317That is, a module compiled with SSE2 that imports a module that was compiled with AVX. How can we call functions using AVX registers if we are only targeting SSE2? There are two design options:
     318* One option is to note that since we will be using AVX instructions at runtime when we call the functions in B, and hence it is legitimate to use AVX instructions in A also, at least for the calling convention.
     319* The other is to avoid generating AVX instructions at all, even for the calling convention, in which case it is essential to avoid inlining the wrapper function since this exposes the worker that uses the AVX calling convention.
     320
     321While the first option is in some ways simpler, it also implies that all ABI-compatible code generators can produce at least some vector instructions. In particular it requires data-movement instructions to be supported. If however we wish to completely avoid implementing any vector support in the NCG backend then we must take the second approach.
     322
     323For the second approach we would need to add an extra architecture flag and check to inlining annotations. There are already several conditions that are checked prior to inlining (e.g. phase checks), this would add an additional check.
     324
     325=== Optional extension: compiling code for multiple sub-architectures ===
     326
     327If we have support for arch-conditional inlining, we in future may want to extend the idea to allow inlining to one of a number of arch-specific implementations.
     328
     329Consider a hypothetical function in a core library that uses vectors but that is too large to be a candidate for inlining. We have to ship core libraries compiled for the base architecture. Hence the function from the core lib will not be compiled to use AVX. Another possibility is to generate several copies of the function worker, compiled for different sub-archtectires. Then when the function is called in another module compiled with -mavx we would like to call the AVX worker. This could be achieved by arch-conditional inlining or rules.
     330
     331This option should only be considered if we expect to have functions in core libs that are above the inlining threshold. This would probably not be the case for ghc-prim and base. It may however make sense for the vector library should that become part of the standard platform and hence typically shipped to users as a pre-compiled binary.
    315332
    316333=== Types for calling conventions ===
     
    318335One of GHC's clever design tricks is that the type of a function in core determines its calling convention. A function in core that accepts an Int is different to one that accepts an Int#. The use of two different types, Int and Int# let us talk about the transformation and lets us write code in core for the wrapper function that converts from Int to Int#.
    319336
    320 If we are to take a worker wrapper approach with calling conventions for vectors then we would do well to use types to distinguish the common and special calling conventions. For example, we could define types:
     337If we are to take a worker wrapper approach with calling conventions for vectors then we would do well to use types to distinguish the common and special calling conventions. For example, we could define sub-architecture specific types:
    321338{{{
    322339FloatSseVec4#
     
    326343DoubleAvxVec4#
    327344}}}
     345We would also need some suitable primitive conversion operations
     346{{{
     347toSseVec4#   :: FloatVec4# -> FloatSseVec4#
     348fromSseVec4# :: FloatSseVec4# -> FloatVec4#
     349etc
     350}}}
    328351Then we can describe the types for the worker and wrapper for our function
    329352{{{
     
    344367f :: DoubleVec4# -> Int
    345368}}}
    346 We have said that this is the lowest common denominator calling convention. This might be passing vectors on the stack. But on x86-64 we know we always have SSE2 available, so we might want to use that in our lowest common denominator calling convention. In that case, would it make sense to say that in fact the wrapper 'f' has type:
     369We have said that this is the lowest common denominator calling convention. The simplest is passing vectors on the stack. This has the advantage of not requiring vector support in the NCG.
     370
     371=== Calling convention and performance ===
     372
     373The mixed ABI approach using worker/wrapper trades off some performance for convenience and compatibility. Why do we think the tradeoff is reasonable?
     374
     375In the case of ordinary unboxed types, Int/Int# etc, this approach is very effective. It is only when calling unknown functions, e.g. higher order functions that are not inlined that we would be calling the wrapper and using the slower calling convention. This is unlikely for high performance numeric code.
     376
     377=== Optional extension: faster common calling convention ===
     378
     379On x86-64 we know we always have SSE2 available, so we might want to use that in our lowest common denominator calling convention. It would of course require support for vector data movement instructions in the NCG.
     380
     381=== Alternative design: only machine-specific ABI types ===
     382
     383With the above extension to use vectors registers in the common calling convention, it would make sense to say that in fact the wrapper `f` has type:
    347384{{{
    348385f :: (# DoubleSseVec2#, DoubleSseVec2# #) -> Int
    349386}}}
    350 This is a plausible design, but it is not necessary to go this way. We can simply declare types like DoubleVec4# to have a particular calling convention without forcing it to be rewritten in terms of machine-specific types in core.
    351 
    352 But it would be plausible to say that types like DoubleVec4# are ephemeral, having no ABI and must be rewritten by a core -> core pass to use machine-specific types with an associated ABI.
     387This is a plausible design, but it is not necessary to go this way. We can simply declare types like `DoubleVec4#` to have a particular calling convention without forcing it to be rewritten in terms of machine-specific types in core.
     388
     389But it would be plausible to say that types like `DoubleVec4#` are ephemeral, having no ABI and must be rewritten by a core -> core pass to use machine-specific types with an associated ABI.
    353390
    354391=== Memory alignment for vectors ===
     
    356393Many CPUs that support vectors have strict alignment requirements, e.g. that 16 byte vectors must be aligned on 16byte boundaries. On some architectures the requirements are not strict but there may be a performance penalty, or alternative instruction may be required to load unaligned vectors.
    357394
    358 Note that the alignment of vectors like DoubleVec4# has to be picked to fit the maximum required alignment of any sub-architecture. For example while DoubleVec4# might be synthesized using operations on DoubleSseVec2# when targeting SSE, the alignment must be picked such that we can use `DoubleAvxVec4#` operations.
    359 
    360 It is relatively straightforward to ensure alignment for vectors in large packed arrays. It is also not too great a burden to ensure stack alignment.
     395Note that the alignment of vectors like `DoubleVec4#` has to be picked to fit the maximum required alignment of any sub-architecture. For example while `DoubleVec4#` might be synthesized using operations on `DoubleSseVec2#` when targeting SSE, the alignment must be picked such that we can use `DoubleAvxVec4#` operations.
     396
     397It is relatively straightforward to ensure alignment for vectors in large packed arrays. It is also not too great a burden to ensure stack alignment. Alignment of arrays passed in from foreign code is the programmer's responsibility.
    361398
    362399A somewhat more tricky problem is alignment of vectors within heap objects, both data constructors and closures. While it is plausible to ban putting vectors in data constructors, it does not seem possible to avoid function closures with vectors saved in the closure's environment record.
     
    366403For functions allocating heap objects with stricter alignment, upon entry along with the usual heap overflow check, it would be necessary to test the heap pointer and if necessary to bump it up to achieve the required alignment. Subsequent allocations within that code block would be able to proceed without further checks as any additional padding between allocations would be known statically. The heap overflow check would also have to take into account the possibility of the extra bump.
    367404
    368 === GHC command line flags ===
    369 
    370 We would add machine flags such as `-msse2` and `-mavx`. These tell GHC that it is allowed to make use of the corresponding instructions.
    371 
    372 For compatibility, the default will remain targeting the base instruction set of the architecture. This is the behaviour of most other compilers. We may also want to add a `-mnative` / `-mdetect` flag that is equivalent to the `-m` flag corresponding to the host machine.
     405=== ABI summary ===
     406
     407The size of the native-sized vectors `IntVec#`, `DoubleVec#` etc correspond to the maximum size for any sub-architecture, e.g. AVX on x86-64.
     408
     409The ordinary `IntVec#`, `Int32Vec4#` etc types correspond to the slow compatible calling convention which passes all vectors on the stack. These vectors must all have their obvious strict alignment. For example `Int32Vec4#` is 16 bytes large and has 16 byte alignment.
     410
     411Extra machine-specific types `DoubleSseVec2#`, `FloatAvxVec8#`, `FloatNeonVec4#` etc correspond to the fast calling convention which use the corresponding vector registers. These have the alignment requirements imposed by the hardware. The machine-specific types need not be exposed but it is also plausible to do so.
     412
     413We will use worker/wrapper to convert between the common types and the machine-specific types.
     414
     415Initially, to avoid implementing vector data-movement instructions in the NCG, we will add arch-conditional inlining of the wrapper functions.
     416
     417If later on we add vector data-movement instructions to the NCG, then the arch-conditional inlining of the wrapper functions can be discarded and the compatible calling convention could be changed to make use of any vector registers in the base architecture (e.g. SSE2 on x86-64).
    373418
    374419== See also ==