Version 24 (modified by pmonday, 4 years ago) (diff)



This text documents the implementation stages / components for adding SIMD support to GHC for the LLVM Back-end. The overriding requirements and architecture for the project are located back at the SIMD LLVM Page.

Based on that design, the high-level tasks that must be accomplished include the following:

  1. Modify autoconf to determine SSE availability and vector size
  2. Add new PrimOps to allow Haskell to make use of Vectors
  3. Add new MachOps to Cmm to communicate use of Vectors
  4. Modify the LLVM Code Generator to translate Cmm to LLVM vector instructions
  5. Demonstrate use of PrimOps from Haskell program
  6. Modify Vector Library
  7. Modify DPH Libraries
  8. Arrange that the other Code Generators continue to generate non SIMD code

Introduction of SIMD support to GHC will occur in stages to demonstrate the entire “vertical” stack is functional:

  1. Introduce “Float” PrimOps (as necessary to run an example showing SIMD usage in the LLVM)
  2. Add appropriate Cmm support for the Float primtype / primop subset
  3. Modify the LLVM Code Generator to support the Double vectorization
  4. Demonstrate the PrimOps and do limited performance testing to ensure SIMD is functional
  5. Modify Vector Libraries to make use of new PrimOps
  6. Modify the DPH Libraries
  7. Higher level examples using the above libraries
  8. Build out the remaining PrimOps
  9. Demonstrate full stack
  10. Test remaining code generators

Current Open Questions

These clearly won't be all of the questions I have, there is a substantial amount of work that goes through the entire GHC compiler stack before reaching the LLVM instructions.

Modify autoconf

Determining if a particular hardware architecture has SIMD instructions, the version of the instructions available (SSE, SSE2, SSE3, SSE4 or an iteration of one of those), and consequently the size of vectors that are supported occurs during the configuration step on the architecture that the build will occur on. This is the same time that the sizes of Ints are calculated, alignment constants, and other pieces that are critical to GHC.

Backing up from the results to the location that the changes are introduced, the current alignment and primitive sizes are available in ./includes/ghcautoconfig.h, here is a sample:

/* The size of `char', as computed by sizeof. */
#define SIZEOF_CHAR 1

/* The size of `double', as computed by sizeof. */

These are constructed from mk/config.h* that are generated by and autoheader. The (or a related file) should be able to be sufficiently modified to determine if SSE is available and, consequently, the vector size that can be operated on and (later) how many pieces of data can be operated on in parallel (determined by the operation). SSE had an MMX register size of 64-bits and all later SSE versions (2 and above) have a register size of 128 bits. This implies that any type that is 32-bits can have 4 pieces of data calculated against in a single instruction.

There is an example of modifications to detect SSE availability available on the web, the primary body of the check is as follows (xmmintrin.h contains the SSE instruction set):

AC_MSG_CHECKING(for SSE in current arch/CFLAGS)
#include <xmmintrin.h>
__m128 testfunc(float *a, float *b) {
  return _mm_add_ps(_mm_loadu_ps(a), _mm_loadu_ps(b));

if test "$has_sse" = yes; then
  AC_DEFINE([_USE_SSE], 1, [Enable SSE support])
  AC_DEFINE([_VECTOR_SIZE], 128, [Default Vector Size for Optimization])

Once the mk/config.h is modified with the above, the includes/ghcautoconf.h is modified during the first stage of the GHC build process. Once includes/ghcautoconf.h is modified, the _USE_SSE constant is available in the Cmm definitions (next section).

There are more detailed explanations of how to use cpuid to determine the supported SSE instruction set available on the web as well. cpuid may be more appropriate but are also much more complex. Details for using cpuid are available at

It should be noted, that since the overall goal is to let the LLVM handle the actual assembly code that does vectorization, it's only possible to support vectorization up to the version that the LLVM supports.

Add new MachOps to Cmm code

It may make more sense to add the MachOps to Cmm prior to implementing the PrimOps (or at least before adding the code to the CgPrimOp.hs file). There is a useful Cmm Wiki Page available to aid in the definition of the new Cmm operations.

Modify compiler/cmm/CmmType.hs to add new required vector types and such, here is a basic outline of what needs to be done:

data CmmType    -- The important one!
  = CmmType CmmCat Width
type Multiplicty = Int
data CmmCat     -- "Category" (not exported)
   = GcPtrCat   -- GC pointer
   | BitsCat   -- Non-pointer
   | FloatCat   -- Float
   | VBitsCat  Multiplicity   -- Non-pointer
   | VFloatCat Multiplicity  -- Float
   deriving( Eq )
        -- See Note [Signed vs unsigned] at the end

Modify compiler/cmm/CmmMachOp.hs, this will add the necessary MachOps for use from the PrimOps modifications to support SIMD. Here is an example of adding a SIMD version of the MO_F_Add MachOp:

  -- Integer SIMD arithmetic
  | MO_V_Add  Width Int
  | MO_V_Sub  Width Int
  | MO_V_Neg  Width Int         -- unary -
  | MO_V_Mul  Width Int
  | MO_V_Quot Width Int

  -- Floating point arithmetic
  | MO_VF_Add Width Int   -- MO_VF_Add W64 4   Add 4-vector of 64-bit floats

Some existing Cmm instructions may be able to be reused, but there will have to be additional instructions added to account for vectorization primitives. This will help keep the SIMD / non-SIMD code generation separate for the time being until we have it working.

Add new PrimOps

Adding the new PrimOps is relatively straight-forward, but a substantial number of LOC will be added to achieve it. Most of this code is cut/paste "like" with minor type modifications.

Background: The following articles can aid in getting the work done:

The steps to be undertaken are:

  1. Modify ./compiler/prelude/primops.txt.pp (the following instructions may be changed a bit based on the direction)
    1. Add the following vector length constants as Int# types
      • intVecLen, intVec8Len, intVec16Len, intVec32Len, intVec64Len, wordVecLen, wordVec8Len, wordVec16Len, wordVec32Len, wordVec64Len, floatVecLen, and doubleVecLen,
    2. Then add the following primtypes:
      • Int : IntVec#, Int8Vec#, Int16Vec#, Int32Vec#, Int64Vec#
      • Word : WordVec#, Word8Vec#, Word16Vec#, Word32Vec#, Word64Vec#
      • Float : FloatVec#
      • Double : DoubleVec#
    3. Add the following primops associated with the above primtypes. The ops come in groups associated with the types above, for example for IntVec#’s we get the following family for the “plus” operation alone:
      • plusInt8Vec# :: Int8Vec# -> Int8Vec# -> Int8Vec#
      • plusInt16Vec# :: Int16Vec# -> Int16Vec# -> Int16Vec#
      • plusInt32Vec# :: Int32Vec# -> Int32Vec# -> Int32Vec#
      • plusInt64Vec# :: Int64Vec# -> Int64Vec# -> Int64Vec#
    4. Repeat this for the following set of operations on IntVec#’s of various lengths, note that the signatures are summarized informally in parentheses behind the operation:
      • plusIntVec#, (signature :: Int8Vec# -> Int8Vec# -> Int8Vec#)
      • minusIntVec#,
      • timesIntVec#,
      • quotIntVec#,
      • remIntVec#
      • negateIntVec# (signature :: IntVec# -> IntVec#)
      • uncheckedIntVecShiftL# (signature :: IntVec# -> Int# -> IntVec#)
      • uncheckedIntVecShiftRA#,
      • uncheckedIntVecShiftRL#
    5. For the Word vectors we similarly introduce:
      • plusWordVec#, minusWordVec#, timesWordVec#, quotWordVec#, remWordVec#, negateWordVec#, andWordVec#, orWordVec#, xorWordVec#, notWord#, uncheckedWordVecShiftL#, uncheckedWordVecShiftRL#
    6. Float
      • plusFloatVec#, minusFloatVec#, timesFloatVec#, quotFloatVec#, remFloatVec#, negateFloatVec#, expFloatVec#, logFloatVec#, sqrtFloatVec# sinFloatVec#, cosFloatVec#, tanFloatVec#, asinFloatVec#, acosFloatVec#, atanFloatVec#, sinhFloatVec#, coshFloatVec#, tanhFloatVec#
    7. Double
      • plusDoubleVec#, minusDoubleVec#, timesDoubleVec#, quotDoubleVec#, remDoubleVec#, negateDoubleVec#, expDoubleVec#, logDoubleVec#, sqrtDoubleVec#, sinDoubleVec#, cosDoubleVec#, tanDoubleVec#, asinDoubleVec#, acosDoubleVec#, atanDoubleVec#, sinhDoubleVec#, coshDoubleVec#, tanhDoubleVec#
  2. Do NOT Modify ./compiler/prelude/PrimOp.lhs (actually, ./compiler/primop-data-decl.hs-incl) to add the new PrimOps (VIntQuotOp, etc...), this will be generated based on the primops.txt.pp modifications
  3. Modify ./compiler/codeGen/CgPrimOp.hs, code for each primop (above) must be added to complete the primop addition.
    1. The code, basically, links the primops to the Cmm MachOps (that, in turn, are read by the code generators)
    2. It looks like some Cmm extensions will have to be added to ensure alignment and pass vectorization information onto the back ends, the necessary MachOps will be determined after the first vertical stack is completed (using the "Double" as a model). There may be some reuse from the existing MachOps. There is some discussion to these extensions (or similar ones) on the original Patch 3557 Documentation

Example of modification to ./compiler/prelude/primops.txt.pp to add one of the additional Float operations:

section "SIMDFloat"
	{Float operations that can take advantage of vectorization.}

primop   FloatVectorAddOp   "plusFloatVec#"      Dyadic            
   Float# -> Float# -> Float#
   with can_fail = True

Here is an example of the update to ./compiler/codeGen/CgPrimOp.hs

-- SIMD Float Ops
translateOp FloatVectorAddOp	= Just (MO_VF_Add W32 4)

The above, after compilation, adds the following to the ./compiler/prelude/PrimOp.lhs file:

   | FloatVectorAddOp

Modify LLVM Code Generator

Take the MachOps in the Cmm definition and translate correctly to the corresponding LLVM instructions. LLVM code generation is in the /compiler/llvmGen directory. The following will have to be modified (at a minimum):

  • /compiler/llvmGen/Llvm/Types.hs - add the MachOps from Cmm and how they bridge to the LLVM vector operations
  • /compiler/llvmGen/LlvmCodeGen/CodeGen.hs - This is the heart of the translation from MachOps to LLVM code. Possibly significant changes will have to be added.
  • Remaining /compiler/llvmGen/* - Supporting changes

Once the LLVM Code Generator is modified to support Double instructions, tests can be run to ensure the “bottom half” of the stack works.

Modify Remaining Code Generators

Until a compiler step is available that removes the new MachOps for other code generators, or a switch is a available that completely turns off other code generators, the native (and probably C) code generators will have to be modified to accept the new MachOps and convert them to equivalent supported MachOps. Without the modifications, the compile will not complete successfully.

For x86 Native Code Generation, locate the ./compiler/nativeGen/X86/CodeGen.hs file and modify it appropriately. For the example above, simply adding a conversion from MO_VF_Add to the equivalent non-vector add is sufficient.

      MO_VF_Add w i | sse2      -> trivialFCode_sse2 w ADD  x y
                    | otherwise -> trivialFCode_x87    GADD x y   

Changes for the remaining new MachOps may be much larger.

Example: Demonstrate SIMD Operation

Once the Code Generator, PrimOps and Cmm are modified, we should be able to demonstrate performance scenarios. The simplest example to use for demonstrating performance is to time vector additions and multiplications using the new vectorized instruction set against a similar addition or multiplication using another PrimOp.

The following two simple programs should demonstrate the difference in performance. The program using the PrimOps should improve performance approaching 2x (Doubles are 64bit and SSE provides two 64bit registers).

Simple usage of the new instructions to add to vectors of doubles:

Question: How does one create one of the new PrimOp types to test prior to testing the vector add operations? This is going to have to be looked at a little ... the code should basically create a vector and then insertDoubleVec# repeatedly to populate the vector. Without the subsequent steps done, this will have to be "hand" done without additional operations defined. Here is the response from Manuel to expand on this: I am not quite sure what the best approach is. The intention in LLVM is clearly to populate vectors using the 'insertIntVec#' etc functions. However, in LLVM you can just use an uninitialised register and insert elements into a vector successively. We could provide a vector "0" value in Haskell and insert into that. Any other ideas?

	let x = ????
	let y = ???
	plusDoubleVec# x y

Using simple lists to achieve the same operation:

	let x = [1,2,3,4]
	let y = [2,3,4,5]
	zipWith (+) x y

The above can be repeated with any of the common operations (multiplication, division, subtraction). This should be sufficient with large sized vectors / lists to illustrate speedup.

(Note that over time and several generations of the integration, one would hope that the latter path would be “optimized” into SIMD instructions)

Modify Vector Libraries and Vector Compiler Optimization (Pragmas and such)

Once we've shown there is speed-up for the lower portions of the compiler and have quantified it, the upper half of the stack should be optimized to take advantage of the vectorization code that was added to the PrimOps and Cmm. There are two primary locations this is handled, in the compiler (compile/vectorize) code that vectorizes modules post-desugar process. This location handles the VECTORISE pragmas as well as implicit vectorization of code. The other location that requires modification is the Vector library itself.

  1. Modify the Vector library /libraries/vector/Data to make use of PrimOps where possible and adjust VECTORISE pragmas if necessary
    • Modify the existing Vector code
    • We will likely also need vector versions of array read/write/indexing to process Haskell arrays with vector operations (this may need to go into compiler/vectorise)
    • Use the /libraries/vector/benchmarks to test updated code, look for
      • slowdowns - vector operations that cannot benefit from SIMD should not show slowdown
      • speedup - all performance tests that make use of maps for the common operators (+, -, *, etc..) should benefit from the SIMD speedup
  2. Modify the compiler/vectorise code to adjust pragmas and vectorization post-desugar process. These modifications may not need to be made on the first pass through the code, more evaluation is necessary.
    • /compiler/vectorise/Vectorise.hs
    • /compiler/vectorise/Vectorise/Env.hs
    • /compiler/vectorise/Vectorise/Type/Env.hs

Once the benchmarks show measurable, reproducible behavior, move onto the DPH libraries. Note that a closer inspection of the benchmarks in the /libraries/vector/benchmarks directory is necessary to ensure they reflect code that will be optimized with the use of SIMD instructions. If they are not appropriate, add code that demonstrates SIMD speed-up appropriately.

Modify DPH Libraries

The DPH libraries have heavy dependencies on the previous vectorization modification step (modifying the Vector libraries and the compiler vector options and post-desugar vectorization steps). The DPH steps should not be undertaken without significant performance improvements illustrated in the previous steps.

  1. The primary changes for DPH are in /libraries/dph/dph-common/Data/Array/Parallel/Lifted/*
  2. VECTOR SCALAR is also heavily used in /libraries/dph/dph-common/Data/Array/Parallel/Prelude, these should be inspected for update as well (Double.hs, Float.hs, Int.hs, Word8.hs)
  3. Modify pragmas as necessary based on changes made above

Note to Self: Determine if the VECTORISE pragmas need adjustment or enhancement (based on previous steps)

Ensure Remaining Code Generators Function Properly

There are really two options on the remaining code generators:

  • Modify each code generator to understand the new Cmm instructions and restore them to non-vectorized instructions
  • Add a compiler step that that does a pre-pass and replaces all "length = 1" vectors and operations on them by the corresponding scalar type and operations

The latter makes sense in that it is effective on all code generators, including the LLVM code generator. Vectors of length = 1 should not be put through SIMD instructions to begin with (as they will incur substantial overhead for no return).

To make this work, a ghc compiler flag must be added that forces all vector lengths to 1 (this will be required in conjunction with any non-LLVM code generator). A user can also use this option to turn off SIMD optimization for LLVM.

  • Add the ghc compiler option: --vector-length=1
  • Modify compiler/vectorise to recognize the new option or add this compiler pass as appropriate

Reference Documentation

Reference Discussion Threads

From: Manual Chakravarty
Q: Should the existing pure Vector libraries (/libraries/vector/Data/*) be modified to use the vectorized code as a first priority, wait until DPH (/libraries/dph/) is modified, or leave the Vector library as is?

A: The DPH libraries ('dph-*') are based on the 'vector' library — i.e., for DPH to use SIMD instruction, we must modify 'vector' first.

Q: How does one create one of the new Vector Types in a Haskell program (direct PrimOp?, for testing ... let x = ????)

A: I am not quite sure what the best approach is. The intention in LLVM is clearly to populate vectors using the 'insertIntVec#' etc functions. However, in LLVM you can just use an uninitialised register and insert elements into a vector successively. We could provide a vector "0" value in Haskell and insert into that. Any other ideas?

A: I just realised that we need vector version of the array read/write/indexing operations as well to process Haskell arrays with vector operations.

Q: One discussion point was that the "Vector Lengths" should be "Set to 1" for non LLVM code generation, where does this happen? On my first survey of the code, it seems that the code generators are partitioned from the main body of code, implying that each of the code generators will have to be modified to account for the new Cmm MachOps? and properly translate them to non-vectorized instructions.

A: Instead of doing the translation for every native code generator separately, we could have a pre-pass that replaces all length = 1 vectors and operations on them by the corresponding scalar type and operation.  Then, the actual native code generators wouldn't need to be changed.

A: The setting of the vector length to 1 needs to happen in dependence on the command line options passed to GHC — i.e., if a non-LLVM backend is selected.

Q: Can we re-use any of the existing MachOps? when adding to Cmm?

A: I am not sure.