Changes between Version 34 and Version 35 of SIMDPlan


Ignore:
Timestamp:
Apr 16, 2012 12:50:09 PM (2 years ago)
Author:
gmainland
Comment:

Note redirection.

Legend:

Unmodified
Added
Removed
Modified
  • SIMDPlan

    v34 v35  
    1  
    2 == Introduction == 
    3 This text documents the implementation stages / components for adding SIMD support to GHC.  The overriding requirements and architecture for the project are located back at the [wiki:SIMD] Page. 
    4  
    5 Based on that design, the high-level tasks that must be accomplished include the following: 
    6  1. Modify autoconf to determine SSE availability and vector size 
    7  1. Add new PrimOps to allow Haskell to make use of Vectors 
    8  1. Add new MachOps to Cmm to communicate use of Vectors 
    9  1. Modify the LLVM Code Generator to translate Cmm to LLVM vector instructions 
    10  1. Demonstrate use of PrimOps from Haskell program 
    11  1. Modify Vector Library 
    12  1. Modify DPH Libraries 
    13  1. Arrange that the other Code Generators continue to generate non SIMD code 
    14  
    15 Introduction of SIMD support to GHC will occur in stages to demonstrate the entire “vertical” stack is functional: 
    16  1. Introduce “Float” PrimOps (as necessary to run an example showing SIMD usage in the LLVM) 
    17  1. Add appropriate Cmm support for the Float primtype / primop subset 
    18  1. Modify the LLVM Code Generator to support the Double vectorization 
    19  1. Demonstrate the PrimOps and do limited performance testing to ensure SIMD is functional 
    20  1. Modify Vector Libraries to make use of new PrimOps 
    21  1. Modify the DPH Libraries 
    22  1. Higher level examples using the above libraries 
    23  1. Build out the remaining PrimOps 
    24  1. Demonstrate full stack 
    25  1. Test remaining code generators 
    26  
    27 == Current Open Questions == 
    28 These clearly won't be all of the questions I have, there is a substantial amount of work that goes through the entire GHC compiler stack before reaching the LLVM instructions. 
    29  
    30 == Location of SIMD branch == 
    31  
    32 The SIMD branch of GHC is named, appropriately, `simd`. 
    33  
    34 == Adding a primtype / primop Outline == 
    35  
    36 When going through this outline, it is helpful to have the [http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/HscMain Compiler Pipeline] explanation available as well as the explanation of the source code tree (that includes nested explanations of important directories and files) available [http://hackage.haskell.org/trac/ghc/wiki/Commentary/SourceTree here]. 
    37  
    38 Addition of Types for use by Haskell 
    39  - ./compiler/types/TyCon.lhs 
    40   - TyCons represent type constructors.  There are @data declarations, @type synonyms, @newtypes and class declarations (@class).  We will need to modify this to add a proper type constructor. 
    41   - Prelude uses this type constructor in ./compiler/prelude/TysPrim.lhs 
    42  
    43 Modifications to the compiler to add primtype / primop to Prelude 
    44  - ./compiler/prelude/PrelNames.lhs 
    45  - ./compiler/prelude/TysPrim.lhs 
    46  - ./compiler/prelude/primops.txt.pp 
    47    - Addition of a type here (primtype) to operate on 
    48    - Defines the primops that are associated with the types that are defined 
    49    - For primops defined here that are inline, modify compiler/codeGen/CgPrimOp.hs 
    50  
    51 Modifications to add support to the compiler for the new types.  Look to individual files for the portion of the compiler that is being modified.  The binaries generated out of this branch are called from the ghc/ binaries. 
    52  - ./compiler/codeGen/CgCallConv.hs 
    53  - ./compiler/codeGen/CgPrimOp.hs 
    54  - ./compiler/codeGen/CgUtils.hs 
    55  - ./compiler/codeGen/SMRep.lhs 
    56  - ./compiler/codeGen/StgCmmLayout.hs 
    57  - ./compiler/codeGen/StgCmmPrim.hs 
    58  
    59 More modifications to the Cmm portion of the compiler chain. 
    60  - ./compiler/cmm contains the code that inputs STG and outputs Cmm 
    61  - ./compiler/cmm/CmmExpr.hs 
    62  - ./compiler/cmm/CmmType.hs 
    63  - ./compiler/cmm/CmmUtils.hs 
    64  - ./compiler/cmm/OldCmm.hs 
    65  - ./compiler/cmm/StackColor.hs 
    66  
    67 Modifications to the LLVM code generator 
    68  - Generating the human readable LLVM code (.ll) occurs in the compiler/llvmGen code.  It receives Cmm and turns it around to LLVM bytecodes through human readable LLVM code.  A "simple" use of LLVM vector instructions using floats is shown at the [wiki:SIMDVectorExampleInLLVM SIMD Vector Example In LLVM] page 
    69   - compiler/llvmGen/Llvm/AbsSyn.hs - no changes, describes the abstract structure of an LLVM program 
    70   - compiler/llvmGen/Llvm/PpLlvm.hs - no changes, this is the module that does a Pretty Print of LLVM I(ntermediate) R(epresentation), it still uses generic constructs described in Llvm.AbsSyn (where no changes are necessary), Llvm.Types (where changes are needed) 
    71   - compiler/llvmGen/Llvm/Types.hs - Describes the LLVM Basic Types and Variables.  Changes are necessary to add Vectors.  LMVector Int LlvmType will be added (eventually constraints on the width may need to be added.  "fadd" is already present, as is "add" and other operations.  Most likely, additional operations will have to be added as the code generated is slightly different for, say, a float vector since a series of instructions have to be executed for each "add", even though the basic "fadd" structure remains the same. 
    72   - compiler/llvmGen/LlvmCodeGen/Base.hs, no changes appear necessary here.  These seem to be primarily various Label and Environment handling items ... primarily structural. 
    73   - compiler/llvmGen/LlvmCodeGen/CodeGen.hs, changes will most likely be necessary here for various operators.  This is the primary location where Cmm is converted to LLVM operators ... for example, MO_F_Add with a parameter is converted to LM_MO_FAdd here:  MO_F_Add  _ -> genBinMach LM_MO_FAdd).  In the event that additional operators are added (MO_VF_Add for example), this will definitely have to be modified. 
    74   - compiler/llvmGen/LlvmCodeGen/Data.hs. converts static data types from Cmm (CmmData) to LLVM structures, no changes are likely necessary here 
    75   - compiler/llvmGen/LlvmCodeGen/Ppr.hs, the pretty print helpers for the LLVM Code Generator ... dependent on other files 
    76   - compiler/llvmGen/LlvmCodeGen/Regs.hs, no additional registers necessary (??), no changes 
    77  
    78 Modifications to the STG Code Generation 
    79  - ./includes/stg/MachRegs.h 
    80  - ./includes/stg/Regs.h 
    81  - ./includes/stg/Types.h 
    82  
    83  
    84  
    85 ./includes/HaskellConstants.hs 
    86 ./includes/mkDerivedConstants.c 
    87  
    88 ./includes/rts/storage/FunTypes.h 
    89  
    90  
    91 ./utils/genapply/GenApply.hs 
    92 ./utils/genprimopcode/Main.hs 
    93  
    94 == Modify autoconf == 
    95 Determining if a particular hardware architecture has SIMD instructions, the version of the instructions available (SSE, SSE2, SSE3, SSE4 or an iteration of one of those), and consequently the size of vectors that are supported occurs during the configuration step on the architecture that the build will occur on.  This is the same time that the sizes of Ints are calculated, alignment constants, and other pieces that are critical to GHC. 
    96  
    97 Backing up from the results to the location that the changes are introduced, the current alignment and primitive sizes are available in ./includes/ghcautoconfig.h, here is a sample: 
    98 {{{ 
    99 ... 
    100 /* The size of `char', as computed by sizeof. */ 
    101 #define SIZEOF_CHAR 1 
    102  
    103 /* The size of `double', as computed by sizeof. */ 
    104 #define SIZEOF_DOUBLE 8 
    105 ... 
    106 }}} 
    107  
    108 These are constructed from mk/config.h* that are generated by configure.ac and autoheader.  The configure.ac (or a related file) should be able to be sufficiently modified to determine if SSE is available and, consequently, the vector size that can be operated on and (later) how many pieces of data can be operated on in parallel (determined by the operation).  SSE had an MMX register size of 64-bits and all later SSE versions (2 and above) have a register size of 128 bits.  This implies that any type that is 32-bits can have 4 pieces of data calculated against in a single instruction. 
    109  
    110 There is an example of configure.ac modifications to detect SSE availability available on the web, the primary body of the check is as follows (xmmintrin.h contains the SSE instruction set): 
    111 {{{ 
    112 AC_MSG_CHECKING(for SSE in current arch/CFLAGS) 
    113 AC_LINK_IFELSE([ 
    114 AC_LANG_PROGRAM([[ 
    115 #include <xmmintrin.h> 
    116 __m128 testfunc(float *a, float *b) { 
    117   return _mm_add_ps(_mm_loadu_ps(a), _mm_loadu_ps(b)); 
    118 } 
    119 ]])], 
    120 [ 
    121 has_sse=yes 
    122 ], 
    123 [ 
    124 has_sse=no 
    125 ] 
    126 ) 
    127 AC_MSG_RESULT($has_sse) 
    128  
    129 if test "$has_sse" = yes; then 
    130   AC_DEFINE([_USE_SSE], 1, [Enable SSE support]) 
    131   AC_DEFINE([_VECTOR_SIZE], 128, [Default Vector Size for Optimization]) 
    132 fi 
    133 }}} 
    134  
    135 Once the mk/config.h is modified with the above, the includes/ghcautoconf.h is modified during the first stage of the GHC build process.  Once includes/ghcautoconf.h is modified, the _USE_SSE constant is available in the Cmm definitions (next section). 
    136  
    137 There are more detailed explanations of how to use cpuid to determine the supported SSE instruction set available on the web as well.  cpuid may be more appropriate but are also much more complex.  Details for using cpuid are available at [http://software.intel.com/en-us/articles/using-cpuid-to-detect-the-presence-of-sse-41-and-sse-42-instruction-sets/]. 
    138  
    139 It should be noted, that since the overall goal is to let the LLVM handle the actual assembly code that does vectorization, it's only possible to support vectorization up to the version that the LLVM supports. 
    140  
    141 == Add new MachOps to Cmm code == 
    142 It may make more sense to add the MachOps to Cmm prior to implementing the PrimOps (or at least before adding the code to the CgPrimOp.hs file).  There is a useful [http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/CmmType#AdditionsinCmm Cmm Wiki Page] available to aid in the definition of the new Cmm operations. 
    143  
    144 Modify compiler/cmm/CmmType.hs to add new required vector types and such, here is a basic outline of what needs to be done: 
    145 {{{ 
    146 data CmmType    -- The important one! 
    147   = CmmType CmmCat Width 
    148   
    149 type Multiplicty = Int 
    150   
    151 data CmmCat     -- "Category" (not exported) 
    152    = GcPtrCat   -- GC pointer 
    153    | BitsCat   -- Non-pointer 
    154    | FloatCat   -- Float 
    155    | VBitsCat  Multiplicity   -- Non-pointer 
    156    | VFloatCat Multiplicity  -- Float 
    157    deriving( Eq ) 
    158         -- See Note [Signed vs unsigned] at the end 
    159 }}} 
    160  
    161 Modify compiler/cmm/CmmMachOp.hs, this will add the necessary MachOps for use from the PrimOps modifications to support SIMD.  Here is an example of adding a SIMD version of the MO_F_Add MachOp: 
    162 {{{ 
    163   -- Integer SIMD arithmetic 
    164   | MO_V_Add  Width Int 
    165   | MO_V_Sub  Width Int 
    166   | MO_V_Neg  Width Int         -- unary - 
    167   | MO_V_Mul  Width Int 
    168   | MO_V_Quot Width Int 
    169  
    170   -- Floating point arithmetic 
    171   | MO_VF_Add Width Int   -- MO_VF_Add W64 4   Add 4-vector of 64-bit floats 
    172   ... 
    173 }}} 
    174  
    175 Some existing Cmm instructions may be able to be reused, but there will have to be additional instructions added to account for vectorization primitives.  This will help keep the SIMD / non-SIMD code generation separate for the time being until we have it working. 
    176  
    177 == Add new PrimOps == 
    178 Adding the new PrimOps is relatively straight-forward, but a  substantial number of LOC will be added to achieve it.  Most of this code is cut/paste "like" with minor type modifications. 
    179  
    180 Background: The following articles can aid in getting the work done: 
    181  * [http://hackage.haskell.org/trac/ghc/wiki/Commentary/PrimOps Primitive Operations (PrimOps)] 
    182  * [http://hackage.haskell.org/trac/ghc/wiki/AddingNewPrimitiveOperations Adding new primitive operations to GHC Haskell] 
    183  * Some guidelines for [wiki:primtype addition], at least until I find something on the Wiki 
    184  
    185 The steps to be undertaken are: 
    186  1. Modify ./compiler/prelude/primops.txt.pp (the following instructions may be changed a bit based on the direction) 
    187   a. Add the following vector length constants as Int# types 
    188    * intVecLen, intVec8Len, intVec16Len, intVec32Len, intVec64Len, wordVecLen, wordVec8Len, wordVec16Len, wordVec32Len, wordVec64Len, floatVecLen, and doubleVecLen,  
    189   a. Then add the following primtypes: 
    190    * Int : IntVec#, Int8Vec#, Int16Vec#, Int32Vec#, Int64Vec# 
    191    * Word : WordVec#, Word8Vec#, Word16Vec#, Word32Vec#, Word64Vec# 
    192    * Float : FloatVec# 
    193    * Double : DoubleVec# 
    194   a. Add the following primops associated with the above primtypes.   The ops come in groups associated with the types above, for example for IntVec#’s we get the following family for the “plus” operation alone: 
    195    * plusInt8Vec# :: Int8Vec# -> Int8Vec# -> Int8Vec# 
    196    * plusInt16Vec# :: Int16Vec# -> Int16Vec# -> Int16Vec# 
    197    * plusInt32Vec# :: Int32Vec# -> Int32Vec# -> Int32Vec# 
    198    * plusInt64Vec# :: Int64Vec# -> Int64Vec# -> Int64Vec# 
    199   a. Repeat this for the following set of operations on IntVec#’s of various lengths, note that the signatures are summarized informally in parentheses behind the operation:    
    200    * plusIntVec#, (signature :: Int8Vec# -> Int8Vec# -> Int8Vec#) 
    201    * minusIntVec#,  
    202    * timesIntVec#,  
    203    * quotIntVec#,  
    204    * remIntVec# 
    205    * negateIntVec# (signature :: IntVec# -> IntVec#) 
    206    * uncheckedIntVecShiftL# (signature :: IntVec# -> Int# -> IntVec#) 
    207    * uncheckedIntVecShiftRA#,  
    208    * uncheckedIntVecShiftRL#  
    209   a. For the Word vectors we similarly introduce: 
    210    * plusWordVec#, minusWordVec#, timesWordVec#, quotWordVec#, remWordVec#, negateWordVec#, andWordVec#, orWordVec#, xorWordVec#, notWord#, uncheckedWordVecShiftL#, uncheckedWordVecShiftRL# 
    211   a. Float 
    212    * plusFloatVec#, minusFloatVec#, timesFloatVec#, quotFloatVec#, remFloatVec#, negateFloatVec#, expFloatVec#, logFloatVec#, sqrtFloatVec#,, sinFloatVec#, cosFloatVec#, tanFloatVec#, asinFloatVec#, acosFloatVec#, atanFloatVec#, sinhFloatVec#, coshFloatVec#, tanhFloatVec# 
    213   a. Double 
    214    * plusDoubleVec#, minusDoubleVec#, timesDoubleVec#, quotDoubleVec#, remDoubleVec#, negateDoubleVec#, expDoubleVec#, logDoubleVec#, sqrtDoubleVec#, sinDoubleVec#, cosDoubleVec#, tanDoubleVec#, asinDoubleVec#, acosDoubleVec#, atanDoubleVec#, sinhDoubleVec#, coshDoubleVec#, tanhDoubleVec# 
    215  1. Do NOT Modify ./compiler/prelude/PrimOp.lhs (actually, ./compiler/primop-data-decl.hs-incl) to add the new PrimOps (VIntQuotOp, etc...), this will be generated based on the primops.txt.pp modifications 
    216  1. Modify ./compiler/codeGen/CgPrimOp.hs, code for each primop (above) must be added to complete the primop addition. 
    217   a. The code, basically, links the primops to the Cmm MachOps (that, in turn, are read by the code generators) 
    218   a. It looks like some Cmm extensions will have to be added to ensure alignment and pass vectorization information onto the back ends, the necessary MachOps will be determined after the first vertical stack is completed (using the "Double" as a model).  There may be some reuse from the existing MachOps.  There is some discussion to these extensions (or similar ones) on the original [http://hackage.haskell.org/trac/ghc/ticket/3557 Patch 3557 Documentation] 
    219  
    220 Example of modification to ./compiler/prelude/primops.txt.pp to add one of the additional Float operations: 
    221 {{{ 
    222 ------------------------------------------------------------------------ 
    223 section "SIMDFloat" 
    224         {Float operations that can take advantage of vectorization.} 
    225 ------------------------------------------------------------------------ 
    226 primtype FloatVec# a 
    227  
    228 primop   FloatVectorAddOp   "plusFloatVec#"      Dyadic             
    229    FloatVec# -> FloatVec# -> FloatVec# 
    230    with can_fail = True 
    231  
    232 primop   FloatVectorAddOp   "plusFloatVec#"      Dyadic             
    233    FloatVec# -> FloatVec# -> FloatVec# 
    234    with can_fail = True 
    235 }}} 
    236  
    237 Here is an example of the update to ./compiler/codeGen/CgPrimOp.hs that generates Machine Ops based on the new PrimOps. 
    238 {{{ 
    239 -- SIMD Float Ops 
    240 translateOp FloatVectorAddOp    = Just (MO_VF_Add W32 4) 
    241 translateOp FloatVectorMultOp   = Just (MO_VF_Mult W32 4) 
    242 }}} 
    243  
    244 The new primtype also needs to be weaved through the code generation path, but it is slightly different then primops.  To complete the primtype definition, the following files need to be modified. 
    245  
    246 ./utils/genprimopcode/Main.hs needs to have an association added between the FloatVec# type added above and a Type that is used for representation elsewhere: 
    247 {{{ 
    248 ppType (TyApp "FloatVec#"   []) = "floatVecPrimTy" 
    249 }}} 
    250  
    251 By adding the floatVecPrimTy, several additional relationships and constructs need to be created as well. 
    252  
    253 ./compiler/prelude/TysPrim.lhs, wires the new type into Prelude: 
    254 {{{ 
    255 module TysPrim( 
    256 .... 
    257 -- Added 
    258         floatVecPrimTyCon,      floatVecPrimTy, 
    259 ... 
    260  
    261 primTyCons  
    262   = [ addrPrimTyCon 
    263 ... 
    264 -- Added 
    265     , floatVecPrimTyCon 
    266  
    267 ... 
    268 floatVecPrimTyConName             = mkPrimTc (fsLit "FloatVec#") floatVecPrimTyConKey floatVecPrimTyCon 
    269  
    270 ... 
    271 -- Add a new subsection for primitive types (others will be added here as well) 
    272 %************************************************************************ 
    273 %*                                                                      * 
    274 \subsection[TysPrim-SIMDvectors]{The primitive SIMD vector types} 
    275 %*                                                                      * 
    276 %************************************************************************ 
    277  
    278 \begin{code} 
    279 floatVecPrimTyCon :: TyCon 
    280 floatVecPrimTyCon                 = pcPrimTyCon  floatVecPrimTyConName         1 PtrRep 
    281  
    282 floatVecPrimTy :: Type 
    283 floatVecPrimTy              = mkTyConTy floatVecPrimTyCon 
    284 \end{code} 
    285 }}} 
    286  
    287 ./compiler/prelude/PrelNames.lhs gives keys for each of the primtypes 
    288 {{{ 
    289 \subsection[Uniques-prelude-TyCons] ... 
    290 ... 
    291 floatVecPrimTyConKey    = mkPreludeTyConUnique 38 
    292 }}} 
    293  
    294 ./compiler/ghci/ByteCodeGen.lhs 
    295 {{{ 
    296 }}} 
    297  
    298 ./compiler/ghci/RtClosureInspect.hs 
    299 {{{ 
    300 repPrim t = rep where  
    301 ... 
    302 -- Added 
    303     | t == floatVecPrimTyCon  = "<floatvec>" 
    304 }}} 
    305  
    306 The above, after compilation, adds the following to the ./compiler/prelude/PrimOp.lhs file: 
    307 {{{ 
    308    | FloatVectorAddOp 
    309    | FloatVectorMultOp 
    310 }}} 
    311  
    312 == Modify LLVM Code Generator == 
    313 Take the MachOps in the Cmm definition and translate correctly to the corresponding LLVM instructions.  LLVM code generation is in the /compiler/llvmGen directory.  The following will have to be modified (at a minimum): 
    314  * /compiler/llvmGen/Llvm/Types.hs - add the MachOps from Cmm and how they bridge to the LLVM vector operations 
    315  * /compiler/llvmGen/LlvmCodeGen/CodeGen.hs - This is the heart of the translation from MachOps to LLVM code.   Possibly significant changes will have to be added. 
    316 * Remaining /compiler/llvmGen/* - Supporting changes 
    317  
    318 At this point, CodeGen is not modified, though it will likely have to be eventually.  Types.hs has a new LMVector type added to support vectors.  As the operations on vectors are the same as all LLVM types (for float vectors use fadd, etc...), I have not made changes to the operators yet (though I'm guessing I will have to eventually).  Here is the diff of changes to Types.hs: 
    319 {{{ 
    320 [paul.monday@pg155-n19 Llvm]$ git diff Types.hs 
    321 diff --git a/compiler/llvmGen/Llvm/Types.hs b/compiler/llvmGen/Llvm/Types.hs 
    322 index 1013426..1133d37 100644 
    323 --- a/compiler/llvmGen/Llvm/Types.hs 
    324 +++ b/compiler/llvmGen/Llvm/Types.hs 
    325 @@ -38,6 +38,7 @@ data LlvmType 
    326    | LMFloat128           -- ^ 128 bit floating point 
    327    | LMPointer LlvmType   -- ^ A pointer to a 'LlvmType' 
    328    | LMArray Int LlvmType -- ^ An array of 'LlvmType' 
    329 +  | LMVector Int LlvmType -- ^ A vector of 'LlvmType' 
    330    | LMLabel              -- ^ A 'LlvmVar' can represent a label (address) 
    331    | LMVoid               -- ^ Void type 
    332    | LMStruct [LlvmType]  -- ^ Structure type 
    333 @@ -55,6 +56,7 @@ instance Show LlvmType where 
    334    show (LMFloat128    ) = "fp128" 
    335    show (LMPointer x   ) = show x ++ "*" 
    336    show (LMArray nr tp ) = "[" ++ show nr ++ " x " ++ show tp ++ "]" 
    337 +  show (LMVector nr tp ) = "<" ++ show nr ++ " x " ++ show tp ++ ">"   
    338    show (LMLabel       ) = "label" 
    339    show (LMVoid        ) = "void" 
    340    show (LMStruct tys  ) = "<{" ++ (commaCat tys) ++ "}>" 
    341 @@ -295,6 +297,7 @@ llvmWidthInBits (LMFloat128)    = 128 
    342  -- it points to. We will go with the former for now. 
    343  llvmWidthInBits (LMPointer _)   = llvmWidthInBits llvmWord 
    344  llvmWidthInBits (LMArray _ _)   = llvmWidthInBits llvmWord 
    345 +llvmWidthInBits (LMVector _ _)   = llvmWidthInBits llvmWord 
    346  llvmWidthInBits LMLabel         = 0 
    347  llvmWidthInBits LMVoid          = 0 
    348  llvmWidthInBits (LMStruct tys)  = sum $ map llvmWidthInBits tys 
    349 }}} 
    350  
    351 == Modify Native Code Generator == 
    352  
    353 Unfortunately, the native code generator will also have to be recompiled.  The GHC compilation depends on a 6.x version of GHC, before native LLVM code generation was built into the HEAD (so simply modifying the mk/build.mk file to go to -fllvm does not work). 
    354  
    355 For x86 Native Code Generation, locate the ./compiler/nativeGen/X86/CodeGen.hs file and modify it appropriately.  For the example above, simply adding a conversion from MO_VF_Add to the equivalent non-vector add is sufficient. 
    356  
    357 {{{ 
    358           -- SIMD Vector Instruction in Native Revert to Simple Instructions 
    359       MO_VF_Add w i | sse2      -> trivialFCode_sse2 w ADD  x y 
    360                   | otherwise -> trivialFCode_x87    GADD x y 
    361       MO_VF_Mul w i | sse2      -> trivialFCode_sse2 w MUL x y 
    362                   | otherwise -> trivialFCode_x87    GMUL x y    
    363 }}} 
    364  
    365 Changes for the remaining new MachOps may be much larger. 
    366  
    367 == Example: Demonstrate SIMD Operation == 
    368 Once the Code Generator, PrimOps and Cmm are modified, we should be able to demonstrate performance scenarios.  The simplest example to use for demonstrating performance is to time vector additions and multiplications using the new vectorized instruction set against a similar addition or multiplication using another PrimOp. 
    369  
    370 The following two simple programs should demonstrate the difference in performance.  The program using the PrimOps ''should'' improve performance approaching 2x (Doubles are 64bit and SSE provides two 64bit registers). 
    371  
    372 Simple usage of the new instructions to add to vectors of doubles: 
    373  
    374 '''Question:'''  How does one create one of the new PrimOp types to test prior to testing the vector add operations?  This is going to have to be looked at a little ... the code should basically create a vector and then insertDoubleVec# repeatedly to populate the vector.  Without the subsequent steps done, this will have to be "hand" done without additional operations defined.  Here is the response from Manuel to expand on this:  I am not quite sure what the best approach is. The intention in LLVM is clearly to populate vectors using the 'insertIntVec#' etc functions. However, in LLVM you can just use an uninitialised register and insert elements into a vector successively. We could provide a vector "0" value in Haskell and insert into that. Any other ideas? 
    375 {{{ 
    376 {-# LANGUAGE MagicHash #-} 
    377 import GHC.Prim 
    378 import GHC.Exts 
    379  
    380 getPrimFloat :: Float -> Float# 
    381 getPrimFloat f = case f of { F# f -> f } 
    382  
    383 main = do 
    384     numberString <- getLine 
    385     let num = read numberString 
    386     let value = getPrimFloat num 
    387     numberString2 <- getLine 
    388     let num2 = read numberString2 
    389     let value2 = getPrimFloat num2 
    390      
    391     let packedVector1 = pack4FloatOp# value value value value 
    392     let packedVector2 = pack4FloatOp# value2 value2 value2 value2 
    393  
    394     let resultVector = multFloatVec4# packedVector1 packedVector2 
    395      
    396     let result = extractFloatVec# resultVector 1 
    397  
    398     let resultFloat = F# result 
    399     print resultFloat 
    400 }}} 
    401          
    402 Using simple lists to achieve the same operation (note that the below is Integer only, I have to modify it to read floats off the command line otherwise a parse error occurs after the reads). 
    403  
    404 {{{ 
    405 main = do 
    406     numberString <- getLine 
    407     let value = read numberString 
    408     numberString2 <- getLine 
    409     let value2 = read numberString2 
    410  
    411     let list1 = [value,value,value,value] 
    412     let list2 = [value2,value2,value2,value2] 
    413  
    414     let result = zipWith (*) list1 list2  
    415  
    416     print (result !! 1) 
    417 }}} 
    418  
    419 The above can be repeated with any of the common operations (multiplication, division, subtraction).  This should be sufficient with large sized vectors / lists to illustrate speedup. 
    420  
    421 (Note that over time and several generations of the integration, one would hope that the latter path would be “optimized” into SIMD instructions) 
    422  
    423 == Modify Vector Libraries and Vector Compiler Optimization (Pragmas and such) == 
    424 Once we've shown there is speed-up for the lower portions of the compiler and have quantified it, the upper half of the stack should be optimized to take advantage of the vectorization code that was added to the PrimOps and Cmm.  There are two primary locations this is handled, in the compiler (compile/vectorize) code that vectorizes modules post-desugar process.  This location handles the VECTORISE pragmas as well as implicit vectorization of code.  The other location that requires modification is the Vector library itself. 
    425  
    426  1. Modify the Vector library /libraries/vector/Data to make use of PrimOps where possible and adjust VECTORISE pragmas if necessary 
    427   * Modify the existing Vector code 
    428   * We will likely also need vector versions of array read/write/indexing to process Haskell arrays with vector operations (this may need to go into compiler/vectorise) 
    429   * Use the /libraries/vector/benchmarks to test updated code, look for 
    430     * slowdowns - vector operations that cannot benefit from SIMD should not show slowdown 
    431     * speedup - all performance tests that make use of maps for the common operators (+, -, *, etc..) should benefit from the SIMD speedup 
    432  1. Modify the compiler/vectorise code to adjust pragmas and vectorization post-desugar process.  These modifications may not need to be made on the first pass through the code, more evaluation is necessary. 
    433   * /compiler/vectorise/Vectorise.hs 
    434   * /compiler/vectorise/Vectorise/Env.hs 
    435   * /compiler/vectorise/Vectorise/Type/Env.hs 
    436  
    437 Once the benchmarks show measurable, reproducible behavior, move onto the DPH libraries.  Note that a closer inspection of the benchmarks in the /libraries/vector/benchmarks directory is necessary to ensure they reflect code that will be optimized with the use of SIMD instructions.  If they are not appropriate, add code that demonstrates SIMD speed-up appropriately. 
    438  
    439 == Modify DPH Libraries == 
    440 The DPH libraries have heavy dependencies on the previous vectorization modification step (modifying the Vector libraries and the compiler vector options and post-desugar vectorization steps).  The DPH steps should not be undertaken without significant performance improvements illustrated in the previous steps. 
    441  
    442  1. The primary changes for DPH are in /libraries/dph/dph-common/Data/Array/Parallel/Lifted/* 
    443  1. VECTOR SCALAR is also heavily used in /libraries/dph/dph-common/Data/Array/Parallel/Prelude, these should be inspected for update as well (Double.hs, Float.hs, Int.hs, Word8.hs) 
    444  a. Modify pragmas as necessary based on changes made above 
    445  
    446 '''Note to Self:''' Determine if the VECTORISE pragmas need adjustment or enhancement (based on previous steps) 
    447  
    448 == Ensure Remaining Code Generators Function Properly == 
    449 There are really two options on the remaining code generators: 
    450  * Modify each code generator to understand the new Cmm instructions and restore them to non-vectorized instructions 
    451  * Add a compiler step that that does a pre-pass and replaces all "length = 1" vectors and operations on them by the corresponding scalar type and operations 
    452  
    453 The latter makes sense in that it is effective on all code generators, including the LLVM code generator.  Vectors of length = 1 should not be put through SIMD instructions to begin with (as they will incur substantial overhead for no return). 
    454  
    455 To make this work, a ghc compiler flag must be added that forces all vector lengths to 1 (this will be required in conjunction with any non-LLVM code generator).  A user can also use this option to turn off SIMD optimization for LLVM. 
    456  
    457  * Add the ghc compiler option: --vector-length=1 
    458  * Modify compiler/vectorise to recognize the new option or add this compiler pass as appropriate 
    459  
    460 == Reference Documentation == 
    461  * [https://wiki.aalto.fi/display/t1065450/LLVM+vectorization LLVM Vectorization] 
    462  * [http://llvm.org/docs/LangRef.html#t_vector LLVM Vector Type] 
    463  
    464 == Reference Discussion Threads == 
    465 {{{ 
    466 From: Manual Chakravarty 
    467 Q: Should the existing pure Vector libraries (/libraries/vector/Data/*) be modified to use the vectorized code as a first priority, wait until DPH (/libraries/dph/) is modified, or leave the Vector library as is? 
    468  
    469 A: The DPH libraries ('dph-*') are based on the 'vector' library — i.e., for DPH to use SIMD instruction, we must modify 'vector' first. 
    470  
    471 Q: How does one create one of the new Vector Types in a Haskell program (direct PrimOp?, for testing ... let x = ????) 
    472  
    473 A: I am not quite sure what the best approach is. The intention in LLVM is clearly to populate vectors using the 'insertIntVec#' etc functions. However, in LLVM you can just use an uninitialised register and insert elements into a vector successively. We could provide a vector "0" value in Haskell and insert into that. Any other ideas? 
    474  
    475 A: I just realised that we need vector version of the array read/write/indexing operations as well to process Haskell arrays with vector operations. 
    476  
    477 Q: One discussion point was that the "Vector Lengths" should be "Set to 1" for non LLVM code generation, where does this happen? On my first survey of the code, it seems that the code generators are partitioned from the main body of code, implying that each of the code generators will have to be modified to account for the new Cmm MachOps? and properly translate them to non-vectorized instructions. 
    478  
    479 A: Instead of doing the translation for every native code generator separately, we could have a pre-pass that replaces all length = 1 vectors and operations on them by the corresponding scalar type and operation.  Then, the actual native code generators wouldn't need to be changed. 
    480  
    481 A: The setting of the vector length to 1 needs to happen in dependence on the command line options passed to GHC — i.e., if a non-LLVM backend is selected. 
    482  
    483 Q: Can we re-use any of the existing MachOps? when adding to Cmm? 
    484  
    485 A: I am not sure. 
    486 }}} 
    487  
     1This page has been moved to [wiki:SIMD/Implementation/Plan]. The main page describing the effort to add SIMD support to GHC is [wiki:SIMD].