wiki:SIMDVectorExampleInLLVM

Version 3 (modified by pmonday, 2 years ago) (diff)

--

It is useful to see the vector instructions "in action" in LLVM human readable form (a ".ll" file) prior to implementing the Cmm -> LLVM backend (within the ./compiler/llvmGen section of the code). LLVM code is somewhere between Java byte code authoring and direct assembly language authoring. Here is the process:

  • Generate or Create a human readable file (a ".ll" file), for example, create "add_floats.ll"
  • Compile this file to byte code using the LLVM compiler: llvm-as add_floats.ll. This generates a ".bc" file, in this case, add_floats.bc. The byte code is unreadable.
  • Now there are a few options once byte code is available
    • Generate native machine code: llc add_floats.bc will create a native assembler instruction set in a ".s" file (add_floats.s)
    • Run the byte codes on the JIT compiler: lli add_floats.bc should run the instructions and produce the result

To demonstrate the vector instructions, we can start with a basic C program (just to illustrate ... remember, LLVM is not functional so starting in an imperative language makes a lot of sense):

int main()
{
   float x[4], y[4], z[4];
   x[0] = 1.0;
   x[1] = 2.0;
   x[2] = 3.0;
   x[3] = 4.0;
   y[0] = 10.0;
   y[1] = 20.0;
   y[2] = 30.0;
   y[3] = 40.0;

   z[0] = x[0] + y[0]; 
   z[1] = x[1] + y[1]; 
   z[2] = x[2] + y[2]; 
   z[3] = x[3] + y[3]; 
} 

Compiling and running this in C is easy and left to the user.

This converts easily to LLVM human readable format (use the online generator if you'd like):

; ModuleID = '/tmp/webcompile/_20751_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

define i32 @main() nounwind {
  %1 = alloca i32, align 4
  %x = alloca [4 x float], align 16
  %y = alloca [4 x float], align 16
  %z = alloca [4 x float], align 16
  store i32 0, i32* %1
  %2 = getelementptr inbounds [4 x float]* %x, i32 0, i64 0
  store float 1.000000e+00, float* %2
  %3 = getelementptr inbounds [4 x float]* %x, i32 0, i64 1
  store float 2.000000e+00, float* %3
  %4 = getelementptr inbounds [4 x float]* %x, i32 0, i64 2
  store float 3.000000e+00, float* %4
  %5 = getelementptr inbounds [4 x float]* %x, i32 0, i64 3
  store float 4.000000e+00, float* %5
  %6 = getelementptr inbounds [4 x float]* %y, i32 0, i64 0
  store float 1.000000e+01, float* %6
  %7 = getelementptr inbounds [4 x float]* %y, i32 0, i64 1
  store float 2.000000e+01, float* %7
  %8 = getelementptr inbounds [4 x float]* %y, i32 0, i64 2
  store float 3.000000e+01, float* %8
  %9 = getelementptr inbounds [4 x float]* %y, i32 0, i64 3
  store float 4.000000e+01, float* %9
  %10 = getelementptr inbounds [4 x float]* %x, i32 0, i64 0
  %11 = load float* %10
  %12 = getelementptr inbounds [4 x float]* %y, i32 0, i64 0
  %13 = load float* %12
  %14 = fadd float %11, %13
  %15 = getelementptr inbounds [4 x float]* %z, i32 0, i64 0
  store float %14, float* %15
  %16 = getelementptr inbounds [4 x float]* %x, i32 0, i64 1
  %17 = load float* %16
  %18 = getelementptr inbounds [4 x float]* %y, i32 0, i64 1
  %19 = load float* %18
  %20 = fadd float %17, %19
  %21 = getelementptr inbounds [4 x float]* %z, i32 0, i64 1
  store float %20, float* %21
  %22 = getelementptr inbounds [4 x float]* %x, i32 0, i64 2
  %23 = load float* %22
  %24 = getelementptr inbounds [4 x float]* %y, i32 0, i64 2
  %25 = load float* %24
  %26 = fadd float %23, %25
  %27 = getelementptr inbounds [4 x float]* %z, i32 0, i64 2
  store float %26, float* %27
  %28 = getelementptr inbounds [4 x float]* %x, i32 0, i64 3
  %29 = load float* %28
  %30 = getelementptr inbounds [4 x float]* %y, i32 0, i64 3
  %31 = load float* %30
  %32 = fadd float %29, %31
  %33 = getelementptr inbounds [4 x float]* %z, i32 0, i64 3
  store float %32, float* %33
  %34 = load i32* %1
  ret i32 %34
}

This is easy enough to run using the JIT compiler: lli add_floats.ll

The core of the instructions can be replaced with vectorization (obviously, optimizing this program will result in very little code and vectorization is not necessary, but this is an exercise.

Here is the .ll code rewritten with vectorization: