Opened 5 years ago

Last modified 2 years ago

#3557 new feature request

CPU Vector instructions in GHC.Prim

Reported by: guest Owned by:
Priority: normal Milestone:
Component: Compiler (NCG) Version: 6.11
Keywords: Cc: ghc@…, axman6@…, haskell.vivian.mcphail@…, pumpkingod@…, dterei, william.knop.nospam@…, Jake.McArthur@…, as@…, hackage.haskell.org@…, jystic@…, nightski@…, mle+hs@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

It would be nice to have support for vector unit (MMX, SSE, AltiVec?, and so on) operations in GHC. Currently Data Parallel Haskell cannot utilize vector units due to GHC's lack of support.
Those vector operations could be nicely used to get e.g. stereo signal processing for the price of mono signal processing.
Maybe those operations could be added to GHC.Prim, or because there are so many, to a new module, GHC.Prim.Vector.

Attachments (1)

partial-simd-patch-3557.patch (166.0 KB) - added by vivian 3 years ago.
Partial patch against HEAD for SIMD instructions

Download all attachments as: .zip

Change History (48)

comment:1 Changed 5 years ago by Axman6

  • Cc axman6@… added

comment:2 Changed 5 years ago by simonmar

  • Difficulty set to Unknown
  • Milestone set to _|_

Yes, this is a good idea, though it would be a fair amount of work. We need new primitive types for the vector data types, and new primitives for each operation. Cmm would need new vector data types, and the native code generator(s) need to know how to allocate registers. We'd have to decide whether/how to use vector registers for argument passing. This might be easier with the new backend.

A good student project, or SoC project perhaps? Leaving the milestone at _|_ for now.

comment:3 follow-up: Changed 4 years ago by guest

I have tried some SSE instructions using harpy in GHC-6.10, but the test program terminates with:

ghci: pthread_mutex_lock.c:82: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.

Of course, this can be due to a programming error. However, is the Haskell runtime prepared for SSE instructions at all? If two threads use XMM registers - do their usage of these registers interfer?

comment:4 in reply to: ↑ 3 Changed 4 years ago by guest

Replying to guest:

ghci: pthread_mutex_lock.c:82: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.

Yes, this was a stupid stack error. My program does now work with SSE commands in a single thread.

comment:5 follow-ups: Changed 3 years ago by vivian

  • Cc haskell.vivian.mcphail@… added
  • Owner set to vivian

Okay, this might take a while. First set of questions:

I'm thinking of making a module GHC.Prim.SSE that contains the new ops.

1) SSE instructions are CPU specific, so we need a way to check whether the CPU supports the various SSE extensions (SSE, SSE2, SSE3, SSE4,...). (assembler instruction CPUID).

a) If an extension is not supported, then does the primop not get defined, or do we hand code a definition for the primop? How do we propagate this information to user code? We could have functions

sse :: Bool
sse2 :: Bool
sse3 :: Bool
...

b) This also affects cross-compilation, where checking the CPU of the build machine doesn't tell us about the capabilities of the target machine.

c) do we include memory management primOps? Specifically there are opcodes for bypassing the cache, which is helpful for live, streaming data.

2) One of the instructions is a dot product instruction that takes (A0,A1,A2,A3), (B0,B1,B2,B3) as packed 32-bit floats and returns (A0*B0)+(A1*B1)+(A2*B2)+(A3*B3). This would work really well with a streaming data type, the first pass (for a vector of 32-bit floats) computes 4-piece chunks of dot product and the next pass computes the sum of those results.

a) I seem to recall that Data.Vector.Unboxed is faster than Data.Vector.Storable. My initial thought about packing/unpacking 4 32-bit types into a 128-bit unit would be to peek/poke. Do unboxed tuples have any guarantees about alignment and so on in memory? It would be great to have a function like

packFloat :: Float# -> Float# -> Float# -> Float# -> Xmm128#
packFloat a b c d = (# a, b, c, d #)

The real win is when we have contiguous sequences of well-aligned floats in memory, so we can fetch 128-bit chunks at a time, bypassing the cache if necessary.

b) Also, what is the relationship between boxed/unboxed numbers? is Float# -> Float a no-op? The 'unboxed' vectors in Data.Vector.Unboxed appear to not be types like Int#, Int8# but rather Int, Int8.

So, my plan is to start with adding primOps to GHC.Prim and the compiler and then follow the code through the compiler to Cmm and the code generators, making changes as necessary.

comment:6 Changed 3 years ago by guest

  • Summary changed from SIMD operations in GHC.Prim to CPU Vector instructions in GHC.Prim

In the meantime I have learned that there is a difference between Vector and SIMD computing. The main difference is a 'shuffle' operation, that allows to permute vector elements. Thus SSE and Altivec vector computing, GPU's of Nvidia and ATI are SIMD. This proposal is about Vector Computing.
See "The Perils of Parallel: Larrabee vs. Nvidia, MIMD vs. SIMD"
http://perilsofparallel.blogspot.com/2008/09/larrabee-vs-nvidia-mimd-vs-simd.html
I have changed the ticket name accordingly.

comment:7 in reply to: ↑ 5 Changed 3 years ago by guest

Replying to vivian:

Okay, this might take a while. First set of questions:

I'm thinking of making a module GHC.Prim.SSE that contains the new ops.

1) SSE instructions are CPU specific, so we need a way to check whether the CPU supports the various SSE extensions (SSE, SSE2, SSE3, SSE4,...). (assembler instruction CPUID).

I like the way LLVM solves this problem. If we adapt it, implementation of vector instructions is also easy for the LLVM backend.

LLVM is divided into a high level interface and a low level interface.
The high level interface provides virtual vectors of every possible length. Early versions did only provide power of two vector sizes. The implementation of the vector instructions tries to make as much as possible use of vector units by dividing virtual vectors into physical CPU vectors. If there is no CPU vector unit at all, the chunk size is 1.

The low level interface provides access to the actual CPU instructions via intrinsics. You may use these functions to optimize your program. However for portable code you must always provide an alternative implementation using the high level interface. I have written such automatical switches for LLVM:
http://hackage.haskell.org/packages/archive/llvm-extra/0.1/doc/html/LLVM-Extra-Extension.html
http://hackage.haskell.org/packages/archive/llvm-extra/0.1/doc/html/LLVM-Extra-Extension-X86.html

I would like to see the same for GHC: Define a primitive arbitrary length vector type in, say, Data.CPUVector. The vector size might be encoded with a type, just as in the llvm package. Additionally write GHC.Prim.X86.SSE modules for the CPU specific instructions.

comment:8 in reply to: ↑ 5 Changed 3 years ago by vivian

Replying to vivian:

So, my plan is to start with adding primOps to GHC.Prim and the compiler and then follow the code through the compiler to Cmm and the code generators, making changes as necessary.

Having thought about this more, looked at Cmm and LLVM, and read guest's comment, it seems a different plan would be better.

Start by adding a vector type to Cmm. Then add platform-specific 'intrinsics'. Then add vector primOps to GHC.Prim.

In Cmm, there are 'Machine-Oppish commands' which seem to relate to FPU-type operations. The SSE operations use their own 128-bit registers, not the main CPU registers. Should the vector operations be defined as 'Machine-Oppish' commands since they don't use normal registers, or should they be given their own register type? It seems that a lot of the vector operations are almost instances of, e.g., MO_F_Add Width. If we had a CPUVector Width type, then we apply MO_F_Add Width to each pair of elements of the CPUVector.

There is some tension here, because it would be nice to be able to use MO_F_Add on a pair of vectors, but the Width field does not store enough information to apply the machine op to two vectors.

What about adding vector machine ops, e.g.

  data MachOp
  -- Integer operations (insensitive to signed/unsigned)
  = MO_Add Width
...
  | MO_VI_Add Width Length   -- machine op vector int add
...
  | MO_VF_Dot Width Length   -- machine op vector float dot product

comment:9 Changed 3 years ago by vivian

Maybe better would be to add an array type. LLVM already has one, as does the LlvmGen? code.

data CmmType 	-- The important one!
  = CmmType  CmmCat Width 
  | CmmArray CmmCat Width Length  -- new type

Then we can use existing MachOps on the new array types.

At the Haskell level, the new primOps can be on Array#s. In a machine, e.g. SSE4 capable, that has a CPU float vector width of 4 floats (multiplying four floats at once), the Array# Float# can be cast to Array# Xmm128Float# and the multiply operation iterated over the Array# Xmm128Float#. A conversion back to Array# Float# gives the result.

comment:10 Changed 3 years ago by rl

The ability to apply one SSE operation to an entire (big) array isn't very useful. Typically, you will have complex computations, i.e., you will want to apply multiple SSE operations to one bit of the array before moving on.

comment:11 Changed 3 years ago by vivian

Okay, thanks for that point rl. In that case, it seems that what needs doing is adding packed types to the Width type:

data Width   = W8 | W16 | W32 | W64 
	     | W80	-- Extended double-precision float, 
			-- used in x86 native codegen only.
			-- (we use Ord, so it'd better be in this order)
	     | W128
             | W8_16, W16_8, W32_4, W64_2 -- NEW Widths for packed types 
                                          -- which fit into XMM 128 registers
	     deriving (Eq, Ord, Show)

Which means we can use the current MachOp, for example,

MO_Add W32_4

will use SSE registers to add 4 packed i32's.

With respect to the XMM registers (8 or 16 depending upon CPU), upon which the SSE instructions operate, should a new type be added to GlobalReg

data GlobalReg
  -- Argument and return registers
  = VanillaReg			-- pointers, unboxed ints and chars
	{-# UNPACK #-} !Int	-- its number
 	VGcPtr

  | FloatReg		-- single-precision floating-point registers
	{-# UNPACK #-} !Int	-- its number

  | DoubleReg		-- double-precision floating-point registers
	{-# UNPACK #-} !Int	-- its number

  | LongReg	        -- long int registers (64-bit, really)
	{-# UNPACK #-} !Int	-- its number

  | XmmReg              -- 128 bit xmm registers
        {-# UNPACK #-} !Int     -- its number

or should this be left to the backend, since it is target-specific whether an SSE intrinsic can be used?

comment:12 Changed 3 years ago by vivian

I don't know what people think about not having certain primOps available on specific CPUs. But this is one way to resolve this ticket:

  • Add #define directives based on CPUID or some other CPU probe (cabal interaction too)
  • Add the SSE/Altivec-specific primitives for Word8Packed16, Int32Packed4, ..., FloatPacked4, DoublePacked2 conditionally included in primOp generation.
  • Add Constructors to Width in Cmm.
  • Add CallishMachOp functions for new mathematical operations (reciprocal, sqrt, ...) and mathematical operations on new packed types
  • Elaborate new primOp generation alternatives in 'StgCmmPrim?.hs`
  • Extend asm, C, llvm backends
  • Think about packed-length agnostic stream generalisations (fold, zip) for automatic optimisation of vector/stream operations (onArrays#) [This includes using operations for memory accesses]

Notes:

+ I think that Complex should be a class, not a data constructor. We could operate on a DoublePacked2 (W64_2) as a complex number.

+ Automatic untupling of packed data-types (rebindable syntax?) e.g

instance Complex DoublePacked2 where
   real = fst
   imag = snd
   magnitude (r,i) = sqrt ((r**2)+(i**2))
   phase = arctan2 r i
   polar x = (magnitude x,phase x)
   toReal (r,i) = (r,i)
   toComplex :: Double -> Double -> DoublePacked2
   toComplex r i = (r,i)
   (:+) = toComplex

and we can have

type ComplexFloatPacked2 = FloatPacked4
instance Complex ComplexFloatPacked4 where - now we have SSE instructions for complex numbers
...

and also

instance (Complex a) => Complex (Vector a) where
...

These length agnostic vectors through (i) stream fusion and (ii) backend vector-length optimisation can take advantage of SSE operations.

+ A consistent linear class hierarchy would aid the gap between computer and mathematics (code optimisation and efficient expression)

comment:13 follow-ups: Changed 3 years ago by guest

It seems that there are a lot of design decisions to make, and for me the comments of the ticket are not the appropriate platform for that. What do you think?

What operations on Complex would benefit from vector operations? In my experience it is more useful to do the vectorization at another level: If you have four Complex Float values, then put the real parts into one SSE vector and the imaginary parts into another SSE vector. Multiplication is then still

cr = ar*br - ai*bi
ci = ar*bi + ai*br

but ar, ai, br, bi, cr, ci are Vector D4 Float (written as llvm type).
In Numeric-Prelude I can use the type Complex (Vector D4 Float).

There should be two possible ways to write programs:

  1. Programs, where the number of vector elements is given by the problem, e.g. storing a stereo or quadro sample in a vector. For this application there should be functions that simulate vectors of any length.
  2. Programs that can adapt to the CPU vector size, e.g. programs that chops a signal or bitmap into chunks of CPU vector size. For these programs a common interface with CPU dependent vector size is better.

There are more details to say, but maybe we can move to a different platform. Maybe a HaskellWiki? page?

comment:14 Changed 3 years ago by simonmar

I haven't followed all the discussion so far because I'm not all that familiar with SSE etc., but I just wanted to make the point that we can't have conditional compilation in primops.txt.pp based on the CPU type, because the CPU on which you build GHC is not necessarily the same as the one it runs on.

In fact, GHC is able to produce native code for any CPU, because we compile in all the native backends (well, that's not strictly true, because you can only have either the x86 or the x86_64 backend, not both, but we should really fix that).

So no conditions: all the primops are visible on all platforms. If you try to use an x86-specific primop with the Sparc backend, you'll get a compile-time error. If you try to use an SSE3-specific primop on a non-SSE3 CPU, the program will fail with an illegal instruction error, just as if you use -msse2 on x86 right now.

comment:15 in reply to: ↑ 13 ; follow-up: Changed 3 years ago by vivian

Replying to guest:

What operations on Complex would benefit from vector operations? In my experience it is more useful to do the vectorization at another level: If you have four Complex Float values, then put the real parts into one SSE vector and the imaginary parts into another SSE vector. Multiplication is then still

cr = ar*br - ai*bi
ci = ar*bi + ai*br

SSE has shuffling and add/sub operations: ADDSUB {A0, A1} {B0, B1} = {A0+B0,A1-B1} ; ci, cr

comment:16 in reply to: ↑ 13 ; follow-up: Changed 3 years ago by vivian

Replying to guest:

There are more details to say, but maybe we can move to a different platform. Maybe a HaskellWiki? page?

GHC/SIMD

comment:17 in reply to: ↑ 16 ; follow-up: Changed 3 years ago by simonmar

Replying to vivian:

Replying to guest:

There are more details to say, but maybe we can move to a different platform. Maybe a HaskellWiki? page?

GHC/SIMD

I think the GHC Trac wiki would be more appropriate for this.

comment:18 Changed 3 years ago by vivian

I have begun by adding

Int32Packed4#
Int64Packed2#
FloatPacked2#
DoublePacked4#

as primitive datatypes. This includes adding the primitive mathematical and logical operations that Cmm uses (which are the C primitives).

I am currently not providing non-hash versions of these datatypes (e.g. Int32Packed4).

Also added are primitive Array#s for these packed types.

comment:19 in reply to: ↑ 15 Changed 3 years ago by guest

Replying to vivian:

Replying to guest:

What operations on Complex would benefit from vector operations? In my experience it is more useful to do the vectorization at another level: If you have four Complex Float values, then put the real parts into one SSE vector and the imaginary parts into another SSE vector. Multiplication is then still

cr = ar*br - ai*bi
ci = ar*bi + ai*br

SSE has shuffling and add/sub operations: ADDSUB {A0, A1} {B0, B1} = {A0+B0,A1-B1} ; ci, cr

Sure, but what you save by doing two multiplications at a time, you will lose by permanent shuffling.

comment:20 Changed 3 years ago by vivian

Both GNU C, attribute((vector_size(n))), and LLVM, <vector n type>, support low-level packed types.

I have primOps like:

unPack4FloatOp# :: FloatPacked4# -> (# Float#, Float#, Float#, Float# #)
pack4FloatOp#   :: Float# -> Float# -> Float# -> Float# -> FloatPacked4#

and I have added corresponding Widths : W32 | ... | W32_4 | W64_2

At the Cmm level, I want to combine 4 floats into one 128-bit packed register. (C and LLVM have operations for this, as does SSE).

I'm not clear whether this should be new CmmStmt data constructors (CmmPack | CmmUnPack) or implemented as CallishMachOps. Since we have the new Widths it seems that the right solution is to extend CmmStmt.

It is also possible to achieve packing/unpacking with memory reads/writes and explicit casting.

comment:21 in reply to: ↑ 17 ; follow-up: Changed 3 years ago by guest

Replying to simonmar:

Replying to vivian:

Replying to guest:

There are more details to say, but maybe we can move to a different platform. Maybe a HaskellWiki? page?

GHC/SIMD

I think the GHC Trac wiki would be more appropriate for this.

Where is the Wiki page now?

comment:22 in reply to: ↑ 21 ; follow-up: Changed 3 years ago by vivian

Replying to guest:

Replying to simonmar:

Replying to vivian:

Replying to guest:

There are more details to say, but maybe we can move to a different platform. Maybe a HaskellWiki? page?

GHC/SIMD

I think the GHC Trac wiki would be more appropriate for this.

Where is the Wiki page now?

http://hackage.haskell.org/trac/ghc/wiki/SIMD

comment:23 in reply to: ↑ 22 ; follow-up: Changed 3 years ago by vivian

Replying to vivian:

Replying to guest:

Replying to simonmar:

Replying to vivian:

Replying to guest:

There are more details to say, but maybe we can move to a different platform. Maybe a HaskellWiki? page?

GHC/SIMD

I think the GHC Trac wiki would be more appropriate for this.

Where is the Wiki page now?

http://hackage.haskell.org/trac/ghc/wiki/SIMD

And now it's been moved to http://hackage.haskell.org/trac/ghc/wiki/VectorComputing (why?)

comment:25 follow-up: Changed 3 years ago by rl

FWIW, I don't think those reasons are entirely correct. I would suggest moving the page back. Vector instructions are SIMD instuctions. Vector computing is, to me, something fairly different. The relevant Wikipedia pages have a lot of information which might help clear up the confusion.

Also, why aren't people required to log in to edit the wiki?

comment:26 in reply to: ↑ 25 ; follow-ups: Changed 3 years ago by guest

Replying to rl:

FWIW, I don't think those reasons are entirely correct. I would suggest moving the page back. Vector instructions are SIMD instuctions. Vector computing is, to me, something fairly different. The relevant Wikipedia pages have a lot of information which might help clear up the confusion.

I do not know what Wikipedia pages are relevant for you, but I think
http://en.wikipedia.org/wiki/Vector_processor
supports the view, that SSE and AltiVec? are about vector computing (or vector processing) not about parallel/SIMD computing. The same applies to the Blog article that compares Larrabee with NVidia GPUs.
I do not see a reason to throw together two different approaches. Do we disagree on the naming or on the statement, that there are two different approaches to tackle repetitive computations?

comment:27 in reply to: ↑ 26 Changed 3 years ago by vivian

Replying to guest:

Replying to rl:

FWIW, I don't think those reasons are entirely correct. I would suggest moving the page back. Vector instructions are SIMD instuctions. Vector computing is, to me, something fairly different. The relevant Wikipedia pages have a lot of information which might help clear up the confusion.

I do not know what Wikipedia pages are relevant for you, but I think
http://en.wikipedia.org/wiki/Vector_processor
supports the view, that SSE and AltiVec? are about vector computing (or vector processing) not about parallel/SIMD computing. The same applies to the Blog article that compares Larrabee with NVidia GPUs.
I do not see a reason to throw together two different approaches. Do we disagree on the naming or on the statement, that there are two different approaches to tackle repetitive computations?

There are three terms here: (i) parallel, (ii) SIMD, (iii) vector processing.

You seem to be equating SIMD with parallel as opposed to vector processing. This, I think, is an error. Consider a Beowulf cluster, which is a parallel machine that uses MPI. This cluster can execute parallel algorithms where each node has its own instructions and data Multiple Instruction Multiple Data (MIMD).

A different sort of parallelism occurs when the same instruction is performed on different data units at the same time Single Instruction Multiple Data (SIMD). In this sense SIMD is vector processing and SIMD is the standard name for it.

By the way, SSE stands for 'Streaming SIMD Extension'.

Whether the computations are performed on the CPU or a GPU, if the same instruction (or kernel) is applied to multiple data units in parallel, this is SIMD.

Also, Manuel Chakravarty has written the Accelerate package for using the NVidia GPU from Haskell.

Anyway, this feature request is about adding CPU Vector instructions, i.e., SSE, or Streaming SIMD Extensions.

comment:28 in reply to: ↑ 26 Changed 3 years ago by vivian

Replying to guest:

I do not know what Wikipedia pages are relevant for you, but I think
http://en.wikipedia.org/wiki/Vector_processor
supports the view, that SSE and AltiVec? are about vector computing (or vector processing) not about parallel/SIMD computing.

Quoting from the Wikipedia article linked above:

Today, most commodity CPUs implement architectures that feature instructions for some vector processing on multiple (vectorized) data sets, typically known as SIMD (Single Instruction, Multiple Data). Common examples include MMX, SSE, and AltiVec?.

comment:29 Changed 3 years ago by pumpkin

  • Cc pumpkingod@… added

comment:30 Changed 3 years ago by dterei

  • Cc davidterei@… added

comment:31 Changed 3 years ago by altaic

  • Cc william.knop.nospam@… added

comment:32 Changed 3 years ago by jmcarthur

  • Cc Jake.McArthur@… added

comment:33 Changed 3 years ago by vivian

  • Owner vivian deleted

Hi, I'm happy to keep working on this ticket, but for the time being I'm busy. I can provide a patch of what I've done or discuss this if someone wants to take over.

comment:34 Changed 3 years ago by igloo

I'd recommend attaching a patch with what you have so far, in case someone else wants to take a look, and so that the latest version is recorded somewhere. Don't worry if it isn't pretty yet.

Changed 3 years ago by vivian

Partial patch against HEAD for SIMD instructions

comment:35 Changed 3 years ago by vivian

I've attached a patch for what I've done so far, it is not pretty.

Two recommendations after having done some coding:

1) Don't try and have dummy functions for CPUs that don't support specific instructions, insist that the programmer do the checking that they are using the right primOps. This is for efficiency and sanity.

2) There needs to be some CMM programming, for example to pack/unpack scalar to vector, which require combinations of machine ops

comment:36 Changed 3 years ago by igloo

Thanks!

comment:37 Changed 3 years ago by dterei

  • Cc dterei added; davidterei@… removed

comment:38 follow-up: Changed 3 years ago by thoughtpolice

  • Cc as@… added

This patch is massive because it seems as if it contains a merge conflict inside the .patch (perhaps this patch was simply diffed from an old tree, patched against HEAD at the time, and then just a diff was generated with the conflict?) Note the massive changes to CmmExpr? and the conflict markers. All of this was caused because a bunch of stuff that was shuffled around under ./compiler/cmm (probably because of the Hoopl changes.)

I've taken the liberty of significantly cleaning this up and bringing it a lot closer to GHC HEAD, and a lot of the fluff goes away from the patch file. Should I push this to github or something and put it on here for easier review and work? It's close to compiling, but I haven't done any additional work - for example, implementing any sort of Cmm operations to pack/unpack vectors.

comment:39 Changed 3 years ago by igloo

  • Status changed from new to patch

comment:40 Changed 3 years ago by liyang

  • Cc hackage.haskell.org@… added

comment:41 Changed 3 years ago by jystic

  • Cc jystic@… added

comment:42 Changed 3 years ago by nightski

  • Cc nightski@… added

comment:43 in reply to: ↑ 38 Changed 3 years ago by simonpj

Replying to thoughtpolice:

This patch is massive because it seems as if it contains a merge conflict inside the .patch (perhaps this patch was simply diffed from an old tree, patched against HEAD at the time, and then just a diff was generated with the conflict?) Note the massive changes to CmmExpr? and the conflict markers. All of this was caused because a bunch of stuff that was shuffled around under ./compiler/cmm (probably because of the Hoopl changes.)

I've taken the liberty of significantly cleaning this up and bringing it a lot closer to GHC HEAD, and a lot of the fluff goes away from the patch file. Should I push this to github or something and put it on here for easier review and work? It's close to compiling, but I haven't done any additional work - for example, implementing any sort of Cmm operations to pack/unpack vectors.

Dear thoughtpolice: do please attach your revised patch; that'd be useful. Activity on this patch may revive shortly.

comment:44 Changed 3 years ago by erikd

  • Cc mle+hs@… added

comment:45 Changed 3 years ago by erikd

I've just done a bit of experimentation on this and with one minor modification of the LLVM AST implementation (I will make a separate bug for this) , the LLVM backend can support a number of standard operations on arbitrarily sized vectors of the standard data types. The only caveat is that the size of these vectors must be known at compile time.

This suggests that for vectors of a statically known length, operation like addition could just be passed through to the LLVM backend as a vector addition primop like:

vectorAdd :: Vec vsize vtype -> Vector vsize vtype -> Vector vsize vtype

For platforms where vector operations aren't supported, the LLVM backend for that platform would be responsible for unrolling the vector operation.

For cases where the the vector width is known only at run time, vector operations can still be generated by generating a loop using the vector instructions.

comment:46 Changed 3 years ago by erikd

Submitted a patch for #5506 which adds a way to represent vector types in the LLVM backend.

comment:47 Changed 2 years ago by igloo

  • Status changed from patch to new

If I followed correctly, we don't currently have a patch that is suitable to apply, so I'm moving this ticket out of state "patch".

Note: See TracTickets for help on using tickets.