| 4 | |

| 5 | In contrast to pure SIMD (single instruction multiple data) parallel computing as found in GPUs, |

| 6 | vector processing includes permutation of vector elements which |

| 7 | allows for certain efficient algorithms |

| 8 | (see Guy Blelloch: [http://www.cs.cmu.edu/afs/cs.cmu.edu/project/scandal/public/papers/CMU-CS-90-190.html Prefix Sums and Their Applications]) |

| 9 | that reduce computation from n scalar operations to log n vector operations using vectors of size n. |

| 10 | Example: Compute the cumulative sum of a 4-element vector using a shift operation |

| 11 | that shifts elements and clears unused elements: |

| 12 | |

| 13 | {{{ |

| 14 | v0 = [x0, x1, x2, x3] |

| 15 | v1 = v0 + shiftr 1 v0 -- v1 = [x0, x0+x1, x1+x2, x2+x3] |

| 16 | v2 = v1 + shiftr 2 v1 -- v2 = [x0, x0+x1, x0+x1+x2, x0+x1+x2+x3] |

| 17 | }}} |