# fast vector operations using SSE2

By , on

Yesterday I learned how to speed up the vector operation y := y + αx by using SSE2 instructions.

The function I was trying to speed up was as follows:

```static void
daxpy0(int n, double a, const double *x, double *y)
{
int i;
for (i=0; i<n; ++i) {
y[i] += a * x[i];
}
}
```

Using SSE2 instructions, operating on two double precision numbers at a time, this can be written as follows:

```#include <emmintrin.h>

static void
daxpy1(int n, double a, const double *x, double *y)
{
__m128d aa, xx, yy;
int i;

aa = _mm_set1_pd(a);       /* store a in both registers */
for (i=0; i+1<n; i+=2) {
xx = _mm_mul_pd(aa, xx); /* ... and multiply both by a */
_mm_store_pd(y+i, yy)    /* write back both elements of y */
}
if (i < n) {
y[i] += a * x[i];        /* handle last element if n is odd */
}
}
```

This code required that the vectors `x` and `y` are 16-byte aligned (otherwise a segmentation fault will occur). This assumption holds, for example, for memory blocks allocated by `malloc` on 64bit Linux and MacOS X systems. Also, obviously, this only works on CPUs which support the SSE2 instruction set. A description of the SSE2 instructions used here can be found in the Intrinsics Reference of the Intel C++ Compiler Documentation (also seems to apply to the GNU C compiler; direct link here).

For comparison, I also tried to use the `daxpy` function provided by BLAS:

```static void
daxpy2(int n, double a, const double *x, double *y)
{
extern void daxpy_(int *Np, double *DAp,
const double *X, int *INCXp,
double *Y, int *INCYp);
int inc = 1;
daxpy_(&n, &a, x, &inc, y, &inc);
}
```

Results. I tried the three functions using two vectors of length 2000. The following table gives the execution time for a million calls to `daxpy` (on a newish quad-core Linux machine):

method function time [s]
direct `daxpy0` 1.52
SSE2 `daxpy1` 0.91
BLAS `daxpy2` 2.47

As can be seen, the SSE2 code in `daxpy1` is fastest, compared to the naive implementation `daxpy0` it takes 40% off the execution time! For some reason, the BLAS version seems to be very slow; and the results on my MacOS machine are similar. Currently I have no explanation for this effect.

This is an excerpt from Jochen's blog.