By Jochen Voss, on

Yesterday I learned how to speed up the vector operation
*y* := *y* + α*x* by using
SSE2 instructions.

The function I was trying to speed up was as follows:

static void daxpy0(int n, double a, const double *x, double *y) { int i; for (i=0; i<n; ++i) { y[i] += a * x[i]; } }

Using SSE2 instructions, operating on two double precision numbers at a time, this can be written as follows:

#include <emmintrin.h> static void daxpy1(int n, double a, const double *x, double *y) { __m128d aa, xx, yy; int i; aa = _mm_set1_pd(a); /* store a in both registers */ for (i=0; i+1<n; i+=2) { xx = _mm_load_pd(x+i); /* load two elements from x ... */ xx = _mm_mul_pd(aa, xx); /* ... and multiply both by a */ yy = _mm_load_pd(y+i); /* load two elements from y ... */ yy = _mm_add_pd(yy, xx); /* ... and add */ _mm_store_pd(y+i, yy) /* write back both elements of y */ } if (i < n) { y[i] += a * x[i]; /* handle last element if n is odd */ } }

This code required that the vectors `x`

and `y`

are 16-byte aligned (otherwise a segmentation fault will occur). This
assumption holds, for example, for memory blocks allocated by
`malloc`

on 64bit Linux and MacOS X systems. Also,
obviously, this only works on CPUs which support the SSE2 instruction set.
A description of the SSE2 instructions used here can be found in the
Intrinsics Reference

of the
Intel C++ Compiler Documentation
(also seems to apply to the GNU C compiler; direct link
here).

For comparison, I also tried to use the `daxpy`

function
provided by BLAS:

static void daxpy2(int n, double a, const double *x, double *y) { extern void daxpy_(int *Np, double *DAp, const double *X, int *INCXp, double *Y, int *INCYp); int inc = 1; daxpy_(&n, &a, x, &inc, y, &inc); }

Details about this approach are described on my page about linear algebra packages on Linux.

Results. I tried the three functions using two
vectors of length 2000. The following table gives the execution time for a
million calls to `daxpy`

(on a newish quad-core Linux machine):

method | function | time [s] |
---|---|---|

direct | `daxpy0`
| 1.52 |

SSE2 | `daxpy1`
| 0.91 |

BLAS | `daxpy2`
| 2.47 |

As can be seen, the SSE2 code in `daxpy1`

is fastest,
compared to the naive implementation `daxpy0`

it takes 40% off
the execution time! For some reason, the BLAS version seems to be very
slow; and the results on my MacOS machine are similar. Currently I have no
explanation for this effect.

This is an excerpt from Jochen's blog.

Newer entry: scrabble

Older entry: tunnelling http over ssh

Copyright © 2010, Jochen Voss. All content on this website (including text, pictures, and any other original works), unless otherwise noted, is licensed under a Creative Commons Attribution-Share Alike 3.0 License.