COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### MMX instruction to speed up computation

• Email
• Follow

```Hi all,
I know this is a newbie question, but I cannot get any decent tutorial

I have the following C code:

int calculate_distance(unsigned int* v1, unsigned int* v2) {
int dist = 0, i;
for(i = 0; i < N; i++)
{
dist += abs(v1[i] - v2[i]);
}

return dist;
}

N >> 1000

Now, I would like to speed up the computation by using
__builtin_ia32_psadbw (I'm compiling with GCC), but I cannot get any
example how to apply it to a vector taht has more than 8 elements...
(integers in v1 and v2 are [0..255], so it could be even a unsigned
char).
Any pointer?

thanks

```
 0

See related articles to this posting

```On Nov 3, 7:38�pm, damiano.bolzoni <spamt...@crayne.org> wrote:
> Hi all,
> I know this is a newbie question, but I cannot get any decent tutorial
>
> I have the following C code:
>
> int calculate_distance(unsigned int* v1, unsigned int* v2) {
> int dist = 0, i;
> for(i = 0; i < N; i++)
> {
> � dist += abs(v1[i] - v2[i]);
>
> }
>
> return dist;
>
> }
>
> N >> 1000
>
> Now, I would like to speed up the computation by using
> __builtin_ia32_psadbw (I'm compiling with GCC), but I cannot get any
> example how to apply it to a vector taht has more than 8 elements...
> (integers in v1 and v2 are [0..255], so it could be even a unsigned
> char).
> Any pointer?
>
> thanks

The first step is to choose the right instructions. You can obviously
do more work in parallel with smaller integer types, and you may even
choose to convert your 32bit integers to 16bit (or even 8bit) on-the-
fly, if the magnitude of your expected result will allow.

With instruction selection aside, the trick is to keep your results in
packed form (in registers) until the very end of the algorithm, and
then finally compute the horizontal sum of the packed scalar results -
keeping scalar operations to a minimum. Just divide the number of
iterations by four, and advance your pointers four elements at a time
- use the aligned load instructions if possible, and handle any
arrays with zeros, which will not affect your computation).

There are 8 SSE registers under IA32, best to use as many as you can
afford (I like to save a few for the compiler) by cycling through
partial accumulators - as described in the Intel optimization
reference manual to avoid dependency stalls.

If your data is not in CPU cache I doubt you will be able to cover the
memory access latency with so few operations, so try to prefetch the
data before you begin this routine. During the loop hardware prefetch
should happen automatically for contiguous memory traversal.

Integer absolute can be computed 4x in parallel, and without
branching: x^int(x>>31) - int(x>>31)
See Stanford bithacks: http://www-graphics.stanford.edu/~seander/bithacks.html

There are plenty of code examples for routines like this in the Intel
manual:
http://www.intel.com/design/processor/manuals/248966.pdf

- Richard Maudsley

```
 0

```damiano.bolzoni wrote:
> Hi all,
> I know this is a newbie question, but I cannot get any decent tutorial
>
> I have the following C code:
>
> int calculate_distance(unsigned int* v1, unsigned int* v2) {
> int dist = 0, i;
> for(i = 0; i < N; i++)
> {
>   dist += abs(v1[i] - v2[i]);
> }
>
> return dist;
> }
>
> N >> 1000

MMX cannot really do much for you, unless the inputs have a very limited
range, i.e. max 32767 or less.

Using SSE2 or later is much better, but you still cannot handle the full
range of unsigned inputs, and neither can your original C algorithm,
since it would need 33bit signed integers for the differences.

Actually, it is possible to handle this with SSE code, but only by
explicitely calculating the carries and wrapping them back in alongside
the parallel abs() calculations.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

2 Replies
164 Views

Similar Articles

12/9/2013 3:26:40 PM
[PageSpeed]