Re: Using SSE 128 bit movs From One Memory Location To Another

  • Follow


On Wednesday, March 2, 2011 3:53:13 PM UTC-6, KA wrote:
> If I try using memcpy() in Ubuntu, it runs 10 times faster than
> either my inlined SSE code, or even an inlined rep movsl does.
> Why is the library function so much faster than anything I
> can write?

It is most likely because you're doing it in a loop.  The algorithm in use =
by the library most likely computes the startup portion necessary to achiev=
e a DWORD-aligned (32-bit aligned) offset, then computes how many DWORDS to=
 copy (or possibly QWORDS, 64-bit chunks), and then the trailing bytes.

In this way there are no tests being used at any point except for the start=
up code and the ending code, and everything in the middle can be executed u=
sing something like this:

mov esi,src
mov edi,dst
mov ecx,dword_count
rep movsd

Or:

mov rsi,src
mov rdi,dst
mov rcx,qword_count
rep movsq

Hope this helps!  BTW, if you're using the GCC library, you can download th=
e source and/or debugging information for it, which will allow you to step =
into that code with your favorite GDB debugger wrapper.

- Rick C. Hodgin
0
Reply foxmuldrster (53) 3/3/2011 1:18:36 AM


0 Replies
233 Views

(page loaded in 0.027 seconds)

Similiar Articles:













7/11/2012 6:20:07 PM


Reply: