On Wednesday, March 2, 2011 3:53:13 PM UTC-6, KA wrote:
> If I try using memcpy() in Ubuntu, it runs 10 times faster than
> either my inlined SSE code, or even an inlined rep movsl does.
> Why is the library function so much faster than anything I
> can write?
It is most likely because you're doing it in a loop. The algorithm in use =
by the library most likely computes the startup portion necessary to achiev=
e a DWORD-aligned (32-bit aligned) offset, then computes how many DWORDS to=
copy (or possibly QWORDS, 64-bit chunks), and then the trailing bytes.
In this way there are no tests being used at any point except for the start=
up code and the ending code, and everything in the middle can be executed u=
sing something like this:
mov esi,src
mov edi,dst
mov ecx,dword_count
rep movsd
Or:
mov rsi,src
mov rdi,dst
mov rcx,qword_count
rep movsq
Hope this helps! BTW, if you're using the GCC library, you can download th=
e source and/or debugging information for it, which will allow you to step =
into that code with your favorite GDB debugger wrapper.
- Rick C. Hodgin
|
|
0
|
|
|
|
Reply
|
foxmuldrster (53)
|
3/3/2011 1:18:36 AM |
|