hi all,
i am very newbie
i search a very fast method for copy 128 bytes on intel xeon 5400 serie
for now i use memcpy (intensive use: builtin_memcpy gcc 4.2.0)
have you better code?
thx
|
|
0
|
|
|
|
Reply
|
Bruno
|
3/2/2009 11:31:48 AM |
|
(Bruno Causse) schrieb:
> hi all,
>
> i am very newbie
>
> i search a very fast method for copy 128 bytes on intel xeon 5400 serie
>
> for now i use memcpy (intensive use: builtin_memcpy gcc 4.2.0)
>
> have you better code?
>
> thx
>
Do you really think, that this is the right newsgroup? I think that
memcpy() is a C-function so ask there.
BTW I think that movs should be fast to copy bytes in assembler.
Uwe
======================================= MODERATOR'S COMMENT:
Maybe the gentleman wants something specific to the Xeon 5400 series of processors?
-- jIm
|
|
0
|
|
|
|
Reply
|
Uwe
|
3/2/2009 2:05:23 PM
|
|
Uwe Plonus <spamtrap@crayne.org> wrote:
> (Bruno Causse) schrieb:
> Do you really think, that this is the right newsgroup? I think that
> memcpy() is a C-function so ask there.
>
yes, i search a asm function (specialy for 128 bytes)
> ======================================= MODERATOR'S COMMENT:
> Maybe the gentleman wants something specific to the
> Xeon 5400 series of processors?
right
|
|
0
|
|
|
|
Reply
|
Bruno
|
3/2/2009 3:18:47 PM
|
|
"(Bruno Causse)" <spamtrap@crayne.org> wrote in part:
> hi all, > > i am very newbie
> i search a very fast method for copy 128 bytes on intel xeon 5400 serie
> for now i use memcpy (intensive use: builtin_memcpy gcc 4.2.0)
> have you better code?
How will you know? What measurement techniques are you using?
Unless memcpy() is somehow in-lined during compilation,
it will incur significant overhead.
If the blocks are not aligned, you will probably get best
performance from a simple MOVSD . For aligned blocks,
unrolled XMM moves are likely faster.
-- Robert
|
|
0
|
|
|
|
Reply
|
Robert
|
3/2/2009 4:05:40 PM
|
|
Robert Redelmeier wrote:
> "(Bruno Causse)" <spamtrap@crayne.org> wrote in part:
>> hi all, > > i am very newbie
>> i search a very fast method for copy 128 bytes on intel xeon 5400 serie
>> for now i use memcpy (intensive use: builtin_memcpy gcc 4.2.0)
>> have you better code?
>
> How will you know? What measurement techniques are you using?
> Unless memcpy() is somehow in-lined during compilation,
> it will incur significant overhead.
If the compiler can know statically (i.e. during compilation time) that
the area is exactly 128 bytes, then inline code can be quite good.
>
> If the blocks are not aligned, you will probably get best
> performance from a simple MOVSD . For aligned blocks,
> unrolled XMM moves are likely faster.
This is the crucial part: With aligned src/dst blocks, 8 xmm registers
worth of unrolled MOVDQA operations will beat anything else.
If you only know that _most_ blocks will be properly aligned, but you
cannot guarantee it, then I'd either use an upfront test (OR together
the two address registers, then test the low 4 bits), or I'd use the
MOVDQU instructions and hope that they will run nearly as fast as MOVDQA
if the targets are aligned.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
|
|
0
|
|
|
|
Reply
|
Terje
|
3/2/2009 4:52:52 PM
|
|
Terje Mathisen <spamtrap@crayne.org> wrote in part:
> Robert Redelmeier wrote:
>> If the blocks are not aligned, you will probably get best
>> performance from a simple MOVSD . For aligned blocks,
>> unrolled XMM moves are likely faster.
>
> This is the crucial part: With aligned src/dst blocks, 8 xmm registers
> worth of unrolled MOVDQA operations will beat anything else.
Especially with MOVNTDQ for stores (avoid the cacheline pre-read).
> If you only know that _most_ blocks will be properly aligned, but you
> cannot guarantee it, then I'd either use an upfront test (OR together
> the two address registers, then test the low 4 bits), or I'd use the
> MOVDQU instructions and hope that they will run nearly as fast as MOVDQA
> if the targets are aligned.
A bit of a PITA for in-line, but you can make the jmp's predictable.
-- Robert
|
|
0
|
|
|
|
Reply
|
Robert
|
3/2/2009 7:18:30 PM
|
|
Robert Redelmeier wrote:
> Terje Mathisen <spamtrap@crayne.org> wrote in part:
>> Robert Redelmeier wrote:
>>> If the blocks are not aligned, you will probably get best
>>> performance from a simple MOVSD . For aligned blocks,
>>> unrolled XMM moves are likely faster.
>> This is the crucial part: With aligned src/dst blocks, 8 xmm registers
>> worth of unrolled MOVDQA operations will beat anything else.
>
> Especially with MOVNTDQ for stores (avoid the cacheline pre-read).
No, no!
If you're moving 128 bytes around, and they aren't all in cached memory,
then the type of instructions you use almost doesn't matter, since
you'll just be waiting for those cache misses anyway.
It is possible to think of situations where the NT style writes makes
sense, that's when you have a big target buffer which isn't going to be
touched at all before network or disk dma uses it, i.e. no cpu
involvement, but if this is time-critical code and the data is going to
be used some time in the near future, then it had better be in a reused
and cached memory area.
>> If you only know that _most_ blocks will be properly aligned, but you
>> cannot guarantee it, then I'd either use an upfront test (OR together
>> the two address registers, then test the low 4 bits), or I'd use the
>> MOVDQU instructions and hope that they will run nearly as fast as MOVDQA
>> if the targets are aligned.
>
> A bit of a PITA for in-line, but you can make the jmp's predictable.
Right.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
|
|
0
|
|
|
|
Reply
|
Terje
|
3/2/2009 8:14:53 PM
|
|
Terje Mathisen <spamtrap@crayne.org> wrote in part:
> Robert Redelmeier wrote:
>> Especially with MOVNTDQ for stores (avoid the cacheline pre-read).
>
> No, no!
>
> If you're moving 128 bytes around, and they aren't all in cached
> memory, then the type of instructions you use almost doesn't matter,
> since you'll just be waiting for those cache misses anyway.
Sure! But since I expect to be writing full cachelines,
why wait for the cache read-in necessary for small mods?
AFAIK, MOVNT* write instructions save the read-before-write.
So you can do READ src - WRITE dst rather than
READ src - READ dst - WRITE dst .
This is why they are fast for block moves.
-- Robert
|
|
0
|
|
|
|
Reply
|
Robert
|
3/2/2009 9:32:27 PM
|
|
Robert Redelmeier schrieb:
> Sure! But since I expect to be writing full cachelines,
> why wait for the cache read-in necessary for small mods?
>
> AFAIK, MOVNT* write instructions save the read-before-write.
> So you can do READ src - WRITE dst rather than
> READ src - READ dst - WRITE dst .
>
> This is why they are fast for block moves.
Non-temporal moves to mem won't be cached (iirc).
Therefore, it won't spill your cache but it limits
your throughput to the (available) RAM bandwidth
(which is far less than the L1/L2 bandwidth).
So, MOVNT might be nice if your working set without
the written data fits in the cache. If you have a larger
cache (or a smaller working set), MOVNT hinders performance.
I timed this on AMD K8 or Core2 for some multi-step
streaming image processing filter. MOVNT for the inter-
mediate results helped for large images (large working set)
but slowed down processing of small images.
Hendrik vdH
|
|
0
|
|
|
|
Reply
|
Hendrik
|
3/3/2009 7:23:21 AM
|
|
Robert Redelmeier wrote:
> Terje Mathisen <spamtrap@crayne.org> wrote in part:
>> Robert Redelmeier wrote:
>>> Especially with MOVNTDQ for stores (avoid the cacheline pre-read).
>> No, no!
>>
>> If you're moving 128 bytes around, and they aren't all in cached
>> memory, then the type of instructions you use almost doesn't matter,
>> since you'll just be waiting for those cache misses anyway.
>
> Sure! But since I expect to be writing full cachelines,
> why wait for the cache read-in necessary for small mods?
OK, that's valid.
>
> AFAIK, MOVNT* write instructions save the read-before-write.
> So you can do READ src - WRITE dst rather than
> READ src - READ dst - WRITE dst .
But you still want the target lines to end up in the cache, i.e. you
want the CPU to realize that full cache lines are being written, so it
can just claim ownership without doing that initial read-for-ownership
operation.
Afair this depends, among other things, on how the memory is setup in
the MTRR tables.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
|
|
0
|
|
|
|
Reply
|
Terje
|
3/3/2009 8:48:20 AM
|
|
Terje Mathisen <spamtrap@crayne.org> wrote in part:
> Robert Redelmeier wrote:
>> AFAIK, MOVNT* write instructions save the read-before-write.
>> So you can do READ src - WRITE dst rather than
>> READ src - READ dst - WRITE dst .
>
> But you still want the target lines to end up in the cache,
> i.e. you want the CPU to realize that full cache lines are
> being written, so it can just claim ownership without doing
> that initial read-for-ownership operation.
You might want those 128 bytes in cache if you are going
to further process them soon. I assumed they already had
been crunched, and were being written out to RAM for storage
("an exercise in caching"). But only the OP knows.
> Afair this depends, among other things, on how the memory
> is setup in the MTRR tables.
Yes, but not something under user control in many OSes.
-- Robert
|
|
0
|
|
|
|
Reply
|
Robert
|
3/3/2009 12:59:21 PM
|
|
|
10 Replies
237 Views
(page loaded in 3.642 seconds)
Similiar Articles: mencpy 128 bytes - comp.lang.asm.x86hi all, i am very newbie i search a very fast method for copy 128 bytes on intel xeon 5400 serie for now i use memcpy (intensive use: builtin_memcp... Re: Using SSE 128 bit movs From One Memory Location To Another ...mencpy 128 bytes - comp.lang.asm.x86 Re: Using SSE 128 bit movs From One Memory Location To Another ... Using SSE 128 bit movs From One Memory Location To Another - comp ... Fast memcpy() ??? - comp.lang.c++mencpy 128 bytes - comp.lang.asm.x86 hi all, i am very newbie i search a very fast method for copy 128 bytes on intel xeon 5400 serie for now i use memcpy (intensive use ... rep movs instruction - comp.lang.asm.x86mencpy 128 bytes - comp.lang.asm.x86 rep movs instruction - comp.lang.asm.x86 If MOVSB (bytes), MOVSW (words), or MOVSD (doublewords ... Re: Using SSE 128 bit movs From ... 128-bit MMX versus 32-bit memory copy - comp.lang.asm.x86 ...In a certain app, the code performs lots of memory copies in 128-byte chunks between ... Fast memcpy() ??? - comp.lang.c++ > > -- > Ian Collins Is it smart enough to copy 32 ... Xilinx ISE12.1 IPCORE source code - comp.arch.fpgaa)the first 8 bytes XlxV62EB is version code,From ISE11.1 Xilinx use AES. ... push ecx ; Dst ..text:10004C6A call memcpy ..text ... 32 bit applet on 64 bit Java? - comp.lang.java.programmer ...The crash happens when the JNI library executes memcpy ... public static native void jni_SetMem(int i, byte abyte0 ... it works just the same on 64 bits, since needing 128 ... Sockets in gfortran? - comp.lang.fortran... find host address."); return;} /* Avoid bcopy/memcpy ... path_alloc(&len); /* malloc's for PATH_MAX+1 bytes ... ndata, error integer :: port = 5000 character*(128 ... improve strlen - comp.lang.asm.x86... 01234567890ABCDEF0123456789ABCDE\0" ; // 128 caratteri ... mov eax, OFFSET $SG-5 $LL3@strlen: add eax, 1 cmp BYTE ... I won't argue that I've got the fastest memcpy around, but ... ptr versus const ptr& - comp.lang.c++.moderated... because the size of the ptr<X> class is 4 bytes (enough ... By the way, you're casting away const in the memcpy call ... 128-bit MMX versus 32-bit memory copy - comp.lang.asm ... mencpy 128 bytes - comp.lang.asm.x86 | Computer Grouphi all, i am very newbie i search a very fast method for copy 128 bytes on intel xeon 5400 serie for now i use memcpy (intensive use: builtin_memcp... Optimizing Memcpy improves speed - electrical engineering ...Figure 7: Cache-policy effect (cold cache, 128 bytes, 333MHz) Figure 8: Cache-policy effect (garbage cache, 128 bytes, 333MHz) With a cold cache, optimized memcpy with ... 7/24/2012 2:26:31 PM
|