mencpy 128 bytes

  • Follow


hi all,

i am very newbie

i search a very fast method for copy 128 bytes on intel xeon 5400 serie

for now i use memcpy (intensive use: builtin_memcpy gcc 4.2.0)

have you better code?

thx

0
Reply Bruno 3/2/2009 11:31:48 AM

(Bruno Causse) schrieb:
> hi all,
> 
> i am very newbie
> 
> i search a very fast method for copy 128 bytes on intel xeon 5400 serie
> 
> for now i use memcpy (intensive use: builtin_memcpy gcc 4.2.0)
> 
> have you better code?
> 
> thx
> 

Do you really think, that this is the right newsgroup? I think that 
memcpy() is a C-function so ask there.

BTW I think that movs should be fast to copy bytes in assembler.

Uwe


======================================= MODERATOR'S COMMENT: 
 Maybe the gentleman wants something specific to the Xeon 5400 series of processors?

-- jIm

0
Reply Uwe 3/2/2009 2:05:23 PM


Uwe Plonus <spamtrap@crayne.org> wrote:

> (Bruno Causse) schrieb:
> Do you really think, that this is the right newsgroup? I think that 
> memcpy() is a C-function so ask there.
> 

yes, i search a asm function (specialy for 128 bytes)

> ======================================= MODERATOR'S COMMENT: 
> Maybe the gentleman wants something specific to the
> Xeon 5400 series of processors?

right

0
Reply Bruno 3/2/2009 3:18:47 PM

"(Bruno Causse)" <spamtrap@crayne.org> wrote in part:
> hi all, > > i am very newbie
> i search a very fast method for copy 128 bytes on intel xeon 5400 serie
> for now i use memcpy (intensive use: builtin_memcpy gcc 4.2.0)
> have you better code?

How will you know?  What measurement techniques are you using?
Unless memcpy() is somehow in-lined during compilation,
it will incur significant overhead.

If the blocks are not aligned, you will probably get best
performance from a simple  MOVSD  .  For aligned blocks,
unrolled XMM moves are likely faster.  


-- Robert

0
Reply Robert 3/2/2009 4:05:40 PM

Robert Redelmeier wrote:
> "(Bruno Causse)" <spamtrap@crayne.org> wrote in part:
>> hi all, > > i am very newbie
>> i search a very fast method for copy 128 bytes on intel xeon 5400 serie
>> for now i use memcpy (intensive use: builtin_memcpy gcc 4.2.0)
>> have you better code?
> 
> How will you know?  What measurement techniques are you using?
> Unless memcpy() is somehow in-lined during compilation,
> it will incur significant overhead.

If the compiler can know statically (i.e. during compilation time) that 
the area is exactly 128 bytes, then inline code can be quite good.
> 
> If the blocks are not aligned, you will probably get best
> performance from a simple  MOVSD  .  For aligned blocks,
> unrolled XMM moves are likely faster.  

This is the crucial part: With aligned src/dst blocks, 8 xmm registers 
worth of unrolled MOVDQA operations will beat anything else.

If you only know that _most_ blocks will be properly aligned, but you 
cannot  guarantee it, then I'd either use an upfront test (OR together 
the two address registers, then test the low 4 bits), or I'd use the 
MOVDQU instructions and hope that they will run nearly as fast as MOVDQA 
if the targets are aligned.

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 3/2/2009 4:52:52 PM

Terje Mathisen <spamtrap@crayne.org> wrote in part:
> Robert Redelmeier wrote:
>> If the blocks are not aligned, you will probably get best
>> performance from a simple  MOVSD  .  For aligned blocks,
>> unrolled XMM moves are likely faster.  
> 
> This is the crucial part: With aligned src/dst blocks, 8 xmm registers 
> worth of unrolled MOVDQA operations will beat anything else.

Especially with MOVNTDQ for stores (avoid the cacheline pre-read).

> If you only know that _most_ blocks will be properly aligned, but you 
> cannot  guarantee it, then I'd either use an upfront test (OR together 
> the two address registers, then test the low 4 bits), or I'd use the 
> MOVDQU instructions and hope that they will run nearly as fast as MOVDQA 
> if the targets are aligned.

A bit of a PITA for in-line, but you can make the jmp's predictable.

-- Robert

0
Reply Robert 3/2/2009 7:18:30 PM

Robert Redelmeier wrote:
> Terje Mathisen <spamtrap@crayne.org> wrote in part:
>> Robert Redelmeier wrote:
>>> If the blocks are not aligned, you will probably get best
>>> performance from a simple  MOVSD  .  For aligned blocks,
>>> unrolled XMM moves are likely faster.  
>> This is the crucial part: With aligned src/dst blocks, 8 xmm registers 
>> worth of unrolled MOVDQA operations will beat anything else.
> 
> Especially with MOVNTDQ for stores (avoid the cacheline pre-read).

No, no!

If you're moving 128 bytes around, and they aren't all in cached memory, 
then the type of instructions you use almost doesn't matter, since 
you'll just be waiting for those cache misses anyway.

It is possible to think of situations where the NT style writes makes 
sense, that's when you have a big target buffer which isn't going to be 
touched at all before network or disk dma uses it, i.e. no cpu 
involvement, but if this is time-critical code and the data is going to 
be used some time in the near future, then it had better be in a reused 
and cached memory area.

>> If you only know that _most_ blocks will be properly aligned, but you 
>> cannot  guarantee it, then I'd either use an upfront test (OR together 
>> the two address registers, then test the low 4 bits), or I'd use the 
>> MOVDQU instructions and hope that they will run nearly as fast as MOVDQA 
>> if the targets are aligned.
> 
> A bit of a PITA for in-line, but you can make the jmp's predictable.

Right.

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 3/2/2009 8:14:53 PM

Terje Mathisen <spamtrap@crayne.org> wrote in part:
> Robert Redelmeier wrote:
>> Especially with MOVNTDQ for stores (avoid the cacheline pre-read).
> 
> No, no!
> 
> If you're moving 128 bytes around, and they aren't all in cached
> memory, then the type of instructions you use almost doesn't matter,
> since you'll just be waiting for those cache misses anyway.

Sure!  But since I expect to be writing full cachelines,
why wait for the cache read-in necessary for small mods?

AFAIK, MOVNT* write instructions save the read-before-write.
So you can do READ src - WRITE dst  rather than
READ src - READ dst - WRITE dst .

This is why they are fast for block moves.

-- Robert

0
Reply Robert 3/2/2009 9:32:27 PM

Robert Redelmeier schrieb:
> Sure!  But since I expect to be writing full cachelines,
> why wait for the cache read-in necessary for small mods?
> 
> AFAIK, MOVNT* write instructions save the read-before-write.
> So you can do READ src - WRITE dst  rather than
> READ src - READ dst - WRITE dst .
> 
> This is why they are fast for block moves.

Non-temporal moves to mem won't be cached (iirc).
Therefore, it won't spill your cache but it limits
your throughput to the (available) RAM bandwidth
(which is far less than the L1/L2 bandwidth).

So, MOVNT might be nice if your working set without
the written data fits in the cache. If you have a larger
cache (or a smaller working set), MOVNT hinders performance.

I timed this on AMD K8 or Core2 for some multi-step
streaming image processing filter. MOVNT for the inter-
mediate results helped for large images (large working set)
but slowed down processing of small images.


Hendrik vdH

0
Reply Hendrik 3/3/2009 7:23:21 AM

Robert Redelmeier wrote:
> Terje Mathisen <spamtrap@crayne.org> wrote in part:
>> Robert Redelmeier wrote:
>>> Especially with MOVNTDQ for stores (avoid the cacheline pre-read).
>> No, no!
>>
>> If you're moving 128 bytes around, and they aren't all in cached
>> memory, then the type of instructions you use almost doesn't matter,
>> since you'll just be waiting for those cache misses anyway.
> 
> Sure!  But since I expect to be writing full cachelines,
> why wait for the cache read-in necessary for small mods?

OK, that's valid.
> 
> AFAIK, MOVNT* write instructions save the read-before-write.
> So you can do READ src - WRITE dst  rather than
> READ src - READ dst - WRITE dst .

But you still want the target lines to end up in the cache, i.e. you 
want the CPU to realize that full cache lines are being written, so it 
can just claim ownership without doing that initial read-for-ownership 
operation.

Afair this depends, among other things, on how the memory is setup in 
the MTRR tables.

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 3/3/2009 8:48:20 AM

Terje Mathisen <spamtrap@crayne.org> wrote in part:
> Robert Redelmeier wrote:
>> AFAIK, MOVNT* write instructions save the read-before-write.
>> So you can do READ src - WRITE dst  rather than
>> READ src - READ dst - WRITE dst .
> 
> But you still want the target lines to end up in the cache,
> i.e. you want the CPU to realize that full cache lines are
> being written, so it can just claim ownership without doing
> that initial read-for-ownership operation.

You might want those 128 bytes in cache if you are going
to further process them soon.  I assumed they already had
been crunched, and were being written out to RAM for storage
("an exercise in caching").  But only the OP knows.

> Afair this depends, among other things, on how the memory
> is setup in the MTRR tables.

Yes, but not something under user control in many OSes.


-- Robert

0
Reply Robert 3/3/2009 12:59:21 PM

10 Replies
237 Views

(page loaded in 3.642 seconds)

Similiar Articles:













7/24/2012 2:26:31 PM


Reply: