AMD vs Intel timing on this code...

  • Follow


On my ADM XP, the following

  mov eax, [esi]
  add esi, 4
  jmp [eax]

appears to execute around the same speed as

  lodsd
  jmp [eax]

Which is faster on Intel (586 and above, 486 and below are of no
interest)?

The AMD docs has vectorpath 4 cycles for lodsd, and 4+4 directpath for
the mov(mem)+add(imm). But I'm seeing the same results (without using
RDSTC, just a very large number of iterations, 1billion+). What's
causing the slowdown? I would have thought lodsd was slower than the
equivalent directpath instructions. Is it a dependency on ESI?

Any faster way of writing this?

-- 
Regards
Alex McDonald

0
Reply alex_mcd 1/12/2004 1:59:25 PM

"Alex McDonald" <alex_mcd@btopenworld.com> wrote in message
news:b57b10b6.0401120553.187b5ba9@posting.google.com...
> On my ADM XP, the following
>
>   mov eax, [esi]
>   add esi, 4
>   jmp [eax]
>
> appears to execute around the same speed as
>
>   lodsd
>   jmp [eax]

They likely produce identical or extremely similar microcode. The lodsd
instruction does have to do a little more because of DF, but that should
overlap with the load.

> Which is faster on Intel (586 and above, 486 and below are of no
> interest)?

In general it will be the mov/add. On a Pentium, these would also be
identically fast unless you pair something with the mov/add instructions.
P6-core seems is mostly neutral: they're both 2 u-ops. However, the mov/add
may decode faster. Intel doesn't bother listing lods latency for P4. I took
a crude measurement, and it seems to be about 6 cycles. (I'm not sure how
accurate that is.) This would make a mov/add pair 3 times faster.

> The AMD docs has vectorpath 4 cycles for lodsd, and 4+4 directpath for
> the mov(mem)+add(imm). But I'm seeing the same results (without using
> RDSTC, just a very large number of iterations, 1billion+). What's
> causing the slowdown? I would have thought lodsd was slower than the
> equivalent directpath instructions. Is it a dependency on ESI?

What? You mean 3+1 for mov + add, right? They should overlap as well which
gives 1 clock less latency than lodsd (4 clk). However, it is quite possible
that eax is available before the full 4 cycles in which case there would be
no difference. VectorPath isn't necessarily slow, either; it does inhibit
decoding of further instructions, but since you jmp right after that really
doesn't matter.

> Any faster way of writing this?

No. It looks like you're getting about ~7-8 cycles latency, and for 2 cache
accesses you need at least 6 cycles anyway. From the looks of it, this is a
LUT for an emulator or something of that sort, so your jmp would be
unpredictable. That stall is going to be a lot worse than the lookup itself.

-Matt


0
Reply Matt 1/13/2004 12:13:15 AM


"Matt Taylor" <para@tampabay.rr.com> wrote in message news:<_AGMb.2670$873.56076@twister.tampabay.rr.com>...
> "Alex McDonald" <alex_mcd@btopenworld.com> wrote in message
> news:b57b10b6.0401120553.187b5ba9@posting.google.com...

===snipped

> 
> > The AMD docs has vectorpath 4 cycles for lodsd, and 4+4 directpath for
> > the mov(mem)+add(imm). But I'm seeing the same results (without using
> > RDSTC, just a very large number of iterations, 1billion+). What's
> > causing the slowdown? I would have thought lodsd was slower than the
> > equivalent directpath instructions. Is it a dependency on ESI?
> 
> What? You mean 3+1 for mov + add, right? 

Typo; should have been 4 (3+1)

===snipped 

> From the looks of it, this is a
> LUT for an emulator or something of that sort, 

Good guess; Forth indirect threaded code; part of the inner
interpreter.

> so your jmp would be
> unpredictable. That stall is going to be a lot worse than the lookup itself.

Thanks for the analysis. I'll stick with the longer mov+add sequence
if the impact on Intel processors is that great. lodsd does have the
advantage of brevity however; the code sequence is only 3 bytes
instead of 7, and this fragment appears very regularly in the code.

> 
> -Matt

0
Reply alex_mcd 1/14/2004 12:52:50 PM

"Alex McDonald" <alex_mcd@btopenworld.com> wrote in message
news:b57b10b6.0401140442.612af022@posting.google.com...
> "Matt Taylor" <para@tampabay.rr.com> wrote in message
news:<_AGMb.2670$873.56076@twister.tampabay.rr.com>...
<snip>
> > so your jmp would be
> > unpredictable. That stall is going to be a lot worse than the lookup
itself.
>
> Thanks for the analysis. I'll stick with the longer mov+add sequence
> if the impact on Intel processors is that great. lodsd does have the
> advantage of brevity however; the code sequence is only 3 bytes
> instead of 7, and this fragment appears very regularly in the code.

Possibly. A couple points:

* Pentium will favor lodsd (size) *if* you can't pair anything with the
mov+add. You need context to make that decision.
* P6-core probably won't be sensitive to the decode issues since you branch
anyway
* The lodsd isntruction may be faster than 6 clocks on a Pentium-4, and the
effective latency for you may be less still since the micro-code can be
rescheduled. I've had difficulty getting accurate P4 timings before, so I
intended 6 to be a *crude* estimate of the latency rather than an absolute
number.

If it's used a lot, the code size may play a role in the performance as
well. I've optimized routines before only to find that the increased code
size outweighed the performance benefit. The best answer is to profile both
versions.

-Matt


0
Reply Matt 1/14/2004 9:50:51 PM

Alex,

It does tend to be processor specific and in particular with Intel
hardware, the old string instructions have usually been well off the
pace since early pentiums. I think from PII onwards there is special
case circuitry in Intel hardware when you use REP in conjunction with
either MOVS or STOS but it only starts to cut in well over 100 bytes.
Particularly on a PIV copying a short string repeatedly which is not
an uncommon operation is a lot faster with registers and indexing than
using the old string instructions.

I have an old AMD k6-2 550 here which has an unusually fast LOOP
insruction, so much so that it crashes a default installation of
win95b and needed a patch to fix it.

Generally I would stay away from the old string instructions unless
you are using REP with either MOVSD or STOSD.

Regards,

hutch at movsd dot com

0
Reply hutch 1/15/2004 6:51:16 PM

"hutch--" <hutch@movsd.com> wrote in message
news:af910ce4.0401150432.1c7e3a8d@posting.google.com...
> Alex,
>
> It does tend to be processor specific and in particular with Intel
> hardware, the old string instructions have usually been well off the
> pace since early pentiums. I think from PII onwards there is special
> case circuitry in Intel hardware when you use REP in conjunction with
> either MOVS or STOS but it only starts to cut in well over 100 bytes.
> Particularly on a PIV copying a short string repeatedly which is not
> an uncommon operation is a lot faster with registers and indexing than
> using the old string instructions.
>
> I have an old AMD k6-2 550 here which has an unusually fast LOOP
> insruction, so much so that it crashes a default installation of
> win95b and needed a patch to fix it.
>
> Generally I would stay away from the old string instructions unless
> you are using REP with either MOVSD or STOSD.
>
> Regards,
>
> hutch at movsd dot com
>

I've tested on a Compaq Presario laptop, PIII 700Mhz; the timings on
lodsd/jmp [eax] are truly awful. My Forth system slows down by over 40% on
simple loops through the code with lodsd. On the XP, it's exactly the same
as mov+add; I'm now tempted to modify the build to detect the processor and
select code that way. Shame really; the short lodsd code reduces the runtime
by quite a slice.

Thanks to you & Matt for the help on this.

-- 
Regards
Alex McDonald



0
Reply Alex 1/15/2004 10:59:56 PM

5 Replies
278 Views

(page loaded in 0.074 seconds)

Similiar Articles:













7/21/2012 9:20:01 PM


Reply: