On my ADM XP, the following
mov eax, [esi]
add esi, 4
jmp [eax]
appears to execute around the same speed as
lodsd
jmp [eax]
Which is faster on Intel (586 and above, 486 and below are of no
interest)?
The AMD docs has vectorpath 4 cycles for lodsd, and 4+4 directpath for
the mov(mem)+add(imm). But I'm seeing the same results (without using
RDSTC, just a very large number of iterations, 1billion+). What's
causing the slowdown? I would have thought lodsd was slower than the
equivalent directpath instructions. Is it a dependency on ESI?
Any faster way of writing this?
--
Regards
Alex McDonald
|
|
0
|
|
|
|
Reply
|
alex_mcd
|
1/12/2004 1:59:25 PM |
|
"Alex McDonald" <alex_mcd@btopenworld.com> wrote in message
news:b57b10b6.0401120553.187b5ba9@posting.google.com...
> On my ADM XP, the following
>
> mov eax, [esi]
> add esi, 4
> jmp [eax]
>
> appears to execute around the same speed as
>
> lodsd
> jmp [eax]
They likely produce identical or extremely similar microcode. The lodsd
instruction does have to do a little more because of DF, but that should
overlap with the load.
> Which is faster on Intel (586 and above, 486 and below are of no
> interest)?
In general it will be the mov/add. On a Pentium, these would also be
identically fast unless you pair something with the mov/add instructions.
P6-core seems is mostly neutral: they're both 2 u-ops. However, the mov/add
may decode faster. Intel doesn't bother listing lods latency for P4. I took
a crude measurement, and it seems to be about 6 cycles. (I'm not sure how
accurate that is.) This would make a mov/add pair 3 times faster.
> The AMD docs has vectorpath 4 cycles for lodsd, and 4+4 directpath for
> the mov(mem)+add(imm). But I'm seeing the same results (without using
> RDSTC, just a very large number of iterations, 1billion+). What's
> causing the slowdown? I would have thought lodsd was slower than the
> equivalent directpath instructions. Is it a dependency on ESI?
What? You mean 3+1 for mov + add, right? They should overlap as well which
gives 1 clock less latency than lodsd (4 clk). However, it is quite possible
that eax is available before the full 4 cycles in which case there would be
no difference. VectorPath isn't necessarily slow, either; it does inhibit
decoding of further instructions, but since you jmp right after that really
doesn't matter.
> Any faster way of writing this?
No. It looks like you're getting about ~7-8 cycles latency, and for 2 cache
accesses you need at least 6 cycles anyway. From the looks of it, this is a
LUT for an emulator or something of that sort, so your jmp would be
unpredictable. That stall is going to be a lot worse than the lookup itself.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
1/13/2004 12:13:15 AM
|
|
"Matt Taylor" <para@tampabay.rr.com> wrote in message news:<_AGMb.2670$873.56076@twister.tampabay.rr.com>...
> "Alex McDonald" <alex_mcd@btopenworld.com> wrote in message
> news:b57b10b6.0401120553.187b5ba9@posting.google.com...
===snipped
>
> > The AMD docs has vectorpath 4 cycles for lodsd, and 4+4 directpath for
> > the mov(mem)+add(imm). But I'm seeing the same results (without using
> > RDSTC, just a very large number of iterations, 1billion+). What's
> > causing the slowdown? I would have thought lodsd was slower than the
> > equivalent directpath instructions. Is it a dependency on ESI?
>
> What? You mean 3+1 for mov + add, right?
Typo; should have been 4 (3+1)
===snipped
> From the looks of it, this is a
> LUT for an emulator or something of that sort,
Good guess; Forth indirect threaded code; part of the inner
interpreter.
> so your jmp would be
> unpredictable. That stall is going to be a lot worse than the lookup itself.
Thanks for the analysis. I'll stick with the longer mov+add sequence
if the impact on Intel processors is that great. lodsd does have the
advantage of brevity however; the code sequence is only 3 bytes
instead of 7, and this fragment appears very regularly in the code.
>
> -Matt
|
|
0
|
|
|
|
Reply
|
alex_mcd
|
1/14/2004 12:52:50 PM
|
|
"Alex McDonald" <alex_mcd@btopenworld.com> wrote in message
news:b57b10b6.0401140442.612af022@posting.google.com...
> "Matt Taylor" <para@tampabay.rr.com> wrote in message
news:<_AGMb.2670$873.56076@twister.tampabay.rr.com>...
<snip>
> > so your jmp would be
> > unpredictable. That stall is going to be a lot worse than the lookup
itself.
>
> Thanks for the analysis. I'll stick with the longer mov+add sequence
> if the impact on Intel processors is that great. lodsd does have the
> advantage of brevity however; the code sequence is only 3 bytes
> instead of 7, and this fragment appears very regularly in the code.
Possibly. A couple points:
* Pentium will favor lodsd (size) *if* you can't pair anything with the
mov+add. You need context to make that decision.
* P6-core probably won't be sensitive to the decode issues since you branch
anyway
* The lodsd isntruction may be faster than 6 clocks on a Pentium-4, and the
effective latency for you may be less still since the micro-code can be
rescheduled. I've had difficulty getting accurate P4 timings before, so I
intended 6 to be a *crude* estimate of the latency rather than an absolute
number.
If it's used a lot, the code size may play a role in the performance as
well. I've optimized routines before only to find that the increased code
size outweighed the performance benefit. The best answer is to profile both
versions.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
1/14/2004 9:50:51 PM
|
|
Alex,
It does tend to be processor specific and in particular with Intel
hardware, the old string instructions have usually been well off the
pace since early pentiums. I think from PII onwards there is special
case circuitry in Intel hardware when you use REP in conjunction with
either MOVS or STOS but it only starts to cut in well over 100 bytes.
Particularly on a PIV copying a short string repeatedly which is not
an uncommon operation is a lot faster with registers and indexing than
using the old string instructions.
I have an old AMD k6-2 550 here which has an unusually fast LOOP
insruction, so much so that it crashes a default installation of
win95b and needed a patch to fix it.
Generally I would stay away from the old string instructions unless
you are using REP with either MOVSD or STOSD.
Regards,
hutch at movsd dot com
|
|
0
|
|
|
|
Reply
|
hutch
|
1/15/2004 6:51:16 PM
|
|
"hutch--" <hutch@movsd.com> wrote in message
news:af910ce4.0401150432.1c7e3a8d@posting.google.com...
> Alex,
>
> It does tend to be processor specific and in particular with Intel
> hardware, the old string instructions have usually been well off the
> pace since early pentiums. I think from PII onwards there is special
> case circuitry in Intel hardware when you use REP in conjunction with
> either MOVS or STOS but it only starts to cut in well over 100 bytes.
> Particularly on a PIV copying a short string repeatedly which is not
> an uncommon operation is a lot faster with registers and indexing than
> using the old string instructions.
>
> I have an old AMD k6-2 550 here which has an unusually fast LOOP
> insruction, so much so that it crashes a default installation of
> win95b and needed a patch to fix it.
>
> Generally I would stay away from the old string instructions unless
> you are using REP with either MOVSD or STOSD.
>
> Regards,
>
> hutch at movsd dot com
>
I've tested on a Compaq Presario laptop, PIII 700Mhz; the timings on
lodsd/jmp [eax] are truly awful. My Forth system slows down by over 40% on
simple loops through the code with lodsd. On the XP, it's exactly the same
as mov+add; I'm now tempted to modify the build to detect the processor and
select code that way. Shame really; the short lodsd code reduces the runtime
by quite a slice.
Thanks to you & Matt for the help on this.
--
Regards
Alex McDonald
|
|
0
|
|
|
|
Reply
|
Alex
|
1/15/2004 10:59:56 PM
|
|
|
5 Replies
278 Views
(page loaded in 0.074 seconds)
Similiar Articles: AMD vs Intel timing on this code... - comp.lang.asm.x86On my ADM XP, the following mov eax, [esi] add esi, 4 jmp [eax] appears to execute around the same speed as lodsd jmp [eax] Which is faster on Intel (586 ... timing an x86 instruction - comp.lang.asm.x86AMD vs Intel timing on this code... - comp.lang.asm.x86 timing an x86 instruction - comp.lang.asm.x86 AMD vs Intel timing on this code... - comp.lang.asm.x86 The lodsd ... RDTSC (was: Fastest logical not) - comp.lang.asm.x86AMD vs Intel timing on this code... - comp.lang.asm.x86 RDTSC (was: Fastest logical not) - comp.lang.asm.x86 AMD vs Intel timing on this code... - comp.lang.asm.x86 RDTSC ... Pentium 4's Latency - comp.lang.asm.x86AMD vs Intel timing on this code... - comp.lang.asm.x86 Pentium 4's Latency - comp.lang.asm.x86 AMD vs Intel timing on this code... - comp.lang.asm.x86 Pentium 4's Latency ... Non Intel & AMD Arch - comp.lang.asm.x86AMD vs Intel timing on this code... - comp.lang.asm.x86 Non Intel & AMD Arch - comp.lang.asm.x86 Place-and-Route : Intel vs AMD - comp.arch.fpga Non Intel & AMD Arch - comp ... AMD, or Intel? - comp.cad.solidworksI would tend to think that SW tests their product on AMD and Intel as well ... AMD vs Intel timing on this code... - comp.lang.asm.x86 On my ADM XP, the following mov eax ... running on multiple processors - comp.soft-sys.matlabAMD vs Intel timing on this code... - comp.lang.asm.x86 running on multiple processors - comp.soft-sys.matlab The target machine that the code ... to mmap() on linux ... Roles Vs Groups - comp.groupware.lotus-notes.programmerAMD vs Intel timing on this code... - comp.lang.asm.x86... Post Question | Groups ... If it's used a lot, the code size may play a role in the ... dual processor laptop? - comp.laptopsAMD vs Intel timing on this code... - comp.lang.asm.x86 Alex, It does tend to be processor specific and in particular with Intel hardware ... Regards, > > hutch at movsd ... Faster way to unzip patches? - comp.unix.solarisAMD vs Intel timing on this code... - comp.lang.asm.x86 Is it a dependency on ESI? Any faster way of writing this? -- Regards Alex McDonald ... so much so that it crashes ... AMD vs Intel timing on this code... - comp.lang.asm.x86 | Computer ...On my ADM XP, the following mov eax, [esi] add esi, 4 jmp [eax] appears to execute around the same speed as lodsd jmp [eax] Which is faster on Intel (586 ... Advanced Micro Devices, Inc (AMD)Advanced Micro Devices (NYSE: AMD) is an innovative technology company dedicated to collaborating with customers and partners to ignite the next generation of ... 7/21/2012 9:20:01 PM
|