push optimization

  • Follow


Hi.

I have a question, since I was unable to find an exact answer to it nowhere.
I need to save/restore CPU context in an OS I'm writing.
The question is which of the 2 following methods is faster:
    1)
     pusha
     ... do some stuff
     popa

    2)
     push eax, ebx... (except esp)
     ... do some stuff
     pop eax, ebx... (except esp)

I saw that on 486 the first method is slightly faster, but that's old news.

Thank you in advance
Yan

0
Reply Yan_952 (5) 10/30/2003 8:46:07 PM

"Yan" <Yan_952@hotmail.com> wrote in message
news:377c1b74.0310301246.b6feffd@posting.google.com...
> Hi.
>
> I have a question, since I was unable to find an exact answer to it
nowhere.
> I need to save/restore CPU context in an OS I'm writing.
> The question is which of the 2 following methods is faster:
>     1)
>      pusha
>      ... do some stuff
>      popa
>
>     2)
>      push eax, ebx... (except esp)
>      ... do some stuff
>      pop eax, ebx... (except esp)
>
> I saw that on 486 the first method is slightly faster, but that's old
news.
>
> Thank you in advance
> Yan

If your code follows the x86 ABI, it will save esp, ebp, esi, edi, and ebx.
Compiled code in GCC should follow this standard. This means you will only
need to save eax, ecx, and edx. In this case, it is much faster to do this:

mov [esp-4], edx
sub esp, 12
mov [esp], eax
mov [esp+4], ecx

The code is slightly out of order to make it run a little faster. On an
Athlon, it is just over 2 clocks with the possibility of executing other
code in parallel. The pusha and popa instructions are very slow. I saw
something like 6-10% improvement in an application that switched from
pusha/popa to the above code. It has to save and restore the registers often
for similar reasons.

-Matt


0
Reply Matt 10/30/2003 10:53:58 PM


Matt Taylor wrote:
> "Yan" <Yan_952@hotmail.com> wrote in message
> news:377c1b74.0310301246.b6feffd@posting.google.com...
>> Hi.
>>
>> I have a question, since I was unable to find an exact answer to it
>> nowhere. I need to save/restore CPU context in an OS I'm writing.
>> The question is which of the 2 following methods is faster:
>>     1)
>>      pusha
>>      ... do some stuff
>>      popa
>>
>>     2)
>>      push eax, ebx... (except esp)
>>      ... do some stuff
>>      pop eax, ebx... (except esp)
>>
>> I saw that on 486 the first method is slightly faster, but that's
>> old news.
>>
>> Thank you in advance
>> Yan
>
> If your code follows the x86 ABI, it will save esp, ebp, esi, edi,
> and ebx. Compiled code in GCC should follow this standard. This means
> you will only need to save eax, ecx, and edx. In this case, it is
> much faster to do this:
>
> mov [esp-4], edx
> sub esp, 12
> mov [esp], eax
> mov [esp+4], ecx
>
> The code is slightly out of order to make it run a little faster.

This would depend entirely on the code following this entry code. It'd be
faster in many cases to put the "sub esp,12" after all the mov's, though if
you have stack-intensive code after this, I would have thought it'd be more
efficient to do
  mov [esp-4], edx
  mov [esp-8], ecx
  sub esp, 12
  mov [esp], eax
Since the Athlon can dispatch all three of the first instructions in one
cycle.

[...]

--
Michael Brown
www.emboss.co.nz : OOS/RSI software and more :)
Add michael@ to emboss.co.nz - My inbox is always open


0
Reply Michael 10/31/2003 1:54:21 AM

"Michael Brown" <see@signature.below> wrote in message
news:jjjob.3191$Mn.126342@news.xtra.co.nz...
> Matt Taylor wrote:
> > "Yan" <Yan_952@hotmail.com> wrote in message
> > news:377c1b74.0310301246.b6feffd@posting.google.com...
> >> Hi.
> >>
> >> I have a question, since I was unable to find an exact answer to it
> >> nowhere. I need to save/restore CPU context in an OS I'm writing.
> >> The question is which of the 2 following methods is faster:
> >>     1)
> >>      pusha
> >>      ... do some stuff
> >>      popa
> >>
> >>     2)
> >>      push eax, ebx... (except esp)
> >>      ... do some stuff
> >>      pop eax, ebx... (except esp)
> >>
> >> I saw that on 486 the first method is slightly faster, but that's
> >> old news.
> >>
> >> Thank you in advance
> >> Yan
> >
> > If your code follows the x86 ABI, it will save esp, ebp, esi, edi,
> > and ebx. Compiled code in GCC should follow this standard. This means
> > you will only need to save eax, ecx, and edx. In this case, it is
> > much faster to do this:
> >
> > mov [esp-4], edx
> > sub esp, 12
> > mov [esp], eax
> > mov [esp+4], ecx
> >
> > The code is slightly out of order to make it run a little faster.
>
> This would depend entirely on the code following this entry code. It'd be
> faster in many cases to put the "sub esp,12" after all the mov's, though
if
> you have stack-intensive code after this, I would have thought it'd be
more
> efficient to do
>   mov [esp-4], edx
>   mov [esp-8], ecx
>   sub esp, 12
>   mov [esp], eax
> Since the Athlon can dispatch all three of the first instructions in one
> cycle.

Er, hmm. I was thinking that it would stall waiting for the AGU, but I
forgot that the AGU issues a cycle early except when an AGI occurs. The
effective latency of the mov instructions is still 3 despite the extra
cycle. Presuming they get the AGU the cycle beforehand, the movs and sub
would execute in parallel.

It is best to do the sub first for protection (from interrupts), but it is
faster to do it this way...

-Matt


0
Reply Matt 10/31/2003 6:05:33 AM

Are you saying that mov is faster than push, or simply multiple
pushes create pipeline stalls, so by adjusting esp manually
you get better performance?

Yan

0
Reply Yan_952 10/31/2003 12:02:23 PM

"Michael Brown" <see@signature.below> writes:
> efficient to do
>   mov [esp-4], edx
>   mov [esp-8], ecx
>   sub esp, 12
>   mov [esp], eax
> Since the Athlon can dispatch all three of the first instructions in one
> cycle.

But is it interrupt safe? Can you always, unconditionally, be 
sure that esp-4 and esp-8 won't have been blatted by some random 
interrupt?

Phil

-- 
Unpatched IE vulnerability: WMP local file bounce
Description: Switching security zone, arbitrary command execution, 
             automatic email-borne command execution
Reference: http://www.ntbugtraq.com/default.asp?pid=36&sid=1&A2=ind0307&L=ntbugtraq&F=P&S=&P=6783
Exploit: http://www.malware.com/once.again!.html

0
Reply Phil 10/31/2003 3:10:31 PM

"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message
news:87k76l8mu0.fsf@nonospaz.fatphil.org...
> "Michael Brown" <see@signature.below> writes:
> > efficient to do
> >   mov [esp-4], edx
> >   mov [esp-8], ecx
> >   sub esp, 12
> >   mov [esp], eax
> > Since the Athlon can dispatch all three of the first instructions in one
> > cycle.
>
> But is it interrupt safe? Can you always, unconditionally, be
> sure that esp-4 and esp-8 won't have been blatted by some random
> interrupt?

I sure hope so. I ran into a call to _alloca(0) in Visual Studio itself.
This puts the return address temporarily below the stack pointer.

-Matt


0
Reply Matt 10/31/2003 7:56:05 PM

"Yan" <Yan_952@hotmail.com> wrote in message
news:377c1b74.0310310402.69291392@posting.google.com...
> Are you saying that mov is faster than push, or simply multiple
> pushes create pipeline stalls, so by adjusting esp manually
> you get better performance?

Which processor? Going through generations:

(P5) Pentium can execute 2 push instructions per cycle.
(P6) Pentium Pro, Pentium 2, and Pentium 3 can only execute 1 push
instruction per cycle (decode starvation).
(P7) Pentium 4 can execute 1 push instruction per cycle (low throughput).
(K6) K6 is limited to 1 push instruction per cycle (store unit cannot accept
more).
(K7) Athlon can only execute 1 push instruction per cycle (AGI).

If you adjust esp and then mov the arguments into place:

(P5): Still 2 instructions/cycle (+1 clock AGI at start)
(P6): Either 1 or 2 instructions/cycle (can it do more than 1 store per
cycle?)
(P7): Changed to 2 instructions/cycle (*fairly* sure it does 2 stores/cycle)
(K6): Still 1 instruction/cycle (+1 clock AGI at start)
(K7): Changed to 2 instructions/cycle

So in general this is a win.

-Matt


0
Reply Matt 10/31/2003 8:01:49 PM

"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message
news:87k76l8mu0.fsf@nonospaz.fatphil.org...
> "Michael Brown" <see@signature.below> writes:
> > efficient to do
> >   mov [esp-4], edx
> >   mov [esp-8], ecx
> >   sub esp, 12
> >   mov [esp], eax
> > Since the Athlon can dispatch all three of the first instructions in one
> > cycle.
>
> But is it interrupt safe? Can you always, unconditionally, be
> sure that esp-4 and esp-8 won't have been blatted by some random
> interrupt?
>
> Phil
>
> -- 
> Unpatched IE vulnerability: WMP local file bounce
> Description: Switching security zone, arbitrary command execution,
>              automatic email-borne command execution
> Reference:
http://www.ntbugtraq.com/default.asp?pid=36&sid=1&A2=ind0307&L=ntbugtraq&F=P&S=&P=6783
> Exploit: http://www.malware.com/once.again!.html
>

Interrupts that don't change privilege levels (i.e. interrupt at ring 3
handled by a ring 3 interrupt handler) would walk all over your stack. IIRC
Windows interrupt handlers are all at ring <3 (ring 0 or ring 1 from what I
can gather), while user code is at ring 3; in this case, the processor (not
the handler) ensures that the ring 3 stack is not disturbed on interrupt.
For user-privilege ring 3 programming, the code above is OK. It would be
very dangerous in, for instance, a 95/98 VXD that ran at ring 0. See chapter
6 vol 1 of the Intel IA-32 Software Developer's manual.

-- 
Regards
Alex McDonald



0
Reply Alex 10/31/2003 8:53:20 PM

"Alex McDonald" <alex_mcd@btopenworld.com> wrote:
>
>Interrupts that don't change privilege levels (i.e. interrupt at ring 3
>handled by a ring 3 interrupt handler) would walk all over your stack. IIRC
>Windows interrupt handlers are all at ring <3 (ring 0 or ring 1 from what I
>can gather), 

Windows doesn't do anything at ring 1.  All code is either ring 3 or ring
0.

The only operating system I ever encountered that actually used more than
two rings was Control Data's NOS/VE for the Cyber 180 mainframes, which was
derived in a complicated way from MULTICS.  It had a rather complicated
segmented/ring-based addressing scheme, where there were a number of layers
between user and kernel that offered operating system services needing
varying amounts of protection.
-- 
- Tim Roberts, timr@probo.com
  Providenza & Boekelheide, Inc.

0
Reply Tim 11/1/2003 10:12:44 PM

Tim Roberts <timr@probo.com> writes:

> "Alex McDonald" <alex_mcd@btopenworld.com> wrote:
> >
> >Interrupts that don't change privilege levels (i.e. interrupt at ring 3
> >handled by a ring 3 interrupt handler) would walk all over your stack. IIRC
> >Windows interrupt handlers are all at ring <3 (ring 0 or ring 1 from what I
> >can gather), 
> 
> Windows doesn't do anything at ring 1.  All code is either ring 3 or ring
> 0.
> 
> The only operating system I ever encountered that actually used more than
> two rings was Control Data's NOS/VE for the Cyber 180 mainframes, which was
> derived in a complicated way from MULTICS.  It had a rather complicated
> segmented/ring-based addressing scheme, where there were a number of layers
> between user and kernel that offered operating system services needing
> varying amounts of protection.

OS/2 uses ring 2 for IO port management, doesn't it?

(Dunno where video drivers were in NT3.x, but they weren't ring 0. 
Ring 3 would seem a bit cranky, but believable.)

Phil
-- 
Unpatched IE vulnerability: mhtml wecerr CAB flip
Description: Delivery and installation of an executable
Reference: http://msgs.securepoint.com/cgi-bin/get/bugtraq0305/48.html

0
Reply Phil 11/3/2003 12:04:25 AM

Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:
>
>OS/2 uses ring 2 for IO port management, doesn't it?

Does it?  I've forgotten, but it does seem to tickle some memory cells.

>(Dunno where video drivers were in NT3.x, but they weren't ring 0. 
>Ring 3 would seem a bit cranky, but believable.)

Display drivers in NT 3.x were just normal, user-mode DLLs, living in the
ring 3 CSRSS process.  The miniport drivers, then just like now, are kernel
drivers, but they do very little work once the display is initialized.

Many people don't realize that display drivers in Windows 9X are just
user-mode DLLs, and more than that, they are 16-bit DLLs.  (Most have
32-bit helpers by now, but the entry points are 16-bit.)
-- 
- Tim Roberts, timr@probo.com
  Providenza & Boekelheide, Inc.

0
Reply Tim 11/4/2003 6:11:25 AM

11 Replies
93 Views

(page loaded in 0.231 seconds)

Similiar Articles:


















7/13/2012 4:04:29 PM


Reply: