Hi.
I have a question, since I was unable to find an exact answer to it nowhere.
I need to save/restore CPU context in an OS I'm writing.
The question is which of the 2 following methods is faster:
1)
pusha
... do some stuff
popa
2)
push eax, ebx... (except esp)
... do some stuff
pop eax, ebx... (except esp)
I saw that on 486 the first method is slightly faster, but that's old news.
Thank you in advance
Yan
|
|
0
|
|
|
|
Reply
|
Yan_952 (5)
|
10/30/2003 8:46:07 PM |
|
"Yan" <Yan_952@hotmail.com> wrote in message
news:377c1b74.0310301246.b6feffd@posting.google.com...
> Hi.
>
> I have a question, since I was unable to find an exact answer to it
nowhere.
> I need to save/restore CPU context in an OS I'm writing.
> The question is which of the 2 following methods is faster:
> 1)
> pusha
> ... do some stuff
> popa
>
> 2)
> push eax, ebx... (except esp)
> ... do some stuff
> pop eax, ebx... (except esp)
>
> I saw that on 486 the first method is slightly faster, but that's old
news.
>
> Thank you in advance
> Yan
If your code follows the x86 ABI, it will save esp, ebp, esi, edi, and ebx.
Compiled code in GCC should follow this standard. This means you will only
need to save eax, ecx, and edx. In this case, it is much faster to do this:
mov [esp-4], edx
sub esp, 12
mov [esp], eax
mov [esp+4], ecx
The code is slightly out of order to make it run a little faster. On an
Athlon, it is just over 2 clocks with the possibility of executing other
code in parallel. The pusha and popa instructions are very slow. I saw
something like 6-10% improvement in an application that switched from
pusha/popa to the above code. It has to save and restore the registers often
for similar reasons.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
10/30/2003 10:53:58 PM
|
|
Matt Taylor wrote:
> "Yan" <Yan_952@hotmail.com> wrote in message
> news:377c1b74.0310301246.b6feffd@posting.google.com...
>> Hi.
>>
>> I have a question, since I was unable to find an exact answer to it
>> nowhere. I need to save/restore CPU context in an OS I'm writing.
>> The question is which of the 2 following methods is faster:
>> 1)
>> pusha
>> ... do some stuff
>> popa
>>
>> 2)
>> push eax, ebx... (except esp)
>> ... do some stuff
>> pop eax, ebx... (except esp)
>>
>> I saw that on 486 the first method is slightly faster, but that's
>> old news.
>>
>> Thank you in advance
>> Yan
>
> If your code follows the x86 ABI, it will save esp, ebp, esi, edi,
> and ebx. Compiled code in GCC should follow this standard. This means
> you will only need to save eax, ecx, and edx. In this case, it is
> much faster to do this:
>
> mov [esp-4], edx
> sub esp, 12
> mov [esp], eax
> mov [esp+4], ecx
>
> The code is slightly out of order to make it run a little faster.
This would depend entirely on the code following this entry code. It'd be
faster in many cases to put the "sub esp,12" after all the mov's, though if
you have stack-intensive code after this, I would have thought it'd be more
efficient to do
mov [esp-4], edx
mov [esp-8], ecx
sub esp, 12
mov [esp], eax
Since the Athlon can dispatch all three of the first instructions in one
cycle.
[...]
--
Michael Brown
www.emboss.co.nz : OOS/RSI software and more :)
Add michael@ to emboss.co.nz - My inbox is always open
|
|
0
|
|
|
|
Reply
|
Michael
|
10/31/2003 1:54:21 AM
|
|
"Michael Brown" <see@signature.below> wrote in message
news:jjjob.3191$Mn.126342@news.xtra.co.nz...
> Matt Taylor wrote:
> > "Yan" <Yan_952@hotmail.com> wrote in message
> > news:377c1b74.0310301246.b6feffd@posting.google.com...
> >> Hi.
> >>
> >> I have a question, since I was unable to find an exact answer to it
> >> nowhere. I need to save/restore CPU context in an OS I'm writing.
> >> The question is which of the 2 following methods is faster:
> >> 1)
> >> pusha
> >> ... do some stuff
> >> popa
> >>
> >> 2)
> >> push eax, ebx... (except esp)
> >> ... do some stuff
> >> pop eax, ebx... (except esp)
> >>
> >> I saw that on 486 the first method is slightly faster, but that's
> >> old news.
> >>
> >> Thank you in advance
> >> Yan
> >
> > If your code follows the x86 ABI, it will save esp, ebp, esi, edi,
> > and ebx. Compiled code in GCC should follow this standard. This means
> > you will only need to save eax, ecx, and edx. In this case, it is
> > much faster to do this:
> >
> > mov [esp-4], edx
> > sub esp, 12
> > mov [esp], eax
> > mov [esp+4], ecx
> >
> > The code is slightly out of order to make it run a little faster.
>
> This would depend entirely on the code following this entry code. It'd be
> faster in many cases to put the "sub esp,12" after all the mov's, though
if
> you have stack-intensive code after this, I would have thought it'd be
more
> efficient to do
> mov [esp-4], edx
> mov [esp-8], ecx
> sub esp, 12
> mov [esp], eax
> Since the Athlon can dispatch all three of the first instructions in one
> cycle.
Er, hmm. I was thinking that it would stall waiting for the AGU, but I
forgot that the AGU issues a cycle early except when an AGI occurs. The
effective latency of the mov instructions is still 3 despite the extra
cycle. Presuming they get the AGU the cycle beforehand, the movs and sub
would execute in parallel.
It is best to do the sub first for protection (from interrupts), but it is
faster to do it this way...
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
10/31/2003 6:05:33 AM
|
|
Are you saying that mov is faster than push, or simply multiple
pushes create pipeline stalls, so by adjusting esp manually
you get better performance?
Yan
|
|
0
|
|
|
|
Reply
|
Yan_952
|
10/31/2003 12:02:23 PM
|
|
"Michael Brown" <see@signature.below> writes:
> efficient to do
> mov [esp-4], edx
> mov [esp-8], ecx
> sub esp, 12
> mov [esp], eax
> Since the Athlon can dispatch all three of the first instructions in one
> cycle.
But is it interrupt safe? Can you always, unconditionally, be
sure that esp-4 and esp-8 won't have been blatted by some random
interrupt?
Phil
--
Unpatched IE vulnerability: WMP local file bounce
Description: Switching security zone, arbitrary command execution,
automatic email-borne command execution
Reference: http://www.ntbugtraq.com/default.asp?pid=36&sid=1&A2=ind0307&L=ntbugtraq&F=P&S=&P=6783
Exploit: http://www.malware.com/once.again!.html
|
|
0
|
|
|
|
Reply
|
Phil
|
10/31/2003 3:10:31 PM
|
|
"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message
news:87k76l8mu0.fsf@nonospaz.fatphil.org...
> "Michael Brown" <see@signature.below> writes:
> > efficient to do
> > mov [esp-4], edx
> > mov [esp-8], ecx
> > sub esp, 12
> > mov [esp], eax
> > Since the Athlon can dispatch all three of the first instructions in one
> > cycle.
>
> But is it interrupt safe? Can you always, unconditionally, be
> sure that esp-4 and esp-8 won't have been blatted by some random
> interrupt?
I sure hope so. I ran into a call to _alloca(0) in Visual Studio itself.
This puts the return address temporarily below the stack pointer.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
10/31/2003 7:56:05 PM
|
|
"Yan" <Yan_952@hotmail.com> wrote in message
news:377c1b74.0310310402.69291392@posting.google.com...
> Are you saying that mov is faster than push, or simply multiple
> pushes create pipeline stalls, so by adjusting esp manually
> you get better performance?
Which processor? Going through generations:
(P5) Pentium can execute 2 push instructions per cycle.
(P6) Pentium Pro, Pentium 2, and Pentium 3 can only execute 1 push
instruction per cycle (decode starvation).
(P7) Pentium 4 can execute 1 push instruction per cycle (low throughput).
(K6) K6 is limited to 1 push instruction per cycle (store unit cannot accept
more).
(K7) Athlon can only execute 1 push instruction per cycle (AGI).
If you adjust esp and then mov the arguments into place:
(P5): Still 2 instructions/cycle (+1 clock AGI at start)
(P6): Either 1 or 2 instructions/cycle (can it do more than 1 store per
cycle?)
(P7): Changed to 2 instructions/cycle (*fairly* sure it does 2 stores/cycle)
(K6): Still 1 instruction/cycle (+1 clock AGI at start)
(K7): Changed to 2 instructions/cycle
So in general this is a win.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
10/31/2003 8:01:49 PM
|
|
"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message
news:87k76l8mu0.fsf@nonospaz.fatphil.org...
> "Michael Brown" <see@signature.below> writes:
> > efficient to do
> > mov [esp-4], edx
> > mov [esp-8], ecx
> > sub esp, 12
> > mov [esp], eax
> > Since the Athlon can dispatch all three of the first instructions in one
> > cycle.
>
> But is it interrupt safe? Can you always, unconditionally, be
> sure that esp-4 and esp-8 won't have been blatted by some random
> interrupt?
>
> Phil
>
> --
> Unpatched IE vulnerability: WMP local file bounce
> Description: Switching security zone, arbitrary command execution,
> automatic email-borne command execution
> Reference:
http://www.ntbugtraq.com/default.asp?pid=36&sid=1&A2=ind0307&L=ntbugtraq&F=P&S=&P=6783
> Exploit: http://www.malware.com/once.again!.html
>
Interrupts that don't change privilege levels (i.e. interrupt at ring 3
handled by a ring 3 interrupt handler) would walk all over your stack. IIRC
Windows interrupt handlers are all at ring <3 (ring 0 or ring 1 from what I
can gather), while user code is at ring 3; in this case, the processor (not
the handler) ensures that the ring 3 stack is not disturbed on interrupt.
For user-privilege ring 3 programming, the code above is OK. It would be
very dangerous in, for instance, a 95/98 VXD that ran at ring 0. See chapter
6 vol 1 of the Intel IA-32 Software Developer's manual.
--
Regards
Alex McDonald
|
|
0
|
|
|
|
Reply
|
Alex
|
10/31/2003 8:53:20 PM
|
|
"Alex McDonald" <alex_mcd@btopenworld.com> wrote:
>
>Interrupts that don't change privilege levels (i.e. interrupt at ring 3
>handled by a ring 3 interrupt handler) would walk all over your stack. IIRC
>Windows interrupt handlers are all at ring <3 (ring 0 or ring 1 from what I
>can gather),
Windows doesn't do anything at ring 1. All code is either ring 3 or ring
0.
The only operating system I ever encountered that actually used more than
two rings was Control Data's NOS/VE for the Cyber 180 mainframes, which was
derived in a complicated way from MULTICS. It had a rather complicated
segmented/ring-based addressing scheme, where there were a number of layers
between user and kernel that offered operating system services needing
varying amounts of protection.
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
|
|
0
|
|
|
|
Reply
|
Tim
|
11/1/2003 10:12:44 PM
|
|
Tim Roberts <timr@probo.com> writes:
> "Alex McDonald" <alex_mcd@btopenworld.com> wrote:
> >
> >Interrupts that don't change privilege levels (i.e. interrupt at ring 3
> >handled by a ring 3 interrupt handler) would walk all over your stack. IIRC
> >Windows interrupt handlers are all at ring <3 (ring 0 or ring 1 from what I
> >can gather),
>
> Windows doesn't do anything at ring 1. All code is either ring 3 or ring
> 0.
>
> The only operating system I ever encountered that actually used more than
> two rings was Control Data's NOS/VE for the Cyber 180 mainframes, which was
> derived in a complicated way from MULTICS. It had a rather complicated
> segmented/ring-based addressing scheme, where there were a number of layers
> between user and kernel that offered operating system services needing
> varying amounts of protection.
OS/2 uses ring 2 for IO port management, doesn't it?
(Dunno where video drivers were in NT3.x, but they weren't ring 0.
Ring 3 would seem a bit cranky, but believable.)
Phil
--
Unpatched IE vulnerability: mhtml wecerr CAB flip
Description: Delivery and installation of an executable
Reference: http://msgs.securepoint.com/cgi-bin/get/bugtraq0305/48.html
|
|
0
|
|
|
|
Reply
|
Phil
|
11/3/2003 12:04:25 AM
|
|
Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:
>
>OS/2 uses ring 2 for IO port management, doesn't it?
Does it? I've forgotten, but it does seem to tickle some memory cells.
>(Dunno where video drivers were in NT3.x, but they weren't ring 0.
>Ring 3 would seem a bit cranky, but believable.)
Display drivers in NT 3.x were just normal, user-mode DLLs, living in the
ring 3 CSRSS process. The miniport drivers, then just like now, are kernel
drivers, but they do very little work once the display is initialized.
Many people don't realize that display drivers in Windows 9X are just
user-mode DLLs, and more than that, they are 16-bit DLLs. (Most have
32-bit helpers by now, but the entry points are 16-bit.)
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
|
|
0
|
|
|
|
Reply
|
Tim
|
11/4/2003 6:11:25 AM
|
|
|
11 Replies
93 Views
(page loaded in 0.231 seconds)
Similiar Articles: Press any key to continue... - comp.unix.programmerHello What the easiest way to implement "Press any key ... Gauss-Hermite quadrature. Quasi-Newton method ... Press any key to continue... - comp.unix.programmer Optimization. Optimize Assembly with Conditional Moves - comp.lang.asm.x86 ...Using C and Assembly code: 64Bit Calling convention - comp.lang ..... taken from: http://www.agner.org/optimize/optimizing_assembly.pdf ... so, push/pop "allow" one to ... MZDDE toolbox question - comp.soft-sys.matlabI can push the update to the zemax file but the merit function does not seem to ... Do you know how to set variables in NSC mode before optimization through this mzdde ... Optimization: Setting bounds depending on solution - comp.soft-sys ...Hi, I'm using a lsqnonlin-function to solve an optimization problem: ysp ... 24h format) * It should be possible to set ... Depending on how far you push them ... Optimization. Gauss-Hermite quadrature. Quasi-Newton method ...And now I must optimize the log-likelihood function using quasi-Newton method. ... fprintf(1,'%5.4f ',v') fprintf(1,'\n') if n <= 0, disp('Press any key ... Switch + Case - comp.lang.asm.x86If C/C++ compiler does not support to use Setcc after optimization (I assume Setcc will be shown if it is global variable, otherwise PUSH will be shown if it is local ... comp.lang.asm.x86 - page 16pipeling floating point operations 0 48 (9/2/2003 9:15:38 PM) Dear all, I was wondering if any it is possible to optimize pipeline instructions for floating point ... Jobs for Parallel Programmers ? - comp.parallel.mpiParallel Programming Talk #85 - "What game developer can teach parallel programmers" with ... Site Map; RSS; Jobs; Press Room; Optimization Notice Mouse, screen effects - comp.lang.asm.x86MOUSE proc far ;CALL MOUSE(IAREA(AX,BX,CX,DX,DSS/SI,ES/DI)) PUSH BP ... As an optimization, you only need to kill the cursor if you will be drawing in the region ... int -> float conversion - comp.graphics.api.openglAnd we're still didn't even consider SSE which only increases optimization opportunities. ... sse (401810h) 0040100B fld dword ptr [ii+4 (40301Ch)] 00401011 push ... reordering expressions with NaN - comp.lang.fortranThat would tend to push towards being more strict. -- Richard Maine ... > > One may want this optimization, and yet be able to do other > IEEE tests. =A0For example ... Flat pattern to G-Code for Amada Punch press - comp.cad.solidworks ...... in SWX) and run it thru a 3rd party program to dump into the punch press. ... the punching and laser machines; one of the weak points at one time was the Optimization ... Run Ansys Classic from Matlab - comp.soft-sys.matlabIn such cases, you would have to use the Java Robot class in order to push the ... I am solving an optimization problem using Matlab and Ansys. I am calling my ansys from ... Xst:1710 warning problem - comp.arch.fpga... arch.fpga Xst:1710 warning problem - comp.arch.fpga vhdl code for debouncing push ... FF/Latch will be trimmed during the optimization ... be trimmed during the optimization ... How to delele a node/anchor of Free Select or Path Tool? - comp ...You have to press the Backspace button several times till you removed the node ... improve strlen - comp.lang.asm.x86... very much that assembly itself is not optimization ... INFORMATICA TUTORIAL,GUIDE AND TRAINING: Informatica Push Down ...Overview of Pushdown Processing Separating logical business logic from physical run-time execution, the Pushdown Optimization Option is coupled with the creation and ... Push down optimization - Toolbox for IT GroupsHi, I am trying to achive a better perfomance for one of my session which extracts data from Oracle DB. The session extracts data by joining ... 7/13/2012 4:04:29 PM
|