Using C and Assembly code: 64Bit Calling convention #2

  • Follow


Hello, I didn't want to cross-post, but please have a look at
Message-ID: <20100407050850.69fa110c@cbyrzba.bet> in comp.lang.c .

What I'm trying to do, is to merge assembly code and C-code in one executable
file.

Now, I don't know exactly about calling conventions of the AMD 64Bit model.
Please link me to a site that describes prologue and epilogue of a called
function in assembly, and how I get values returned from said function.
(RAX contains return value, which is OK for integers, but how do I get a float
value returned?)

Please help.

Regards,
--polemon
-- 
echo "cbyrzba@cbyrzba.bet" | tr '[a-z]' '[n-za-m]'

0
Reply cbyrzba690 (2) 4/7/2010 3:21:48 AM

On 04/06/2010 08:21 PM, polemon wrote:
> Hello, I didn't want to cross-post, but please have a look at
> Message-ID: <20100407050850.69fa110c@cbyrzba.bet> in comp.lang.c .
> 
> What I'm trying to do, is to merge assembly code and C-code in one executable
> file.
> 
> Now, I don't know exactly about calling conventions of the AMD 64Bit model.
> Please link me to a site that describes prologue and epilogue of a called
> function in assembly, and how I get values returned from said function.
> (RAX contains return value, which is OK for integers, but how do I get a float
> value returned?)
> 
> Please help.
> 


http://www.nasm.us/doc/nasmdo11.html

	-hpa
0
Reply H 4/7/2010 5:30:24 AM


"H. Peter Anvin" <hpa@nospicedham.zytor.com> wrote in message
news:4BBC1870.4070706@zytor.com...
> On 04/06/2010 08:21 PM, polemon wrote:
> > Hello, I didn't want to cross-post, but please have a look at
> > Message-ID: <20100407050850.69fa110c@cbyrzba.bet> in comp.lang.c .
> >
> > What I'm trying to do, is to merge assembly code and C-code in one
executable
> > file.
> >
> > Now, I don't know exactly about calling conventions of the AMD 64Bit
model.
> > Please link me to a site that describes prologue and epilogue of a
called
> > function in assembly, and how I get values returned from said function.
> > (RAX contains return value, which is OK for integers, but how do I get a
float
> > value returned?)
> >
> http://www.nasm.us/doc/nasmdo11.html
>

Follow link above... Get to this:

"The Win64 ABI is described at:"
http://msdn2.microsoft.com/en-gb/library/ms794533.aspx

Click on that link:

"THIS CONTENT HAS BEEN REMOVED - MSDN."

So, reporting a dead link in NASM doc.


RP


0
Reply Rod 4/7/2010 6:44:57 AM

"polemon" <cbyrzba@nospicedham.cbyrzba.bet> wrote in message
news:20100407052148.6bbca98f@cbyrzba.bet...
> Hello, I didn't want to cross-post, but please have a look at
> Message-ID: <20100407050850.69fa110c@cbyrzba.bet> in comp.lang.c .
>

Google Groups doesn't have it indexed by msg-id yet.

Link to Polemon's c.l.c post:
http://groups.google.com/group/comp.lang.c/msg/2b0a8fb31a55e53c

> What I'm trying to do, is to merge assembly code and C-code in one
executable
> file.
>

If you were trying to merge, you'd use inline assembly.  You're trying to
link assembly objects and C objects and have them work correctly for that
calling convention.  Someone should be able to help you.


Rod Pemberton


0
Reply Rod 4/7/2010 7:13:31 AM

On Wed, 7 Apr 2010 05:21:48 +0200
polemon <cbyrzba@nospicedham.cbyrzba.bet> wrote:

> Hello, I didn't want to cross-post, but please have a look at
> Message-ID: <20100407050850.69fa110c@cbyrzba.bet> in comp.lang.c .
> 
> What I'm trying to do, is to merge assembly code and C-code in one
> executable file.
> 
> Now, I don't know exactly about calling conventions of the AMD 64Bit
> model. Please link me to a site that describes prologue and epilogue
> of a called function in assembly, and how I get values returned from
> said function. (RAX contains return value, which is OK for integers,
> but how do I get a float value returned?)
> 
> Please help.
> 
> Regards,
> --polemon

http://www.x86-64.org/

-- 
http://maxa.homedns.org/

Sometimes online sometimes not



0
Reply Branimir 4/7/2010 8:41:13 AM

On Apr 7, 1:44=A0am, "Rod Pemberton" <do_not_h...@havenone.cmm> wrote:
> "H. Peter Anvin" <h...@nospicedham.zytor.com> wrote in messagenews:4BBC18=
70.4070706@zytor.com...
>
>
>
>
>
> > On 04/06/2010 08:21 PM, polemon wrote:
> > > Hello, I didn't want to cross-post, but please have a look at
> > > Message-ID: <20100407050850.69fa1...@cbyrzba.bet> in comp.lang.c .
>
> > > What I'm trying to do, is to merge assembly code and C-code in one
> executable
> > > file.
>
> > > Now, I don't know exactly about calling conventions of the AMD 64Bit
> model.
> > > Please link me to a site that describes prologue and epilogue of a
> called
> > > function in assembly, and how I get values returned from said functio=
n.
> > > (RAX contains return value, which is OK for integers, but how do I ge=
t a
> float
> > > value returned?)
>
> >http://www.nasm.us/doc/nasmdo11.html
>
> Follow link above... Get to this:
>
> "The Win64 ABI is described at:"http://msdn2.microsoft.com/en-gb/library/=
ms794533.aspx
>
> Click on that link:
>
> "THIS CONTENT HAS BEEN REMOVED - MSDN."
>
> So, reporting a dead link in NASM doc.


Current link:

http://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx
0
Reply robertwessel2 4/7/2010 9:58:34 AM

On Tue, 06 Apr 2010 22:30:24 -0700
"H. Peter Anvin" <hpa@nospicedham.zytor.com> wrote:

> http://www.nasm.us/doc/nasmdo11.html

Thanks a lot!

This, and using gcc -S inttest.c and having a look at how registers are used
when calling a function, did the trick.
http://pix.cs.olemiss.edu/csci223/x86-64CpuRegs.gif is helpful as well.

I will now try managing floating point values and then passing arguments on
the stack (I still don't know exactly how I should do that).

As for the prologue, I don't get why I need the first and last two lines:

	mov	[rsp + 8], rcx
	push	r15
	push	r14
	push	r13
	sub	rsp, 16
	lea	r13, [rsp + 128]

I would appreciate it, if someone could explain that to me in detail...

Also, is there a book on assembly for x86_64 that you would recommend?

Regards,
--polemon

-- 
echo "cbyrzba@cbyrzba.bet" | tr '[a-z]' '[n-za-m]'

0
Reply polemon 4/8/2010 1:02:41 AM

"polemon" <cbyrzba@nospicedham.cbyrzba.bet> wrote in message 
news:20100407052148.6bbca98f@cbyrzba.bet...
> Hello, I didn't want to cross-post, but please have a look at
> Message-ID: <20100407050850.69fa110c@cbyrzba.bet> in comp.lang.c .
>
> What I'm trying to do, is to merge assembly code and C-code in one 
> executable
> file.
>
> Now, I don't know exactly about calling conventions of the AMD 64Bit 
> model.
> Please link me to a site that describes prologue and epilogue of a called
> function in assembly, and how I get values returned from said function.
> (RAX contains return value, which is OK for integers, but how do I get a 
> float
> value returned?)
>

keep in mind that Win64 and Linux x86-64 (and Mac OSX) use different calling 
conventions...

in this thread, people have provided links to both, but which is intended 
here is not clearly indicated, leaving people free to answer for their 
particular OS of choice...


0
Reply BGB 4/8/2010 4:02:49 AM

polemon <cbyrzba@cbyrzba.bet> wrote:
>
>As for the prologue, I don't get why I need the first and last two lines:
>
>	mov	[rsp + 8], rcx
>	push	r15
>	push	r14
>	push	r13
>	sub	rsp, 16
>	lea	r13, [rsp + 128]
>
>I would appreciate it, if someone could explain that to me in detail...

The first line is saving in memory the first argument you got from the
function that called you, which arrives in rcx.  It's probably doing that
because it needs it later.  If you call another function, the first
parameter YOU pass has to go in rcx, so you'll lose your incoming
parameter.

The "sub rsp, 16" is used to make room for the first four (register)
parameters.  Even though the first four parameters are passed in registers
(rcx, rdx, r8, r9), you are required to allocate stack space for them AS IF
they had been passed on the stack.  This gives the function you are calling
a place to put the register parameters in memory, just as you are doing in
the first line above.  That line is storing rcx in the stack area allocated
by the function that called you.

You must ALWAYS allocate room for four parameters, even if you don't pass
any parameters at all.

The r13 manipulation must be something specific to this function.
-- 
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
0
Reply Tim 4/8/2010 4:55:55 AM

Tim Roberts wrote:

>> As for the prologue, I don't get why I need the first and last two lines:
>>
>> 	mov	[rsp + 8], rcx
>> 	push	r15
>> 	push	r14
>> 	push	r13
>> 	sub	rsp, 16
>> 	lea	r13, [rsp + 128]
>>
>> I would appreciate it, if someone could explain that to me in detail...
>
> The first line is saving in memory the first argument you got from the
> function that called you, which arrives in rcx.  It's probably doing that
> because it needs it later.  If you call another function, the first
> parameter YOU pass has to go in rcx, so you'll lose your incoming
> parameter.
>
> The "sub rsp, 16" is used to make room for the first four (register)
> parameters.

Actually, just 2 registers (128 bit / 64 bit = 2)...

> Even though the first four parameters are passed in registers
> (rcx, rdx, r8, r9), you are required to allocate stack space for them AS IF
> they had been passed on the stack.  This gives the function you are calling
> a place to put the register parameters in memory, just as you are doing in
> the first line above.  That line is storing rcx in the stack area allocated
> by the function that called you.
>
> You must ALWAYS allocate room for four parameters, even if you don't pass
> any parameters at all.

This is Windows' calling convention.

Linux: RDI, RSI, RDX, RCX, R8, R9, XMM0 ... XMM7 are
used to pass up to 14 parameters. Further parameters
are passed on the stack at 08[RSP] upwards (with RSP
aligned to a multiple of 8).

Looking at the code snippet: GCC overwrites the 15th
parameter on the stack with the content of RCX. Any-
one out there who knows why? Seems quite unnecessary
and somehow illegal.

> The r13 manipulation must be something specific to this function.

R13 is used as basepointer. The so called 'red zone'
at [RSP - 1]...[RSP - 128] is 'allocated' as storage
for local parameters and data.

Return values are passed in RAX or XMM0.

Information taken from:

http://www.agner.org/optimize/optimizing_assembly.pdf


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/8/2010 10:04:28 AM

polemon wrote:

> This, and using gcc -S inttest.c and having a look at how registers are used
> when calling a function, did the trick.

Surely the best way to learn how things work (I did
the same thing some years ago).

> http://pix.cs.olemiss.edu/csci223/x86-64CpuRegs.gif is helpful as well.

Some other resources:

http://developer.amd.com/documentation/guides/Pages/default.aspx
http://www.intel.com/products/processor/manuals/
http://www.agner.org/optimize/

> I will now try managing floating point values and then passing arguments on
> the stack (I still don't know exactly how I should do that).

In 64 bit Linux, the first 14 parameters are passed
in RDI, RSI, RDX, RCX, R8, R9, XMM0...XMM7. I doubt
too many functions pass more than 14 parameters. ;)

64 bit FP values fit into 64 bit integer registers.
To learn how this is done for real, code a function
fptest.c and compile it.

> As for the prologue, I don't get why I need the first and last two lines:
>
> 	mov	[rsp + 8], rcx

This looks suspicious - it copies parameter four to
the location where parameter 15 had to be passed if
such a parameter existed.

Is this really the first line of that function?

> 	push	r15
> 	push	r14
> 	push	r13

Saves those registers on the stack.

> 	sub	rsp, 16

Reserves another 16 byte on the stack. We moved RSP
40 byte down if we reach this point.

> 	lea	r13, [rsp + 128]

Doesn't make sense in this context, except there is
some code preceeding the first line you posted. The
128 byte below the current RSP are called red zone,
but the red zone is below RSP, not above. When this
line was executed, R13 points to a location 80 byte
above the return address of your function - an area
'belonging' to the calling function(s).

Are you sure you posted the entire code?


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/8/2010 12:05:02 PM

"polemon" <cbyrzba@cbyrzba.bet> wrote in message
news:20100408030241.72bc777d@cbyrzba.bet...
>
> This, and using gcc -S inttest.c and having a look at how registers are
used
> when calling a function, did the trick.
>

Newer GCC versions support a number of debugging options which show what the
compiler is doing.  You might look into options -dA or -dP or similar
(near -fdump-rtl-"pass" section).  IIRC, they should show what assembly code
corresponds to what C code.  They might useful for understanding assembly
generated for the function prolog and epilog.

http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html


Rod Pemberton


0
Reply Rod 4/8/2010 12:20:40 PM

"Tim Roberts" <timr@nospicedham.probo.com> wrote in message 
news:d7oqr5p3e60iu2b6398guclhqk19akc8eq@4ax.com...
> polemon <cbyrzba@cbyrzba.bet> wrote:
>>
>>As for the prologue, I don't get why I need the first and last two lines:
>>
>> mov [rsp + 8], rcx
>> push r15
>> push r14
>> push r13
>> sub rsp, 16
>> lea r13, [rsp + 128]
>>
>>I would appreciate it, if someone could explain that to me in detail...
>
> The first line is saving in memory the first argument you got from the
> function that called you, which arrives in rcx.  It's probably doing that
> because it needs it later.  If you call another function, the first
> parameter YOU pass has to go in rcx, so you'll lose your incoming
> parameter.
>
> The "sub rsp, 16" is used to make room for the first four (register)
> parameters.  Even though the first four parameters are passed in registers
> (rcx, rdx, r8, r9), you are required to allocate stack space for them AS 
> IF
> they had been passed on the stack.  This gives the function you are 
> calling
> a place to put the register parameters in memory, just as you are doing in
> the first line above.  That line is storing rcx in the stack area 
> allocated
> by the function that called you.
>
> You must ALWAYS allocate room for four parameters, even if you don't pass
> any parameters at all.
>
> The r13 manipulation must be something specific to this function.

actually, this "lea" is important to the functioning of the calling 
convention, as it sets up a function with a base pointer (although, setting 
up and using a base pointer is not always required, but there are different 
rules in each case).

r13 serves a similar role to "ebp" in x86 cdecl, but there is more free 
choice in which register to use, although r13 is common...


> -- 
> Tim Roberts, timr@probo.com
> Providenza & Boekelheide, Inc. 


0
Reply BGB 4/8/2010 12:26:50 PM

On Wed, 07 Apr 2010 21:55:55 -0700
Tim Roberts <timr@nospicedham.probo.com> wrote:

> polemon <cbyrzba@cbyrzba.bet> wrote:
> >
> >As for the prologue, I don't get why I need the first and last two lines:
> >
> >	mov	[rsp + 8], rcx
> >	push	r15
> >	push	r14
> >	push	r13
> >	sub	rsp, 16
> >	lea	r13, [rsp + 128]
> >
> >I would appreciate it, if someone could explain that to me in detail...
> 
> The first line is saving in memory the first argument you got from the
> function that called you, which arrives in rcx.  It's probably doing that
> because it needs it later.  If you call another function, the first
> parameter YOU pass has to go in rcx, so you'll lose your incoming
> parameter.

OK, but according to http://www.nasm.us/doc/nasmdo11.html , first argument is
passed in RDI, and in fact, I use RDI te retreive my argument:

	imul	rdi, rdi
	mov	rax, rdi

> The "sub rsp, 16" is used to make room for the first four (register)
> parameters.  Even though the first four parameters are passed in registers
> (rcx, rdx, r8, r9), you are required to allocate stack space for them AS IF
> they had been passed on the stack.  This gives the function you are calling
> a place to put the register parameters in memory, just as you are doing in
> the first line above.  That line is storing rcx in the stack area allocated
> by the function that called you.
> 
> You must ALWAYS allocate room for four parameters, even if you don't pass
> any parameters at all.

Furthermore, the doc tell me I have six registers where I get arguments.
Does that mean I need to "sub rsp, 24"?

And what about floating point arguments, do I need to reserve space for them
as well?


-- 
echo "cbyrzba@cbyrzba.bet" | tr '[a-z]' '[n-za-m]'

0
Reply polemon 4/8/2010 2:46:48 PM

I wrote:

> In 64 bit Linux, the first 14 parameters are passed
> in RDI, RSI, RDX, RCX, R8, R9, XMM0...XMM7. I doubt
> too many functions pass more than 14 parameters. ;)

Correction (After reading some other documents):

RDI through R9 are used to pass integer arguments,
while XMM0 through XMM7 are used to pass floating
point arguments. A function like

double MyFunction(_int64 a, double b, _int64 c, double d);

passes a in RDI, b in XMM0, c in RSI and d in XMM1
(in Windows, there are other conventions). Results
of type double are returned in XMM0, integer types
in RAX.


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/8/2010 3:31:14 PM

"Bernhard Schornak" <schornak@nospicedham.web.de> wrote in message 
news:hpkgpl$nvi$1@news.eternal-september.org...
> polemon wrote:
>
>> This, and using gcc -S inttest.c and having a look at how registers are 
>> used
>> when calling a function, did the trick.
>
> Surely the best way to learn how things work (I did
> the same thing some years ago).
>
>> http://pix.cs.olemiss.edu/csci223/x86-64CpuRegs.gif is helpful as well.
>
> Some other resources:
>
> http://developer.amd.com/documentation/guides/Pages/default.aspx
> http://www.intel.com/products/processor/manuals/
> http://www.agner.org/optimize/
>
>> I will now try managing floating point values and then passing arguments 
>> on
>> the stack (I still don't know exactly how I should do that).
>
> In 64 bit Linux, the first 14 parameters are passed
> in RDI, RSI, RDX, RCX, R8, R9, XMM0...XMM7. I doubt
> too many functions pass more than 14 parameters. ;)
>
> 64 bit FP values fit into 64 bit integer registers.
> To learn how this is done for real, code a function
> fptest.c and compile it.
>
>> As for the prologue, I don't get why I need the first and last two lines:
>>
>> mov [rsp + 8], rcx
>
> This looks suspicious - it copies parameter four to
> the location where parameter 15 had to be passed if
> such a parameter existed.
>
> Is this really the first line of that function?
>

it is a Win64 prologue...

this saves the first argument (passed in as a register) to the stack.
Win64 may pass up to 4 arguments in registers, and reserves 32-bytes of 
stack for holding these arguments (larger items are passed as pointers to 
the item).

nevermind that in the vast majority of cases, the first thing the function 
does is to store any register-based arguments to the stack.


I think the OP was mixing Win64 and Linux code, hence part of his problem.


>> push r15
>> push r14
>> push r13
>
> Saves those registers on the stack.
>
>> sub rsp, 16
>
> Reserves another 16 byte on the stack. We moved RSP
> 40 byte down if we reach this point.
>
>> lea r13, [rsp + 128]
>
> Doesn't make sense in this context, except there is
> some code preceeding the first line you posted. The
> 128 byte below the current RSP are called red zone,
> but the red zone is below RSP, not above. When this
> line was executed, R13 points to a location 80 byte
> above the return address of your function - an area
> 'belonging' to the calling function(s).
>
> Are you sure you posted the entire code?
>

it is a complete prologue at least...


that is the last line of the prologue, which tells Win64 where the 
base-pointer is.
it is also used in the epilogue, and often to index arguments and locals.

however, it is optional, and one can instead choose to refer to everything 
relative to RSP.


>
> Greetings from Augsburg
>
> Bernhard Schornak 


0
Reply BGB 4/9/2010 1:32:32 PM

On Fri, 9 Apr 2010 06:32:32 -0700
"BGB / cr88192" <cr88192@nospicedham.hotmail.com> wrote:

> it is a Win64 prologue...
> 
> this saves the first argument (passed in as a register) to the stack.
> Win64 may pass up to 4 arguments in registers, and reserves 32-bytes of 
> stack for holding these arguments (larger items are passed as pointers to 
> the item).

Does that mean, I don't need to do that on Linux?

> nevermind that in the vast majority of cases, the first thing the function 
> does is to store any register-based arguments to the stack.

This is getting a bit confusing. So, I *do* need to reserve space on the stack
for each register parameter (that my function uses)?

I read in a book (which works with 32bit coding conventions, where everything
is passed on the stack), and there I was told, that I need to save the ESP to
EBP and put the old value of EBP on the stack. This is used to get a fixed
pointer to stack position where arguments are passed, and where space should
be reserved for local variables.

How does that work in AMD64/Linux?

> I think the OP was mixing Win64 and Linux code, hence part of his problem.

Yeah, the source I used, was not clear about Win64 and Linux conventions.

-- 
echo "cbyrzba@cbyrzba.bet" | tr '[a-z]' '[n-za-m]'

0
Reply polemon 4/9/2010 3:42:47 PM

polemon <cbyrzba@nospicedham.cbyrzba.bet> wrote:
>
>OK, but according to http://www.nasm.us/doc/nasmdo11.html , first argument is
>passed in RDI, and in fact, I use RDI te retreive my argument:

Yes, apparently the Windows x86-64 calling convention (which I described)
is quite different from the Linux x86-64 calling convention.  Color me
embarrassed for not realizing that.
-- 
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
0
Reply Tim 4/10/2010 3:36:26 AM

On 04/09/2010 08:42 AM, polemon wrote:
> On Fri, 9 Apr 2010 06:32:32 -0700
> "BGB / cr88192" <cr88192@nospicedham.hotmail.com> wrote:
> 
>> it is a Win64 prologue...
>>
>> this saves the first argument (passed in as a register) to the stack.
>> Win64 may pass up to 4 arguments in registers, and reserves 32-bytes of 
>> stack for holding these arguments (larger items are passed as pointers to 
>> the item).
> 
> Does that mean, I don't need to do that on Linux?
> 

You do not need to reserve space for register arguments on the stack in
Linux.

> I read in a book (which works with 32bit coding conventions, where everything
> is passed on the stack), and there I was told, that I need to save the ESP to
> EBP and put the old value of EBP on the stack. This is used to get a fixed
> pointer to stack position where arguments are passed, and where space should
> be reserved for local variables.
>  
> How does that work in AMD64/Linux?

For a leaf function (which doesn't call others), you can just use 128
bytes below RSP directly.  Otherwise you have to drop RSP.  You don't
need a frame pointer (RBP) -- even for 32 bits -- unless you're using
variable-sized allocations on the stack, but using a frame pointer
sometimes makes debugging easier.  It means losing a register, though.

	-hpa
0
Reply H 4/10/2010 6:01:19 AM

polemon wrote:

>> it is a Win64 prologue...
>>
>> this saves the first argument (passed in as a register) to the stack.
>> Win64 may pass up to 4 arguments in registers, and reserves 32-bytes of
>> stack for holding these arguments (larger items are passed as pointers to
>> the item).
>
> Does that mean, I don't need to do that on Linux?

It means you *must not* do it on Linux. If you do,
strange things might happen. In the best case, the
program crashes. It is very hard to track down the
cause of an error if your function writes to stack
locations belonging to other functions.

>> nevermind that in the vast majority of cases, the first thing the function
>> does is to store any register-based arguments to the stack.
>
> This is getting a bit confusing. So, I *do* need to reserve space on the stack
> for each register parameter (that my function uses)?

Windows passes parameters in RCX, RDX, R8 and R9 -
Linux in RDI, RSI, RDX, RCX, R8 and R9.

Windows passes FP values in XMM0...XMM3 - Linux in
XMM0...XMM7.

If a mix of integer and FP values is to pass, Win-
dows keeps order (e.g. ECX, XMM1, R8, XMM3), Linux
does not (same example: EDI, XMM0, ESI, XMM1).

For Windows, you have to subtract 32 byte from RSP
before you call another function (they are used to
store ECX, EDX, R8 and R9 if required). Linux does
not know such a thing.

For Linux, there's a 128 byte 'red zone' below RSP
reserved for local data. No such thing in Windows.

> I read in a book (which works with 32bit coding conventions, where everything
> is passed on the stack), and there I was told, that I need to save the ESP to
> EBP and put the old value of EBP on the stack. This is used to get a fixed
> pointer to stack position where arguments are passed, and where space should
> be reserved for local variables.
>
> How does that work in AMD64/Linux?

If your function saves registers or puts temporary
variables onto the stack, you may need a reference
point where those data are stored. This is done by
copying rSP to rBP (or any other register!) before
you start to push or move anything onto the stack.

If you use push instructions, rSP automatically is
updated by the processor with every push. With mov
instructions, you have to subtract the size of all
saved data from rSP before you start to write them
to the stack. Data between the current rSP and the
stack element the basepointer 'points to' are safe
from being overwritten by called functions.

The reference (aka basepointer) is used to address
elements in your local stack area. To access these
elements, you have to use *negative* offsets - all
elements are *below* that reference point! the 1st
element is at [REF - 8], the second at [REF - 16],
and so on (64 bit = 8 byte).

>> I think the OP was mixing Win64 and Linux code, hence part of his problem.
>
> Yeah, the source I used, was not clear about Win64 and Linux conventions.

If you want to preserve RCX in Linux, just push it
onto the stack like R13...R15.

BTW - I still don't have a clue what the last line

lea r13, [rsp + 128]

is good for. If you want to access data in the so-
called 'red zone' (the 128 suggests this), then it
should be

lea r13, [rsp - 128]

The 'red zone' is below, not above RSP. You should
not use it, anyway (your function pushes registers
onto the stack).

A skeleton for your function might look like this:

subq $0x78,%rsp        # create stack frame
movq %r13,0x58(%rsp)   # store registers
movq %r14,0x60(%rsp)
movq %r15,0x68(%rsp)
movq %rcx,0x70(%rsp)
....                    # function code
....
....
movq 0x58(%rsp),%r13   # restore registers
movq 0x60(%rsp),%r14
movq 0x68(%rsp),%r15
movq 0x70(%rsp),%rcx
addq $0x78,%rsp        # destroy stack frame
ret

Local variables go to [RSP + 0]...[RSP + 50]. When
more are required, increase the value you subtract
from RSP as well as all offsets related to RSP. To
shrink the reserved area, just decrease that value
and all offsets related to RSP.

Caution: Do not use PUSH or POP instructions - RSP
must not change under any circumstances! All stack
elements are addressed via [RSP + x], where x must
be smaller than the value you subtract from RSP in
the first line. If you call another function, then
the return address is pushed to [RSP - 8], leaving
your local stack frame untouched.

RSP automatically is aligned to a 16 byte boundary
if it was properly aligned before the function was
called. Return addresses are pushed onto the stack
before the 1st instruction in a called function is
executed. If RSP was 0x1230 before the call, it is
0x1228 after the call. Subtracting any number end-
ing with 8 automatically aligns RSP: 1228 - 0078 =
11B0. This is the bottom of the local stack frame,
used to address local variables / saved registers.


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/10/2010 1:50:58 PM

BGB / cr88192 wrote:

[snip]

>>> As for the prologue, I don't get why I need the first and last two lines:
>>>
>>> mov [rsp + 8], rcx
>>
>> This looks suspicious - it copies parameter four to
>> the location where parameter 15 had to be passed if
>> such a parameter existed.
>>
>> Is this really the first line of that function?
>>
>
> it is a Win64 prologue...
>
> this saves the first argument (passed in as a register) to the stack.
> Win64 may pass up to 4 arguments in registers, and reserves 32-bytes of
> stack for holding these arguments (larger items are passed as pointers to
> the item).
>
> nevermind that in the vast majority of cases, the first thing the function
> does is to store any register-based arguments to the stack.
>
>
> I think the OP was mixing Win64 and Linux code, hence part of his problem.

Yes. Quite weird, so I did not see the obvious
- or sorted it out as impossible... ;)

>>> push r15
>>> push r14
>>> push r13
>>
>> Saves those registers on the stack.
>>
>>> sub rsp, 16
>>
>> Reserves another 16 byte on the stack. We moved RSP
>> 40 byte down if we reach this point.
>>
>>> lea r13, [rsp + 128]
>>
>> Doesn't make sense in this context, except there is
>> some code preceeding the first line you posted. The
>> 128 byte below the current RSP are called red zone,
>> but the red zone is below RSP, not above. When this
>> line was executed, R13 points to a location 80 byte
>> above the return address of your function - an area
>> 'belonging' to the calling function(s).
>>
>> Are you sure you posted the entire code?
>>
>
> it is a complete prologue at least...
>
> that is the last line of the prologue, which tells Win64 where the
> base-pointer is.
>
> it is also used in the epilogue, and often to index arguments and locals.

Agreed. The problem is - R13 points to a loca-
tion 80 byte above the return address. If this
register is used as basepointer, all writes go
to stack reserved by a preceeding function. In
the best case, a return address is overwritten
with arbitrary data, causing a crash. In worse
cases, the program runs with bogus data. It is
hard to track down an error if functions alter
working parameters of other functions.

> however, it is optional, and one can instead choose to refer to everything
> relative to RSP.

That's what I do. On modern processors, MOV is
faster than PUSH and up to three MOV's execute
simultaneously on an Athlon. If proper offsets
are subtracted from RSP, stack alignment is no
issue, as well. Working with a 'frozen' RSP is
much easier than working with negative offsets
to an additional register (which is missing as
general purpose register for real work).


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/10/2010 1:55:36 PM

"Bernhard Schornak" <schornak@nospicedham.web.de> wrote in message 
news:hppvob$pbg$1@news.eternal-september.org...
> polemon wrote:
>
>>> it is a Win64 prologue...
>>>
>>> this saves the first argument (passed in as a register) to the stack.
>>> Win64 may pass up to 4 arguments in registers, and reserves 32-bytes of
>>> stack for holding these arguments (larger items are passed as pointers 
>>> to
>>> the item).
>>
>> Does that mean, I don't need to do that on Linux?
>
> It means you *must not* do it on Linux. If you do,
> strange things might happen. In the best case, the
> program crashes. It is very hard to track down the
> cause of an error if your function writes to stack
> locations belonging to other functions.
>

(directed to the OP):
the issue is more meta than this.

try thinking in terms of what the processor does, not what an HLL does and 
maps to the processor (or, what is going on, and not what you intend to 
accomplish).

how the processor sees things is a little differently, and this is where one 
needs to understand the processor (the world of registers, opcodes, and 
memory, generally absent any higher structure than these themselves) to 
really understand what all is going on here (and why Win64 and Linux are so 
different despite both using the same processor).

ro the processor, functions and function calls don't actually exist, but are 
instead an abstraction built on top of the processor (much as "the computer" 
is an abstraction built on top of electronic components, which are an 
abstraction over particular configurations of matter and energy, ...).


>>> nevermind that in the vast majority of cases, the first thing the 
>>> function
>>> does is to store any register-based arguments to the stack.
>>
>> This is getting a bit confusing. So, I *do* need to reserve space on the 
>> stack
>> for each register parameter (that my function uses)?
>
> Windows passes parameters in RCX, RDX, R8 and R9 -
> Linux in RDI, RSI, RDX, RCX, R8 and R9.
>
> Windows passes FP values in XMM0...XMM3 - Linux in
> XMM0...XMM7.
>
> If a mix of integer and FP values is to pass, Win-
> dows keeps order (e.g. ECX, XMM1, R8, XMM3), Linux
> does not (same example: EDI, XMM0, ESI, XMM1).
>
> For Windows, you have to subtract 32 byte from RSP
> before you call another function (they are used to
> store ECX, EDX, R8 and R9 if required). Linux does
> not know such a thing.
>
> For Linux, there's a 128 byte 'red zone' below RSP
> reserved for local data. No such thing in Windows.
>

yep, so far so good...


>> I read in a book (which works with 32bit coding conventions, where 
>> everything
>> is passed on the stack), and there I was told, that I need to save the 
>> ESP to
>> EBP and put the old value of EBP on the stack. This is used to get a 
>> fixed
>> pointer to stack position where arguments are passed, and where space 
>> should
>> be reserved for local variables.
>>
>> How does that work in AMD64/Linux?
>
> If your function saves registers or puts temporary
> variables onto the stack, you may need a reference
> point where those data are stored. This is done by
> copying rSP to rBP (or any other register!) before
> you start to push or move anything onto the stack.
>
> If you use push instructions, rSP automatically is
> updated by the processor with every push. With mov
> instructions, you have to subtract the size of all
> saved data from rSP before you start to write them
> to the stack. Data between the current rSP and the
> stack element the basepointer 'points to' are safe
> from being overwritten by called functions.
>
> The reference (aka basepointer) is used to address
> elements in your local stack area. To access these
> elements, you have to use *negative* offsets - all
> elements are *below* that reference point! the 1st
> element is at [REF - 8], the second at [REF - 16],
> and so on (64 bit = 8 byte).

yep, but things differ here a little in the Win64 case...


>>> I think the OP was mixing Win64 and Linux code, hence part of his 
>>> problem.
>>
>> Yeah, the source I used, was not clear about Win64 and Linux conventions.
>
> If you want to preserve RCX in Linux, just push it
> onto the stack like R13...R15.
>

yep.


> BTW - I still don't have a clue what the last line
>
> lea r13, [rsp + 128]
>
> is good for. If you want to access data in the so-
> called 'red zone' (the 128 suggests this), then it
> should be
>
> lea r13, [rsp - 128]
>

this is also part of the Win64 ABI.
Win64 defines a number of things which are different from the 
Linux/AMD64/SysV calling convention.

in particular, there are a number of rules for forming proper prologues and 
epilogues, mostly since the code for the prologues and epilogues needs also 
to be able to be processed as data, which means it has to be formed 
according to certain rules (this is to allow such things as exception 
handling and stack unwinding, so the prologue and epilogue, as well as some 
aspects of stack and register usage, are carefully controlled, such that the 
stack can be automatically unwound and registers automatically restored).


lea r13, [rsp+128]

is a single example of such rules.
this operation is optional, but if present, it is defined.
the offset is because the offset is required to be present, and also a 
multiple of 16 (for whatever unknown reason).


in x86 cdecl, there was EBP, but code often bypassed it, leaving there being 
no way to perform a proper backtrace.

in Win64, certain prologue and epilogue rules are required, and to do 
otherwise is a violation of the ABI.

one can feel happy though that Win64 addresses matters of exception handling 
at the ABI and calling convention levels, rather than requiring debugging 
info (as is generally done on Linux, or IOW, one needs to deal with DWARF in 
order to safely have stack unwinding...).


> The 'red zone' is below, not above RSP. You should
> not use it, anyway (your function pushes registers
> onto the stack).
>
> A skeleton for your function might look like this:
>
> subq $0x78,%rsp        # create stack frame
> movq %r13,0x58(%rsp)   # store registers
> movq %r14,0x60(%rsp)
> movq %r15,0x68(%rsp)
> movq %rcx,0x70(%rsp)
> ...                    # function code
> ...
> ...
> movq 0x58(%rsp),%r13   # restore registers
> movq 0x60(%rsp),%r14
> movq 0x68(%rsp),%r15
> movq 0x70(%rsp),%rcx
> addq $0x78,%rsp        # destroy stack frame
> ret
>

yep.


> Local variables go to [RSP + 0]...[RSP + 50]. When
> more are required, increase the value you subtract
> from RSP as well as all offsets related to RSP. To
> shrink the reserved area, just decrease that value
> and all offsets related to RSP.
>
> Caution: Do not use PUSH or POP instructions - RSP
> must not change under any circumstances! All stack
> elements are addressed via [RSP + x], where x must
> be smaller than the value you subtract from RSP in
> the first line. If you call another function, then
> the return address is pushed to [RSP - 8], leaving
> your local stack frame untouched.
>
> RSP automatically is aligned to a 16 byte boundary
> if it was properly aligned before the function was
> called. Return addresses are pushed onto the stack
> before the 1st instruction in a called function is
> executed. If RSP was 0x1230 before the call, it is
> 0x1228 after the call. Subtracting any number end-
> ing with 8 automatically aligns RSP: 1228 - 0078 =
> 11B0. This is the bottom of the local stack frame,
> used to address local variables / saved registers.
>

yep, fair enough.



0
Reply BGB 4/10/2010 9:34:53 PM

BGB / cr88192 wrote:

> "Bernhard Schornak"<schornak@nospicedham.web.de>  wrote in message
> news:hppvob$pbg$1@news.eternal-september.org...
>> polemon wrote:
>>
>>>> it is a Win64 prologue...
>>>>
>>>> this saves the first argument (passed in as a register) to the stack.
>>>> Win64 may pass up to 4 arguments in registers, and reserves 32-bytes of
>>>> stack for holding these arguments (larger items are passed as pointers
>>>> to
>>>> the item).
>>>
>>> Does that mean, I don't need to do that on Linux?
>>
>> It means you *must not* do it on Linux. If you do,
>> strange things might happen. In the best case, the
>> program crashes. It is very hard to track down the
>> cause of an error if your function writes to stack
>> locations belonging to other functions.
>
> (directed to the OP):
> the issue is more meta than this.
>
> try thinking in terms of what the processor does, not what an HLL does and
> maps to the processor (or, what is going on, and not what you intend to
> accomplish).

I am one of the few people who write libraries and
apps entirely with (G)AS (since 1993 or so)... ;)

> how the processor sees things is a little differently, and this is where one
> needs to understand the processor (the world of registers, opcodes, and
> memory, generally absent any higher structure than these themselves) to
> really understand what all is going on here (and why Win64 and Linux are so
> different despite both using the same processor).

Having a closer look at both, I think the MS folks
did a better job (in this case). Linux is a little
bit confusing and adds more bloat to executables.

> ro the processor, functions and function calls don't actually exist, but are
> instead an abstraction built on top of the processor (much as "the computer"
> is an abstraction built on top of electronic components, which are an
> abstraction over particular configurations of matter and energy, ...).

Where the thing we call matter is energy, as well.
Everything in our Universe either is energy, or it
is Nothing - the Emptiness where all energy moves,
interacts and expands to.

Nevertheless, this does not address the problem we
might encounter if we execute the posted code. The
program's stack might look like this:

-------------------------------

1308 function 0: return address
1300
..... function 0: local area
1280

1278 function 1: return address
1270
..... function 1: local area
1240

1238 function 2: return address
1230 stored R15
1228 stored R14
1220 stored R13
1210 variables

-------------------------------

RSP holds 1210 at this point. The next instruction

leaq 0x80(%rsp),%r13

loads 1290 into R13. Address 1290 holds local data
belonging to function 0 (the function who called a
function calling our function...).

As you can see, this cannot be valid. It was valid
code if R13 held a parameter passed by function 0,
pointing to e. g. a string variable that had to be
processed, and function 1 called helpers, here our
function, to do the job. But R13 is loaded with an
offset to RSP - not with the content of a register
holding the address of a string stored in function
0's local stack area.

As long as we don't allow functions to guess where
something to process might be stored, the piece of
code violates some basic principles of interaction
between single parts building a greater whole.

>>>> nevermind that in the vast majority of cases, the first thing the
>>>> function
>>>> does is to store any register-based arguments to the stack.
>>>
>>> This is getting a bit confusing. So, I *do* need to reserve space on the
>>> stack
>>> for each register parameter (that my function uses)?
>>
>> Windows passes parameters in RCX, RDX, R8 and R9 -
>> Linux in RDI, RSI, RDX, RCX, R8 and R9.
>>
>> Windows passes FP values in XMM0...XMM3 - Linux in
>> XMM0...XMM7.
>>
>> If a mix of integer and FP values is to pass, Win-
>> dows keeps order (e.g. ECX, XMM1, R8, XMM3), Linux
>> does not (same example: EDI, XMM0, ESI, XMM1).
>>
>> For Windows, you have to subtract 32 byte from RSP
>> before you call another function (they are used to
>> store ECX, EDX, R8 and R9 if required). Linux does
>> not know such a thing.
>>
>> For Linux, there's a 128 byte 'red zone' below RSP
>> reserved for local data. No such thing in Windows.
>>
>
> yep, so far so good...
>
>
>>> I read in a book (which works with 32bit coding conventions, where
>>> everything
>>> is passed on the stack), and there I was told, that I need to save the
>>> ESP to
>>> EBP and put the old value of EBP on the stack. This is used to get a
>>> fixed
>>> pointer to stack position where arguments are passed, and where space
>>> should
>>> be reserved for local variables.
>>>
>>> How does that work in AMD64/Linux?
>>
>> If your function saves registers or puts temporary
>> variables onto the stack, you may need a reference
>> point where those data are stored. This is done by
>> copying rSP to rBP (or any other register!) before
>> you start to push or move anything onto the stack.
>>
>> If you use push instructions, rSP automatically is
>> updated by the processor with every push. With mov
>> instructions, you have to subtract the size of all
>> saved data from rSP before you start to write them
>> to the stack. Data between the current rSP and the
>> stack element the basepointer 'points to' are safe
>> from being overwritten by called functions.
>>
>> The reference (aka basepointer) is used to address
>> elements in your local stack area. To access these
>> elements, you have to use *negative* offsets - all
>> elements are *below* that reference point! the 1st
>> element is at [REF - 8], the second at [REF - 16],
>> and so on (64 bit = 8 byte).
>
> yep, but things differ here a little in the Win64 case...

Okay, but polemon is writing code for Linux. It is
superfluous to explain how this is done on another
platform. He surely isn't interested in that if he
wants to create a working app for Linux within the
current decade... ;)

By the way, there is more to it. Conventional code
requires the base pointer because it is the *only*
reference to the return address of a running func-
tion. In most environments, rSP isn't properly up-
dated between function calls, so it often holds an
arbitrary value whenever the function logic issues
a return to the caller. Without a basepointer, the
function did not know where its return address was
stored...

An alternative is to work with a fixed rSP, but it
excludes using PUSH, POP and any other instruction
where rSP is updated automatically.

>> BTW - I still don't have a clue what the last line
>>
>> lea r13, [rsp + 128]
>>
>> is good for. If you want to access data in the so-
>> called 'red zone' (the 128 suggests this), then it
>> should be
>>
>> lea r13, [rsp - 128]
>>
>
> this is also part of the Win64 ABI.
> Win64 defines a number of things which are different from the
> Linux/AMD64/SysV calling convention.

128 byte is the size of Linux' red zone - I guess,
polemon wanted to set R13 to its beginning, but it
definitely is the wrong direction (upwards instead
of downwards). Moreover, the red zone shouldn't be
used if a function PUSHes data onto the stack.

> in particular, there are a number of rules for forming proper prologues and
> epilogues, mostly since the code for the prologues and epilogues needs also
> to be able to be processed as data, which means it has to be formed
> according to certain rules (this is to allow such things as exception
> handling and stack unwinding, so the prologue and epilogue, as well as some
> aspects of stack and register usage, are carefully controlled, such that the
> stack can be automatically unwound and registers automatically restored).
>
> lea r13, [rsp+128]
>
> is a single example of such rules.
> this operation is optional, but if present, it is defined.

As shown above, it is no valid code in our special
case. Even if this was a Windows app, it addressed
stack areas not belonging to the running function.

Stack above RSP on function entry is taboo for any
called function. The only exception from this rule
is 64 bit Win, where the so called shadow space at
08...27[RSP] is defined as a temporary storage for
the called function. But: 80[RSP] is far above the
shadow space of the running function, isn't it?

Even if we introduced exception handling and other
things, this line was invalid. Assuming, we run on
Win64, our shadow space is located at 1240...125F.

Our function's stack, again:

1260 callee's stack (bottom)
1240 shadow space
1238 return address
1230 stored R15
1228 stored R14
1220 stored R13
1210 local area (subq 0x10,%rsp)

125F is the topmost byte belonging to us, anything
above belongs to our caller (or even its callers).

The only thing we know is where the stack frame of
our caller begins (1260). We neither have any info
about our callers stack size nor about where which
data are stored in it. Consequently, we should not
touch anything beyond the last byte inside our own
local stack area. Accessing data outside our frame
is a very bad idea. Passing pointers to such areas
to an error handler (or any other external entity)
is even worse: It pretends ownership of 'property'
not belonging to us.

> the offset is because the offset is required to be present, and also a
> multiple of 16 (for whatever unknown reason).

The stack should be aligned to multiples of 16, so
we can use XMM instructions without adding tons of
(superfluous) alignments preceeding them.

> in x86 cdecl, there was EBP, but code often bypassed it, leaving there being
> no way to perform a proper backtrace.
>
> in Win64, certain prologue and epilogue rules are required, and to do
> otherwise is a violation of the ABI.
>
> one can feel happy though that Win64 addresses matters of exception handling
> at the ABI and calling convention levels, rather than requiring debugging
> info (as is generally done on Linux, or IOW, one needs to deal with DWARF in
> order to safely have stack unwinding...).

Agreed. The MS folks are one nose ahead with their
conventions at the moment. Nevertheless - it still
is too much bloat. 1k code and 2k resources result
in a 20 k application (compiled with mingw64). 17k
never used runtime stuff and exception handling is
a bad ratio between 'payload' and 'inert mass'. ;)


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/11/2010 8:20:37 PM

"Bernhard Schornak" <schornak@nospicedham.web.de> wrote in message 
news:hptaup$no3$1@news.eternal-september.org...
> BGB / cr88192 wrote:
>
>> "Bernhard Schornak"<schornak@nospicedham.web.de>  wrote in message
>> news:hppvob$pbg$1@news.eternal-september.org...
>>> polemon wrote:
>>>
>>>>> it is a Win64 prologue...
>>>>>
>>>>> this saves the first argument (passed in as a register) to the stack.
>>>>> Win64 may pass up to 4 arguments in registers, and reserves 32-bytes 
>>>>> of
>>>>> stack for holding these arguments (larger items are passed as pointers
>>>>> to
>>>>> the item).
>>>>
>>>> Does that mean, I don't need to do that on Linux?
>>>
>>> It means you *must not* do it on Linux. If you do,
>>> strange things might happen. In the best case, the
>>> program crashes. It is very hard to track down the
>>> cause of an error if your function writes to stack
>>> locations belonging to other functions.
>>
>> (directed to the OP):
>> the issue is more meta than this.
>>
>> try thinking in terms of what the processor does, not what an HLL does 
>> and
>> maps to the processor (or, what is going on, and not what you intend to
>> accomplish).
>
> I am one of the few people who write libraries and
> apps entirely with (G)AS (since 1993 or so)... ;)
>

yep.


>> how the processor sees things is a little differently, and this is where 
>> one
>> needs to understand the processor (the world of registers, opcodes, and
>> memory, generally absent any higher structure than these themselves) to
>> really understand what all is going on here (and why Win64 and Linux are 
>> so
>> different despite both using the same processor).
>
> Having a closer look at both, I think the MS folks
> did a better job (in this case). Linux is a little
> bit confusing and adds more bloat to executables.
>

I agree...

the SysV/AMD64 calling convention is a good deal more complicated than it 
needs to be, and I don't believe it is as well-suited to the sorts of coding 
practices used in most apps (nor, for that matter, the typical performance 
behavior on typical processors).

so, Win64 may be a little better in several ways, although in general I am 
not as impressed with MSVC's code output.


>> ro the processor, functions and function calls don't actually exist, but 
>> are
>> instead an abstraction built on top of the processor (much as "the 
>> computer"
>> is an abstraction built on top of electronic components, which are an
>> abstraction over particular configurations of matter and energy, ...).
>
> Where the thing we call matter is energy, as well.
> Everything in our Universe either is energy, or it
> is Nothing - the Emptiness where all energy moves,
> interacts and expands to.
>
> Nevertheless, this does not address the problem we
> might encounter if we execute the posted code. The
> program's stack might look like this:
>
> -------------------------------
>
> 1308 function 0: return address
> 1300
> .... function 0: local area
> 1280
>
> 1278 function 1: return address
> 1270
> .... function 1: local area
> 1240
>
> 1238 function 2: return address
> 1230 stored R15
> 1228 stored R14
> 1220 stored R13
> 1210 variables
>
> -------------------------------
>
> RSP holds 1210 at this point. The next instruction
>
> leaq 0x80(%rsp),%r13
>
> loads 1290 into R13. Address 1290 holds local data
> belonging to function 0 (the function who called a
> function calling our function...).
>
> As you can see, this cannot be valid. It was valid
> code if R13 held a parameter passed by function 0,
> pointing to e. g. a string variable that had to be
> processed, and function 1 called helpers, here our
> function, to do the job. But R13 is loaded with an
> offset to RSP - not with the content of a register
> holding the address of a string stored in function
> 0's local stack area.
>
> As long as we don't allow functions to guess where
> something to process might be stored, the piece of
> code violates some basic principles of interaction
> between single parts building a greater whole.
>

all this is only an issue if one assumes that R13 needs to point within the 
local functions stack frame.
AFAIK, this is not assumed by the Win64 ABI, but it is assumed that one 
adds/subtracts an offset.

there is no problem if one doesn't access any space outside the current 
stack frame, regardless of how far offset the base-pointer is.

in the case of my compiler, I think the offset is fairly arbitrary, just 
there so that the unwinder code/... is happy.

>>>>> nevermind that in the vast majority of cases, the first thing the
>>>>> function
>>>>> does is to store any register-based arguments to the stack.
>>>>
>>>> This is getting a bit confusing. So, I *do* need to reserve space on 
>>>> the
>>>> stack
>>>> for each register parameter (that my function uses)?
>>>
>>> Windows passes parameters in RCX, RDX, R8 and R9 -
>>> Linux in RDI, RSI, RDX, RCX, R8 and R9.
>>>
>>> Windows passes FP values in XMM0...XMM3 - Linux in
>>> XMM0...XMM7.
>>>
>>> If a mix of integer and FP values is to pass, Win-
>>> dows keeps order (e.g. ECX, XMM1, R8, XMM3), Linux
>>> does not (same example: EDI, XMM0, ESI, XMM1).
>>>
>>> For Windows, you have to subtract 32 byte from RSP
>>> before you call another function (they are used to
>>> store ECX, EDX, R8 and R9 if required). Linux does
>>> not know such a thing.
>>>
>>> For Linux, there's a 128 byte 'red zone' below RSP
>>> reserved for local data. No such thing in Windows.
>>>
>>
>> yep, so far so good...
>>
>>
>>>> I read in a book (which works with 32bit coding conventions, where
>>>> everything
>>>> is passed on the stack), and there I was told, that I need to save the
>>>> ESP to
>>>> EBP and put the old value of EBP on the stack. This is used to get a
>>>> fixed
>>>> pointer to stack position where arguments are passed, and where space
>>>> should
>>>> be reserved for local variables.
>>>>
>>>> How does that work in AMD64/Linux?
>>>
>>> If your function saves registers or puts temporary
>>> variables onto the stack, you may need a reference
>>> point where those data are stored. This is done by
>>> copying rSP to rBP (or any other register!) before
>>> you start to push or move anything onto the stack.
>>>
>>> If you use push instructions, rSP automatically is
>>> updated by the processor with every push. With mov
>>> instructions, you have to subtract the size of all
>>> saved data from rSP before you start to write them
>>> to the stack. Data between the current rSP and the
>>> stack element the basepointer 'points to' are safe
>>> from being overwritten by called functions.
>>>
>>> The reference (aka basepointer) is used to address
>>> elements in your local stack area. To access these
>>> elements, you have to use *negative* offsets - all
>>> elements are *below* that reference point! the 1st
>>> element is at [REF - 8], the second at [REF - 16],
>>> and so on (64 bit = 8 byte).
>>
>> yep, but things differ here a little in the Win64 case...
>
> Okay, but polemon is writing code for Linux. It is
> superfluous to explain how this is done on another
> platform. He surely isn't interested in that if he
> wants to create a working app for Linux within the
> current decade... ;)
>
> By the way, there is more to it. Conventional code
> requires the base pointer because it is the *only*
> reference to the return address of a running func-
> tion. In most environments, rSP isn't properly up-
> dated between function calls, so it often holds an
> arbitrary value whenever the function logic issues
> a return to the caller. Without a basepointer, the
> function did not know where its return address was
> stored...
>
> An alternative is to work with a fixed rSP, but it
> excludes using PUSH, POP and any other instruction
> where rSP is updated automatically.
>

actually, it is not that much of a problem, since all one really has to do 
is keep track of where rSP is at compile time. push/pop/... are safe so long 
as the compiler knows about it, and can update its internal state 
accordingly.


the great limit though is that, yes, it does require the stack frame to be 
fixed-size at compile time, so the split between rSP relative and rBP/... 
relative indexing is that it allows for an indeterminate amount of stack 
space to exist between the local stack spot, and the fixed-position locals 
and arguments.

it also helps notably with hand-written code, or with naive compilers, since 
it doesn't require exactly keeping track of where everything is located.



>>> BTW - I still don't have a clue what the last line
>>>
>>> lea r13, [rsp + 128]
>>>
>>> is good for. If you want to access data in the so-
>>> called 'red zone' (the 128 suggests this), then it
>>> should be
>>>
>>> lea r13, [rsp - 128]
>>>
>>
>> this is also part of the Win64 ABI.
>> Win64 defines a number of things which are different from the
>> Linux/AMD64/SysV calling convention.
>
> 128 byte is the size of Linux' red zone - I guess,
> polemon wanted to set R13 to its beginning, but it
> definitely is the wrong direction (upwards instead
> of downwards). Moreover, the red zone shouldn't be
> used if a function PUSHes data onto the stack.
>

I don't know, the OP didn't himself say anything about the red-zone from 
what I remember.
this number could have very well been coincidence.


>> in particular, there are a number of rules for forming proper prologues 
>> and
>> epilogues, mostly since the code for the prologues and epilogues needs 
>> also
>> to be able to be processed as data, which means it has to be formed
>> according to certain rules (this is to allow such things as exception
>> handling and stack unwinding, so the prologue and epilogue, as well as 
>> some
>> aspects of stack and register usage, are carefully controlled, such that 
>> the
>> stack can be automatically unwound and registers automatically restored).
>>
>> lea r13, [rsp+128]
>>
>> is a single example of such rules.
>> this operation is optional, but if present, it is defined.
>
> As shown above, it is no valid code in our special
> case. Even if this was a Windows app, it addressed
> stack areas not belonging to the running function.
>
> Stack above RSP on function entry is taboo for any
> called function. The only exception from this rule
> is 64 bit Win, where the so called shadow space at
> 08...27[RSP] is defined as a temporary storage for
> the called function. But: 80[RSP] is far above the
> shadow space of the running function, isn't it?
>

it is not as simple as this, especially when one gets to matters of passing 
and returning structs (this is where the calling conventions really 
differ...).

but, yeah, normally one would not write into space outside the current 
function, but little is to say that the base-pointer can't point there, or 
for that matter, damn near anywhere else within the confines of the ABI.

in fact, the most notable thing about 128 is that it is the smallest value 
which would force a 16-bit displacement to be used [rsp+64] for example, 
would most likely use an 8-bit displacement (it is unlinely NASM would 
otherwise infer that it needs 16 bits for the offset...).



> Even if we introduced exception handling and other
> things, this line was invalid. Assuming, we run on
> Win64, our shadow space is located at 1240...125F.
>
> Our function's stack, again:
>
> 1260 callee's stack (bottom)
> 1240 shadow space
> 1238 return address
> 1230 stored R15
> 1228 stored R14
> 1220 stored R13
> 1210 local area (subq 0x10,%rsp)
>
> 125F is the topmost byte belonging to us, anything
> above belongs to our caller (or even its callers).
>
> The only thing we know is where the stack frame of
> our caller begins (1260). We neither have any info
> about our callers stack size nor about where which
> data are stored in it. Consequently, we should not
> touch anything beyond the last byte inside our own
> local stack area. Accessing data outside our frame
> is a very bad idea. Passing pointers to such areas
> to an error handler (or any other external entity)
> is even worse: It pretends ownership of 'property'
> not belonging to us.
>

if nothing there is addressed, nothing matters here.

lea r13, [rsp+4096]
mov rax, [r13-4092]

this is infact often how the basepointer is used in Win64 code...

typically the same offset is reversed as part of the epilogue:
lea rsp, [r13-4096]
pop ...
....
ret


>> the offset is because the offset is required to be present, and also a
>> multiple of 16 (for whatever unknown reason).
>
> The stack should be aligned to multiples of 16, so
> we can use XMM instructions without adding tons of
> (superfluous) alignments preceeding them.
>

the offset itself though is required to be a multiple of 16, but it is not 
necessarily the case AFAIK that the stack is aligned by 16 at the time the 
lea is used.


>> in x86 cdecl, there was EBP, but code often bypassed it, leaving there 
>> being
>> no way to perform a proper backtrace.
>>
>> in Win64, certain prologue and epilogue rules are required, and to do
>> otherwise is a violation of the ABI.
>>
>> one can feel happy though that Win64 addresses matters of exception 
>> handling
>> at the ABI and calling convention levels, rather than requiring debugging
>> info (as is generally done on Linux, or IOW, one needs to deal with DWARF 
>> in
>> order to safely have stack unwinding...).
>
> Agreed. The MS folks are one nose ahead with their
> conventions at the moment. Nevertheless - it still
> is too much bloat. 1k code and 2k resources result
> in a 20 k application (compiled with mingw64). 17k
> never used runtime stuff and exception handling is
> a bad ratio between 'payload' and 'inert mass'. ;)
>

however, it can also be noted that MSVC has exception-handling in C code, 
and Windows itself throws internal exceptions occasionally.

thus, broken exception handling risks crashing...

so, it was either all this, or probably the strategy used in Win32 SEH, 
which was notable in that it made exception handling fairly expensive, and 
also didn't readily allow generalized unwinding.

this may matter some as it is common in non-C languages (such as C++, Java, 
and C#) to use exception handling as a sort of control-flow mechanism (IOW: 
for non-exceptional situations), and so there is some value in trying to 
have fast exceptions...



0
Reply BGB 4/11/2010 10:40:50 PM

On Sun, 11 Apr 2010 15:40:50 -0700
"BGB / cr88192" <cr88192@nospicedham.hotmail.com> wrote:

> I don't know, the OP didn't himself say anything about the red-zone from=
=20
> what I remember.
> this number could have very well been coincidence.

I heard about 'red-zone' for the first time in this thread.

=46rom what I understand, using the registers for passing data is quite strai=
ght
forward (I tested around a bit and it seems to work the way I imagined it
to work).

But I am still quite unsure about using the stack.
=46rom what I understand, I should omit PUSH-/POPing to/from the stack if
possible, should I just use labels in .bss for variables instead?

I was searching the web for information about the red-zone, but there hardl=
y is
any.

I'm programming for Linux, just to clear the confusion about Windows/Linux
conventions.
The prologue I had in the first (referenced) post was what I've found on the
Internet as "a prologue" I wasn't aware of the fact, that Windows and Linux
handles parameters differently.

Since the first six parameters for floats and integers are passed via regis=
ters,
the 7th paramter should be on the stack, but at which offset of RSP?
Does the offset depend on the type of the passed value? Since this is for
64 bit, I'd assume, that all parameters passed on the stack should be 64bits
in length.

Since the red-zone is below the stack pointer, how should I push a 7th
parameter on the stack for a function call? From what I understand, I'd
be pushing that into the red-zone, since the pointer decreases with every
push.

While playing around with passing values with registers, I was asking mysel=
f=20
if I actually need much of a prologue and epilogue, since I can 'save' the
registers I need in my function, in variables in .bss section.
It seems as if shouldn't use the stack wherever possible.

Anyways, thanks a lot for your help so far!
I hope however, this thread doesn't get derailed into OS-wars of some sorts=
....

Regards,
--polemon

--=20
echo "cbyrzba@cbyrzba.bet" | tr '[a-z]' '[n-za-m]'

0
Reply polemon 4/12/2010 2:51:11 AM

On 04/11/2010 07:51 PM, polemon wrote:
> 
> Since the first six parameters for floats and integers are passed via registers,
> the 7th paramter should be on the stack, but at which offset of RSP?
> Does the offset depend on the type of the passed value? Since this is for
> 64 bit, I'd assume, that all parameters passed on the stack should be 64bits
> in length.
> 
> Since the red-zone is below the stack pointer, how should I push a 7th
> parameter on the stack for a function call? From what I understand, I'd
> be pushing that into the red-zone, since the pointer decreases with every
> push.
> 

No, the 7th parameter is above the stack pointer; the redzone is above
the stack pointer.  Specifically, the 7th parameter is at [rsp+8] at the
entry to the function.

> While playing around with passing values with registers, I was asking myself 
> if I actually need much of a prologue and epilogue, since I can 'save' the
> registers I need in my function, in variables in .bss section.
> It seems as if shouldn't use the stack wherever possible.

Urk, no, that's actually both slower and makes your function non-reentrant.

	-hpa
0
Reply H 4/12/2010 4:47:06 AM

BGB / cr88192 wrote:

> Bernhard Schornak wrote:
>
>> BGB / cr88192 wrote:
>>
>>> Bernhard Schornak wrote:

[snip]

>>> ro the processor, functions and function calls don't actually exist, but
>>> are
>>> instead an abstraction built on top of the processor (much as "the
>>> computer"
>>> is an abstraction built on top of electronic components, which are an
>>> abstraction over particular configurations of matter and energy, ...).
>>
>> Where the thing we call matter is energy, as well.
>> Everything in our Universe either is energy, or it
>> is Nothing - the Emptiness where all energy moves,
>> interacts and expands to.
>>
>> Nevertheless, this does not address the problem we
>> might encounter if we execute the posted code. The
>> program's stack might look like this:
>>
>> -------------------------------
>>
>> 1308 function 0: return address
>> 1300
>> .... function 0: local area
>> 1280
>>
>> 1278 function 1: return address
>> 1270
>> .... function 1: local area
>> 1240
>>
>> 1238 function 2: return address
>> 1230 stored R15
>> 1228 stored R14
>> 1220 stored R13
>> 1210 variables
>>
>> -------------------------------
>>
>> RSP holds 1210 at this point. The next instruction
>>
>> leaq 0x80(%rsp),%r13
>>
>> loads 1290 into R13. Address 1290 holds local data
>> belonging to function 0 (the function who called a
>> function calling our function...).
>>
>> As you can see, this cannot be valid. It was valid
>> code if R13 held a parameter passed by function 0,
>> pointing to e. g. a string variable that had to be
>> processed, and function 1 called helpers, here our
>> function, to do the job. But R13 is loaded with an
>> offset to RSP - not with the content of a register
>> holding the address of a string stored in function
>> 0's local stack area.
>>
>> As long as we don't allow functions to guess where
>> something to process might be stored, the piece of
>> code violates some basic principles of interaction
>> between single parts building a greater whole.
>>
>
> all this is only an issue if one assumes that R13 needs to point within the
> local functions stack frame.

If we did not need R13 for any reason, we could omit
the entire line...

> AFAIK, this is not assumed by the Win64 ABI, but it is assumed that one
> adds/subtracts an offset.

I have not seen such a demand in any document. Could
you post a link where such conventions are defined?

> there is no problem if one doesn't access any space outside the current
> stack frame, regardless of how far offset the base-pointer is.

It does not make sense to provide 'speculative' data
anywhere. What should it be good for? Actually, it's
wasting precious clock cycles which could be used to
perform real work.

> in the case of my compiler, I think the offset is fairly arbitrary, just
> there so that the unwinder code/... is happy.

Unfortunately, I haven't seen it, yet, so I am quite
'uninformed' at the moment. If you post a link, I'll
have a look at it.

[snip]

>>>>> I read in a book (which works with 32bit coding conventions, where
>>>>> everything
>>>>> is passed on the stack), and there I was told, that I need to save the
>>>>> ESP to
>>>>> EBP and put the old value of EBP on the stack. This is used to get a
>>>>> fixed
>>>>> pointer to stack position where arguments are passed, and where space
>>>>> should
>>>>> be reserved for local variables.
>>>>>
>>>>> How does that work in AMD64/Linux?
>>>>
>>>> If your function saves registers or puts temporary
>>>> variables onto the stack, you may need a reference
>>>> point where those data are stored. This is done by
>>>> copying rSP to rBP (or any other register!) before
>>>> you start to push or move anything onto the stack.
>>>>
>>>> If you use push instructions, rSP automatically is
>>>> updated by the processor with every push. With mov
>>>> instructions, you have to subtract the size of all
>>>> saved data from rSP before you start to write them
>>>> to the stack. Data between the current rSP and the
>>>> stack element the basepointer 'points to' are safe
>>>> from being overwritten by called functions.
>>>>
>>>> The reference (aka basepointer) is used to address
>>>> elements in your local stack area. To access these
>>>> elements, you have to use *negative* offsets - all
>>>> elements are *below* that reference point! the 1st
>>>> element is at [REF - 8], the second at [REF - 16],
>>>> and so on (64 bit = 8 byte).
>>>
>>> yep, but things differ here a little in the Win64 case...
>>
>> Okay, but polemon is writing code for Linux. It is
>> superfluous to explain how this is done on another
>> platform. He surely isn't interested in that if he
>> wants to create a working app for Linux within the
>> current decade... ;)
>>
>> By the way, there is more to it. Conventional code
>> requires the base pointer because it is the *only*
>> reference to the return address of a running func-
>> tion. In most environments, rSP isn't properly up-
>> dated between function calls, so it often holds an
>> arbitrary value whenever the function logic issues
>> a return to the caller. Without a basepointer, the
>> function did not know where its return address was
>> stored...
>>
>> An alternative is to work with a fixed rSP, but it
>> excludes using PUSH, POP and any other instruction
>> where rSP is updated automatically.
>>
>
> actually, it is not that much of a problem, since all one really has to do
> is keep track of where rSP is at compile time. push/pop/... are safe so long
> as the compiler knows about it, and can update its internal state
> accordingly.

Agreed. On the other hand, keeping track of the con-
tent of rSP adds some avoidable overhead to the com-
piler's code as well as to each executable.

> the great limit though is that, yes, it does require the stack frame to be
> fixed-size at compile time, so the split between rSP relative and rBP/...
> relative indexing is that it allows for an indeterminate amount of stack
> space to exist between the local stack spot, and the fixed-position locals
> and arguments.

Applies to a fixed rSP, as well. There's nothing you
could not do with the one register solution - except
that ENTER, LEAVE, POP and PUSH *must not* be used.

A properly designed 'fixed rSP environment' benefits
greatly from recent processors' accellerating mecha-
nisms where the equivalent conventional code suffers
from stalls introduced with the required rSP updates
which are part of the PUSH and POP instruction. PUSH
executes at a rate of one PUSH every two clocks, POP
at a rate of one POP every three clocks. Replacement
of PUSH / POP sequences with simple MOV instructions
reduces execution time to one MOV per cycle. In case
of ordered writes to continuous stack locations, the
write combining mechanism is triggered - allowing to
store 8 qwords (64 byte) in one gulp (three clocks).
Those 64 byte must reside in one cache line, though,
but it is no problem to align stack frames to entire
cache lines. If an OS is designed for high speed, it
should provide such mechanisms with appropriate ABI-
definitions.

Given clocks are for AMD Athlon. Even if there are 3
execution units, memory reads and writes are limited
to one access per clock cycle - this is a limitation
of the memory interface, not the processor itself.

> it also helps notably with hand-written code, or with naive compilers, since
> it doesn't require exactly keeping track of where everything is located.

A questionable argument... ;)

Using a fixed rSP allows programmers to use positive
offsets rather than poking around with negative off-
sets to the basepointer (which is a common source of
many 'not that obvious' and 'hard to find' errors).

[snip]

>>> in particular, there are a number of rules for forming proper prologues
>>> and
>>> epilogues, mostly since the code for the prologues and epilogues needs
>>> also
>>> to be able to be processed as data, which means it has to be formed
>>> according to certain rules (this is to allow such things as exception
>>> handling and stack unwinding, so the prologue and epilogue, as well as
>>> some
>>> aspects of stack and register usage, are carefully controlled, such that
>>> the
>>> stack can be automatically unwound and registers automatically restored).
>>>
>>> lea r13, [rsp+128]
>>>
>>> is a single example of such rules.
>>> this operation is optional, but if present, it is defined.
>>
>> As shown above, it is no valid code in our special
>> case. Even if this was a Windows app, it addressed
>> stack areas not belonging to the running function.
>>
>> Stack above RSP on function entry is taboo for any
>> called function. The only exception from this rule
>> is 64 bit Win, where the so called shadow space at
>> 08...27[RSP] is defined as a temporary storage for
>> the called function. But: 80[RSP] is far above the
>> shadow space of the running function, isn't it?
>>
>
> it is not as simple as this, especially when one gets to matters of passing
> and returning structs (this is where the calling conventions really
> differ...).
>
> but, yeah, normally one would not write into space outside the current
> function, but little is to say that the base-pointer can't point there, or
> for that matter, damn near anywhere else within the confines of the ABI.

I disagree. Properly written code never should point
to any location outside its private frame if it does
not reference an address passed by the caller (where
processing those data is expected).

It simply violates basic programming principles. Not
more, not less. In other words: It's very bad coding
style and should not be used in real code.

> in fact, the most notable thing about 128 is that it is the smallest value
> which would force a 16-bit displacement to be used [rsp+64] for example,
> would most likely use an 8-bit displacement (it is unlinely NASM would
> otherwise infer that it needs 16 bits for the offset...).

I don't get this. Why should anyone bother about the
size of the displacement? I assume the compiler uses
the smallest possible size (keeping code as small as
posible), regardless of any calling conventions. Why
should calling conventions explicitely define a size
for displacements?

>> Even if we introduced exception handling and other
>> things, this line was invalid. Assuming, we run on
>> Win64, our shadow space is located at 1240...125F.
>>
>> Our function's stack, again:
>>
>> 1260 callee's stack (bottom)
>> 1240 shadow space
>> 1238 return address
>> 1230 stored R15
>> 1228 stored R14
>> 1220 stored R13
>> 1210 local area (subq 0x10,%rsp)
>>
>> 125F is the topmost byte belonging to us, anything
>> above belongs to our caller (or even its callers).
>>
>> The only thing we know is where the stack frame of
>> our caller begins (1260). We neither have any info
>> about our callers stack size nor about where which
>> data are stored in it. Consequently, we should not
>> touch anything beyond the last byte inside our own
>> local stack area. Accessing data outside our frame
>> is a very bad idea. Passing pointers to such areas
>> to an error handler (or any other external entity)
>> is even worse: It pretends ownership of 'property'
>> not belonging to us.
>>
>
> if nothing there is addressed, nothing matters here.

If nothing was addressed, this line was not required
at all. Are we executing instructions just for fun?

> lea r13, [rsp+4096]
> mov rax, [r13-4092]
>
> this is infact often how the basepointer is used in Win64 code...
>
> typically the same offset is reversed as part of the epilogue:
> lea rsp, [r13-4096]
> pop ...
> ...
> ret

Does it help to understand code better than the more
obvious

pushq %r15            # save regs
....
movq %rsp,%r13        # copy reference point
....
....
movq -8(%r13),%rax    # address 1st local
....
....
movq %r13,%rsp        # restore reference point
popq %r15             # restore regs
....
ret

where MOV reg,reg is some byte shorter than LEA reg?

>>> the offset is because the offset is required to be present, and also a
>>> multiple of 16 (for whatever unknown reason).
>>
>> The stack should be aligned to multiples of 16, so
>> we can use XMM instructions without adding tons of
>> (superfluous) alignments preceeding them.
>>
>
> the offset itself though is required to be a multiple of 16, but it is not
> necessarily the case AFAIK that the stack is aligned by 16 at the time the
> lea is used.

If it wasn't, it had to be aligned in every function
manually (adding a lot of bloat). At least, Win64ABI
defines 'RSP must be aligned to a paragraph boundary
before another function is called', so RSP ends with
an 8 when it enters the callee's code. That's why we
have to subtract something ending with 8 to create a
'fixed RSP' stack frame.

BTW - the usual prologue looks like this

pushq %(whatever we use as basepointer)
movq %rsp,%(whatever we use as basepointer)

where the initial PUSH automatically aligns RSP to a
multiple of 16 (if it was paragraph-aligned prior to
the call). Hence, no extra code is required to align
RSP.

>>> in x86 cdecl, there was EBP, but code often bypassed it, leaving there
>>> being
>>> no way to perform a proper backtrace.
>>>
>>> in Win64, certain prologue and epilogue rules are required, and to do
>>> otherwise is a violation of the ABI.
>>>
>>> one can feel happy though that Win64 addresses matters of exception
>>> handling
>>> at the ABI and calling convention levels, rather than requiring debugging
>>> info (as is generally done on Linux, or IOW, one needs to deal with DWARF
>>> in
>>> order to safely have stack unwinding...).
>>
>> Agreed. The MS folks are one nose ahead with their
>> conventions at the moment. Nevertheless - it still
>> is too much bloat. 1k code and 2k resources result
>> in a 20 k application (compiled with mingw64). 17k
>> never used runtime stuff and exception handling is
>> a bad ratio between 'payload' and 'inert mass'. ;)
>>
>
> however, it can also be noted that MSVC has exception-handling in C code,
> and Windows itself throws internal exceptions occasionally.
>
> thus, broken exception handling risks crashing...

A question of the belief system one is bound to. For
my taste, bad code simply should crash and be debug-
ged until it runs error-free. This prevents program-
mers from publishing pre-beta software and leave the
debugging part to the end-user.

> so, it was either all this, or probably the strategy used in Win32 SEH,
> which was notable in that it made exception handling fairly expensive, and
> also didn't readily allow generalized unwinding.

As well as it allowed everyone to add some home-made
exception handling via 'hooks'.

> this may matter some as it is common in non-C languages (such as C++, Java,
> and C#) to use exception handling as a sort of control-flow mechanism (IOW:
> for non-exceptional situations), and so there is some value in trying to
> have fast exceptions...

I'm curious what comes up with the next 'generation'
of so called 'high level' languages. Probably, there
is some more bloat introducing some more performance
brakes. Everything can be slowed down while it still
is moving... ;)


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/12/2010 5:18:54 PM

polemon wrote:

> But I am still quite unsure about using the stack.
>  From what I understand, I should omit PUSH-/POPing to/from the stack if
> possible, should I just use labels in .bss for variables instead?

Both methods (PUSH/POP and 'fixed RSP') have their
pro's and con's, but working with a 'fixed RSP' is
(generally) faster on modern processors.

Unfortunately, you have to use the stack - as long
as your function is called by other functions with
changing parameters, there is no way to use memory
instead of the stack. Consider, that your function
might be called by another thread while it is exe-
cuting. If your local variables were stored in me-
mory, an incoming call overwrote the parameters of
the currently running 'instance' with those passed
by the new call. Using the stack creates a private
area for each call, *regardless* of how many times
your function is called.

http://en.wikipedia.org/wiki/Reentrant_(subroutine)

> I was searching the web for information about the red-zone, but there hardly is
> any.

http://www.cs.cmu.edu/~fp/courses/15411-f09/lectures/10-calling.pdf
http://www.x86-64.org/documentation.html

> I'm programming for Linux, just to clear the confusion about Windows/Linux
> conventions.
> The prologue I had in the first (referenced) post was what I've found on the
> Internet as "a prologue" I wasn't aware of the fact, that Windows and Linux
> handles parameters differently.
>
> Since the first six parameters for floats and integers are passed via registers,
> the 7th paramter should be on the stack, but at which offset of RSP?

You may pass up to 8 FP parameters in XMM0...XMM7.

On function entry, RSP holds the address where the
return address of your function is stored. This is
always a value ending with 8 (Win or Linux doesn't
matter). Parameters are stored directly above that
address:

....
Parameter 2 @ [RSP + 0x10]
Parameter 1 @ [RSP + 0x08]
Return Adr. @ [RSP + 0x00]
....

and so on. Keep in mind: Every time you PUSH some-
thing onto your stack, the processor automatically
subtracts 8 from RSP, so the offset to those para-
meters grows by 8 with every PUSH. If you subtract
an appropriate amount of byte for local variables,
you have to add the same value to the offsets used
to access passed parameters.

When you PUSH two registers (or subtract 0x10 from
RSP), the offsets now are:

Parameter 2 @ [RSP + 0x20]
Parameter 1 @ [RSP + 0x18]
Return Adr. @ [RSP + 0x10]
Register #1 @ [RSP + 0x08]
Register #2 @ [RSP + 0x00]

> Does the offset depend on the type of the passed value?

No - each parameter occupies 8 byte = 64 bit. Data
with smaller size are passed as 64 bit quantities.
Sometimes it may be required to sign-extend values
to fit into 64 bit (depends on what a function ex-
pects as input).

> Since this is for 64 bit, I'd assume, that all parameters
> passed on the stack should be 64 bits in length.

Applies to integer registers. I could not find any
document providing a definition how FP values were
to pass, though. Assuming 64 bit is safe, I think,
but it might be 128 bit (size of an XMM register),
as well. OTOH - the double FP format is 64 bit, so
64 bit is reasonable (and consistent).

> Since the red-zone is below the stack pointer, how should I push a 7th
> parameter on the stack for a function call? From what I understand, I'd
> be pushing that into the red-zone, since the pointer decreases with every
> push.

Forget about the 'red zone'. It's only thought for
leaf functions = functions not calling other func-
tions. If your function passes parameters to other
functions (that is: it does call other functions),
you have to create a stack frame, where RSP *must*
point to its lowest element. This can be done with
PUSH instructions - decreasing RSP automatically -
or by subtracting the required stackspace from RSP
and using RSP as reference to those reserved stack
elements.

Remember that a CALL instruction pushes the return
address onto the stack @ [current RSP - 8] and RSP
is used by the called function similar to what you
do with it in your function. Local variables which
are stored below the current RSP will be destroyed
when a called function is executed.

> While playing around with passing values with registers, I was asking myself
> if I actually need much of a prologue and epilogue, since I can 'save' the
> registers I need in my function, in variables in .bss section.
> It seems as if shouldn't use the stack wherever possible.

See top...

> Anyways, thanks a lot for your help so far!

You're welcome!

> I hope however, this thread doesn't get derailed into OS-wars of some sorts...

No way. ;)


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/12/2010 6:23:43 PM

"Bernhard Schornak" <schornak@nospicedham.web.de> wrote in message
news:hpvofh$9nv$1@news.eternal-september.org...
>
> Unfortunately, you have to use the stack - as long
> as your function is called by other functions with
> changing parameters, there is no way to use memory
> instead of the stack.
>

Sure there is.  Use memory as a stack...  x86 has instructions for just
that: PUSH/POP to/from memory.  Most people forget about the memory form of
these instructions.  But, you don't have to use the stack or memory as a
stack for parameters.  You just have to allocate new parameter space per
function call.


Rod Pemberton


0
Reply Rod 4/12/2010 7:45:22 PM

"Bernhard Schornak" <schornak@nospicedham.web.de> wrote in message 
news:hpvkm0$rl6$1@news.eternal-september.org...
> BGB / cr88192 wrote:
>
>> Bernhard Schornak wrote:
>>
>>> BGB / cr88192 wrote:
>>>
>>>> Bernhard Schornak wrote:
>
> [snip]
>
>>>> ro the processor, functions and function calls don't actually exist, 
>>>> but
>>>> are
>>>> instead an abstraction built on top of the processor (much as "the
>>>> computer"
>>>> is an abstraction built on top of electronic components, which are an
>>>> abstraction over particular configurations of matter and energy, ...).
>>>
>>> Where the thing we call matter is energy, as well.
>>> Everything in our Universe either is energy, or it
>>> is Nothing - the Emptiness where all energy moves,
>>> interacts and expands to.
>>>
>>> Nevertheless, this does not address the problem we
>>> might encounter if we execute the posted code. The
>>> program's stack might look like this:
>>>
>>> -------------------------------
>>>
>>> 1308 function 0: return address
>>> 1300
>>> .... function 0: local area
>>> 1280
>>>
>>> 1278 function 1: return address
>>> 1270
>>> .... function 1: local area
>>> 1240
>>>
>>> 1238 function 2: return address
>>> 1230 stored R15
>>> 1228 stored R14
>>> 1220 stored R13
>>> 1210 variables
>>>
>>> -------------------------------
>>>
>>> RSP holds 1210 at this point. The next instruction
>>>
>>> leaq 0x80(%rsp),%r13
>>>
>>> loads 1290 into R13. Address 1290 holds local data
>>> belonging to function 0 (the function who called a
>>> function calling our function...).
>>>
>>> As you can see, this cannot be valid. It was valid
>>> code if R13 held a parameter passed by function 0,
>>> pointing to e. g. a string variable that had to be
>>> processed, and function 1 called helpers, here our
>>> function, to do the job. But R13 is loaded with an
>>> offset to RSP - not with the content of a register
>>> holding the address of a string stored in function
>>> 0's local stack area.
>>>
>>> As long as we don't allow functions to guess where
>>> something to process might be stored, the piece of
>>> code violates some basic principles of interaction
>>> between single parts building a greater whole.
>>>
>>
>> all this is only an issue if one assumes that R13 needs to point within 
>> the
>> local functions stack frame.
>
> If we did not need R13 for any reason, we could omit
> the entire line...
>

yes, that would lead to an alternate prologue/epilogue form, and require the 
use of a fixed-size stack frame (defined by a final "sub rsp, imm", and 
where the "add esp, imm" would trigger the start of the epilogue).


>> AFAIK, this is not assumed by the Win64 ABI, but it is assumed that one
>> adds/subtracts an offset.
>
> I have not seen such a demand in any document. Could
> you post a link where such conventions are defined?
>

there is one on MSDN:
http://msdn.microsoft.com/en-us/library/tawsa7cb.aspx

some pages are around which go into a lot more detail, but I would have to 
find them.

here is some general info:
http://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx


>> there is no problem if one doesn't access any space outside the current
>> stack frame, regardless of how far offset the base-pointer is.
>
> It does not make sense to provide 'speculative' data
> anywhere. What should it be good for? Actually, it's
> wasting precious clock cycles which could be used to
> perform real work.
>

but, it is required so that the automatic unwinder code doesn't get 
confused, which if this happens can cause ones' app to crash, for little 
other reason than one not following some seemingly picky little rules 
related to the particular combinations and forms of opcodes in the prologue 
and epilogue.

the reason for this is that MS's code may actually disassemble parts of the 
running app, and may make use of some logic which actually does pattern 
recognition against ones' machine code. if the sequences aren't quite right, 
even if they are functionally equivalent, the code may get confused and 
promptly crash the running process...


>> in the case of my compiler, I think the offset is fairly arbitrary, just
>> there so that the unwinder code/... is happy.
>
> Unfortunately, I haven't seen it, yet, so I am quite
> 'uninformed' at the moment. If you post a link, I'll
> have a look at it.
>

ok.

> [snip]
>
>>>>>> I read in a book (which works with 32bit coding conventions, where
>>>>>> everything
>>>>>> is passed on the stack), and there I was told, that I need to save 
>>>>>> the
>>>>>> ESP to
>>>>>> EBP and put the old value of EBP on the stack. This is used to get a
>>>>>> fixed
>>>>>> pointer to stack position where arguments are passed, and where space
>>>>>> should
>>>>>> be reserved for local variables.
>>>>>>
>>>>>> How does that work in AMD64/Linux?
>>>>>
>>>>> If your function saves registers or puts temporary
>>>>> variables onto the stack, you may need a reference
>>>>> point where those data are stored. This is done by
>>>>> copying rSP to rBP (or any other register!) before
>>>>> you start to push or move anything onto the stack.
>>>>>
>>>>> If you use push instructions, rSP automatically is
>>>>> updated by the processor with every push. With mov
>>>>> instructions, you have to subtract the size of all
>>>>> saved data from rSP before you start to write them
>>>>> to the stack. Data between the current rSP and the
>>>>> stack element the basepointer 'points to' are safe
>>>>> from being overwritten by called functions.
>>>>>
>>>>> The reference (aka basepointer) is used to address
>>>>> elements in your local stack area. To access these
>>>>> elements, you have to use *negative* offsets - all
>>>>> elements are *below* that reference point! the 1st
>>>>> element is at [REF - 8], the second at [REF - 16],
>>>>> and so on (64 bit = 8 byte).
>>>>
>>>> yep, but things differ here a little in the Win64 case...
>>>
>>> Okay, but polemon is writing code for Linux. It is
>>> superfluous to explain how this is done on another
>>> platform. He surely isn't interested in that if he
>>> wants to create a working app for Linux within the
>>> current decade... ;)
>>>
>>> By the way, there is more to it. Conventional code
>>> requires the base pointer because it is the *only*
>>> reference to the return address of a running func-
>>> tion. In most environments, rSP isn't properly up-
>>> dated between function calls, so it often holds an
>>> arbitrary value whenever the function logic issues
>>> a return to the caller. Without a basepointer, the
>>> function did not know where its return address was
>>> stored...
>>>
>>> An alternative is to work with a fixed rSP, but it
>>> excludes using PUSH, POP and any other instruction
>>> where rSP is updated automatically.
>>>
>>
>> actually, it is not that much of a problem, since all one really has to 
>> do
>> is keep track of where rSP is at compile time. push/pop/... are safe so 
>> long
>> as the compiler knows about it, and can update its internal state
>> accordingly.
>
> Agreed. On the other hand, keeping track of the con-
> tent of rSP adds some avoidable overhead to the com-
> piler's code as well as to each executable.
>

fair enough. my compiler keeps track of rSP though, and may choose push/pop 
or raw move opcodes depending on the particular situation (relative location 
of the real and virtual rSP, ...).

so, push/pop "allow" one to both move a value to/from a register (or even 
memory/memory) and adjust rSP at the same time, so they are not always a bad 
thing (although always using them would not be great for performance either, 
....).


>> the great limit though is that, yes, it does require the stack frame to 
>> be
>> fixed-size at compile time, so the split between rSP relative and rBP/...
>> relative indexing is that it allows for an indeterminate amount of stack
>> space to exist between the local stack spot, and the fixed-position 
>> locals
>> and arguments.
>
> Applies to a fixed rSP, as well. There's nothing you
> could not do with the one register solution - except
> that ENTER, LEAVE, POP and PUSH *must not* be used.
>
> A properly designed 'fixed rSP environment' benefits
> greatly from recent processors' accellerating mecha-
> nisms where the equivalent conventional code suffers
> from stalls introduced with the required rSP updates
> which are part of the PUSH and POP instruction. PUSH
> executes at a rate of one PUSH every two clocks, POP
> at a rate of one POP every three clocks. Replacement
> of PUSH / POP sequences with simple MOV instructions
> reduces execution time to one MOV per cycle. In case
> of ordered writes to continuous stack locations, the
> write combining mechanism is triggered - allowing to
> store 8 qwords (64 byte) in one gulp (three clocks).
> Those 64 byte must reside in one cache line, though,
> but it is no problem to align stack frames to entire
> cache lines. If an OS is designed for high speed, it
> should provide such mechanisms with appropriate ABI-
> definitions.
>
> Given clocks are for AMD Athlon. Even if there are 3
> execution units, memory reads and writes are limited
> to one access per clock cycle - this is a limitation
> of the memory interface, not the processor itself.
>

however, although a fixed-frame is general better (and my compiler often 
does so), it is not readily compatible with common implementations of 
certain C99 and C++ features (such as varadic arrays, ...).

for example, one would have to either:
not implement or support varadic arrays;
internally use malloc/free calls or similar to reserve and free the space 
for said arrays;
....


>> it also helps notably with hand-written code, or with naive compilers, 
>> since
>> it doesn't require exactly keeping track of where everything is located.
>
> A questionable argument... ;)
>
> Using a fixed rSP allows programmers to use positive
> offsets rather than poking around with negative off-
> sets to the basepointer (which is a common source of
> many 'not that obvious' and 'hard to find' errors).
>

a compiler doesn't really need to care though whether its offsets are 
positive or negative.
it only really matters much for hand-written code or for naive compilers.


> [snip]
>
>>>> in particular, there are a number of rules for forming proper prologues
>>>> and
>>>> epilogues, mostly since the code for the prologues and epilogues needs
>>>> also
>>>> to be able to be processed as data, which means it has to be formed
>>>> according to certain rules (this is to allow such things as exception
>>>> handling and stack unwinding, so the prologue and epilogue, as well as
>>>> some
>>>> aspects of stack and register usage, are carefully controlled, such 
>>>> that
>>>> the
>>>> stack can be automatically unwound and registers automatically 
>>>> restored).
>>>>
>>>> lea r13, [rsp+128]
>>>>
>>>> is a single example of such rules.
>>>> this operation is optional, but if present, it is defined.
>>>
>>> As shown above, it is no valid code in our special
>>> case. Even if this was a Windows app, it addressed
>>> stack areas not belonging to the running function.
>>>
>>> Stack above RSP on function entry is taboo for any
>>> called function. The only exception from this rule
>>> is 64 bit Win, where the so called shadow space at
>>> 08...27[RSP] is defined as a temporary storage for
>>> the called function. But: 80[RSP] is far above the
>>> shadow space of the running function, isn't it?
>>>
>>
>> it is not as simple as this, especially when one gets to matters of 
>> passing
>> and returning structs (this is where the calling conventions really
>> differ...).
>>
>> but, yeah, normally one would not write into space outside the current
>> function, but little is to say that the base-pointer can't point there, 
>> or
>> for that matter, damn near anywhere else within the confines of the ABI.
>
> I disagree. Properly written code never should point
> to any location outside its private frame if it does
> not reference an address passed by the caller (where
> processing those data is expected).
>
> It simply violates basic programming principles. Not
> more, not less. In other words: It's very bad coding
> style and should not be used in real code.
>

however, compilers are free to, and often do, generate some fairly nasty 
ASM-code fragments (be it for performance, or fitting to various ABI 
requirements, ...).

it is common practice with ASM to disregard what an instruction is 
(conceptually) doing, and instead focus primarily on its external 
characteristics (what it does to the bits, whether or not it zeroes the 
high-order bits of a register, ...).


>> in fact, the most notable thing about 128 is that it is the smallest 
>> value
>> which would force a 16-bit displacement to be used [rsp+64] for example,
>> would most likely use an 8-bit displacement (it is unlinely NASM would
>> otherwise infer that it needs 16 bits for the offset...).
>
> I don't get this. Why should anyone bother about the
> size of the displacement? I assume the compiler uses
> the smallest possible size (keeping code as small as
> posible), regardless of any calling conventions. Why
> should calling conventions explicitely define a size
> for displacements?
>

because the stack-frame unwinder may pattern-match against the machine 
code...

actually, some of this MS-style code-processing trickery is used some by 
some of my code as well, such as using multi-byte NOP instructions as a 
means of encoding some metadata annotations into the machine-code, ...


>>
>> if nothing there is addressed, nothing matters here.
>
> If nothing was addressed, this line was not required
> at all. Are we executing instructions just for fun?
>

no, but we do use them as part of the ABI requirements.

one can omit the 'lea' line, but then the ABI requirements change and one 
has to follow some other rules.


>> lea r13, [rsp+4096]
>> mov rax, [r13-4092]
>>
>> this is infact often how the basepointer is used in Win64 code...
>>
>> typically the same offset is reversed as part of the epilogue:
>> lea rsp, [r13-4096]
>> pop ...
>> ...
>> ret
>
> Does it help to understand code better than the more
> obvious
>
> pushq %r15            # save regs
> ...
> movq %rsp,%r13        # copy reference point
> ...
> ...
> movq -8(%r13),%rax    # address 1st local
> ...
> ...
> movq %r13,%rsp        # restore reference point
> popq %r15             # restore regs
> ...
> ret
>
> where MOV reg,reg is some byte shorter than LEA reg?
>

this, however, is not an accepted prologue/epilogue form for the Win64 ABI.

actually, I would really have liked it had I just been able to be like:
push rbp
mov rbp, rsp
....
mov rsp, rbp
pop rbp
ret

but, alas, this is not valid as per the ABI either...


>>>> the offset is because the offset is required to be present, and also a
>>>> multiple of 16 (for whatever unknown reason).
>>>
>>> The stack should be aligned to multiples of 16, so
>>> we can use XMM instructions without adding tons of
>>> (superfluous) alignments preceeding them.
>>>
>>
>> the offset itself though is required to be a multiple of 16, but it is 
>> not
>> necessarily the case AFAIK that the stack is aligned by 16 at the time 
>> the
>> lea is used.
>
> If it wasn't, it had to be aligned in every function
> manually (adding a lot of bloat). At least, Win64ABI
> defines 'RSP must be aligned to a paragraph boundary
> before another function is called', so RSP ends with
> an 8 when it enters the callee's code. That's why we
> have to subtract something ending with 8 to create a
> 'fixed RSP' stack frame.
>
> BTW - the usual prologue looks like this
>
> pushq %(whatever we use as basepointer)
> movq %rsp,%(whatever we use as basepointer)
>
> where the initial PUSH automatically aligns RSP to a
> multiple of 16 (if it was paragraph-aligned prior to
> the call). Hence, no extra code is required to align
> RSP.
>

sadly, one can't do it this way.


>>>> in x86 cdecl, there was EBP, but code often bypassed it, leaving there
>>>> being
>>>> no way to perform a proper backtrace.
>>>>
>>>> in Win64, certain prologue and epilogue rules are required, and to do
>>>> otherwise is a violation of the ABI.
>>>>
>>>> one can feel happy though that Win64 addresses matters of exception
>>>> handling
>>>> at the ABI and calling convention levels, rather than requiring 
>>>> debugging
>>>> info (as is generally done on Linux, or IOW, one needs to deal with 
>>>> DWARF
>>>> in
>>>> order to safely have stack unwinding...).
>>>
>>> Agreed. The MS folks are one nose ahead with their
>>> conventions at the moment. Nevertheless - it still
>>> is too much bloat. 1k code and 2k resources result
>>> in a 20 k application (compiled with mingw64). 17k
>>> never used runtime stuff and exception handling is
>>> a bad ratio between 'payload' and 'inert mass'. ;)
>>>
>>
>> however, it can also be noted that MSVC has exception-handling in C code,
>> and Windows itself throws internal exceptions occasionally.
>>
>> thus, broken exception handling risks crashing...
>
> A question of the belief system one is bound to. For
> my taste, bad code simply should crash and be debug-
> ged until it runs error-free. This prevents program-
> mers from publishing pre-beta software and leave the
> debugging part to the end-user.
>

catching the exceptions though may be a good idea, since the usual issue is 
that a lot of these exceptions are for non-exceptional situations...


>> so, it was either all this, or probably the strategy used in Win32 SEH,
>> which was notable in that it made exception handling fairly expensive, 
>> and
>> also didn't readily allow generalized unwinding.
>
> As well as it allowed everyone to add some home-made
> exception handling via 'hooks'.
>

yep, this has some merit...


>> this may matter some as it is common in non-C languages (such as C++, 
>> Java,
>> and C#) to use exception handling as a sort of control-flow mechanism 
>> (IOW:
>> for non-exceptional situations), and so there is some value in trying to
>> have fast exceptions...
>
> I'm curious what comes up with the next 'generation'
> of so called 'high level' languages. Probably, there
> is some more bloat introducing some more performance
> brakes. Everything can be slowed down while it still
> is moving... ;)
>

mandatory RTTI for all objects, maybe?...
everything is name mangled?...


this is not too far off from coding in C on .NET via C++/CLI...



0
Reply BGB 4/13/2010 1:21:31 AM

On Mon, 12 Apr 2010 18:21:31 -0700, "BGB /
cr88192" <cr88192@nospicedham.hotmail.com> wrote:

<snip>

>the reason for this is that MS's code may actually disassemble parts of the 
>running app, and may make use of some logic which actually does pattern 
>recognition against ones' machine code. if the sequences aren't quite right, 
>even if they are functionally equivalent, the code may get confused and 
>promptly crash the running process...

This seems rather amazing, and totally out of
character with what I'd expect from Microsoft. 
(But then they've done, ahh, "unusual" things
before!) 

Can you point me to a reference?

Best regards,


Bob Masta
 
              DAQARTA  v5.10
   Data AcQuisition And Real-Time Analysis
              www.daqarta.com
Scope, Spectrum, Spectrogram, Sound Level Meter
    Frequency Counter, FREE Signal Generator
           Pitch Track, Pitch-to-MIDI 
         DaqMusic - FREE MUSIC, Forever!
             (Some assembly required)
     Science (and fun!) with your sound card!
0
Reply N0Spam 4/13/2010 11:49:44 AM

BGB / cr88192 wrote:

> "Bernhard Schornak" wrote:
>
>> BGB / cr88192 wrote:
>>
>>> Bernhard Schornak wrote:
>>>
>>>> BGB / cr88192 wrote:
>>>>
>>>>> Bernhard Schornak wrote:
>>
>>>>> ro the processor, functions and function calls don't actually exist,
>>>>> but
>>>>> are
>>>>> instead an abstraction built on top of the processor (much as "the
>>>>> computer"
>>>>> is an abstraction built on top of electronic components, which are an
>>>>> abstraction over particular configurations of matter and energy, ...).
>>>>
>>>> Where the thing we call matter is energy, as well.
>>>> Everything in our Universe either is energy, or it
>>>> is Nothing - the Emptiness where all energy moves,
>>>> interacts and expands to.
>>>>
>>>> Nevertheless, this does not address the problem we
>>>> might encounter if we execute the posted code. The
>>>> program's stack might look like this:
>>>>
>>>> -------------------------------
>>>>
>>>> 1308 function 0: return address
>>>> 1300
>>>> .... function 0: local area
>>>> 1280
>>>>
>>>> 1278 function 1: return address
>>>> 1270
>>>> .... function 1: local area
>>>> 1240
>>>>
>>>> 1238 function 2: return address
>>>> 1230 stored R15
>>>> 1228 stored R14
>>>> 1220 stored R13
>>>> 1210 variables
>>>>
>>>> -------------------------------
>>>>
>>>> RSP holds 1210 at this point. The next instruction
>>>>
>>>> leaq 0x80(%rsp),%r13
>>>>
>>>> loads 1290 into R13. Address 1290 holds local data
>>>> belonging to function 0 (the function who called a
>>>> function calling our function...).
>>>>
>>>> As you can see, this cannot be valid. It was valid
>>>> code if R13 held a parameter passed by function 0,
>>>> pointing to e. g. a string variable that had to be
>>>> processed, and function 1 called helpers, here our
>>>> function, to do the job. But R13 is loaded with an
>>>> offset to RSP - not with the content of a register
>>>> holding the address of a string stored in function
>>>> 0's local stack area.
>>>>
>>>> As long as we don't allow functions to guess where
>>>> something to process might be stored, the piece of
>>>> code violates some basic principles of interaction
>>>> between single parts building a greater whole.
>>>>
>>>
>>> all this is only an issue if one assumes that R13 needs to point within
>>> the
>>> local functions stack frame.
>>
>> If we did not need R13 for any reason, we could omit
>> the entire line...
>
> yes, that would lead to an alternate prologue/epilogue form, and require the
> use of a fixed-size stack frame (defined by a final "sub rsp, imm", and
> where the "add esp, imm" would trigger the start of the epilogue).

Not what I meant. If R13 is loaded with a proper
address, this is valid (and probably necessary!)
code.

>>> AFAIK, this is not assumed by the Win64 ABI, but it is assumed that one
>>> adds/subtracts an offset.
>>
>> I have not seen such a demand in any document. Could
>> you post a link where such conventions are defined?
>>
>
> there is one on MSDN:
> http://msdn.microsoft.com/en-us/library/tawsa7cb.aspx
>
> some pages are around which go into a lot more detail, but I would have to
> find them.
>
> here is some general info:
> http://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx

It says:

"This prolog stores the argument register RCX in
its home location, saves nonvolatile registers
R13-R15, allocates the fixed part of the stack
frame, and establishes a frame pointer that
points 128 bytes into the fixed allocation area.
Using an offset allows more of the fixed
allocation area to be addressed with one-byte
offsets."

As far as I understand it, "a frame pointer that
points 128 bytes into the fixed allocation area"
definitely states that R13 doesn't point to any-
where *outside* the private area of the running
function.

BTW: This surely is the document you referred to
while speaking of displacements. As I told, it's
done to keep displacements as small as possible.
A one byte displacement addresses memory between
-128 and +127 byte from the base - hence the 128
byte are added, allowing to access the registers
home as well as the bottom-most local variable.

>>> there is no problem if one doesn't access any space outside the current
>>> stack frame, regardless of how far offset the base-pointer is.
>>
>> It does not make sense to provide 'speculative' data
>> anywhere. What should it be good for? Actually, it's
>> wasting precious clock cycles which could be used to
>> perform real work.
>
> but, it is required so that the automatic unwinder code doesn't get
> confused, which if this happens can cause ones' app to crash, for little
> other reason than one not following some seemingly picky little rules
> related to the particular combinations and forms of opcodes in the prologue
> and epilogue.
>
> the reason for this is that MS's code may actually disassemble parts of the
> running app, and may make use of some logic which actually does pattern
> recognition against ones' machine code. if the sequences aren't quite right,
> even if they are functionally equivalent, the code may get confused and
> promptly crash the running process...

I never denied anyone to use this kind of code -
I just deny functions to access stackspace if it
does not belong to them. If you read Microsoft's
document with care: They say the same with other
words.

>>> in the case of my compiler, I think the offset is fairly arbitrary, just
>>> there so that the unwinder code/... is happy.
>>
>> Unfortunately, I haven't seen it, yet, so I am quite
>> 'uninformed' at the moment. If you post a link, I'll
>> have a look at it.
>
> ok.

I mismatched "my compiler" with "the compiler I
wrote"... ;)

>> [snip]
>>
>>>>>>> I read in a book (which works with 32bit coding conventions, where
>>>>>>> everything
>>>>>>> is passed on the stack), and there I was told, that I need to save
>>>>>>> the
>>>>>>> ESP to
>>>>>>> EBP and put the old value of EBP on the stack. This is used to get a
>>>>>>> fixed
>>>>>>> pointer to stack position where arguments are passed, and where space
>>>>>>> should
>>>>>>> be reserved for local variables.
>>>>>>>
>>>>>>> How does that work in AMD64/Linux?
>>>>>>
>>>>>> If your function saves registers or puts temporary
>>>>>> variables onto the stack, you may need a reference
>>>>>> point where those data are stored. This is done by
>>>>>> copying rSP to rBP (or any other register!) before
>>>>>> you start to push or move anything onto the stack.
>>>>>>
>>>>>> If you use push instructions, rSP automatically is
>>>>>> updated by the processor with every push. With mov
>>>>>> instructions, you have to subtract the size of all
>>>>>> saved data from rSP before you start to write them
>>>>>> to the stack. Data between the current rSP and the
>>>>>> stack element the basepointer 'points to' are safe
>>>>>> from being overwritten by called functions.
>>>>>>
>>>>>> The reference (aka basepointer) is used to address
>>>>>> elements in your local stack area. To access these
>>>>>> elements, you have to use *negative* offsets - all
>>>>>> elements are *below* that reference point! the 1st
>>>>>> element is at [REF - 8], the second at [REF - 16],
>>>>>> and so on (64 bit = 8 byte).
>>>>>
>>>>> yep, but things differ here a little in the Win64 case...
>>>>
>>>> Okay, but polemon is writing code for Linux. It is
>>>> superfluous to explain how this is done on another
>>>> platform. He surely isn't interested in that if he
>>>> wants to create a working app for Linux within the
>>>> current decade... ;)
>>>>
>>>> By the way, there is more to it. Conventional code
>>>> requires the base pointer because it is the *only*
>>>> reference to the return address of a running func-
>>>> tion. In most environments, rSP isn't properly up-
>>>> dated between function calls, so it often holds an
>>>> arbitrary value whenever the function logic issues
>>>> a return to the caller. Without a basepointer, the
>>>> function did not know where its return address was
>>>> stored...
>>>>
>>>> An alternative is to work with a fixed rSP, but it
>>>> excludes using PUSH, POP and any other instruction
>>>> where rSP is updated automatically.
>>>
>>> actually, it is not that much of a problem, since all one really has to
>>> do
>>> is keep track of where rSP is at compile time. push/pop/... are safe so
>>> long
>>> as the compiler knows about it, and can update its internal state
>>> accordingly.
>>
>> Agreed. On the other hand, keeping track of the con-
>> tent of rSP adds some avoidable overhead to the com-
>> piler's code as well as to each executable.
>
> fair enough. my compiler keeps track of rSP though, and may choose push/pop
> or raw move opcodes depending on the particular situation (relative location
> of the real and virtual rSP, ...).

Keeping care of rSP is crucial in any case - for
compilers as well as for assembler programmers.

However, the fixed rSP causes less headaches and
less overhead (and the MS authorities gave their
blessings...). ;)

> so, push/pop "allow" one to both move a value to/from a register (or even
> memory/memory) and adjust rSP at the same time, so they are not always a bad
> thing (although always using them would not be great for performance either,
> ...).

push mem (3 clocks, DP double)

is the only thing you cannot do with MOV, right,
but:

With PUSH you cannot do this

....
movq par8_1,0x38(%rsp)
movq par7_s,0x30(%rsp)
movq par6_1,0x28(%rsp)
movq par5_s,0x20(%rsp)
call _whatever
movq par8_2,0x38(%rsp)
call _whatever
movq par8_3,0x38(%rsp)
call _whatever
....

You had to push all four parameters three times.

>>> the great limit though is that, yes, it does require the stack frame to
>>> be
>>> fixed-size at compile time, so the split between rSP relative and rBP/...
>>> relative indexing is that it allows for an indeterminate amount of stack
>>> space to exist between the local stack spot, and the fixed-position
>>> locals
>>> and arguments.
>>
>> Applies to a fixed rSP, as well. There's nothing you
>> could not do with the one register solution - except
>> that ENTER, LEAVE, POP and PUSH *must not* be used.
>>
>> A properly designed 'fixed rSP environment' benefits
>> greatly from recent processors' accellerating mecha-
>> nisms where the equivalent conventional code suffers
>> from stalls introduced with the required rSP updates
>> which are part of the PUSH and POP instruction. PUSH
>> executes at a rate of one PUSH every two clocks, POP
>> at a rate of one POP every three clocks. Replacement
>> of PUSH / POP sequences with simple MOV instructions
>> reduces execution time to one MOV per cycle. In case
>> of ordered writes to continuous stack locations, the
>> write combining mechanism is triggered - allowing to
>> store 8 qwords (64 byte) in one gulp (three clocks).
>> Those 64 byte must reside in one cache line, though,
>> but it is no problem to align stack frames to entire
>> cache lines. If an OS is designed for high speed, it
>> should provide such mechanisms with appropriate ABI-
>> definitions.
>>
>> Given clocks are for AMD Athlon. Even if there are 3
>> execution units, memory reads and writes are limited
>> to one access per clock cycle - this is a limitation
>> of the memory interface, not the processor itself.
>>
>
> however, although a fixed-frame is general better (and my compiler often
> does so), it is not readily compatible with common implementations of
> certain C99 and C++ features (such as varadic arrays, ...).

That is: They are outdated. As an assembler-only
programmer, I have the freedom to utilise all of
those goodies modern processors provide (as long
as AS knows the required opcodes).

> for example, one would have to either:
> not implement or support varadic arrays;
> internally use malloc/free calls or similar to reserve and free the space
> for said arrays;
> ...

After googling for a while, I got a glimpse of a
clue what variadic arrays are. But what the heck
are they good for? It probably is easier to pack
those data into an array with sets of structures
and process their members individually.

In Bavaria, we say "Wolpertinger"

http://en.wikipedia.org/wiki/Wolpertinger

in the USA it's called "Jackalope"... ;)

>>> it also helps notably with hand-written code, or with naive compilers,
>>> since
>>> it doesn't require exactly keeping track of where everything is located.
>>
>> A questionable argument... ;)
>>
>> Using a fixed rSP allows programmers to use positive
>> offsets rather than poking around with negative off-
>> sets to the basepointer (which is a common source of
>> many 'not that obvious' and 'hard to find' errors).
>
> a compiler doesn't really need to care though whether its offsets are
> positive or negative.
> it only really matters much for hand-written code or for naive compilers.

Yes. What you forget is the fact that a compiler
does some additional work if it has to keep care
of two entities - a stack and a base pointer (if
we use one) - simultaneously. (As a human does.)

>>>>> in particular, there are a number of rules for forming proper prologues
>>>>> and
>>>>> epilogues, mostly since the code for the prologues and epilogues needs
>>>>> also
>>>>> to be able to be processed as data, which means it has to be formed
>>>>> according to certain rules (this is to allow such things as exception
>>>>> handling and stack unwinding, so the prologue and epilogue, as well as
>>>>> some
>>>>> aspects of stack and register usage, are carefully controlled, such
>>>>> that
>>>>> the
>>>>> stack can be automatically unwound and registers automatically
>>>>> restored).
>>>>>
>>>>> lea r13, [rsp+128]
>>>>>
>>>>> is a single example of such rules.
>>>>> this operation is optional, but if present, it is defined.
>>>>
>>>> As shown above, it is no valid code in our special
>>>> case. Even if this was a Windows app, it addressed
>>>> stack areas not belonging to the running function.
>>>>
>>>> Stack above RSP on function entry is taboo for any
>>>> called function. The only exception from this rule
>>>> is 64 bit Win, where the so called shadow space at
>>>> 08...27[RSP] is defined as a temporary storage for
>>>> the called function. But: 80[RSP] is far above the
>>>> shadow space of the running function, isn't it?
>>>
>>> it is not as simple as this, especially when one gets to matters of
>>> passing
>>> and returning structs (this is where the calling conventions really
>>> differ...).
>>>
>>> but, yeah, normally one would not write into space outside the current
>>> function, but little is to say that the base-pointer can't point there,
>>> or
>>> for that matter, damn near anywhere else within the confines of the ABI.
>>
>> I disagree. Properly written code never should point
>> to any location outside its private frame if it does
>> not reference an address passed by the caller (where
>> processing those data is expected).
>>
>> It simply violates basic programming principles. Not
>> more, not less. In other words: It's very bad coding
>> style and should not be used in real code.
>
> however, compilers are free to, and often do, generate some fairly nasty
> ASM-code fragments (be it for performance, or fitting to various ABI
> requirements, ...).

Especially a compiler follows very strict rules.
It might emit code looking quite strange for our
eyes, but it surely never violates basic rules -
e.g. it never will load addresses which are out-
side a valid range.

> it is common practice with ASM to disregard what an instruction is
> (conceptually) doing, and instead focus primarily on its external
> characteristics (what it does to the bits, whether or not it zeroes the
> high-order bits of a register, ...).

Of course. Its 'abstraction level' is quite low,
just on top of pure byte code (the special field
of Wolfgang). It requires some knowledge about a
machine to use it efficiently. Without that, one
better starts coding in C or 'higher' languages.

>>> in fact, the most notable thing about 128 is that it is the smallest
>>> value
>>> which would force a 16-bit displacement to be used [rsp+64] for example,
>>> would most likely use an 8-bit displacement (it is unlinely NASM would
>>> otherwise infer that it needs 16 bits for the offset...).
>>
>> I don't get this. Why should anyone bother about the
>> size of the displacement? I assume the compiler uses
>> the smallest possible size (keeping code as small as
>> posible), regardless of any calling conventions. Why
>> should calling conventions explicitely define a size
>> for displacements?
>
> because the stack-frame unwinder may pattern-match against the machine
> code...

As I started with the design of a new OS (half a
year ago), I didn't think about such things, but
meanwhile, I consider this might be something of
concern. As far as I put it on paper, it's not a
big deal to detect a RET within a piece of code.
With a well defined ABI, it is quite easy to get
the stack size - the 1st or 2nd line before RET,
depending on where RAX/XMM0 is loaded with a re-
turn value. This stack size determines where the
return address (and saved registers) are stored.
Easy as pie...

> actually, some of this MS-style code-processing trickery is used some by
> some of my code as well, such as using multi-byte NOP instructions as a
> means of encoding some metadata annotations into the machine-code, ...

As long as you don't forget where which info can
be found, it's a sound thing.

>>> if nothing there is addressed, nothing matters here.
>>
>> If nothing was addressed, this line was not required
>> at all. Are we executing instructions just for fun?
>
> no, but we do use them as part of the ABI requirements.
>
> one can omit the 'lea' line, but then the ABI requirements change and one
> has to follow some other rules.

It's not about the code itself (which is valid),
it is about the value we load into R13. And this
value definitely is *not* valid in this special
case. The first write to anything addressed with
R13 corrupts the local area of another function.

It's a fact - not a Sony! ;)

>>> lea r13, [rsp+4096]
>>> mov rax, [r13-4092]
>>>
>>> this is infact often how the basepointer is used in Win64 code...
>>>
>>> typically the same offset is reversed as part of the epilogue:
>>> lea rsp, [r13-4096]
>>> pop ...
>>> ...
>>> ret
>>
>> Does it help to understand code better than the more
>> obvious
>>
>> pushq %r15            # save regs
>> ...
>> movq %rsp,%r13        # copy reference point
>> ...
>> ...
>> movq -8(%r13),%rax    # address 1st local
>> ...
>> ...
>> movq %r13,%rsp        # restore reference point
>> popq %r15             # restore regs
>> ...
>> ret
>>
>> where MOV reg,reg is some byte shorter than LEA reg?
>
> this, however, is not an accepted prologue/epilogue form for the Win64 ABI.

Read both again. Even if I did not know MS' doc,
it is exactly the same. Any other order of these
instructions didn't make sense.

Only one thing is missing in my snippet: I don't
allocate a stackframe (subq $whatever,%rsp).

> actually, I would really have liked it had I just been able to be like:
> push rbp
> mov rbp, rsp
> ...
> mov rsp, rbp
> pop rbp
> ret

Just do it that way.

> but, alas, this is not valid as per the ABI either...

As long as RSP does not change, it is valid.

>>>>> the offset is because the offset is required to be present, and also a
>>>>> multiple of 16 (for whatever unknown reason).
>>>>
>>>> The stack should be aligned to multiples of 16, so
>>>> we can use XMM instructions without adding tons of
>>>> (superfluous) alignments preceeding them.
>>>>
>>>
>>> the offset itself though is required to be a multiple of 16, but it is
>>> not
>>> necessarily the case AFAIK that the stack is aligned by 16 at the time
>>> the
>>> lea is used.
>>
>> If it wasn't, it had to be aligned in every function
>> manually (adding a lot of bloat). At least, Win64ABI
>> defines 'RSP must be aligned to a paragraph boundary
>> before another function is called', so RSP ends with
>> an 8 when it enters the callee's code. That's why we
>> have to subtract something ending with 8 to create a
>> 'fixed RSP' stack frame.
>>
>> BTW - the usual prologue looks like this
>>
>> pushq %(whatever we use as basepointer)
>> movq %rsp,%(whatever we use as basepointer)
>>
>> where the initial PUSH automatically aligns RSP to a
>> multiple of 16 (if it was paragraph-aligned prior to
>> the call). Hence, no extra code is required to align
>> RSP.
>
> sadly, one can't do it this way.

Hm, I just omitted pushing nonvolatile registers
prior to allocating a stack frame. As MS's paper
says, we have to push registers first, then it's
allowed to grab some stack.

>>>>> in x86 cdecl, there was EBP, but code often bypassed it, leaving there
>>>>> being
>>>>> no way to perform a proper backtrace.
>>>>>
>>>>> in Win64, certain prologue and epilogue rules are required, and to do
>>>>> otherwise is a violation of the ABI.
>>>>>
>>>>> one can feel happy though that Win64 addresses matters of exception
>>>>> handling
>>>>> at the ABI and calling convention levels, rather than requiring
>>>>> debugging
>>>>> info (as is generally done on Linux, or IOW, one needs to deal with
>>>>> DWARF
>>>>> in
>>>>> order to safely have stack unwinding...).
>>>>
>>>> Agreed. The MS folks are one nose ahead with their
>>>> conventions at the moment. Nevertheless - it still
>>>> is too much bloat. 1k code and 2k resources result
>>>> in a 20 k application (compiled with mingw64). 17k
>>>> never used runtime stuff and exception handling is
>>>> a bad ratio between 'payload' and 'inert mass'. ;)
>>>>
>>>
>>> however, it can also be noted that MSVC has exception-handling in C code,
>>> and Windows itself throws internal exceptions occasionally.
>>>
>>> thus, broken exception handling risks crashing...
>>
>> A question of the belief system one is bound to. For
>> my taste, bad code simply should crash and be debug-
>> ged until it runs error-free. This prevents program-
>> mers from publishing pre-beta software and leave the
>> debugging part to the end-user.
>
> catching the exceptions though may be a good idea, since the usual issue is
> that a lot of these exceptions are for non-exceptional situations...

The most common exception is INT0D

"[Program Name/Description] has stopped working"
(Win7 message)

this is a 'General Protection Fault' aka 'Access
Violation' for real. All other exceptions (INT00
through INT1F) occur less, but are more critical
from the system's view.

>>> so, it was either all this, or probably the strategy used in Win32 SEH,
>>> which was notable in that it made exception handling fairly expensive,
>>> and
>>> also didn't readily allow generalized unwinding.
>>
>> As well as it allowed everyone to add some home-made
>> exception handling via 'hooks'.
>
> yep, this has some merit...

Not for real. The system's ecxeption handler has
to wade through tons of 'exception handlers' put
into the ISR chain by 'It_was_me' or programmers
using rSP as GP register:

http://www.virtualdub.org/blog/pivot/entry.php?id=85

>>> this may matter some as it is common in non-C languages (such as C++,
>>> Java,
>>> and C#) to use exception handling as a sort of control-flow mechanism
>>> (IOW:
>>> for non-exceptional situations), and so there is some value in trying to
>>> have fast exceptions...
>>
>> I'm curious what comes up with the next 'generation'
>> of so called 'high level' languages. Probably, there
>> is some more bloat introducing some more performance
>> brakes. Everything can be slowed down while it still
>> is moving... ;)
>
> mandatory RTTI for all objects, maybe?...
> everything is name mangled?...

In Germany, I wrote "Bahnhof" now. In English, I
just can state "I don't have a clue what you are
talking about!"...

(Bahnhof = Railway Station)

> this is not too far off from coding in C on .NET via C++/CLI...

Goto ([this_line] - 8)... ;)


BTW - this thread convinced me to publish a tiny
part of my former homepage:

http://st-intelligentdesign.blogspot.com/2010/04/introduction.html

(This is a link to the first of its 13 parts; to
read 'em in the proper order, you might consider
to use the directory ... it's a blog.)

That paper explains the stack quite detailed. It
analyses major weaknesses of conventional C code
and introduces an advanced programming technique
called 'Intelligent Design' (including many code
samples to back up all claims).


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/13/2010 7:26:43 PM

Rod Pemberton wrote:

> "Bernhard Schornak"<schornak@nospicedham.web.de>  wrote in message
> news:hpvofh$9nv$1@news.eternal-september.org...
>>
>> Unfortunately, you have to use the stack - as long
>> as your function is called by other functions with
>> changing parameters, there is no way to use memory
>> instead of the stack.
>
> Sure there is.  Use memory as a stack...  x86 has instructions for just
> that: PUSH/POP to/from memory.

Good point!

Unfortunately, POP and PUSH increase/decrease rSP
when they are executed and store data where? Yes,
the stack.

Hence, we had to provide an extra register (named
RSP2) where the address of our secondary stack is
stored. Whenever our function is called, our RSP1
is exchanged with RSP2, so the real RSP points to
our private private stack.

Um - is this thought as serious solution for this
problem? I think we had to do the same thing with
any incarnation of RSP(x) and stack(y) - creating
a stack frame, storing registers, et cetera.

> Most people forget about the memory form of
> these instructions.  But, you don't have to use the stack or memory as a
> stack for parameters.  You just have to allocate new parameter space per
> function call.

And where should that be (if neither on stack nor
in memory)? ;)

I think, polemon is at beginners level at the mo-
ment. It is counterproductive to confuse him with
such weird ideas. Do you argee with that?


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/13/2010 7:52:58 PM

"Bernhard Schornak" <schornak@nospicedham.web.de> wrote in message
news:hq2i34$mpe$1@news.eternal-september.org...
> Rod Pemberton wrote:
> > "Bernhard Schornak"<schornak@nospicedham.web.de>  wrote in message
> > news:hpvofh$9nv$1@news.eternal-september.org...
> >>
> >> Unfortunately, you have to use the stack - as long
> >> as your function is called by other functions with
> >> changing parameters, there is no way to use memory
> >> instead of the stack.
> >
> > Sure there is.  Use memory as a stack...  x86 has instructions for just
> > that: PUSH/POP to/from memory.
>
> Good point!
>
> Unfortunately, POP and PUSH increase/decrease rSP
> when they are executed and store data where? Yes,
> the stack.
>

If you don't like PUSH/POP for stack-to-register, nor like PUSH/POP mem for
stack-to-memory, then use MOV or XCHG for register-to-memory, or MOVS for
memory-to-memory...  I.e., there are still other ways to collect values into
a block of data, i.e., "stack frame", which contains the parameters for a
function call.

> Hence, we had to provide an extra register (named
> RSP2) where the address of our secondary stack is
> stored. Whenever our function is called, our RSP1
> is exchanged with RSP2, so the real RSP points to
> our private private stack.
>
> Um - is this thought as serious solution for this
> problem? I think we had to do the same thing with
> any incarnation of RSP(x) and stack(y) - creating
> a stack frame, storing registers, et cetera.

Uh, if not an XCHG with PUSH/POP, then perhaps MOV relative to SP (or BP
with your frame pointer) to simulate PUSH/POP... ?

> > But, you don't have to use the stack or memory as a
> > stack for parameters.  You just have to allocate new
> > parameter space per
> > function call.
>
> And where should that be (if neither on stack nor
> in memory)? ;)
>

A) in memory - from your memory allocator...

I said "memory as a stack".  You reclassified that as "memory" in general.
One - memory as a stack - has a specified format: contiguous memory, stack
layout.  The other - memory - may or may not be contiguous, or formatted, or
used as a stack.  Stack based languages only require a "stack frame" to call
a function.  There's no requirement for more than one "stack frame" to be
placed together contiguously as a stack.  In fact, for C, the "C Rationale"
says that even the function arguments are not required to be stored
contiguously.  I.e., you don't even need a "stack frame" for calling C
functions...  An early version of C was implemented using a Fortran calling
convention with a "discontiguous stack" (CLCS R&J 1981).  I.e., a bunch of
"stack frames", but not allocated together in a stack.  The point being: as
long as you can access your function parameters and those parameters are
allocated on a per function call basis, you can call functions without
worrying about overwriting parameters from prior calls to the same function.

> I think, polemon is at beginners level at the mo-
> ment. It is counterproductive to confuse him with
> such weird ideas. Do you argee with that?

He may be at a beginners level, but he doesn't need to stick to a
stack-based model of parameter passing for function calls for his own code.
He's coding in assembly where he can do whatever he wants - as long as he's
not interfacing to code from other sources.  Since I was just commenting on
necessity of using the cpu stack, I don't know if interfacing outside code
was discussed earlier in the thread.


Rod Pemberton



0
Reply Rod 4/13/2010 9:49:47 PM

Rod Pemberton wrote:

> Bernhard Schornak wrote:
>
>> Rod Pemberton wrote:
>>
>>> Bernhard Schornak wrote:
>>>
>>>> Unfortunately, you have to use the stack - as long
>>>> as your function is called by other functions with
>>>> changing parameters, there is no way to use memory
>>>> instead of the stack.
>>>
>>> Sure there is.  Use memory as a stack...  x86 has instructions for just
>>> that: PUSH/POP to/from memory.
>>
>> Good point!
>>
>> Unfortunately, POP and PUSH increase/decrease rSP
>> when they are executed and store data where? Yes,
>> the stack.
>>
>
> If you don't like PUSH/POP for stack-to-register, nor like PUSH/POP mem for
> stack-to-memory, then use MOV or XCHG for register-to-memory, or MOVS for
> memory-to-memory...  I.e., there are still other ways to collect values into
> a block of data, i.e., "stack frame", which contains the parameters for a
> function call.

This is what I suggested as alternative code

subq $0x78,%rsp        # create stack frame
movq %r13,0x58(%rsp)   # store registers
movq %r14,0x60(%rsp)
movq %r15,0x68(%rsp)
movq %rcx,0x70(%rsp)
....                    # function code
....
....
movq 0x58(%rsp),%r13   # restore registers
movq 0x60(%rsp),%r14
movq 0x68(%rsp),%r15
movq 0x70(%rsp),%rcx
addq $0x78,%rsp        # destroy stack frame
ret

in one of my previous posts. Everything can be
addressed via RSP, so no basepointer is needed
at all. PUSH and POP are mutual exclusive with
this technique!

>> Hence, we had to provide an extra register (named
>> RSP2) where the address of our secondary stack is
>> stored. Whenever our function is called, our RSP1
>> is exchanged with RSP2, so the real RSP points to
>> our private private stack.
>>
>> Um - is this thought as serious solution for this
>> problem? I think we had to do the same thing with
>> any incarnation of RSP(x) and stack(y) - creating
>> a stack frame, storing registers, et cetera.
>
> Uh, if not an XCHG with PUSH/POP, then perhaps MOV relative to SP (or BP
> with your frame pointer) to simulate PUSH/POP... ?

Anything else than the usual prologue sequence
adds superfluous bloat to the function. Either
we use the conventional ways with PUSH/POP and
a basepointer or the above method without one.
Both methods reduce overhead to the neccessary
minimum.

>>> But, you don't have to use the stack or memory as a
>>> stack for parameters.  You just have to allocate new
>>> parameter space per
>>> function call.
>>
>> And where should that be (if neither on stack nor
>> in memory)? ;)
>>
> A) in memory - from your memory allocator...

Okay. Your original sentence confused me here,
because it implies that neither memory nor the
stack is used (leaves the option to store some
variables in XMM or other registers).

> I said "memory as a stack".  You reclassified that as "memory" in general.
> One - memory as a stack - has a specified format: contiguous memory, stack
> layout.  The other - memory - may or may not be contiguous, or formatted, or
> used as a stack.  Stack based languages only require a "stack frame" to call
> a function.  There's no requirement for more than one "stack frame" to be
> placed together contiguously as a stack.  In fact, for C, the "C Rationale"
> says that even the function arguments are not required to be stored
> contiguously.  I.e., you don't even need a "stack frame" for calling C
> functions...  An early version of C was implemented using a Fortran calling
> convention with a "discontiguous stack" (CLCS R&J 1981).  I.e., a bunch of
> "stack frames", but not allocated together in a stack.  The point being: as
> long as you can access your function parameters and those parameters are
> allocated on a per function call basis, you can call functions without
> worrying about overwriting parameters from prior calls to the same function.

I agree that we can take a million paths to go
to a desired destination. Unfortunately, there
is an operating system between us and our tar-
get, denying us to take most of those possible
paths, forcing us to use some standard methods
compatible with its calling conventions. These
conventions don't leave much room to implement
our own methods to bypass them. Hence, we have
to obey and apply those standard methods if we
want to create a function which is callable by
other functions. That is: It can interact with
any other function provided by someone else.

I disagree with your claim that parameters can
be passed in arbitrary order. The C convention
for x86(-64) clearly states the parameters are
put on the stack in ascending order - we leave
various x86-64 convention(s) to pass the first
few parameters in defined registers out of the
frame. My point is - no 'gaps' between any two
parameters are allowed. If we insert some gaps
intentionally, the called function reads bogus
data from the stack. If these are numeric data
to process, the returned results are void, but
nothing else happens. In case the callee tries
to use those bogus parameters as addresses for
writes, it either corrupts data somewhere else
or triggers a GP (what it did when it tried to
read there, as well).

http://en.wikipedia.org/wiki/X86_calling_conventions

>> I think, polemon is at beginners level at the mo-
>> ment. It is counterproductive to confuse him with
>> such weird ideas. Do you argee with that?
>
> He may be at a beginners level, but he doesn't need to stick to a
> stack-based model of parameter passing for function calls for his own code.
> He's coding in assembly where he can do whatever he wants - as long as he's
> not interfacing to code from other sources.  Since I was just commenting on
> necessity of using the cpu stack, I don't know if interfacing outside code
> was discussed earlier in the thread.

Actually, I do not know what polemon wanted to
code, either - I just know that he writes code
for Linux. Linux' calling conventions are very
strictly defined, leaving sparse room for code
bypassing them. The majority of x86(-64) OS'es
defines the stack as local storage for running
functions. This concept allows functions to be
reentrant. We might implement methods to avoid
accessing the stack, at all, but these methods
probably are unknown to any other function. If
no other function can communicate with us, our
function is unusable in the given environment.

In our special case, polemon was looking for a
way to skip the entire prologue and store some
local variables in memory, because creation of
a stack frame is a quite complicated issue for
beginners... ;)


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/14/2010 4:50:36 PM

On 04/14/2010 09:50 AM, Bernhard Schornak wrote:
> 
> in one of my previous posts. Everything can be
> addressed via RSP, so no basepointer is needed
> at all. PUSH and POP are mutual exclusive with
> this technique!
> 

No, they're not.

	-hpa
0
Reply H 4/14/2010 5:02:51 PM

"Bernhard Schornak" <schornak@nospicedham.web.de> wrote in message
news:hq4rp0$csm$1@news.eternal-september.org...
>
> Unfortunately, there
> is an operating system between us and our tar-
> get, denying us to take most of those possible
> paths, forcing us to use some standard methods
> compatible with its calling conventions.
>

That's not a good assumption to make.  In the case of many here, including
me, an OS is not always between us and the hardware.  We've coded
bootloaders.  We've code our own minimal OSes.

> I disagree with your claim that parameters can
> be passed in arbitrary order. The C convention
> for x86(-64) clearly states the parameters are
> put on the stack in ascending order

What "C convention"?  What "C calling convention"?

There's no such thing for C.

I believe you've confused an application binary interface (ABI) required for
making calls to an operating system (OS) function with the binary
implementation of a function calling convention for a compilable high-level
language, such as C.  They can be the same, but are not required to be.

ANSI and ISO standardized C.  The K&R, ANSI, and ISO C specifications
do not specify a calling convention for C.  The ANSI C89 Rationale and ISO
C99 Rationale specify the following (and a few non-quoted tidbits):


From the C Rationale:

"The C89 Committee considered it important that a function taking a variable
number of arguments, printf for example, be expressible portably in C.
Hence, the C89 Committee devoted much time to exploring methods of
traversing variable argument lists. One proposal was to require arguments to
be passed as a 'brick,' that is, a contiguous area of memory, the layout of
which would be sufficiently well specified that a portable method of
traversing the brick could be determined."

"Several diverse implementations, however, can implement argument passing
more efficiently if the arguments are not required to be contiguous. Thus,
the C89 Committee decided to hide the implementation details of determining
the location of successive elements of an argument list behind a standard
set of macros."

Also, from the C Rationale:

"... Implementations in environments where efficient function calling
mechanisms are available must, in effect, use the efficient calling sequence
either in all 'fixed argument list' calls or in none. Since efficient
calling sequences often do not allow for variable argument functions, the
fixed part of a variable argument list may be passed in a completely
different fashion than in a fixed argument list with the same number and
types of arguments."


That's as close as C spec.'s come to specifying a calling convention for
C.  There's nothing requiring arguments to passed in a contiguous 'brick'.
They concluded it wasn't the best solution.  I.e., there can be gaps.
There's nothing requiring right-to-left parameter passing.  In fact, it
expects you to use fixed left-to-right for functions with fixed argument
lists, if that calling convention is more efficient.


What you're referring to as the "C convention" is the "System V Application
Binary Interface" and "System V Application Binary Interface x86-64(tm)
Architecture Processor Supplement" extensions.  That's different.  It's not
a C calling convention.  It's a binary interface for callable OS functions.
It specifies the OS system binary interface that compiled programs on AT&T
UNIX System V must use to call OS functions.  It doesn't specify the calling
convention used by a high-level language like C.  It was designed for
calling an OS function from compiled C, but applies to all executables which
call OS functions on SysV.  SysV ABI recommends that the C language use the
same method as the OS, but does not require it:

"The standard calling sequence requirements apply only to global functions.
Local functions that are not reachable from other compilation units may use
different conventions. Nevertheless, it is recommended that all functions
use the standard calling sequence when possible."


From SysV ABI intro:

"The System V Application Binary Interface, or ABI, defines a system
interface for compiled application programs and a minimal environment for
support of installation scripts. Its purpose is to document a standard
binary interface for application programs on systems that implement an
operating system that complies with the X/Open Common Application
Environment Specification, Issue 4.2 and the System V Interface Definition,
fourth Edition."

"The ABI defines a binary interface for application programs that are
compiled and packaged for System V implementations on many different
hardware architectures.  Since a binary specification must include
information specific to the computer processor architecture for which it is
intended, it is not possible for a single document to specify the interface
for all possible System V implementations. Therefore, the System V ABI is a
family of specifications, rather than a single one."

> My point is - no 'gaps' between any two
> parameters are allowed.

See C Rationale quotes above.  It's authoritative for C.  SysV ABI is
authoritative for OS calls, but not other function calls.


Rod Pemberton



0
Reply Rod 4/14/2010 7:51:55 PM

H. Peter Anvin wrote:

> On 04/14/2010 09:50 AM, Bernhard Schornak wrote:
>>
>> in one of my previous posts. Everything can be
>> addressed via RSP, so no basepointer is needed
>> at all. PUSH and POP are mutual exclusive with
>> this technique!
>
> No, they're not.

If you use RSP for addressing, you have to ad-
just it after each PUSH / POP (or use changing
offsets). It's easier to use MOV, only.

BTW: On recent processors, no one wants to use
PUSH or POP, anyway. Put this into a test loop
and execute it:

....                   # requires 64 byte stack
....
rdtsc
movl %eax,0x30(%rsp)  # store start
movl %edx,0x34(%rsp)
pushq %eax
pushq %ebx
pushq %ecx
pushq %edx
pushq %edi
pushq %esi
rdtsc
subl 0x60(%rsp),%eax  # calc difference
sbbl 0x64(%rsp),%edx
addq $0x30,%rsp
movl %eax,0x30(%rsp)  # store result 1
movl %edx,0x34(%rsp)
rdtsc
movl %eax,0x38(%rsp)  # store start
movl %edx,0x3C(%rsp)
movq %eax,0x00(%rsp)
movq %ebx,0x08(%rsp)
movq %ecx,0x10(%rsp)
movq %edx,0x18(%rsp)
movq %edi,0x20(%rsp)
movq %esi,0x28(%rsp)
rdtsc
subl 0x38(%rsp),%eax  # calc difference
sbbl 0x3C(%rsp),%edx
....                   # display or save

Time required for PUSH/POP at 0x30/0x34(%rsp),
MOV only in EAX/EDX. The code is optimised for
both, iNTEL and AMD processors (2 or 3 pipes).


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/14/2010 8:23:37 PM

Rod Pemberton wrote:

> Bernhard Schornak wrote:
>>
>> Unfortunately, there
>> is an operating system between us and our tar-
>> get, denying us to take most of those possible
>> paths, forcing us to use some standard methods
>> compatible with its calling conventions.
>
> That's not a good assumption to make.  In the case of many here, including
> me, an OS is not always between us and the hardware.  We've coded
> bootloaders.  We've code our own minimal OSes.

I do not question your competence... :)

>> I disagree with your claim that parameters can
>> be passed in arbitrary order. The C convention
>> for x86(-64) clearly states the parameters are
>> put on the stack in ascending order
>
> What "C convention"?  What "C calling convention"?
>
> There's no such thing for C.
>
> I believe you've confused an application binary interface (ABI) required for
> making calls to an operating system (OS) function with the binary
> implementation of a function calling convention for a compilable high-level
> language, such as C.  They can be the same, but are not required to be.
>
> ANSI and ISO standardized C.  The K&R, ANSI, and ISO C specifications
> do not specify a calling convention for C.  The ANSI C89 Rationale and ISO
> C99 Rationale specify the following (and a few non-quoted tidbits):
>
>
>  From the C Rationale:
>
> "The C89 Committee considered it important that a function taking a variable
> number of arguments, printf for example, be expressible portably in C.
> Hence, the C89 Committee devoted much time to exploring methods of
> traversing variable argument lists. One proposal was to require arguments to
> be passed as a 'brick,' that is, a contiguous area of memory, the layout of
> which would be sufficiently well specified that a portable method of
> traversing the brick could be determined."
>
> "Several diverse implementations, however, can implement argument passing
> more efficiently if the arguments are not required to be contiguous. Thus,
> the C89 Committee decided to hide the implementation details of determining
> the location of successive elements of an argument list behind a standard
> set of macros."
>
> Also, from the C Rationale:
>
> "... Implementations in environments where efficient function calling
> mechanisms are available must, in effect, use the efficient calling sequence
> either in all 'fixed argument list' calls or in none. Since efficient
> calling sequences often do not allow for variable argument functions, the
> fixed part of a variable argument list may be passed in a completely
> different fashion than in a fixed argument list with the same number and
> types of arguments."
>
>
> That's as close as C spec.'s come to specifying a calling convention for
> C.  There's nothing requiring arguments to passed in a contiguous 'brick'.
> They concluded it wasn't the best solution.  I.e., there can be gaps.
> There's nothing requiring right-to-left parameter passing.  In fact, it
> expects you to use fixed left-to-right for functions with fixed argument
> lists, if that calling convention is more efficient.
>
>
> What you're referring to as the "C convention" is the "System V Application
> Binary Interface" and "System V Application Binary Interface x86-64(tm)
> Architecture Processor Supplement" extensions.  That's different.  It's not
> a C calling convention.  It's a binary interface for callable OS functions.
> It specifies the OS system binary interface that compiled programs on AT&T
> UNIX System V must use to call OS functions.  It doesn't specify the calling
> convention used by a high-level language like C.  It was designed for
> calling an OS function from compiled C, but applies to all executables which
> call OS functions on SysV.  SysV ABI recommends that the C language use the
> same method as the OS, but does not require it:
>
> "The standard calling sequence requirements apply only to global functions.
> Local functions that are not reachable from other compilation units may use
> different conventions. Nevertheless, it is recommended that all functions
> use the standard calling sequence when possible."
>
>
>  From SysV ABI intro:
>
> "The System V Application Binary Interface, or ABI, defines a system
> interface for compiled application programs and a minimal environment for
> support of installation scripts. Its purpose is to document a standard
> binary interface for application programs on systems that implement an
> operating system that complies with the X/Open Common Application
> Environment Specification, Issue 4.2 and the System V Interface Definition,
> fourth Edition."
>
> "The ABI defines a binary interface for application programs that are
> compiled and packaged for System V implementations on many different
> hardware architectures.  Since a binary specification must include
> information specific to the computer processor architecture for which it is
> intended, it is not possible for a single document to specify the interface
> for all possible System V implementations. Therefore, the System V ABI is a
> family of specifications, rather than a single one."
>
>> My point is - no 'gaps' between any two
>> parameters are allowed.
>
> See C Rationale quotes above.  It's authoritative for C.  SysV ABI is
> authoritative for OS calls, but not other function calls.

First: Thank you for the time you spent to write
your reply - I learned a lot of things I did not
know before. Many of them will be forgotten over
time, because I don't need them for my own work.

While C never was 'my thing', interaction with C
functions is of concern for e.g. my libraries or
apps. I probably mismatch 'C Rationale' with the
functions I compile with GCC, but - GCC's output
is the only code I see and my (or anyone else's)
functions must comply to GCC's way to pass para-
meters. This is not only valid for calls to some
OS functions - it also is valid for any function
not touching OS services at all, e.g. all built-
in functions like qsort. If we do this

....
pushl $compare_fnct
pushl $whatever3     # insert gap 3
pushl $whatever2     # insert gap 2
pushl $size
pushl $whatever      # insert gap 1
pushl $num
pushl array_address
call  _qsort
....

then qsort definitely loads whatever as size and
size as address of the compare function. This is
the way it is done in real life. If theory knows
other ways, that's fine, but it does not work in
the given environment.

I think you agree the sample code won't run in a
Linux environment nor on any other x86 platform.
Therefore, 'gaps', even if 'C Rationale'  allows
such a thing, are not working in existing imple-
mentations of 'C Rationale' like GCC. Of course,
I tend to overweight the practical side, leaving
the theory behind alone...

It takes many lifespans to get in touch with the
basic concepts of every computer, operating sys-
tem and programming language. Unfortunately, any
human has only one lifespan to learn... ;)


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/14/2010 10:48:56 PM

Bernhard Schornak wrote:
> H. Peter Anvin wrote:
>>> in one of my previous posts. Everything can be
>>> addressed via RSP, so no basepointer is needed
>>> at all. PUSH and POP are mutual exclusive with
>>> this technique!

>> No, they're not.

> If you use RSP for addressing, you have to ad-
> just it after each PUSH / POP (or use changing
> offsets). It's easier to use MOV, only.
>
> BTW: On recent processors, no one wants to use
> PUSH or POP, anyway. Put this into a test loop
> and execute it:

> ...                   # requires 64 byte stack
> ...
> rdtsc
> movl %eax,0x30(%rsp)  # store start
> movl %edx,0x34(%rsp)
> pushq %eax
> pushq %ebx
> pushq %ecx
> pushq %edx
> pushq %edi
> pushq %esi
> rdtsc
> subl 0x60(%rsp),%eax  # calc difference
> sbbl 0x64(%rsp),%edx
> addq $0x30,%rsp
> movl %eax,0x30(%rsp)  # store result 1
> movl %edx,0x34(%rsp)
> rdtsc
> movl %eax,0x38(%rsp)  # store start
> movl %edx,0x3C(%rsp)
> movq %eax,0x00(%rsp)
> movq %ebx,0x08(%rsp)
> movq %ecx,0x10(%rsp)
> movq %edx,0x18(%rsp)
> movq %edi,0x20(%rsp)
> movq %esi,0x28(%rsp)
> rdtsc
> subl 0x38(%rsp),%eax  # calc difference
> sbbl 0x3C(%rsp),%edx
> ...                   # display or save
>
> Time required for PUSH/POP at 0x30/0x34(%rsp),
> MOV only in EAX/EDX. The code is optimised for
> both, iNTEL and AMD processors (2 or 3 pipes).

I can't check this yet, because my 64-bit OS is still on
the ToDolist even I work with a Phenom II since a while.

What were the difference on your test ?

timing performance of this may depend on alignment
to cache bounds and on busy pipes and fetch status.
PUSH/POP are quite shorter opcodes than MOVes are,
and can easier reside within one code fetch then.

You save on a few micro-ops per instruction, but
on cost of code size ...

Unfortunately we haven't got PUSHAQ/POPAQ for 64bit,
PUSHAD/POPAD were as fast as three discrete PUSH/POP.

__
wolfgang


0
Reply wolfgang 4/15/2010 10:01:58 AM

On Thu, 15 Apr 2010 12:01:58 +0200
"wolfgang kern" <nowhere@never.at> wrote:

> 
> timing performance of this may depend on alignment
> to cache bounds and on busy pipes and fetch status.
> PUSH/POP are quite shorter opcodes than MOVes are,
> and can easier reside within one code fetch then.
> 
> You save on a few micro-ops per instruction, but
> on cost of code size ...
> 
> Unfortunately we haven't got PUSHAQ/POPAQ for 64bit,
> PUSHAD/POPAD were as fast as three discrete PUSH/POP.

I use most natural way for humans. Compiler performs
unnatural assembly coding, IMO.
I use push/pop and registers. Sometimes relative
to sp or bp;)
No definitive rule but depending on situation.
Also you can use xmm registers instead of stack ;)

You can put 4 32 bit ints in one xmm register eg.
There are also mm registers.
Assembly is freedom of choice, you only need
to follow conventions when calling hll function or expect
to be called from hll.

Greets!

-- 
http://maxa.homedns.org/

Sometimes online sometimes not




0
Reply Branimir 4/15/2010 12:19:21 PM

wolfgang kern wrote:

> Bernhard Schornak wrote:
>> H. Peter Anvin wrote:
>>
>> BTW: On recent processors, no one wants to use
>> PUSH or POP, anyway. Put this into a test loop
>> and execute it:
>
>> ...                   # requires 64 byte stack
>> ...
>> rdtsc
>> movl %eax,0x30(%rsp)  # store start
>> movl %edx,0x34(%rsp)
>> pushq %eax
>> pushq %ebx
>> pushq %ecx
>> pushq %edx
>> pushq %edi
>> pushq %esi
>> rdtsc
>> subl 0x60(%rsp),%eax  # calc difference
>> sbbl 0x64(%rsp),%edx
>> addq $0x30,%rsp
>> movl %eax,0x30(%rsp)  # store result 1
>> movl %edx,0x34(%rsp)
>> rdtsc
>> movl %eax,0x38(%rsp)  # store start
>> movl %edx,0x3C(%rsp)
>> movq %eax,0x00(%rsp)
>> movq %ebx,0x08(%rsp)
>> movq %ecx,0x10(%rsp)
>> movq %edx,0x18(%rsp)
>> movq %edi,0x20(%rsp)
>> movq %esi,0x28(%rsp)
>> rdtsc
>> subl 0x38(%rsp),%eax  # calc difference
>> sbbl 0x3C(%rsp),%edx
>> ...                   # display or save
>>
>> Time required for PUSH/POP at 0x30/0x34(%rsp),
>> MOV only in EAX/EDX. The code is optimised for
>> both, iNTEL and AMD processors (2 or 3 pipes).
>
> I can't check this yet, because my 64-bit OS is still on
> the ToDolist even I work with a Phenom II since a while.
>
> What were the difference on your test ?

I ran similar tests on my older machine (a dual
core Athlon64) two years ago, with just two and
three registers being pushed. The MOVed version
always was one to three clocks faster than con-
ventional PUSHes. The differences probably were
caused by not synchronizing RDTSC and some less
obvious cache issues (OS/2 has a 'flying' ESP).

> timing performance of this may depend on alignment
> to cache bounds and on busy pipes and fetch status.
> PUSH/POP are quite shorter opcodes than MOVes are,
> and can easier reside within one code fetch then.

Possible. The MOV version greatly benefits from
the way parameters are MOVed. Several writes to
straight ascending addresses triggers the write
combining mechanism. PUSHes are done 'top down'
and cannot benefit from built-in accelleration
mechanisms. However, PUSH and POP are optimised
on hardware level. They will keep pace, but are
always slower than simple MOVs. Because the RSP
must be updated as a part of these intructions,
they cannot run simultaneously (introducing de-
pendencies) while MOV can be paired to patterns
without dependencies.

> You save on a few micro-ops per instruction, but
> on cost of code size ...

Yes & No. Besides counting clocks, designs with
fixed RSP and 'MOV only restriction' bear a lot
of other benefits which are not as obvious as a
pure clock cycle comparison. In fact, code will
be more dense and executes faster with such de-
signs - have a look at Optimisation 04 in my ID
paper:

http://st-intelligentdesign.blogspot.com/2010/04/13-appendix-2.html

The stack frame is created on demand - LDinit()
does not await parameters, so no stack frame is
required if no error occurs. I detected a major
flaw right now: The conditional jump triggers a
penalty (easy to avoid with a few corrections).

The only point where size really matters is the
code we have to write for MBR boot sectors, any
other code is not restricted to a few byte more
or less. My 'PUSH and POP free' MBR is 220 byte
(another 70 byte are occupied by messages).

> Unfortunately we haven't got PUSHAQ/POPAQ for 64bit,
> PUSHAD/POPAD were as fast as three discrete PUSH/POP.

AMD probably thought of the fact, that they had
to save/restore R08-R15, as well. This mnemonic
suggests *all* registers are saved/restored. ;)


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/15/2010 2:25:20 PM

Bernhard Schornak wrote:
....
>> What were the difference on your test ?

> I ran similar tests on my older machine (a dual
> core Athlon64) two years ago, with just two and
> three registers being pushed. The MOVed version
> always was one to three clocks faster than con-
> ventional PUSHes. The differences probably were
> caused by not synchronizing RDTSC and some less
> obvious cache issues (OS/2 has a 'flying' ESP).

>> timing performance of this may depend on alignment
>> to cache bounds and on busy pipes and fetch status.
>> PUSH/POP are quite shorter opcodes than MOVes are,
>> and can easier reside within one code fetch then.

> Possible. The MOV version greatly benefits from
> the way parameters are MOVed. Several writes to
> straight ascending addresses triggers the write
> combining mechanism. PUSHes are done 'top down'
> and cannot benefit from built-in accelleration
> mechanisms. However, PUSH and POP are optimised
> on hardware level. They will keep pace, but are
> always slower than simple MOVs. Because the RSP
> must be updated as a part of these intructions,
> they cannot run simultaneously (introducing de-
> pendencies) while MOV can be paired to patterns
> without dependencies.

>> You save on a few micro-ops per instruction, but
>> on cost of code size ...

> Yes & No. Besides counting clocks, designs with
> fixed RSP and 'MOV only restriction' bear a lot
> of other benefits which are not as obvious as a
> pure clock cycle comparison. In fact, code will
> be more dense and executes faster with such de-
> signs - have a look at Optimisation 04 in my ID
> paper:

> http://st-intelligentdesign.blogspot.com/2010/04/13-appendix-2.html

Well done Bernhard,
and real good arguments against the believe that compilers
do anything smart in terms of optimisatiion !

__
wolfgang


0
Reply wolfgang 4/16/2010 2:52:15 PM

Branimir Maksimovic wrote:

>> Unfortunately we haven't got PUSHAQ/POPAQ for 64bit,
>> PUSHAD/POPAD were as fast as three discrete PUSH/POP.

> I use most natural way for humans. Compiler performs
> unnatural assembly coding, IMO.
> I use push/pop and registers. Sometimes relative
> to sp or bp;)
> No definitive rule but depending on situation.
> Also you can use xmm registers instead of stack ;)

Yeah,
And I avoid parameter (copy) passing whenever possible, so
my OS-functions use the callers instance owned data-struct
by just one pointer register. Why copy, just read the
original and write results back to it or hold them in up
to seven (fifteen if I ever go 64-bit) registers beside
the eight status-opportunities given by Cy/S/Z-flags.
Isn't this more return information than eax==0 or else ? ;)
No stackframes nor locals are required and the routines
became reentrant too.

> You can put 4 32 bit ints in one xmm register eg.
> There are also mm registers.
> Assembly is freedom of choice, you only need
> to follow conventions when calling hll function or expect
> to be called from hll.

Agreed. Really sad that windoze 'and' Linux are HLL-based.
__
wolfgang


0
Reply wolfgang 4/16/2010 3:44:34 PM

wolfgang kern wrote:

[snip]

> Well done Bernhard,
> and real good arguments against the believe that compilers
> do anything smart in terms of optimisatiion !

GCC probably isn't the only compiler emitting
less optimised code. I use its AS part, only,
and it generally doesn't bother me what its C
part emits. I just use it sometimes to trans-
late 'spaghettis+sauce' found in C files into
human-readable mnemonics (using GCC as 'sauce
separator' and 'spaghetti unwinder')... ;)


Have a nice weekend!

Bernhard Schornak
0
Reply Bernhard 4/16/2010 5:55:11 PM

Branimir Maksimovic wrote:

> wolfgang kern wrote:
>
>> timing performance of this may depend on alignment
>> to cache bounds and on busy pipes and fetch status.
>> PUSH/POP are quite shorter opcodes than MOVes are,
>> and can easier reside within one code fetch then.
>>
>> You save on a few micro-ops per instruction, but
>> on cost of code size ...
>>
>> Unfortunately we haven't got PUSHAQ/POPAQ for 64bit,
>> PUSHAD/POPAD were as fast as three discrete PUSH/POP.
>
> I use most natural way for humans. Compiler performs
> unnatural assembly coding, IMO.

Yes - any of them.

> I use push/pop and registers. Sometimes relative
> to sp or bp;)
> No definitive rule but depending on situation.
> Also you can use xmm registers instead of stack ;)
>
> You can put 4 32 bit ints in one xmm register eg.
> There are also mm registers.

Depending on the OS, using MM registers might
cause problems. Older operating systems, e.g.
OS/2, still use FP registers st(0)...st(7) in
kernel and API routines. Most 64 bit OS'es do
not use FP or MM registers, at all, leaving a
loophole to abuse them as temporary storage.

> Assembly is freedom of choice, you only need
> to follow conventions when calling hll function or expect
> to be called from hll.

Most functions do. Hence, you're bound to the
register usage dictated by the OS.


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 4/16/2010 6:18:31 PM

45 Replies
327 Views

(page loaded in 0.195 seconds)

Similiar Articles:


















7/23/2012 2:27:05 AM


Reply: