While looking at disassembled code generated by gas, found something
like :
66 data16
66 data16
90 nop
66 data16
90 nop
Currently I dont have an example to show, but you can find an example
at http://www.google.com/codesearch/p?hl=3Den#kVt6p8A1K_o/bttv-0.7.107/driv=
er/bttv.asm&q=3Ddata16%20lang:assembly
The number of data16 and nop may differ. Whats the purpose of
assembler to generate such code?
I have read the gas manual it just states that "Operand/Address size
pre=02xes data16 and addr16 change 32-bit operands/addresses into 16-bit
operands /addresses". I cannot relate that to above mentioned code.
Thanks
|
|
0
|
|
|
|
Reply
|
omkarenator
|
5/26/2009 2:04:18 PM |
|
On Tue, 26 May 2009 15:04:18 +0100, omkarenator <omkarenator@gmail.com>
wrote:
> While looking at disassembled code generated by gas, found something
> like :
>
> 66 data16
> 66 data16
> 90 nop
> 66 data16
> 90 nop
>
> Currently I dont have an example to show, but you can find an example
> at
> http://www.google.com/codesearch/p?hl=en#kVt6p8A1K_o/bttv-0.7.107/driver/bttv.asm&q=data16%20lang:assembly
Looks like padding to me. The number of data16 prefixes used is always
chosen to ensure that the next basic block starts on a 16-byte boundary.
The actual code looks unreachable.
That might be a performance win if it meant fewer cache misses on the
"usual" execution path. Then again, it might just cause cache misses
further on in the code. Is there much evidence that this sort of thing
is worthwhile?
Personally I'd have used INT3s or just repeated NOPs, but perhaps there
is some cunning reason for choice.
|
|
0
|
|
|
|
Reply
|
Ken
|
5/26/2009 2:46:04 PM
|
|
omkarenator <omkarenator@gmail.com> writes:
>While looking at disassembled code generated by gas, found something
>like :
>
> 66 data16
> 66 data16
> 90 nop
> 66 data16
> 90 nop
>
>Currently I dont have an example to show, but you can find an example
>at http://www.google.com/codesearch/p?hl=3Den#kVt6p8A1K_o/bttv-0.7.107/driv=
>er/bttv.asm&q=3Ddata16%20lang:assembly
>
>The number of data16 and nop may differ. Whats the purpose of
>assembler to generate such code?
That's just longer nop sequences (IIRC this is the one recommended by
AMD), probably to align the code that follows (typically a loop entry
point) to a cache line boundary.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
|
|
0
|
|
|
|
Reply
|
anton
|
5/26/2009 3:20:56 PM
|
|
Ken Hagan wrote:
> On Tue, 26 May 2009 15:04:18 +0100, omkarenator <omkarenator@gmail.com>
> wrote:
>
>> While looking at disassembled code generated by gas, found something
>> like :
>>
>> 66 data16
>> 66 data16
>> 90 nop
>> 66 data16
>> 90 nop
> Personally I'd have used INT3s or just repeated NOPs, but perhaps there
> is some cunning reason for choice.
The AMD decoder can swallow up to N prefix bytes as part of each
instruction, with no speed penalty?
I.e. if this code is ever executed, the running time will be for two
NOPs instead of 5.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
|
|
0
|
|
|
|
Reply
|
Terje
|
5/26/2009 8:11:56 PM
|
|
omkarenator wrote:
> While looking at disassembled code generated by gas, found something
> like :
>
> 66 data16
> 66 data16
> 90 nop
> 66 data16
> 90 nop
>
> Currently I dont have an example to show, but you can find an example
> at http://www.google.com/codesearch/p?hl=en#kVt6p8A1K_o/bttv-0.7.107/driver/bttv.asm&q=data16%20lang:assembly
>
> The number of data16 and nop may differ. Whats the purpose of
> assembler to generate such code?
>
> I have read the gas manual it just states that "Operand/Address size
> prexes data16 and addr16 change 32-bit operands/addresses into 16-bit
> operands /addresses". I cannot relate that to above mentioned code.
The 66h prefix normally does exactly what the manual says, but that
isn't so helpful in this case...
The basic problem is that there aren't different instruction encodings
for 16-bit and 32-bit operations; a code segment (or is that a page?)
has a "default" operand size, and if you want the "other" size, you need
the 66h prefix. For instance, in 16-bit mode a "MOV AX, BX" can be
turned into "MOV EAX, EBX" by adding a 66h prefix. Conversely, in
32-bit mode a "MOV EAX, EBX" can be turned into "MOV AX, BX" by adding
the same 66h prefix.
However, NOP doesn't take any operands, so the 66h prefix has absolutely
no effect. Why use it? Because this sequence:
90 NOP
90 NOP
90 NOP
90 NOP
90 NOP
is five instructions, while this sequence:
66 DATA16
66 DATA16
90 NOP
66 DATA16
90 NOP
is only two instructions and therefore may decode/execute faster, even
though they're both exactly five bytes of code.
S
--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Isaac Jaffe
|
|
0
|
|
|
|
Reply
|
Stephen
|
5/27/2009 6:38:47 PM
|
|
Stephen Sprunk wrote:
>
> However, NOP doesn't take any operands, so the 66h prefix has absolutely
> no effect. Why use it? Because this sequence:
>
> 90 NOP
> 90 NOP
> 90 NOP
> 90 NOP
> 90 NOP
>
> is five instructions, while this sequence:
>
> 66 DATA16
> 66 DATA16
> 90 NOP
> 66 DATA16
> 90 NOP
>
> is only two instructions and therefore may decode/execute faster, even
> though they're both exactly five bytes of code.
>
> S
>
The second example only requires 2 retires, assuming that
nop's can't be completely absored by the decoder for some reason.
Is NOP guaranteed to take at least 1 clock for timing loops?
If it is, why does PAUSE exist?
Eric
|
|
0
|
|
|
|
Reply
|
EricP
|
5/27/2009 7:15:13 PM
|
|
Stephen Sprunk wrote:
> The 66h prefix normally does exactly what the manual says, but that
> isn't so helpful in this case...
>
> The basic problem is that there aren't different instruction encodings
> for 16-bit and 32-bit operations; a code segment (or is that a page?)
> has a "default" operand size, and if you want the "other" size, you need
> the 66h prefix. For instance, in 16-bit mode a "MOV AX, BX" can be
> turned into "MOV EAX, EBX" by adding a 66h prefix. Conversely, in
> 32-bit mode a "MOV EAX, EBX" can be turned into "MOV AX, BX" by adding
> the same 66h prefix.
>
> However, NOP doesn't take any operands, so the 66h prefix has absolutely
> no effect. Why use it? Because this sequence:
NOP does in fact take a register operand: (E)AX!
90h is the opcode for XCHG AX,AX (or XCHG EAX,EAX in 32-bit mode), so
adding one or more 66h prefix bytes change the code from 16 to 32 bit or
vice versa. :-)
OTOH, this isn't really true any longer: Even though the asm code for
XCHG EAX,EAX is indeed 90h, by the time a modern x86 cpu sees the code,
it knows that this doesn't modify anything, and therefore it can
disregard the current status of the EAX register.
I.e. NOP today is a "true No-operation", not just a regular one-byte
opcode that happens to not modify anything.
BTW, MOVing any register to itself (MOV EAX,EAX) is a two-byte NO-OP,
but unlike the 66h/90h combination it isn't specialcased by the decoder,
so it does affect the reorder buffer register pressure.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
|
|
0
|
|
|
|
Reply
|
Terje
|
5/27/2009 7:39:07 PM
|
|
EricP wrote:
> Stephen Sprunk wrote:
>> However, NOP doesn't take any operands, so the 66h prefix has
>> absolutely no effect. Why use it? Because this sequence:
>>
>> 90 NOP
>> 90 NOP
>> 90 NOP
>> 90 NOP
>> 90 NOP
>>
>> is five instructions, while this sequence:
>>
>> 66 DATA16
>> 66 DATA16
>> 90 NOP
>> 66 DATA16
>> 90 NOP
>>
>> is only two instructions and therefore may decode/execute faster, even
>> though they're both exactly five bytes of code.
>
> The second example only requires 2 retires, assuming that
> nop's can't be completely absored by the decoder for some reason.
>
> Is NOP guaranteed to take at least 1 clock for timing loops?
Nope. It can be absorbed entirely by the decoder -- but modern decoders
in particular will be able to swallow the second sequence one clock
faster than they can absorb the first.
(Even if the NOP doesn't actually get absorbed, it would still race to
the head of the OoO queues since it has no dependencies, which means it
still won't consume any "real" clocks in most circumstances.)
> If it is, why does PAUSE exist?
PAUSE is another name for REP NOP. Prior to the Intel P4, it is
equivalent to NOP; on the P4 (and, I assume, later Intel chips), PAUSE
causes "a finite delay" in execution. It is recommended over NOP
specifically for spin locks because:
1. One hardware thread spinning on NOP instructions could starve another
hardware thread on the same core from the resources it needs to break
the spin lock.
2. Spinning on NOPs consumes a lot of power and generates a lot of heat
for no real purposes.
The "finite delay" is allowed to be zero clocks, if there is no other
hardware threads running on that core or SMT is disabled, giving the
same behavior as pre-P4 CPUs. OTOH, it may still be more than zero
clocks as part of a power-saving strategy. And it can vary, so it's not
useful for timing loops...
S
--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Isaac Jaffe
|
|
0
|
|
|
|
Reply
|
Stephen
|
5/29/2009 5:50:31 PM
|
|
|
7 Replies
444 Views
(page loaded in 0.12 seconds)
|