Consider the following pseudo-code function:
function myAND(float a, float b)
if a = 0 then test_a = 0 else test_a = 1
if b = 0 then test_b = 0 else test_b = 1
result = test_a AND test_b
Or, in "normal" language, zero stays zero,
anything else becomes 1.
Compute the logical "and" of these two booleans.
(i.e. return "[0|1] and [0|1]").
All numbers are 64bit floats.
Carefull (sort of) thinking lead to the
following FPU code:
fld1 ; load 0 and 1 for later...
fldz
fld1
fldz
fld qword [someplace] ; get first value
fcomip st1 ; compare to 0 and "forget" it
fcmove st0,st1 ; st0 = 0 if [someplace] = 0, 1 otherwise
fxch ; cleanup
fstp st0
fld1
fldz
fld qword [someplace + 8] ; idem for value 2
fcomip st1
fcmove st0,st1
fxch
fstp st0
; ..."later" starts here:
fcomip st1 ; compare the 0/1 of values 1 and 2
fcmove st0,st2 ; result is either 0 or 1
fcmovne st0,st1
fxch st0,st2 ; heavy cleaning
fcompp
; final result in st0
Now, this code has two rather interesting properties:
1. it works :)
2. it has no "jmp"s
However, it sort of looks sub-optimal,
and, well, I was hoping that someone here
would see some improvement to it.
(Less code, something faster, ...)
FWIW, the final code will have to run on P3s.
Any comments and suggestions welcome
cu
P
|
|
0
|
|
|
|
Reply
|
PatD
|
7/7/2004 7:32:08 PM |
|
"PatD" <spamtrap@crayne.org> wrote in message
news:26ioe0tfm7p15gamvr3gvb6t51poc9vdn9@4ax.com...
>
> Consider the following pseudo-code function:
>
> function myAND(float a, float b)
> if a = 0 then test_a = 0 else test_a = 1
> if b = 0 then test_b = 0 else test_b = 1
> result = test_a AND test_b
That looks like:
int myAND(double a, double b) {
return (a && b);
}
....
> Carefull (sort of) thinking lead to the
> following FPU code:
....
> Now, this code has two rather interesting properties:
> 1. it works :)
> 2. it has no "jmp"s
Here's what gcc gives me:
myAND:
pushl %ebp
movl %esp,%ebp
subl $4,%esp
movl $0,-4(%ebp)
fldl 8(%ebp)
fldz
fucompp
fnstsw %ax
andb $69,%ah
cmpb $64,%ah
je .L2
fldl 16(%ebp)
fldz
fucompp
fnstsw %ax
andb $69,%ah
cmpb $64,%ah
je .L2
movl $1,-4(%ebp)
..L2:
movl -4(%ebp),%eax
jmp .L1
.p2align 4,,7
..L1:
leave
ret
Now, it does have a few jumps in there (one of which seems unnecessary), but
it has 17 instructions compared to your 21, and it splits the load across
the ALUs as well as the FPUs. The jumps could probably be eliminated by
removing C's short-circuit evaluation or by using CMOVcc.
> FWIW, the final code will have to run on P3s.
That knocks out SSE2 (SSE1 is 32-bit FP only), which would have been my
other suggestion.
S
--
Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov
|
|
0
|
|
|
|
Reply
|
spamtrap2 (1628)
|
7/7/2004 8:49:11 PM
|
|
PatD <spamtrap@crayne.org> wrote:
>
> function myAND(float a, float b)
> if a = 0 then test_a = 0 else test_a = 1
> if b = 0 then test_b = 0 else test_b = 1
> result = test_a AND test_b
>
> Or, in "normal" language, zero stays zero, anything
> else becomes 1. Compute the logical "and" of these
> two booleans. (i.e. return "[0|1] and [0|1]").
> All numbers are 64bit floats.
As a general rule, floats should not be tested
for equality, but rather for sufficiently small
differences. floats are "dirty" numbers.
If you really want to test for zero equality,
the fastest way is with the regular ALU because
zero is stored as [surprise] 64 zero bits. So:
mov eax, [a]
and eax, [a+4]
jz num_is_zero
mov [a], low_wd_of_fl1 # something like 00000000h
mov [a+4], hi_wd_of_fl1 # something like 7FF00000h
num_is_zero: # could be optimized for other case
-- Robert
|
|
0
|
|
|
|
Reply
|
Robert
|
7/7/2004 9:45:07 PM
|
|
"PatD" <spamtrap@crayne.org> wrote in message
news:26ioe0tfm7p15gamvr3gvb6t51poc9vdn9@4ax.com...
>
> Consider the following pseudo-code function:
>
> function myAND(float a, float b)
> if a = 0 then test_a = 0 else test_a = 1
> if b = 0 then test_b = 0 else test_b = 1
> result = test_a AND test_b
>
> Or, in "normal" language, zero stays zero,
> anything else becomes 1.
> Compute the logical "and" of these two booleans.
> (i.e. return "[0|1] and [0|1]").
<snip>
Along the lines of what Robert was thinking:
mov eax, [a+4]
mov edx, [b+4]
or eax, [a]
or edx, [b]
cmp eax, 1
setnc al
cmp edx, 1
setnc dl
and al, dl
If you want to handle -0.0 correctly, then you can modify the first two
lines like so:
mov eax, 0x7FFFFFFF
mov edx, 0x7FFFFFFF
and eax, [a+4]
and edx, [b+4]
There is no need to use the (slow) FPU when testing floating-point numbers
against 0. The standard optimization trick is to load the data into integer
registers and check against integer 0. On an Athlon, the above code should
retire in 10 cycles. Pentium-III should be very similar. Add 2 cycles to
handle -0.0.
If you can cope with inverted logic, this is perhaps slightly better on
Intel processors:
mov eax, [a+4]
mov edx, [b+4]
or eax, [a]
or edx, [b]
cmp eax, 1
sbb eax, eax
cmp edx, 1
sbb edx, edx
or eax, edx
This code computes the following:
eax = (a != 0.0 && b != 0.0 ? 0 : -1);
One will immediately recognize that this is just De Morgan's Law, so the
computation is now a == 0.0 || b == 0.0. It avoids the use of partial
registers. This doesn't affect AMD chips, but for a Pentium-III processor it
may be faster.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
7/7/2004 10:54:53 PM
|
|
> Consider the following pseudo-code function:
>
> function myAND(float a, float b)
> if a = 0 then test_a = 0 else test_a = 1
> if b = 0 then test_b = 0 else test_b = 1
> result = test_a AND test_b
>
> Or, in "normal" language, zero stays zero,
> anything else becomes 1.
> Compute the logical "and" of these two booleans.
> (i.e. return "[0|1] and [0|1]").
>
> All numbers are 64bit floats.
Hm. What about
float_1 DQ +1.0
pxor mm7, mm7
movq mm0, [someplace]
movq mm1, [someplace]
pcmpgtd mm0, mm7
pcmpgtd mm1, mm7
pshufw mm2, mm0, 4eh ; exchange dwords
pshufw mm3, mm1, 4eh
por mm0, mm2 ; mm0 now is set to 1s or to 0s
por mm1, mm3 ; same about mm1
pand mm0, mm1 ; mm0 is set to 1s if both a and b were non-zeros
; and to 0s otherwise
pand mm0, [float_1]
movq [retval], mm0
emms
Fast enough and doesn't use FPU at all!
--
Ivan
|
|
0
|
|
|
|
Reply
|
Ivan
|
7/7/2004 10:55:04 PM
|
|
> zero is stored as [surprise] 64 zero bits. So:
Right. so,
> mov eax, [a]
> and eax, [a+4]
This should be 'or', not 'and'.
========================================================================
(Mr.) IIJIMA Hiromitsu, mailto:delmonta@ht.sakura.ne.jp
aka Delmonta http://www.ht.sakura.ne.jp/~delmonta/
|
|
0
|
|
|
|
Reply
|
IIJIMA
|
7/8/2004 2:02:15 AM
|
|
IIJIMA Hiromitsu <spamtrap@crayne.org> wrote:
> Right. so,
>
>> mov eax, [a]
>> and eax, [a+4]
>
> This should be 'or', not 'and'.
Ooops. Thanks for the correction!
-- Robert
|
|
0
|
|
|
|
Reply
|
Robert
|
7/8/2004 12:37:53 PM
|
|
In a land long ago, in a time far away
PatD <spamtrap@crayne.org> wrote:
<unrelated note>
Actually, when I see lots of different names,
but all with the very same address,
that's spam for me :)
</unrelated note>
[cut]
>All numbers are 64bit floats.
>
>
>Carefull (sort of) thinking lead to the
>following FPU code:
[cut]
>
>Any comments and suggestions welcome
[insert intersting bits of all other replies prior to this one here]
(Robert, Matt, others... (too lazy))
Indeed, zero is zero!
I have this now on a sticker on my monitor until it sinks in :)
That's what happens, I guess, when you write too much FPU code
in a row...
You're, sort of, not seeing any alternatives anymore.
Fortunately this group has it all :)
Thanks a lot for all the suggestions, corrections, code,
you name it, ...
Now, both ALUs and the FPU busy all at the same time...
Sounds like a plan :)
Off to my editor,
Some notes:
- MMX _does_ use the FPU! :)
- Floats are "dirty"? Yep, sorta.
However, "0.0" exactly and "anything else", the point being on
"anything" is "good enough" for my projet.
- Any "NASM" experts here?
Macros have a label-is-local-for-every-call feature.
Is there a way to get this labelautonumbering outside
of a macro?
- Why do I need the floats?
I'm writing a compiler for a language whose only variable types
are "number" and "string"... :)
Yours
P
|
|
0
|
|
|
|
Reply
|
PatD
|
7/8/2004 5:41:37 PM
|
|
PatD <spamtrap@crayne.org> wrote:
> - Floats are "dirty"? Yep, sorta. However, "0.0" exactly
> and "anything else", the point being on "anything" is
> "good enough" for my projet.
Someone said: "Floats are like sandpiles. Anytime you do
anything with them, you lose some sand and pick up some dirt."
-- Robert
|
|
0
|
|
|
|
Reply
|
Robert
|
7/8/2004 6:24:16 PM
|
|
> - MMX _does_ use the FPU! :)
It shares registers with FPU. (Actually, since all execution units deal
with aliased register pool, it does not matter). But that doesn't involve
using FP units which are generally slower than integer. Aren't they?
--
Ivan
|
|
0
|
|
|
|
Reply
|
Ivan
|
7/8/2004 7:17:38 PM
|
|
"Ivan Korotkov" <koroNOSPAMtkov2@ztelDOT.ru> wrote in message
news:opsat18yvw75ztla@news.zebratelecom.ru...
> > - MMX _does_ use the FPU! :)
>
> It shares registers with FPU. (Actually, since all execution units deal
> with aliased register pool, it does not matter). But that doesn't involve
> using FP units which are generally slower than integer. Aren't they?
Doing what? On x86 CPUs, FP multipliers are a lot faster (and a lot more
complex) than integer multipliers. In the Pentium-IV Intel even reuses the
FP multiplier for integer multiplies. On Athlon, a dword-precision integer
multiply takes 4 cycles. Extended-precision is 6 cycles. A full 64-bit FP
multiply takes 4 cycles. This does not hold for add/sub vs. fadd/fsub.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
7/8/2004 11:13:32 PM
|
|
> Doing what?
Well, the code we were talking about.
> On x86 CPUs, FP multipliers are a lot faster (and a lot more
> complex) than integer multipliers. In the Pentium-IV Intel even reuses
> the
> FP multiplier for integer multiplies. On Athlon, a dword-precision
> integer
> multiply takes 4 cycles. Extended-precision is 6 cycles. A full 64-bit FP
> multiply takes 4 cycles. This does not hold for add/sub vs. fadd/fsub.
You mean that such instructions like fcmp, fcmove, fld, etc. are faster
than movq, pand, pshufw, etc.? I thought that (at least on P4) MMX is
always faster than FPU.
--
Ivan
|
|
0
|
|
|
|
Reply
|
Ivan
|
7/9/2004 5:57:41 PM
|
|
"Ivan Korotkov" <koroNOSPAMtkov2@ztelDOT.ru> wrote in message
news:opsavsrogr75ztla@news.zebratelecom.ru...
> > Doing what?
>
> Well, the code we were talking about.
>
> > On x86 CPUs, FP multipliers are a lot faster (and a lot more
> > complex) than integer multipliers. In the Pentium-IV Intel even reuses
> > the
> > FP multiplier for integer multiplies. On Athlon, a dword-precision
> > integer
> > multiply takes 4 cycles. Extended-precision is 6 cycles. A full 64-bit
FP
> > multiply takes 4 cycles. This does not hold for add/sub vs. fadd/fsub.
>
> You mean that such instructions like fcmp, fcmove, fld, etc. are faster
> than movq, pand, pshufw, etc.? I thought that (at least on P4) MMX is
> always faster than FPU.
I had actually meant that fmul is faster than imul/mul, not realizing that
you were talking about MMX. What was said about MMX using the FPU is true --
they share a lot of logic, even more so between SSE and MMX. Instructions
like andps & andnps probably execute internally as a pair of pand & pandn
instructions.
On Pentium-IV, it so happens that mulss (7 clk) is 1-2 cycles faster than
MMX multiply instructions. On Athlon, the latter are 1 cycle faster than the
former. They might as well be the same; the FP multiplier is very big, so it
gets shared between the two. Many instructions (e.g. shufps/shufpd/pshufd)
are only distinct because the FP register file, which is shared between x87,
MMX, and SSE, keeps track of the type of data held by each register. The
only operation I can think of that is completely separate is
addition/subtraction. MMX additionally has shifts, and there is no direct
analog for FP data, though fscale is similar.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
7/10/2004 8:40:47 AM
|
|
> I had actually meant that fmul is faster than imul/mul, not realizing
> that
> you were talking about MMX. What was said about MMX using the FPU is true
> -- they share a lot of logic, even more so between SSE and MMX.
> Instructions
> like andps & andnps probably execute internally as a pair of pand & pandn
> instructions.
But I hope that that example in calculating ((a != 0) && (b != 0)) should
be still faster when implemented with MMX rather than FPU. Moreover, SISD
FPU instructions should execute a little slower than SIMD SSE, isn't it so?
P.S. Matt, you seem to be a real instruction timing and optimization pro.
Is that your interest area? ;)
--
Ivan
|
|
0
|
|
|
|
Reply
|
Ivan
|
7/10/2004 2:41:46 PM
|
|
"Ivan Korotkov" <koroNOSPAMtkov2@ztelDOT.ru> wrote in message
news:opsaxd7lrk75ztla@news.zebratelecom.ru...
> > I had actually meant that fmul is faster than imul/mul, not realizing
> > that
> > you were talking about MMX. What was said about MMX using the FPU is
true
> > -- they share a lot of logic, even more so between SSE and MMX.
> > Instructions
> > like andps & andnps probably execute internally as a pair of pand &
pandn
> > instructions.
>
> But I hope that that example in calculating ((a != 0) && (b != 0)) should
> be still faster when implemented with MMX rather than FPU. Moreover, SISD
> FPU instructions should execute a little slower than SIMD SSE, isn't it
so?
Both SIMD SSE and MMX will have the SIMD advantage over x87/SISD SSE on
parallelizable code. SSE isn't going to be very helpful because Pentium-III
did not support SSE 2. I suppose, though, a comisd instruction could be
approximated using only SSE. MMX is likely much faster in that case. On K8
and Pentium-M, the comisd instruction should make SISE SSE superior. The
comisd instruction takes 1 (!) cycle on Pentium-M and 4-5 on K8.
Hmm...using just SSE 1 with cmpps may be a tad slower than MMX, but it
should be pretty close on most CPUs. I don't have a reference for
Pentium-III, so I'm going off of my Pentium-M reference -- should be the
same though. It looks like the pcmp* instructions take 1 cycle while cmpps
takes 3; however, it also takes multiple MMX instructions to do the same
thing cmpneqps does, so that may stand in SSE's favor. SSE may end up being
faster only because there is no need for an emms instruction. ;-)
> P.S. Matt, you seem to be a real instruction timing and optimization pro.
> Is that your interest area? ;)
I spent a little over a year playing with Athlon and doing assembly
optimization for it. I used to have a print-out on my desk with the
instruction timings, but eventually I memorized the more common instructions
since I was always looking them up. I've studied Pentium and Pentium-IV
also, but I know Athlon far better than anything else.
Not much has changed lately, so there isn't really anything new to know...
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
7/15/2004 2:20:01 PM
|
|
|
14 Replies
233 Views
(page loaded in 0.143 seconds)
Similiar Articles: fpu code optimisation request - comp.lang.asm.x86Consider the following pseudo-code function: function myAND(float a, float b) if a = 0 then test_a = 0 else test_a = 1 if b = 0 then test_b = 0 ... request for code - comp.soft-sys.matlabfpu code optimisation request - comp.lang.asm.x86 Consider the following pseudo-code function: function myAND(float a, float b) if a = 0 then test_a = 0 else test_a = 1 if ... good guess of starting point optimization - comp.soft-sys.matlab ...fpu code optimisation request - comp.lang.asm.x86... FPU when testing floating-point numbers against 0. The standard optimization ... That's what happens, I guess ... and ... Re: No need to optimize in assembly anymore - comp.lang.asm.x86 ...fpu code optimisation request - comp.lang.asm.x86 Re: No need to optimize in assembly anymore - comp.lang.asm.x86 ... fpu code optimisation request - comp.lang.asm.x86 You ... How to find out which instruction sets (mmx/sse/etc.) are used ...fpu code optimisation request - comp.lang.asm.x86... etc.? I thought that (at least on P4) MMX is ... instruction could be approximated using only SSE. Microsoft JVM looks like it will return! - comp.lang.java.security ...fpu code optimisation request - comp.lang.asm.x86... else test_b = 1 > result = test_a AND test_b That looks like: int myAND(double a, double b) { return ... Trick with SBB instruction - comp.lang.asm.x86fpu code optimisation request - comp.lang.asm.x86 The standard optimization trick is to load the data ... mov edx, [b+4] or eax, [a] or edx, [b] cmp eax, 1 sbb ... Condition Code - comp.lang.asm.x86fpu code optimisation request - comp.lang.asm.x86 Consider the following pseudo-code function: function myAND(float a, float b) if a = 0 then test_a = 0 else test_a = 1 if ... float vs double? - comp.graphics.api.openglfpu code optimisation request - comp.lang.asm.x86 On x86 CPUs, FP multipliers are a lot faster (and a lot more complex) than ... c# - Float/double precision in debug ... fpu code optimisation request - comp.lang.asm.x86 | Computer GroupConsider the following pseudo-code function: function myAND(float a, float b) if a = 0 then test_a = 0 else test_a = 1 if b = 0 then test_b = 0 ... request for code - comp.soft-sys.matlab | Computer Groupfpu code optimisation request - comp.lang.asm.x86 Consider the following pseudo-code function: function myAND(float a, float b) if a = 0 then test_a = 0 else test_a = 1 if ... 7/23/2012 4:52:50 PM
|