Posted something similar before, but now I'm convinced there is either
a compiler or a FP bug:
int main()
{
float y1 = float(atof("-6.20001"));
float y2 = float(atof("-6.20001"));
if (y2 < y1)
puts("yes");
else
puts("no");
if (y2 < y1)
puts("yes");
else
puts("no");
}
In debug mode, this prints "no no" as expected. But in release mode,
it prints "yes no"!?!
That means that essentially exactly the same comparison, ie, y2 < y1,
is giving two different results, even though both values should have
exactly the same bit pattern.
The assembly code is as follows:
6: float y1 = float(atof("-6.20001"));
push offset string "-6.20001" (00409038)
call _atof (004010e7)
fstp dword ptr [esp+8]
7: float y2 = float(atof("-6.20001"));
push offset string "-6.20001" (00409038)
call _atof (004010e7)
fst dword ptr [esp+8]
9: if (y2 < y1)
fcomp dword ptr [esp+0Ch]
add esp,8
fnstsw ax
test ah,1
je main+34h (00401034)
10: puts("yes");
push offset string "yes" (00409034)
11: else
jmp main+39h (00401039)
12: puts("no");
push offset string "no" (00409030)
call _puts (00401080)
14: if (y2 < y1)
fld dword ptr [esp+4]
fcomp dword ptr [esp+8]
add esp,4
fnstsw ax
test ah,1
je main+61h (00401061)
15: puts("yes");
push offset string "yes" (00409034)
16: else
etc.etc.
The code for y2 < y1 seems to be different in the second case. I'm not
familiar enough with the x87 instruction set to determine what the
difference means, but it's obviously significant.
If you insert some code that manipulates y2 before the first
comparison, then the result is correct.
Likewise if you change everything to use doubles, the result is fine
(which is what I've done to fix the problem).
Just wondering if anyone knew whether this was a VC (6.0) compiler
problem, or an actual chip bug?
Dylan
|
|
0
|
|
|
|
Reply
|
wizofaus
|
11/10/2003 10:42:26 PM |
|
In comp.lang.asm.x86 Dylan Nicholson <wizofaus@hotmail.com> wrote:
> Posted something similar before, but now I'm convinced there is either
> a compiler or a FP bug:
> int main()
> {
> float y1 = float(atof("-6.20001"));
> float y2 = float(atof("-6.20001"));
> if (y2 < y1)
> puts("yes");
> else
> puts("no");
> if (y2 < y1)
> puts("yes");
> else
> puts("no");
> }
> In debug mode, this prints "no no" as expected. But in release mode,
> it prints "yes no"!?!
> That means that essentially exactly the same comparison, ie, y2 < y1,
> is giving two different results, even though both values should have
> exactly the same bit pattern.
> The assembly code is as follows:
> 6: float y1 = float(atof("-6.20001"));
> push offset string "-6.20001" (00409038)
> call _atof (004010e7)
> fstp dword ptr [esp+8]
> 7: float y2 = float(atof("-6.20001"));
> push offset string "-6.20001" (00409038)
> call _atof (004010e7)
> fst dword ptr [esp+8]
I don't get this exact code when I assemble your code with VC(7). C compilers usually use
the EBP register for stack varaibles instead of ESP! Are you using some weird flags?
> 9: if (y2 < y1)
> fcomp dword ptr [esp+0Ch]
> add esp,8
> fnstsw ax
> test ah,1
> je main+34h (00401034)
This is where the problem is. atof() returns a double, not a
float. However, this code compares the *double* value of y2 (which is
left on the coprocessor stack by atof) with the *float* value of
y1. Why is this important? Because -6.20001 can not be represented
exactly as a float or double.
The IEEE binary representation of -6.20001 as a float is C0 C6 66 7B. This represents the
number -6.200009823. The double representation of -6.20001 is C0 18 CC CF 6B E3 7D E9
or -6.2000099999999997991. So y2 is < y1 based on this comparision.
> 10: puts("yes");
> push offset string "yes" (00409034)
> 11: else
> jmp main+39h (00401039)
> 12: puts("no");
> push offset string "no" (00409030)
> call _puts (00401080)
> 14: if (y2 < y1)
> fld dword ptr [esp+4]
> fcomp dword ptr [esp+8]
> add esp,4
> fnstsw ax
> test ah,1
> je main+61h (00401061)
Here the float representation is loaded onto the coprocessor stack for both arguments and
the comparision comes out as expected.
> 15: puts("yes");
> push offset string "yes" (00409034)
> 16: else
> etc.etc.
> The code for y2 < y1 seems to be different in the second case. I'm not
> familiar enough with the x87 instruction set to determine what the
> difference means, but it's obviously significant.
> If you insert some code that manipulates y2 before the first
> comparison, then the result is correct.
This is because the compiler then must reload the float value of y2 onto
the coprocessor stack.
> Likewise if you change everything to use doubles, the result is fine
> (which is what I've done to fix the problem).
Here all comparisions are as doubles so no problem. Actually I there is usually
little point to use float over double. float operations are not faster than
double on the x86 chip. The only exception I can think of is for programs that
deal with LOTS of floating point data where double precision is not required
and the memory size difference between float and double becomes significant.
> Just wondering if anyone knew whether this was a VC (6.0) compiler
> problem, or an actual chip bug?
> Dylan
--
Paul Carter (http://www.drpaulcarter.com/pcasm)
|
|
0
|
|
|
|
Reply
|
pacman128
|
11/11/2003 1:15:30 AM
|
|
<pacman128@hotmail.com> wrote in message
news:SKWrb.3201$vJ3.2145@okepread05...
> Here all comparisions are as doubles so no problem. Actually I there is
usually
> little point to use float over double. float operations are not faster
than
> double on the x86 chip. The only exception I can think of is for programs
that
> deal with LOTS of floating point data where double precision is not
required
> and the memory size difference between float and double becomes
significant.
The above may be the case for C programs, but Fortran or assembly
programs don't require intermediate value in float calculations
to be converted to double, so it would seem to me that it would
be possible at least in principle to execute float calculations
twice as fast as double calculations using the SIMD capabilities
available on x86 chips. Of course this requires the variables to
be properly aligned and you or the compiler have to set things up
so that the desired operations can actually be carried out in
parllel... lots of the time this just isn't possible.
--
write(*,*) transfer((/17.392111325966148d0,3.6351694777236872d228, &
6.0134700169991705d-154/),(/'x'/)); end
|
|
0
|
|
|
|
Reply
|
James
|
11/11/2003 2:43:56 AM
|
|
On Tue, 11 Nov 2003 02:43:56 GMT, "James Van Buskirk"
<not_valid@comcast.net> wrote in comp.lang.asm.x86:
> <pacman128@hotmail.com> wrote in message
> news:SKWrb.3201$vJ3.2145@okepread05...
>
> > Here all comparisions are as doubles so no problem. Actually I there is
> usually
> > little point to use float over double. float operations are not faster
> than
> > double on the x86 chip. The only exception I can think of is for programs
> that
> > deal with LOTS of floating point data where double precision is not
> required
> > and the memory size difference between float and double becomes
> significant.
>
> The above may be the case for C programs, but Fortran or assembly
> programs don't require intermediate value in float calculations
> to be converted to double, so it would seem to me that it would
> be possible at least in principle to execute float calculations
> twice as fast as double calculations using the SIMD capabilities
> available on x86 chips. Of course this requires the variables to
> be properly aligned and you or the compiler have to set things up
> so that the desired operations can actually be carried out in
> parllel... lots of the time this just isn't possible.
C programs have not required that floating point operations be carried
out in double precision since the original 1989 ANSI standard. The
cause of the problem, as pacnam128 already said, is the fact that the
Intel x86 coprocessor/fpu always keeps internal values in either 64
bit or 80 bit precision, regardless of the language in use.
Unless the compiler for a language never dumps to RAM, or always dumps
all values to RAM, this will happen in every language.
--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++ ftp://snurse-l.org/pub/acllc-c++/faq
|
|
0
|
|
|
|
Reply
|
Jack
|
11/11/2003 4:30:45 AM
|
|
"Jack Klein" <jackklein@spamcop.net> wrote in message
news:ubp0rvc46njpd42dbgj0m64k07rm934ais@4ax.com...
> C programs have not required that floating point operations be carried
> out in double precision since the original 1989 ANSI standard.
Cool. Learned something new here... don't follow C much...
> The
> cause of the problem, as pacnam128 already said, is the fact that the
> Intel x86 coprocessor/fpu always keeps internal values in either 64
> bit or 80 bit precision, regardless of the language in use.
> Unless the compiler for a language never dumps to RAM, or always dumps
> all values to RAM, this will happen in every language.
Now, this last paragraph isn't true: some compilers are smart
enough to use SSE, SSE2, or 3DNow!. ifl can use SSE2, for e.g.
--
write(*,*) transfer((/17.392111325966148d0,3.6351694777236872d228, &
6.0134700169991705d-154/),(/'x'/)); end
|
|
0
|
|
|
|
Reply
|
James
|
11/11/2003 8:46:34 AM
|
|
"James Van Buskirk" <not_valid@comcast.net> wrote in message
news:Kl1sb.122987$9E1.612128@attbi_s52...
> "Jack Klein" <jackklein@spamcop.net> wrote in message
> news:ubp0rvc46njpd42dbgj0m64k07rm934ais@4ax.com...
>
> > C programs have not required that floating point operations be carried
> > out in double precision since the original 1989 ANSI standard.
>
> Cool. Learned something new here... don't follow C much...
>
> > The
> > cause of the problem, as pacnam128 already said, is the fact that the
> > Intel x86 coprocessor/fpu always keeps internal values in either 64
> > bit or 80 bit precision, regardless of the language in use.
>
> > Unless the compiler for a language never dumps to RAM, or always dumps
> > all values to RAM, this will happen in every language.
>
> Now, this last paragraph isn't true: some compilers are smart
> enough to use SSE, SSE2, or 3DNow!. ifl can use SSE2, for e.g.
Compilers will use SSE and SSE2 for scalar operations. Intel worked very
hard to build a compiler that would automatically vectorize loops, but I
don't think anything ever came out of it. (i.e. it was heavily special-cased
and not very useful in general.)
Visual C++ has the /Op option that guarantees precision at the cost of some
performance. The compiler will force FP operands back to memory to clip them
to whatever precision the user is working in. The introduction of /arch:sse
and /arch:sse2 will hopefully fix this red herring completely.
It is also untrue that the FP unit only works in 64-bit and 80-bit modes.
The control word can be set to 32-bit mode as well. The C runtime library
will initialize it to 64-bits by default. As was said earlier in the thread,
most x87 ops execute at the same speed regardless. The exceptions are
functions which utilize iterative algorithms: transcendental functions,
division, and square root.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
11/11/2003 10:49:40 AM
|
|
"Matt Taylor" <para@tampabay.rr.com> wrote in message news:893sb.85182$jW5.911816@twister.tampabay.rr.com...
[...]
> Compilers will use SSE and SSE2 for scalar operations. Intel worked very
> hard to build a compiler that would automatically vectorize loops, but I
> don't think anything ever came out of it. (i.e. it was heavily special-cased
> and not very useful in general.)
Actually the vectorization in the Intel compilers works reasonably well.
Vectorizing isn't easy, and even compilers for dedicated vector computers
have a hard time to vectorize every loop that could be vectorized by hand.
Usually there are tight restrictions on which loop can be vectorized in
all such compilers. In addition, SSE(2) is limited to vectorization of unit
stride loops as there is no scatter/gather logic.
On 2002-02-17 Tim Prince reported in comp.lang.fortran that the Intel beta
6 compiler was able to vectorize 58 out of 122 loops in a vectorization
benchmark available from netlib.org:
tennessee/vector.ps
by: David Levine, David Callahan, & Jack Dongarra,
title: A Comparative Study of Automatic Vectorizing Compilers,
ref: Parallel Computing 17 (1991), pp. 1223-1244.
for: We compare the capabilities of several commercially
available, vectorizing Fortran compilers using a test suite of Fortran
loops. We present the results of compiling and executing these loops
on a variety of supercomputers, mini-supercomputers, and mainframes.
The compilers in the report achieved about 70% vectorization versus about
50% for the Intel compiler. I would assume that Intel has since improved
the vectorization (the compilers are currently at version 7.x best I know).
[...]
> most x87 ops execute at the same speed regardless. The exceptions are
> functions which utilize iterative algorithms: transcendental functions,
> division, and square root.
The transcendental instructions are not controlled by precision control,
they always return extended precision results. Divide and square root
instructions typically take advantage of reduced precision by exiting
iterations early for increased performance, but I don't think this is
strictly true for all x86 processors (I seem to recall a processor which
always had the same execution time for square root regardless of the
precision control setting).
-- Norbert
|
|
0
|
|
|
|
Reply
|
Norbert
|
11/12/2003 7:14:37 AM
|
|
"Norbert Juffa" <juffa@earthlink.net> wrote in message
news:x5lsb.8164$nz.7655@newsread2.news.pas.earthlink.net...
> "Matt Taylor" <para@tampabay.rr.com> wrote in message
news:893sb.85182$jW5.911816@twister.tampabay.rr.com...
<snip>
> tennessee/vector.ps
> by: David Levine, David Callahan, & Jack Dongarra,
> title: A Comparative Study of Automatic Vectorizing Compilers,
> ref: Parallel Computing 17 (1991), pp. 1223-1244.
> for: We compare the capabilities of several commercially
> available, vectorizing Fortran compilers using a test suite of Fortran
> loops. We present the results of compiling and executing these loops
> on a variety of supercomputers, mini-supercomputers, and mainframes.
>
> The compilers in the report achieved about 70% vectorization versus about
> 50% for the Intel compiler. I would assume that Intel has since improved
> the vectorization (the compilers are currently at version 7.x best I
know).
Interesting. I would think that C would be a little more difficult to
vectorize than Fortran, but 50% is better than what I had heard.
> [...]
> > most x87 ops execute at the same speed regardless. The exceptions are
> > functions which utilize iterative algorithms: transcendental functions,
> > division, and square root.
>
> The transcendental instructions are not controlled by precision control,
> they always return extended precision results. Divide and square root
> instructions typically take advantage of reduced precision by exiting
> iterations early for increased performance, but I don't think this is
> strictly true for all x86 processors (I seem to recall a processor which
> always had the same execution time for square root regardless of the
> precision control setting).
Interesting. The P4 manuals say nothing about requiring extended precision,
but I checked and it takes identical time on my Athlon regardless of the
precision setting. Additionally, AMD lists the cycle counts for fdiv and
fsqrt in single, double, and extended precision. There is only a range for
fsin and fcos.
Anyway, fdiv and fsqrt should be iterative with early exit on all desktop
processors in circulation today.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
11/12/2003 10:20:15 AM
|
|
"Matt Taylor" <para@tampabay.rr.com> wrote in message news:zPnsb.93414$jW5.1060231@twister.tampabay.rr.com...
> "Norbert Juffa" <juffa@earthlink.net> wrote in message
> > The transcendental instructions are not controlled by precision control,
> > they always return extended precision results. Divide and square root
[...]
>
> Interesting. The P4 manuals say nothing about requiring extended precision,
> but I checked and it takes identical time on my Athlon regardless of the
> precision setting. Additionally, AMD lists the cycle counts for fdiv and
> fsqrt in single, double, and extended precision. There is only a range for
> fsin and fcos.
IA-32 Intel Architexture Software Developer's Manual, Volume 1: Basic
Architecture, section 8.1.4.2
"The precision-control bits only affect the results of the following
floating-point instructions: FADD, FADP, FIADD, FSUB, FSUBP, FISUB,
FSUBR, FSUBRP, FISUBR, FMUL, FMULP, FIMUL, FDIV, FDIVP, FIDIV, FDIVR,
FDIVRP, FIDIVR, and FSQRT."
-- Norbert
|
|
0
|
|
|
|
Reply
|
Norbert
|
11/12/2003 6:14:19 PM
|
|
|
8 Replies
249 Views
(page loaded in 0.087 seconds)
|