f



What is acceptable for -ffast-math?

I know many of the audience here are interested in floating point and optimizations and this post by Linus Torvalds even touches on hardware sympathy:

https://gcc.gnu.org/ml/gcc/2001-07/msg02150.html

Discussion on proggit is also interesting: https://www.reddit.com/r/programming/comments/5jrljd/linus_torvalds_what_is_acceptable_for_ffastmath/

Hope this brings some cheer and engagement to the group ;)
0
William
12/23/2016 7:45:25 AM
comp.arch 7611 articles. 0 followers. carchreader (32) is leader. Post Follow

30 Replies
29 Views

Similar Articles

[PageSpeed] 53

William Edwards wrote:
> I know many of the audience here are interested in floating point and
> optimizations and this post by Linus Torvalds even touches on
> hardware sympathy:
>
> https://gcc.gnu.org/ml/gcc/2001-07/msg02150.html
>
> Discussion on proggit is also interesting:
>
https://www.reddit.com/r/programming/comments/5jrljd/linus_torvalds_what_is_acceptable_for_ffastmath/
>
>  Hope this brings some cheer and engagement to the group ;)
>
Indeed, even i the origianl post was from 2001!

Most of the observations made (i.e. --fast-math don't matter because it 
is the IO that kills you) are even more correct today.

Around the same time frame (or even earlier?) Andy Glew told me to 
"assume all fp operations are free, the total time is determined by the 
cycles needed to load the operands".

The one problem they mention is denormals which is something we have 
discussed here as well:

Mitch have shown/stated that he can do denormal with a small fraction of 
a cycle in total overhead, i.e. effectively free,

In the Mill FP emulation code I'm writing I've found that I can handle 
denormal input and/or output with less than 5% total overhead even for 
FADD/FSUB.

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
0
Terje
12/23/2016 9:20:30 AM
On 23/12/16 10:20, Terje Mathisen wrote:
> William Edwards wrote:
>> I know many of the audience here are interested in floating point and
>> optimizations and this post by Linus Torvalds even touches on
>> hardware sympathy:
>>
>> https://gcc.gnu.org/ml/gcc/2001-07/msg02150.html
>>
>> Discussion on proggit is also interesting:
>>
> https://www.reddit.com/r/programming/comments/5jrljd/linus_torvalds_what_is_acceptable_for_ffastmath/
> 
>>
>>  Hope this brings some cheer and engagement to the group ;)
>>
> Indeed, even i the origianl post was from 2001!
> 
> Most of the observations made (i.e. --fast-math don't matter because it
> is the IO that kills you) are even more correct today.
> 
> Around the same time frame (or even earlier?) Andy Glew told me to
> "assume all fp operations are free, the total time is determined by the
> cycles needed to load the operands".
> 

That may be true in some circumstances, but it is certainly /not/ true
in other cases.

In particular, it is /far/ from true on small processors and
microcontrollers.  If these have hardware floating point at all, it is
usually somewhat limited.  Typically you have no or very limited support
for things like infinities, denormals, etc.  Full IEEE compatible
floating point can involve doing significant library work even when the
chip has the basic floating point operations in hardware.  "-ffast-math"
can make a big difference on these devices by allowing less strict
details which are usually totally irrelevant on typical programs for
such devices.  It also allows more re-arrangement of expressions,
combined "multiply-and-add" instructions, etc.

It is also far from true when you are using SIMD instructions (or, even
more so, GPU accelerators).  Again, these typically do not implement all
of IEEE, which means that without "-ffast-math" the compiler may not be
able to vectorise calculations efficiently.

The bandwidth to load the operands is certainly important for a lot of
big calculations, but it is not the only thing to consider.


0
David
12/23/2016 10:20:42 AM
On Friday, 23 December 2016 10:20:32 UTC+1, Terje Mathisen  wrote:
> Indeed, even i the origianl post was from 2001!
>=20
> Most of the observations made (i.e. --fast-math don't matter because it=
=20
> is the IO that kills you) are even more correct today.

Yeah I missed the 2001 when I first saw the link on reddit, as it all rings=
 so true even today :)

On the proggit discussion we have people talking about the hit from denorma=
ls, and on HN we have this excellent post:

> In my experience, -ffast-math has become vastly more useful over the last=
 10-15 years. I suppose this was a somewhat reasonable discussion to have i=
n 2001, when most performance critical code still heavily relied on hand-op=
timized assembly, but nowadays -ffast-math is often the only reasonably con=
venient way to get compilers to autovectorize code properly, use FMA instru=
ctions, etc. I have production code that literally runs 4x faster purely by=
 adding -ffast-math due to autovectorization (and subsequent ILP and other =
improvements).=20

https://news.ycombinator.com/item?id=3D13243489
0
William
12/23/2016 11:24:04 AM
On Friday, December 23, 2016 at 11:20:32 AM UTC+2, Terje Mathisen wrote:
> William Edwards wrote:
> > I know many of the audience here are interested in floating point and
> > optimizations and this post by Linus Torvalds even touches on
> > hardware sympathy:
> >
> > https://gcc.gnu.org/ml/gcc/2001-07/msg02150.html
> >
> > Discussion on proggit is also interesting:
> >
> https://www.reddit.com/r/programming/comments/5jrljd/linus_torvalds_what_is_acceptable_for_ffastmath/
> >
> >  Hope this brings some cheer and engagement to the group ;)
> >
> Indeed, even i the origianl post was from 2001!
> 
> Most of the observations made (i.e. --fast-math don't matter because it 
> is the IO that kills you) are even more correct today.
> 
> Around the same time frame (or even earlier?) Andy Glew told me to 
> "assume all fp operations are free, the total time is determined by the 
> cycles needed to load the operands".
> 
> The one problem they mention is denormals which is something we have 
> discussed here as well:
> 
> Mitch have shown/stated that he can do denormal with a small fraction of 
> a cycle in total overhead, i.e. effectively free,

Mitch's statement was limited to FMA-based FPUs with no lower-latency bypassing for special cases of FMUL [which, IMHO, makes sense] or FADD [which is, IMHO, unfortunate, because low-latency FADD is very useful].

> 
> In the Mill FP emulation code I'm writing I've found that I can handle 
> denormal input and/or output with less than 5% total overhead even for 
> FADD/FSUB.
>

Do you mean, "normal" case is 5% slower due to need to support subnormal?
Or all subnormal cases, inputs and out puts, are at most 5% slower than "normal"?

> Terje
> 
> -- 
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

0
already5chosen
12/23/2016 1:50:50 PM
On Friday, December 23, 2016 at 12:20:44 PM UTC+2, David Brown wrote:
> On 23/12/16 10:20, Terje Mathisen wrote:
> > William Edwards wrote:
> >> I know many of the audience here are interested in floating point and
> >> optimizations and this post by Linus Torvalds even touches on
> >> hardware sympathy:
> >>
> >> https://gcc.gnu.org/ml/gcc/2001-07/msg02150.html
> >>
> >> Discussion on proggit is also interesting:
> >>
> > https://www.reddit.com/r/programming/comments/5jrljd/linus_torvalds_what_is_acceptable_for_ffastmath/
> > 
> >>
> >>  Hope this brings some cheer and engagement to the group ;)
> >>
> > Indeed, even i the origianl post was from 2001!
> > 
> > Most of the observations made (i.e. --fast-math don't matter because it
> > is the IO that kills you) are even more correct today.
> > 
> > Around the same time frame (or even earlier?) Andy Glew told me to
> > "assume all fp operations are free, the total time is determined by the
> > cycles needed to load the operands".
> > 
> 
> That may be true in some circumstances, but it is certainly /not/ true
> in other cases.
> 
> In particular, it is /far/ from true on small processors and
> microcontrollers.  If these have hardware floating point at all, it is
> usually somewhat limited.  Typically you have no or very limited support
> for things like infinities, denormals, etc.  Full IEEE compatible
> floating point can involve doing significant library work even when the
> chip has the basic floating point operations in hardware.  "-ffast-math"
> can make a big difference on these devices by allowing less strict
> details which are usually totally irrelevant on typical programs for
> such devices.  It also allows more re-arrangement of expressions,
> combined "multiply-and-add" instructions, etc.
> 
> It is also far from true when you are using SIMD instructions (or, even
> more so, GPU accelerators).  Again, these typically do not implement all
> of IEEE, which means that without "-ffast-math" the compiler may not be
> able to vectorise calculations efficiently.

Only for very special definition of 'typically".
It is true for ARM NEON.
It is true for PPC Embedded Floating-Point, that is present on several Freescale PPC cores, but in lights of latest acquisition by NXP and even more recent acquisition by Qualcomm, this product lines are unlikely to have bright future.

I'd guess, it was true for few old GPUs no longer used as GPGPUs.
May be, it was true for Altivec, I don't remember. Anyway, today Altivec is almost dead. It's only alive form, IBM "Vector Facility", supports full IEEE-754.

It's not true for almost anything else, including modern GPGPUs, Intel/AMD SSE/AVX, AArch64 Advanced SIMD, AArch64 SVE, IBM POWER VSX, IBM BlueGene SIMD, Fujitsu SIMD of K computers, etc...

> 
> The bandwidth to load the operands is certainly important for a lot of
> big calculations, but it is not the only thing to consider.

0
already5chosen
12/23/2016 2:24:37 PM
already5chosen@yahoo.com wrote:
> On Friday, December 23, 2016 at 11:20:32 AM UTC+2, Terje Mathisen wrote:
>> In the Mill FP emulation code I'm writing I've found that I can handle
>> denormal input and/or output with less than 5% total overhead even for
>> FADD/FSUB.
>>
> Do you mean, "normal" case is 5% slower due to need to support subnormal?
> Or all subnormal cases, inputs and out puts, are at most 5% slower than "normal"?

Both:

I add a single cycle to the normal case in order to support denormal 
with branchless code that doesn't need to specialcase denorms at either end.

Testing for denormal and branching would be far slower unless you could 
guarantee effectively no denorms at all, while "denorms_are_zero" would 
only run those ~5% faster.

Terje
-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
0
Terje
12/23/2016 2:34:19 PM
In article <o3iq8u$v20$1@gioia.aioe.org>,
Terje Mathisen  <terje.mathisen@tmsw.no> wrote:
>William Edwards wrote:
>> I know many of the audience here are interested in floating point and
>> optimizations and this post by Linus Torvalds even touches on
>> hardware sympathy:
>>
>> https://gcc.gnu.org/ml/gcc/2001-07/msg02150.html
>>
>> Discussion on proggit is also interesting:
>>
>https://www.reddit.com/r/programming/comments/5jrljd/linus_torvalds_what_is_acceptable_for_ffastmath/
>>
>>  Hope this brings some cheer and engagement to the group ;)
>>
>Indeed, even i the origianl post was from 2001!
>
>Most of the observations made (i.e. --fast-math don't matter because it 
>is the IO that kills you) are even more correct today.
>
>Around the same time frame (or even earlier?) Andy Glew told me to 
>"assume all fp operations are free, the total time is determined by the 
>cycles needed to load the operands".

Which is all true, to a first approximation, but does NOT imply that
--fast-math isn't significantly faster.  Let's ignore the issue that
some operations with some operands glitch the pipeline, important
though that is.  Obeying dogmatic non-optimisation rules (as required
by an extreme reading of IEEE 754 etc.) often prevents a reduction in
the number of data accesses, by forbidding some reorganisations that
make better use of registers and non-trivial common expression
elimination.

>The one problem they mention is denormals which is something we have 
>discussed here as well:
>
>Mitch have shown/stated that he can do denormal with a small fraction of 
>a cycle in total overhead, i.e. effectively free,
>
>In the Mill FP emulation code I'm writing I've found that I can handle 
>denormal input and/or output with less than 5% total overhead even for 
>FADD/FSUB.

Average time, worst-case time, code size or number of branches?  As
you know, my objections to denormalised numbers are on the grounds
that they increase complexity, and hence make bugs more likely, for
a negligible improvement in numerical accuracy, rather than mainly
their performance.


Regards,
Nick Maclaren.
0
nmm
12/23/2016 2:35:35 PM
William Edwards <willvarfar@gmail.com> writes:
>On the proggit discussion we have people talking about the hit from denorma=
>ls, and on HN we have this excellent post:
>
>> In my experience, -ffast-math has become vastly more useful over the last=
> 10-15 years. I suppose this was a somewhat reasonable discussion to have i=
>n 2001, when most performance critical code still heavily relied on hand-op=
>timized assembly, but nowadays -ffast-math is often the only reasonably con=
>venient way to get compilers to autovectorize code properly, use FMA instru=
>ctions, etc. I have production code that literally runs 4x faster purely by=
> adding -ffast-math due to autovectorization (and subsequent ILP and other =
>improvements).=20
>
>https://news.ycombinator.com/item?id=3D13243489

Here's the example that we manually vectorized a few weeks ago (on a
Core i3-3227U)

[b8:~/pub/anton/lvas/effizienz:15138] gcc -O3 tsp9.c -lm
[b8:~/pub/anton/lvas/effizienz:15139] perf stat -e cycles:u a.out 10000
sumdist = 88.609717

 Performance counter stats for 'a.out 10000':

         226964842      cycles:u                                                    

       0.158383019 seconds time elapsed

[b8:~/pub/anton/lvas/effizienz:15140] gcc -O3 -ffast-math tsp9.c -lm
[b8:~/pub/anton/lvas/effizienz:15141] perf stat -e cycles:u a.out 10000
sumdist = 88.609717

 Performance counter stats for 'a.out 10000':

         226982633      cycles:u                                                    

       0.156984167 seconds time elapsed

I.e., no speedup from -ffast-math at all.  It's just one program, but
so is the case that you cited.

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
0
anton
12/23/2016 5:02:24 PM
William Edwards <willvarfar@gmail.com> writes:
>I know many of the audience here are interested in floating point and optimizations and this post by Linus Torvalds even touches on hardware sympathy:
>
>https://gcc.gnu.org/ml/gcc/2001-07/msg02150.html

There are several issues: Denormals, and applying transformations
based on the associative law (which does not hold for FP math).

Since we are on comp.arch here, let's look at the hardware aspect
(denormals) first: Hardware engineers started out with the usual
"supercomputer mindset" (i.e., only implement a part of the
functionality, and leave it to the software guys to avoid the
unimplemented/slow cases) and did not implement denormal numbers or
implemented them in a slow way; and some told us that it's impossible
to implement denormals efficiently.  It turned out that the software
guys were not able to avoid denormals, so eventually the hardware guys
put their mind to it and implemented denormals efficiently.  Next:
sequential consistency.

Concerning the associative law, some FP software is written without
putting in thought about the specific properties of FP arithmetics,
assuming that it is just like arithmetics on reals.  For such code it
may be harmless to perform transformations based on the associative
law; except that the program has been tested with a certain way of
being compiler, and might be sheer luck work as intended, and
compiling it with associative-law-transformations may or may not break
that; but anyway, you can run the program with the transformations,
and see if the results are satisfactory (ok, works only for the
specific input, but so did the original tests).

And then there is code that is carefully designed with the properties
of FP math (and possibly IEEE FP math) in mind; for such code you most
likely don't want transformations based on arithmetic laws that simply
don't hold for FP math (such as the associative law).

At least with -ffast-math you have the choice, and it's not on by
default, unlike the "optimizations" based on assuming that a program
does not perform undefined behaviour.

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
0
anton
12/23/2016 5:15:28 PM
On Friday, December 23, 2016 at 7:50:53 AM UTC-6, already...@yahoo.com wrote:
> > 
> > The one problem they mention is denormals which is something we have 
> > discussed here as well:
> > 
> > Mitch have shown/stated that he can do denormal with a small fraction of 
> > a cycle in total overhead, i.e. effectively free,
> 
> Mitch's statement was limited to FMA-based FPUs with no lower-latency bypassing for special cases of FMUL [which, IMHO, makes sense] or FADD [which is, IMHO, unfortunate, because low-latency FADD is very useful].

My statement was indeed directed at FMAC based FP calculation units.
And in an FMAC in order to deliver a single rounding error from A*B+C
the data path has to have enough circuitry so that adding a few special
touches, makes the FMAC unit handle denorms at full speed. It actually
takes more circuitry to propagate a NaN through the unit than to process
denorms at full speed.

Basically, one bothers to detect a zero exponent and uses this information
to suppress the hidden one bit at the top of the fraction. Then over in
the normalizer, one invents a hidden bit so the LZA or FF1 does not over-
normalize the <potentially denormalized> pre-normal result.

Nothing in my statement means that it is hard to do the same in a FMUL
unit or an FADD unit. The same tricks are available, one just has to 
"get in there and do it". The amount of circuitry is negligible (in the
large scheme of things).
0
MitchAlsup
12/23/2016 5:44:44 PM
Nick Maclaren wrote:
> In article <o3iq8u$v20$1@gioia.aioe.org>,
> Terje Mathisen  <terje.mathisen@tmsw.no> wrote:
>> Around the same time frame (or even earlier?) Andy Glew told me to
>> "assume all fp operations are free, the total time is determined by the
>> cycles needed to load the operands".
>
> Which is all true, to a first approximation, but does NOT imply that
> --fast-math isn't significantly faster.  Let's ignore the issue that
> some operations with some operands glitch the pipeline, important
> though that is.  Obeying dogmatic non-optimisation rules (as required
> by an extreme reading of IEEE 754 etc.) often prevents a reduction in
> the number of data accesses, by forbidding some reorganisations that
> make better use of registers and non-trivial common expression
> elimination.

As you well know I have made if not my living then at least my online 
reputation by optimizing low-level asm/C code, the main idea here is of 
course to minimize the number of ops and maximize cache reuse. I have 
found that this is a lot more fun to do on relatively small kernels with 
a lot of operations.

Codecs, crypto, complicated math, vectorizable by disregarding the 
normal C rules for operation order and overlap.

I've found that I can often rewrite the C code so that it becomes 
obvious to the compiler that it is in fact legal to optimize the heck 
out of it.

I.e. copy everything into local variables to get rid of all aliasing 
considerations.
>
>> The one problem they mention is denormals which is something we have
>> discussed here as well:
>>
>> Mitch have shown/stated that he can do denormal with a small fraction of
>> a cycle in total overhead, i.e. effectively free,
>>
>> In the Mill FP emulation code I'm writing I've found that I can handle
>> denormal input and/or output with less than 5% total overhead even for
>> FADD/FSUB.
>
> Average time, worst-case time, code size or number of branches?  As
> you know, my objections to denormalised numbers are on the grounds
> that they increase complexity, and hence make bugs more likely, for
> a negligible improvement in numerical accuracy, rather than mainly
> their performance.

My Mill code tend to be branchless/constant time specifically so that it 
can work unmodified on SIMD data. This is the main reason I started 
looking for inline ways to handle denorm with the lowest possible overhead.

I do need to have a parallel path for peculiar inputs, i.e. Inf/NaN. 
This is computed at the same time as the mainline function, then a 
pick() operation selects the proper result at the end.

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
0
Terje
12/23/2016 6:35:58 PM
In article <o3jqqc$mvp$1@gioia.aioe.org>,
Terje Mathisen  <terje.mathisen@tmsw.no> wrote:
>
>As you well know I have made if not my living then at least my online 
>reputation by optimizing low-level asm/C code, the main idea here is of 
>course to minimize the number of ops and maximize cache reuse. I have 
>found that this is a lot more fun to do on relatively small kernels with 
>a lot of operations.

Absolutely.  But there is a great deal of code where the time goes
in necessarily complicated code - i.e. where the mathematics is
itself complicated.  And not all programmers are your calibre.

>Codecs, crypto, complicated math, vectorizable by disregarding the 
>normal C rules for operation order and overlap.
>
>I've found that I can often rewrite the C code so that it becomes 
>obvious to the compiler that it is in fact legal to optimize the heck 
>out of it.
>
>I.e. copy everything into local variables to get rid of all aliasing 
>considerations.

Indeed, you can.  I certainly could, and probably still could, but
that's not really the point.  It is extra effort, often makes the
code a lot less clear and/or different from the mathematics, and
that is NOT good software engineering.  Even excluding the ones who
can't do it, there are many who will make mistakes and/or fail to
find the ones they have made, and it is a nightmare for maintenance,
especially by other people (the usual form).

I have often found it easier to throw such code away and start from
scratch than to try to reverse engineer whether the optimisations
were (a) correct and (b) beneficial.  That is particularly the case
when you need to add a new form of optimisation, usually SIMD or
threading, into highly manually optimised code.  The optimal code
structure is often very different in the different cases.  Or when
you need to change one complex formula into a very similar one;
the changes may be simple in the original mathematics, but they very
often aren't in the optimised form.

The standard recommendation for software engineering is NOT to hand
optimise unless you really have to - write clean, clear code, and
op the compiler optimisation.


Regards,
Nick Maclaren.
0
nmm
12/23/2016 8:03:47 PM
Nick Maclaren wrote:
> In article <o3jqqc$mvp$1@gioia.aioe.org>,
> Terje Mathisen  <terje.mathisen@tmsw.no> wrote:
>> I.e. copy everything into local variables to get rid of all aliasing
>> considerations.
>
> Indeed, you can.  I certainly could, and probably still could, but
> that's not really the point.  It is extra effort, often makes the
> code a lot less clear and/or different from the mathematics, and
> that is NOT good software engineering.  Even excluding the ones who
> can't do it, there are many who will make mistakes and/or fail to
> find the ones they have made, and it is a nightmare for maintenance,
> especially by other people (the usual form).

My daytime job finally changed last year, after 34 years in (more or 
less) the same company: I am now CTO of Open iT and that means that I 
get to figure out how we can try to deliver correct results from 
error-prone input data.

This is actually quite fun:

Given a data format with a small amount of redundancy and a (usually 
smaller) chance of dropped lines it is sometimes possible to figure out 
exactly what the dropped line must have looked like, other times I can 
get very close, with provable error limits, and sometimes (particularly 
with paired errors) I can only detect that something bad has happened.
>
> I have often found it easier to throw such code away and start from
> scratch than to try to reverse engineer whether the optimisations
> were (a) correct and (b) beneficial.  That is particularly the case
> when you need to add a new form of optimisation, usually SIMD or
> threading, into highly manually optimised code.  The optimal code
> structure is often very different in the different cases.  Or when
> you need to change one complex formula into a very similar one;
> the changes may be simple in the original mathematics, but they very
> often aren't in the optimised form.

This is very good advice, it is _far_ easier to start from a 
mathematical description of what's to be done than 10k lines of deeply 
nested C++ inheritance.
>
> The standard recommendation for software engineering is NOT to hand
> optimise unless you really have to - write clean, clear code, and
> op the compiler optimisation.

Rule 1 for optimization: Don't do it.
Rule 2 for optimization: Don't do it yet.

I do like to add a third rule though:

Don't write your code so as to make it significantly harder to optimize.

I.e. in a code base I looked at recently everything was written on top 
of the C-lib fgetc():

On most unix versions this is in fact a macro that just returns the next 
char from a local buffer, only making a real call to a support function 
for refill when needed.

When compiled with Microsoft Visual Studio in a multi-threaded 
environment you get something far, far heavier:

Each fgetc() turns into at least 3 kernel calls because the MT libs wrap 
every IO operation in Lock()/Unlock() semaphore operations.

This meant that the exact same code ran 3-5 times faster in Linux than 
Windows.

Fixing it just meant moving the interface up from the character to the 
line level, at which point the overhead didn't matter nearly as much.

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
0
Terje
12/23/2016 9:54:34 PM
In article <o3k6eo$19v7$1@gioia.aioe.org>,
Terje Mathisen  <terje.mathisen@tmsw.no> wrote:
>
>My daytime job finally changed last year, after 34 years in (more or 
>less) the same company: I am now CTO of Open iT and that means that I 
>get to figure out how we can try to deliver correct results from 
>error-prone input data.
>
>This is actually quite fun:
>
>Given a data format with a small amount of redundancy and a (usually 
>smaller) chance of dropped lines it is sometimes possible to figure out 
>exactly what the dropped line must have looked like, other times I can 
>get very close, with provable error limits, and sometimes (particularly 
>with paired errors) I can only detect that something bad has happened.

Oh, yes, indeed!  There's lots of interesting mathematics, dating from
the 1920s and still ongoing, much of which is remarkably practical.
I remember some colleagues recovering the source of a program where
the bottom bit of every fifth character had been lost :-)  And I have
had to do this on occasion (who hasn't?)

One of the things that I despair about is the way that so many data
protocols and the tools that manage them either ignore this issue or
make a complete pig's ear of it.  One can't expect excellence, but
one might at least hope for competence.  There are signs that things
may improve: at least computer scientists are now teaching the use
of invariants, rather than saying that redundancy is undesirable.
You can guess what I say in my courses :-)  It looks like your
company's tools will be better, even if you can't fix the protocols.

>This is very good advice, it is _far_ easier to start from a 
>mathematical description of what's to be done than 10k lines of deeply 
>nested C++ inheritance.

Or, worse, interoperating template code.  SO much more efficient,
because it is done at compile-time :-(

>> The standard recommendation for software engineering is NOT to hand
>> optimise unless you really have to - write clean, clear code, and
>> op the compiler optimisation.
>
>Rule 1 for optimization: Don't do it.
>Rule 2 for optimization: Don't do it yet.

Right.  But, as you say below:

>I do like to add a third rule though:
>
>Don't write your code so as to make it significantly harder to optimize.

The two areas where I do advise early optimisation are (a) in the
choice of algorithm and (b) in the choice of data structures (mainly
an access issue, of course).  Not in detail, but enough that you don't
have to gut the code later.

>I.e. in a code base I looked at recently everything was written on top 
>of the C-lib fgetc():
>
>On most unix versions this is in fact a macro that just returns the next 
>char from a local buffer, only making a real call to a support function 
>for refill when needed.
>
>When compiled with Microsoft Visual Studio in a multi-threaded 
>environment you get something far, far heavier:
>
>Each fgetc() turns into at least 3 kernel calls because the MT libs wrap 
>every IO operation in Lock()/Unlock() semaphore operations.
>
>This meant that the exact same code ran 3-5 times faster in Linux than 
>Windows.

Blame SC22WG21, SC22WG14 and POSIX.  I tried to stop that lunacy
being required, but failed.  I think that you will find that modern
versions of gcc call a library function, not a macro, because they
need to achieve the same effect.


Regards,
Nick Maclaren.
0
nmm
12/24/2016 10:43:06 AM
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

>I.e. in a code base I looked at recently everything was written on top 
>of the C-lib fgetc():
>
>On most unix versions this is in fact a macro that just returns the next 
>char from a local buffer, only making a real call to a support function 
>for refill when needed.

I don't think so. The C standard says it should be interlocked.

>When compiled with Microsoft Visual Studio in a multi-threaded 
>environment you get something far, far heavier:
>
>Each fgetc() turns into at least 3 kernel calls because the MT libs wrap 
>every IO operation in Lock()/Unlock() semaphore operations.

I wonder why the standards group defined it that way.  I had to write
my own buffering in a program (many short records) to get performance.
0
Bengt
12/24/2016 3:02:02 PM
Bengt Larsson wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> wrote:
>
>> I.e. in a code base I looked at recently everything was written on top
>> of the C-lib fgetc():
>>
>> On most unix versions this is in fact a macro that just returns the next
>> char from a local buffer, only making a real call to a support function
>> for refill when needed.
>
> I don't think so. The C standard says it should be interlocked.

That probably explains why the Linux version was only 3-5 X faster:

They still did the locking but probably using user-land optimistic 
primitives.
>
>> When compiled with Microsoft Visual Studio in a multi-threaded
>> environment you get something far, far heavier:
>>
>> Each fgetc() turns into at least 3 kernel calls because the MT libs wrap
>> every IO operation in Lock()/Unlock() semaphore operations.
>
> I wonder why the standards group defined it that way.  I had to write
> my own buffering in a program (many short records) to get performance.
>

To add insult to injury, the main/only access pattern actually used in 
that case was full sequential read-only processing of entire files, with 
no other simultaneous access.

A small amount of buffering, similar to what you did, was sufficient to 
solve the problem.

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
0
Terje
12/24/2016 3:45:58 PM
In article <u53t5c167ss9f9412e6j8vp2ivo4n1a7u0@4ax.com>,
Bengt Larsson  <bengtl12.net@telia.NOSPAMcom> wrote:
>Terje Mathisen <terje.mathisen@tmsw.no> wrote:
>
>>I.e. in a code base I looked at recently everything was written on top 
>>of the C-lib fgetc():
>>
>>On most unix versions this is in fact a macro that just returns the next 
>>char from a local buffer, only making a real call to a support function 
>>for refill when needed.
>
>I don't think so. The C standard says it should be interlocked.
>
>>When compiled with Microsoft Visual Studio in a multi-threaded 
>>environment you get something far, far heavier:
>>
>>Each fgetc() turns into at least 3 kernel calls because the MT libs wrap 
>>every IO operation in Lock()/Unlock() semaphore operations.
>
>I wonder why the standards group defined it that way.  I had to write
>my own buffering in a program (many short records) to get performance.

I could answer that but, because it is the season of goodwill,
I won't.  But think 50% politics, 30% Dunning-Kruger and the rest
just plain ignorance.  It's not just a performance issue, either,
but a rather nasty ambiguity/inconsistency one.


Regards,
Nick Maclaren.
0
nmm
12/24/2016 3:58:40 PM
On Saturday, December 24, 2016 at 5:46:00 PM UTC+2, Terje Mathisen wrote:
> Bengt Larsson wrote:
> > Terje Mathisen <terje.mathisen@tmsw.no> wrote:
> >
> >> I.e. in a code base I looked at recently everything was written on top
> >> of the C-lib fgetc():
> >>
> >> On most unix versions this is in fact a macro that just returns the next
> >> char from a local buffer, only making a real call to a support function
> >> for refill when needed.
> >
> > I don't think so. The C standard says it should be interlocked.
> 
> That probably explains why the Linux version was only 3-5 X faster:
> 
> They still did the locking but probably using user-land optimistic 
> primitives.

Optimistic vs pessimistic likely makes zero difference.

What I don't understand is why MSVC MT lib can not implement their Lock()/Unlock() as EnterCriticalSection()/LeaveCriticalSection().


> >
> >> When compiled with Microsoft Visual Studio in a multi-threaded
> >> environment you get something far, far heavier:
> >>
> >> Each fgetc() turns into at least 3 kernel calls because the MT libs wrap
> >> every IO operation in Lock()/Unlock() semaphore operations.
> >
> > I wonder why the standards group defined it that way.  I had to write
> > my own buffering in a program (many short records) to get performance.
> >
> 
> To add insult to injury, the main/only access pattern actually used in 
> that case was full sequential read-only processing of entire files, with 
> no other simultaneous access.
> 
> A small amount of buffering, similar to what you did, was sufficient to 
> solve the problem.
> 
> Terje
> 
> -- 
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

0
already5chosen
12/24/2016 4:30:01 PM
On Friday, December 23, 2016 at 7:44:47 PM UTC+2, MitchAlsup wrote:
> On Friday, December 23, 2016 at 7:50:53 AM UTC-6, already...@yahoo.com wrote:
> > > 
> > > The one problem they mention is denormals which is something we have 
> > > discussed here as well:
> > > 
> > > Mitch have shown/stated that he can do denormal with a small fraction of 
> > > a cycle in total overhead, i.e. effectively free,
> > 
> > Mitch's statement was limited to FMA-based FPUs with no lower-latency bypassing for special cases of FMUL [which, IMHO, makes sense] or FADD [which is, IMHO, unfortunate, because low-latency FADD is very useful].
> 
> My statement was indeed directed at FMAC based FP calculation units.
> And in an FMAC in order to deliver a single rounding error from A*B+C
> the data path has to have enough circuitry so that adding a few special
> touches, makes the FMAC unit handle denorms at full speed. It actually
> takes more circuitry to propagate a NaN through the unit than to process
> denorms at full speed.
> 
> Basically, one bothers to detect a zero exponent and uses this information
> to suppress the hidden one bit at the top of the fraction. Then over in
> the normalizer, one invents a hidden bit so the LZA or FF1 does not over-
> normalize the <potentially denormalized> pre-normal result.
> 
> Nothing in my statement means that it is hard to do the same in a FMUL
> unit or an FADD unit. The same tricks are available, one just has to 
> "get in there and do it". The amount of circuitry is negligible (in the
> large scheme of things).

In large scheme of things the whole FPU is negligible unless you are GPU or Xeon Phi.

But at slightly smaller scheme of things I'd expect on non-FMAC-based FMUL full-speed subnormals will cost several percents (>5% ?) both in logic and in latency, mostly within final normalization step.

For FADD I am less sure. Should think a little more about it. 
But right now I have better things to do, then to think about it.
0
already5chosen
12/24/2016 4:38:49 PM
On Friday, December 23, 2016 at 4:34:21 PM UTC+2, Terje Mathisen wrote:
> already5chosen@yahoo.com wrote:
> > On Friday, December 23, 2016 at 11:20:32 AM UTC+2, Terje Mathisen wrote:
> >> In the Mill FP emulation code I'm writing I've found that I can handle
> >> denormal input and/or output with less than 5% total overhead even for
> >> FADD/FSUB.
> >>
> > Do you mean, "normal" case is 5% slower due to need to support subnormal?
> > Or all subnormal cases, inputs and out puts, are at most 5% slower than "normal"?
> 
> Both:
> 
> I add a single cycle to the normal case in order to support denormal 
> with branchless code that doesn't need to specialcase denorms at either end.
> 

5% suggest that the whole FADD or FMUL takes 20 clocks.
From 4-8-wide 64-bit CPU I expected 1.5x-2x better.
Unless 20 clocks include an overhead of invalid opcode exception+dispatch.
But I was under impression that due to Mill's absence of requirements of binary compatibility between core variants such overhead is avoided.

> Testing for denormal and branching would be far slower unless you could 
> guarantee effectively no denorms at all, 

I don't see why.
According to my understanding, low-end Mill cores do not have unusuallylong pipelines, so mispredicted taken branch is not too expensive. 8-10 clocks or something like that? But even 30-50 clocks do not really change the trade offs.
So, branch appears to make a lot of sense, unless you expect that tens of percents of your inputs or outputs are subnormals.

> while "denorms_are_zero" would only run those ~5% faster.

The good reference point is not "denorms_are_zero", but "VAX floating point", although the difference between the two is not necessarily big.

> 
> Terje
> -- 
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

0
already5chosen
12/24/2016 4:53:00 PM
On Friday, December 23, 2016 at 7:08:11 PM UTC+2, Anton Ertl wrote:
> William Edwards <willvarfar@gmail.com> writes:
> >On the proggit discussion we have people talking about the hit from denorma=
> >ls, and on HN we have this excellent post:
> >
> >> In my experience, -ffast-math has become vastly more useful over the last=
> > 10-15 years. I suppose this was a somewhat reasonable discussion to have i=
> >n 2001, when most performance critical code still heavily relied on hand-op=
> >timized assembly, but nowadays -ffast-math is often the only reasonably con=
> >venient way to get compilers to autovectorize code properly, use FMA instru=
> >ctions, etc. I have production code that literally runs 4x faster purely by=
> > adding -ffast-math due to autovectorization (and subsequent ILP and other =
> >improvements).=20
> >
> >https://news.ycombinator.com/item?id=3D13243489
> 
> Here's the example that we manually vectorized a few weeks ago (on a
> Core i3-3227U)
> 
> [b8:~/pub/anton/lvas/effizienz:15138] gcc -O3 tsp9.c -lm
> [b8:~/pub/anton/lvas/effizienz:15139] perf stat -e cycles:u a.out 10000
> sumdist = 88.609717
> 
>  Performance counter stats for 'a.out 10000':
> 
>          226964842      cycles:u                                                    
> 
>        0.158383019 seconds time elapsed
> 
> [b8:~/pub/anton/lvas/effizienz:15140] gcc -O3 -ffast-math tsp9.c -lm
> [b8:~/pub/anton/lvas/effizienz:15141] perf stat -e cycles:u a.out 10000
> sumdist = 88.609717
> 
>  Performance counter stats for 'a.out 10000':
> 
>          226982633      cycles:u                                                    
> 
>        0.156984167 seconds time elapsed
> 
> I.e., no speedup from -ffast-math at all.  It's just one program, but
> so is the case that you cited.
> 
> - anton
> -- 
> M. Anton Ertl                    Some things have to be seen to be believed
> anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
> http://www.complang.tuwien.ac.at/anton/home.html

I think, you are missing the point.
Nobody says that fast-math can speed up manually vectorized (and manually optimized) code.
The claim is that fast-math can make original scalar code closer in speed to manually vectorized.


0
already5chosen
12/24/2016 4:58:00 PM
On Sat, 24 Dec 2016 08:58:00 -0800 (PST)
already5chosen@yahoo.com wrote:

> On Friday, December 23, 2016 at 7:08:11 PM UTC+2, Anton Ertl wrote:
> > William Edwards <willvarfar@gmail.com> writes:  
> > >On the proggit discussion we have people talking about the hit
> > >from denorma= ls, and on HN we have this excellent post:
> > >  
> > >> In my experience, -ffast-math has become vastly more useful over
> > >> the last=  
> > > 10-15 years. I suppose this was a somewhat reasonable discussion
> > > to have i=
> > >n 2001, when most performance critical code still heavily relied
> > >on hand-op= timized assembly, but nowadays -ffast-math is often
> > >the only reasonably con= venient way to get compilers to
> > >autovectorize code properly, use FMA instru= ctions, etc. I have
> > >production code that literally runs 4x faster purely by=
> > > adding -ffast-math due to autovectorization (and subsequent ILP
> > > and other =
> > >improvements).=20
> > >
> > >https://news.ycombinator.com/item?id=3D13243489  
> > 
> > Here's the example that we manually vectorized a few weeks ago (on a
> > Core i3-3227U)
> > 
> > [b8:~/pub/anton/lvas/effizienz:15138] gcc -O3 tsp9.c -lm
> > [b8:~/pub/anton/lvas/effizienz:15139] perf stat -e cycles:u a.out
> > 10000 sumdist = 88.609717
> > 
> >  Performance counter stats for 'a.out 10000':
> > 
> >          226964842
> > cycles:u                                                    
> > 
> >        0.158383019 seconds time elapsed
> > 
> > [b8:~/pub/anton/lvas/effizienz:15140] gcc -O3 -ffast-math tsp9.c -lm
> > [b8:~/pub/anton/lvas/effizienz:15141] perf stat -e cycles:u a.out
> > 10000 sumdist = 88.609717
> > 
> >  Performance counter stats for 'a.out 10000':
> > 
> >          226982633
> > cycles:u                                                    
> > 
> >        0.156984167 seconds time elapsed
> > 
> > I.e., no speedup from -ffast-math at all.  It's just one program,
> > but so is the case that you cited.
> > 
> > - anton
> > -- 
> > M. Anton Ertl                    Some things have to be seen to be
> > believed anton@mips.complang.tuwien.ac.at Most things have to be
> > believed to be seen
> > http://www.complang.tuwien.ac.at/anton/home.html  
> 
> I think, you are missing the point.
> Nobody says that fast-math can speed up manually vectorized (and
> manually optimized) code. The claim is that fast-math can make
> original scalar code closer in speed to manually vectorized.
> 
> 

One thing I know is that gcc won't do `x * 1.0/5.0` rather `x/5.0` if
fast-math is not on.


-- 
press any key to continue or any other to quit...
0
Melzzzzz
12/24/2016 5:06:21 PM
Terje Mathisen wrote:
> Bengt Larsson wrote:
>> Terje Mathisen <terje.mathisen@tmsw.no> wrote:
>>
>>> I.e. in a code base I looked at recently everything was written on top
>>> of the C-lib fgetc():
>>>
>>> On most unix versions this is in fact a macro that just returns the next
>>> char from a local buffer, only making a real call to a support function
>>> for refill when needed.
>>
>> I don't think so. The C standard says it should be interlocked.
> 
> That probably explains why the Linux version was only 3-5 X faster:
> 
> They still did the locking but probably using user-land optimistic 
> primitives.

I've never used these but there are non-locking versions
of some C library calls.

_fgetc_nolock

Also you can define the symbol

#define _CRT_DISABLE_PERFCRIT_LOCKS

and it will switch all CRT IO calls to the non-locking version.

Multithreaded Libraries Performance
https://msdn.microsoft.com/en-us/library/ms235505.aspx

_nolock Functions
https://msdn.microsoft.com/en-us/library/ms235386.aspx

As to why the Linux version would be 3-5X faster,
I can't think of a reason.
They both should be using light weight user mode critical
sections to guard user mode read ahead buffers so there
should be essentially zero locking cost.

So _fgetc_nolock may not make a difference if the Windows
overhead is due to other things, maybe TLS table accesses.

Eric

0
EricP
12/24/2016 5:16:54 PM
EricP wrote:
> 
> As to why the Linux version would be 3-5X faster,
> I can't think of a reason.
> They both should be using light weight user mode critical
> sections to guard user mode read ahead buffers so there
> should be essentially zero locking cost.

Or maybe the default size of the read ahead buffer
is smaller on Windows so it makes more OS calls.
The Windows docs say the default size is 4kB.
You could use setvbuf to change the buffer.

Eric

0
EricP
12/24/2016 5:43:48 PM
already5chosen@yahoo.com writes:
>On Friday, December 23, 2016 at 7:08:11 PM UTC+2, Anton Ertl wrote:
>> Here's the example that we manually vectorized a few weeks ago (on a
>> Core i3-3227U)
>> 
>> [b8:~/pub/anton/lvas/effizienz:15138] gcc -O3 tsp9.c -lm
>> [b8:~/pub/anton/lvas/effizienz:15139] perf stat -e cycles:u a.out 10000
>> sumdist = 88.609717
>> 
>>  Performance counter stats for 'a.out 10000':
>> 
>>          226964842      cycles:u                                                    
>> 
>>        0.158383019 seconds time elapsed
>> 
>> [b8:~/pub/anton/lvas/effizienz:15140] gcc -O3 -ffast-math tsp9.c -lm
>> [b8:~/pub/anton/lvas/effizienz:15141] perf stat -e cycles:u a.out 10000
>> sumdist = 88.609717
>> 
>>  Performance counter stats for 'a.out 10000':
>> 
>>          226982633      cycles:u                                                    
>> 
>>        0.156984167 seconds time elapsed
>> 
>> I.e., no speedup from -ffast-math at all.  It's just one program, but
>> so is the case that you cited.
....
>I think, you are missing the point.
>Nobody says that fast-math can speed up manually vectorized (and manually optimized) code.
>The claim is that fast-math can make original scalar code closer in speed to manually vectorized.

I am sorry to have expressed myself so ambiguously.  tsp9.c is the
scalar code that is the starting point of our vectorization
experiments.

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
0
anton
12/24/2016 8:46:13 PM
On Saturday, December 24, 2016 at 10:49:58 PM UTC+2, Anton Ertl wrote:
> already5chosen@yahoo.com writes:
> >On Friday, December 23, 2016 at 7:08:11 PM UTC+2, Anton Ertl wrote:
> >> Here's the example that we manually vectorized a few weeks ago (on a
> >> Core i3-3227U)
> >> 
> >> [b8:~/pub/anton/lvas/effizienz:15138] gcc -O3 tsp9.c -lm
> >> [b8:~/pub/anton/lvas/effizienz:15139] perf stat -e cycles:u a.out 10000
> >> sumdist = 88.609717
> >> 
> >>  Performance counter stats for 'a.out 10000':
> >> 
> >>          226964842      cycles:u                                                    
> >> 
> >>        0.158383019 seconds time elapsed
> >> 
> >> [b8:~/pub/anton/lvas/effizienz:15140] gcc -O3 -ffast-math tsp9.c -lm
> >> [b8:~/pub/anton/lvas/effizienz:15141] perf stat -e cycles:u a.out 10000
> >> sumdist = 88.609717
> >> 
> >>  Performance counter stats for 'a.out 10000':
> >> 
> >>          226982633      cycles:u                                                    
> >> 
> >>        0.156984167 seconds time elapsed
> >> 
> >> I.e., no speedup from -ffast-math at all.  It's just one program, but
> >> so is the case that you cited.
> ...
> >I think, you are missing the point.
> >Nobody says that fast-math can speed up manually vectorized (and manually optimized) code.
> >The claim is that fast-math can make original scalar code closer in speed to manually vectorized.
> 
> I am sorry to have expressed myself so ambiguously.  tsp9.c is the
> scalar code that is the starting point of our vectorization
> experiments.
> 
> - anton
> -- 
> M. Anton Ertl                    Some things have to be seen to be believed
> anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
> http://www.complang.tuwien.ac.at/anton/home.html

Got it.

Did you do your experiment with the same old gcc 4.9 that you used few weeks ago?
I think,  between 4.9 and now gcc put a lot of work into vectorization.
Although probably not enough to do something about your tsp code. It is not an easy case.
May be, if you change the order of array access from the top-down to bottom-up, then compiler will have a better chance...
Or, may be, not. It's very possible that data-dependent if {} is still above the abilities of relatively immature gcc vectorizer.
0
already5chosen
12/24/2016 9:35:45 PM
already5chosen@yahoo.com wrote:
> On Friday, December 23, 2016 at 4:34:21 PM UTC+2, Terje Mathisen
>> Testing for denormal and branching would be far slower unless you
>> could guarantee effectively no denorms at all,
>
> I don't see why. According to my understanding, low-end Mill cores do
> not have unusuallylong pipelines, so mispredicted taken branch is not
> too expensive. 8-10 clocks or something like that? But even 30-50
> clocks do not really change the trade offs. So, branch appears to
> make a lot of sense, unless you expect that tens of percents of your
> inputs or outputs are subnormals.

Think SIMD with a single denorm among N-1 normal values: If you handle 
this with test & branch then you need a SIMD test + a branch on any 
non-zero + check of each value and scalar evaluation of any denorms. 
This is so expensive that just 1% denorms would totally destroy your 
performance.

Terje
PS. Merry Christmas everyone, we've already finished the main 
celebration here in Oslo.

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
0
Terje
12/24/2016 9:36:43 PM
On Saturday, December 24, 2016 at 11:36:46 PM UTC+2, Terje Mathisen wrote:
> already5chosen@yahoo.com wrote:
> > On Friday, December 23, 2016 at 4:34:21 PM UTC+2, Terje Mathisen
> >> Testing for denormal and branching would be far slower unless you
> >> could guarantee effectively no denorms at all,
> >
> > I don't see why. According to my understanding, low-end Mill cores do
> > not have unusuallylong pipelines, so mispredicted taken branch is not
> > too expensive. 8-10 clocks or something like that? But even 30-50
> > clocks do not really change the trade offs. So, branch appears to
> > make a lot of sense, unless you expect that tens of percents of your
> > inputs or outputs are subnormals.
>=20
> Think SIMD with a single denorm among N-1 normal values: If you handle=20
> this with test & branch then you need a SIMD test + a branch on any=20
> non-zero + check of each value and scalar evaluation of any denorms.=20
> This is so expensive that just 1% denorms would totally destroy your=20
> performance.

First, 1% of subnormals is plenty. Typically, they happen much rarer, at le=
ast for DP. Yes, if your test vectors have uniformally distributed exponent=
, then you'll see subnormals at several percents, but such distribution is =
not representative of real world.

Second, I was under impression that low-end Mill has no 256-bit SIMD. So, f=
or double precision, we are talking about N=3D2. I am inclined to think tha=
t if branch is good for N=3D1 it's very likely good for N=3D2 too.


>=20
> Terje
> PS. Merry Christmas everyone, we've already finished the main=20
> celebration here in Oslo.
>=20
> --=20
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

0
already5chosen
12/24/2016 10:13:59 PM
On 12/24/2016 2:13 PM, already5chosen@yahoo.com wrote:

> Second, I was under impression that low-end Mill has no 256-bit SIMD.
> So, for double precision, we are talking about N=2. I am inclined to
> think that if branch is good for N=1 it's very likely good for N=2
> too.

The Mill member's width (number of slots, belt length; in general the 
number of concurrent operands), and its height (number of bytes per 
operand) are independently configurable specifications, as is the 
presence or absence of floating point or any other special-purpose 
functional unit. Thus a game-oriented configuration for Bethesda (which 
uses fixed-point in the engine I think) might have no floating point 
functional units yet be very high (128 byte). In the rare case where it 
would execute floating-point code the ops would be emulated, and might 
indeed be working on vectors of 16 doubles in parallel.

I grant you that this is an unlikely customer configuration, and in a 
conventional design flow model we should remove this option. However, 
the Mill does not use a conventional design flow model; we are 
specification driven, and the componentry, including software, should 
support whatever is specified.

Consequently we use the same piece of emulation code, whether it is 
float[1] (a.k.a scalar), quad[4], or double[16]. Yes, we instantiate for 
special cases, when we have to, but the architecture is designed to make 
that rarely necessary. The reliability and TTM advantages of using a 
single code outweigh any inefficiency in corner cases.

Or so we believe.
0
Ivan
12/25/2016 1:31:55 AM
On Sat, 24 Dec 2016 08:30:01 -0800 (PST), already5chosen@yahoo.com
wrote:

>What I don't understand is why MSVC MT lib can not implement 
>their Lock()/Unlock() as EnterCriticalSection()/LeaveCriticalSection().

The MT library *does* use critical sections.  The problem is that,
under contention, critical sections can devolve into semaphores.


See Critical Section Objects:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms682530(v=vs.85).aspx

  :
On multiprocessor systems, if the critical section is unavailable, the
calling thread spins dwSpinCount times before performing a wait
operation on a semaphore that is associated with the critical section.
If the critical section becomes free during the spin operation, the
calling thread avoids the wait operation.
  :
 

George
0
George
12/25/2016 2:46:55 AM
Reply: