No need to optimize in assembly anymore

  • Follow


Given that optimizing in assembly for one processor will have no effect (or
negative effect) on a different processor it seems that low level
optimization is becoming a waste of time.  As processors become more
sophisticated and diverse in the way they execute code this trend is likely
to continue.

Anyone agree?


DayStar.

0
Reply a 5/16/2004 6:37:44 PM

> Given that optimizing in assembly for one processor will have no effect
(or
> negative effect) on a different processor it seems that low level
> optimization is becoming a waste of time.  As processors become more
> sophisticated and diverse in the way they execute code this trend is
likely
> to continue.

Everything in our life is optimized for a particular entity. But
optimization rules for all P6 CPUs are almost the same. For P4, PM and
succeeding few models they might also be nearly the same. Each CPU family
has similar architecture.
Sometimes it's urgent to optimize for a concrete CPU. A friend of mine (he's
a physician) asked me to write a small program that modelates planet flaw in
2-star system (famous 3-body problem). It had to run on a... Pentium (133
MHz). Mathcad and other general math & engineering apps run VERY slowly
compared to special-case programs (and especially on such a CPU) so I wrote
my own ODE solver. I'm not a specialist in optimizing for P5 and I relied on
compiler's optimizer (VC 7.1). The code actually rans faster with
optimizations and this app has never run on any CPU but P5 (and, of course,
mine - but for testing purpose). Thus you can't say that optimization is
always undesirable.

Ivan


0
Reply Ivan 5/16/2004 11:06:37 PM


In comp.lang.asm.x86 Ivan Korotkov <koroNOSPAMtkov2@zteldot.ru> wrote:
>> Given that optimizing in assembly for one processor will have no effect
> (or
>> negative effect) on a different processor it seems that low level
>> optimization is becoming a waste of time.  As processors become more
>> sophisticated and diverse in the way they execute code this trend is
> likely
>> to continue.

> Everything in our life is optimized for a particular entity. But
> optimization rules for all P6 CPUs are almost the same. For P4, PM and
> succeeding few models they might also be nearly the same. Each CPU family
> has similar architecture.
> Sometimes it's urgent to optimize for a concrete CPU. A friend of mine (he's
> a physician) asked me to write a small program that modelates planet flaw in
> 2-star system (famous 3-body problem). It had to run on a... Pentium (133
> MHz). Mathcad and other general math & engineering apps run VERY slowly
> compared to special-case programs (and especially on such a CPU) so I wrote
> my own ODE solver. I'm not a specialist in optimizing for P5 and I relied on
> compiler's optimizer (VC 7.1). The code actually rans faster with
> optimizations and this app has never run on any CPU but P5 (and, of course,
> mine - but for testing purpose). Thus you can't say that optimization is
> always undesirable.

> Ivan

I think you mean physicist, not physician. :-)

-- 
Paul Carter (former physicist)

0
Reply pacman128 5/16/2004 11:33:17 PM

>Reply to article by: "a" <a@a.com>
>Date written: Sun, 16 May 2004 18:37:44 +0000 (UTC)
>MsgID:<e4Npc.89$JL5.48@newsfe1-win>

>Given that optimizing in assembly for one processor will have no effect (or
>negative effect) on a different processor it seems that low level
>optimization is becoming a waste of time.  As processors become more
>sophisticated and diverse in the way they execute code this trend is likely
>to continue.

>Anyone agree?

Right on.

Of course, there is still lots of room for algorithm optimization.

The Sage

=============================================================
My Home Page :                http://members.cox.net/the.sage

"My friend plans to make a fortune with his invention. It's a
big metal box with a slot on one side and a sign that says
'How gullible are you? To find out, insert $50.'" -- COMEDY
COMES CLEAN, by Bill Jones
=============================================================

0
Reply The_Sage 5/17/2004 1:57:06 AM

On Sun, 16 May 2004 18:37:44 +0000 (UTC), "a" <a@a.com> wrote:

>
>Given that optimizing in assembly for one processor will have no effect (or
>negative effect) on a different processor it seems that low level
>optimization is becoming a waste of time.  As processors become more
>sophisticated and diverse in the way they execute code this trend is likely
>to continue.
>
>Anyone agree?

Low level optimization is only a waste of time if the cost in time and
effort is not worth the end result.  I'm sure few would argue that the
difference between 9ms and 10ms is significant enough to worry
about.  But if that difference scales linearly to hours... who would
argue that the difference between 9hrs and 10hrs is not worth the
effort to optimize?

Or... if a 3D renderer takes 9 minutes to render a complex scene,
as oposed to 10 minutes.  Weeelll... That may or may not matter to
Joe Somebody.  But, when we're talking *thousands* of frames to
be rendered... how much does a comparitively small difference need
to accumulate... before it doesn't look so small anymore?

In other words (and, as always) - it depends.

Something else to consider - if current technology runs out of steam
(which is inevitable), will the demand for performance increases also
just... stop?  Or will the suits be forced to look at the waste that's been
taking place all these years...

Jeff

http://www.jefftturner.com


0
Reply pH 5/17/2004 7:36:27 AM

"a" <a@a.com> wrote:

| Given that optimizing in assembly for one processor will have no effect (or
| negative effect) on a different processor it seems that low level
| optimization is becoming a waste of time.  As processors become more
| sophisticated and diverse in the way they execute code this trend is likely
| to continue.
|
| Anyone agree?

I see ASM as "the only" solution for hardware-near programming.
if you ever disassembled a hw-driver-DLL written in C or any other
HLL, you'll find out the obvious difference.

Sure, hw-optimised code isn't portable from one chip to another.
So whenever a new CPU joins the market, some of us can make
money by writing new optimised code for it.

HLL-source can be altered fast and cheap, on cost of performance.

But as HLL-code is portable and most compilers are written in
HLL anyway, the 'new,faster,smarter' CPU would just be slowed
down by detouring with HLL to make the programs work at all.

Finally Yes, who cares about optimised code?
The millions of office-clerks use the wasted
"wait for the mouse unfrozen"-time for a coffee-break anyway.
And the few who really need top speed in real time,
.... are my clients already :)

__
wolfgang



0
Reply wolfgang 5/17/2004 1:06:44 PM

In alt.lang.asm wolfgang kern <nowhere@nevernet.at> wrote:
> I see ASM as "the only" solution for hardware-near programming.

Entirely true.  Look at the embedded market.  I think large
portions of cellphone, PDA, DVD player, etc programming is
done in ASM.  CPU is limited, and both speed and size matter
both for their impact on cost and battery life.  

For larger machines, known CPU-bound task get ASM coding:
MPEG4 encoding, DVD playback (especially using the GPU)
Size & CPU are generally much less important.

-- Robert

0
Reply Robert 5/17/2004 1:58:40 PM

"a" <a@a.com> wrote in message news:e4Npc.89$JL5.48@newsfe1-win...
>
> Given that optimizing in assembly for one processor will have no effect
(or
> negative effect) on a different processor it seems that low level
> optimization is becoming a waste of time.  As processors become more
> sophisticated and diverse in the way they execute code this trend is
likely
> to continue.
>
> Anyone agree?

Maybe when hardware performance leapfrogs software entropy. Maybe also when
compilers start generating reasonably efficient code.

Certain algorithms (e.g. extended-precision arithmetic) can't even be
implemented efficiently in most high level languages, in particular C/C++
and derivatives. A 128-bit add is 4 instructions in assembly. A compiler
that doesn't know what you are doing (basically all of them) can't even
compute the carry from one addition in 4 instructions.

And anyway, your premise was incorrect. Optimizing for one processor does
not necessarily make others run more slowly. It is all relative to some
ideal; ideal code for one processor will never be ideal on another, but that
doesn't mean it is slower than the original baseline. With respect to the
difference between Pentium-IV and other x86 microprocessors, I have
encountered two schools of thought: 1) I'll optimize for whichever CPU is
running in my system 2) I'll optimize for whichever CPU I expect my
customers to be using (generally Pentium-IV since Intel dominates the
market).

-Matt

0
Reply Matt 5/17/2004 3:06:49 PM

"Robert Redelmeier" <redelm@ev1.net.invalid> wrote in message
news:Pg3qc.725$RH1.143@newssvr23.news.prodigy.com...
> In alt.lang.asm wolfgang kern <nowhere@nevernet.at> wrote:
> > I see ASM as "the only" solution for hardware-near programming.
>
> Entirely true.  Look at the embedded market.  I think large
> portions of cellphone, PDA, DVD player, etc programming is
> done in ASM.  CPU is limited, and both speed and size matter
> both for their impact on cost and battery life.
<snip>

AFAIK Java is in wide-spread use in embedded markets, even in cell phones.
How bizarre.

-Matt

0
Reply Matt 5/17/2004 3:25:03 PM

a wrote:
> Given that optimizing in assembly for one processor will have
 > no effect (or negative effect) on a different processor it seems
 > that low level optimization is becoming a waste of time.  As
 > processors become more sophisticated and diverse in the way they
 > execute code this trend is likely to continue.
> 
> Anyone agree?

Partially.  If you are talking about cycle counting then yes,
because these counts are non deterministic (due to out-of-order
processing) and inconsistant across different processor
generations and manifacturers (due to different goals in
choosing hardware optimisations).

However, there are other forms of low level optimisation
which work well across many processors -- such as vectorising,
improving cache utilisation and branch elimination.  Current
HLL compilers are still not very effective these operations.

Modern compilers are only beginning to utilise vectorising
(the Intel C Compiler is one of the forerunners in this
technology).  Improving cache utilisation requires a knowledge
of how the data structure is accessed -- this view can be
obscured by a HLL, though it is always readily apparent in
assembly.  Branch elimination (one of my favourates) only
gives improvements on processors using branch prediction and
relys on arranging the data to remove the need for conditional
branches -- while compilers often get the simple cases, I have
not seen one which can do complex eliminations. Ie.

C code...

if( a < b ) c += e;
else        d += e;

Naive assembly...

     cmp eax, ebx	; if( a < b )
     je .s
       add ecx, esi	; c += e;
       jmp .x
..s:   add edx, esi	; d += e;
..x:

Branch elimination optimisation...

     add ecx, esi	; do c += e (may undo later)
     cmp eax, ebx	; if( a < b )
     sbb ebp, ebp	; create mask { p = a < b ? e : 0 }
     and ebp, esi
     sub ecx, ebp	; undo c += e if a >= b
     add edx, ebp	; do d += e if a >= b

I doubt many (if any) compilers would do this optimisation
-- but such an optimisation will produce an improvement
across _all_ processors.

And, of course, algorithmic and structural programming
optimisations are always important whatever programming
language you choose.

C
2004-05-17

0
Reply C 5/17/2004 6:23:44 PM

> I think you mean physicist, not physician. :-)

Oops, sorry, didn't know there're two similar words :)

Ivan


0
Reply Ivan 5/17/2004 6:23:47 PM

Matt Taylor wrote:
> 
> AFAIK Java is in wide-spread use in embedded markets, even in cell phones.
> How bizarre.

Why is that bizarre?  That was what Java was originally intended for.

0
Reply T 5/17/2004 6:23:50 PM

Robert Redelmeier wrote:

> In alt.lang.asm wolfgang kern <nowhere@nevernet.at> wrote:
>> I see ASM as "the only" solution for hardware-near programming.
> 
> Entirely true.  Look at the embedded market.  I think large
> portions of cellphone, PDA, DVD player, etc programming is
> done in ASM.  CPU is limited, and both speed and size matter
> both for their impact on cost and battery life.

Yes, nothing is more annoying than a cellphone thats 1 or 2 seconds behind
you when your entering phone numbers and so on. If you ever get to use a
windows based cell phone you'll see what i mean.
Eric

> 
> For larger machines, known CPU-bound task get ASM coding:
> MPEG4 encoding, DVD playback (especially using the GPU)
> Size & CPU are generally much less important.
> 
> -- Robert

0
Reply Eric 5/17/2004 7:52:23 PM

pH wrote:

> On Sun, 16 May 2004 18:37:44 +0000 (UTC), "a" <a@a.com> wrote:
> 
>>
>>Given that optimizing in assembly for one processor will have no effect
>>(or negative effect) on a different processor it seems that low level
>>optimization is becoming a waste of time.  As processors become more
>>sophisticated and diverse in the way they execute code this trend is
>>likely to continue.
>>
>>Anyone agree?
> 
> Low level optimization is only a waste of time if the cost in time and
> effort is not worth the end result.  I'm sure few would argue that the
> difference between 9ms and 10ms is significant enough to worry
> about.  But if that difference scales linearly to hours... who would
> argue that the difference between 9hrs and 10hrs is not worth the
> effort to optimize?
> 
> Or... if a 3D renderer takes 9 minutes to render a complex scene,
> as oposed to 10 minutes.  Weeelll... That may or may not matter to
> Joe Somebody.  But, when we're talking *thousands* of frames to
> be rendered... how much does a comparitively small difference need
> to accumulate... before it doesn't look so small anymore?
> 
> In other words (and, as always) - it depends.
> 
> Something else to consider - if current technology runs out of steam
> (which is inevitable), will the demand for performance increases also
> just... stop?  Or will the suits be forced to look at the waste that's
> been taking place all these years...
> 
> Jeff
> 
> http://www.jefftturner.com

 And, (how could i forget this one?): simulating new CPU designs before
first silicon.
Eric

0
Reply Eric 5/17/2004 7:52:26 PM

pH wrote:

> On Sun, 16 May 2004 18:37:44 +0000 (UTC), "a" <a@a.com> wrote:
> 
>>
>>Given that optimizing in assembly for one processor will have no effect
>>(or negative effect) on a different processor it seems that low level
>>optimization is becoming a waste of time.  As processors become more
>>sophisticated and diverse in the way they execute code this trend is
>>likely to continue.
>>
>>Anyone agree?
> 
> Low level optimization is only a waste of time if the cost in time and
> effort is not worth the end result.  I'm sure few would argue that the
> difference between 9ms and 10ms is significant enough to worry
> about.  But if that difference scales linearly to hours... who would
> argue that the difference between 9hrs and 10hrs is not worth the
> effort to optimize?
> 
> Or... if a 3D renderer takes 9 minutes to render a complex scene,
> as oposed to 10 minutes.  Weeelll... That may or may not matter to
> Joe Somebody.  But, when we're talking *thousands* of frames to
> be rendered... how much does a comparitively small difference need
> to accumulate... before it doesn't look so small anymore?
> 
> In other words (and, as always) - it depends.
> 
> Something else to consider - if current technology runs out of steam
> (which is inevitable), will the demand for performance increases also
> just... stop?  Or will the suits be forced to look at the waste that's
> been taking place all these years...
> 
> Jeff
> 
> http://www.jefftturner.com

Yes, and in that light, a few microseconds can make a world of difference
when your modeling the behavior of things such as the weather, solar system
evolution, nuclear weapons, stars, galaxy interactions to name just a few,
that require very large numbers of iterations over the same code.
Eric

0
Reply Eric 5/17/2004 7:52:30 PM

Robert Redelmeier wrote:

> In alt.lang.asm wolfgang kern <nowhere@nevernet.at> wrote:
> 
>>I see ASM as "the only" solution for hardware-near programming.
> 
> 
> Entirely true.  Look at the embedded market.  I think large
> portions of cellphone, PDA, DVD player, etc programming is
> done in ASM.  CPU is limited, and both speed and size matter
> both for their impact on cost and battery life.  
> 
> For larger machines, known CPU-bound task get ASM coding:
> MPEG4 encoding, DVD playback (especially using the GPU)
> Size & CPU are generally much less important.
> 
> -- Robert
> 

A lot of embedded work is still 16 bit, or even 8 bit, for which
half decent compilers were never available.

0
Reply Scott 5/17/2004 7:52:40 PM

pH wrote:

> Low level optimization is only a waste of time if the cost in time and
> effort is not worth the end result.  I'm sure few would argue that the
> difference between 9ms and 10ms is significant enough to worry
> about.  But if that difference scales linearly to hours... who would
> argue that the difference between 9hrs and 10hrs is not worth the
> effort to optimize?

Back to basics. If you have program N that you plan to run K times, then
the time required to optimize that, O (as in your own time), needs to be
less than (N-O)*K, ie., the total time you save by speeding up the program
must be greater than the time you take optimizing it. Otherwise you could
accomplish the same goal by simply performing longer runs.

This applies to batch mode problems only. Realtime problems, such as getting
all elements drawn within a frame time, are "deadline" related and must run
within a certain period of time.

0
Reply Scott 5/17/2004 7:53:07 PM

"C" <cc-news@hermes.mirlex.com> wrote in message
news:zb2qc.88$uu2.24@newsfe2-gui.server.ntli.net...
> a wrote:
> > Given that optimizing in assembly for one processor will have
>  > no effect (or negative effect) on a different processor it seems
>  > that low level optimization is becoming a waste of time.  As
>  > processors become more sophisticated and diverse in the way they
>  > execute code this trend is likely to continue.
> >
> > Anyone agree?
>
> Partially.  If you are talking about cycle counting then yes,
> because these counts are non deterministic (due to out-of-order
> processing) and inconsistant across different processor
> generations and manifacturers (due to different goals in
> choosing hardware optimisations).
<snip>

Cycle-counting isn't quite that non-deterministic. The code usually falls
into the same cadence regardless of the initial state upon entry due to
dependencies. Cache misses are unpredictable, but there isn't really
anything you can do at that level to avoid them.

Code scheduling tends to improve performance across all architectures. Even
heavily pipelined machines like the Pentium-IV with massive capacity for
in-flight ops see improvement when poorly-scheduled code is optimized in
this fashion. Out-of-order processing helps to hide the differences between
CPUs, but it doesn't make a very good crutch.

Cycle-counting is also useful since most modern processors have similar
weaknesses and strengths. Multiplies & shifts are a classic example; convert
a constant divide to a constant multiply, and some constant multiplies will
convert into shifts. Pentium-IV is a little bit different, but otherwise x86
processors generally favor the same simple operations.

-Matt

0
Reply Matt 5/17/2004 9:05:50 PM

Matt Taylor wrote:
> "C" <cc-news@hermes.mirlex.com> wrote in message
> news:zb2qc.88$uu2.24@newsfe2-gui.server.ntli.net...
> 
>>Partially.  If you are talking about cycle counting then yes,
>>because these counts are non deterministic (due to out-of-order
>>processing) and inconsistant across different processor
>>generations and manifacturers (due to different goals in
>>choosing hardware optimisations).
> 
> <snip>
> 
> Cycle-counting isn't quite that non-deterministic. The code usually
 > falls into the same cadence regardless of the initial state upon entry
 > due to dependencies.

You are correct: I over simplified there.  Though for most purposes,
especially when considering other or future processors, the exact
time for a given sequence is too difficult to determine accuratly.
Essentially, though one can get a rough idea of how well a peice of
code will perform vs. a similar peice, it is often little more than
an educated guess -- especially if one is considering whether a
minor replacement / reshuffle will produce an improvement.

> Cache misses are unpredictable, but there isn't really
> anything you can do at that level to avoid them.

Yes, though we are not totally hopeless there either, provided
the algorithm requires multiple accesses to memory.  In these
cases you can try to localise memory accesses to increase the
probability of a cache hit.  This is normally most effective
in loops which process large amounts of data, for example an
implementation of the FFT.  Similarly, aligning data can have
a similar effect on improving the probability of a cache hit
when multiple accesses must be made to the same data structure.

> Code scheduling tends to improve performance across all architectures.
 > Even heavily pipelined machines like the Pentium-IV with massive
 > capacity for in-flight ops see improvement when poorly-scheduled
 > code is optimized in this fashion.

Yes, the P4 trace cache is an interesting concept -- it is a
pity other hardware constraints (such as it only having a
single decoder for input, or the slow shifts) reduced its
effectiveness and therefore the potency of the processor
overall.

 > Out-of-order processing helps to hide the differences between
> CPUs, but it doesn't make a very good crutch.

True, this, of course, being due to the limited look ahead in OoO
hardware.  Though there is a big difference between unoptimised
code and well optimised code -- OoO does reduce the difference
between lightly optimised and heavly optimised code considerably.
Indeed, having OoO hardware is _definatly_ not an excuse to
completely avoid optimisation, only to avoid squeezing every last
cycle out of the code. (As that style of optimisation, while
taking the most programmer effort, is mitigated against by the
OoO hardware.)  [Though in some cases, such as heavily used
inner loops, such optimisations may be justified despite their
inherent non portability.]

> Cycle-counting is also useful since most modern processors have similar
> weaknesses and strengths. Multiplies & shifts are a classic example;
 > convert a constant divide to a constant multiply, and some constant
 > multiplies will convert into shifts. Pentium-IV is a little bit
 > different, but otherwise x86 processors generally favor the same
 > simple operations.

Yes, that was an aspect I did not address in my post, primarily
because compilers and some assemblers will do this automatically.
And I guess that doing strength reduction has become so second
nature to me, even in HLL programming, that I no longer realise I
am doing it at all :-)

I also failed to mention those algorithmic implementations which
only are apparent / possible in assembler such as extended precision
arithmetric.  (Something I note you address elsewhere in this thread
and indeed, I too have dicussed it recently [alt.lang.asm].)  Well,
'tis either miss a few details or do a Beth style 'postius maximus'.
:-)

Anyway, I think that the replies here have mostly disproved the
original poster's hypothesis.  Being as more reasons have been given
than I would have not thought of immediatly, this is turning into a
fairly interesting thread.

C
2004-05-18

0
Reply C 5/17/2004 10:54:31 PM

C wrote:
 
 
> C code...
> 
> if( a < b ) c += e;
> else        d += e;
 
> Branch elimination optimisation...
> 
>      add ecx, esi       ; do c += e (may undo later)
>      cmp eax, ebx       ; if( a < b )
>      sbb ebp, ebp       ; create mask { p = a < b ? e : 0 }
>      and ebp, esi
>      sub ecx, ebp       ; undo c += e if a >= b
>      add edx, ebp       ; do d += e if a >= b
> 
> I doubt many (if any) compilers would do this optimisation
> -- but such an optimisation will produce an improvement
> across _all_ processors.

That's not a problem of the compiler. If you don't want a 
conditional branch use different C code:

c+=  (e&(-(a<b)));
d+=e-(e&(-(a<b)));

The difference in the generated code is, that "sbb" has
to be replaced by "eor", "setl" and "neg" because there
are no flags in C:

; 8    : c+=  (e&(-(a<b)));

  00000 8b 0d 00 00 00
        00               mov     ecx, DWORD PTR _a
  00006 8b 15 00 00 00
        00               mov     edx, DWORD PTR _b
  0000c 33 c0            xor     eax, eax
  0000e 3b ca            cmp     ecx, edx
  00010 8b 0d 00 00 00
        00               mov     ecx, DWORD PTR _e
  00016 8b 15 00 00 00
        00               mov     edx, DWORD PTR _c
  0001c 0f 9c c0         setl    al
  0001f f7 d8            neg     eax
  00021 23 c1            and     eax, ecx
  00023 03 d0            add     edx, eax

; 9    : d+=e-(e&(-(a<b)));

  00025 2b c8            sub     ecx, eax
  00027 a1 00 00 00 00   mov     eax, DWORD PTR _d
  0002c 89 15 00 00 00
        00               mov     DWORD PTR _c, edx
  00032 03 c1            add     eax, ecx
  00034 a3 00 00 00 00   mov     DWORD PTR _d, eax

0
Reply Herbert 5/17/2004 10:54:34 PM

"T.M. Sommers" <tms@nj.net> wrote in message
news:jrycnUUwiO4lbzXdUSdV9g@telcove.net...
> Matt Taylor wrote:
> >
> > AFAIK Java is in wide-spread use in embedded markets, even in cell
phones.
> > How bizarre.
>
> Why is that bizarre?  That was what Java was originally intended for.

I know it is possible to generate native binaries, but I've not seen much
development in that area; gcj is the only compiler I know of that will do
it, and its native binaries aren't much better than those translated at
runtime. I would expect that Java is usually interpreted on these systems
since runtime translation has a large memory footprint. The interpreter is
quite slow already, and it's being run on systems that don't have a lot of
horsepower.

The other problem is that until Java 1.5, Java does not support templates.
Since no templated data structures exist, the programmer must use object
wrappers for primitives. The object wrappers add a lot of memory overhead.
Java is not known for having a light footprint.

Another charming aspect of Java is its complete uselessness for real-time
programming. While this isn't a big deal for cell phones (since, as I
gather, the Java apps are typically interactive games), it is a big deal for
other devices. I suppose it is possible to write real-time Java programs if
one is clever with respect to garbage collection, but it seems easier to me
to simply use a non-garbage-collected language.

-Matt

0
Reply Matt 5/18/2004 5:19:20 AM

Scott Moore wrote:
> Back to basics. If you have program N that you plan to run K
times, then
> the time required to optimize that, O (as in your own time),
needs to be
> less than (N-O)*K, ie., the total time you save by speeding up
the program
> must be greater than the time you take optimizing it.
Otherwise you could
> accomplish the same goal by simply performing longer runs.

....on the presumption that _only_ you are making use of this
software, of course...and on the presumption you're only running
the program once...

Because, if you have thousands or millions of customers, then
the _TOTAL_ time you save by speeding up the program is
_multiplied_ by all of the users of your software...

So, if I reduce the 10 minutes down to 9 minutes - per the
example - then I save 1 minute..._PER RUN_...this minute is also
saved for all, ooh, ten thousand users of my rather popular
little program...and, during its lifetime, each of those users
_on average_ (some more, some less, of course ;) might run the
program a couple of hundred times...

The _TOTAL_ run-time saved is thus approximately: 1 minute *
10,000 * 200 = 2 million minutes...which is around 1,388 days or
roughly 3 and a bit years...

On the other hand, you're writing a program for _yourself only_
which you intend to only run just _once_ before throwing forever
into the trashcan?

And you're preaching to us about wasting our time?

I took a day to optimise the program from 10 to 9 minutes, let's
say...the TOTAL time saved from this is roughly 3 to 4 years...

You didn't take that day so you get 24 hours back...mind you,
you lose 1 minute...but, okay, if this program is for _only_ you
and you're _only_ going to run it once then this makes complete
sense, trade in 1 minute to get back 24 hours...just wondering
how many pieces of software we're talking about here fall into
this category?

If it ain't for only you and you're running it more than
once...well, first, you lose that minute every time you run the
program...and you lose it for every user...if the total number
of runs across all those users exceeds 1,440 (144 users running
it 10 times each? 10 users running it 144 times each? ;) then
that's your precious 24 hours swallowed right back up...

Or were you only thinking of _developer time_?

Just writing off all the time wasted out in the "real world"
being used by your users because, at that time, you've signed it
off and collected your paycheque?

Yeah, that wouldn't be unusual...I've got a hard drive full of
bloated, slow crap that's just the same...developers only
concern for their own time and what they save themselves to get
"time to market" and all that...consequently, users have to
spend money on hard drives bigger than they need to be, CPUs
faster than they need to be, RAM sizes bigger than they need to
be, etc....across _MILLIONS_ of machines out there...we're not
just talking about an astronomically large amount of time wasted
but also a similarly gargantuan amount of money to pay for all
this "supercomputer" equipment...

There's even more costs attached...think about Microsoft's
general profile with the public...their reputation for BSODs,
their reputation for gargantuan bloatware, for needing to sit
around all the time for it to go anywhere...that's an awful lot
of "PR" cost with the public to be known as _CRAP_...to be
synonymous in their eyes with "complete amateur"...think of all
that cost in "word of mouth"...think of all that cost in
_SALES_...

But, indeed, 24 hours of developer time is so precious in
comparison, isn't it? I think NOT...

You doubt that these "costs" I speak of actually exist (because
your "textbook" does not refer to them)?

Well, I cite the fact that this topic is being discussed _at
all_ (if there weren't a problem, we wouldn't even be
considering talking about it, as the thought would not enter our
minds)...I cite all the conversations all over the world
regarding the inefficient nature of software...I cite the fact
that the term "Bloatware" even exists (if software was
acceptable like this, then a term of insult would not have been
coined to refer to it in contemptuous tones)...I cite a simple
comparison of older software to newer software, measured against
the hardware improvements that have happened (a _negative_
measure...software is simply _getting worse_ all the time)...

I would also cite something like Carmack writing his 3D games,
_ALWAYS_ re-writing a brand new game engine for every new game
release...I mean, what a waste of "developer time", eh? Hasn't
he heard of "code re-use" and "library code"? On the other hand,
he's stupidly rich, building his own spaceship for _fun_...sales
of his games - which are always literally _STATE OF THE ART_
because he does not sit on his fat arse re-using code and living
on past glories but always takes his time to always keep his
game engines at the very cutting edge of technology, as and when
it appears - don't compare to any other game on the
planet...DOOM was estimated to have been installed on _15
million_ machines...

I would also question why are programmers these days so
desparate to find any excuse not to actually do any programming?

Or is this kind of the point here, maybe? Not people who are
doing it for the Love of the art at all - actually absolutely
hating to actually program anything that any and every
opportunity NOT to do any programming is taken in a heartbeat
and defended as "good practice" so that they can get away with
doing it next time too - but showed up in the "programmer for
hire" column, simply because they heard from a friend that:
"that's where all the money is...if you want to be rich, be a
computer programmer or something"...and that there has been some
"spill over" from the "get rich quick" wannabees who have no
appreciation for the art whatsoever infecting those who might
have some Love for programming because they are surrounded by
these "get rich quickers" who just want the money without the
work, that repeat mantras over and over about how "money for
nothing" _IS_ the right way to go about living one's life...

Okay, I can accept that such people exist and probably abound in
this industry...but, sorry, I'm not one of them and I'm not
going to defend _bad work_...I'm not going to defend people
who've not learnt about _service to their customers_...I'm not
going to defend _money for nothing_ attitudes...

Much like Bush and Blair (who could just say "look, we really
did do this with the noblest of intentions at first...you know,
kick out a dictator, install a democracy...a brilliant little
plan there...but, fair enough, it's clear now that we've screwed
up big-time in some serious and important places in actually
doing what we intended to do originally" rather than all that
"I've never made a mistake in my life" / "There are WMD...just
wait until 2015 for us to find them" / "nothing has ever gone
wrong, nothing ever will go wrong...I refuse, refuse, refuse to
hear that there could be any problems ever with anything I've
ever done or will do" DENIAL after DENIAL), I'd far prefer it if
people were just _honest_ about things...you know, if it's how
you feel then just say: "Oh, that's too much work for me...I'm
going to do it this other way which makes less work for
me...it's a touch crap, slow and bloated like this, mind you,
but it'll do...I mean, users are used to such crap so I'll get
away with that and save myself a whole lot of effort...probably
take the afternoon off, in fact...get in some golf practice,
perhaps"...that's honest and people might not agree but everyone
can work with that, understand that and no-one's under any false
illusions about what's going on...

But it's false, dangerous and devicive to have this kind of
thing about: "no, really...writing no code at all is the _BEST_
way to do things...everyone should learn how not to do any
programming at all...everyone use HLLs! Don't need to understand
a blessed thing about the machine to do that, you see...you can
get away with the least effort possible and that's a 'Good
Thing'"...

I mean, by this kind of logic, the very best programmers press
one key...the power switch...and off goes the machine and they
never switch it on ever again...wow! Think of all that
"developer time" saved!

Repeat the mantra: "no such thing as a free lunch, no such thing
as a free lunch, no such thing as a..." ;)

Beth :)

0
Reply Beth 5/18/2004 5:52:47 AM

Robert Redelmeier <redelm@ev1.net.invalid> wrote:

>In alt.lang.asm wolfgang kern <nowhere@nevernet.at> wrote:
>> I see ASM as "the only" solution for hardware-near programming.
>
>Entirely true.  Look at the embedded market.  I think large
>portions of cellphone, PDA, DVD player, etc programming is
>done in ASM. 

You think wrong.  Most of the code in the typical cell phone, PDA, or DVD
player is C or C++.  The innermost loops sometimes use assembler
optimizations, but the compilers for those embedded chips are very good at
what they do.

>For larger machines, known CPU-bound task get ASM coding:
>MPEG4 encoding, DVD playback (especially using the GPU)

MPEG-4 encoding is far too complicated to be done in assembler, except for
the innermost loops.  The code would be incomprehensible.[1]  It is almost
too complicated to be done in C.

[1] I apologize in advance for firing up this debate yet again.
-- 
- Tim Roberts, timr@probo.com
  Providenza & Boekelheide, Inc.

0
Reply Tim 5/18/2004 6:24:39 AM

On 2004-05-18, Matt Taylor <para@tampabay.rr.com> wrote:
> The other problem is that until Java 1.5, Java does not support templates.
> Since no templated data structures exist, the programmer must use object
> wrappers for primitives. The object wrappers add a lot of memory overhead.

Note that templates in Java 1.5 are implemented without changing the VM,
so integers etc will still be converted to objects and the memory
overhead will be the same as doing things manually now.

lupus

-- 
-----------------------------------------------------------------
lupus@debian.org                                     debian/rules
lupus@ximian.com                             Monkeys do it better

0
Reply Paolo 5/18/2004 6:25:13 PM

"C" <cc-news@hermes.mirlex.com> wrote in message
news:Nxbqc.189$LH3.14@newsfe1-win...
> Matt Taylor wrote:
<snip>
> > Cache misses are unpredictable, but there isn't really
> > anything you can do at that level to avoid them.
>
> Yes, though we are not totally hopeless there either, provided
> the algorithm requires multiple accesses to memory.  In these
> cases you can try to localise memory accesses to increase the
> probability of a cache hit.  This is normally most effective
> in loops which process large amounts of data, for example an
> implementation of the FFT.  Similarly, aligning data can have
> a similar effect on improving the probability of a cache hit
> when multiple accesses must be made to the same data structure.
<snip>

Yeah, but that has more to do with data organization than optimizing in
assembly. One usually brushes with these problems when one learns assembly,
but the same optimizations can be applied in C/C++.

-Matt

0
Reply Matt 5/18/2004 6:26:28 PM

"Beth" <BethStone21@hotmail.NOSPICEDHAM.com> wrote in message
news:Hkiqc.7$%R1.3@newsfe2-gui.server.ntli.net...
> Scott Moore wrote:
> > Back to basics. If you have program N that you plan to run K
> times, then
> > the time required to optimize that, O (as in your own time),
> needs to be
> > less than (N-O)*K, ie., the total time you save by speeding up
> the program
> > must be greater than the time you take optimizing it.
> Otherwise you could
> > accomplish the same goal by simply performing longer runs.
>
> ...on the presumption that _only_ you are making use of this
> software, of course...and on the presumption you're only running
> the program once...
>
> Because, if you have thousands or millions of customers, then
> the _TOTAL_ time you save by speeding up the program is
> _multiplied_ by all of the users of your software...

That's what the K factor is for, although the formula should be NK-O since
optimization work is not applied iteratively.

It would, however, be more appropriate to factor in cost. I've never heard
of anyone looking at the problem with this formula. More typically the cost
of optimizing determines minimum requirements, and those are always set high
to reduce the amount of effort necessary to optimize.

<snip>
> Yeah, that wouldn't be unusual...I've got a hard drive full of
> bloated, slow crap that's just the same...developers only
> concern for their own time and what they save themselves to get
> "time to market" and all that...consequently, users have to
> spend money on hard drives bigger than they need to be, CPUs
> faster than they need to be, RAM sizes bigger than they need to
> be, etc....across _MILLIONS_ of machines out there...we're not
> just talking about an astronomically large amount of time wasted
> but also a similarly gargantuan amount of money to pay for all
> this "supercomputer" equipment...

In other words, software is what keeps Intel in business. My sentiments are
quite similar.

Project schedules are the root of all evil. Universities and their crappy CS
curricula are the other root of all evil. As I posted in another thread, I
have seen a number of my peers and coworkers do silly things like:

// ...
char *buf = new char[256];
FILE *fp;
sprintf(buf, "%s%s", path, filename);
fp = fopen(buf, "rb");
delete[] buf;
// ...

While project managers make the decision to ship crappy software instead of
spending time making it better, developers have their role to play as well.
It would be nice if compilers trapped these sorts of mistakes, but alas.

<snip>
> I would also question why are programmers these days so
> desparate to find any excuse not to actually do any programming?
<snip>

While it is easy to criticize this behavior, there are practical concerns.
Heavily used code (e.g. msvcrt.dll -- the C runtime library) is also heavily
tested code. Constantly rewriting code can introduce and reintroduce bugs.
You go on to cite Carmack's work; however, he isn't highly respected for his
code being bug-free and secure. The Quake server has been known to have
buffer overrun exploits in it. (Admittedly abuse of software reuse leads to
similar problems; however, I'm only pointing out the existance of a happy
medium.)

A lot of code does not need to be rewritten if it is written well the first
time. There is generally no sense in rewriting code such as STL's std::sort.
The same is mostly true of containers; however, container classes are
ridiculously inefficient with data structures whose size is known or
bounded. Replacing containers in these situations always leads to orders of
magnitude of improvement. In one case I know of, code was sped up by a
factor of 20,000x in 30 minutes of developer time simply by replacing
containers with stack allocation.

On that note, it is interesting to me that Microsoft claims .NET programs
run at 70% of native performance. Although I think the 70% figure is a
stretch of the imagination, they do achieve fairly good performance. I think
that is largely because .NET has fast memory allocation, and malloc() is
slow. That says quite a lot about software.

-Matt

0
Reply Matt 5/19/2004 7:23:24 AM

pH wrote:

> Something else to consider - if current technology runs out of steam
> (which is inevitable), will the demand for performance increases also
> just... stop?  Or will the suits be forced to look at the waste that's been
> taking place all these years...
> 
> Jeff
> 
> http://www.jefftturner.com
> 

No it won't. The only reason we continue to invest in process improvements
to the exclusion of all else is that this is a cheap way to increase performance
that does not involve any new thinking. If process/technology improvements
were to stop tomorrow, we would have our hands full for the forseeable future
with parallelisim speedups. In fact, the speed increases we get from process
improvements are probally retarding parallel processing research quite a bit.

0
Reply Scott 5/19/2004 9:04:12 AM

"Scott Moore" <samiam@moorecad.com> wrote in message
news:tfEqc.115093$Ik.9534698@attbi_s53...
> pH wrote:
[snip]
>
> No it won't. The only reason we continue to invest in process improvements
> to the exclusion of all else is that this is a cheap way to increase
performance
> that does not involve any new thinking. If process/technology improvements
> were to stop tomorrow, we would have our hands full for the forseeable
future
> with parallelisim speedups. In fact, the speed increases we get from
process
> improvements are probally retarding parallel processing research quite a
bit.
>

Exactly!  I put my hip-boots on whenever I hear the folks on the Tech TV
shows (http://www.techtv.com) promote technologies like USB and Serial ATA
with quibs like "serial is FASTER than parallel" -- this is just a bunch of
balony!  Sending 32 bits at a time is ALWAYS going to be faster than sending
1 bit at a time.  The trick these technologies use to achieve higher rates
is that they've gone and severed the electrical connections to the device
and have resorted to broadcasting the signal at the device.  This is
basically the same technology that a Network Interface Card (NIC) uses for
your LAN or broadband internet connection.  The NIC is really like a
low-power radio station sending a radio signal to the other computer but
instead of broadcasting through the wide-open air, it sends the signal
through a "conduit" such as a co-axial cable or a twisted-pair bundle.  The
conduit maintains signal strength and reduces outside interference.  [the
outside woven wire on a co-axial cable both acts as a conduit to "pipe" the
signal to the destination and also acts as a Faraday cage to block
interference {if you completely surround yourself in a mesh of connected
wires (like the window and door screens used to keep flys and bugs outside
during the summer), then you are sitting inside a Faraday cage and no amount
of electrical potential or magnetic flux can reach you <although
nuclear-type radiation can still get through>}]  ...I am running out of
parenthetical bracketting symbols in this damnedable ASCII... [twisted-pair
reduces interference because an outside signal will induce a positive
voltage in one wire and a negative voltage in the other wire and these two
opposing voltages cancel out at the destination (that's a bit
over-simplified, but you get the gist)]

Now, I am not suggesting that we design a keyboard with 32 USB connections
to keep up with fast typists (although, if we recieve alien visitors from
the Andrameda Galaxy who are equiped with 42 fingers, then this would not be
a bad idea ;-)  But, having 32 SATA channels (thus making it parallel so
that the "S" for "serial" becomes hypocritical) to your harddrive sounds
kwel.  You'll notice that the PCI-X or PCI-Express bus standard (also
follows this "broadcast" methodology) is taking this approach and being
flexible about it so that MOBOs can provide different bandwidth expansion
slots for different purposes (something like a 32-channel connector for your
graphics card, a 16-channel for maybe HD and NIC, an 8-bit channel for
modem --or something along those lines).  Even seeing this at the CPU level
with hypertransport.

L8r
Nathan.


0
Reply Evenbit 5/19/2004 5:41:31 PM

"Evenbit" <nbakerNOSPAM2328@charter.net> wrote in message
news:10an52ookui95c@corp.supernews.com...
>
> Exactly!  I put my hip-boots on whenever I hear the folks on the Tech TV
> shows (http://www.techtv.com) promote technologies like USB and Serial ATA
> with quibs like "serial is FASTER than parallel" -- this is just a bunch
of
> balony!  Sending 32 bits at a time is ALWAYS going to be faster than
sending
> 1 bit at a time.

Uh....
You're ignoring a few physical realities here.
There is this thing called signal skew that creates problems with wide data
buses.
Also, for the same money you can build a *much* higher quality cable that
can
transmit data more than 32x faster serially than had you built a parallel
cable with
32+lines. Then, of course, there are the issues of weight and bulk (which
don't
affect data transfer rates, but are important attributes nonetheless. Also,
issues
of cross-talk (related to cable quality) are much, much, lower with serial
transmission media. Because of all the bad issues related to parallel
cables,
there are usually severe length limitations on parallel cables (e.g., the
ATA spec
calls out for a maximum of 18", SATA goes much farther).

So, no. Sending data in parallel is *not* always faster than sending it in
serial when you consider real-world constraints. With sufficient
limitations,
you *can* build fast parallel connections (e.g., SCSI 320), but even
10 Gigabit ethernet has got that beat...



> The trick these technologies use to achieve higher rates
> is that they've gone and severed the electrical connections to the device
> and have resorted to broadcasting the signal at the device.  This is
> basically the same technology that a Network Interface Card (NIC) uses for
> your LAN or broadband internet connection.  The NIC is really like a
> low-power radio station sending a radio signal to the other computer but
> instead of broadcasting through the wide-open air, it sends the signal
> through a "conduit" such as a co-axial cable or a twisted-pair bundle.
The
> conduit maintains signal strength and reduces outside interference.  [the
> outside woven wire on a co-axial cable both acts as a conduit to "pipe"
the
> signal to the destination and also acts as a Faraday cage to block
> interference {if you completely surround yourself in a mesh of connected
> wires (like the window and door screens used to keep flys and bugs outside
> during the summer), then you are sitting inside a Faraday cage and no
amount
> of electrical potential or magnetic flux can reach you <although
> nuclear-type radiation can still get through>}]  ...I am running out of
> parenthetical bracketting symbols in this damnedable ASCII...
[twisted-pair
> reduces interference because an outside signal will induce a positive
> voltage in one wire and a negative voltage in the other wire and these two
> opposing voltages cancel out at the destination (that's a bit
> over-simplified, but you get the gist)]

And it is exactly this interference that shoots high-speed transmissions
over parallel cables in the foot. All those signals travelling in parallel
interfere with one another. This is why serial is so attractive.


>
> Now, I am not suggesting that we design a keyboard with 32 USB connections
> to keep up with fast typists (although, if we recieve alien visitors from
> the Andrameda Galaxy who are equiped with 42 fingers, then this would not
be
> a bad idea ;-)  But, having 32 SATA channels (thus making it parallel so
> that the "S" for "serial" becomes hypocritical) to your harddrive sounds
> kwel.

You're better off with SCSI than a crazy number of SATA drives (at least,
if performance is what you're seeking).
A *parallel* version of SCSI (note that SCSI has three physical layers:
parallel, SCSI over SATA, and SCSI over firewire; currently parallel
SCSI is the highest performing version).

Cheers,
Randy Hyde

0
Reply Randall 5/20/2004 3:29:27 AM

"Randall Hyde" <randyhyde@earthlink.net> wrote in message
news:xpVqc.2074$Tn6.1682@newsread1.news.pas.earthlink.net...
>
> "Evenbit" <nbakerNOSPAM2328@charter.net> wrote in message
> news:10an52ookui95c@corp.supernews.com...
<snip>
> > Now, I am not suggesting that we design a keyboard with 32 USB
connections
> > to keep up with fast typists (although, if we recieve alien visitors
from
> > the Andrameda Galaxy who are equiped with 42 fingers, then this would
not
> be
> > a bad idea ;-)  But, having 32 SATA channels (thus making it parallel so
> > that the "S" for "serial" becomes hypocritical) to your harddrive sounds
> > kwel.
>
> You're better off with SCSI than a crazy number of SATA drives (at least,
> if performance is what you're seeking).
> A *parallel* version of SCSI (note that SCSI has three physical layers:
> parallel, SCSI over SATA, and SCSI over firewire; currently parallel
> SCSI is the highest performing version).

Aren't SCSI and fibrechannel intimately related as well?

I would expect 1394 to underperform since the protocol has relatively high
overhead and low transfer speeds. However, I am curious to know how far
apart U320 and SCSI-over-SATA are. SATA seems competitive at least, though
at the moment most drives and chipsets use the converter chips. Supposedly
the conversion adds something like 5% overhead; I presume that is latency.

Medium aside, you're not going to find 15K rpm SATA disks.

-Matt

0
Reply Matt 5/20/2004 5:27:13 AM

Evenbit wrote:

> "Scott Moore" <samiam@moorecad.com> wrote in message
> news:tfEqc.115093$Ik.9534698@attbi_s53...
> 
>>pH wrote:
> 
> [snip]
> 
>>No it won't. The only reason we continue to invest in process improvements
>>to the exclusion of all else is that this is a cheap way to increase
> 
> performance
> 
>>that does not involve any new thinking. If process/technology improvements
>>were to stop tomorrow, we would have our hands full for the forseeable
> 
> future
> 
>>with parallelisim speedups. In fact, the speed increases we get from
> 
> process
> 
>>improvements are probally retarding parallel processing research quite a
> 
> bit.
> 
> 
> Exactly!  I put my hip-boots on whenever I hear the folks on the Tech TV
> shows (http://www.techtv.com) promote technologies like USB and Serial ATA
> with quibs like "serial is FASTER than parallel" -- this is just a bunch of
> balony!  Sending 32 bits at a time is ALWAYS going to be faster than sending

No, they are correct. The issue there is communication, not processing power.
Even at oc192 we do use parallel communication, but it is done by ganging up
multiple serial connections. The issue is clocking. It becomes more trouble
than it is worth to coordinate clocks over multiple paths, so one serial line
is going to be worth several (conventional) parallel lines because that serial
line can be operated at a much higher rate. At oc192 speeds, the difference is
such that multiple serial lines must both separate clocking and a short buffer
to allow resyncing the data once it reaches the inside of the receiver
and getting converted back to parallel form.

So why do chips use parallel communication internally ? Simple. Chips work
on a single clock domain, and silicon area is small to allow that to happen.
However, even that is becoming troublesome at higher and higher clock speeds,
so more and more designers are going to use multiple clock domains with
resyncronisation between them.

> 1 bit at a time.  The trick these technologies use to achieve higher rates
> is that they've gone and severed the electrical connections to the device
> and have resorted to broadcasting the signal at the device.  This is
> basically the same technology that a Network Interface Card (NIC) uses for
> your LAN or broadband internet connection.  The NIC is really like a
> low-power radio station sending a radio signal to the other computer but
> instead of broadcasting through the wide-open air, it sends the signal
> through a "conduit" such as a co-axial cable or a twisted-pair bundle.  The
> conduit maintains signal strength and reduces outside interference.  [the
> outside woven wire on a co-axial cable both acts as a conduit to "pipe" the
> signal to the destination and also acts as a Faraday cage to block
> interference {if you completely surround yourself in a mesh of connected
> wires (like the window and door screens used to keep flys and bugs outside
> during the summer), then you are sitting inside a Faraday cage and no amount
> of electrical potential or magnetic flux can reach you <although
> nuclear-type radiation can still get through>}]  ...I am running out of
> parenthetical bracketting symbols in this damnedable ASCII... [twisted-pair
> reduces interference because an outside signal will induce a positive
> voltage in one wire and a negative voltage in the other wire and these two
> opposing voltages cancel out at the destination (that's a bit
> over-simplified, but you get the gist)]
> 
> Now, I am not suggesting that we design a keyboard with 32 USB connections
> to keep up with fast typists (although, if we recieve alien visitors from
> the Andrameda Galaxy who are equiped with 42 fingers, then this would not be
> a bad idea ;-)  But, having 32 SATA channels (thus making it parallel so
> that the "S" for "serial" becomes hypocritical) to your harddrive sounds
> kwel.  You'll notice that the PCI-X or PCI-Express bus standard (also
> follows this "broadcast" methodology) is taking this approach and being
> flexible about it so that MOBOs can provide different bandwidth expansion
> slots for different purposes (something like a 32-channel connector for your
> graphics card, a 16-channel for maybe HD and NIC, an 8-bit channel for
> modem --or something along those lines).  Even seeing this at the CPU level
> with hypertransport.
> 
> L8r
> Nathan.
> 
> 

0
Reply Scott 5/20/2004 8:59:09 AM

"Matt Taylor" <para@tampabay.rr.com> wrote in message
news:PfXqc.2409$6%6.1638@twister.tampabay.rr.com...
> "Randall Hyde" <randyhyde@earthlink.net> wrote in message
> news:xpVqc.2074$Tn6.1682@newsread1.news.pas.earthlink.net...
>
> Aren't SCSI and fibrechannel intimately related as well?

Yep, forgot about that one.

>
> I would expect 1394 to underperform since the protocol has relatively high
> overhead and low transfer speeds.

Maybe FW400, but don't forget that they're shipping FW800 today and
FW1600 is around the bend.

> However, I am curious to know how far
> apart U320 and SCSI-over-SATA are. SATA seems competitive at least, though
> at the moment most drives and chipsets use the converter chips. Supposedly
> the conversion adds something like 5% overhead; I presume that is latency.

Again, as long as the transport protocol is faster than a disk drive can
physically
transfer data, the important issue is going to be concurrent operation for
RAID processing. SATA, alas, loses big time to SCSI in this department from
what I hear (I've not personally tried to set up a SATA RAID device, so I
can't personally vouch for this, but I've talked to some people who have and
they say it wasn't worth the effort; OTOH, I am running SCSI RAID and it
makes a *big* difference in performance; SATA's problems may be due
to the current technologies in use, I don't know).

> Medium aside, you're not going to find 15K rpm SATA disks.

They are available today. Indeed, I believe you can even get 15K PATA
drives if you look hard enough. A project I'm working on briefly considered
the 15K SATA drives, but went with 7,200 RPM because the data transfer
rates were actually better on the 7.2K drives.
Cheers,
Randy Hyde

0
Reply Randall 5/20/2004 12:30:55 PM

Matt,

    I don't understand your point.  You state that container classes or just
classes are slower than structs.  If it is true that many assembler
programmers and C (not C++) programmers are sticking with structs because
they think that it can increase the performance better than classes.  Is it
true?

-- 
Bryan Parkoff
"Matt Taylor" <para@tampabay.rr.com> wrote in message
news:aRDqc.633861$Po1.207271@twister.tampabay.rr.com...
> "Beth" <BethStone21@hotmail.NOSPICEDHAM.com> wrote in message
> news:Hkiqc.7$%R1.3@newsfe2-gui.server.ntli.net...
> > Scott Moore wrote:
> > > Back to basics. If you have program N that you plan to run K
> > times, then
> > > the time required to optimize that, O (as in your own time),
> > needs to be
> > > less than (N-O)*K, ie., the total time you save by speeding up
> > the program
> > > must be greater than the time you take optimizing it.
> > Otherwise you could
> > > accomplish the same goal by simply performing longer runs.
> >
> > ...on the presumption that _only_ you are making use of this
> > software, of course...and on the presumption you're only running
> > the program once...
> >
> > Because, if you have thousands or millions of customers, then
> > the _TOTAL_ time you save by speeding up the program is
> > _multiplied_ by all of the users of your software...
>
> That's what the K factor is for, although the formula should be NK-O since
> optimization work is not applied iteratively.
>
> It would, however, be more appropriate to factor in cost. I've never heard
> of anyone looking at the problem with this formula. More typically the
cost
> of optimizing determines minimum requirements, and those are always set
high
> to reduce the amount of effort necessary to optimize.
>
> <snip>
> > Yeah, that wouldn't be unusual...I've got a hard drive full of
> > bloated, slow crap that's just the same...developers only
> > concern for their own time and what they save themselves to get
> > "time to market" and all that...consequently, users have to
> > spend money on hard drives bigger than they need to be, CPUs
> > faster than they need to be, RAM sizes bigger than they need to
> > be, etc....across _MILLIONS_ of machines out there...we're not
> > just talking about an astronomically large amount of time wasted
> > but also a similarly gargantuan amount of money to pay for all
> > this "supercomputer" equipment...
>
> In other words, software is what keeps Intel in business. My sentiments
are
> quite similar.
>
> Project schedules are the root of all evil. Universities and their crappy
CS
> curricula are the other root of all evil. As I posted in another thread, I
> have seen a number of my peers and coworkers do silly things like:
>
> // ...
> char *buf = new char[256];
> FILE *fp;
> sprintf(buf, "%s%s", path, filename);
> fp = fopen(buf, "rb");
> delete[] buf;
> // ...
>
> While project managers make the decision to ship crappy software instead
of
> spending time making it better, developers have their role to play as
well.
> It would be nice if compilers trapped these sorts of mistakes, but alas.
>
> <snip>
> > I would also question why are programmers these days so
> > desparate to find any excuse not to actually do any programming?
> <snip>
>
> While it is easy to criticize this behavior, there are practical concerns.
> Heavily used code (e.g. msvcrt.dll -- the C runtime library) is also
heavily
> tested code. Constantly rewriting code can introduce and reintroduce bugs.
> You go on to cite Carmack's work; however, he isn't highly respected for
his
> code being bug-free and secure. The Quake server has been known to have
> buffer overrun exploits in it. (Admittedly abuse of software reuse leads
to
> similar problems; however, I'm only pointing out the existance of a happy
> medium.)
>
> A lot of code does not need to be rewritten if it is written well the
first
> time. There is generally no sense in rewriting code such as STL's
std::sort.
> The same is mostly true of containers; however, container classes are
> ridiculously inefficient with data structures whose size is known or
> bounded. Replacing containers in these situations always leads to orders
of
> magnitude of improvement. In one case I know of, code was sped up by a
> factor of 20,000x in 30 minutes of developer time simply by replacing
> containers with stack allocation.
>
> On that note, it is interesting to me that Microsoft claims .NET programs
> run at 70% of native performance. Although I think the 70% figure is a
> stretch of the imagination, they do achieve fairly good performance. I
think
> that is largely because .NET has fast memory allocation, and malloc() is
> slow. That says quite a lot about software.
>
> -Matt
>

0
Reply Bryan 5/21/2004 4:36:27 PM

"Bryan Parkoff" <bryan.nospam.parkoff@nospam.com> wrote in message
news:H%prc.9302$Hh.4606@fe1.texas.rr.com...
> Matt,
>
>     I don't understand your point.  You state that container classes or
just
> classes are slower than structs.  If it is true that many assembler
> programmers and C (not C++) programmers are sticking with structs because
> they think that it can increase the performance better than classes.  Is
it
> true?
<snip>

I am criticizing the *universal* use of container classes. I am saying that
std::vector is acceptable in some cases, but in other cases it is better to
create an array of 5 elements on the stack. (Aside: the one thing I do
dislike about STL containers is that they use C++ exception handling. With
MSVC, this results in bloated code.)

It is like the example I gave:

// ...
char *buf = new char[256];
FILE *fp;
sprintf(buf, "%s%s", path, filename);
fp = fopen(buf, "rb");
delete[] buf;
// ...

This code could be rewritten like this:
// ...
char buf[256];
FILE *fp;
sprintf(buf, "%s%s", path, filename);
fp = fopen(buf, "rb");
// ...

Fast code should avoid malloc & new whenever possible.

-Matt

0
Reply Matt 5/23/2004 12:12:16 AM

Matt,

    Okay, do you claim that variables and functions inside classes are
better than container classes and STL?  You would want to avoid malloc or
new function if it is less than 1MB.  It is not really necessary to use
malloc or new functions if memory is not very big.
    What is the minimum MB that malloc or new function should be used?  I
suspect 10MB.?????
    Is it true that malloc or new function is never used if segments are
used instead of flat model under 640KB?  It sounds like malloc or new
function uses to allocate virtual memory, but below 640KB does not.  Correct
me if I am wrong.

-- 
Bryan Parkoff
"Matt Taylor" <para@tampabay.rr.com> wrote in message
news:_KRrc.19466$Ol3.15521@twister.tampabay.rr.com...
> "Bryan Parkoff" <bryan.nospam.parkoff@nospam.com> wrote in message
> news:H%prc.9302$Hh.4606@fe1.texas.rr.com...
> > Matt,
> >
> >     I don't understand your point.  You state that container classes or
> just
> > classes are slower than structs.  If it is true that many assembler
> > programmers and C (not C++) programmers are sticking with structs
because
> > they think that it can increase the performance better than classes.  Is
> it
> > true?
> <snip>
>
> I am criticizing the *universal* use of container classes. I am saying
that
> std::vector is acceptable in some cases, but in other cases it is better
to
> create an array of 5 elements on the stack. (Aside: the one thing I do
> dislike about STL containers is that they use C++ exception handling. With
> MSVC, this results in bloated code.)
>
> It is like the example I gave:
>
> // ...
> char *buf = new char[256];
> FILE *fp;
> sprintf(buf, "%s%s", path, filename);
> fp = fopen(buf, "rb");
> delete[] buf;
> // ...
>
> This code could be rewritten like this:
> // ...
> char buf[256];
> FILE *fp;
> sprintf(buf, "%s%s", path, filename);
> fp = fopen(buf, "rb");
> // ...
>
> Fast code should avoid malloc & new whenever possible.
>
> -Matt
>

0
Reply Bryan 5/23/2004 2:10:12 AM

>     Okay, do you claim that variables and functions inside classes are
> better than container classes and STL?  You would want to avoid malloc or
> new function if it is less than 1MB.

He didn't tell this exact value.

> It is not really necessary to use
> malloc or new functions if memory is not very big.
>     What is the minimum MB that malloc or new function should be used?  I
> suspect 10MB.?????

What's the minimum number of hairs to form a beard? I would stop at few
kilobytes (of memory, not of hairs :)). Allocating more than 4KB on stack
cannot be done by just substracting from esp (on Windows at least, don't
know how *nix organizes stack). Otherwise, stack pointer will bypass the
guardian page and fall into reserved memory space. Compiler calls a special
function (_chkstk, IIRC) which allocates memory on stack page-by-page. This
will cause as many page faults and transitions as many pages you allocate.
Thus, the efficiency of stack allocations is limited not by the memory waste
but by the time that PF handler requires to handle so many page faults.
Let's imagine that you allocate 10MB of memory (it is 10 times more than the
default stack reserve, but you have set it to 15MB). That will cause 2560
page faults. P4 @ 2.4GHz needs about 0.9ms to call the interrupt handler so
many times. And the handler itself requires some time. So it will take about
a few milliseconds. I think it's *much* slower then calling malloc.

>     Is it true that malloc or new function is never used if segments are
> used instead of flat model under 640KB?

Who - told - you - that?

> It sounds like malloc or new
> function uses to allocate virtual memory,

Under Windows.

> but below 640KB does not.

Does. Under DOS.

Ivan


0
Reply Ivan 5/23/2004 6:04:05 PM

"Ivan Korotkov" <koroNOSPAMtkov2@ztelDOT.ru> wrote in message
news:c8qbfj$21uj$1@news1.macomnet.ru...
> >     Okay, do you claim that variables and functions inside classes are
> > better than container classes and STL?  You would want to avoid malloc
or
> > new function if it is less than 1MB.
>
> He didn't tell this exact value.

Right. I think Bryan misunderstood what I said. The functions malloc() and
new are quite slow. Stack allocation is 1 instruction: sub esp, xx.

> > It is not really necessary to use
> > malloc or new functions if memory is not very big.
> >     What is the minimum MB that malloc or new function should be used?
I
> > suspect 10MB.?????
>
> What's the minimum number of hairs to form a beard? I would stop at few
> kilobytes (of memory, not of hairs :)). Allocating more than 4KB on stack
> cannot be done by just substracting from esp (on Windows at least, don't
> know how *nix organizes stack). Otherwise, stack pointer will bypass the
> guardian page and fall into reserved memory space. Compiler calls a
special
> function (_chkstk, IIRC) which allocates memory on stack page-by-page.
This
> will cause as many page faults and transitions as many pages you allocate.
> Thus, the efficiency of stack allocations is limited not by the memory
waste
> but by the time that PF handler requires to handle so many page faults.

This is a one-time cost.

> Let's imagine that you allocate 10MB of memory (it is 10 times more than
the
> default stack reserve, but you have set it to 15MB). That will cause 2560
> page faults. P4 @ 2.4GHz needs about 0.9ms to call the interrupt handler
so
> many times. And the handler itself requires some time. So it will take
about
> a few milliseconds. I think it's *much* slower then calling malloc.
<snip>

I suppose I gave insufficient context. In every case I have dealt with, the
allocation has been relatively small (< 64KB), and it has been allocated
iteratively. In this case, the cost of #PF is paid once, and afterward the
allocation occurs in <1 ns on modern multi-GHz processors. I believe it is
also possible for large allocations to avoid the #PF penalty by calling
NtVirtualAlloc manually, though MSVC's runtime does not. (It is also
possible to speed up heap allocation with fixed-size/bucket allocators, but
most apps do not use custom memory management.)

For every piece of commercial software on Windows that I have tested,
RtlAllocateHeap is one of the top 16 time consuming functions. Generally it
is even in the top 5. The slower the code is, the higher up on the list that
function is. One app in particular calls it >300,000 times while starting
up. This is a user-measurable amount of time.

-Matt

0
Reply Matt 5/23/2004 10:52:46 PM

> I suppose I gave insufficient context. In every case I have dealt with,
the
> allocation has been relatively small (< 64KB), and it has been allocated
> iteratively. In this case, the cost of #PF is paid once, and afterward the
> allocation occurs in <1 ns on modern multi-GHz processors.

I understood you. Bryan didn't ;). So I told him why his theory of
allocating huge memory blocks on stack is inefficient.

Ivan


0
Reply Ivan 5/24/2004 6:00:01 PM

I didn't say optimization is undesirable.  In fact for a compiler it should
always be done.  What I meant was that hand optimized code in assembly may
no longer be useful as it might run well for one processor but poorly for
another.

The only optimizations that will be useful are the high level ones that work
across CPU families such as loop unrolling, data alignment, function
inlining etc. - precisely the ones that a good compiler will perform.

Optimizations such as instruction rescheduling to prevent stalling the
pipeline and reorganisation of code for better use of cache will be
considered a waste of time as they will only work for a particular
processor.

Perhaps asssembly days are numbered?.....


DayStar.




"Ivan Korotkov" <koroNOSPAMtkov2@ztelDOT.ru> wrote in message
news:c88o65$rg7$2@news1.macomnet.ru...
> > Given that optimizing in assembly for one processor will have no effect
> (or
> > negative effect) on a different processor it seems that low level
> > optimization is becoming a waste of time.  As processors become more
> > sophisticated and diverse in the way they execute code this trend is
> likely
> > to continue.
>
> Everything in our life is optimized for a particular entity. But
> optimization rules for all P6 CPUs are almost the same. For P4, PM and
> succeeding few models they might also be nearly the same. Each CPU family
> has similar architecture.
> Sometimes it's urgent to optimize for a concrete CPU. A friend of mine
(he's
> a physician) asked me to write a small program that modelates planet flaw
in
> 2-star system (famous 3-body problem). It had to run on a... Pentium (133
> MHz). Mathcad and other general math & engineering apps run VERY slowly
> compared to special-case programs (and especially on such a CPU) so I
wrote
> my own ODE solver. I'm not a specialist in optimizing for P5 and I relied
on
> compiler's optimizer (VC 7.1). The code actually rans faster with
> optimizations and this app has never run on any CPU but P5 (and, of
course,
> mine - but for testing purpose). Thus you can't say that optimization is
> always undesirable.
>
> Ivan
>
>

0
Reply a 5/26/2004 5:58:07 PM

"a" <a@a.com> wrote in message news:bSTpc.306$JL5.73@newsfe1-win...
>
> Perhaps asssembly days are numbered?.....

And they've been numbered since at least 1975, when I first
heard this phrase uttered (in my assembly class at the time) :-)

I guess that as processors have gone from eight bits, to 16 bits,
to 32 bits, to 64 bits, etc., that the data type we're using to
"number assembly's days" just keeps getting larger and larger :-)

Cheers,
Randy Hyde

0
Reply Randall 5/27/2004 4:10:55 AM

a wrote:

> I didn't say optimization is undesirable.  In fact for a compiler it should
> always be done.  What I meant was that hand optimized code in assembly may
> no longer be useful as it might run well for one processor but poorly for
> another.
> 
> The only optimizations that will be useful are the high level ones that work
> across CPU families such as loop unrolling, data alignment, function
> inlining etc. - precisely the ones that a good compiler will perform.
> 
> Optimizations such as instruction rescheduling to prevent stalling the
> pipeline and reorganisation of code for better use of cache will be
> considered a waste of time as they will only work for a particular
> processor.
> 
> Perhaps asssembly days are numbered?.....
> 
> 
> DayStar.

The formula is the amount of work on making a particular program run faster
divided into the number of times the program will be run.

Compilers add to the classic formula by amortizing the work by a large number
of different programs. That "waste of time instruction reschedule" is being
amortized by 1000's of programs and all of their runs. Even considering only
programs for a particular CPU, the returns are large.

Programmers gave up the need to program direct to assembly for the most part.
Thats good, if compilers are good. But if you resign your work to an idiot,
or an idiot compiler, you aren't getting return on your money.

I do think there are problems with overly targeting a single CPU model, mainly
because of the likelyhood that the program will later be run on a different
model CPU. However, there *is* no irrelivant optimization. Since a considerable
amount of the world runs on GCC or VC++, these compilers deserve to get every
last optimization possible. Before deciding that optimizations are irrelivant,
go talk to the cancer researcher who is investigating protein folding. Try
to tell HIM that the %5 boost he gets from instruction scheduling is not
worth it.

0
Reply Scott 5/28/2004 5:51:00 PM

Thus spake "C" <cc-news@hermes.mirlex.com>
> However, there are other forms of low level optimisation
> which work well across many processors -- such as vectorising,
> improving cache utilisation and branch elimination.  Current
> HLL compilers are still not very effective these operations.
>
> ... Branch elimination (one of my favourates) only
> gives improvements on processors using branch prediction and
> relys on arranging the data to remove the need for conditional
> branches -- while compilers often get the simple cases, I have
> not seen one which can do complex eliminations. Ie.
>
> C code...
>
> if( a < b ) c += e;
> else        d += e;
>
> Naive assembly...
>
>      cmp eax, ebx ; if( a < b )
>      je .s
>        add ecx, esi ; c += e;
>        jmp .x
> .s:   add edx, esi ; d += e;
> .x:
>
> Branch elimination optimisation...
>
>      add ecx, esi ; do c += e (may undo later)
>      cmp eax, ebx ; if( a < b )
>      sbb ebp, ebp ; create mask { p = a < b ? e : 0 }
>      and ebp, esi
>      sub ecx, ebp ; undo c += e if a >= b
>      add edx, ebp ; do d += e if a >= b
>
> I doubt many (if any) compilers would do this optimisation
> -- but such an optimisation will produce an improvement
> across _all_ processors.

How about this:

mov ecx, c    ; tmp1 = c
mov edx, d    ; tmp2 = d
add ecx, e    ; tmp1 += e
add edx, e    ; tmp2 += e
cmp eax, ebx    ; if (a < b)
cmovl c, ecx    ; then c = tmp1
cmovge d, edx    ; else d = tmp2

One more instruction than yours (unless the temp registers are already
available), but it should execute in 3 cycles...

S

Stephen Sprunk        "Stupid people surround themselves with smart
CCIE #3723           people.  Smart people surround themselves with
K5SSS         smart people who disagree with them."  --Aaron Sorkin


0
Reply Stephen 5/29/2004 9:43:33 AM

"a" <a@a.com> wrote in message news:bSTpc.306$JL5.73@newsfe1-win...
> I didn't say optimization is undesirable.  In fact for a compiler it
should
> always be done.  What I meant was that hand optimized code in assembly may
> no longer be useful as it might run well for one processor but poorly for
> another.

I answered this comment already and directly stated that it was false.

> The only optimizations that will be useful are the high level ones that
work
> across CPU families such as loop unrolling, data alignment, function
> inlining etc. - precisely the ones that a good compiler will perform.

Compilers are also egregiously shortsighted. Until MSVC 7, the following
code would generate a mul and a div:

int x, y, z;
// ...
y = x / 10;
z = x % 10;

Now in MSVC 7.1 it generates a single div. This is several times slower than
the code that GCC 3.2 produces. Of course, MSVC gets it right if you remove
the modulo calculation.

Another case which I found interesting involves STL's min function:

int min_int(int a, int  b)
{
 return std::min(a, b);
}

MSVC 7.1:
 mov eax, DWORD PTR _b$[esp-4]
 cmp eax, DWORD PTR _a$[esp-4]
 lea eax, DWORD PTR _b$[esp-4]
 jl SHORT $L4471
 lea eax, DWORD PTR _a$[esp-4]
$L4471:
 mov eax, DWORD PTR [eax]
 ret 0

GCC 3.2:
 mov eax, DWORD PTR [esp+4]
 cmp DWORD PTR [esp+8], eax
 jge L2
 lea eax, [esp+8]
L3:
 mov eax, DWORD PTR [eax]
 ret
 .p2align 4,,7
L2:
 lea eax, [esp+4]
 jmp L3

This is particularly egregious because the compiler is inlining the std::min
function, and both GCC and MSVC make the mistake of preserving
pass-by-reference semantics in the inlined code. Even an assembly novice
would not likely do this.

Let's also not forget MMX and SSE which many popular compilers cannot take
advantage of except through intrinsics. AFAIK Intel C++ is the only popular
C/C++ compiler that can vectorize loops. MSVC can't, and I haven't seen such
an option in GCC yet either. Last I checked, Borland can't. Watcom can't.
Etc.

Try casting a 32-bit int to a 64-bit int and multiplying by another 32-bit
int, for example. You would expect the compiler to use the extended multiply
instruction (mul/imul with 1 operand). Usually MSVC emits a call to _ullmul;
it is very picky about that particular idiom. There are also many, many
cases involving templates or references/pointers where the compiler does a
piss-poor job of optimizing.

Compilers don't reason about code like humans do. Compilers are built from
cases, and the cases are never exhaustive. If your software has no
particular need for performance, then you probably don't have a particular
need to be writing assembly for it. There are, however, a large number of
niche cases where the compiler performs poorly.

> Optimizations such as instruction rescheduling to prevent stalling the
> pipeline and reorganisation of code for better use of cache will be
> considered a waste of time as they will only work for a particular
> processor.

Optimal code for one processor is not optimal for another; however, optimal
code for one processor is still probably better than unscheduled code for
another processor. Timings are often very similar for primitive ops (ALU,
bswap, others). Scheduled code will try to minimize the amount of time it
takes to get through the critical path, and since timings are similar this
will be the same path for other processors. This will result in an
improvement for *all* processors.

> Perhaps asssembly days are numbered?.....
<snip>

People have been saying this for years and assembly is still actively used.
Unless compilers make a dramatic leap forward, this is unlikely to ever
become true--software is always pushing the envelope, though some people
tend to forget this because it isn't true of every market.

-Matt

0
Reply Matt 5/31/2004 1:50:29 AM

"Matt Taylor" <para@tampabay.rr.com> writes:
> Compilers are also egregiously shortsighted. Until MSVC 7, the following
> code would generate a mul and a div:
> 
> int x, y, z;
> // ...
> y = x / 10;
> z = x % 10;
> 
> Now in MSVC 7.1 it generates a single div. This is several times slower than
> the code that GCC 3.2 produces. Of course, MSVC gets it right if you remove
> the modulo calculation.

I just checked GCC 2.95, and even that (from the late 90s) only has one mul.


Phil



-- 
1st bug in MS win2k source code found after 20 minutes: scanline.cpp
2nd and 3rd bug found after 10 more minutes: gethost.c
Both non-exploitable. (The 2nd/3rd ones might be, depending on the CRTL)

0
Reply Phil 5/31/2004 11:29:22 AM

Matt Taylor wrote:


> Compilers are also egregiously shortsighted. Until MSVC 7, the following
> code would generate a mul and a div:
> 
> int x, y, z;
> // ...
> y = x / 10;
> z = x % 10;
> 

Why "compilers" in general ? You already admit there are exceptions, such as
vectorizing C++ from Intel.

So your example compilers are a free one that is supported by people working
in their spare time and another that the maker principly ships to support its
OS monopoloy.

Compiler technology is suprisingly good for a day and age where people run
around asking "where can I get a vectorizing compiler for my machine with
full visual environment... for free ?". Nobody wanders into the dealership
asking for a free Porche.

MS compilers are typically low performing. IBM, back when they had full rights
to the Windows code, told that they were able to make a %10-%20 faster version
of Windows simply by recompiling with Watcom. All compilers need not be judged
by this poor yardstick.

0
Reply Scott 5/31/2004 6:01:00 PM

Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:

>"Matt Taylor" <para@tampabay.rr.com> writes:
>> Compilers are also egregiously shortsighted. Until MSVC 7, the following
>> code would generate a mul and a div:
>> 
>> int x, y, z;
>> // ...
>> y = x / 10;
>> z = x % 10;
>> 
>> Now in MSVC 7.1 it generates a single div. This is several times slower than
>> the code that GCC 3.2 produces. Of course, MSVC gets it right if you remove
>> the modulo calculation.
>
>I just checked GCC 2.95, and even that (from the late 90s) only has one mul.

I believe the point was that there shouldn't be a "mul" at all.  The x86
"div" instruction happily produces the quotient AND remainder in one fell
swoop.

What, exactly, does gcc 2.95 produce for this?
-- 
- Tim Roberts, timr@probo.com
  Providenza & Boekelheide, Inc.

0
Reply Tim 5/31/2004 11:44:38 PM

Tim Roberts <timr@probo.com> writes:

> Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:
> 
> >"Matt Taylor" <para@tampabay.rr.com> writes:
> >> Compilers are also egregiously shortsighted. Until MSVC 7, the following
> >> code would generate a mul and a div:
> >> 
> >> int x, y, z;
> >> // ...
> >> y = x / 10;
> >> z = x % 10;
> >> 
> >> Now in MSVC 7.1 it generates a single div. This is several times slower than
> >> the code that GCC 3.2 produces. Of course, MSVC gets it right if you remove
> >> the modulo calculation.
> >
> >I just checked GCC 2.95, and even that (from the late 90s) only has one mul.
> 
> I believe the point was that there shouldn't be a "mul" at all.  The x86
> "div" instruction happily produces the quotient AND remainder in one fell
> swoop.

Do what?!?!? 

I think _my_ point was that there shouldn't be a "div" at all. The x86 
"mul" instruction happily produces the quotient AND remainder in one fell 
swoop. 

When you multiply by a scaled reciprocal, and shift appropriately, that is.
You don't want any divs as they are interminably slow on most implementations.

Here's what GCC 2.95, and maybe prior, provides. 3.3 isn't substantially different.


Unsigned:

..LC1:
        .long 0xcccccccd
....

        movl 4(%esp),%ecx
        movl %ecx,%eax
        mull .LC1
        shrl $3,%edx
        movl %edx,y
        leal (%edx,%edx,4),%eax
        addl %eax,%eax
        subl %eax,%ecx
        movl %ecx,z
        ret

Signed:
        movl 4(%esp),%ecx
        movl $1717986919,%edx
        movl %edx,%eax
        imull %ecx
        sarl $2,%edx
        movl %ecx,%eax
        sarl $31,%eax
        subl %eax,%edx
        movl %edx,y
        leal (%edx,%edx,4),%eax
        addl %eax,%eax
        subl %eax,%ecx
        movl %ecx,z
        ret


Phil
-- 
1st bug in MS win2k source code found after 20 minutes: scanline.cpp
2nd and 3rd bug found after 10 more minutes: gethost.c
Both non-exploitable. (The 2nd/3rd ones might be, depending on the CRTL)

0
Reply Phil 6/1/2004 6:10:03 AM

46 Replies
108 Views

(page loaded in 0.339 seconds)

Similiar Articles:


















7/28/2012 8:11:42 AM


Reply: