65nm news from Intel

  • Follow


http://www.reuters.com/locales/c_newsArticle.jsp?type=technologyNews&localeKey=en_IN&storyID=6098883

    Yousuf Khan


0
Reply bbbl67 (44) 8/30/2004 5:39:10 AM

"Yousuf Khan" <bbbl67@ezrs.com> wrote in message news:<24zYc.102338$UTP.50876@twister01.bloor.is.net.cable.rogers.com>...
> http://www.reuters.com/locales/c_newsArticle.jsp?type=technologyNews&localeKey=en_IN&storyID=6098883
> 
>     Yousuf Khan

official press release,

http://crew.tweakers.net/Wouter/Press65nm804a.pdf

more publicity,

http://www.extremetech.com/article2/0,1558,1640647,00.asp
http://news.com.com/Intel+to+throttle+power+by+enhancing+silicon/2100-1006_3-5327521.html?tag=nefd.top
http://cbs.marketwatch.com/news/story.asp?guid=%7BE706E5AF-C144-4A19-936D-1943CB16A394%

Looks damn good on paper.
0
Reply mas769 8/30/2004 11:05:25 AM


"Yousuf Khan" <bbbl67@ezrs.com> wrote in message 
news:24zYc.102338$UTP.50876@twister01.bloor.is.net.cable.rogers.com...
> http://www.reuters.com/locales/c_newsArticle.jsp?type=technologyNews&localeKey=en_IN&storyID=6098883
>
>    Yousuf Khan
>
>

I don't know, maybe it's just me but it seems like this article puts way to 
much importance on the manufacturing process a CPU is made on.. Not that 
these things aren't important at all... But the fact that my Athlon64 3000+ 
is still made on a .13 process really didn't discourage me at all.. My 
system still performs extremely well despite being a "generation behind" 
Intel's Prescott.

Carlo 


0
Reply Carlo 8/30/2004 10:54:23 PM

It looks like AMD is progressing nicely with  .09
This website shows the Athlon 64 4000+ and 3800+ as
well as the FX-55 as scheduled for release in October.

http://www.c627627.com/AMD/Athlon64/

Mobile Athlon 64 chips for thin and light notebooks are
being made now on .09

Carlo Razzeto wrote:

> "Yousuf Khan" <bbbl67@ezrs.com> wrote in message
> news:24zYc.102338$UTP.50876@twister01.bloor.is.net.cable.rogers.com...
> > http://www.reuters.com/locales/c_newsArticle.jsp?type=technologyNews&localeKey=en_IN&storyID=6098883
> >
> >    Yousuf Khan
> >
> >
>
> I don't know, maybe it's just me but it seems like this article puts way to
> much importance on the manufacturing process a CPU is made on.. Not that
> these things aren't important at all... But the fact that my Athlon64 3000+
> is still made on a .13 process really didn't discourage me at all.. My
> system still performs extremely well despite being a "generation behind"
> Intel's Prescott.
>
> Carlo

0
Reply JK 8/30/2004 11:03:00 PM

Carlo Razzeto wrote:
> I don't know, maybe it's just me but it seems like this article puts
> way to much importance on the manufacturing process a CPU is made
> on.. Not that these things aren't important at all... But the fact
> that my Athlon64 3000+ is still made on a .13 process really didn't
> discourage me at all.. My system still performs extremely well
> despite being a "generation behind" Intel's Prescott.

Shhh! Intel needs a little bit of a pick-me-up. Let it enjoy its usual
fawning coverage, like from yesteryear. :-)

    Yousuf Khan


0
Reply Yousuf 8/31/2004 12:53:30 AM

"Carlo Razzeto" <crazzeto@hotmail.com> wrote ...
> I don't know, maybe it's just me but it seems like this article puts way 
> to much importance on the manufacturing process a CPU is made on..

Well, it _is_ important.  The rules change when the process
changes, and a microarchitecture that excelled on, say, 180nm
may be non-competitive on 65nm.

I'd go into detail, but I work for Intel. Sorry
--
Dennis M. O'Connor    dmoc@primenet.com 


0
Reply Dennis 8/31/2004 1:26:26 AM

On Mon, 30 Aug 2004 18:54:23 -0400, "Carlo Razzeto"
<crazzeto@hotmail.com> wrote:
>
>"Yousuf Khan" <bbbl67@ezrs.com> wrote in message 
>news:24zYc.102338$UTP.50876@twister01.bloor.is.net.cable.rogers.com...
>> http://www.reuters.com/locales/c_newsArticle.jsp?type=technologyNews&localeKey=en_IN&storyID=6098883
>>
>
>I don't know, maybe it's just me but it seems like this article puts way to 
>much importance on the manufacturing process a CPU is made on.. Not that 
>these things aren't important at all... But the fact that my Athlon64 3000+ 
>is still made on a .13 process really didn't discourage me at all.. My 
>system still performs extremely well despite being a "generation behind" 
>Intel's Prescott.

The important difference is that Athlon64 3000+ costs AMD more to
build than Intel's Prescott 3.0GHz chips, yet sells for less.  New
process generation is equally one part technology, one part financial
these days (case-in-point, Intel is very aggressively moving the
low-end Celeron to the newest manufacturing product rather than just
focusing on high-end chips first).

-------------
Tony Hill
hilla <underscore> 20 <at> yahoo <dot> ca
0
Reply Tony 8/31/2004 2:47:37 AM


JK wrote:

> It looks like AMD is progressing nicely with  .09
> This website shows the Athlon 64 4000+ and 3800+

The 3800+ on .09 that is. The 3800+ on .13 was released earlier.

> as
> well as the FX-55 as scheduled for release in October.
>
> http://www.c627627.com/AMD/Athlon64/
>
> Mobile Athlon 64 chips for thin and light notebooks are
> being made now on .09
>
> Carlo Razzeto wrote:
>
> > "Yousuf Khan" <bbbl67@ezrs.com> wrote in message
> > news:24zYc.102338$UTP.50876@twister01.bloor.is.net.cable.rogers.com...
> > > http://www.reuters.com/locales/c_newsArticle.jsp?type=technologyNews&localeKey=en_IN&storyID=6098883
> > >
> > >    Yousuf Khan
> > >
> > >
> >
> > I don't know, maybe it's just me but it seems like this article puts way to
> > much importance on the manufacturing process a CPU is made on.. Not that
> > these things aren't important at all... But the fact that my Athlon64 3000+
> > is still made on a .13 process really didn't discourage me at all.. My
> > system still performs extremely well despite being a "generation behind"
> > Intel's Prescott.
> >
> > Carlo

0
Reply JK 8/31/2004 3:02:24 AM

"Tony Hill" <hilla_nospam_20@yahoo.ca> wrote in message 
news:r5n7j0ljl4vhlnr2t1nmfmdklpbgf62f6p@4ax.com...
> On Mon, 30 Aug 2004 18:54:23 -0400, "Carlo Razzeto"
> <crazzeto@hotmail.com> wrote:
>
> The important difference is that Athlon64 3000+ costs AMD more to
> build than Intel's Prescott 3.0GHz chips, yet sells for less.  New
> process generation is equally one part technology, one part financial
> these days (case-in-point, Intel is very aggressively moving the
> low-end Celeron to the newest manufacturing product rather than just
> focusing on high-end chips first).
>
> -------------
> Tony Hill
> hilla <underscore> 20 <at> yahoo <dot> ca

This I realize and I'm not trying to take that away... I'm just saying that 
if I didn't know any better and I were to read the article I might tend to 
automatically assume that a .13 chip is worse than a .09 chip etc.... When 
the truth is the manufacturing process is not really going to have a huge 
impact in performance (unless of course it means they can get more MHz out 
of it).

Carlo 


0
Reply Carlo 8/31/2004 3:55:47 AM

On Mon, 30 Aug 2004 23:55:47 -0400, "Carlo Razzeto"
<crazzeto@hotmail.com> wrote:
>
>"Tony Hill" <hilla_nospam_20@yahoo.ca> wrote in message 
>news:r5n7j0ljl4vhlnr2t1nmfmdklpbgf62f6p@4ax.com...
>> On Mon, 30 Aug 2004 18:54:23 -0400, "Carlo Razzeto"
>> <crazzeto@hotmail.com> wrote:
>>
>> The important difference is that Athlon64 3000+ costs AMD more to
>> build than Intel's Prescott 3.0GHz chips, yet sells for less.  New
>> process generation is equally one part technology, one part financial
>> these days (case-in-point, Intel is very aggressively moving the
>> low-end Celeron to the newest manufacturing product rather than just
>> focusing on high-end chips first).
>
>This I realize and I'm not trying to take that away... I'm just saying that 
>if I didn't know any better and I were to read the article I might tend to 
>automatically assume that a .13 chip is worse than a .09 chip etc.... When 
>the truth is the manufacturing process is not really going to have a huge 
>impact in performance (unless of course it means they can get more MHz out 
>of it).

Well, until very recently a new manufacturing processes DID mean that
they could get more MHz out of it, usually quite a bit more MHz.  On
the old 180nm process the P4 struggled to reach 2.0GHz, while on the
130nm process Intel has managed to push the chip up to 3.4GHz.
Previously the gains were even larger, with the 250nm PIII topping out
at 600MHz and the 180nm eventually managing 1.13GHz.

However the new 90nm fab process has maybe thrown this automatic
assumption of much higher clock speeds into question, at least for the
time being.  Intel's still having trouble getting the "Prescott" P4 up
to 3.6GHz and have pushed back the release date of their 3.8 and
4.0GHz P4 chips multiple times.  This might just be a specific
situation, as the Prescott is a VERY different chip from the
Northwood, beyond simply the process shrink, however IBM doesn't seem
to be too much better with their PowerPC chips.  The PPC 970 (130nm)
made it to 2.0GHz and might have had some headroom left, while
currently IBM is struggling to get decent production on the 2.5GHz PPC
970FX (90nm).


So... err.. what was the point I was trying to get at here again?!
Ohh yeah, I think I'm basically agreeing with you :>

-------------
Tony Hill
hilla <underscore> 20 <at> yahoo <dot> ca
0
Reply Tony 8/31/2004 6:41:55 AM

 This is just extra publicity for what has already been
known for months, ie the drive to 65nm is on a fast
pace, things are looking good, much more straining of
silicon, better internal power management, etc. The really
exciting transistor designs will happen at 45nm, using the high-k
interconnects. Though that's still three years away. And there
is interesting research going on at 15nm, for the next decade.

What's not known is exactly how Intel is going to design
the silicon. How are the multiple cores going to work, especially
with the one bus? Even more significantly, how are applications going
to benefit from the 2+ cores; are they going to have to explicitly
code multiple-threading to benefit,  which afterall ain't easy to pull off,
or will the feeding of the multiple cores be handled effectively by the
compilers,
or may be even the OS? I see that Intel has released a thread checking
tool, hopefully MS incorporates something like it in their next Studio.

So far, looks like the new upcoming multi-core chip designs will depend heavily
on how applications are developed, more so than ever before. We
already saw some of this with the branch-predictors, the results
weren't impressive at all. If the thread related logic issues can't somehow be
handled at the tool, OS, compiler, or chip level, then it's going to be a long
day reaping the full potential of 2+ cores. 2+ cores may end up like the
386, full of potential but not enough software support.



"Yousuf Khan" <bbbl67@ezrs.com> wrote in message
news:24zYc.102338$UTP.50876@twister01.bloor.is.net.cable.rogers.com...
>
http://www.reuters.com/locales/c_newsArticle.jsp?type=technologyNews&localeKey=en_IN&storyID=6098883
>
>     Yousuf Khan
>
>


0
Reply Raymond 8/31/2004 7:54:52 AM

In article <g9WYc.3239$OQ6.1732@trnddc09>, "Raymond" <no@all.net> writes:
|>  This is just extra publicity for what has already been
|> known for months, ie the drive to 65nm is on a fast
|> pace, things are looking good, much more straining of
|> silicon, better internal power management, etc. The really
|> exciting transistor designs will happen at 45nm, using the high-k
|> interconnects. Though that's still three years away. And there
|> is interesting research going on at 15nm, for the next decade.

Oh, really?  I did a quick Web search, but couldn't find when
the comparable announcement was made for 90 nm.  I vaguely
remember mid-2001, which was a little matter of 3 years before
90 nm hit the streets in quantity.

If my recollection is correct, it isn't looking good at all for
65 nm, as the passive leakage problems are even worse.  Mid-2007
for mass production isn't what Intel are hoping for (or claiming),
but IS what ITRS are predicting ....

I shall not be holding my breath for 65 nm; you are welcome to
hold yours for it :-)


Regards,
Nick Maclaren.
0
Reply nmm1 8/31/2004 9:17:50 AM

Dennis M. O'Connor wrote:

> Carlo Razzeto wrote:
> 
>> I don't know, maybe it's just me but it seems like this article puts way
>> too much importance on the manufacturing process a CPU is made on..
> 
> Well, it _is_ important.  The rules change when the process
> changes, and a microarchitecture that excelled on, say, 180nm
> may be non-competitive on 65nm.

Such as P6, and thus Pentium M? :-)

That would spell bad news, hehe.

> I'd go into detail, but I work for Intel. Sorry

You gave us a taste, now we want more :-)
0
Reply Grumble 8/31/2004 12:22:16 PM

Nick Maclaren wrote:
> If my recollection is correct, it isn't looking good at all for
> 65 nm, as the passive leakage problems are even worse.  Mid-2007
> for mass production isn't what Intel are hoping for (or claiming),
> but IS what ITRS are predicting ....

If you read the article, the statement is that leakage is dealt with to 
a degree by straining the silicon lattice.  I don't know how much that 
changes things, but they want us to think it solves the problem (which 
it probably doesn't).

I thought 2005 was too soon for 65nm, but that's what I read.  That 
Pentium 4 will be shipping in 2005 on 65nm.  Which, thankfully, gives 
that embarrassment that is Prescott just one year of life.

Alex
-- 
My words are my own.  They represent no other; they belong to no other.
Don't read anything into them or you may be required to compensate me
for violation of copyright.  (I do not speak for my employer.)

0
Reply Alex 8/31/2004 12:45:50 PM

"Raymond" <no@all.net> wrote :

[cut]

> reaping the full potential of 2+ cores. 2+ cores may end up like
> the 386, full of potential but not enough software support.

yes, like all the rest of SMP boxes, obsolete and unsupported ...

Pozdrawiam.
-- 
RusH   //
 http://randki.o2.pl/profil.php?id_r=352019
Like ninjas, true hackers are shrouded in secrecy and mystery.
You may never know -- UNTIL IT'S TOO LATE.
0
Reply RusH 8/31/2004 2:03:34 PM

In article <ch1rtu$4hc$1@news01.intel.com>,
Alex Johnson <compuwiz@jhu.edu> writes:
|> 
|> If you read the article, the statement is that leakage is dealt with to 
|> a degree by straining the silicon lattice.  I don't know how much that 
|> changes things, but they want us to think it solves the problem (which 
|> it probably doesn't).

One of the most reliable sources in the industry has told me that
it doesn't.  Yes, it helps, but only somewhat.

|> I thought 2005 was too soon for 65nm, but that's what I read.  That 
|> Pentium 4 will be shipping in 2005 on 65nm.  Which, thankfully, gives 
|> that embarrassment that is Prescott just one year of life.

If you believe that ordinary customers will be able to buy 65 nm
Pentium 4s at commodity prices in mid-2005, I have this bridge for
sale ....


Regards,
Nick Maclaren.
0
Reply nmm1 8/31/2004 2:37:13 PM


Raymond wrote:

> 
> So far, looks like the new upcoming multi-core chip designs will depend heavily
> on how applications are developed, more so than ever before. We
> already saw some of this with the branch-predictors, the results
> weren't impressive at all. If the thread related logic issues can't somehow be
> handled at the tool, OS, compiler, or chip level, then it's going to be a long
> day reaping the full potential of 2+ cores. 2+ cores may end up like the
> 386, full of potential but not enough software support.

Well, the various SMT, CMP, hyperthreading, etc... solution have various tradeoffs.
The OSes and applications which can't do a rewrite everytime a new variation comes
out will basically just ignore the differences.  You just use Posix pthreads and
use high granularity concurrency.  So the only benefit the cpu vendors will have is
short term, what they can get with model dependent device drivers and marketing hype.

I'm curious as to what Sun is up to on their Throughput Computing which would seem
to be based on a low granularity concurrency model.  It may end up being a closed
model based on propietary hardware in which case it may not amount to much.  What
they should do is create a platform independent api that allows use of low granularity
concurrency and use that to show the benefits of their hw solution over other hw.

Joe Seigh
0
Reply Joe 8/31/2004 2:53:40 PM

On Tue, 31 Aug 2004 02:41:55 -0400, Tony Hill
<hilla_nospam_20@yahoo.ca> wrote:

>However the new 90nm fab process has maybe thrown this automatic
>assumption of much higher clock speeds into question, at least for the
>time being.  Intel's still having trouble getting the "Prescott" P4 up
>to 3.6GHz and have pushed back the release date of their 3.8 and
>4.0GHz P4 chips multiple times.

As I understand it, you could indeed hit, say, 5 GHz with a 90 nm
process (and Prescott's design - longer pipeline, etc - indicates
Intel were hoping to do just that), except that the chip would melt?

-- 
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
0
Reply wallacethinmintr 8/31/2004 4:05:08 PM

In article <4134a157.322981650@news.eircom.net>,
wallacethinmintr@eircom.net (Russell Wallace) writes:
|> On Tue, 31 Aug 2004 02:41:55 -0400, Tony Hill
|> <hilla_nospam_20@yahoo.ca> wrote:
|> 
|> >However the new 90nm fab process has maybe thrown this automatic
|> >assumption of much higher clock speeds into question, at least for the
|> >time being.  Intel's still having trouble getting the "Prescott" P4 up
|> >to 3.6GHz and have pushed back the release date of their 3.8 and
|> >4.0GHz P4 chips multiple times.
|> 
|> As I understand it, you could indeed hit, say, 5 GHz with a 90 nm
|> process (and Prescott's design - longer pipeline, etc - indicates
|> Intel were hoping to do just that), except that the chip would melt?

I am pretty sure that Intel could cool the chip, even at that speed.
A factory-fitted silver heatsink, with high-speed water-cooling to
a heat exchanger in front of a large and fast fan, bolted into a
heavy chassis, should do the job.

As a demonstration of virtuosity, it would be excellent.  As a
system to sell in large numbers, perhaps not.


Regards,
Nick Maclaren.
0
Reply nmm1 8/31/2004 4:23:33 PM

> Carlo Razzeto wrote:
> 
>> I don't know, maybe it's just me but it seems like this article puts way
>> too much importance on the manufacturing process a CPU is made on..
> 
> Well, it _is_ important.  The rules change when the process
> changes, and a microarchitecture that excelled on, say, 180nm
> may be non-competitive on 65nm.

Take Alpha 21264, as an example. It ran 600MHz in the DEC FAB at
(what was it) 0.25 micron. It also ran at 600 MHz in the Samsung
FAB at 0.18 microns. This basically indicates that DEC at 0.25�
was as high performing as Samsung at 0.18�.

In general, the speed of a part optimized for a high performance 
logic FAB* can run as much as 2X the frequency of that same design 
taped out for a merchant market 'rent-a-FAB'**. So, yes, manufacturing
capability is a very very big lever in the performance of a CPU.

Mitch

[*] such as an Intel FAB
[**] if you can get it to run at all (timed race conditions)
0
Reply MitchAlsup 8/31/2004 11:06:08 PM

Mitch Alsup <MitchAlsup@aol.com> wrote:
> Take Alpha 21264, as an example. It ran 600MHz in the DEC FAB at
> (what was it) 0.25 micron. It also ran at 600 MHz in the Samsung
> FAB at 0.18 microns. This basically indicates that DEC at 0.25�
> was as high performing as Samsung at 0.18�.
>
> In general, the speed of a part optimized for a high performance
> logic FAB* can run as much as 2X the frequency of that same design
> taped out for a merchant market 'rent-a-FAB'**. So, yes, manufacturing
> capability is a very very big lever in the performance of a CPU.
>
> Mitch
>
> [*] such as an Intel FAB
> [**] if you can get it to run at all (timed race conditions)

Is it perhaps possible that that particular generation of Alpha was locked
at 600Mhz, due to internal timing concerns, or what have you?

    Yousuf Khan


0
Reply Yousuf 8/31/2004 11:59:08 PM

"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:ch1fnu$9vv$1@pegasus.csx.cam.ac.uk...
>
> In article <g9WYc.3239$OQ6.1732@trnddc09>, "Raymond" <no@all.net> writes:
> |>  This is just extra publicity for what has already been
> |> known for months, ie the drive to 65nm is on a fast
> |> pace, things are looking good, much more straining of
> |> silicon, better internal power management, etc. The really
> |> exciting transistor designs will happen at 45nm, using the high-k
> |> interconnects. Though that's still three years away. And there
> |> is interesting research going on at 15nm, for the next decade.
>
> Oh, really?  I did a quick Web search, but couldn't find when
> the comparable announcement was made for 90 nm.  I vaguely
> remember mid-2001, which was a little matter of 3 years before
> 90 nm hit the streets in quantity.

If you read exactly what Intel said after they achieved 90nm
SRAM, they weren't anywhere as rosy as they are now with
65nm.

> If my recollection is correct, it isn't looking good at all for
> 65 nm, as the passive leakage problems are even worse.  Mid-2007
> for mass production isn't what Intel are hoping for (or claiming),
> but IS what ITRS are predicting ....
>
> I shall not be holding my breath for 65 nm; you are welcome to
> hold yours for it :-)

I am holding my breath! :-)


0
Reply Raymond 9/1/2004 4:01:59 AM

"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:ch22ep$qkl$1@pegasus.csx.cam.ac.uk...
>
> In article <ch1rtu$4hc$1@news01.intel.com>,
> Alex Johnson <compuwiz@jhu.edu> writes:
> |>
> |> If you read the article, the statement is that leakage is dealt with to
> |> a degree by straining the silicon lattice.  I don't know how much that
> |> changes things, but they want us to think it solves the problem (which
> |> it probably doesn't).
>
> One of the most reliable sources in the industry has told me that
> it doesn't.  Yes, it helps, but only somewhat.
>
> |> I thought 2005 was too soon for 65nm, but that's what I read.  That
> |> Pentium 4 will be shipping in 2005 on 65nm.  Which, thankfully, gives
> |> that embarrassment that is Prescott just one year of life.
>
> If you believe that ordinary customers will be able to buy 65 nm
> Pentium 4s at commodity prices in mid-2005, I have this bridge for
> sale ....

What they're saying is first production in 2005, and high volume by
2006, perhaps even high enough to overtake that of 90nm.


0
Reply Raymond 9/1/2004 4:01:59 AM

"Raymond" <no@all.net> wrote in message news:<g9WYc.3239$OQ6.1732@trnddc09>...
> 
> What's not known is exactly how Intel is going to design
> the silicon. How are the multiple cores going to work, especially
> with the one bus? Even more significantly, how are applications going
> to benefit from the 2+ cores; are they going to have to explicitly
> code multiple-threading to benefit,  which afterall ain't easy to pull off,
> or will the feeding of the multiple cores be handled effectively by the
> compilers,
> or may be even the OS? I see that Intel has released a thread checking
> tool, hopefully MS incorporates something like it in their next Studio.

Every version of Windows based on NT (NT, 2000, XP, Server 2k3,
Longhorn, etc) has gotten progressively better at utilizing multiple
CPU's. MS keeps tweaking things to a finer level of granularity. So
minimally, a single threaded application could still hog 1 CPU, but at
least the OS underneath will do it's best to make use of the other
CPU.

Also, I suspect your comments about languages are true when it comes
to C/C++. But the newer languages like Java, C# and VB.Net make
working with threads MUCH easier. I'm not exactly sure what MS could
"incorporate in their next Studio" that could possibly make it any
easier to write multi-threaded managed code. And with alot more of
Longhorn written itself as managed code, inculding the new Avalon/XAML
UI stuff, I suspect that even traditional message driven GUI code will
make better use of multiple cores. Of course the cynics will claim
that amounts to Windows yet again sucking all possible power out of
even the latest & greatest hardware, but I guess that's inevitable.

IMO the bigger debate will be: Do I go for a faster single core or
slower dual core CPU? All things being equal (including cost), I think
a dual core chip has to be clocked slower and/or have less cache???
Not confusing the market will be a real challenge if that's the case.
0
Reply gaf1234567890 9/1/2004 5:17:28 AM

"G" <gaf1234567890@hotmail.com> wrote in message
news:b7eb1fbe.0408312117.79f43277@posting.google.com...
> "Raymond" <no@all.net> wrote in message
news:<g9WYc.3239$OQ6.1732@trnddc09>...

> Every version of Windows based on NT (NT, 2000, XP, Server 2k3,
> Longhorn, etc) has gotten progressively better at utilizing multiple
> CPU's. MS keeps tweaking things to a finer level of granularity. So
> minimally, a single threaded application could still hog 1 CPU, but at
> least the OS underneath will do it's best to make use of the other
> CPU.
>
> Also, I suspect your comments about languages are true when it comes
> to C/C++. But the newer languages like Java, C# and VB.Net make
> working with threads MUCH easier. I'm not exactly sure what MS could
> "incorporate in their next Studio" that could possibly make it any
> easier to write multi-threaded managed code. And with alot more of
> Longhorn written itself as managed code, inculding the new Avalon/XAML
> UI stuff, I suspect that even traditional message driven GUI code will
> make better use of multiple cores. Of course the cynics will claim
> that amounts to Windows yet again sucking all possible power out of
> even the latest & greatest hardware, but I guess that's inevitable.
>
> IMO the bigger debate will be: Do I go for a faster single core or
> slower dual core CPU? All things being equal (including cost), I think
> a dual core chip has to be clocked slower and/or have less cache???
> Not confusing the market will be a real challenge if that's the case.

I like the idea of at least 2 cores for desktops, as long as it's implemmented
well. There is enough multi-threading and multi-tasking going on today
for some real benefit, but won't be even close to x2 performance. Beyond
2 cores, I don't see much benefit adding more cores for desktops, not today,
and not tomorrow, nothwithstanding a lot more intense use of multi-threading.
I just don't see how the OS, or any compiler, can possibly deal with the main
logical
issues involved in sychronization and concurrency, automagically turning an
otherwise
mostly STA program into a multi-threaded one.  .NET has some nice features for
multi-threading, but other than the garbage collector, they don't run by
themselves.
It's still up to the developer to handle the logical issues involved, and
debugging
them is still quite a challenge.



0
Reply Raymond 9/1/2004 6:30:13 AM

MitchAlsup@aol.com (Mitch Alsup) writes:
>Take Alpha 21264, as an example. It ran 600MHz in the DEC FAB at
>(what was it) 0.25 micron. It also ran at 600 MHz in the Samsung
>FAB at 0.18 microns. This basically indicates that DEC at 0.25�
>was as high performing as Samsung at 0.18�.

AFAIK the DEC Fab was sold to Intel before they switched from 0.35u to
0.25u (and someone here (Doug Siebert?) claimed that they would switch
to 0.28u, not 0.25u).

In 0.35u, the 21264 was generally sold at 500MHz, although there were
versions in some machines at 575MHz (which came out before the general
availability of the 21164).  The 21264 in 0.25u was sold at 667MHz.
The 21264 in 0.18u was sold at 800MHz, later 1001MHz, and eventually
1250MHz.

I don't know where they were fabbed, but I guess that the 800MHz chips
in our Samsung UP1500 boards were fabbed by Samsung.

They never sold that board with faster parts; possible explanations
for that are: Samsung stopped working on Alphas before the faster
designs became available (this was around the time when the
end-of-life of Alpha was announced); or the faster designs were tuned
for the Intel process and would have required retuning for the Samsung
process (if at all possible); or the board was not designed to deliver
enough power for the faster parts (that would have been relatively
easy to fix, though).  Samsung had announced EV68s with faster clock
and on-chip L2 cache, though.

>In general, the speed of a part optimized for a high performance 
>logic FAB* can run as much as 2X the frequency of that same design 
>taped out for a merchant market 'rent-a-FAB'**. So, yes, manufacturing
>capability is a very very big lever in the performance of a CPU.
>
>Mitch
>
>[*] such as an Intel FAB

Well, in the example above the Intel fab produced at most a frequency
advantage factor of 1.56 over the Samsung fab.

Also, we should be able to see a frequency advantage of Intel over
Nvidia and ATI in graphics chips and maybe over Nvidia and VIA in
chipsets (although for chipsets the effects would not be very
visible).

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
0
Reply anton 9/1/2004 7:12:55 AM

In article <XQbZc.44$A63.6@trnddc09>, Raymond <no@all.net> wrote:
>>
>> Oh, really?  I did a quick Web search, but couldn't find when
>> the comparable announcement was made for 90 nm.  I vaguely
>> remember mid-2001, which was a little matter of 3 years before
>> 90 nm hit the streets in quantity.
>
>If you read exactly what Intel said after they achieved 90nm
>SRAM, they weren't anywhere as rosy as they are now with
>65nm.

I need to correct what I said - it was 2 years.  March 2002.

Actually, I remember them being every bit as optimistic.  Anyway,
such claims are worth almost as much as the hot air that carries
them.

>> I shall not be holding my breath for 65 nm; you are welcome to
>> hold yours for it :-)
>
>I am holding my breath! :-)

You have better lungs than I do :-)


Regards,
Nick Maclaren.
0
Reply nmm1 9/1/2004 8:34:41 AM

In article <XQbZc.45$A63.43@trnddc09>, Raymond <no@all.net> wrote:
>"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
>news:ch22ep$qkl$1@pegasus.csx.cam.ac.uk...
>> In article <ch1rtu$4hc$1@news01.intel.com>,
>> Alex Johnson <compuwiz@jhu.edu> writes:
>> |>
>> |> I thought 2005 was too soon for 65nm, but that's what I read.  That
>> |> Pentium 4 will be shipping in 2005 on 65nm.  Which, thankfully, gives
>> |> that embarrassment that is Prescott just one year of life.
>>
>> If you believe that ordinary customers will be able to buy 65 nm
>> Pentium 4s at commodity prices in mid-2005, I have this bridge for
>> sale ....
>
>What they're saying is first production in 2005, and high volume by
>2006, perhaps even high enough to overtake that of 90nm.

Even if that were so, it would give Prescott a lot more than a year
to hold the fort.

Anyway, once upon a time when knights were bold and press statements
were intended to convey information, "production" meant the delivery
of products, and "products" meant goods sold to ordinary customers.
At least in this context.

Yes, I believe that Intel (and IBM) will be able to make 65 nm CPUs
in early 2005, perhaps even late 2004.  But small numbers of ones
made for testing does not constitute production in any meaningful
sense.

Regards,
Nick Maclaren.
0
Reply nmm1 9/1/2004 8:39:19 AM

> I am pretty sure that Intel could cool the chip, even at that speed.
> A factory-fitted silver heatsink, with high-speed water-cooling to
> a heat exchanger in front of a large and fast fan, bolted into a
> heavy chassis, should do the job.

A heat pipe is better at moving heat than any solid material, and quite
easy to use.

Dumping all those watts in the environment, absent water cooling, is more
of a problem. I'd rather not have several hundred watts heating the air in
my office, thank you.

	Jan
0
Reply ISO 9/1/2004 8:40:00 AM

In article <b7eb1fbe.0408312117.79f43277@posting.google.com>,
G <gaf1234567890@hotmail.com> wrote:
>
>Also, I suspect your comments about languages are true when it comes
>to C/C++. But the newer languages like Java, C# and VB.Net make
>working with threads MUCH easier. I'm not exactly sure what MS could
>"incorporate in their next Studio" that could possibly make it any
>easier to write multi-threaded managed code. And with alot more of
>Longhorn written itself as managed code, inculding the new Avalon/XAML
>UI stuff, I suspect that even traditional message driven GUI code will
>make better use of multiple cores. Of course the cynics will claim
>that amounts to Windows yet again sucking all possible power out of
>even the latest & greatest hardware, but I guess that's inevitable.

I am afraid not.  I haven't looked at them in detail, but a quick
glance indicates that they give the appearance of making the design
and coding of threaded applications easier, while not tackling the
most important problems.

But your last remark is correct.  It isn't hard to separate GUIs
into multiple components, separated by message passing (whether
using thread primitives or not), and those are a doddle to schedule
on multi-core systems.  And that is the way that things are going.


Regards,
Nick Maclaren.
0
Reply nmm1 9/1/2004 8:43:25 AM

In article <2plg70Fm26psU1@uni-berlin.de>,
=?ISO-8859-1?Q?Jan_Vorbr=FCggen?=  <jvorbrueggen-not@mediasec.de> wrote:
>> I am pretty sure that Intel could cool the chip, even at that speed.
>> A factory-fitted silver heatsink, with high-speed water-cooling to
>> a heat exchanger in front of a large and fast fan, bolted into a
>> heavy chassis, should do the job.
>
>A heat pipe is better at moving heat than any solid material, and quite
>easy to use.

Hang on - I never said that the silver heatsink was solid!  It should
be silver for the conductivity and resistance to corrosion, but I was
assuming circulating water inside it.  Sorry about omitting that
critical point :-(

>Dumping all those watts in the environment, absent water cooling, is more
>of a problem. I'd rather not have several hundred watts heating the air in
>my office, thank you.

Or 1,000 of them dumping heat in my machine room ....


Regards,
Nick Maclaren.
0
Reply nmm1 9/1/2004 8:55:06 AM

G wrote:
>
> Every version of Windows based on NT (NT, 2000, XP, Server 2k3,
> Longhorn, etc) has gotten progressively better at utilizing multiple
> CPU's. MS keeps tweaking things to a finer level of granularity. So
> minimally, a single threaded application could still hog 1 CPU, but at
> least the OS underneath will do it's best to make use of the other
> CPU.

A data point. I'm doing nothing much except reading this group and yet
the XP performance monitor shows a queue of 7 or 8 threads ready to run.

I think applications like WORD and Excel already do things like spell-
checking and recalculation in worker threads. I don't find it hard to
believe that a typical Windows box would benefit from 4+ "processors". 


0
Reply Ken 9/1/2004 10:21:52 AM


Nick Maclaren wrote:
> 
> But your last remark is correct.  It isn't hard to separate GUIs
> into multiple components, separated by message passing (whether
> using thread primitives or not), and those are a doddle to schedule
> on multi-core systems.  And that is the way that things are going.
> 

I'm not sure that the gui by itself is enough to justify a multi-core
cpu.  And there are problems enough in multi-threaded gui, even apart
from deadlocks caused by inexperienced programmer mixing threads and OO
callbacks.  Consider mouse events queued before but received after a
resize operation.  The mouse coordinates are in the wrong frame of reference
and all wrong.  Gui designers design as if the event queue was <= 1 at all
times.

What would more likely to utilize concurrency would be the database like
Longhorm filesystem that MS is supposed to be doing.  Except that I don't
think MS has the expertise to do lock-free concurrent programming like that.
If they have, they've been keeping a low profile.

Joe Seigh
0
Reply Joe 9/1/2004 11:08:07 AM

In article <4135ADBB.722B70F9@xemaps.com>,
Joe Seigh <jseigh_01@xemaps.com> writes:
|> 
|> I'm not sure that the gui by itself is enough to justify a multi-core
|> cpu.  And there are problems enough in multi-threaded gui, even apart
|> from deadlocks caused by inexperienced programmer mixing threads and OO
|> callbacks.  Consider mouse events queued before but received after a
|> resize operation.  The mouse coordinates are in the wrong frame of reference
|> and all wrong.  Gui designers design as if the event queue was <= 1 at all
|> times.

Take a mouse event in an unrealistically simple design.  This is picked
up by the kernel, and passed to the display manager, which converts it
into another form and passes it to the application.  That does something
with it, passes a message to the display manager, which calls the kernel
to update the screen.  The user does not see any effect until that has
completed.

At best, you have 4 context switches, 2 of which are between user-level
contexts, and it is common for there to be MANY more.  Now, consider
that being done as part of drag-and-drop - you want the process to
happen in under 2 milliseconds (certainly under 5), or it will start to
be visible.  That can be 1,000+ context switches a second, and some
of those contexts have large working sets, so you are reloading a
lot of cache and TLBs.

One of the advantages of a multi-core system is that you don't need to
switch context just to pass a message if the threads or processes are
on different cores.  You just pass the message.


Regards,
Nick Maclaren.
0
Reply nmm1 9/1/2004 12:27:34 PM

gaf1234567890@hotmail.com (G) writes:
> Every version of Windows based on NT (NT, 2000, XP, Server 2k3,
> Longhorn, etc) has gotten progressively better at utilizing multiple
> CPU's. MS keeps tweaking things to a finer level of granularity. So
> minimally, a single threaded application could still hog 1 CPU, but
> at least the OS underneath will do it's best to make use of the
> other CPU.

long ago and far away i was told that the people in beaverton had done
quite a bit of the NT smp work ... since all they had was smp (while
redmond concentrated on their primary customer base ... which was
mostly all non-smp).

-- 
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
0
Reply Anne 9/1/2004 2:54:09 PM

> I think applications like WORD and Excel already do things like spell-
> checking and recalculation in worker threads. I don't find it hard to

I also see a lot of background processes from GUI thingies on my Mac.
This sucks because it happens even for application that are currently
"idle".  E.g. there are two other people "logged in" but currently inactive,
but they use up a lot of resident pages, thus making me page a lot more.
I suspect that with 10 users logged in at the same time and only 768MB of
RAM, the machine would be brought to its knees :-(


        Stefan
0
Reply Stefan 9/1/2004 3:03:02 PM

On 31 Aug 2004 16:23:33 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:

>I am pretty sure that Intel could cool the chip, even at that speed.
>A factory-fitted silver heatsink, with high-speed water-cooling to
>a heat exchanger in front of a large and fast fan, bolted into a
>heavy chassis, should do the job.

Indeed, I read awhile ago that someone actually did crank a P4 to 5
GHz with the aid of a custom-build liquid cooling system. Of course,
it was a "because it's there" personal project rather than a
commercial product.

>As a demonstration of virtuosity, it would be excellent.  As a
>system to sell in large numbers, perhaps not.

Quite.

-- 
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
0
Reply wallacethinmintr 9/1/2004 3:34:21 PM

On Wed, 01 Sep 2004 10:40:00 +0200, =?ISO-8859-1?Q?Jan_Vorbr=FCggen?=
<jvorbrueggen-not@mediasec.de> wrote:

>Dumping all those watts in the environment, absent water cooling, is more
>of a problem. I'd rather not have several hundred watts heating the air in
>my office, thank you.

For me, that would be an advantage: I need the heat anyway; it might
as well be doing useful work on the way. It's the cost of the system
that'd be a problem.

-- 
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
0
Reply wallacethinmintr 9/1/2004 3:35:00 PM

On Wed, 01 Sep 2004 06:30:13 GMT, "Raymond" <no@all.net> wrote:

>Beyond
>2 cores, I don't see much benefit adding more cores for desktops, not today,
>and not tomorrow, nothwithstanding a lot more intense use of multi-threading.
>I just don't see how the OS, or any compiler, can possibly deal with the main
>logical
>issues involved in sychronization and concurrency, automagically turning an
>otherwise
>mostly STA program into a multi-threaded one.

We had exactly that argument 15 years ago with regard to parallel
processing on servers and supercomputers.

It won't surprise me in the least if 15 years from now, when the
conversation is about multiple cores in digital watches or whatever,
someone says "we had exactly that argument 15 years ago with regard to
parallel processing on desktops" :)

-- 
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
0
Reply wallacethinmintr 9/1/2004 7:35:36 PM

In article <41362416.1434651@news.eircom.net>,
Russell Wallace <wallacethinmintr@eircom.net> wrote:
>On Wed, 01 Sep 2004 06:30:13 GMT, "Raymond" <no@all.net> wrote:
>
>>Beyond
>>2 cores, I don't see much benefit adding more cores for desktops, not today,
>>and not tomorrow, nothwithstanding a lot more intense use of multi-threading.
>>I just don't see how the OS, or any compiler, can possibly deal with the main
>>logical
>>issues involved in sychronization and concurrency, automagically turning an
>>otherwise
>>mostly STA program into a multi-threaded one.
>
>We had exactly that argument 15 years ago with regard to parallel
>processing on servers and supercomputers.

And 30 years ago.  I wasn't in this game 45 years ago.

>It won't surprise me in the least if 15 years from now, when the
>conversation is about multiple cores in digital watches or whatever,
>someone says "we had exactly that argument 15 years ago with regard to
>parallel processing on desktops" :)

Nor would it surprise me.  Raymond makes one good point, though he
gets it slightly wrong!

There is effectively NO chance of automatic parallelisation working
on serial von Neumann code of the sort we know and, er, love.  Not
in the near future, not in my lifetime and not as far as anyone can
predict.  Forget it.

This has the consequence that large-scale parallelism is not a viable
general-purpose architecture until and unless we move to a paradigm
that isn't so intractable.  There are such paradigms (functional
programming is a LITTLE better, for a start), but none have taken
off as general models.  The HPC world is sui generis, and not relevant
in this thread.

So he would be right if he replaced "beyond 2 cores" by "beyond a
small number of cores".  At least for the next decade or so.


Regards,
Nick Maclaren.
0
Reply nmm1 9/1/2004 7:50:09 PM

nmm1@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<ch4f7m$sq6$1@pegasus.csx.cam.ac.uk>...
> In article <4135ADBB.722B70F9@xemaps.com>,
> Joe Seigh <jseigh_01@xemaps.com> writes:
> |> 
> |> I'm not sure that the gui by itself is enough to justify a multi-core
> |> cpu.  And there are problems enough in multi-threaded gui, even apart
> |> from deadlocks caused by inexperienced programmer mixing threads and OO
> |> callbacks.  Consider mouse events queued before but received after a
> |> resize operation.  The mouse coordinates are in the wrong frame of reference
> |> and all wrong.  Gui designers design as if the event queue was <= 1 at all
> |> times.
> 
> Take a mouse event in an unrealistically simple design.  This is picked
> up by the kernel, and passed to the display manager, which converts it
> into another form and passes it to the application.  That does something
> with it, passes a message to the display manager, which calls the kernel
> to update the screen.  The user does not see any effect until that has
> completed.
> 
> At best, you have 4 context switches, 2 of which are between user-level
> contexts, and it is common for there to be MANY more.  Now, consider
> that being done as part of drag-and-drop - you want the process to
> happen in under 2 milliseconds (certainly under 5), or it will start to
> be visible.  That can be 1,000+ context switches a second, and some
> of those contexts have large working sets, so you are reloading a
> lot of cache and TLBs.
> 
> One of the advantages of a multi-core system is that you don't need to
> switch context just to pass a message if the threads or processes are
> on different cores.  You just pass the message.
> 
> 
> Regards,
> Nick Maclaren.


Actually I wasn't even thinking about anything remotely as complicated
as that.

What I thought is that since XAML is declarative in nature, that an
"inexperienced programmer mixing threads and OO callbacks" (Joe's
comment) wouldn't really be doing the coding at all. It would be done
(and theoretically optimized) by the implementation that sits behind
it.

With respect to both threaded apps and GUI development, my only point
is that it's one possible benefit of the newer higher level
languages/tools. In fact I seem to remember the exact same case being
made a long time ago for things like the UCSD P-System... Whether it's
true or not I can't say.
0
Reply gaf1234567890 9/1/2004 8:45:05 PM

We are on track for mass shipment of a
billion(that's with a B) transistor die by '08.

We shall all now bow toward Santa Clara.

Moore Rules!!!!


"Raymond" <no@all.net> wrote in message news:XQbZc.44$A63.6@trnddc09...
>
> "Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
> news:ch1fnu$9vv$1@pegasus.csx.cam.ac.uk...
> >
> > In article <g9WYc.3239$OQ6.1732@trnddc09>, "Raymond" <no@all.net>
writes:
> > |>  This is just extra publicity for what has already been
> > |> known for months, ie the drive to 65nm is on a fast
> > |> pace, things are looking good, much more straining of
> > |> silicon, better internal power management, etc. The really
> > |> exciting transistor designs will happen at 45nm, using the high-k
> > |> interconnects. Though that's still three years away. And there
> > |> is interesting research going on at 15nm, for the next decade.
> >
> > Oh, really?  I did a quick Web search, but couldn't find when
> > the comparable announcement was made for 90 nm.  I vaguely
> > remember mid-2001, which was a little matter of 3 years before
> > 90 nm hit the streets in quantity.
>
> If you read exactly what Intel said after they achieved 90nm
> SRAM, they weren't anywhere as rosy as they are now with
> 65nm.
>
> > If my recollection is correct, it isn't looking good at all for
> > 65 nm, as the passive leakage problems are even worse.  Mid-2007
> > for mass production isn't what Intel are hoping for (or claiming),
> > but IS what ITRS are predicting ....
> >
> > I shall not be holding my breath for 65 nm; you are welcome to
> > hold yours for it :-)
>
> I am holding my breath! :-)
>
>


0
Reply spinlock 9/1/2004 8:52:11 PM

In article <b7eb1fbe.0409011245.52e96dd3@posting.google.com>,
G <gaf1234567890@hotmail.com> wrote:
>
>Actually I wasn't even thinking about anything remotely as complicated
>as that.

Don't ever try to track down a bug in a GUI system, then :-(  I was
not joking when I said that was unrealistically simple.

>What I thought is that since XAML is declarative in nature, that an
>"inexperienced programmer mixing threads and OO callbacks" (Joe's
>comment) wouldn't really be doing the coding at all. It would be done
>(and theoretically optimized) by the implementation that sits behind
>it.

Grrk.  I don't know XAML, but that sends shivers up my spine.  It is
FAR harder to get that sort of thing right than it appears, unless
the language is designed to ensure that such parallelism cannot
create an inconsistency.  And VERY few are.

>With respect to both threaded apps and GUI development, my only point
>is that it's one possible benefit of the newer higher level
>languages/tools. In fact I seem to remember the exact same case being
>made a long time ago for things like the UCSD P-System... Whether it's
>true or not I can't say.

It has been claimed more often than I care to think, and I have been
inflicted with such claims since the 1960s.  Yes, it is a possible
benefit, but it is rarely delivered.  Such languages typically make
one of three errors:

    Relying on the user not making an error - not one.

    Being so restrictive that they can't be used for real work.

    Being so incomprehensible that nobody can understand them.


Regards,
Nick Maclaren.
0
Reply nmm1 9/1/2004 8:55:08 PM

On Wed, 01 Sep 2004 13:52:11 -0700, spinlock wrote:

> We are on track for mass shipment of a billion(that's with a B) transistor
> die by '08.
> 
> We shall all now bow toward Santa Clara.
> 
> Moore Rules!!!!

Ummm... the 'lock' fell off your 'spin'

0
Reply AD 9/1/2004 11:28:50 PM

On 1 Sep 2004 19:50:09 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:

>There is effectively NO chance of automatic parallelisation working
>on serial von Neumann code of the sort we know and, er, love.  Not
>in the near future, not in my lifetime and not as far as anyone can
>predict.  Forget it.

At least as far as your typical spaghetti C++ is concerned, yeah, not
going to happen anytime in the near future.

>This has the consequence that large-scale parallelism is not a viable
>general-purpose architecture until and unless we move to a paradigm
>that isn't so intractable.

And yet, by that argument there should be no market for the big
parallel servers and supercomputers; yet there is. The solution is
that for things that need the speed, people just write the parallel
code by hand.

If what's on the desktop when Doom X, Half-Life Y and Unreal Z come
out is a chip with 1024 individually slow cores, then those games will
be written to use 1024-way parallelism, just as weather forecasting
and quantum chemistry programs are today. Ditto for Photoshop, 3D
modelling, movie editing, speech recognition etc. There's certainly no
shortage of parallelism in the problem domains. The reason things like
games don't use parallel code today whereas weather forecasting does
isn't because of any software issue, it's because gamers don't have
the money to buy massively parallel supercomputers whereas
organizations doing weather forecasting do. When that changes, so will
the software.

-- 
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
0
Reply wallacethinmintr 9/2/2004 6:39:34 AM

"Dennis M. O'Connor" <dmoc@primenet.com> writes:

> "Carlo Razzeto" <crazzeto@hotmail.com> wrote ...
>> I don't know, maybe it's just me but it seems like this article puts way 
>> to much importance on the manufacturing process a CPU is made on..
>
> Well, it _is_ important.  The rules change when the process
> changes, and a microarchitecture that excelled on, say, 180nm
> may be non-competitive on 65nm.

As an end-user, I don't care which process is used.  What matters are
performance, price, and power consumption (not necessarily in that
order) of the end-product.  You could use wet cardboard, if it worked.

But as a design engineer, yes, process definitely does matter.


Kai
-- 
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>
0
Reply Kai 9/2/2004 8:13:18 AM

spinlock wrote:

> We are on track for mass shipment of a billion (that's with a B)
> transistor die by '08.

Who's "we" ?

I have read that there will be ~1.7e9 transistors in Montecito.
Cache (2*1 MB L2 + 2*12 MB L3) probably accounts for ~90% of the
transistor count. Montecito is expected next year.

At 90 nm, please correct me if I am wrong, the chip would occupy
between 650 mm^2 and 750 mm^2. Is that possible?

> We shall all now bow toward Santa Clara.

Whatever floats your boat.

-- 
Regards, Grumble
0
Reply Grumble 9/2/2004 8:39:39 AM

In article <ch6m8b$grg$1@news-rocq.inria.fr>, Grumble  <a@b.c> wrote:
>spinlock wrote:
>
>> We are on track for mass shipment of a billion (that's with a B)
>> transistor die by '08.
>
>Who's "we" ?

A good question.  But note that "by '08" includes "in 2005".

>I have read that there will be ~1.7e9 transistors in Montecito.
>Cache (2*1 MB L2 + 2*12 MB L3) probably accounts for ~90% of the
>transistor count. Montecito is expected next year.

By whom is it expected?  And how is it expected to appear?  Yes,
someone will wave a chip at IDF and claim that it is a Montecito,
but are you expecting it to be available for internal testing,
to all OEMS, to special customers, or on the open market?


Regards,
Nick Maclaren.
0
Reply nmm1 9/2/2004 8:54:54 AM

In article <4136bd3e.40649206@news.eircom.net>,
Russell Wallace <wallacethinmintr@eircom.net> wrote:
>
>At least as far as your typical spaghetti C++ is concerned, yeah, not
>going to happen anytime in the near future.

Sigh.  You are STILL missing the point.  Spaghetti C++ may be about
as bad as it gets, but the SAME applies to the cleanest of Fortran,
if it is using the same programming paradigms.  I can't get excited
over factors of 5-10 difference in optimisability, when we are
talking about improvements over decades.

>>This has the consequence that large-scale parallelism is not a viable
>>general-purpose architecture until and unless we move to a paradigm
>>that isn't so intractable.
>
>And yet, by that argument there should be no market for the big
>parallel servers and supercomputers; yet there is. The solution is
>that for things that need the speed, people just write the parallel
>code by hand.

Sigh.  Look, I am in that area.  If it were only so simple :-(

>If what's on the desktop when Doom X, Half-Life Y and Unreal Z come
>out is a chip with 1024 individually slow cores, then those games will
>be written to use 1024-way parallelism, just as weather forecasting
>and quantum chemistry programs are today. Ditto for Photoshop, 3D
>modelling, movie editing, speech recognition etc. There's certainly no
>shortage of parallelism in the problem domains. The reason things like
>games don't use parallel code today whereas weather forecasting does
>isn't because of any software issue, it's because gamers don't have
>the money to buy massively parallel supercomputers whereas
>organizations doing weather forecasting do. When that changes, so will
>the software.

Oh, yeah.  Ha, ha.  I have been told that more-or-less continually
since about 1970.  Except for the first two thirds of your first
sentence, it is nonsense.

Not merely do people sweat blood to get such parallelism, they
often have to change their algorithms (sometimes to ones that are
less desirable, such as being less accurate), and even then only
SOME problems can be parallelised.


Regards,
Nick Maclaren.
0
Reply nmm1 9/2/2004 9:01:35 AM

Nick Maclaren wrote:

> Grumble wrote:
> 
>> spinlock wrote:
>> 
>>> We are on track for mass shipment of a billion (that's with a B)
>>> transistor die by '08.
>> 
>> Who's "we" ?
> 
> A good question.  But note that "by '08" includes "in 2005".

I took "by 2008" to mean "sometime in 2008". Otherwise he would have 
said "by 2005" or "by 2006", don't you think?

>> I have read that there will be ~1.7e9 transistors in Montecito.
>> Cache (2*1 MB L2 + 2*12 MB L3) probably accounts for ~90% of the
>> transistor count. Montecito is expected next year.
> 
> By whom is it expected?  And how is it expected to appear?  Yes,
> someone will wave a chip at IDF and claim that it is a Montecito,
> but are you expecting it to be available for internal testing,
> to all OEMS, to special customers, or on the open market?

In November 2003, Intel's roadmap claimed Montecito would appear in 
2005. 6 months later, Otellini mentioned 2005 again. In June 2004, Intel 
supposedly showcased Montecito dies, and claimed that testing had begun.

http://www.theinquirer.net/?article=15917
http://www.xbitlabs.com/news/cpu/display/20040219125800.html
http://www.xbitlabs.com/news/cpu/display/20040619180753.html

Perhaps Intel is being overoptimistic, but, as far as I understand, they 
claim Montecito will be ready in 2005.

-- 
Regards, Grumble
0
Reply Grumble 9/2/2004 10:16:37 AM

In article <ch6s4q$ict$1@news-rocq.inria.fr>, Grumble <a@b.c> writes:
|> > 
|> > By whom is it expected?  And how is it expected to appear?  Yes,
|> > someone will wave a chip at IDF and claim that it is a Montecito,
|> > but are you expecting it to be available for internal testing,
|> > to all OEMS, to special customers, or on the open market?
|> 
|> In November 2003, Intel's roadmap claimed Montecito would appear in 
|> 2005. 6 months later, Otellini mentioned 2005 again. In June 2004, Intel 
|> supposedly showcased Montecito dies, and claimed that testing had begun.
|> 
|> Perhaps Intel is being overoptimistic, but, as far as I understand, they 
|> claim Montecito will be ready in 2005.

I am aware of that.  Given that Intel failed to reduce the power
going to 90 nm for the Pentium 4, that implies it will need 200
watts.  Given that HP have already produced a dual-CPU package,
they will have boards rated for that.  Just how many other vendors
will have?

Note that Intel will lose more face if they produce the Montecito
and OEMs respond by dropping their IA64 lines than if they make
it available only on request to specially favoured OEMs.


Regards,
Nick Maclaren.
0
Reply nmm1 9/2/2004 11:02:48 AM

Nick Maclaren wrote:
>>Montecito is expected next year.
> 
> By whom is it expected?  And how is it expected to appear?  Yes,
> someone will wave a chip at IDF and claim that it is a Montecito,
> but are you expecting it to be available for internal testing,
> to all OEMS, to special customers, or on the open market?

By intel and everyone who has been believing their repeated, unwavering 
claims that mid-2005 will see commercial revenue shipments of Montecito. 
  Based on all the past releases in IPF, I expect a "launch" in June '05 
and customers will have systems running in their environments around 
August.  There should be Montecito demonstrations at this coming IDF. 
There were wafers shown at the last IDF.  If my anticipated schedule is 
correct, OEMs will have test chips soon.

Alex
-- 
My words are my own.  They represent no other; they belong to no other.
Don't read anything into them or you may be required to compensate me
for violation of copyright.  (I do not speak for my employer.)

0
Reply Alex 9/2/2004 12:21:07 PM

On 2 Sep 2004 09:01:35 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:

>Sigh.  You are STILL missing the point.  Spaghetti C++ may be about
>as bad as it gets, but the SAME applies to the cleanest of Fortran,
>if it is using the same programming paradigms.  I can't get excited
>over factors of 5-10 difference in optimisability, when we are
>talking about improvements over decades.

"Cleanest of Fortran" usually means vector-style code, which is a
reasonable target for autoparallelization. I'll grant you if you took
a pile of spaghetti C++ and translated line-for-line to Fortran, the
result wouldn't autoparallelize with near-future technology any more
than the original did.

>>And yet, by that argument there should be no market for the big
>>parallel servers and supercomputers; yet there is. The solution is
>>that for things that need the speed, people just write the parallel
>>code by hand.
>
>Sigh.  Look, I am in that area.  If it were only so simple :-(

I didn't claim it was simple. I claimed that, even though it's
complicated, it still happens.

>>If what's on the desktop when Doom X, Half-Life Y and Unreal Z come
>>out is a chip with 1024 individually slow cores, then those games will
>>be written to use 1024-way parallelism, just as weather forecasting
>>and quantum chemistry programs are today. Ditto for Photoshop, 3D
>>modelling, movie editing, speech recognition etc. There's certainly no
>>shortage of parallelism in the problem domains. The reason things like
>>games don't use parallel code today whereas weather forecasting does
>>isn't because of any software issue, it's because gamers don't have
>>the money to buy massively parallel supercomputers whereas
>>organizations doing weather forecasting do. When that changes, so will
>>the software.
>
>Oh, yeah.  Ha, ha.  I have been told that more-or-less continually
>since about 1970.  Except for the first two thirds of your first
>sentence, it is nonsense.

So you claim weather forecasting and quantum chemistry _don't_ use
parallel processing today? Or that gamers would be buying 1024-CPU
machines today if Id would only get around to shipping parallel code?

>Not merely do people sweat blood to get such parallelism, they
>often have to change their algorithms (sometimes to ones that are
>less desirable, such as being less accurate), and even then only
>SOME problems can be parallelised.

I didn't claim sweating blood and changing algorithms weren't
required. However, I'm not aware of any CPU-intensive problems of
practical importance that _can't_ be parallelized; do you have any
examples of such?

-- 
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
0
Reply wallacethinmintr 9/2/2004 2:15:52 PM

In article <4137299e.68397045@news.eircom.net>,
wallacethinmintr@eircom.net (Russell Wallace) writes:
|> 
|> "Cleanest of Fortran" usually means vector-style code, which is a
|> reasonable target for autoparallelization. ...

Not in my world, it doesn't.  There are lots of other extremely
clean codes.

|> >Oh, yeah.  Ha, ha.  I have been told that more-or-less continually
|> >since about 1970.  Except for the first two thirds of your first
|> >sentence, it is nonsense.
|> 
|> So you claim weather forecasting and quantum chemistry _don't_ use
|> parallel processing today? Or that gamers would be buying 1024-CPU
|> machines today if Id would only get around to shipping parallel code?

I am claiming that a significant proportion of the programs don't.
In a great many cases, people have simply given up attempting the
analyses, and have moved to less satisfactory ones that can be
parallelised.  In some cases, they have abandoned whole lines of
reserach!  Your statement was that the existing programs would
be parallelised:

    then those games will be written to use 1024-way parallelism,
    just as weather forecasting and quantum chemistry programs are
    today

|> >Not merely do people sweat blood to get such parallelism, they
|> >often have to change their algorithms (sometimes to ones that are
|> >less desirable, such as being less accurate), and even then only
|> >SOME problems can be parallelised.
|> 
|> I didn't claim sweating blood and changing algorithms weren't
|> required. However, I'm not aware of any CPU-intensive problems of
|> practical importance that _can't_ be parallelized; do you have any
|> examples of such?

Yes.  Look at ODEs for one example that is very hard to parallelise.
Anything involving sorting is also hard to parallelise, as are many
graph-theoretic algorithms.  Ones that are completely hopeless are
rarer, but exist - take a look at the "Spectral Test" in Knuth for
a possible candidate.

The characteristic of the most common class of unparallelisable
algorithm is that they are iterative, each step is small (i.e.
effectively scalar), yet it makes global changes (and where the
cost of that is very small).  This means that steps are never
independent, and are therefore serialised.

What I can't say is how many CPU-intensive problems of practical
importance are intrinsically unparallelisable - i.e. they CAN'T
be converted to a parallelisable form by changing the algorithms.
But that is not what I claimed.


Regards,
Nick Maclaren.
0
Reply nmm1 9/2/2004 2:39:45 PM

On 2 Sep 2004 14:39:45 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:

>Your statement was that the existing programs would
>be parallelised:
>
>    then those games will be written to use 1024-way parallelism,
>    just as weather forecasting and quantum chemistry programs are
>    today

Oh! I think we've been talking at cross purposes then.

I'm not at all talking about taking existing code and tweaking it to
run in parallel. I agree that isn't always feasible. I'm talking about
taking an existing problem domain and writing new code to solve it
with parallel algorithms.

>What I can't say is how many CPU-intensive problems of practical
>importance are intrinsically unparallelisable - i.e. they CAN'T
>be converted to a parallelisable form by changing the algorithms.
>But that is not what I claimed.

Okay, I'm specifically talking about using different algorithms where
necessary.

-- 
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
0
Reply wallacethinmintr 9/2/2004 2:48:38 PM

Russell Wallace wrote:

> On 1 Sep 2004 19:50:09 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:
> 
> 
>>There is effectively NO chance of automatic parallelisation working
>>on serial von Neumann code of the sort we know and, er, love.  Not
>>in the near future, not in my lifetime and not as far as anyone can
>>predict.  Forget it.
> 
> 
> At least as far as your typical spaghetti C++ is concerned, yeah, not
> going to happen anytime in the near future.
> 

The statement is wrong in any case. C can be translated to hardware
(which is defacto parallelisim) by "constraints", i.e., refusing to
translate its worst features (look up system C, C to hardware and
similar). Other languages can do it without constraints. Finally,
any code, no matter how bad, could be so translated by executing it
(simulating it), and then translating what it does dynamically and
not statically. This simulation can then give the programmer a report
of what was not executed, and the programmer modifies the test cases
until all code has been so translated.

> 
>>This has the consequence that large-scale parallelism is not a viable
>>general-purpose architecture until and unless we move to a paradigm
>>that isn't so intractable.
> 
> 
> And yet, by that argument there should be no market for the big
> parallel servers and supercomputers; yet there is. The solution is
> that for things that need the speed, people just write the parallel
> code by hand.
> 
> If what's on the desktop when Doom X, Half-Life Y and Unreal Z come
> out is a chip with 1024 individually slow cores, then those games will
> be written to use 1024-way parallelism, just as weather forecasting
> and quantum chemistry programs are today. Ditto for Photoshop, 3D
> modelling, movie editing, speech recognition etc. There's certainly no
> shortage of parallelism in the problem domains. The reason things like
> games don't use parallel code today whereas weather forecasting does
> isn't because of any software issue, it's because gamers don't have
> the money to buy massively parallel supercomputers whereas
> organizations doing weather forecasting do. When that changes, so will
> the software.
> 


-- 
Samiam is Scott A. Moore

Personal web site: http:/www.moorecad.com/scott
My electronics engineering consulting site: http://www.moorecad.com
ISO 7185 Standard Pascal web site: http://www.moorecad.com/standardpascal
Classic Basic Games web site: http://www.moorecad.com/classicbasic
The IP Pascal web site, a high performance, highly portable ISO 7185 Pascal
compiler system: http://www.moorecad.com/ippas

Being right is more powerfull than large corporations or governments.
The right argument may not be pervasive, but the facts eventually are.
0
Reply Scott 9/2/2004 4:34:08 PM

wallacethinmintr@eircom.net (Russell Wallace) writes:
> >This has the consequence that large-scale parallelism is not a viable
> >general-purpose architecture until and unless we move to a paradigm
> >that isn't so intractable.
> 
> And yet, by that argument there should be no market for the big
> parallel servers and supercomputers; yet there is. The solution is
> that for things that need the speed, people just write the parallel
> code by hand.

More accurately, they try to. Whether they succeed is a different question
(it's clear that they do succeed some times, but there's no reason to
believe that just because you'ld like it, you'll succeed).

-- 
David Gay
dgay@acm.org
0
Reply David 9/2/2004 4:40:13 PM

Scott Moore <samiam@moorecad.com> wrote:

> Russell Wallace wrote:
> 
> > On 1 Sep 2004 19:50:09 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:
> > 
> > 
> >>There is effectively NO chance of automatic parallelisation working
> >>on serial von Neumann code of the sort we know and, er, love.  Not
> >>in the near future, not in my lifetime and not as far as anyone can
> >>predict.  Forget it.
> > 
> > 
> > At least as far as your typical spaghetti C++ is concerned, yeah, not
> > going to happen anytime in the near future.
> > 
> 
> The statement is wrong in any case. C can be translated to hardware
> (which is defacto parallelisim) by "constraints", i.e., refusing to
> translate its worst features (look up system C, C to hardware and
> similar).

That is wildly optimistic, especially if you want to translate to
hardware. Efficient automatic parallelization of an imperative language
is terribly hard in any case, but to a platform with static fine-grained
parallelism?

Yes, there is SystemC and SpecC, but both sneak in a parallel
programming language that you must use for any reasonable performance.
And neither is exactly a runaway success...

Also note that you have quietly moved from C++ to C, and that even a
constrained version of C is already another language, especially since
one of the constraints is likely to be `no malloc'. You cannot compile
legacy code and expect a useful result, you must write specifically for
that compiler and its constraints.

>Other languages can do it without constraints.

Not likely. Imperative languages have too little inherent parallelism.
The best bet in that respect are functional languages (Haskell, ML,
etc.), but they are useless without dynamic memory allocation.

There are hardware description languages, of course, but they are very
specialized beasts.

> Finally,
> any code, no matter how bad, could be so translated by executing it
> (simulating it), and then translating what it does dynamically and
> not statically.

JIT-ing to hardware is not even done by any /research/ group AFAIK. 

> This simulation can then give the programmer a report
> of what was not executed, and the programmer modifies the test cases
> until all code has been so translated.

That is only workable for very small programs, and even then I have my
doubts. You leave all the hard work to the programmer, namely rewriting
the program to use a parallelisable algorithm.

0
Reply reeuwijk 9/2/2004 6:17:08 PM

In article <s718ybshbf6.fsf@beryl.CS.Berkeley.EDU>,
David Gay  <dgay@beryl.CS.Berkeley.EDU> wrote:
>
>> >This has the consequence that large-scale parallelism is not a viable
>> >general-purpose architecture until and unless we move to a paradigm
>> >that isn't so intractable.
>> 
>> And yet, by that argument there should be no market for the big
>> parallel servers and supercomputers; yet there is. The solution is
>> that for things that need the speed, people just write the parallel
>> code by hand.
>
>More accurately, they try to. Whether they succeed is a different question
>(it's clear that they do succeed some times, but there's no reason to
>believe that just because you'ld like it, you'll succeed).

Precisely.  As far as the easiness of doing it is concerned, the
question to ask is how the proportion of systems/money/effort/etc.
spent on large scale parallel applications is varying over time,
relative to that on all performance-limited applications.

If we exclude the modern equivalents of the Manhattan project,
and include the traditional vector systems as parallel (as they
were), my guess is that it has remained pretty constant for the
past 30 or 40 years.  The number of performance-limited tasks that
can be parallelised is continually (if slowly) increasing, but
probably no faster than the number of tasks people would like to
do that are limited by performance.

Highly parallel systems were specialist in 1974, and they are STILL
specialist.  We know how to do a LOT more in parallel than we
did then, but it is still a small proportion of what we would like
to do.  Still, it keeps people like me off the streets :-)


Regards,
Nick Maclaren.
0
Reply nmm1 9/2/2004 7:27:15 PM

> Precisely.  As far as the easiness of doing it is concerned, the
> question to ask is how the proportion of systems/money/effort/etc.
> spent on large scale parallel applications is varying over time,
> relative to that on all performance-limited applications.

Getting back to the issue of multiprocessors for "desktops" or even
laptops: I agree that parallelizing Emacs is going to be excrutiatingly
painful so I don't see it happening any time soon.  But that's not really
the question.

I think that as SMP and SMT progresses on those machines (first as
bi-processors), you'll see more applications use *very* coarse grain
parallelism.  It won't make much difference performancewise: the extra
processor will be used for unrelated tasks like "background foo" which isn't
done now because it would slow things down too much on a uniprocessor.
Existing things mostly won't be parallelized, but the extra CPU will be used
for new things of dubious value.

Your second CPU will be mostly idle, of course, but so is the first CPU
anyway ;-)



        Stefan
0
Reply Stefan 9/2/2004 8:14:16 PM

In article <jwvvfewh1wk.fsf-monnier+comp.arch@gnu.org>,
Stefan Monnier  <monnier@iro.umontreal.ca> wrote:
>
>I think that as SMP and SMT progresses on those machines (first as
>bi-processors), you'll see more applications use *very* coarse grain
>parallelism.  It won't make much difference performancewise: the extra
>processor will be used for unrelated tasks like "background foo" which isn't
>done now because it would slow things down too much on a uniprocessor.
>Existing things mostly won't be parallelized, but the extra CPU will be used
>for new things of dubious value.

I regret to say that I agree with you :-(


Regards,
Nick Maclaren.
0
Reply nmm1 9/2/2004 8:27:34 PM

Stefan Monnier wrote:

> 
> Getting back to the issue of multiprocessors for "desktops" or even
> laptops: I agree that parallelizing Emacs is going to be excrutiatingly
> painful so I don't see it happening any time soon.  But that's not really
> the question.
> 
> I think that as SMP and SMT progresses on those machines (first as
> bi-processors), you'll see more applications use *very* coarse grain
> parallelism.  It won't make much difference performancewise: the extra
> processor will be used for unrelated tasks like "background foo" which isn't
> done now because it would slow things down too much on a uniprocessor.
> Existing things mostly won't be parallelized, but the extra CPU will be used
> for new things of dubious value.
> 
> Your second CPU will be mostly idle, of course, but so is the first CPU
> anyway ;-)
> 

I sometimes think: no one experienced the microprocessor revolution.  Or 
perhaps: everyone has adjusted his recollection so that he thinks he saw 
things much more clearly than he did.  Or perhaps: the world is divided 
between those whose world-view was built before the revolution and are 
never going to acknowledge exactly what they missed, anyway, and those 
whose world-view was built too late to have enough perspective to see 
just how badly everybody missed it.

The world of programming is about to change in ways that no big-iron or 
cluster megaspending program ever could accomplish.  I'm tempted to say: 
get used to it, but it would be socially unacceptable and we're going to 
have a repeat of what happened with the microprocessor revolution: 
almost no one is going to put his hand to his forehead and say, "I 
should have seen that coming, but I didn't."

RM

0
Reply Robert 9/2/2004 8:47:32 PM

Robert Myers wrote:

[SNIP]

> The world of programming is about to change in ways that no big-iron or 
> cluster megaspending program ever could accomplish.  I'm tempted to say: 
> get used to it, but it would be socially unacceptable and we're going to 
> have a repeat of what happened with the microprocessor revolution: 
> almost no one is going to put his hand to his forehead and say, "I 
> should have seen that coming, but I didn't."

More CPUs per chunk of memory ?

Back in 1990 as a PFY at INMOS I asked about why they took the
approach they did (OCCAM/CSP/Transputers). I was given an explanation 
that included trends in heat dissipation, memory latency, clock rates,
leakage etc. By and large it's panning out as predicted, although the
timescales have proven to be a little longer (kudos to the guys doing
the chip design and silicon physics).


Cheers,
Rupert

0
Reply Rupert 9/2/2004 9:45:03 PM

Rupert Pigott wrote:
> Robert Myers wrote:
> 
> [SNIP]
> 
>> The world of programming is about to change in ways that no big-iron 
>> or cluster megaspending program ever could accomplish.  I'm tempted to 
>> say: get used to it, but it would be socially unacceptable and we're 
>> going to have a repeat of what happened with the microprocessor 
>> revolution: almost no one is going to put his hand to his forehead and 
>> say, "I should have seen that coming, but I didn't."
> 
> 
> More CPUs per chunk of memory ?
> 
> Back in 1990 as a PFY at INMOS I asked about why they took the
> approach they did (OCCAM/CSP/Transputers). I was given an explanation 
> that included trends in heat dissipation, memory latency, clock rates,
> leakage etc. By and large it's panning out as predicted, although the
> timescales have proven to be a little longer (kudos to the guys doing
> the chip design and silicon physics).
> 

Yes, indeed.

That's a powerful insight, but I would characterize it as the hardware 
driver for what I see as a more profound revolution in software.  Who 
knows, maybe the day of Occam is at hand.  :-).

The smallest unit that anyone will ever program for non-embedded 
applications will support I hesitate to guess how many execution pipes, 
but certainly more than one.  Single-pipe programming, using tools 
appropriate for single-pipe programming, will come to seem just as 
natural as doing physics without vectors and tensors.

The fact that this reality is finally percolating into the lowly but 
ubiquitous PC is what I'm counting on for magic.

RM

0
Reply Robert 9/2/2004 10:01:06 PM

On 2 Sep 2004 19:27:15 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:

>In article <s718ybshbf6.fsf@beryl.CS.Berkeley.EDU>,
>David Gay  <dgay@beryl.CS.Berkeley.EDU> wrote:
>
>>More accurately, they try to. Whether they succeed is a different question
>>(it's clear that they do succeed some times, but there's no reason to
>>believe that just because you'ld like it, you'll succeed).

Just like programming in general, really :)

>Highly parallel systems were specialist in 1974, and they are STILL
>specialist.  We know how to do a LOT more in parallel than we
>did then, but it is still a small proportion of what we would like
>to do.  Still, it keeps people like me off the streets :-)

What are some examples of important and performance-limited
computation tasks that aren't run in parallel?

-- 
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
0
Reply wallacethinmintr 9/2/2004 11:31:17 PM

Russell Wallace wrote:
> On 2 Sep 2004 19:27:15 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:
> 
> 
>>In article <s718ybshbf6.fsf@beryl.CS.Berkeley.EDU>,
>>David Gay  <dgay@beryl.CS.Berkeley.EDU> wrote:
>>Highly parallel systems were specialist in 1974, and they are STILL
>>specialist.  We know how to do a LOT more in parallel than we
>>did then, but it is still a small proportion of what we would like
>>to do.  Still, it keeps people like me off the streets :-)
> 
> What are some examples of important and performance-limited
> computation tasks that aren't run in parallel?

I.e., that run fastest on a one-processor Itanium or Opteron or 
Xeon workstation...

On the other hand, who isn't drooling over these:

http://www.orionmulti.com/products/

Quoting the press release on Transmeta's web site:

"The specifications for Orion's DS-96 deskside Cluster Workstation 
include 96 nodes with 300 Gflops peak performance (150 sustained), 
up to 192 gigabytes of memory and up to 9.6 terabytes of storage. 
It consumes less than 1500 watts and fits unobtrusively under a 
desk. Orion's DT-12 desktop Cluster Workstation has 12 nodes with 
36 Gflops peak performance (18 sustained), up to 24 gigabytes of 
DDR SDRAM memory and up to 1 terabyte of internal disk storage. 
The DT-12 consumes less than 220 watts and is scalable to 48 nodes 
by stacking up to four systems.

"Orion's desktop model will be available in October 2004, and the 
deskside model will be available during the latter part of Q4. For 
more information about Orion Multisystems and its products, visit 
www.orionmultisystems.com.

Have to wonder why all of those nodes are hooked together (inside 
the box, presumably on the motherboard) with gigabit ethernet, 
rather than something like the Horus chipset that's been spoken 
about here recently, given that the processors have HyperChannel 
interfaces.  My guess is that it let them offload system software 
development onto the open source cluster community, without having 
to even do device drivers.  I guess that the HyperChannel is for 
peripherals, and doesn't do interprocessor cache coherency anyway. 
  Still, you'd think that they could have come up with something 
lighter-weight than gigabit ethernet, switched or not.

R Clint Whaley and others have been playing with Atlas on Eficions 
recently, too.  Don't look to be too bad, although there seem to 
be some code-vs-data cache pressure issues.  Two flops/clock peak 
(2GFlop at 1GHz) realizing between 90% and 60% of peak on various 
atlas kernels.

Cheers,

-- 
Andrew
0
Reply Andrew 9/3/2004 12:48:21 AM

In article <4137ad0c.102048234@news.eircom.net>,
wallacethinmintr@eircom.net (Russell Wallace) writes:
|> 
|> >Highly parallel systems were specialist in 1974, and they are STILL
|> >specialist.  We know how to do a LOT more in parallel than we
|> >did then, but it is still a small proportion of what we would like
|> >to do.  Still, it keeps people like me off the streets :-)
|> 
|> What are some examples of important and performance-limited
|> computation tasks that aren't run in parallel?

ODEs, to a great extent.

A great deal of transaction processing.

A great deal of I/O.

Event handling in GUIs.


Regards,
Nick Maclaren.
0
Reply nmm1 9/3/2004 9:15:26 AM

> ODEs, to a great extent.

Parallelize over initial conditions?

> A great deal of transaction processing.

Parallelize over transactions? OK, the commit phase needs to be
serialized.

> A great deal of I/O.

Why that?

> Event handling in GUIs.

Is that really limiting performance in any way, nowadays?

	Jan
0
Reply ISO 9/3/2004 9:52:01 AM

On 3 Sep 2004 09:15:26 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:

>In article <4137ad0c.102048234@news.eircom.net>,
>wallacethinmintr@eircom.net (Russell Wallace) writes:
>
>|> What are some examples of important and performance-limited
>|> computation tasks that aren't run in parallel?
>
>ODEs, to a great extent.

Can you be more specific? What sort of jobs using ODEs? Why can't they
be parallelized?

>A great deal of transaction processing.

Any references? Everyone I've heard of with heavy transaction
processing workloads is buying SMP servers; I haven't heard of anyone
saying "well we could afford a 32-way box, and our workload sure as
hell needs it, but we're just sticking with a 1-way box because the
software can't handle more".

Nor have I seen anyone advertise "our server has only 1 processor
because your software probably can't use more, but it has the storage,
reliability etc for heavy-duty transaction processing"; there should
be a big market for such if your statement is correct.

>A great deal of I/O.

I was under the impression I/O was I/O limited, not CPU limited?

>Event handling in GUIs.

That's not CPU-limited either; it runs plenty fast enough on a single
processor.

-- 
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
0
Reply wallacethinmintr 9/3/2004 10:03:07 AM

Yousuf Khan wrote:
> http://www.reuters.com/locales/c_newsArticle.jsp?type=technologyNews&localeKey=en_IN&storyID=6098883
> 
>     Yousuf Khan
> 
> 


"This is evidence that Moore's Law continues," said Mark Bohr( Intel's 
director of process architecture and integration).


I remember that I read some months ago an interesting study of two intel 
researchers who had shown the end of Moore Law:  they said that we have 
now a real wall that we cannot cross.

Can a company contradict itself like this (within one year) ?

S

0
Reply TOUATI 9/3/2004 10:11:38 AM

In article <2pqt63Fo07glU2@uni-berlin.de>,
=?ISO-8859-1?Q?Jan_Vorbr=FCggen?= <jvorbrueggen-not@mediasec.de> writes:
|> 
|> > ODEs, to a great extent.
|> 
|> Parallelize over initial conditions?

If that is what you are doing.  If they are a component of a more
complex application, you will find that unhelpful :-)

|> > A great deal of transaction processing.
|> 
|> Parallelize over transactions? OK, the commit phase needs to be
|> serialized.

Think multi-component and parallelising WITHIN transactions.  In
theory, it can often be done.  In practice, doing it and maintaining
consistency is hard enough that it isn't.  Why do you think that so
many electronic transactions are so slow, and often getting slower?

Note that this is not a CPU limitation as such, but is a different
level of parallelism.  But it is the same class of problem.

|> > A great deal of I/O.
|> 
|> Why that?

Incompetence and historical, unparallelisable specifications.

|> > Event handling in GUIs.
|> 
|> Is that really limiting performance in any way, nowadays?

Yes.  I am sent gibbering up the wall by it, and am not alone in
that.  The reason is that I am using some fairly ancient machines
with more modern software.  Answers:

    Never upgrade software, and don't connect to parts of the net
that need newer versions.

    Upgrade your system.  Oops.  A few years down the line, you
will have the same problem.  And remember that Not-Moore's Law
has reached the end of the line - so, while I can upgrade by a
healthy factor and remain serial, people with the latest and
greatest systems can't.


Regards,
Nick Maclaren.
0
Reply nmm1 9/3/2004 10:14:37 AM

In article <41384006.139679505@news.eircom.net>,
wallacethinmintr@eircom.net (Russell Wallace) writes:
|> >
|> >|> What are some examples of important and performance-limited
|> >|> computation tasks that aren't run in parallel?
|> >
|> >ODEs, to a great extent.
|> 
|> Can you be more specific? What sort of jobs using ODEs? Why can't they
|> be parallelized?

You need to ask an ODE expert.  I am not one, and am relying largely
on information provided by one.  A Web search will help (I checked
my memory that way).

|> >A great deal of transaction processing.
|> 
|> Any references? ...

See my other posting.  I am talking about the latency of a single
transaction.

|> >A great deal of I/O.
|> 
|> I was under the impression I/O was I/O limited, not CPU limited?

(a) Not if it is Ethernet and TCP/IP, it isn't.

(b) Parallelism is parallelism.  The same issues arise and similar
approaches work.

(c) People (well, Microsoft, at least) are starting to put that
level of I/O into hardware - God help us all.

|> >Event handling in GUIs.
|> 
|> That's not CPU-limited either; it runs plenty fast enough on a single
|> processor.

(a) See (b) above.

(b) Don't bet on it and, no, it doesn't.  Every system I have used
is slow enough that it misses and mishandles events, and that has
included the latest and greatest workstations.  Yes, my reactions
are unusually fast for an old fogey.


Regards,
Nick Maclaren.
0
Reply nmm1 9/3/2004 10:20:53 AM

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> Getting back to the issue of multiprocessors for "desktops" or even
> laptops: I agree that parallelizing Emacs is going to be
> excrutiatingly painful so I don't see it happening any time soon.
> But that's not really the question.

In fact, Emacs IS a good candidate. Very little context that is not
buffer or window/frame local. Going into that swamp is another issue!

-- 
Paul Repacholi                               1 Crescent Rd.,
+61 (08) 9257-1001                           Kalamunda.
                                             West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be.
0
Reply Paul 9/3/2004 10:54:45 AM

> Think multi-component and parallelising WITHIN transactions.  In
> theory, it can often be done.  In practice, doing it and maintaining
> consistency is hard enough that it isn't.

What kind of transaction - by itself - would take long enough to warrant
that?

 > Why do you think that so
> many electronic transactions are so slow, and often getting slower?

I wonder myself. I put it down to general incompetence - in particular,
because some much data is unnecessarily slung around over none-too-fast
networks. Of course, anything XML-based will make things only worse.

> |> > Event handling in GUIs.
> |> 
> |> Is that really limiting performance in any way, nowadays?
> 
> Yes.  I am sent gibbering up the wall by it, and am not alone in
> that.  The reason is that I am using some fairly ancient machines
> with more modern software. 

Ancient as in a 30 MHz (IIRC) 68040 running NeXtStep - which is the most
responsive UIs I've ever seen? That is to say: any performance problem
with UIs is a problem of design and/or implementation, not the problem
as such. Not that that helps you any if the application you are using
is programmed on such a UI...cue WIN32 woes...

	Jan
0
Reply ISO 9/3/2004 12:24:29 PM

In comp.arch Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
> In article <ch6m8b$grg$1@news-rocq.inria.fr>, Grumble  <a@b.c> wrote:
> >spinlock wrote:
> >
> >> We are on track for mass shipment of a billion (that's with a B)
> >> transistor die by '08.
> >
> >Who's "we" ?
> 
> A good question.  But note that "by '08" includes "in 2005".
> 
> >I have read that there will be ~1.7e9 transistors in Montecito.
> >Cache (2*1 MB L2 + 2*12 MB L3) probably accounts for ~90% of the
> >transistor count. Montecito is expected next year.
> 
> By whom is it expected?  And how is it expected to appear?  Yes,
> someone will wave a chip at IDF and claim that it is a Montecito,
> but are you expecting it to be available for internal testing,
> to all OEMS, to special customers, or on the open market?

Is any kind of itanium actually available on the open market (and
i mean openmarket for new chips, not resale of systems)?

> 
> 
> Regards,
> Nick Maclaren.

-- 
	Sander

+++ Out of cheese error +++
0
Reply Sander 9/3/2004 12:31:23 PM

In comp.arch Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
> 
> Event handling in GUIs.
> 

GUIs and event processing and the inability to trivialy allow for 
at least topwindowlevel total paralellism is just a complete screwup.
Its made worse by middleware (like say java and swing) exporting
such braindeadness to application level.

So instead of "write to tolerate or take advantage of parallelism
if present" everybody writes for "serial everything in GUI is the 
one andonly true way".

> 
> Regards,
> Nick Maclaren.

-- 
	Sander

+++ Out of cheese error +++
0
Reply Sander 9/3/2004 12:36:29 PM

In article <2pr63uFnlr8lU1@uni-berlin.de>,
=?ISO-8859-1?Q?Jan_Vorbr=FCggen?= <jvorbrueggen-not@mediasec.de> writes:
|> > Think multi-component and parallelising WITHIN transactions.  In
|> > theory, it can often be done.  In practice, doing it and maintaining
|> > consistency is hard enough that it isn't.
|> 
|> What kind of transaction - by itself - would take long enough to warrant
|> that?

Anything that is built up of a couple of dozen steps, with the
various components scattered from here to New Zealand!

In practice, the cumulative latency issue bites earlier, but that
one is imposed by physical limits.  Again, I am not denying the
overriding cause of incompetence.

|>  > Why do you think that so
|> > many electronic transactions are so slow, and often getting slower?
|> 
|> I wonder myself. I put it down to general incompetence - in particular,
|> because some much data is unnecessarily slung around over none-too-fast
|> networks. Of course, anything XML-based will make things only worse.

There is no doubt that General Incompetence is in overall command,
but the question is what form the incompetence takes :-)

|> Ancient as in a 30 MHz (IIRC) 68040 running NeXtStep - which is the most
|> responsive UIs I've ever seen? That is to say: any performance problem
|> with UIs is a problem of design and/or implementation, not the problem
|> as such. Not that that helps you any if the application you are using
|> is programmed on such a UI...cue WIN32 woes...

No :-(

Ancient as in a 250 MHz processor with lashings of memory, and the
need to run Netscape 6 or beyond, because of the ghastly Web pages
I need to access.

Look, I was asked

    What are some examples of important and performance-limited
    computation tasks that aren't run in parallel?

not WHY are they not run in parallel, nor WHY they are performance-
limited, nor WHETHER that is unavoidable.  As you point out, it
is due to misdesigns at various levels.  But it IS an example of
what I was asked for.


Regards,
Nick Maclaren.
0
Reply nmm1 9/3/2004 12:55:18 PM

In article <1094215030.368528@haldjas.folklore.ee>,
Sander Vesik <sander@haldjas.folklore.ee> writes:
|> 
|> GUIs and event processing and the inability to trivialy allow for 
|> at least topwindowlevel total paralellism is just a complete screwup.
|> Its made worse by middleware (like say java and swing) exporting
|> such braindeadness to application level.
|> 
|> So instead of "write to tolerate or take advantage of parallelism
|> if present" everybody writes for "serial everything in GUI is the 
|> one andonly true way".

I should like to be able to disagree, but regret that I am unable
to.  The one niggle that I have is that a FEW applications do
allow for parallelism at the top level which is, as you say,
trivial.

There is no reason why most of the underlying morass ("layers"
implies a degree of structure that it does not possess) should
not be fully asynchronous and parallel.  Well, no good reason.
But it isn't in most modern designs.


Regards,
Nick Maclaren.
0
Reply nmm1 9/3/2004 12:58:40 PM

Robert Myers <rmyers1400@comcast.net> wrote in message news:<EFLZc.105418$Fg5.9550@attbi_s53>...
> Stefan Monnier wrote:
> > 
> > Your second CPU will be mostly idle, of course, but so is the first CPU
> > anyway ;-)
> > 
> 
> I sometimes think: no one experienced the microprocessor revolution.

Indeed. One thing we noticed in the RISC revolution (may it rest in 
peace) was that a dual processor workstation did not get an application 
done any faster, but it made the person interacting with the application 
a lot happier!

One of the big benefits to a dual processor that is difficult to measure 
is the improvement in hand eye coordination with the application. Lets 
say a heavy CAD application is using 10% of a CPU for keyboard and mouse 
activity, and 100% of the other CPU for application processing. This dual 
processor arrangement is much better hand->(KB->app->graphics)->eye 
coordination than  a single CPU with 110% the processing power.
0
Reply MitchAlsup 9/3/2004 3:09:45 PM

"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:ch9pjm$fu0$1@pegasus.csx.cam.ac.uk...

snip

> Look, I was asked
>
>     What are some examples of important and performance-limited
>     computation tasks that aren't run in parallel?
>
> not WHY are they not run in parallel, nor WHY they are performance-
> limited, nor WHETHER that is unavoidable.  As you point out, it
> is due to misdesigns at various levels.  But it IS an example of
> what I was asked for.

OK, let me rephrase the original question to more reflect what I think the
OP was asking.

What are some examples of important, CPU bound applications that are limited
by not being parallelized?

I mean this to eliminate answers that depend on improving the latency
between the UK and New Zealand, which is a different sort of research
program.  :-)  I also mean to eliminate transaction processing, at least as
most commercial systems use it as it is already highly parallel between
transactions and very few individual transactions use enough CPU to benefit
much by within transaction CPU parallelism.  I also mean to eliminate I/O as
that has been parallelized for decades (as you well know).

So from your original list we still have ODEs and perhaps UIs, though there
the benefit may be limited to relativly simple things like what was
mentioned earlier - dedicating a CPU to user interactions to assure
responsivness.  Are there others?

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/3/2004 3:42:17 PM

TOUATI Sid wrote:
> "This is evidence that Moore's Law continues," said Mark Bohr( Intel's
> director of process architecture and integration).
>
>
> I remember that I read some months ago an interesting study of two
> intel researchers who had shown the end of Moore Law:  they said that
> we have now a real wall that we cannot cross.
>
> Can a company contradict itself like this (within one year) ?

Only if one of the company reps is an executive. :-)

    Yousuf Khan


0
Reply Yousuf 9/3/2004 3:50:01 PM

Stephen Fuld wrote:

<snip>

> 
> So from your original list we still have ODEs and perhaps UIs, though there
> the benefit may be limited to relativly simple things like what was
> mentioned earlier - dedicating a CPU to user interactions to assure
> responsivness.  Are there others?
> 

I take the question to be: how many applications have been created for 
which appropriate hardware doesn't yet exist?

In the broad class of applications that will spring into existence when 
appropriate resources become available, I would place those that depend 
on brute force search.

RM

0
Reply Robert 9/3/2004 4:29:55 PM

In article <th0_c.548848$Gx4.476461@bgtnsc04-news.ops.worldnet.att.net>,
"Stephen Fuld" <s.fuld@PleaseRemove.att.net> writes:
|> 
|> OK, let me rephrase the original question to more reflect what I think the
|> OP was asking.
|> 
|> What are some examples of important, CPU bound applications that are limited
|> by not being parallelized?
|> 
|> I mean this to eliminate answers that depend on improving the latency
|> between the UK and New Zealand, which is a different sort of research
|> program.  :-)  I also mean to eliminate transaction processing, at least as
|> most commercial systems use it as it is already highly parallel between
|> transactions and very few individual transactions use enough CPU to benefit
|> much by within transaction CPU parallelism.  I also mean to eliminate I/O as
|> that has been parallelized for decades (as you well know).

Actually, no, it doesn't eliminate it.  I am not an expert on what
is normally known as transaction processing, but most of the things
that I have seen that fall under that have various steps.  Now, in
many cases, many of those steps could be done in parallel, but aren't
(for the reasons I gave).  Locking is all very well for some problems,
but not for others; Alpha LDC/STC designs can be applied more generally;
and so on.

Also, some I/O has been parallelised for decades, but modern forms
typically aren't.  TCP/IP over Ethernet is usually dire, and that is
today's de facto standard.

If, however, you are referring to problem areas where there is no
known way of parallelising them, and yet they are bottlenecks, I
should have to think harder.  I am certain that there are some, but
(as I said) a lot of people will have abandoned them as intractable.
So I should have to think about currently untackled requirements.

|> So from your original list we still have ODEs and perhaps UIs, though there
|> the benefit may be limited to relativly simple things like what was
|> mentioned earlier - dedicating a CPU to user interactions to assure
|> responsivness.  Are there others?

Protein folding comes close.  It is parallelisable in space, but
not easily in time.  There are quite a lot of problems like that.



Regards,
Nick Maclaren.
0
Reply nmm1 9/3/2004 4:39:00 PM

On 3 Sep 2004 16:39:00 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:

>In article <th0_c.548848$Gx4.476461@bgtnsc04-news.ops.worldnet.att.net>,
>"Stephen Fuld" <s.fuld@PleaseRemove.att.net> writes:
>|> 
>|> OK, let me rephrase the original question to more reflect what I think the
>|> OP was asking.
>|> 
>|> What are some examples of important, CPU bound applications that are limited
>|> by not being parallelized?

Yes, that would be a better way of phrasing it.

>|> I mean this to eliminate answers that depend on improving the latency
>|> between the UK and New Zealand, which is a different sort of research
>|> program.  :-)

Right :) I'll agree it's an answer to the question I asked, but it's
not the sort of problem I'm interested in here.

>If, however, you are referring to problem areas where there is no
>known way of parallelising them, and yet they are bottlenecks, I
>should have to think harder.  I am certain that there are some, but
>(as I said) a lot of people will have abandoned them as intractable.
>So I should have to think about currently untackled requirements.

Okay.

>Protein folding comes close.  It is parallelisable in space, but
>not easily in time.  There are quite a lot of problems like that.

Speaking of which: It seems to me that a big problem with protein
folding and similar jobs (e.g. simulating galaxy collisions) is:

- If you want N digits of accuracy in the numerical calculations, you
just need to use N digits of numerical precision, for O(N^2)
computational effort.

- However, quantizing time produces errors; if you want to reduce
these to N digits of accuracy, you need to use exp(N) time steps.

Is this right? Or is there any way to put a bound on the total error
introduced by time quantization over many time steps?

(Fluid dynamics simulation has this problem too, but in both the space
and time dimensions; I suppose there's definitely no way of solving it
for the space dimension, at least, other than by brute force.)

-- 
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
0
Reply wallacethinmintr 9/3/2004 6:12:16 PM

Sander Vesik wrote:
> Is any kind of itanium actually available on the open market (and
> i mean openmarket for new chips, not resale of systems)?

Searching PriceWatch I got 11 offers for boxed Itanium 2 CPUs.

-- 
My words are my own.  They represent no other; they belong to no other.
Don't read anything into them or you may be required to compensate me
for violation of copyright.  (I do not speak for my employer.)

0
Reply Alex 9/3/2004 6:13:42 PM

In article <4138b28e.169004334@news.eircom.net>,
Russell Wallace <wallacethinmintr@eircom.net> wrote:
>
>Speaking of which: It seems to me that a big problem with protein
>folding and similar jobs (e.g. simulating galaxy collisions) is:
>
>- If you want N digits of accuracy in the numerical calculations, you
>just need to use N digits of numerical precision, for O(N^2)
>computational effort.

More or less.

>- However, quantizing time produces errors; if you want to reduce
>these to N digits of accuracy, you need to use exp(N) time steps.
>
>Is this right? Or is there any way to put a bound on the total error
>introduced by time quantization over many time steps?

There are ways, but they aren't very reliable.  The worse problem
is that many such analyses are numerically unstable (a.k.a. chaotic),
and that the number of digits you need in your calculations is
exponential in the number of time steps.  Also, reducing the size
of steps reduces one cause of error and increases this one.

You don't usually have to mince time as finely as you said, but
the problem remains.  This is alleviated by the fact that most
numerical errors merely change one possible solution into another,
which is harmless.  Unfortunately, there is (in general) no way of
telling whether that is happening or whether they are changing a
possible solution into an impossible one.

>(Fluid dynamics simulation has this problem too, but in both the space
>and time dimensions; I suppose there's definitely no way of solving it
>for the space dimension, at least, other than by brute force.)

The same applies to the other problems.  The formulae are different,
but the problems have a similar structure.

All this is why doing such things is a bit of a black art.  I know
enough to know the problems in principle, but can't even start to
tackle serious problems in practice.


Regards,
Nick Maclaren.
0
Reply nmm1 9/3/2004 8:01:04 PM

In article <e90782f7.0409030709.3ece20b5@posting.google.com>,
 MitchAlsup@aol.com (Mitch Alsup) wrote:

> Robert Myers <rmyers1400@comcast.net> wrote in message 
> news:<EFLZc.105418$Fg5.9550@attbi_s53>...
> > Stefan Monnier wrote:
> > > 
> > > Your second CPU will be mostly idle, of course, but so is the first CPU
> > > anyway ;-)
> > I sometimes think: no one experienced the microprocessor revolution.
> 
> Indeed. One thing we noticed in the RISC revolution (may it rest in 
> peace) was that a dual processor workstation did not get an application 
> done any faster, but it made the person interacting with the application 
> a lot happier!

Hmmm. My experience was a bit different; back in the early 1990's, based 
on watching SMP clients' experiences, I proposed a rule something along 
the lines that "disappointment with multiprocessor systems scales at 
least linearly with the number of CPUs". But that was when a lot of the 
clients were engineering-savvy....

    Hamish
0
Reply Hamish 9/3/2004 9:31:24 PM

In comp.arch Jan Vorbr?ggen <jvorbrueggen-not@mediasec.de> wrote:
> > Think multi-component and parallelising WITHIN transactions.  In
> > theory, it can often be done.  In practice, doing it and maintaining
> > consistency is hard enough that it isn't.
> 
> What kind of transaction - by itself - would take long enough to warrant
> that?

any transaction that goes off and does some data mining in the middle?

> 
>  > Why do you think that so
> > many electronic transactions are so slow, and often getting slower?
> 
> I wonder myself. I put it down to general incompetence - in particular,
> because some much data is unnecessarily slung around over none-too-fast
> networks. Of course, anything XML-based will make things only worse.

Just a symptom of database centric things being designed and run by
people who just want to add another tier to fix all problems. Oh, 
and clusters are cool and should be used at all cost. 

-- 
	Sander

+++ Out of cheese error +++
0
Reply Sander 9/3/2004 10:37:13 PM


On Fri, 3 Sep 2004, Russell Wallace wrote:

> >Protein folding comes close.  It is parallelisable in space, but
> >not easily in time.  There are quite a lot of problems like that.

Protein folding comes in two forms - which I would describe as

i) "Hamiltonian evolution of a  (semi-)classical approximation"
   - this is similar to galactic dynamics etc... and necessarily
   serialises in time.

ii) Statistical mechanical prediction of the probability
    distribution of a protein configuration.
    This neither has the time problem, and can be
    done using exact Monte-Carlo algorithms, such as Metropolis,
    Hybrid Monte-Carlo, etc...

The latter has very nice parallelisation and numerical error
insensitivity properties. It also accounts better for quantum effects.
No doubt, it is much more numerically expensive.

Peter


> Speaking of which: It seems to me that a big problem with protein
> folding and similar jobs (e.g. simulating galaxy collisions) is:
>
> - If you want N digits of accuracy in the numerical calculations, you
> just need to use N digits of numerical precision, for O(N^2)
> computational effort.
>
> - However, quantizing time produces errors; if you want to reduce
> these to N digits of accuracy, you need to use exp(N) time steps.
>
> Is this right? Or is there any way to put a bound on the total error
> introduced by time quantization over many time steps?
>
> (Fluid dynamics simulation has this problem too, but in both the space
> and time dimensions; I suppose there's definitely no way of solving it
> for the space dimension, at least, other than by brute force.)
>
> --
> "Sore wa himitsu desu."
> To reply by email, remove
> the small snack from address.
>

Peter Boyle	pboyle@physics.gla.ac.uk


0
Reply Peter 9/4/2004 8:15:51 AM

Kees van Reeuwijk wrote:

> Scott Moore <samiam@moorecad.com> wrote:
> 
> 
>>Russell Wallace wrote:
>>
>>
>>>On 1 Sep 2004 19:50:09 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:
>>>
>>>
>>>
>>>>There is effectively NO chance of automatic parallelisation working
>>>>on serial von Neumann code of the sort we know and, er, love.  Not
>>>>in the near future, not in my lifetime and not as far as anyone can
>>>>predict.  Forget it.
>>>
>>>
>>>At least as far as your typical spaghetti C++ is concerned, yeah, not
>>>going to happen anytime in the near future.
>>>
>>
>>The statement is wrong in any case. C can be translated to hardware
>>(which is defacto parallelisim) by "constraints", i.e., refusing to
>>translate its worst features (look up system C, C to hardware and
>>similar).
> 
> 
> That is wildly optimistic, especially if you want to translate to
> hardware. Efficient automatic parallelization of an imperative language
> is terribly hard in any case, but to a platform with static fine-grained
> parallelism?

Everyone agrees that the present, Von Neumann oriented languages are
poor matches for parallelisim. Everyone agrees that there should be an
ideal parallel language out there somewere. But nobody seems to be
able to find it. Von Neumann was the compromise we made to move forward
at all. Programming patch panels may have been very much parallel, but
it was a waste of time, literally drowning in too much flexibility.

> 
> Yes, there is SystemC and SpecC, but both sneak in a parallel
> programming language that you must use for any reasonable performance.
> And neither is exactly a runaway success...

For hardware applications, I am told that C/C++, in any case, does not
have the kinds of controls and statements hardware engineers want and
need, so its too simplistic to state that the languages had to be
extended to be efficient. It was equally likely they were extended
for the task at hand.

> 
> Also note that you have quietly moved from C++ to C, and that even a
> constrained version of C is already another language, especially since
> one of the constraints is likely to be `no malloc'. You cannot compile
> legacy code and expect a useful result, you must write specifically for
> that compiler and its constraints.
> 

C++ can be considered to be simply a high level language implemented over
the top of C, as indeed its first implementation was (cfront).

> 
>>Other languages can do it without constraints.
> 
> 
> Not likely. Imperative languages have too little inherent parallelism.
> The best bet in that respect are functional languages (Haskell, ML,
> etc.), but they are useless without dynamic memory allocation.

The reason SMP exists is that programmers don't want to change. Hillis
avocated the need to throw the present computing structures out with
the bathwater to get to "perfect" parallelisim.

I'm not arguing that the present languages are bad for parallelisim.
Just that nobody feels like starting over, and any approach (like SMP)
based on the way things work now, instead of ideally, is going to
deliver more results if only because the state of the art is already
so far along.

> 
> There are hardware description languages, of course, but they are very
> specialized beasts.

Which is why they are not mainstream. Why is Verilog a sucess ? Because
it looks like C. QED.

"I stayed up all night, getting nowhere. But finally, the king
of the monkeys came and told me the answer. As soon as I can transcribe
their strange and beautiful language, I will have it" - Dilbert

> 
> 
>>Finally,
>>any code, no matter how bad, could be so translated by executing it
>>(simulating it), and then translating what it does dynamically and
>>not statically.
> 
> 
> JIT-ing to hardware is not even done by any /research/ group AFAIK. 
> 
> 
>>This simulation can then give the programmer a report
>>of what was not executed, and the programmer modifies the test cases
>>until all code has been so translated.
> 
> 
> That is only workable for very small programs, and even then I have my
> doubts. You leave all the hard work to the programmer, namely rewriting
> the program to use a parallelisable algorithm.
> 

I think you answered a question I didn't pose. The original question I
answered was "why is C not able to parallize as well as other languages",
not "how can any procedural language be parallized...."

In any case, thanks for the debate.

-- 
Samiam is Scott A. Moore

Personal web site: http:/www.moorecad.com/scott
My electronics engineering consulting site: http://www.moorecad.com
ISO 7185 Standard Pascal web site: http://www.moorecad.com/standardpascal
Classic Basic Games web site: http://www.moorecad.com/classicbasic
The IP Pascal web site, a high performance, highly portable ISO 7185 Pascal
compiler system: http://www.moorecad.com/ippas

Being right is more powerfull than large corporations or governments.
The right argument may not be pervasive, but the facts eventually are.
0
Reply Scott 9/4/2004 9:05:27 AM

In article <Pine.GSO.4.58.0409040908080.13233@holyrood.ed.ac.uk>,
Peter Boyle  <pboyle@holyrood.ed.ac.uk> wrote:
>
>Protein folding comes in two forms - which I would describe as
>
>i) "Hamiltonian evolution of a  (semi-)classical approximation"
>   - this is similar to galactic dynamics etc... and necessarily
>   serialises in time.

Like any other form of "extended ODE" with a time component.

>ii) Statistical mechanical prediction of the probability
>    distribution of a protein configuration.
>    This neither has the time problem, and can be
>    done using exact Monte-Carlo algorithms, such as Metropolis,
>    Hybrid Monte-Carlo, etc...
>
>The latter has very nice parallelisation and numerical error
>insensitivity properties. It also accounts better for quantum effects.
>No doubt, it is much more numerically expensive.

It also doesn't deal with the multiple minimum problem - which is
critical for at least some proteins, such as prions.  You need to
know how the protein foilds to be sure that there isn't a barrier,
and/or to estimate the probability and conditions for folding into
different configurations.


Regards,
Nick Maclaren.
0
Reply nmm1 9/4/2004 9:28:59 AM

Scott Moore wrote:

> Which is why they are not mainstream. Why is Verilog a sucess ? Because
> it looks like C. QED.

As a softie who was recently introduced to Verilog, I would vehemently
dispute that. In fact I originally wrote "Does it buggery look like
C !". :)


Cheers,
Rupert

0
Reply Rupert 9/4/2004 9:48:20 AM

Rupert Pigott wrote:

> Scott Moore wrote:
> 
>> Which is why they are not mainstream. Why is Verilog a sucess ?
>> Because it looks like C. QED.
> 
> As a softie who was recently introduced to Verilog, I would vehemently
> dispute that. In fact I originally wrote "Does it buggery look like
> C !". :)

IMHO, Verilog is a success (with the competitor VHDL), because it offers
the things you need, and nothing more. VHDL is an overloaded language,
which allows you to do everything, but usually in a bit clumsy way. But
actually, you don't want to do everything (you just want to design
chips), so the features are in the way.

And if you want more features, you can use SystemVerilog.

What I don't like from any HDL is that there is no explicit register
declaration. "reg" in Verilog is misleading, it's just a variable.
Registers depend on context, and register assignment styles also depend
on context (like synchronous vs. asynchronous resets). A explicit
register declaration should declare the register's clock, the reset and
reset value (optionally a gating), and leave the reset and gating style
(synchronous vs. asynchronous/gating vs. multiplexer) to the synthesis
tool settings.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
0
Reply Bernd 9/4/2004 11:51:47 AM

Peter Boyle wrote:
> 
> On Fri, 3 Sep 2004, Russell Wallace wrote:
> 
> 
>>>Protein folding comes close.  It is parallelisable in space, but
>>>not easily in time.  There are quite a lot of problems like that.
> 
> 
> Protein folding comes in two forms - which I would describe as
> 
> i) "Hamiltonian evolution of a  (semi-)classical approximation"
>    - this is similar to galactic dynamics etc... and necessarily
>    serialises in time.
> 

By the way that massive parallelism is currently most commonly done, 
this is a communication problem masquerading as a computational problem.

The real problem here is that you need to take a large number of time
steps (~10^11, in the example from IBM's Blue Gene document) and,
because of long range forces, you need nearly global communication at
every step (since every particle needs to know where every other
particle within any arbitrary cutoff for long-range forces is).  The
expensive sum over particles can be made parallel, but at the cost of
putting a copy of nearly every particle position into the memory space 
of every processor over which the sum is made parallel (you can take out 
the nearly if there is no long range cutoff).

The last time this problem was mentioned, the ops count (estimated 1000
machine instructions per force calculation) from Table 1 of Allen, et. al.

www.research.ibm.com/journal/sj/402/allen.pdf

was discussed here.  There are other ways you can organize the 
calculation, but if you follow the naive sum-over-particles calculation 
implied by Table 1, then the ops count should include the cost of moving 
the particle position for that particular particle from the memory space 
of the processor where it was most recently updated to the place where 
its position will be used to update the position of another particle.

If you have a processor fast enough (possibly by being 
multiply-threaded) to handle updating multiple particles in a single 
memory space, you can amortize that communication cost over the number 
of particles being updated.  The usual issues with shared memory don't 
arise because the shared variables can be read-only.  A large number of 
threads or cores on a chip with a shared L3 would be just dandy.  The 
entire shared space (six degrees of freedom for 32000 particles) would 
probably fit into L3.

RM

0
Reply Robert 9/4/2004 2:03:03 PM

Paul Repacholi <prep@prep.synonet.com> wrote in message news:<87llfroc5m.fsf@k9.prep.synonet.com>...
> Stefan Monnier <monnier@iro.umontreal.ca> writes:
> 
> > Getting back to the issue of multiprocessors for "desktops" or even
> > laptops: I agree that parallelizing Emacs is going to be
> > excrutiatingly painful so I don't see it happening any time soon.
> > But that's not really the question.
> 
> In fact, Emacs IS a good candidate. Very little context that is not
> buffer or window/frame local. Going into that swamp is another issue!

I have a better reason why emacs is a great candidate for
parallerization.
Its written in lisp, and in reality its a lisp operating system with
embedded wordprocessor included as a major app in it. Now The lisp
code could be autoparallized by autoparallerizing compiler. So you
would need to do some work to improve the underlying lisp compiler/OS
to handle mutliprocessing needs. BTW: I think that EMACS is going to
be one of the desktop aplications that are going to be parallerized
well. [If it hasn't already.] Simply because parallerizing it is geeky
enough trick that someone in OSS developement may wan't to do just for
the kicks, and most of it is written in parallerisable language.

Jouni Osmala
0
Reply josmala 9/4/2004 4:14:36 PM

"Scott Moore" <samiam@moorecad.com> wrote in message
news:rzf_c.32443$_g7.1885@attbi_s52...

snip

> The reason SMP exists is that programmers don't want to change. Hillis
> avocated the need to throw the present computing structures out with
> the bathwater to get to "perfect" parallelisim.
>
> I'm not arguing that the present languages are bad for parallelisim.
> Just that nobody feels like starting over, and any approach (like SMP)
> based on the way things work now, instead of ideally, is going to
> deliver more results if only because the state of the art is already
> so far along.

I freely admit that I may be way off base here, but I am very much reminded
an analogous situation in a somewhat earlier age.  Perhaps it can best be
described with the paraphrase "SMP considered harmfull to parallel
programming progress".  That is SMP is like the use of the Goto statement in
that it is very usefull in modest sized applications (think perhaps quick
and dirty) but as things scale up, neither works well and both seem to have
unintended consequences that make further progress much harder.  Do we need
to bite the bullet and "throw out" the SMP code, just like we mostly did
with goto filled code and thus regress in order to make more progress later?
I very well remember the resistance to eliminating goto, the projected cost
in terms of inefficient programs the cost of rewriting, etc.  But now, few
would go back.

Just a thought, but I find it interesting.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/4/2004 4:35:33 PM

In comp.arch Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:

> It also doesn't deal with the multiple minimum problem - which is
> critical for at least some proteins, such as prions.  You need to
> know how the protein foilds to be sure that there isn't a barrier,
> and/or to estimate the probability and conditions for folding into
> different configurations.

No current method adequately deals with the multiple minimum problem
because they can't adequately explore the space to find even a single
minimum.

I think there are very useful things that could be done by molecular
dynamics on a large ensemble of starting configurations for fairly
modest-length trajectories.  That would be trivially parallelizable
with current codes.

-- Dave
0
Reply dhinds 9/4/2004 4:49:18 PM

Stephen Fuld wrote:

> 
> I freely admit that I may be way off base here, but I am very much reminded
> an analogous situation in a somewhat earlier age.  Perhaps it can best be
> described with the paraphrase "SMP considered harmfull to parallel
> programming progress".  That is SMP is like the use of the Goto statement in
> that it is very usefull in modest sized applications (think perhaps quick
> and dirty) but as things scale up, neither works well and both seem to have
> unintended consequences that make further progress much harder.  Do we need
> to bite the bullet and "throw out" the SMP code, just like we mostly did
> with goto filled code and thus regress in order to make more progress later?
> I very well remember the resistance to eliminating goto, the projected cost
> in terms of inefficient programs the cost of rewriting, etc.  But now, few
> would go back.
> 
> Just a thought, but I find it interesting.
> 

<with respect>

No. No. No. No. No. No.

Single-processor system images harmful to parallel programming.

SPSI = Everything has to cross user/kernel space boundaries and 
commmunication stack for any kind of nontrivial parallelism.

Better ways to do it than classic SMP?  I'm sure there are, but tens of 
thousands of instances of the Linux kernel aren't the answer, either.

RM

0
Reply Robert 9/4/2004 5:24:35 PM

> >It won't surprise me in the least if 15 years from now, when the
> >conversation is about multiple cores in digital watches or whatever,
> >someone says "we had exactly that argument 15 years ago with regard to
> >parallel processing on desktops" :)
> 
> Nor would it surprise me.  Raymond makes one good point, though he
> gets it slightly wrong!
> 
> There is effectively NO chance of automatic parallelisation working
> on serial von Neumann code of the sort we know and, er, love.  Not
> in the near future, not in my lifetime and not as far as anyone can
> predict.  Forget it.
> 
> This has the consequence that large-scale parallelism is not a viable
> general-purpose architecture until and unless we move to a paradigm
> that isn't so intractable.  There are such paradigms (functional
> programming is a LITTLE better, for a start), but none have taken
> off as general models.  The HPC world is sui generis, and not relevant
> in this thread.
> 
> So he would be right if he replaced "beyond 2 cores" by "beyond a
> small number of cores".  At least for the next decade or so.
> 
> 
> Regards,
> Nick Maclaren.

There are few hints why parallel CPU:s are beneficial at the HOME
desktop.
Most programs do run fast enough. Look for exceptions..
a) Games
b) 3D editing software.
c) videoediting 

Now what about future and parallerisation of said things.

a) Well, One thread for UI, couple of threads for physics,[Split by
area], and HUGE amount of threads for AI available. [Think each
monster only READS shared area while writes on its own specific area,
so no sharing, only syncronization with physics and game mechanics
threads, and those syncronizations could be handled by keeping the
LAST frame intact for AI threads that work with one frame delay.] I
have reasons to believe that the main benefits off adding more than 4
cores is improving AI algorithms in games.
b) Already showing some parallerism, don't know about how much
inherited parallerism for CPU that is reasonably easy to get. [For
gains.]
c) Same here.

But main point for desktop parallerism isn't about what cannot be
parallerised, but that IS there enough important aplications that CAN
be parallerised. And I'm saying yes there is, and the wall that hits
it is that at some point no desktop apps need more computing power,
not that getting more cores would be anyparallel. You don't need
parallerize every task for desktop to embrace multicore paradigma,
just the games.

Besides people who write software will typicly have TWO years for
doubling the number of cores ;) [Except there is probably one shrink
that goes for increasing cache instead of number of cores, and that is
more probably earlier than later.]

And for languages, hmmm. They will evolve, simply since using C for
writing app for 16 cores don't look promising, as a general case. SOME
like games are parallel. [No you don't need to parallerize every
single small task, just run different TASKS in parallel(physics,
ai1,ai2...ai[n], ui), and if some task takes more than 1/60th of a
second then there is need to parallerise that.]
But for for many cases things will adapt. 2ndly, there are multiple
processes for CPU:s to be utilized. For instance, OS, P2Papp,MP3player
in background while the actual game in foreground.
If its 2 cores in 2005 its 4 cores in 2007 or 2009 and 8 in 2011,
assuming intel roadmap holds for processes, And in this time scale I
think there is reasonable parallerism available to the cores within a
year of the introduction. For two cores its mostly background
processes, for 4 cores its that there is aplications that do work in 3
threads, +background stuff, and for 8 cores there have been several
years for desktop application writers to deal with it. Some are fast
enough anyway and don't need the parallerism while those who do need
will have to found to way to use or their competitors will for
desktop. Yess there are problems that are inherently
non-parallerisable, but as long as those tasks that  drive the sales
are parallerisable there will be exponential trend of increasing
number of cores per die. And no what we learned of supercomputers
won't hold for desktop.
a) Desktop runs n aplications and background processes for sametime
and there is benefit for them having a CPU so that foreground is not
stalled, then the fore ground process may have some coarse grain
parallerism available like games have...
b) Faster desktop processors are sold by numbers for brainwashed
masses.
c) There is need for more processors as long as there is ANY tasks
that could utilize more processors.
d) There is NO long latency communcation problem between the nodes as
with supercomputers, your inter process communication latency happens
in a single die, which makes it a LOT faster than the supercomputers
so you DON'T need to replicate the read only data for different
processes.


Jouni Osmala
-I know I should write more software and write less in comp.arch ...
0
Reply josmala 9/4/2004 7:47:52 PM

Jouni Osmala wrote:

[SNIP]

> There are few hints why parallel CPU:s are beneficial at the HOME
> desktop.
> Most programs do run fast enough. Look for exceptions..
> a) Games
> b) 3D editing software.
> c) videoediting 
> 
> Now what about future and parallerisation of said things.
> 
> a) Well, One thread for UI, couple of threads for physics,[Split by
> area], and HUGE amount of threads for AI available. [Think each
> monster only READS shared area while writes on its own specific area,
> so no sharing, only syncronization with physics and game mechanics
> threads, and those syncronizations could be handled by keeping the
> LAST frame intact for AI threads that work with one frame delay.] I
> have reasons to believe that the main benefits off adding more than 4
> cores is improving AI algorithms in games.

[SNIP]

None of this is new. One of the early apps written for the transputer
was in fact the fabled "Flight Simulator". The thing is these days
there is a lot of work off-loaded to specialised hardware (eg: your
shiny new Nvidia 6800), the question I have to ask is : How much do
CPUs hold games back now ?

My suspicion is : In the general case, not much because developers
target median hardware in order to maximise their potential market.

AI as it's done in games now is basically scripting, I haven't seen
many signs of that changing. Game developers really don't seem to be
much interested in anything else, genuine adaptive AI would be a
complete bastard to test (and they do play-test quite heavily).

We shall see how it pans out, the next cut of XBox may well confirm
your hypothesis.

Cheers,
Rupert

0
Reply Rupert 9/4/2004 8:09:49 PM

On Sat, 04 Sep 2004 09:05:27 GMT, Scott Moore <samiam@moorecad.com>
wrote:

>Everyone agrees that the present, Von Neumann oriented languages are
>poor matches for parallelisim. Everyone agrees that there should be an
>ideal parallel language out there somewere. But nobody seems to be
>able to find it.

There's some AI stuff I'm thinking about doing, where I can quite
easily find, say, a quadrillion things that can be done in parallel,
and so specify. Unfortunately I don't have a quadrillion CPUs, and
there won't be enough time for one to several CPUs to iterate through
them all; so the part I'm currently stuck on is figuring out how to
prioritize them so the important ones get done first and a solution of
acceptable quality can be found in a reasonable time. In a sense, in
this context, parallelism is easy and serializing is hard.

-- 
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
0
Reply wallacethinmintr 9/4/2004 11:54:31 PM

On Sat, 04 Sep 2004 21:09:49 +0100, Rupert Pigott
<roo@try-removing-this.darkboong.demon.co.uk> wrote:

>None of this is new. One of the early apps written for the transputer
>was in fact the fabled "Flight Simulator". The thing is these days
>there is a lot of work off-loaded to specialised hardware (eg: your
>shiny new Nvidia 6800), the question I have to ask is : How much do
>CPUs hold games back now ?

Lots. Pathfinding alone would in many cases be quite adequate to keep
tens of CPUs pegged at 100% load.

-- 
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
0
Reply wallacethinmintr 9/4/2004 11:57:57 PM

In article <9538122f.0409041147.1fb5e535@posting.google.com>,
Jouni Osmala <josmala@cc.hut.fi> wrote:
>
>But main point for desktop parallerism isn't about what cannot be
>parallerised, but that IS there enough important aplications that CAN
>be parallerised. And I'm saying yes there is, and the wall that hits
>it is that at some point no desktop apps need more computing power,
>not that getting more cores would be anyparallel. You don't need
>parallerize every task for desktop to embrace multicore paradigma,
>just the games.

For some fairly trivial meaning of the word "important".  While the
game market was miniscule until a couple of decades back, it has now
reached saturation.  Yes, it dominates the benchmarketing, but it
doesn't dominate CPU design, and there are very good economic reasons
for that.  Sorry, that one won't fly.

>Besides people who write software will typicly have TWO years for
>doubling the number of cores ;) [Except there is probably one shrink
>that goes for increasing cache instead of number of cores, and that is
>more probably earlier than later.]

Hmm.  I have been told that more times than I care to think, over a
period of 30+ years.  It's been said here, in the context of 'cheap'
computers at least a dozen times over the past decade.  That one
won't even start moving.

>And for languages, hmmm. They will evolve, simply since using C for
>writing app for 16 cores don't look promising, as a general case. ...

Ditto, but redoubled in spades.  If that were an aircraft, it would
a Vickers Viscount that has been sitting at Kinshasa airport since
the Belgians left.

There was essentially NO progress in the 1970s, and the 'progress'
since then has been AWAY FROM parallelism.  With the minor exception
of Fortran 90/95.

>But for for many cases things will adapt. 2ndly, there are multiple
>processes for CPU:s to be utilized. For instance, OS, P2Papp,MP3player
>in background while the actual game in foreground.

Your first sentence is optimistic, but not impossible.  Your second
is what most of the experienced people have been saying.  Multiple
cores will be used to run multiple processes (or semi-independent
threads) on desktops, for the forseeable future.


Regards,
Nick Maclaren.
0
Reply nmm1 9/5/2004 10:52:54 AM


Nick Maclaren wrote:

> Your first sentence is optimistic, but not impossible.  Your second
> is what most of the experienced people have been saying.  Multiple
> cores will be used to run multiple processes (or semi-independent
> threads) on desktops, for the forseeable future.
> 

Well, Sun is supposedly working on Thruput Computing.  The only example
they gave was speeding up network stacks.  I'm not sure how much will
come out of it since Sun is off working in their own la-la land.  I
suspect it will be closer to what someone said earlier, if you build
it, the apps will come.  That is, the creative ideas will appear when
you get these multi-core SMP machines into the hands of other than
those with preconceived ideas about the applications of parallel and
concurrent programming.

Joe Seigh
0
Reply Joe 9/5/2004 11:57:50 AM

In article <413AFF6F.28EC2D7@xemaps.com>,
Joe Seigh  <jseigh_01@xemaps.com> wrote:
>
>Well, Sun is supposedly working on Thruput Computing.  The only example
>they gave was speeding up network stacks.  I'm not sure how much will
>come out of it since Sun is off working in their own la-la land.  I
>suspect it will be closer to what someone said earlier, if you build
>it, the apps will come.  That is, the creative ideas will appear when
>you get these multi-core SMP machines into the hands of other than
>those with preconceived ideas about the applications of parallel and
>concurrent programming.

Obviously, I can't speak for Sun.  But I am prety sure that their
intent with that is to kick-start some radical rethinking, and they
are hoping to shake up the industry rather than follow a predicted
path.


Regards,
Nick Maclaren.
0
Reply nmm1 9/5/2004 1:42:29 PM

Nick Maclaren wrote:

> 
> There was essentially NO progress in the 1970s, and the 'progress'
> since then has been AWAY FROM parallelism.  With the minor exception
> of Fortran 90/95.
> 

So, how bad is it?  As bad as hot fusion, on which the US has finally 
given up (except in the context of international cooperation, which is a 
sure sign the US feels it has no future)?

You don't count ada as at least a feint in the right direction?  Some 
people actually use it, and it can be formally analyzed.  If people 
aren't using better tools than c, it's not because those tools aren't 
available.

RM

0
Reply Robert 9/5/2004 2:08:02 PM


Nick Maclaren wrote:
> 
> In article <413AFF6F.28EC2D7@xemaps.com>,
> Joe Seigh  <jseigh_01@xemaps.com> wrote:
> >
> >Well, Sun is supposedly working on Thruput Computing.  The only example
> >they gave was speeding up network stacks.  I'm not sure how much will
> >come out of it since Sun is off working in their own la-la land.  I
> >suspect it will be closer to what someone said earlier, if you build
> >it, the apps will come.  That is, the creative ideas will appear when
> >you get these multi-core SMP machines into the hands of other than
> >those with preconceived ideas about the applications of parallel and
> >concurrent programming.
> 
> Obviously, I can't speak for Sun.  But I am prety sure that their
> intent with that is to kick-start some radical rethinking, and they
> are hoping to shake up the industry rather than follow a predicted
> path.
> 

I agree that's probably what their intent is.  The problem is Sun is
stuck in that propietary hardware and software business model and that
kind of narrows their view point.

What they should be doing is creating an open api that runs well enough
on current hardware that it gets widespread adoption but runs even better
on their propietary hardware that Sun has a competitive advantage over
commodity hardware.

That's sort of what hw vendors do with existing api's but with new api's
you have the advantage of patenting all the more obvious implementations
before it can occur to anyone else.

It's a bit of timing thing.  You want general adoption of the api before
everyone realized you've locked up the copetitive advantage.

Joe Seigh
0
Reply Joe 9/5/2004 2:30:04 PM

Rupert Pigott <roo@try-removing-this.darkboong.demon.co.uk> writes:
> The thing is these days
>there is a lot of work off-loaded to specialised hardware (eg: your
>shiny new Nvidia 6800), the question I have to ask is : How much do
>CPUs hold games back now ?

Quite a bit.  See results at

http://www.complang.tuwien.ac.at/anton/umark/

Well, I better reproduce the results here:

               --------------Machines-------------------
qual Resolut.  Markus Anton Franz calis5 calis2a calis2b
 Low 1280x1024 59.2   67.9  44.6  39.4
 Low   800x600 62.0   72.0  46.5  44.7   37.4
High 1280x1024 40.0   21.3  31.7  18.0	 10.8    22.7
High   800x600 47.2   44.1  33.9  32.2	 23.5	 25.3

Machines:
Markus: Pentium 4 3000MHz (512 KB L2), Gforce FX5900 256MB, i875, 1GB DDR400 dual channel
Anton: Athlon 64 3200+ (2000MHz 1MB L2), Gforce4Ti4200 64MB, K8T800, 512MB DDR333 ECC
Franz: Athlon XP 2800+ (2083MHz 512KB L2), Radeon 9600 XT 128MB?, KT400A, 512MB DDR333
calis5: Athlon XP 2700+ (2166MHz, 256KB L2), Gforce FX5600
calis2a: Athlon XP 1900+ (1600MHz, 256KB L2), Gforce FX 5600, KT266A, 256MB RAM
calis2b: Athlon XP 1900+ (1600MHz, 256KB L2), Radeon 9600, KT266A, 256MB RAM

All the low-quality results seem to be CPU-limited (little difference
between resolutions).  And even for the high-quality results, the
results with a Radeon 9600, 9600XT, and Gforce FX5900 are mostly
CPU-limited, and only those with Gforce 4Ti4200 and FX5600 are
graphics-card-limited; and comparing "calis5" to "Franz" and "calis2a"
to "calis2b" at "High 800x600", these machines are probably also
CPU-limited at this settings.  And that's with machines with pretty
fast CPUs (there is certainly less difference between these CPUs than
between a Radeon 9600 and a Gforce FX5900).

One other interesting point in these results is that the Athlon 64
outdoes similarly-clocked Athlon XPs by a factor >1.5 on the
CPU-limited low-quality settings.  Looking at the Doom3 results at
<http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2149>, a similar
thing happens for Doom 3 and that's not due to the cache size.

Followups set to comp.arch

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
0
Reply anton 9/5/2004 3:18:42 PM

Rupert Pigott wrote:
> AI as it's done in games now is basically scripting, I haven't seen
> many signs of that changing. Game developers really don't seem to be

I think that's at least partly wrong. Even Quake3 had some man-year++ 
effort in developing different kinds of robots that you could play with 
or against, and that was 4-5 (?) years ago.

> much interested in anything else, genuine adaptive AI would be a
> complete bastard to test (and they do play-test quite heavily).

According to John Cash (who wrote most of that Q3 code), it was indeed a 
lot of work to test it, not least since John Carmack tended to tear 
apart all the internals of his engine every two or three weeks. :-(

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
0
Reply Terje 9/5/2004 3:24:46 PM

Anton Ertl wrote:
> Rupert Pigott <roo@try-removing-this.darkboong.demon.co.uk> writes:
> 
>>The thing is these days
>>there is a lot of work off-loaded to specialised hardware (eg: your
>>shiny new Nvidia 6800), the question I have to ask is : How much do
>>CPUs hold games back now ?
> 
> 
> Quite a bit.  See results at
> 
> http://www.complang.tuwien.ac.at/anton/umark/
> 
> Well, I better reproduce the results here:
> 
>                --------------Machines-------------------
> qual Resolut.  Markus Anton Franz calis5 calis2a calis2b
>  Low 1280x1024 59.2   67.9  44.6  39.4
>  Low   800x600 62.0   72.0  46.5  44.7   37.4
> High 1280x1024 40.0   21.3  31.7  18.0	 10.8    22.7
> High   800x600 47.2   44.1  33.9  32.2	 23.5	 25.3
> 
> Machines:
> Markus: Pentium 4 3000MHz (512 KB L2), Gforce FX5900 256MB, i875, 1GB DDR400 dual channel
> Anton: Athlon 64 3200+ (2000MHz 1MB L2), Gforce4Ti4200 64MB, K8T800, 512MB DDR333 ECC
> Franz: Athlon XP 2800+ (2083MHz 512KB L2), Radeon 9600 XT 128MB?, KT400A, 512MB DDR333
> calis5: Athlon XP 2700+ (2166MHz, 256KB L2), Gforce FX5600
> calis2a: Athlon XP 1900+ (1600MHz, 256KB L2), Gforce FX 5600, KT266A, 256MB RAM
> calis2b: Athlon XP 1900+ (1600MHz, 256KB L2), Radeon 9600, KT266A, 256MB RAM

Looking at that last line...

franz has ~33% more fps(?) than calis2b. franz has 30% more clock
and 100% more cache. Now look at the GFX cards here :
http://www.ati.com/products/radeon9600/radeon9600pro/compare.html
The card in franz is running at >50% more engine and memory clock
too, the fillrate is claimed to be >50% faster too.

Then we look at calis 2a and 5, 37% more fps(?) with approx 35%
more clock...

So maybe you are right : CPUs are still where the performance
comes from and the GFX card makers are just fooling with us. I
guess we might learn more from profiling the drivers and
application... Not going to happen soon. :)

Cheers,
Rupert

0
Reply Rupert 9/5/2004 4:08:43 PM

In article <65F_c.378468$%_6.351073@attbi_s01>,
Robert Myers  <rmyers1400@comcast.net> wrote:
>Nick Maclaren wrote:
>> 
>> There was essentially NO progress in the 1970s, and the 'progress'
>> since then has been AWAY FROM parallelism.  With the minor exception
>> of Fortran 90/95.
>
>So, how bad is it?  As bad as hot fusion, on which the US has finally 
>given up (except in the context of international cooperation, which is a 
>sure sign the US feels it has no future)?

Yes, precisely.

The approaches of trying to autoparallelise arbitrary serial von
Neumann code are probably closer to cold fusion, though ....

>You don't count ada as at least a feint in the right direction?  Some 
>people actually use it, and it can be formally analyzed.  If people 
>aren't using better tools than c, it's not because those tools aren't 
>available.

I suppose so, but it is negligibly better than many of the late
1960s languages.  And, even if it were wholly positive, it is
outweighed by the C/C++/Java/etc. regressions.


Regards,
Nick Maclaren.
0
Reply nmm1 9/5/2004 5:41:13 PM

Rupert Pigott <roo@try-removing-this.darkboong.demon.co.uk> writes:

> [SNIP]

> None of this is new. One of the early apps written for the
> transputer was in fact the fabled "Flight Simulator". The thing is
> these days there is a lot of work off-loaded to specialised hardware
> (eg: your shiny new Nvidia 6800), the question I have to ask is :
> How much do CPUs hold games back now ?

If I remember the discussions on the PS2 here when details forst came
out, there was concern expressed by some that the non-video CPU was
marginal for some game play. That would have got a lot worse I suspect
with time.

> My suspicion is : In the general case, not much because developers
> target median hardware in order to maximise their potential market.

Plus they have to predict what the mainstream WILL BE :(

-- 
Paul Repacholi                               1 Crescent Rd.,
+61 (08) 9257-1001                           Kalamunda.
                                             West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be.
0
Reply Paul 9/5/2004 6:14:49 PM

Robert Myers wrote:

> So, how bad is it?  As bad as hot fusion, on which the US has finally 
> given up (except in the context of international cooperation, which is a 
> sure sign the US feels it has no future)?

There's been a great deal of progress in magnetic fusion in the last
few decades.  Confinement parameters for current machines are orders
of magnitude ahead of where they were in the 1970s; understanding of
plasmas has also greatly advanced.

Whether tokamaks are going to be economically competitive is another
matter.  Fortunately, there are exciting ideas for more compact
reactors.

	Paul
0
Reply Paul 9/5/2004 11:42:04 PM

Paul F. Dietz wrote:

> Robert Myers wrote:
> 
>> So, how bad is it?  As bad as hot fusion, on which the US has finally 
>> given up (except in the context of international cooperation, which is 
>> a sure sign the US feels it has no future)?
> 
> 
> There's been a great deal of progress in magnetic fusion in the last
> few decades.  Confinement parameters for current machines are orders
> of magnitude ahead of where they were in the 1970s; understanding of
> plasmas has also greatly advanced.
> 
> Whether tokamaks are going to be economically competitive is another
> matter.  Fortunately, there are exciting ideas for more compact
> reactors.
> 

I chose hot fusion as an example of a problem you'd _really_ like to be 
able to solve, that you think that you ought to be able to solve, that 
significant effort has gone into solving, but that you just haven't been 
able to solve so far...to the point where it has become questionable 
whether it is reasonable expect a satisfactory solution within a 
forseeable future.

RM

0
Reply Robert 9/6/2004 12:42:21 AM

> >But main point for desktop parallerism isn't about what cannot be
> >parallerised, but that IS there enough important aplications that CAN
> >be parallerised. And I'm saying yes there is, and the wall that hits
> >it is that at some point no desktop apps need more computing power,
> >not that getting more cores would be anyparallel. You don't need
> >parallerize every task for desktop to embrace multicore paradigma,
> >just the games.
> 
> For some fairly trivial meaning of the word "important".  While the
> game market was miniscule until a couple of decades back, it has now
> reached saturation.  Yes, it dominates the benchmarketing, but it
> doesn't dominate CPU design, and there are very good economic reasons
> for that.  Sorry, that one won't fly.

My view is that Benchmarketing WILL dominate consumer CPU:s. Twice the
cores,twice the performance that Joe Consumer will see. What I was
saying was that which is important in HOME USER perspective, which is
substantial part of X86 market don't you think. The normal businesses
have already stopped looking for top performance in their desktop
PC:s, and go for low cost options. The workstation market is
different, but many workstation apps do parallerize atleast for couple
of threads. My view is that what sells that they will deliver. Its
important because those numbers will determince whose computer is
faster in homeuser market, and business desktops have long gone the
idea of whose is faster. They look just OEM name and price and some
other variables like is it Intel-inside and go for celeron.
 
> >Besides people who write software will typicly have TWO years for
> >doubling the number of cores ;) [Except there is probably one shrink
> >that goes for increasing cache instead of number of cores, and that is
> >more probably earlier than later.]
> 
> Hmm.  I have been told that more times than I care to think, over a
> period of 30+ years.  It's been said here, in the context of 'cheap'
> computers at least a dozen times over the past decade.  That one
> won't even start moving.

Paying exponential amount of die area on logaritmic performance
increase vs, increasing number of cores is something that should get
when ondie caches are big enough. The reason why it should happen now,
is NOT that software people wan't it to happen, but its just that
doubling the number of cores Vs gaining 20% single thread performance
is something really important. 1st do you really think x86 core could
go much wider and give big performance from that? Do you think
lenghtening the pipeline for better clock speed would be possible
(over P4). No there is not enough ILP in x86 code, and costs of
circuitry extracting ILP goes up so much faster than the gained ILP
that its dead end too. So they have to turn to what they can increase
caches put ondie memory controllers and multiple cores, but after dual
core what they are going to do?

The ILP vs clockspeed vs coresize kind of question. The
Interconnection delays, and power density issues harm so that after a
certain point bigger core extract less ILP than it looses in
clockspeed. Well what does has to do with this. Well Interconnection
delays relative to transistorspeed will increase, AND that reduces the
optimal size for the core heavily. Trends are there, I can give you
figured that mr Demone deduced, that has to be taken with grain of
salt.

http://www.realworldtech.com/page.cfm?ArticleID=RWT062004172947&p=7

But if this happens as it looks, tha at 0.45u thats 2007 on intel
roadmap the optimal core size would be 20mm� rest is L2 cache and
other cores, and what ever they bring to the die. And intel seems to
keep desktop CPU die size about 100-200mm� So thats 4 cores and their
caches.

The reason for multicore is not because multithreading becomes
extremely usefull but because gaining single threaded performance
becomes MUCH harder.

> >And for languages, hmmm. They will evolve, simply since using C for
> >writing app for 16 cores don't look promising, as a general case. ...

> There was essentially NO progress in the 1970s, and the 'progress'
> since then has been AWAY FROM parallelism.  With the minor exception
> of Fortran 90/95.

There is already companies that use internal parallel languages for
their consumer products to cope with SSE, 3Dnow, and SMP. There ARE
parallel languages that are easy to use for application developement.
I'd say that when there is >500 million desktops with multicore CPU:s
(n>2) and highly competitive software market that runs on them that
needs performance as a distinctive method SOME ONE will see a business
opportunity. What I see is that there are millions of coders out there
who are looking for a solutions and still on dekstop the
syncronisation latencies will make their problem much easier than the
supercomputer folks, as multicore systems will have syncronization
latencies way lower than mainmemory latency. The progress to other
direction comes out of opportunity and necessity, not because what was
previous trend. When there is two cores as mainstream, and 4 cores in
roadmap, people who need the power on DESKTOP will go looking how to
use more threads. And at some point there will be parallel language,
out of necessity. Perhaps when there is 16 cores or more in
mainstream. But for 16 core to happen there should be a situation when
going from 8->16 gives more performance than a doubling of L2orL3
cache will as average case. Yes thats the real reason, there is not
much available for improving singlethreaded performance while keeping
the x86 ISA, and scaling trends hurt even more.

> >But for for many cases things will adapt. 2ndly, there are multiple
> >processes for CPU:s to be utilized. For instance, OS, P2Papp,MP3player
> >in background while the actual game in foreground.
> 
> Your first sentence is optimistic, but not impossible.  Your second
> is what most of the experienced people have been saying.  Multiple
> cores will be used to run multiple processes (or semi-independent
> threads) on desktops, for the forseeable future.

What I'm saying that performance limited applications on HIGHEND
systems, have semi-independent threads for 8 cores, as soon as there
is motivation to utilize that. People keep looking for how to make
things semi-independent, but at some point there has to be a better
way to write the parallel code, or there will be nothing extra
transistors could to do for performance improving than doubling of
ondie caches, after that. In 6-10 years there will be 16 core at
desktop. Unless there is some really disturband technology that gives
us MUCH better use for transistors. Like intel make 4 core EV8 for
desktop;) Or that the quantum computing gets to desktop, and makes
normal semiconductor devices obsolete [Very improbable;]

Jouni Osmala
0
Reply josmala 9/6/2004 7:24:09 AM

> - If you want N digits of accuracy in the numerical calculations, you
> just need to use N digits of numerical precision, for O(N^2)
> computational effort.
> 
> - However, quantizing time produces errors; if you want to reduce
> these to N digits of accuracy, you need to use exp(N) time steps.
> 
> Is this right? Or is there any way to put a bound on the total error
> introduced by time quantization over many time steps?

However, in many cases you're not interested in the values of the result
variables as such: you want to categorize the outcome of the experiment
in some way - e.g., the final configuration the protein you are simulating
is in, and an approximate time until a stable configuration is reached.
Whether that result corresponds to exactly those initial conditions you
set up or any of the simulated intermediate conditions is not really relevant,
as long as you can convince yourself that through simulating some set of
initial conditions, you get a statistically accurate view of the outcome in
a qualitative sense. Thus, it would be very valuable to be able to say, for
instance, that the presence of a certain "wrong" configuration of the 
Alzheimer protein "catalyses" the folding of newly-made such protein into
the same "wrong" configuration - or to refute this hypothesis.

	Jan
0
Reply ISO 9/6/2004 8:39:52 AM

>>What kind of transaction - by itself - would take long enough to warrant
>>that?
> any transaction that goes off and does some data mining in the middle?

Ugh - another case of bad or incompetent design, then?

	Jan
0
Reply ISO 9/6/2004 8:41:46 AM

 > In a sense, in
> this context, parallelism is easy and serializing is hard.

Yes, I quite agree. As a programmer, you need to make arbitrary decisions
on what to serialize and what not to serialize. It's similar to writing
explicit loops instead of using array language.

	Jan
0
Reply ISO 9/6/2004 8:46:20 AM

"Jouni Osmala" <josmala@cc.hut.fi> wrote in message
news:9538122f.0409052324.3ce9651c@posting.google.com...

snip

> There is already companies that use internal parallel languages for
> their consumer products to cope with SSE, 3Dnow, and SMP. There ARE
> parallel languages that are easy to use for application developement.

Can you give some examples of languages in each of these catagories?  And
speculate about why, if they are easy to use and make parallel programming
much easier, then why aren't they the "standard" for high performance
computing?

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/6/2004 3:42:12 PM

In article <oz%_c.317571$OB3.179975@bgtnsc05-news.ops.worldnet.att.net>,
Stephen Fuld <s.fuld@PleaseRemove.att.net> wrote:
>
>"Jouni Osmala" <josmala@cc.hut.fi> wrote in message
>news:9538122f.0409052324.3ce9651c@posting.google.com...
>
>> There is already companies that use internal parallel languages for
>> their consumer products to cope with SSE, 3Dnow, and SMP. There ARE
>> parallel languages that are easy to use for application developement.
>
>Can you give some examples of languages in each of these catagories?  And
>speculate about why, if they are easy to use and make parallel programming
>much easier, then why aren't they the "standard" for high performance
>computing?

Or even used significantly in that area!  Yes, PLEASE tell me about
those languages, as it really is rather relevant to my work.


Regards,
Nick Maclaren.
0
Reply nmm1 9/6/2004 6:31:49 PM

Robert Myers wrote:

(snip)

> I chose hot fusion as an example of a problem you'd _really_ like to be 
> able to solve, that you think that you ought to be able to solve, that 
> significant effort has gone into solving, but that you just haven't been 
> able to solve so far...to the point where it has become questionable 
> whether it is reasonable expect a satisfactory solution within a 
> forseeable future.

Low oil prices over some years have decreased interest.

If oil prices stay near or higher than they are now, that would
be a big incentive to fusion work.

-- glen

0
Reply glen 9/6/2004 7:01:17 PM

In article <1u2%c.142744$mD.19432@attbi_s02>,
glen herrmannsfeldt  <gah@ugcs.caltech.edu> wrote:
>Robert Myers wrote:
>
>> I chose hot fusion as an example of a problem you'd _really_ like to be 
>> able to solve, that you think that you ought to be able to solve, that 
>> significant effort has gone into solving, but that you just haven't been 
>> able to solve so far...to the point where it has become questionable 
>> whether it is reasonable expect a satisfactory solution within a 
>> forseeable future.
>
>Low oil prices over some years have decreased interest.
>
>If oil prices stay near or higher than they are now, that would
>be a big incentive to fusion work.

Our Lords and Masters in Washington and Whitehall are doing their
level best to arrange that.  But Robert Myers is right - there has
been enough work that there are grounds for believing that the
problem is effectively insoluble.


Regards,
Nick Maclaren.
0
Reply nmm1 9/6/2004 8:21:12 PM

"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:chigro$g4i$1@pegasus.csx.cam.ac.uk...
>  But Robert Myers is right - there has
> been enough work that there are grounds for believing that the
> problem is effectively insoluble.
>

You mean "at the present time", correct?   ;-).

Regards,
    Dean

>
> Regards,
> Nick Maclaren.


0
Reply Dean 9/6/2004 8:24:17 PM

In article <RH3%c.16887$Hw6.3652@newssvr27.news.prodigy.com>,
Dean Kent <dkent@realworldtech.com> wrote:
>"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
>news:chigro$g4i$1@pegasus.csx.cam.ac.uk...
>
>>  But Robert Myers is right - there has
>> been enough work that there are grounds for believing that the
>> problem is effectively insoluble.
>
>You mean "at the present time", correct?   ;-).

No.  I said "insoluble", not "unsolved".


Regards,
Nick Maclaren.
0
Reply nmm1 9/6/2004 8:32:52 PM

"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:chihhk$gj7$1@pegasus.csx.cam.ac.uk...
>
> No.  I said "insoluble", not "unsolved".
>

So, it is your position that it cannot be solved now, nor anytime in the
future?

Regards,
    Dean

>
> Regards,
> Nick Maclaren.


0
Reply Dean 9/6/2004 8:47:55 PM

In article <%14%c.16897$iG6.5495@newssvr27.news.prodigy.com>,
Dean Kent <dkent@realworldtech.com> wrote:
>"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
>news:chihhk$gj7$1@pegasus.csx.cam.ac.uk...
>>
>> No.  I said "insoluble", not "unsolved".
>
>So, it is your position that it cannot be solved now, nor anytime in the
>future?

Reread my posting.  I said that there is good evidence that may
well be the case.


Regards,
Nick Maclaren.
0
Reply nmm1 9/6/2004 9:25:16 PM

Dean Kent wrote:

> "Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
> news:chihhk$gj7$1@pegasus.csx.cam.ac.uk...
> 
>>No.  I said "insoluble", not "unsolved".
>>
> 
> 
> So, it is your position that it cannot be solved now, nor anytime in the
> future?
> 

Hot fusion plainly has its ready defenders.  The expectations for 
programming multiprocessors are apparently low, with no apparent and 
certainly no strenuous dissent.

RM

0
Reply Robert 9/6/2004 10:41:38 PM

Robert Myers wrote:
> Dean Kent wrote:
> 
>> "Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
>> news:chihhk$gj7$1@pegasus.csx.cam.ac.uk...
>>
>>> No.  I said "insoluble", not "unsolved".
>>>
>>
>>
>> So, it is your position that it cannot be solved now, nor anytime in the
>> future?
>>
> 
> Hot fusion plainly has its ready defenders.  The expectations for 
> programming multiprocessors are apparently low, with no apparent and 
> certainly no strenuous dissent.

Doesn't seem to stop people from trying.  At least the cost of 
admission is much lower than tokomac fusion research :-)

My previous message about these guys didn't elicit any response, 
which I thought odd.  I was sure it would raise at least a few 
hackles.  I think that I could do something useful with at least 
the small version:

http://www.orionmulti.com/products/

My original post made it as far as google, at least:
http://groups.google.com/groups?q=author:Andrew+author:Reilly&hl=en&lr=&ie=UTF-8&selm=2ppt79FniplsU1%40uni-berlin.de&rnum=1

Cheers,

-- 
Andrew
0
Reply Andrew 9/6/2004 11:21:04 PM

Andrew Reilly wrote:

<snip>

> 
> My previous message about these guys didn't elicit any response, which I 
> thought odd.  I was sure it would raise at least a few hackles.  I think 
> that I could do something useful with at least the small version:
> 
> http://www.orionmulti.com/products/
> 
> My original post made it as far as google, at least:
> http://groups.google.com/groups?q=author:Andrew+author:Reilly&hl=en&lr=&ie=UTF-8&selm=2ppt79FniplsU1%40uni-berlin.de&rnum=1 
> 

What's the figure of merit that makes this product attractive?  It's an 
x-86 cluster with gigabit ethernet interconnect.

RM

0
Reply Robert 9/7/2004 12:24:29 AM

Robert Myers wrote:

> Andrew Reilly wrote:
> 
> <snip>
> 
>>
>> My previous message about these guys didn't elicit any response, which 
>> I thought odd.  I was sure it would raise at least a few hackles.  I 
>> think that I could do something useful with at least the small version:
>>
>> http://www.orionmulti.com/products/
>>
>> My original post made it as far as google, at least:
>> http://groups.google.com/groups?q=author:Andrew+author:Reilly&hl=en&lr=&ie=UTF-8&selm=2ppt79FniplsU1%40uni-berlin.de&rnum=1 
>>
> 
> 
> What's the figure of merit that makes this product attractive?  It's an 
> x-86 cluster with gigabit ethernet interconnect.

Flops/dollar, perhaps, but mostly flops/watt.  Ultimately 
flops/standard-wall-socket.  Flops/cubic meter are probably pretty 
good too.  Oh, and you get to run your x86 cluster Linux code on 
it, rather than recoding for the DSP farms that are the other 
alternative for that sort of compute/watt or compute/volume.  That 
includes double precision maths, which most of the DSP farms 
aren't good at.

Yeah, I do think that in-the-box gigabit ethernet was a weird 
choice, as I said in my previous message.  I wonder if you could 
usefully use something like the hyperchannel switches that have 
been mentioned here recently in a cache-incoherent mode, instead? 
  That could be even more interesting.

Cheers,

-- 
Andrew
0
Reply Andrew 9/7/2004 1:29:54 AM

On Fri, 03 Sep 2004 10:48:21 +1000, Andrew Reilly
<areilly-newspost@areilly.bpc-users.org> wrote:
>
>Russell Wallace wrote:
>> On 2 Sep 2004 19:27:15 GMT, nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:
>> 
>> 
>>>In article <s718ybshbf6.fsf@beryl.CS.Berkeley.EDU>,
>>>David Gay  <dgay@beryl.CS.Berkeley.EDU> wrote:
>>>Highly parallel systems were specialist in 1974, and they are STILL
>>>specialist.  We know how to do a LOT more in parallel than we
>>>did then, but it is still a small proportion of what we would like
>>>to do.  Still, it keeps people like me off the streets :-)
>> 
>> What are some examples of important and performance-limited
>> computation tasks that aren't run in parallel?
>
>I.e., that run fastest on a one-processor Itanium or Opteron or 
>Xeon workstation...
>
>On the other hand, who isn't drooling over these:
>
>http://www.orionmulti.com/products/

I, for one, am *not* drooling.  About the only thing this system has
going for it is a fairly low power consumption for the performance it
gets, but even than we're talking about ~200W vs. ~400W.  Once Intel
and AMD get their dual-core chips out than this advantage will
disappear.

12 processors seems fast until you realize that the processors max out
at about 1/3rd of the performance of top-end processors and are often
down closer to 1/6th or worse!  Even for their Linpack scores (a
fairly best-case sort of situation for Transmeta chips) you could
match the performance of the 12-processor system with a 4-processor
Opteron or Xeon setup.

>Have to wonder why all of those nodes are hooked together (inside 
>the box, presumably on the motherboard) with gigabit ethernet, 
>rather than something like the Horus chipset that's been spoken 
>about here recently, given that the processors have HyperChannel 
>interfaces.  My guess is that it let them offload system software 
>development onto the open source cluster community, without having 
>to even do device drivers.

Both software and hardware development is being offloaded here.  It's
the cheap solution that will kinda-sorta work ok for the intended
task.

-------------
Tony Hill
hilla <underscore> 20 <at> yahoo <dot> ca
0
Reply Tony 9/7/2004 4:13:14 AM

In article <CI5%c.44980$3l3.13915@attbi_s03>,
Robert Myers <rmyers1400@comcast.net> writes:
|> > 
|> >>No.  I said "insoluble", not "unsolved".
|> > 
|> > So, it is your position that it cannot be solved now, nor anytime in the
|> > future?
|> 
|> Hot fusion plainly has its ready defenders.  The expectations for 
|> programming multiprocessors are apparently low, with no apparent and 
|> certainly no strenuous dissent.

Eh?  I will dissent, strenuously, against such a sweeping statement!
My comment was about such programming by the mass of 'ordinary'
programmers, not about its use in HPC and embedded work (including
games).

And then there is Jouni Osmala ....


Regards,
Nick Maclaren.
0
Reply nmm1 9/7/2004 8:12:13 AM

Nick Maclaren wrote:

> In article <RH3%c.16887$Hw6.3652@newssvr27.news.prodigy.com>,
> Dean Kent <dkent@realworldtech.com> wrote:
> 
>>"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
>>news:chigro$g4i$1@pegasus.csx.cam.ac.uk...
>>
>>
>>> But Robert Myers is right - there has
>>>been enough work that there are grounds for believing that the
>>>problem is effectively insoluble.
>>
>>You mean "at the present time", correct?   ;-).
> 
> 
> No.  I said "insoluble", not "unsolved".

You might be right, but that's still in the 'famous last words' 
cathegory. :-)

I believe the relevant quote is something like this:

"When an established expert in a field tell you that something is 
possible, he is almost certainly right, but when he tells you that 
something is impossible, he is very likely wrong."

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
0
Reply Terje 9/7/2004 10:16:58 AM

In article <chk1qr$dii$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> writes:
|> > 
|> > No.  I said "insoluble", not "unsolved".
|> 
|> You might be right, but that's still in the 'famous last words' 
|> cathegory. :-)

When the water in my kettle undergoes spontaneous cold fusion,
I will undoubtedly have spoken my last words :-)

But PLEASE remember that I said:

    There has been enough work that there are grounds for believing
    that the problem is effectively insoluble.

That is a MUCH weaker statement than saying that it is insoluble.

|> I believe the relevant quote is something like this:
|> 
|> "When an established expert in a field tell you that something is 
|> possible, he is almost certainly right, but when he tells you that 
|> something is impossible, he is very likely wrong."

Yes.  Apply that recursively :-)

More seriously, (a) I am not an expert (and a neutral lay appraiser
is often more likely to be correct than an expert), (b) far too many
experts in this field have been telling us for 50 years that all
they need for a solution is just a little
Regards,
Nick Maclaren.
 more time and money and
(c) there are lashings of counter-examples to Clarke's Law.  What
he SHOULD have said is more like:

    When an established expert in a field tell you that something is 
    possible, he is almost certainly right, but when he tells you
    that something is impossible, without giving a clear, simple,
    draft proof why it is, he is very likely wrong.

There are people who have given such proofs, and have been wrong,
but it is relatively rare.

0
Reply nmm1 9/7/2004 11:04:06 AM

Terje Mathisen wrote:
> Nick Maclaren wrote:
> 
>> In article <RH3%c.16887$Hw6.3652@newssvr27.news.prodigy.com>,
>> Dean Kent <dkent@realworldtech.com> wrote:
>>
>>> "Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
>>> news:chigro$g4i$1@pegasus.csx.cam.ac.uk...
>>>
>>>
>>>> But Robert Myers is right - there has
>>>> been enough work that there are grounds for believing that the
>>>> problem is effectively insoluble.
>>>
>>>
>>> You mean "at the present time", correct?   ;-).
>>
>>
>>
>> No.  I said "insoluble", not "unsolved".
> 
> 
> You might be right, but that's still in the 'famous last words' 
> cathegory. :-)
> 
> I believe the relevant quote is something like this:
> 
> "When an established expert in a field tell you that something is 
> possible, he is almost certainly right, but when he tells you that 
> something is impossible, he is very likely wrong."
> 

I'm sure I don't have to say this for your benefit, but, as to what I 
said on the subject, I really want to stick with my own exact words, 
which I chose with some care.

RM

0
Reply Robert 9/7/2004 11:28:20 AM

Terje Mathisen  <terje.mathisen@hda.hydro.com> wrote:
+---------------
| I believe the relevant quote is something like this:
| 
| "When an established expert in a field tell you that something is 
| possible, he is almost certainly right, but when he tells you that 
| something is impossible, he is very likely wrong."
+---------------

You're probably thinking of Clarke's First Law:

    When a distinguished but elderly scientist states that something
    is possible, he is almost certainly right. When he states that
    something is impossible, he is very probably wrong.
	Arthur C Clarke, "Profiles of the Future" (1962; rev. 1973)
	``Hazards of Prophecy: The Failure of Imagination''

But one should always temper that with Isaac Asimov's comment:

    When, however, the lay public rallies round an idea that is
    denounced by distinguished but elderly scientists and supports
    that idea with great fervor and emotion--the distinguished but
    elderly scientists are then, after all, probably right.
	Isaac Asimov (1920-1992), in "Fantasy & Science Fiction" 1977
	[In answer to Clarke's First Law]


-Rob

Refs: <http://www.phantazm.dk/sf/arthur_c_clarke/s.htm>
      <http://www.xs4all.nl/~jcdverha/scijokes/8_4.html>
      and many others...

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607


0
Reply rpw3 9/7/2004 11:35:56 AM

Rob Warnock wrote:

> Terje Mathisen  <terje.mathisen@hda.hydro.com> wrote:
> +---------------
> | I believe the relevant quote is something like this:
> | 
> | "When an established expert in a field tell you that something is 
> | possible, he is almost certainly right, but when he tells you that 
> | something is impossible, he is very likely wrong."
> +---------------
> 
> You're probably thinking of Clarke's First Law:
> 
>     When a distinguished but elderly scientist states that something
>     is possible, he is almost certainly right. When he states that
>     something is impossible, he is very probably wrong.
> 	Arthur C Clarke, "Profiles of the Future" (1962; rev. 1973)
> 	``Hazards of Prophecy: The Failure of Imagination''

Right, thanks for the reference!
> 
> But one should always temper that with Isaac Asimov's comment:
> 
>     When, however, the lay public rallies round an idea that is
>     denounced by distinguished but elderly scientists and supports
>     that idea with great fervor and emotion--the distinguished but
>     elderly scientists are then, after all, probably right.
> 	Isaac Asimov (1920-1992), in "Fantasy & Science Fiction" 1977
> 	[In answer to Clarke's First Law]

Disproof by popular acclaim?

:-)

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
0
Reply Terje 9/7/2004 11:54:05 AM

Nick Maclaren wrote:

> In article <CI5%c.44980$3l3.13915@attbi_s03>,
> Robert Myers <rmyers1400@comcast.net> writes:
> |> > 
> |> >>No.  I said "insoluble", not "unsolved".
> |> > 
> |> > So, it is your position that it cannot be solved now, nor anytime in the
> |> > future?
> |> 
> |> Hot fusion plainly has its ready defenders.  The expectations for 
> |> programming multiprocessors are apparently low, with no apparent and 
> |> certainly no strenuous dissent.
> 
> Eh?  I will dissent, strenuously, against such a sweeping statement!
> My comment was about such programming by the mass of 'ordinary'
> programmers, not about its use in HPC and embedded work (including
> games).
> 
> And then there is Jouni Osmala ....
> 

I wouldn't want to discourage Jouni or anyone else from optimism, even 
overoptimism, even naive overoptimism.

I was, on the other hand, trying to provoke a dissent from you to the 
extent that you would say what you do think is possible.

If SGI can get 1024 Itanium to cooperate in a single Linux system image 
on a single system image, then somebody must know what they're doing. 
NASA Ames apparently has enough confidence in its ability to program big 
SMP boxes that it is buying 20 with 512 processors apiece.

On the other hand, the suggestion was recently made here that maybe we 
should just banish SMP as an unacceptable programming style (meaning, I 
think, that multiprocessor programming should not be done in a 
globally-shared memory space, or at least that the shared space should 
be hidden behind something like MPI).

The situation is _so_ bad that it doesn't seem embarrassing, apparently, 
for Orion Multisystems to take a lame processor, to hobble it further 
with a lame interconnect, and to call it a workstation.  If the future 
of computing really is slices of Wonder Bread in a plastic bag and not a 
properly cooked meal, then the Orion box makes some sense.  Might as 
well get used to it and start programming on an architecture that at 
least has the right topology and instruction set, as I believe Andrew 
Reilly is suggesting.

If big computers are to be used to solve problems, they are inevitably 
going to fall into the hands of people who are more interested in 
solving problems than they are in the computers...as should be.  If we 
really can't conjure tools for programming them that are reliable in the 
hands of relative amateurs, I see it as a more pressing issue than not 
being able to do hot fusion (the prospects for wind and solar having 
come along very nicely).

RM

0
Reply Robert 9/7/2004 12:05:24 PM

In article <8uh%c.47892$3l3.16380@attbi_s03>,
Robert Myers <rmyers1400@comcast.net> writes:
|> 
|> I was, on the other hand, trying to provoke a dissent from you to the 
|> extent that you would say what you do think is possible.
|> 
|> If SGI can get 1024 Itanium to cooperate in a single Linux system image 
|> on a single system image, then somebody must know what they're doing. 
|> NASA Ames apparently has enough confidence in its ability to program big 
|> SMP boxes that it is buying 20 with 512 processors apiece.

Yes, it can be done.

|> On the other hand, the suggestion was recently made here that maybe we 
|> should just banish SMP as an unacceptable programming style (meaning, I 
|> think, that multiprocessor programming should not be done in a 
|> globally-shared memory space, or at least that the shared space should 
|> be hidden behind something like MPI).

My view is that, if it is to be done, it should be done properly.
And currently, it isn't.  There are hardware issues where the
primitives provided are unsuitable, the operating system ones are
definitely unsuitable, and the language situation beggars belief.
All soluble, in theory.

Whether it is the BEST approach is unclear.  Explicit synchronisation
of incoherent shared memory is a good model, too, as is message
passing.  I can live with any of them, and so can most good parallel
programmers.

|> If big computers are to be used to solve problems, they are inevitably 
|> going to fall into the hands of people who are more interested in 
|> solving problems than they are in the computers...as should be.  If we 
|> really can't conjure tools for programming them that are reliable in the 
|> hands of relative amateurs, I see it as a more pressing issue than not 
|> being able to do hot fusion (the prospects for wind and solar having 
|> come along very nicely).

And we need to start by developing some defined parallel programming
languages and paradigms that are acceptable to such users.


Regards,
Nick Maclaren.
0
Reply nmm1 9/7/2004 12:21:51 PM


Robert Myers wrote:
> 
> On the other hand, the suggestion was recently made here that maybe we
> should just banish SMP as an unacceptable programming style (meaning, I
> think, that multiprocessor programming should not be done in a
> globally-shared memory space, or at least that the shared space should
> be hidden behind something like MPI).

With the latter presenting a different api to the programmer or do you
mean doing the shared memory virtualization in software rather than hardware?
Distributed algorithms are attractive from a hardware point of view because
they force some nastier error checking into software.  Do you really want
programmers who can't handle shared memory doing distributed programming?

Joe Seigh
0
Reply Joe 9/7/2004 12:39:17 PM

Nick Maclaren wrote:

<snip>

> All soluble, in theory.

<snip>

> And we need to start by developing some defined parallel programming
> languages and paradigms that are acceptable to such users.

It _does_ seem rather like hot fusion.

RM

0
Reply Robert 9/7/2004 12:58:20 PM

Robert Myers wrote:
> On the other hand, the suggestion was recently made here that maybe we 
> should just banish SMP as an unacceptable programming style (meaning, I 
> think, that multiprocessor programming should not be done in a 
> globally-shared memory space, or at least that the shared space should 
> be hidden behind something like MPI).

I wonder how much SMP style, and the uniform address spaces that 
go with it, can be hidden under VM, pointer swizzling and layers 
of software-based caching.  Probably not much, really.

> The situation is _so_ bad that it doesn't seem embarrassing, apparently, 
> for Orion Multisystems to take a lame processor, to hobble it further 
> with a lame interconnect, and to call it a workstation.  If the future 
> of computing really is slices of Wonder Bread in a plastic bag and not a 
> properly cooked meal, then the Orion box makes some sense.  Might as 
> well get used to it and start programming on an architecture that at 
> least has the right topology and instruction set, as I believe Andrew 
> Reilly is suggesting.

Well, I think that the specific instruction set is probably a red 
herring.  I reckon that an object code specifically designed to be 
a target for JIT compilation to a register-to-register VLIW engine 
of indeterminate dimensions will turn out to be better ultimately. 
  There are projects moving in that direction: 
http://llvm.cs.uiuc.edu/, and from long, long ago: TAO-group's VM.
Stack-based VM's like JVM and MS-IL might or might not be the 
right answer.  I guess we'll find out soon enough.

Code portability and density is important, of course, but the main 
thing is winning back with dynamic recompilation some of the 
unknowables that plain VLIW in-order RISC visits on code.

The Transmeta Eficieon is just the first widely available 
processor with embedded-levels of integration (memory and some 
peripheral interfaces and hyper-channel for other peripherals) and 
power consumption that can do pipelined double-precision floating 
point multiply/additions at two flops/clock at an interesting 
clock rate.  1.5Ghz is significantly faster than the DSP 
competitors.  TIC6700 tops out at 300MHz and only does single 
precision at the core rate.  PowerPC+Altivec doesn't have the 
memory controller or the peripheral interconnect to drive up the 
areal density.  The BlueGene core is about the right shape, but I 
haven't seen any industrial/embedded boxes with a few dozen of 
them in it, yet.  The MIPS and ARM processors that have the 
integration don't have the floating point chops.  Modern versions 
of VIA C3 might be getting interesting (or not: I haven't looked 
at their double-precision performance), but have neither the 
memory controller nor the hyperchannel, nor quite the MHz.  Of 
course, Opterons fit that description too, and clock much faster, 
but I thought that they consumed considerably more power, too. 
Maybe their MIPS/watt is closer than I've given it credit for.


> If big computers are to be used to solve problems, they are inevitably 
> going to fall into the hands of people who are more interested in 
> solving problems than they are in the computers...as should be.  If we 
> really can't conjure tools for programming them that are reliable in the 
> hands of relative amateurs, I see it as a more pressing issue than not 
> being able to do hot fusion (the prospects for wind and solar having 
> come along very nicely).

For such people, I suspect that the appropriate level of 
programming is that of science fiction starship bridge computers: 
"here's what I want: make it so".  I wonder if anyone has looked 
at something like simulated annealing or genetic optimisation to 
drive memory access patterns revealed by problems expressed at an 
APL or Matlab (or higher) level.  For most of the "big science" 
problems, I suspect that the "what I want" is not terribly 
difficult to express (once you've done the science-level thinking, 
of course).  The tricky part, at the moment, is having a human 
understand the redundancies and dataflows (and numerical 
stability) issues well enough to map the direct-form of the 
solution to something efficient (on one or on a bunch of 
processors).  I think that from a sufficient altitude, that looks 
like an annealing problem, with dynamic recompilation being the 
lower tier mechanism of the optimisation target.  The lucky thing 
about "big science" problems is that by definition they have big 
data, and run for a long time.  That time and that amount of data 
might as well be used by the machine itself to try to speed the 
process up as by a bunch of humans attempting the same thing 
without as intimate access to the actual values in the data sets 
and computations.

It's late, I've had a few glasses of a nice red and I'm rambling. 
  Sorry about that.  Hope the ramble sparks some other ideas.

-- 
Andrew
0
Reply Andrew 9/7/2004 1:10:16 PM

Joe Seigh wrote:
> 
> Robert Myers wrote:
> 
>>On the other hand, the suggestion was recently made here that maybe we
>>should just banish SMP as an unacceptable programming style (meaning, I
>>think, that multiprocessor programming should not be done in a
>>globally-shared memory space, or at least that the shared space should
>>be hidden behind something like MPI).
> 
> 
> With the latter presenting a different api to the programmer or do you
> mean doing the shared memory virtualization in software rather than hardware?

I mean that one writes modules as if for a von Neumann 
architecture--never any possibility of a variable being corrupted 
because of concurrency.  Data that fall outside the purview of the 
module are received from or sent to an outside agent through a perfectly 
encapsulated interface.  How that agent does its work, whether in 
hardware or software, is immaterial, so long as it does it according to 
specification without intervention or oversight from the application 
programmer.

> Distributed algorithms are attractive from a hardware point of view because
> they force some nastier error checking into software.  Do you really want
> programmers who can't handle shared memory doing distributed programming?
> 

I believe that it is possible to write formally incorrect programs in 
any language currently in practical use.  It seems likely that anyone 
using such a language, no matter how competent, will eventually write a 
formally incorrect program and introduce a bug that will prove to be 
very hard to find.

Artificial boundaries (separate processors, separate memory spaces, 
separate processes, separate threads, separate system images) might help 
in debugging and create an illusion of safety, but, without formal 
verification, an illusion is what it is.

RM

0
Reply Robert 9/7/2004 1:29:19 PM

Joe Seigh wrote:
> 
> Robert Myers wrote:
> 
>>On the other hand, the suggestion was recently made here that maybe we
>>should just banish SMP as an unacceptable programming style (meaning, I
>>think, that multiprocessor programming should not be done in a
>>globally-shared memory space, or at least that the shared space should
>>be hidden behind something like MPI).
> 
> 
> With the latter presenting a different api to the programmer or do you
> mean doing the shared memory virtualization in software rather than hardware?
> Distributed algorithms are attractive from a hardware point of view because
> they force some nastier error checking into software.  Do you really want
> programmers who can't handle shared memory doing distributed programming?

Too late to worry about whether we want them to be doing that kind
of stuff, they already have... Outlook has been terrorizing the
Internet for many years now, surely a decade by now in fact. :)

Cheers,
Rupert

0
Reply Rupert 9/7/2004 1:57:11 PM

"Robert Myers" <rmyers1400@comcast.net> wrote in message
news:nTm_c.371401$%_6.4568@attbi_s01...
> Stephen Fuld wrote:
>
> >
> > I freely admit that I may be way off base here, but I am very much
reminded
> > an analogous situation in a somewhat earlier age.  Perhaps it can best
be
> > described with the paraphrase "SMP considered harmfull to parallel
> > programming progress".  That is SMP is like the use of the Goto
statement in
> > that it is very usefull in modest sized applications (think perhaps
quick
> > and dirty) but as things scale up, neither works well and both seem to
have
> > unintended consequences that make further progress much harder.  Do we
need
> > to bite the bullet and "throw out" the SMP code, just like we mostly did
> > with goto filled code and thus regress in order to make more progress
later?
> > I very well remember the resistance to eliminating goto, the projected
cost
> > in terms of inefficient programs the cost of rewriting, etc.  But now,
few
> > would go back.
> >
> > Just a thought, but I find it interesting.
> >
>
> <with respect>
>
> No. No. No. No. No. No.
>
> Single-processor system images harmful to parallel programming.
>
> SPSI = Everything has to cross user/kernel space boundaries and
> commmunication stack for any kind of nontrivial parallelism.

The two seem related.  While you can certainly implement message passing on
an SMP hardware design, you can't realistically use shared memory semantics
on things like clusters.  NUMAs are sort of in the middle.

Perhaps I should refine my claim.  How about something like software designs
that assume shared memory semantics will have to go?  A good piece of
sofware would work well and without any wource changes on a single or
multiple (up to some limit) CPUs.

> Better ways to do it than classic SMP?  I'm sure there are, but tens of
> thousands of instances of the Linux kernel aren't the answer, either.

Of course!  But just as it took a while, and a few failed attempts, to come
up with a successful strategy to get rid of GOTOs (remember "Structured
Programing"?) , it will probably take the same to come up with a reasonable
successor for existing parallel programming paradigms.  And that will
require both hardware and software to make it work well. (In that I agree
with Nick).  I am thinking it is something with a better interconnect
architecture than using NICs of some kind with an I/O type interface.

I am intrigued by what was done with the transputer in that area and also by
some of the hypercube designs like NCube.  We need simple primitives to get
information from one "process" to another, with no code changes no matter
whether the two processes are on the same CPU or not.  Then the underlying
physical mechanism can be optimized for particular sizes and technologies.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/7/2004 5:07:04 PM

Nick,

Many scientists disagree with you, there are
6 different hot fusion projects at MIT alone.

http://web.mit.edu/ned/www/research/fusion&plasmaphysics.html

"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:chikjs$iuk$1@pegasus.csx.cam.ac.uk...
> In article <%14%c.16897$iG6.5495@newssvr27.news.prodigy.com>,
> Dean Kent <dkent@realworldtech.com> wrote:
> >"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
> >news:chihhk$gj7$1@pegasus.csx.cam.ac.uk...
> >>
> >> No.  I said "insoluble", not "unsolved".
> >
> >So, it is your position that it cannot be solved now, nor anytime in the
> >future?
>
> Reread my posting.  I said that there is good evidence that may
> well be the case.
>
>
> Regards,
> Nick Maclaren.


0
Reply spinlock 9/7/2004 7:35:32 PM

Stephen Fuld wrote:

> "Robert Myers" <rmyers1400@comcast.net> wrote in message
> news:nTm_c.371401$%_6.4568@attbi_s01...
> 

<snip>

>>
>>Single-processor system images harmful to parallel programming.
>>
>>SPSI = Everything has to cross user/kernel space boundaries and
>>commmunication stack for any kind of nontrivial parallelism.
> 
> 
> The two seem related.  While you can certainly implement message passing on
> an SMP hardware design, you can't realistically use shared memory semantics
> on things like clusters.  NUMAs are sort of in the middle.
> 
> Perhaps I should refine my claim.  How about something like software designs
> that assume shared memory semantics will have to go?  A good piece of
> sofware would work well and without any wource changes on a single or
> multiple (up to some limit) CPUs.
> 

That would seem to close off some of the most interesting possibilities 
for Whitefield (four Banias cores, shared L2).

RM

0
Reply Robert 9/7/2004 7:42:59 PM

> I think that EMACS is going to be one of the desktop aplications that are
> going to be parallerized well.

That statement is simply hilarious,


        Stefan "an Emacs maintainer"
0
Reply Stefan 9/7/2004 7:44:46 PM

In article <2q6griFr7i3oU1@uni-berlin.de>, spinlock <NullVoid@att.net> wrote:
>>
>> >> No.  I said "insoluble", not "unsolved".
>> >
>> >So, it is your position that it cannot be solved now, nor anytime in the
>> >future?
>>
>> Reread my posting.  I said that there is good evidence that may
>> well be the case.
>
>Many scientists disagree with you, there are
>6 different hot fusion projects at MIT alone.

This is why it is such an appropriate analogy.  A hell of a lot of
money has been poured into it over many decades, there has been
some progress, there are a lot of people who claim that there is
a breakthrough just round the corner, that has been true all along,
there are some very solid analyses to cast such claims into doubt,
the benefits of a breakthrough would be considerable, and probably
a few more similarities.

I am not intending to hold my breath in either case.


Regards,
Nick Maclaren.
0
Reply nmm1 9/7/2004 8:02:43 PM

"Robert Myers" <rmyers1400@comcast.net> wrote in message
news:7bo%c.49387$3l3.3074@attbi_s03...
> Stephen Fuld wrote:
>
> > "Robert Myers" <rmyers1400@comcast.net> wrote in message
> > news:nTm_c.371401$%_6.4568@attbi_s01...
> >
>
> <snip>
>
> >>
> >>Single-processor system images harmful to parallel programming.
> >>
> >>SPSI = Everything has to cross user/kernel space boundaries and
> >>commmunication stack for any kind of nontrivial parallelism.
> >
> >
> > The two seem related.  While you can certainly implement message passing
on
> > an SMP hardware design, you can't realistically use shared memory
semantics
> > on things like clusters.  NUMAs are sort of in the middle.
> >
> > Perhaps I should refine my claim.  How about something like software
designs
> > that assume shared memory semantics will have to go?  A good piece of
> > sofware would work well and without any wource changes on a single or
> > multiple (up to some limit) CPUs.
> >
>
> That would seem to close off some of the most interesting possibilities
> for Whitefield (four Banias cores, shared L2).

I don't think so.  Any implementation "under the covers" would be fine, and
I perceive that shared memory could very easily be used to implement some
sort of message passing (NOT MPI) mechanism underneath.  What I am trying to
avoid is all the programming difficulties of locks, protected variables,
etc. that seem to give people so many headaches and don't scale very well.
I am thinking of something like some transaction systems that break up the
work into rather small chunks and pass the results of one chunk/transaction
to another.  The first is then free to start work on another piece of work
in parallel with the second part working on the first.  It is sort of like
David DiNucci's software cabeling, though there are parts of that where I
have some problems seeing how it would work well.  Thus the software
wouldn't rely on an architecture like you desribed, but it could be
implemented on such an architecture pretty easily.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/7/2004 8:34:00 PM

Stephen Fuld wrote:
> "Robert Myers" <rmyers1400@comcast.net> wrote in message
> news:nTm_c.371401$%_6.4568@attbi_s01...
> 
>>Stephen Fuld wrote:
>>
>>
>>>I freely admit that I may be way off base here, but I am very much
> 
> reminded
> 
>>>an analogous situation in a somewhat earlier age.  Perhaps it can best
> 
> be
> 
>>>described with the paraphrase "SMP considered harmfull to parallel
>>>programming progress".  That is SMP is like the use of the Goto
> 
> statement in
> 
>>>that it is very usefull in modest sized applications (think perhaps
> 
> quick
> 
>>>and dirty) but as things scale up, neither works well and both seem to
> 
> have
> 
>>>unintended consequences that make further progress much harder.  Do we
> 
> need
> 
>>>to bite the bullet and "throw out" the SMP code, just like we mostly did
>>>with goto filled code and thus regress in order to make more progress
> 
> later?
> 
>>>I very well remember the resistance to eliminating goto, the projected
> 
> cost
> 
>>>in terms of inefficient programs the cost of rewriting, etc.  But now,
> 
> few
> 
>>>would go back.
>>>
>>>Just a thought, but I find it interesting.
>>>
>>
>><with respect>
>>
>>No. No. No. No. No. No.
>>
>>Single-processor system images harmful to parallel programming.
>>
>>SPSI = Everything has to cross user/kernel space boundaries and
>>commmunication stack for any kind of nontrivial parallelism.
> 
> 
> The two seem related.  While you can certainly implement message passing on
> an SMP hardware design, you can't realistically use shared memory semantics
> on things like clusters.  NUMAs are sort of in the middle.
> 
> Perhaps I should refine my claim.  How about something like software designs
> that assume shared memory semantics will have to go?  A good piece of
> sofware would work well and without any wource changes on a single or
> multiple (up to some limit) CPUs.

I love the idea of making truely portable || (as Eugene would put
it) source, and I think it is possible...

The big problem as I see it is the load & run bit, to make that
truely viable I have come to the conclusion that you'd need the
OS & SysAdmin to make calls on how the workload is distributed in
terms of data & processes.

Even worse, in a general purpose system you'd have to do this in
real time, and of course for extra credit you'd have to avoid the
profile driven optimisation pitfalls.

[SNIP]

> I am intrigued by what was done with the transputer in that area and also by
> some of the hypercube designs like NCube.  We need simple primitives to get
> information from one "process" to another, with no code changes no matter
> whether the two processes are on the same CPU or not.  Then the underlying
> physical mechanism can be optimized for particular sizes and technologies.

That's old hat. To reiterate : The problem is then taking that
network of processes and mapping it onto a tuple of machine
resources & dataset. That tuple can change over time too. :(

This is something that has been nagging at the back of my mind
for the past 6 years or so, but I've not come up with any nice
solutions yet. I'm getting to the point where I really need to
just bite the bullet and put together a playpen. :)

Cheers,
Rupert

0
Reply Rupert 9/7/2004 11:01:03 PM

"Rupert Pigott" <roo@try-removing-this.darkboong.demon.co.uk> wrote in
message news:1094598063.600563@teapot.planet.gong...
> Stephen Fuld wrote:
> > "Robert Myers" <rmyers1400@comcast.net> wrote in message
> > news:nTm_c.371401$%_6.4568@attbi_s01...
> >
> >>Stephen Fuld wrote:
> >>
> >>
> >>>I freely admit that I may be way off base here, but I am very much
> >
> > reminded
> >
> >>>an analogous situation in a somewhat earlier age.  Perhaps it can best
> >
> > be
> >
> >>>described with the paraphrase "SMP considered harmfull to parallel
> >>>programming progress".  That is SMP is like the use of the Goto
> >
> > statement in
> >
> >>>that it is very usefull in modest sized applications (think perhaps
> >
> > quick
> >
> >>>and dirty) but as things scale up, neither works well and both seem to
> >
> > have
> >
> >>>unintended consequences that make further progress much harder.  Do we
> >
> > need
> >
> >>>to bite the bullet and "throw out" the SMP code, just like we mostly
did
> >>>with goto filled code and thus regress in order to make more progress
> >
> > later?
> >
> >>>I very well remember the resistance to eliminating goto, the projected
> >
> > cost
> >
> >>>in terms of inefficient programs the cost of rewriting, etc.  But now,
> >
> > few
> >
> >>>would go back.
> >>>
> >>>Just a thought, but I find it interesting.
> >>>
> >>
> >><with respect>
> >>
> >>No. No. No. No. No. No.
> >>
> >>Single-processor system images harmful to parallel programming.
> >>
> >>SPSI = Everything has to cross user/kernel space boundaries and
> >>commmunication stack for any kind of nontrivial parallelism.
> >
> >
> > The two seem related.  While you can certainly implement message passing
on
> > an SMP hardware design, you can't realistically use shared memory
semantics
> > on things like clusters.  NUMAs are sort of in the middle.
> >
> > Perhaps I should refine my claim.  How about something like software
designs
> > that assume shared memory semantics will have to go?  A good piece of
> > sofware would work well and without any wource changes on a single or
> > multiple (up to some limit) CPUs.
>
> I love the idea of making truely portable || (as Eugene would put
> it) source, and I think it is possible...
>
> The big problem as I see it is the load & run bit, to make that
> truely viable I have come to the conclusion that you'd need the
> OS & SysAdmin to make calls on how the workload is distributed in
> terms of data & processes.
>
> Even worse, in a general purpose system you'd have to do this in
> real time, and of course for extra credit you'd have to avoid the
> profile driven optimisation pitfalls.
>
> [SNIP]
>
> > I am intrigued by what was done with the transputer in that area and
also by
> > some of the hypercube designs like NCube.  We need simple primitives to
get
> > information from one "process" to another, with no code changes no
matter
> > whether the two processes are on the same CPU or not.  Then the
underlying
> > physical mechanism can be optimized for particular sizes and
technologies.
>
> That's old hat. To reiterate : The problem is then taking that
> network of processes and mapping it onto a tuple of machine
> resources & dataset. That tuple can change over time too. :(

But if the "transactions" are small enough, then it may matter less.  There
is a lot of work on "load balancing" with things like web server farms.  The
load leveling of modest sized "transactions" can be done, and even it isn't
optimal, it can be made to be pretty close fairly easily.  But yes, it does
erquire some kind of "supervisor/monitor" to make those decisions, but it
can monitor and update as things change.

> This is something that has been nagging at the back of my mind
> for the past 6 years or so, but I've not come up with any nice
> solutions yet. I'm getting to the point where I really need to
> just bite the bullet and put together a playpen. :)

Sounds good.  Keep us up to date as you progess.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/7/2004 11:25:39 PM

Stephen Fuld wrote:

> "Robert Myers" <rmyers1400@comcast.net> wrote in message
> news:7bo%c.49387$3l3.3074@attbi_s03...
> 
>>Stephen Fuld wrote:


>>>Perhaps I should refine my claim.  How about something like software
>>>designs that assume shared memory semantics will have to go?  A good piece
>>>of sofware would work well and without any wource changes on a single or
>>>multiple (up to some limit) CPUs.
>>>
>>
>>That would seem to close off some of the most interesting possibilities
>>for Whitefield (four Banias cores, shared L2).
> 
> 
> I don't think so.  Any implementation "under the covers" would be fine, and
> I perceive that shared memory could very easily be used to implement some
> sort of message passing (NOT MPI) mechanism underneath.  What I am trying to
> avoid is all the programming difficulties of locks, protected variables,
> etc. that seem to give people so many headaches and don't scale very well.
> I am thinking of something like some transaction systems that break up the
> work into rather small chunks and pass the results of one chunk/transaction
> to another.  The first is then free to start work on another piece of work
> in parallel with the second part working on the first.  It is sort of like
> David DiNucci's software cabeling, though there are parts of that where I
> have some problems seeing how it would work well.  Thus the software
> wouldn't rely on an architecture like you desribed, but it could be
> implemented on such an architecture pretty easily.
> 

But what mechanism copes with the fact that inter-chunk communication 
between chunks on the same die would be very different from inter-chunk 
communication with the chunks on, say, separate system images in 
different boxes?  I can imagine delegating OS-like responsibilities to a 
task assigned to each die that would tell a chunk where to look for data 
from other chunks and how to get the data.  That way, chunks on the die 
can take care of business without invoking the heavy machinery required 
to communicate off-die.  The fact that you have this new kind of 
granularity (die affinity as well as processor affinity), though, would 
seem to make the problems that Rupert is worried about even worse.

With the shared memory model, you do have all the ugly stuff you're 
trying to avoid, but the memory hierarchy takes care of where to look 
for the data and exploits the efficiencies of the shared cache 
automatically.

RM

0
Reply Robert 9/8/2004 12:25:15 AM

Andrew Reilly wrote:

> Robert Myers wrote:
> 

<snip>

> 
>> The situation is _so_ bad that it doesn't seem embarrassing, 
>> apparently, for Orion Multisystems to take a lame processor, to hobble 
>> it further with a lame interconnect, and to call it a workstation.  If 
>> the future of computing really is slices of Wonder Bread in a plastic 
>> bag and not a properly cooked meal, then the Orion box makes some 
>> sense.  Might as well get used to it and start programming on an 
>> architecture that at least has the right topology and instruction set, 
>> as I believe Andrew Reilly is suggesting.
> 
> 
> Well, I think that the specific instruction set is probably a red 
> herring...

<snip>

> 
> The Transmeta Eficieon is just the first widely available processor with 
> embedded-levels of integration (memory and some peripheral interfaces 
> and hyper-channel for other peripherals) and power consumption that can 
> do pipelined double-precision floating point multiply/additions at two 
> flops/clock at an interesting clock rate.  1.5Ghz is significantly 
> faster than the DSP competitors.  TIC6700 tops out at 300MHz and only 
> does single precision at the core rate.  PowerPC+Altivec doesn't have 
> the memory controller or the peripheral interconnect to drive up the 
> areal density.  The BlueGene core is about the right shape, but I 
> haven't seen any industrial/embedded boxes with a few dozen of them in 
> it, yet.  The MIPS and ARM processors that have the integration don't 
> have the floating point chops.  Modern versions of VIA C3 might be 
> getting interesting (or not: I haven't looked at their double-precision 
> performance), but have neither the memory controller nor the 
> hyperchannel, nor quite the MHz.  Of course, Opterons fit that 
> description too, and clock much faster, but I thought that they consumed 
> considerably more power, too. Maybe their MIPS/watt is closer than I've 
> given it credit for.
> 

Maybe by the time Whitefield and Niagara are available, Transmeta will 
have a similar product, too.  A ULV Whitefield is where I'd want to 
start, and I don't think I'd be too bothered by the separate controller, 
which I'd get to amortize over at least four cores.  By the time 
Whitefield is available, Intel should have more complete infrastructure 
like Advanced Switching as an interconnect.

A dual-processor box would already have eight pipes with very little 
fuss (and will probably be available as a standard workstation product). 
  If I wanted to do something more exotic, I'm absolutely certain I 
wouldn't wind up with a gigabit ethernet cluster in a box.  I think I 
could do all that and still compete on performance/watt.

All told, I think you either have to have an application that's 
well-suited to the architecture and be cramped for space or power, 
and/or believe that the architecture (cluster with lame interconnect) 
really is the future of computing to find the box attractive.  The 
architecture manifestly _isn't_ the future of computing, though, with 
chips like Whitefield on the way.

RM

0
Reply Robert 9/8/2004 1:00:50 AM

"Robert Myers" <rmyers1400@comcast.net> wrote in message
news:Ljs%c.259916$8_6.122091@attbi_s04...
> Stephen Fuld wrote:
>
> > "Robert Myers" <rmyers1400@comcast.net> wrote in message
> > news:7bo%c.49387$3l3.3074@attbi_s03...
> >
> >>Stephen Fuld wrote:
>
>
> >>>Perhaps I should refine my claim.  How about something like software
> >>>designs that assume shared memory semantics will have to go?  A good
piece
> >>>of sofware would work well and without any wource changes on a single
or
> >>>multiple (up to some limit) CPUs.
> >>>
> >>
> >>That would seem to close off some of the most interesting possibilities
> >>for Whitefield (four Banias cores, shared L2).
> >
> >
> > I don't think so.  Any implementation "under the covers" would be fine,
and
> > I perceive that shared memory could very easily be used to implement
some
> > sort of message passing (NOT MPI) mechanism underneath.  What I am
trying to
> > avoid is all the programming difficulties of locks, protected variables,
> > etc. that seem to give people so many headaches and don't scale very
well.
> > I am thinking of something like some transaction systems that break up
the
> > work into rather small chunks and pass the results of one
chunk/transaction
> > to another.  The first is then free to start work on another piece of
work
> > in parallel with the second part working on the first.  It is sort of
like
> > David DiNucci's software cabeling, though there are parts of that where
I
> > have some problems seeing how it would work well.  Thus the software
> > wouldn't rely on an architecture like you desribed, but it could be
> > implemented on such an architecture pretty easily.
> >
>
> But what mechanism copes with the fact that inter-chunk communication
> between chunks on the same die would be very different from inter-chunk
> communication with the chunks on, say, separate system images in
> different boxes?

What handled it in Transputers.  At least according to my understanding, a
program just did a "send" command to a process and something figured out if
that process was on the same die or required transversal of a link.  Am I
wrong about that?

> I can imagine delegating OS-like responsibilities to a
> task assigned to each die that would tell a chunk where to look for data
> from other chunks and how to get the data.  That way, chunks on the die
> can take care of business without invoking the heavy machinery required
> to communicate off-die.  The fact that you have this new kind of
> granularity (die affinity as well as processor affinity), though, would
> seem to make the problems that Rupert is worried about even worse.

It may make tuning worse, but it makes getting reasonable scalability
without source code changes easier.  I should say that my assumption is that
there is code for every "transaction" on each "processor image" so the only
thing required is sending the data.  An OS function could certainly keep
track of processor utilization and "arange" the routing to best utilize each
processor taking into account logical distances, etc.

Obviously, I have no finished design, just mussings.  But I think we need to
start looking at different ideas than the basic ones we have now.  I am
trying to build on the "easy parallelism" of most transaction systems and
apply that to other situations.

> With the shared memory model, you do have all the ugly stuff you're
> trying to avoid, but the memory hierarchy takes care of where to look
> for the data and exploits the efficiencies of the shared cache
> automatically.

Yes, and it works well for modest sized systems.  But it starts to show the
strain as the number of CPUs grows and it exposes all the uglyness we have
discussed to the user.

BTW, I hadn't intended to get into this with this little thought through on
my part.  I appreciate your indulgence and civility with what may be a hair
brained idea.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/8/2004 3:34:11 AM

Russell Wallace wrote:
> On Tue, 31 Aug 2004 02:41:55 -0400, Tony Hill
> <hilla_nospam_20@yahoo.ca> wrote:
> 
> 
>>However the new 90nm fab process has maybe thrown this automatic
>>assumption of much higher clock speeds into question, at least for the
>>time being.  Intel's still having trouble getting the "Prescott" P4 up
>>to 3.6GHz and have pushed back the release date of their 3.8 and
>>4.0GHz P4 chips multiple times.
> 
> 
> As I understand it, you could indeed hit, say, 5 GHz with a 90 nm
> process (and Prescott's design - longer pipeline, etc - indicates
> Intel were hoping to do just that), except that the chip would melt?
> 
You say that as if it were a bad thing ;-)

-- 
bill davidsen (davidsen@darkstar.prodigy.com)
   SBC/Prodigy Yorktown Heights NY data center
   Project Leader, USENET news
   http://newsgroups.news.prodigy.com
0
Reply Bill 9/8/2004 4:27:16 AM

Sander Vesik wrote:
> In comp.arch Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
> 
>>In article <ch6m8b$grg$1@news-rocq.inria.fr>, Grumble  <a@b.c> wrote:
>>
>>>spinlock wrote:
>>>
>>>
>>>>We are on track for mass shipment of a billion (that's with a B)
>>>>transistor die by '08.
>>>
>>>Who's "we" ?
>>
>>A good question.  But note that "by '08" includes "in 2005".
>>
>>
>>>I have read that there will be ~1.7e9 transistors in Montecito.
>>>Cache (2*1 MB L2 + 2*12 MB L3) probably accounts for ~90% of the
>>>transistor count. Montecito is expected next year.
>>
>>By whom is it expected?  And how is it expected to appear?  Yes,
>>someone will wave a chip at IDF and claim that it is a Montecito,
>>but are you expecting it to be available for internal testing,
>>to all OEMS, to special customers, or on the open market?
> 
> 
> Is any kind of itanium actually available on the open market (and
> i mean openmarket for new chips, not resale of systems)?

There are people advertizing such. Who would buy onw without a 
motherboard is an interesting question, I learned a long time ago that 
buying the CPU and M/B as a usit avoid finger pointing if there's any issue.

-- 
bill davidsen (davidsen@darkstar.prodigy.com)
   SBC/Prodigy Yorktown Heights NY data center
   Project Leader, USENET news
   http://newsgroups.news.prodigy.com
0
Reply Bill 9/8/2004 4:36:15 AM

Nick Maclaren wrote:
> In article <41362416.1434651@news.eircom.net>,
> Russell Wallace <wallacethinmintr@eircom.net> wrote:
> 
>>On Wed, 01 Sep 2004 06:30:13 GMT, "Raymond" <no@all.net> wrote:
>>
>>
>>>Beyond
>>>2 cores, I don't see much benefit adding more cores for desktops, not today,
>>>and not tomorrow, nothwithstanding a lot more intense use of multi-threading.
>>>I just don't see how the OS, or any compiler, can possibly deal with the main
>>>logical
>>>issues involved in sychronization and concurrency, automagically turning an
>>>otherwise
>>>mostly STA program into a multi-threaded one.
>>
>>We had exactly that argument 15 years ago with regard to parallel
>>processing on servers and supercomputers.
> 
> 
> And 30 years ago.  I wasn't in this game 45 years ago.
> 
> 
>>It won't surprise me in the least if 15 years from now, when the
>>conversation is about multiple cores in digital watches or whatever,
>>someone says "we had exactly that argument 15 years ago with regard to
>>parallel processing on desktops" :)
> 
> 
> Nor would it surprise me.  Raymond makes one good point, though he
> gets it slightly wrong!
> 
> There is effectively NO chance of automatic parallelisation working
> on serial von Neumann code of the sort we know and, er, love.  Not
> in the near future, not in my lifetime and not as far as anyone can
> predict.  Forget it.

There are some problems which can not be made parallel. As in "can be 
proved not to be parallelizable" rather than "we don't know how yet." 
But the world is not full of those problems, and desktops are REALLY not 
running them. So in a practical sense, SMP is as good as a faster CPU 
*IF* you have multiple taks or threads, and the overhead doesn't eat the 
extra power.
> 
> This has the consequence that large-scale parallelism is not a viable
> general-purpose architecture until and unless we move to a paradigm
> that isn't so intractable.  There are such paradigms (functional
> programming is a LITTLE better, for a start), but none have taken
> off as general models.  The HPC world is sui generis, and not relevant
> in this thread.
> 
> So he would be right if he replaced "beyond 2 cores" by "beyond a
> small number of cores".  At least for the next decade or so.

20-25 years ago there was a company called Convex which had a killer C 
compiler which did parallelizing. For many problems it would give 
Cray-like results for far fewer bucks. I have no idea why they 
concentrated on hardware when they had the best software of the time.

-- 
bill davidsen (davidsen@darkstar.prodigy.com)
   SBC/Prodigy Yorktown Heights NY data center
   Project Leader, USENET news
   http://newsgroups.news.prodigy.com
0
Reply Bill 9/8/2004 4:44:10 AM

Nick Maclaren wrote:

> Not merely do people sweat blood to get such parallelism, they
> often have to change their algorithms (sometimes to ones that are
> less desirable, such as being less accurate), and even then only
> SOME problems can be parallelised.

I think you are looking at huge problems, when the money is on the 
desktop. As network speeds go up you can be running several java apps, 
unpacking a jpg, pulling something out of a local database, updating the 
display... In other words programs don't need to be massively rewritten, 
opening a web page may generate enough totally autonomous tasks to make 
SMP useful. And games, thread per character?

I don't think we need to wait for any breakthroughs to benefit, the only 
question is how much.

-- 
bill davidsen (davidsen@darkstar.prodigy.com)
   SBC/Prodigy Yorktown Heights NY data center
   Project Leader, USENET news
   http://newsgroups.news.prodigy.com
0
Reply Bill 9/8/2004 4:53:26 AM

In article <afw%c.7042$837.3968@newssvr31.news.prodigy.com>,
Bill Davidsen <davidsen@darkstar.prodigy.com> writes:
|> Nick Maclaren wrote:
|> 
|> > Not merely do people sweat blood to get such parallelism, they
|> > often have to change their algorithms (sometimes to ones that are
|> > less desirable, such as being less accurate), and even then only
|> > SOME problems can be parallelised.
|> 
|> I think you are looking at huge problems, when the money is on the 
|> desktop. ...

You weren't following the thread.  I and others were pointing out
that small-scale process-level parallelism is useful on the desktop,
but serious parallelisation of applications is a wide blue yonder
project.  The context of the above is where I was telling someone
that the fact that HPC applications have been parallelised does not
mean that desktop ones can easily follow.



Regards,
Nick Maclaren.
0
Reply nmm1 9/8/2004 8:30:15 AM

In article <u6w%c.7041$e67.4327@newssvr31.news.prodigy.com>,
Bill Davidsen <davidsen@darkstar.prodigy.com> writes:
|> > 
|> > There is effectively NO chance of automatic parallelisation working
|> > on serial von Neumann code of the sort we know and, er, love.  Not
|> > in the near future, not in my lifetime and not as far as anyone can
|> > predict.  Forget it.
|> 
|> There are some problems which can not be made parallel. As in "can be 
|> proved not to be parallelizable" rather than "we don't know how yet." 
|> But the world is not full of those problems, and desktops are REALLY not 
|> running them. So in a practical sense, SMP is as good as a faster CPU 
|> *IF* you have multiple taks or threads, and the overhead doesn't eat the 
|> extra power.

Yes, but there are many more that can be parallelised, but not using
automatically - this is a variant of the halting problem - and it was
actually that one I was referring to.

|> 20-25 years ago there was a company called Convex which had a killer C 
|> compiler which did parallelizing. For many problems it would give 
|> Cray-like results for far fewer bucks. I have no idea why they 
|> concentrated on hardware when they had the best software of the time.

C-like compiler.  Semantically, it was very unlike C.  There has
been no problem about autoparallelising some codes (the ones that
I call vectorisable) for 30+ years.  Some language systems have
handled other classes of problem, but the state of the art has
not advanced much beyond that.


Regards,
Nick Maclaren.
0
Reply nmm1 9/8/2004 8:37:24 AM

Stefan Monnier wrote:
>> I think that EMACS is going to be one of the desktop aplications
>> that are going to be parallerized well.
>
> That statement is simply hilarious,
>
>        Stefan "an Emacs maintainer"

Well I don't much care either way, myself, but in the interests of
maintaining a standard of debate in this newsgroups, would you care
to rebut the supporting arguments? 


0
Reply Ken 9/8/2004 8:58:17 AM

In article <chmhjc$1fs$1$8300dec7@news.demon.co.uk>,
"Ken Hagan" <K.Hagan@thermoteknix.co.uk> writes:
|> Stefan Monnier wrote:
|> >> I think that EMACS is going to be one of the desktop aplications
|> >> that are going to be parallerized well.
|> >
|> > That statement is simply hilarious,
|> >
|> >        Stefan "an Emacs maintainer"
|> 
|> Well I don't much care either way, myself, but in the interests of
|> maintaining a standard of debate in this newsgroups, would you care
|> to rebut the supporting arguments? 

Yes.  No problem.  I humbly submit the source of Emacs as evidence,
and claim that the conclusion is obvious.

Note that Stefan Monnier did not say that Emacs could not be
parallelised well, at least in theory, but was responding to a
comment that it was going to be.


Regards,
Nick Maclaren.
0
Reply nmm1 9/8/2004 9:22:40 AM

Stephen Fuld wrote:

> "Robert Myers" <rmyers1400@comcast.net> wrote in message
> news:Ljs%c.259916$8_6.122091@attbi_s04...
> 
>>Stephen Fuld wrote:
>>
>>
>>>"Robert Myers" <rmyers1400@comcast.net> wrote in message
>>>news:7bo%c.49387$3l3.3074@attbi_s03...
>>>
>>>
>>>>Stephen Fuld wrote:
>>
>>
>>>>>Perhaps I should refine my claim.  How about something like software
>>>>>designs that assume shared memory semantics will have to go?  A good
>>>>>piece of sofware would work well and without any wource changes on a single
>>>>>or multiple (up to some limit) CPUs.
>>>>>
>>>>
>>>>That would seem to close off some of the most interesting possibilities
>>>>for Whitefield (four Banias cores, shared L2).
>>>
>>>
>>>I don't think so.  Any implementation "under the covers" would be fine, 
>>>and I perceive that shared memory could very easily be used 
>>>to implement some sort of message passing (NOT MPI) mechanism underneath.  
>>>What I am trying to avoid is all the programming difficulties of locks, 
>>>protected variables, etc. that seem to give people so many headaches and 
>>>don't scale very well.
> 
>>>I am thinking of something like some transaction systems that break up
>>>the work into rather small chunks and pass the results of one
>>>chunk/transaction to another. 
>>>The first is then free to start work on another piece of work
>>>in parallel with the second part working on the first.  It is sort of
>>>like David DiNucci's software cabeling, though there are 
>>>parts of that where I
>>>have some problems seeing how it would work well.  Thus the software
>>>wouldn't rely on an architecture like you desribed, but it could be
>>>implemented on such an architecture pretty easily.
>>>
>>
>>But what mechanism copes with the fact that inter-chunk communication
>>between chunks on the same die would be very different from inter-chunk
>>communication with the chunks on, say, separate system images in
>>different boxes?
> 
> 
> What handled it in Transputers.  At least according to my understanding, a
> program just did a "send" command to a process and something figured out if
> that process was on the same die or required transversal of a link.  Am I
> wrong about that?
> 

Rupert?

> 
>>I can imagine delegating OS-like responsibilities to a
>>task assigned to each die that would tell a chunk where to look for data
>>from other chunks and how to get the data.  That way, chunks on the die
>>can take care of business without invoking the heavy machinery required
>>to communicate off-die.  The fact that you have this new kind of
>>granularity (die affinity as well as processor affinity), though, would
>>seem to make the problems that Rupert is worried about even worse.
> 
> 
> It may make tuning worse, but it makes getting reasonable scalability
> without source code changes easier.  I should say that my assumption is that
> there is code for every "transaction" on each "processor image" so the only
> thing required is sending the data.  An OS function could certainly keep
> track of processor utilization and "arange" the routing to best utilize each
> processor taking into account logical distances, etc.
> 
> Obviously, I have no finished design, just mussings.  But I think we need to
> start looking at different ideas than the basic ones we have now.  I am
> trying to build on the "easy parallelism" of most transaction systems and
> apply that to other situations.
> 

I wonder if the "easy parallelism" of most transaction systems doesn't 
rest solidly on the slowness of disk drives, the slowness of the network 
interconnect, and the relatively low expectations that are the result.

> 
>>With the shared memory model, you do have all the ugly stuff you're
>>trying to avoid, but the memory hierarchy takes care of where to look
>>for the data and exploits the efficiencies of the shared cache
>>automatically.
> 
> 
> Yes, and it works well for modest sized systems.  But it starts to show the
> strain as the number of CPUs grows and it exposes all the uglyness we have
> discussed to the user.
> 
> BTW, I hadn't intended to get into this with this little thought through on
> my part.  I appreciate your indulgence and civility with what may be a hair
> brained idea.
> 

Probably not a good sign if I am being thanked merely for being civil. ;-).

Hare-brained or not, some ideas not currently in the mainstream canon 
are going to have to move there, unless computers are to be sold with 
all but one core disabled... Unless, as I believe some people are 
privately thinking, the second, third, and fourth or more cores will be 
there and actually do something useful at some point in their career, 
but for the most part, they will be marketing gimmicks for all but the 
applications where SMP is already useful.

RM

RM

0
Reply Robert 9/8/2004 1:17:11 PM

>>> I think that EMACS is going to be one of the desktop applications
>>> that are going to be parallerized well.
>> 
>> That statement is simply hilarious,
>> Stefan "an Emacs maintainer"

> Well I don't much care either way, myself, but in the interests of
> maintaining a standard of debate in this newsgroups, would you care
> to rebut the supporting arguments? 

Well, as Nick points out, there's the source code, riddled with global
variables, dynamic data structures, indirections, ...

Then there's the Lisp, its dynamic scoping and its interaction with
buffer-local variables.  Most of the Lisp code doesn't rely on the precise
semantics of the current implementation of dynamic scoping (which relies
extensively on global variables), but some do and it's extremely difficult
to figure out which part does.

Then there's the display semantics: to redisplay a window, you need to walk
the buffer sequentially, interpreting each char and its associated
text-properties in sequence.  Why is that?  Because the current char might be
displayed as one big image, so you can't know whether the next char will
need to be displayed (and where) until you've processed the current char.
Of course, it can still be parallelised, using speculation (which might
work very well here).

....

BTW, does anyone know of any work on parallelizing regexp-matching?


        Stefan
0
Reply Stefan 9/8/2004 2:48:58 PM

In article <jwvzn40akzd.fsf-monnier+comp.arch@gnu.org>,
Stefan Monnier <monnier@iro.umontreal.ca> writes:
|> 
|> Then there's the display semantics: to redisplay a window, you need to walk
|> the buffer sequentially, interpreting each char and its associated
|> text-properties in sequence.  Why is that?  Because the current char might be
|> displayed as one big image, so you can't know whether the next char will
|> need to be displayed (and where) until you've processed the current char.
|> Of course, it can still be parallelised, using speculation (which might
|> work very well here).

It would for me - one line to show the style of my .emacs:

(setq auto-mode-alist '((".*" . fundamental-mode)))

|> BTW, does anyone know of any work on parallelizing regexp-matching?

I believe that I saw some once.  Anyway, I thought about it, and
felt that it should be straightforward.  In particular, it arose
out of the algorithm that I developed to check if two regular
expressions overlapped.

Unfortunately, that was only parallelising the NFA to give the same
sort of performance as the DFA, and needed a VASTLY more lightweight
parallel model than any I know of :-(

If you can give me a more detailed description of what sort of use
you are interested in parallelising, I may be able to help.


Regards,
Nick Maclaren.
0
Reply nmm1 9/8/2004 3:15:14 PM

Nick Maclaren wrote:
>
> Yes.  No problem.  I humbly submit the source of Emacs as evidence,
> and claim that the conclusion is obvious.
>
> Note that Stefan Monnier did not say that Emacs could not be
> parallelised well, at least in theory, but was responding to a
> comment that it was going to be.

I disagree. Jouni's post began...

    I have a better reason why emacs is a great candidate for
    parallerization.

....which is certainly starting from a "could" rather than "would"
viewpoint.

    Its written in lisp, and in reality its a lisp operating system
    with embedded wordprocessor included as a major app in it. Now
    the lisp code could be autoparallized by autoparallerizing compiler.
    So you would need to do some work to improve the underlying lisp
    compiler/OS to handle mutliprocessing needs.

Here he makes a specific supporting argument for his claim. When I
asked for rebuttals, I was rather hoping that someone would address
this one. Auto-parallelisation of Lisp may be significantly easier
than the same task for C (which I happily accept hasn't really
happened yet, despite efforts) so emacs may be much better placed
than "the average app".

    BTW: I think that EMACS is going to be one of the desktop
    applications that are going to be parallerized well. [If it
    hasn't already.]

OK, here he switches to "could" mode, but if he blows both ways in the
same post I think its unfair to claim he went in just one direction.

    Simply because parallerizing it is geeky enough trick that someone
    in OSS developement may wan't to do just for the kicks [...]

Here's a second line of argument, differentiating emacs from the average
app. It is surely undeniable that "cult" OSS software gets ported and
twisted in far more ways than its intrinsic quality would justify. If
I had to place money on which applications would get ported first and
best to any new architecture, I'd bet on emacs and GNU C. 


0
Reply Ken 9/8/2004 3:52:15 PM

In article <u6w%c.7041$e67.4327@newssvr31.news.prodigy.com>,
Bill Davidsen  <davidsen@darkstar.prodigy.com> wrote:
>Nick Maclaren wrote:
>> In article <41362416.1434651@news.eircom.net>,
>> Russell Wallace <wallacethinmintr@eircom.net> wrote:
>> 
>>>On Wed, 01 Sep 2004 06:30:13 GMT, "Raymond" <no@all.net> wrote:
>>>
>>>
>>>>Beyond
>>>>2 cores, I don't see much benefit adding more cores for desktops, not today,
>>>>and not tomorrow, nothwithstanding a lot more intense use of multi-threading.
>>>>I just don't see how the OS, or any compiler, can possibly deal with the main
>>>>logical
>>>>issues involved in sychronization and concurrency, automagically turning an
>>>>otherwise
>>>>mostly STA program into a multi-threaded one.
>>>
>>>We had exactly that argument 15 years ago with regard to parallel
>>>processing on servers and supercomputers.
>> 
>> 
>> And 30 years ago.  I wasn't in this game 45 years ago.
>> 
>> 
>>>It won't surprise me in the least if 15 years from now, when the
>>>conversation is about multiple cores in digital watches or whatever,
>>>someone says "we had exactly that argument 15 years ago with regard to
>>>parallel processing on desktops" :)
>> 
>> 
>> Nor would it surprise me.  Raymond makes one good point, though he
>> gets it slightly wrong!
>> 
>> There is effectively NO chance of automatic parallelisation working
>> on serial von Neumann code of the sort we know and, er, love.  Not
>> in the near future, not in my lifetime and not as far as anyone can
>> predict.  Forget it.
>
>There are some problems which can not be made parallel. As in "can be 
>proved not to be parallelizable" rather than "we don't know how yet." 
>But the world is not full of those problems, and desktops are REALLY not 
>running them. So in a practical sense, SMP is as good as a faster CPU 
>*IF* you have multiple taks or threads, and the overhead doesn't eat the 
>extra power.
>> 
>> This has the consequence that large-scale parallelism is not a viable
>> general-purpose architecture until and unless we move to a paradigm
>> that isn't so intractable.  There are such paradigms (functional
>> programming is a LITTLE better, for a start), but none have taken
>> off as general models.  The HPC world is sui generis, and not relevant
>> in this thread.
>> 
>> So he would be right if he replaced "beyond 2 cores" by "beyond a
>> small number of cores".  At least for the next decade or so.
>
>20-25 years ago there was a company called Convex which had a killer C 
>compiler which did parallelizing. For many problems it would give 
>Cray-like results for far fewer bucks. I have no idea why they 
>concentrated on hardware when they had the best software of the time.

  The focus on hardware, at least early on, probably made some sense.
  They seemed to fill a niche for a "low-cost" mini-super, which coupled
  with great compilers (and VAX and CRAY source compatibility), could 
  attract VAX users (for example) wanting more performance without
  CRAY price.  By the early 1990's maybe it wasn't so sensible anymore...
  And getting started on a microprocessor-based parallel machine
  (SMP-like nodes, but with a crossbar) so late (and with little
  experience with such machines) was the end-- or, if you see the
  bright side of things, a new beginning after being acquired by HP.

  I agree that the compiler technology was a great strength
  of Convex, though ultimately I'm not sure we would have had much more
  success as a software company.  Looking back, we had great technical people,
  but I'm not so sure we had the right high-level decision makers to
  take the company any further.

  Interestingly, on the back patio of our Richardson campus, there
  used to be the "graveyard".  Whenever one of our "competitors"
  (Alliant, etc) flopped, they'd get a tombstone.  I bailed before I
  ever learned whether we received our own tombstone in the graveyard.

0
Reply jle 9/8/2004 4:14:57 PM


Stefan Monnier wrote:

> 
> BTW, does anyone know of any work on parallelizing regexp-matching?
> 

Last I knew a while back was everybody (from running a Google query and
getting a million PHD theses) was working on non-determistic FSM based
regexp matching.  Unfortunately just the theses, no libraries.  You sure
can tell when you hit a fad for PHD dissertations.  The NDFSM's were interesting
since there was no backtracking which was useful if you wanted to do Expect
like regexp matching without having to rescan the entire buffer everytime you
got input.  It's parallelization but not the way you are thinking.

Joe Seigh
0
Reply Joe 9/8/2004 4:29:12 PM

"Ken Hagan" <K.Hagan@thermoteknix.co.uk> writes:

> Nick Maclaren wrote:
> >
> > Yes.  No problem.  I humbly submit the source of Emacs as evidence,
> > and claim that the conclusion is obvious.
> >
> > Note that Stefan Monnier did not say that Emacs could not be
> > parallelised well, at least in theory, but was responding to a
> > comment that it was going to be.
> 
> I disagree. Jouni's post began...
> 
>     I have a better reason why emacs is a great candidate for
>     parallerization.
> 
> ...which is certainly starting from a "could" rather than "would"
> viewpoint.
> 
>     Its written in lisp, and in reality its a lisp operating system
>     with embedded wordprocessor included as a major app in it. Now
>     the lisp code could be autoparallized by autoparallerizing compiler.
>     So you would need to do some work to improve the underlying lisp
>     compiler/OS to handle mutliprocessing needs.
> 
> Here he makes a specific supporting argument for his claim. When I
> asked for rebuttals, I was rather hoping that someone would address
> this one. Auto-parallelisation of Lisp may be significantly easier
> than the same task for C (which I happily accept hasn't really
> happened yet, despite efforts) so emacs may be much better placed
> than "the average app".

I don't buy this: alias analysis for lisp is not significantly easier (to
implement) than for C. The results might be slightly more precise (you
don't have to worry about some of the weird tricks you can play with
pointers and memory in C), but I doubt that it makes much difference in
practice (anybody know of a comparative study?). Without good alias
analysis, you're not going to do much auto-parallelisation of an imperative
language (yes, lisp is an imperative language before somebody claims
otherwise).

> Here's a second line of argument, differentiating emacs from the average
> app. It is surely undeniable that "cult" OSS software gets ported and
> twisted in far more ways than its intrinsic quality would justify. If
> I had to place money on which applications would get ported first and
> best to any new architecture, I'd bet on emacs and GNU C. 

Speaking as someone who has also extensively hacked emacs, I'll have to
agree with Stefan. It won't happen (it would be easier to reimplement it
from scratch if you wanted a parallelised version). And what do you
mean by GNU C? An auto-parallelising gcc? Or a parallelised version of
gcc? (and why would you want the latter, when make -j gives you lots
of parallelism already?)

-- 
David Gay
dgay@acm.org
0
Reply David 9/8/2004 5:48:29 PM

Nick Maclaren <nmm1@cus.cam.ac.uk> wrote ...
> 
> In article <ch6s4q$ict$1@news-rocq.inria.fr>, Grumble <a@b.c> writes:
> |> > 
> |> > By whom is it expected?  And how is it expected to appear?  Yes,
> |> > someone will wave a chip at IDF and claim that it is a Montecito,
> |> > but are you expecting it to be available for internal testing,
> |> > to all OEMS, to special customers, or on the open market?
> |> 
> |> In November 2003, Intel's roadmap claimed Montecito would appear in 
> |> 2005. 6 months later, Otellini mentioned 2005 again. In June 2004, Intel 
> |> supposedly showcased Montecito dies, and claimed that testing had begun.
> |> 
> |> Perhaps Intel is being overoptimistic, but, as far as I understand, they 
> |> claim Montecito will be ready in 2005.
> 
> I am aware of that.  Given that Intel failed to reduce the power
> going to 90 nm for the Pentium 4, that implies it will need 200
> watts.  Given that HP have already produced a dual-CPU package,
> they will have boards rated for that.  Just how many other vendors
> will have?

This is wrong.  As described by Paul Otellini during his keynote speech at 
IDF yesterday, and documented here (watch for URL wrap):

ftp://download.intel.com/pressroom/kits/events/idffall_
2004/otellini_presentation.pdf#page=38

the dual-core, multithreaded, Montecito package actually consumes *less* 
power than current Itanium 2 processors.

 -- Jim Hull
    Itanium Processor Architect at HP
0
Reply Jim 9/8/2004 6:20:06 PM

"Robert Myers" <rmyers1400@comcast.net> wrote in message
news:rDD%c.267294$8_6.207541@attbi_s04...

snip

> I wonder if the "easy parallelism" of most transaction systems doesn't
> rest solidly on the slowness of disk drives, the slowness of the network
> interconnect, and the relatively low expectations that are the result.

Well, there is probably some truth in that.  But I am also familiar with
systems that had large SSDs and were CPU bound on the largest available
processors of the time and benefitted from having multiple processors.  And
while the OS had to change and be aware of the multiple CPUs, I believe the
appications programs (i.e. transaction code) didn't.

snip

> Hare-brained or not, some ideas not currently in the mainstream canon
> are going to have to move there, unless computers are to be sold with
> all but one core disabled... Unless, as I believe some people are
> privately thinking, the second, third, and fourth or more cores will be
> there and actually do something useful at some point in their career,
> but for the most part, they will be marketing gimmicks for all but the
> applications where SMP is already useful.

As has been stated here before, desktop systems can probably already benefit
from a modest number of cores.  And htey do it using the "transaction" or
multiple process model, where multiple processes are running simultaneously.
And it probably wouldn't take much work to have some major applications take
better advantage of a modest number of cores.  But it it when you want to
scale up and things get more complex that current technology runs into
problems.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/8/2004 7:31:48 PM

In article <MPG.1ba8ed33652544ed9896c7@usenet01.boi.hp.com>,
Jim Hull  <jim.hull@hp.com> wrote:
>> |> 
>> |> Perhaps Intel is being overoptimistic, but, as far as I understand, they 
>> |> claim Montecito will be ready in 2005.
>> 
>> I am aware of that.  Given that Intel failed to reduce the power
>> going to 90 nm for the Pentium 4, that implies it will need 200
>> watts.  Given that HP have already produced a dual-CPU package,
>> they will have boards rated for that.  Just how many other vendors
>> will have?
>
>This is wrong.  As described by Paul Otellini during his keynote speech at 
>IDF yesterday, and documented here (watch for URL wrap):
>
>ftp://download.intel.com/pressroom/kits/events/idffall_
>2004/otellini_presentation.pdf#page=38

Most interesting.  Unfortunately, that failed to download.

It is amusing that, at the time I posted my statement, it was based
on the best available information, but that was negated within days.
I don't suppose that you can say WHY the Montecito manages to make
good use of the 90 nm process and the Prescott failed to?

I have a very similar confusion over the IBM G5, with reliable reports
of 200 watts and other ones of (if I recall) 50 watts.


Regards,
Nick Maclaren.
0
Reply nmm1 9/8/2004 8:19:16 PM

In comp.arch Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
> In article <MPG.1ba8ed33652544ed9896c7@usenet01.boi.hp.com>,
> Jim Hull  <jim.hull@hp.com> wrote:

> >> I am aware of that.  Given that Intel failed to reduce the power
> >> going to 90 nm for the Pentium 4, that implies it will need 200
> >> watts.  Given that HP have already produced a dual-CPU package,
> >> they will have boards rated for that.  Just how many other vendors
> >> will have?

> >This is wrong.  As described by Paul Otellini during his keynote speech at 
> >IDF yesterday, and documented here (watch for URL wrap):

> >ftp://download.intel.com/pressroom/kits/events/idffall_
> >2004/otellini_presentation.pdf#page=38

> Most interesting.  Unfortunately, that failed to download.

> It is amusing that, at the time I posted my statement, it was based
> on the best available information, but that was negated within days.
> I don't suppose that you can say WHY the Montecito manages to make
> good use of the 90 nm process and the Prescott failed to?

Read this..

http://whatever.org.ar/~module/resources/computers/computer-arch/ia-64/vail_slides_2003.pdf

Re-posted from RWT.

http://www.realworldtech.com/forums/index.cfm?action=detail&PostNum=2668&Thread=1&entryID=37912&roomID=11

> I have a very similar confusion over the IBM G5, with reliable reports
> of 200 watts and other ones of (if I recall) 50 watts.

Perhaps system power draw versus CPU:typical power draw.





-- 
davewang202(at)yahoo(dot)com
0
Reply David 9/8/2004 8:44:16 PM

Jim Hull <jim.hull@hp.com> wrote in message news:<MPG.1ba8ed33652544ed9896c7@usenet01.boi.hp.com>...
> This is wrong.  As described by Paul Otellini during his keynote speech at 
> IDF yesterday, and documented here (watch for URL wrap):
> 
> ftp://download.intel.com/pressroom/kits/events/idffall_
> 2004/otellini_presentation.pdf#page=38
> 
> the dual-core, multithreaded, Montecito package actually consumes *less* 
> power than current Itanium 2 processors.
> 
>  -- Jim Hull
>     Itanium Processor Architect at HP

There is also:
    http://www28.cplan.com/cbi_export/MA_OSAS002_266814_68-1_v2.pdf
which gives the specific quote:
    2 cores, 2 threads, 26.5MByte of cache, and 1.72 billion
transistors at 100W
(2 threads means "2 threads per core" in case it is not clear. Slide
elsewhere indicates SMT.)

(The crypto folks will appreciate y'all adding an extra shifter per
core too. Its the little extra touches that count :-))

-Z-
0
Reply googlenews 9/8/2004 11:59:27 PM

Jim Hull <jim.hull@hp.com> wrote in message news:<MPG.1ba8ed33652544ed9896c7@usenet01.boi.hp.com>...
> Nick Maclaren <nmm1@cus.cam.ac.uk> wrote ...
> > 
> > In article <ch6s4q$ict$1@news-rocq.inria.fr>, Grumble <a@b.c> writes:
> > |> > 
> > |> > By whom is it expected?  And how is it expected to appear?  Yes,
> > |> > someone will wave a chip at IDF and claim that it is a Montecito,
> > |> > but are you expecting it to be available for internal testing,
> > |> > to all OEMS, to special customers, or on the open market?
> > |> 
> > |> In November 2003, Intel's roadmap claimed Montecito would appear in 
> > |> 2005. 6 months later, Otellini mentioned 2005 again. In June 2004, Intel 
> > |> supposedly showcased Montecito dies, and claimed that testing had begun.
> > |> 
> > |> Perhaps Intel is being overoptimistic, but, as far as I understand, they 
> > |> claim Montecito will be ready in 2005.
> > 
> > I am aware of that.  Given that Intel failed to reduce the power
> > going to 90 nm for the Pentium 4, that implies it will need 200
> > watts.  Given that HP have already produced a dual-CPU package,
> > they will have boards rated for that.  Just how many other vendors
> > will have?
> 
> This is wrong.  As described by Paul Otellini during his keynote speech at 
> IDF yesterday, and documented here (watch for URL wrap):
> 
> ftp://download.intel.com/pressroom/kits/events/idffall_
> 2004/otellini_presentation.pdf#page=38
> 
> the dual-core, multithreaded, Montecito package actually consumes *less* 
> power than current Itanium 2 processors.
> 
>  -- Jim Hull
>     Itanium Processor Architect at HP

So all this stuff came good then ? ;-)

http://whatever.org.ar/~module/resources/computers/computer-arch/ia-64/vail_slides_2003.pdf
0
Reply mas769 9/9/2004 12:15:08 AM

Stephen Fuld wrote:
> "Rupert Pigott" <roo@try-removing-this.darkboong.demon.co.uk> wrote in

[SNIP]

>>This is something that has been nagging at the back of my mind
>>for the past 6 years or so, but I've not come up with any nice
>>solutions yet. I'm getting to the point where I really need to
>>just bite the bullet and put together a playpen. :)
> 
> 
> Sounds good.  Keep us up to date as you progess.

That's the tricky bit, I haven't progressed much in 6 years. :(

Cheers,
Rupert

0
Reply Rupert 9/9/2004 12:20:55 AM

Stephen Fuld wrote:

[SNIP]

> BTW, I hadn't intended to get into this with this little thought through on
> my part.  I appreciate your indulgence and civility with what may be a hair
> brained idea.

This harebrained idea has turned up several times before. Andy Glew,
myself and probably others have taken swipes at it over the past few
years.

My sticking point with it is overhead (that's before I even get into
the tarpit that is automagic resource management for parallel
workloads). Anyway you slice it : the overhead will limit the
granularity, and the granularity pretty much defines what kinds
of problems you can tackle.

Personally I think for a lot of typical commerical tasks this stuff
will fit nicely (just as RDBs have shown).

Cheers,
Rupert

0
Reply Rupert 9/9/2004 12:32:03 AM

In article <ch6nhf$qbn$1@pegasus.csx.cam.ac.uk>,
	nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
> In article <4136bd3e.40649206@news.eircom.net>,
> Russell Wallace <wallacethinmintr@eircom.net> wrote:
>>
>>At least as far as your typical spaghetti C++ is concerned, yeah, not
>>going to happen anytime in the near future.
>
> Sigh.  You are STILL missing the point.  Spaghetti C++ may be about
> as bad as it gets, but the SAME applies to the cleanest of Fortran,
> if it is using the same programming paradigms.  I can't get excited
> over factors of 5-10 difference in optimisability, when we are
> talking about improvements over decades.
>
Simple...

Let's all dust off our old APL manuals, and then practically ALL of
our code will be vectorizable/parallel.

GDR,
Dale Pontius
0
Reply dale 9/9/2004 12:36:11 AM

David Gay <dgay@beryl.CS.Berkeley.EDU> wrote in message news:<s714qm8hcsy.fsf@beryl.CS.Berkeley.EDU>...
[...]

Has anyone even done JIT to native code for elisp yet? That would be
much easier, and would provide more broadly applicable performance
gains. (At the cost of portability, though there are some fairly
portable JIT systems now. And it is an active area for research.)

As to Ken's kvetch, Stefan did excerpt a very specific line from
Jouni's post which he indicated as "hilarious." Also having had a fair
bit of experience with the emacs C source and elisp code, I found
Stefan's post dead on target.

To someone who knows emacs internals, Jouni's post comes across as
naively optimistic. Much of the performance critical stuff in emacs is
in C. (E.g the regexp matcher.) And even if one takes the Amdahl's law
hit and ignores that, many functions written in native code have
interesting side-effects such as filesystem modifications. So just
getting correctness takes a lot of work.

Beyond the emacs specific part, lisp dialects vary in how ammenable
they are to automatic parallel execution. Even in the best of cases,
completely automatic exploitation of multiprocessor hardware has not
been widely used. Usually some sort of programmer visible concurrency
is exposed, such as futures.

There are likely designs for editors and word processors that can
profitably use multiprocessors. E.g partitioning user input, disk I/O,
source parsing, and display tasks into threads. Though this type of
structure is not going to happen via automatic parallelization of
dusty deck codes.

-Z-
0
Reply googlenews 9/9/2004 12:52:55 AM

Nick Maclaren wrote:
> In article <afw%c.7042$837.3968@newssvr31.news.prodigy.com>,
> Bill Davidsen <davidsen@darkstar.prodigy.com> writes:
> |> Nick Maclaren wrote:
> |> 
> |> > Not merely do people sweat blood to get such parallelism, they
> |> > often have to change their algorithms (sometimes to ones that are
> |> > less desirable, such as being less accurate), and even then only
> |> > SOME problems can be parallelised.
> |> 
> |> I think you are looking at huge problems, when the money is on the 
> |> desktop. ...
> 
> You weren't following the thread.  I and others were pointing out
> that small-scale process-level parallelism is useful on the desktop,
> but serious parallelisation of applications is a wide blue yonder
> project.  The context of the above is where I was telling someone
> that the fact that HPC applications have been parallelised does not
> mean that desktop ones can easily follow.

Why do they need to be? The typical desktop is running multiple threads 
most of the time. At a minimum the application and the kernel, but 
things like browsers do multiple things (I don't know if IE uses this, 
other browsers do). Clearly not always running many threads, but if you 
have multiple processes many CPUs are useful, even virtual ones.

Don't just think of making a single application run faster, the browser 
is the only low hanging fruit, but many things happen at once, and can 
run as separate processes as well as threads.

I follow what you say, but just breaking up a single app is not the only 
benefit.

-- 
bill davidsen (davidsen@darkstar.prodigy.com)
   SBC/Prodigy Yorktown Heights NY data center
   Project Leader, USENET news
   http://newsgroups.news.prodigy.com
0
Reply Bill 9/9/2004 4:09:43 AM

Bill Davidsen wrote:
> I follow what you say, but just breaking up a single app is not the only 
> benefit.

One (possibly) significant benefit is the increase in net L1 (and 
L2?) cache size, one per core, which will spend a higher portion 
of its time hot with the application/OS/whatever.  Assuming the OS 
knows about processor affinity.

-- 
Andrew
0
Reply Andrew 9/9/2004 6:25:13 AM

In article <chnqv0$jhm$2@grapevine.wam.umd.edu>,
David Wang <foo@bar.invalid> writes:
|> 
|> > >ftp://download.intel.com/pressroom/kits/events/idffall_
|> > >2004/otellini_presentation.pdf#page=38
|> 
|> > Most interesting.  Unfortunately, that failed to download.

I have now seen it.  Without Jim Hull's statement, I would have
regarded "lower power" as being normal executive waffle - i.e.
it didn't say what it was lower than ....

|> http://whatever.org.ar/~module/resources/computers/computer-arch/ia-64/vail_slides_2003.pdf

I am extremely impressed.  Foil 7 gives the same order of magnitude
as I got to, but my current understanding is that the power has
been reduced by 2.5-3 times below that.

From my point of view, that changes the IA64 line from something
that we would simply rule out of consideration to something that
we shall have to consider seriously.

|> > I have a very similar confusion over the IBM G5, with reliable reports
|> > of 200 watts and other ones of (if I recall) 50 watts.
|> 
|> Perhaps system power draw versus CPU:typical power draw.

Perhaps.  It makes a LOT of difference for the HPC people, where
the 'idle' mode savings typically don't help.


Regards,
Nick Maclaren.
0
Reply nmm1 9/9/2004 10:49:52 AM

In article <bIQ%c.7179$0L1.2527@newssvr31.news.prodigy.com>,
Bill Davidsen <davidsen@darkstar.prodigy.com> writes:
|> > 
|> > You weren't following the thread.  I and others were pointing out
|> > that small-scale process-level parallelism is useful on the desktop,
|> > ...
|> 
|> Don't just think of making a single application run faster, the browser 
|> is the only low hanging fruit, but many things happen at once, and can 
|> run as separate processes as well as threads.

Yes, that's what several of us had said earlier in the thread.

The consequence is that SMALL-SCALE parallelism (i.e. 2-8 way)
will be nearly universal within a few years.  LARGE-SCALE
parallelism is another matter.


Regards,
Nick Maclaren.
0
Reply nmm1 9/9/2004 10:52:07 AM

In article <r4l412xvf6.ln2@homer.edgehp.invalid>,
dale@edgehp.net () writes:
|> >
|> > Sigh.  You are STILL missing the point.  Spaghetti C++ may be about
|> > as bad as it gets, but the SAME applies to the cleanest of Fortran,
|> > if it is using the same programming paradigms.  I can't get excited
|> > over factors of 5-10 difference in optimisability, when we are
|> > talking about improvements over decades.
|> >
|> Simple...
|> 
|> Let's all dust off our old APL manuals, and then practically ALL of
|> our code will be vectorizable/parallel.

Hmm.  Do you have a good APL Dirichlet tesselation code handy?


Regards,
Nick Maclaren.
0
Reply nmm1 9/9/2004 10:53:20 AM

> > Yes.  No problem.  I humbly submit the source of Emacs as evidence,
> > and claim that the conclusion is obvious.
> >
> > Note that Stefan Monnier did not say that Emacs could not be
> > parallelised well, at least in theory, but was responding to a
> > comment that it was going to be.
> 
> I disagree. Jouni's post began...
> 
>     I have a better reason why emacs is a great candidate for
>     parallerization.
> 
> ...which is certainly starting from a "could" rather than "would"
> viewpoint.
> 
>     Its written in lisp, and in reality its a lisp operating system
>     with embedded wordprocessor included as a major app in it. Now
>     the lisp code could be autoparallized by autoparallerizing compiler.
>     So you would need to do some work to improve the underlying lisp
>     compiler/OS to handle mutliprocessing needs.
> 
> Here he makes a specific supporting argument for his claim. When I
> asked for rebuttals, I was rather hoping that someone would address
> this one. Auto-parallelisation of Lisp may be significantly easier
> than the same task for C (which I happily accept hasn't really
> happened yet, despite efforts) so emacs may be much better placed
> than "the average app".
> 
>     BTW: I think that EMACS is going to be one of the desktop
>     applications that are going to be parallerized well. [If it
>     hasn't already.]
> 
> OK, here he switches to "could" mode, but if he blows both ways in the
> same post I think its unfair to claim he went in just one direction.
> 
>     Simply because parallerizing it is geeky enough trick that someone
>     in OSS developement may wan't to do just for the kicks [...]
> 
> Here's a second line of argument, differentiating emacs from the average
> app. It is surely undeniable that "cult" OSS software gets ported and
> twisted in far more ways than its intrinsic quality would justify. If
> I had to place money on which applications would get ported first and
> best to any new architecture, I'd bet on emacs and GNU C.

Okay. As Engrish is my 2nd language, and Finnish is my first AND my
expression of ideas is not clearest, as on a team work, I typicly have
to spend over 10 hours speaking for other students to get how my
algorithms and thats live with pen&paper as assistant in my native
language, even if they are excellent students near graduating that DO
coding as part of their studies. [Or is the problem that my algorithms
are so weird that others have hard time understanding them.]

Lets make some simple claims.
a)
I think LISP is great for parallerization.
b) 
Emacs operating system has several aplications running in top of it,
and atleast SOME of them benefit from parallelized lisp execution.
c)
Some one is going to write parallerized Lisp interpreter/(ORjit) just
for the kicks for eLisp after desktop multiprocessing becomes
mainstream.
d) 
After that some others will improve the underlaying lisp code for
better parallel execution IF there need for performance in that code.

Now I don't claim, WHEN the c happens, and how quicly d is going to
happen nor that ALL the code is going to be parallerized. Heck it
might be that after reading the post of those other people in this
matter that very little current lisp code is going to usefull for
parallerization, but after the parallerization back end has been done,
there will be gradual improvement in that matter, or a great jumps in
different areas. And new elisp application written in more functional
form, perhaps even DOOM clone written in elisp that parallerizes for n
processors ;)


Jouni
0
Reply josmala 9/9/2004 12:20:10 PM

nmm1@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<chiael$bb5$1@pegasus.csx.cam.ac.uk>...
> In article <oz%_c.317571$OB3.179975@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.fuld@PleaseRemove.att.net> wrote:
> >
> >"Jouni Osmala" <josmala@cc.hut.fi> wrote in message
> >news:9538122f.0409052324.3ce9651c@posting.google.com...
> >
> >> There is already companies that use internal parallel languages for
> >> their consumer products to cope with SSE, 3Dnow, and SMP. There ARE
> >> parallel languages that are easy to use for application developement.
> >
> >Can you give some examples of languages in each of these catagories?  And
> >speculate about why, if they are easy to use and make parallel programming
> >much easier, then why aren't they the "standard" for high performance
> >computing?
> 
> Or even used significantly in that area!  Yes, PLEASE tell me about
> those languages, as it really is rather relevant to my work.

Those things are used as a sub-blocks by several BIG companies. And
the company that delivers those, wasn't interested in making parallel
language just something that made expressing their problems easier,
and the result was something that could parallerize well as long as
you kept everything in single memory space. And I doubt that they want
their corporate secrets leaked by some thing that was in our
university cafe discussion after a lecture one of their researcher
kept. There are limitations still it won't scale to big systems,
because it cannot cope with multiple memory domains, but it scales
excellently with SMT and CMP and vector extensions, and can mix them
in any way needed to handle the code. [Now thats part of the reason
I'm optimistic in CMP, while not too optimistic on clusters.] And I
remember very little about it, as it was years ago, but they still
actively licence the product they used it with.
I must still iterate my view on parallerism, on desktop as a future.
Utilizing CMP is much easier than clusters. Simple because the
syncronization latencies are 3 order of magnitudes smaller compared to
Myrinet for instance, and you can use shared memory where usefull.

Jouni Osmala
0
Reply josmala 9/9/2004 1:14:46 PM

dale@edgehp.net () wrote in message news:<r4l412xvf6.ln2@homer.edgehp.invalid>...
> In article <ch6nhf$qbn$1@pegasus.csx.cam.ac.uk>,
> 	nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
> > In article <4136bd3e.40649206@news.eircom.net>,
> > Russell Wallace <wallacethinmintr@eircom.net> wrote:
> >>
> >>At least as far as your typical spaghetti C++ is concerned, yeah, not
> >>going to happen anytime in the near future.
> >
> > Sigh.  You are STILL missing the point.  Spaghetti C++ may be about
> > as bad as it gets, but the SAME applies to the cleanest of Fortran,
> > if it is using the same programming paradigms.  I can't get excited
> > over factors of 5-10 difference in optimisability, when we are
> > talking about improvements over decades.
> >
> Simple...
> 
> Let's all dust off our old APL manuals, and then practically ALL of
> our code will be vectorizable/parallel.

Why not buy a book on Fortran 95 and learn about the array and
ELEMENTAL functions? There are many commercial compilers and a free
compiler called G95 for Linux and Unix at http://www.g95.org .
0
Reply beliavsky 9/9/2004 4:47:37 PM

"Rupert Pigott" <roo@try-removing-this.darkboong.demon.co.uk> wrote in
message news:1094689922.364659@teapot.planet.gong...
> Stephen Fuld wrote:
>
> [SNIP]
>
> > BTW, I hadn't intended to get into this with this little thought through
on
> > my part.  I appreciate your indulgence and civility with what may be a
hair
> > brained idea.
>
> This harebrained idea has turned up several times before. Andy Glew,
> myself and probably others have taken swipes at it over the past few
> years.
>
> My sticking point with it is overhead (that's before I even get into
> the tarpit that is automagic resource management for parallel
> workloads). Anyway you slice it : the overhead will limit the
> granularity, and the granularity pretty much defines what kinds
> of problems you can tackle.

Yes, agreed.  That is why my thoughts include trying to come up with some
kind of low overhead message passing system to reduce that overhead.  I
remember the Elixi (sp?) system was aimed at HPC type applications and had
hardware support for interprocess message passing.  And I have mentioned the
transputer, whic, I gather had hardware/microcode support for something
similar.  I think a key is to limit message length to minimize resource
overcommittment, and handle the common cases without any OS intervention,
but perhaps have some when the queues got large enough so that you could
prevent overflows, etc.  But again, I am just mussing here.

I also want to make clear that when I am talking "transactions" I am talking
many thousands to millions of instructions.  I know that does put a lower
limit on the kinds of granularity you can reasonably have, but it also
limits the overhead.  Think in terms of a few hundred microseconds to a few
milliseconds of CPU time per "transaction".

> Personally I think for a lot of typical commerical tasks this stuff
> will fit nicely (just as RDBs have shown).

And long before RDBs.  Think airline res systems since the 1960s.  It is the
success with that kind of workload that gives me hope the same ideas can be
used to help HPC.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/9/2004 5:15:46 PM

"Jouni Osmala" <josmala@cc.hut.fi> wrote in message
news:9538122f.0409090514.7d71c212@posting.google.com...

snip

> Those things are used as a sub-blocks by several BIG companies. And
> the company that delivers those, wasn't interested in making parallel
> language just something that made expressing their problems easier,
> and the result was something that could parallerize well as long as
> you kept everything in single memory space. And I doubt that they want
> their corporate secrets leaked by some thing that was in our
> university cafe discussion after a lecture one of their researcher
> kept. There are limitations still it won't scale to big systems,
> because it cannot cope with multiple memory domains, but it scales
> excellently with SMT and CMP and vector extensions, and can mix them
> in any way needed to handle the code. [Now thats part of the reason
> I'm optimistic in CMP, while not too optimistic on clusters.] And I
> remember very little about it, as it was years ago, but they still
> actively licence the product they used it with.

OK, given that you can't violate even an implied NDA, can you at least tell
us the name of the product that the company still licences (while still not
telling us anything about how it is written)?  Perhaps that might help us to
progress further in the discussion.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam




> I must still iterate my view on parallerism, on desktop as a future.
> Utilizing CMP is much easier than clusters. Simple because the
> syncronization latencies are 3 order of magnitudes smaller compared to
> Myrinet for instance, and you can use shared memory where usefull.
>
> Jouni Osmala


0
Reply Stephen 9/9/2004 5:15:51 PM

> Has anyone even done JIT to native code for elisp yet?  That would be
> much easier, and would provide more broadly applicable performance
> gains. (At the cost of portability, though there are some fairly
> portable JIT systems now. And it is an active area for research.)

The problem is that elisp is a very dynamic language.  E.g. dynamic scoping
together with buffer-local variables makes most optimizations very difficult
to perform.
And a naive approach results in very disappointing speedups (because the
interpretive overhead is often dwarfed by the slowness of even the most
basic operations such as "get the value of variable `foo'" or "create new
local var `bar'").


        Stefan
0
Reply Stefan 9/9/2004 6:05:12 PM

Stephen Fuld wrote:
> "Rupert Pigott" <roo@try-removing-this.darkboong.demon.co.uk> wrote in
> message news:1094689922.364659@teapot.planet.gong...
> 
>>Stephen Fuld wrote:
>>
>>[SNIP]
>>
>>
>>>BTW, I hadn't intended to get into this with this little thought through
> 
> on
> 
>>>my part.  I appreciate your indulgence and civility with what may be a
> 
> hair
> 
>>>brained idea.
>>
>>This harebrained idea has turned up several times before. Andy Glew,
>>myself and probably others have taken swipes at it over the past few
>>years.
>>
>>My sticking point with it is overhead (that's before I even get into
>>the tarpit that is automagic resource management for parallel
>>workloads). Anyway you slice it : the overhead will limit the
>>granularity, and the granularity pretty much defines what kinds
>>of problems you can tackle.
> 
> 
> Yes, agreed.  That is why my thoughts include trying to come up with some
> kind of low overhead message passing system to reduce that overhead.  I
> remember the Elixi (sp?) system was aimed at HPC type applications and had
> hardware support for interprocess message passing.  And I have mentioned the
> transputer, whic, I gather had hardware/microcode support for something
> similar.  I think a key is to limit message length to minimize resource
> overcommittment, and handle the common cases without any OS intervention,
> but perhaps have some when the queues got large enough so that you could
> prevent overflows, etc.  But again, I am just mussing here.

I've been trying to find the paper that I found online that described
how the T4/8 series did IPC (local and remote). I'm damned if I am
going to copy the bloody databook again (I doubt anyone here would
learn much from it if I did).

Transputers didn't queue. It was a rendezvous kind of mechanism. The
sender process would block until the receiver was ready. When the
receiver was ready it would start copying data into it's buffers. If
the receiver didn't read enough data the sender would remain blocked
(trying to send). If the receiver tried to read too much data it
would also block waiting for the extra bytes. I don't think that is
a big deal as long as you provide facilities for terminating those
processes. OCCAM helped a lot by checking what was being sent and
what was expected. :)

In general I think queueing is a bad idea, let the app programmer do
that if they want it. All that's needed is a way of passing data from
one process to another and a way for the receving process to know that
there is data waiting for it (or that something went horribly wrong).

I think you could classify the Transputer IPC stuff as being "RISC"
in concept. It didn't do a whole load of fancy stuff at all, it was
kept as simple as possible. While I'd love to see this happen in
this day and age with low $/port, I wonder if I might be asking for
too much. Perhaps the b/w & latency requirements really do demand
very complex protocols implemented by H/W.

> I also want to make clear that when I am talking "transactions" I am talking
> many thousands to millions of instructions.  I know that does put a lower
> limit on the kinds of granularity you can reasonably have, but it also
> limits the overhead.  Think in terms of a few hundred microseconds to a few
> milliseconds of CPU time per "transaction".

OK, that explains a bit. FWIW that's the way I've been heading. :)

>>Personally I think for a lot of typical commerical tasks this stuff
>>will fit nicely (just as RDBs have shown).
> 
> 
> And long before RDBs.  Think airline res systems since the 1960s.  It is the
> success with that kind of workload that gives me hope the same ideas can be
> used to help HPC.

I don't know enough about HPC workloads, which is why I keep sifting
through Toone & Nick's posts to get some hints because occasionally
they do let slip some useful info. :)

Cheers,
Rupert

0
Reply Rupert 9/9/2004 6:47:24 PM

"Rupert Pigott" <roo@try-removing-this.darkboong.demon.co.uk> wrote in
message news:1094755645.912389@teapot.planet.gong...
> Stephen Fuld wrote:

snipped some details on how transputers work, for which I am very thankfull.

> I think you could classify the Transputer IPC stuff as being "RISC"
> in concept. It didn't do a whole load of fancy stuff at all, it was
> kept as simple as possible. While I'd love to see this happen in
> this day and age with low $/port, I wonder if I might be asking for
> too much.

Why?  One could easily have more than 4 ports per chip and the speed would
naturally be a lot faster.  In fact, I suppose one could develop a
peripheral chip with the ports to avoid having to design it into the CPU,
but that would add some latency.

> Perhaps the b/w & latency requirements really do demand
> very complex protocols implemented by H/W.

Perhaps.  But the bandwidth requirement can be minimized by judicious
restrictions on message length.  One hopes one is not sending whole arrays
across the links!

> > I also want to make clear that when I am talking "transactions" I am
talking
> > many thousands to millions of instructions.  I know that does put a
lower
> > limit on the kinds of granularity you can reasonably have, but it also
> > limits the overhead.  Think in terms of a few hundred microseconds to a
few
> > milliseconds of CPU time per "transaction".
>
> OK, that explains a bit. FWIW that's the way I've been heading. :)

Something about minds and the same gutter.  :-)

> >>Personally I think for a lot of typical commerical tasks this stuff
> >>will fit nicely (just as RDBs have shown).
> >
> >
> > And long before RDBs.  Think airline res systems since the 1960s.  It is
the
> > success with that kind of workload that gives me hope the same ideas can
be
> > used to help HPC.
>
> I don't know enough about HPC workloads, which is why I keep sifting
> through Toone & Nick's posts to get some hints because occasionally
> they do let slip some useful info. :)

Once again, agreed!  It seems that there should be a reasonably agreed upon
taxonomy of HPC appication types that may be usefull for at least discussion
purposes.  I am getting bits and pieces of it, but haven't seen a complete,
or even nearly complete one.  For example, processing of each element in
regular array across many time steps (with varying degrees of interaction
among local and distant array elements).  Processing sparce or irregular
arrays (with some more subdivisions), etc.  Again, I am just mussing here
and would appreciate someone who really knows this stuff to add more
information.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/9/2004 7:14:08 PM

In article <1094755645.912389@teapot.planet.gong>,
Rupert Pigott  <roo@try-removing-this.darkboong.demon.co.uk> wrote:

>In general I think queueing is a bad idea, let the app programmer do
>that if they want it. All that's needed is a way of passing data from
>one process to another and a way for the receving process to know that
>there is data waiting for it (or that something went horribly wrong).

Experience is shown that actual programs need more than this. MPI
didn't come about just because people wanted one way to pass messages,
it also helped to provide means to write correct programs that ran on
a variety of hardware. As an example, you need to be able to live with
finite memory. And have sequencing.

-- greg

0
Reply lindahl 9/9/2004 7:39:02 PM

In article <4Y10d.336514$OB3.232486@bgtnsc05-news.ops.worldnet.att.net>,
Stephen Fuld <s.fuld@PleaseRemove.att.net> wrote:

>Why?  One could easily have more than 4 ports per chip and the speed would
>naturally be a lot faster.

Neither half of this entence is true. They cost money, and in some
implementations (especially when you're trying to save money), they
have a performance hit.

>Once again, agreed!  It seems that there should be a reasonably agreed upon
>taxonomy of HPC appication types that may be usefull for at least discussion
>purposes.

Indeed; have you tried looking in the literature?

-- greg

0
Reply lindahl 9/9/2004 7:50:28 PM

"Greg Lindahl" <lindahl@pbm.com> wrote in message
news:4140b404$1@news.meer.net...
> In article <4Y10d.336514$OB3.232486@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.fuld@PleaseRemove.att.net> wrote:
>
> >Why?  One could easily have more than 4 ports per chip and the speed
would
> >naturally be a lot faster.
>
> Neither half of this entence is true. They cost money, and in some
> implementations (especially when you're trying to save money), they
> have a performance hit.
>
> >Once again, agreed!  It seems that there should be a reasonably agreed
upon
> >taxonomy of HPC appication types that may be usefull for at least
discussion
> >purposes.
>
> Indeed; have you tried looking in the literature?
>
> -- greg

One can put some sort of proprietary high speed link or links on a processor
chip for relatively little silicon area/pins.  Said link would achieve
speeds in the 3-5 Gbit/second/differential pair range for distances of  a
few feet to a few meters depending on the details of the interconnect
construction.  This does not include a significant amount of buffering or
protocol logic.

Or one could put such links in the "chipset" hub, ala IBM and Intel with
their scalability port designs, or Cray with Black Widow or whatever.  It
seems to make little sense to put the links on the processor chip unless the
memory is also attached to the processor.

del cecchi
>


0
Reply Del 9/9/2004 8:14:45 PM

In article <2qbrtmFtnqbpU1@uni-berlin.de>,
Del  Cecchi <cecchinospam@us.ibm.com> wrote:
>
>One can put some sort of proprietary high speed link or links on a processor
>chip for relatively little silicon area/pins.  Said link would achieve
>speeds in the 3-5 Gbit/second/differential pair range for distances of  a
>few feet to a few meters depending on the details of the interconnect
>construction.  This does not include a significant amount of buffering or
>protocol logic.

Experience is that, for HPC work, you need very little such logic
(but it had better be the RIGHT logic, and it and the software had
better speak the same language).

>Or one could put such links in the "chipset" hub, ala IBM and Intel with
>their scalability port designs, or Cray with Black Widow or whatever.  It
>seems to make little sense to put the links on the processor chip unless the
>memory is also attached to the processor.

I would put it more strongly.  You are bonkers to put it anywhere
other than in the "memory controller" layer.  I.e. communication
is a memory access function, and not a computation one.


Regards,
Nick Maclaren.
0
Reply nmm1 9/9/2004 8:36:44 PM

>>Those things are used as a sub-blocks by several BIG companies. And
>>the company that delivers those, wasn't interested in making parallel
>>language just something that made expressing their problems easier,
>>and the result was something that could parallerize well as long as
>>you kept everything in single memory space. And I doubt that they want
>>their corporate secrets leaked by some thing that was in our
>>university cafe discussion after a lecture one of their researcher
>>kept. There are limitations still it won't scale to big systems,
>>because it cannot cope with multiple memory domains, but it scales
>>excellently with SMT and CMP and vector extensions, and can mix them
>>in any way needed to handle the code. [Now thats part of the reason
>>I'm optimistic in CMP, while not too optimistic on clusters.] And I
>>remember very little about it, as it was years ago, but they still
>>actively licence the product they used it with.
> 
> 
> OK, given that you can't violate even an implied NDA, can you at least tell
> us the name of the product that the company still licences (while still not
> telling us anything about how it is written)?  Perhaps that might help us to
> progress further in the discussion.

Perhaps I should rather ask them if they wan't to talk about it.
And after a little thinking I'm not 100% certain, about how much they 
used that. I remember what he told me the benefits of doing it and 
little about what kind of language it is. But as its VERY small company 
having relatively big product portfolio, and BIG licensor companies they 
where resource limited so but still, I'm not sure if they
A) utilize that as a competitive advantage and keep it as a secret.
B) It was a pure research project that got great results for a while, 
but wasn't used later.

Although they seem to VERY open about their other technology I haven't 
seen anything about this besides the conversation I was in after the 
lecture. The problem for me to identify single specific product is that 
I don't wan't to disclose the company, unless they want to speak about 
the thing. The examples what he told me how it was great language where 
mostly related of how certain things related to their products where 
writen with minimal hazzle with it, so it was implied that they actually 
used it, especially since they didn't have too many coders. They still 
license the products they made then, and something hints that they do 
use the language from their later products. But I cannot remember if he 
expressly said that it was used in their specific product, but its 
usefullness for several things couple of their products did was expressed.

Perhaps I should STOP talking about all of these great things.
And make things happen that I'm so optimistic about. Since if other 
people are so pessimistic about if anything happens there, AND there is 
real commercial benefit from making them happen.

Jouni Osmala
-Yes, Nick gave a push for starting to implement new code instead of 
just talking about things, but Comp.arch should be on the list of banned 
substances ;)
0
Reply Jouni 9/9/2004 8:43:11 PM

Greg Lindahl wrote:
> In article <1094755645.912389@teapot.planet.gong>,
> Rupert Pigott  <roo@try-removing-this.darkboong.demon.co.uk> wrote:
> 
> 
>>In general I think queueing is a bad idea, let the app programmer do
>>that if they want it. All that's needed is a way of passing data from
>>one process to another and a way for the receving process to know that
>>there is data waiting for it (or that something went horribly wrong).
> 
> 
> Experience is shown that actual programs need more than this. MPI
> didn't come about just because people wanted one way to pass messages,
> it also helped to provide means to write correct programs that ran on
> a variety of hardware. As an example, you need to be able to live with
> finite memory. And have sequencing.

I'm thnking of the barest primitives. There's nothing stopping an
app/library/OS from building on top of those basics. I accept that
you may *want* HW & OS support for queueing.

I'll take a really good long look at the MPI-2 spec and see what's
going on with it. Last time I gave it any serious scrutiny it felt
quite immature and folks were still having their Cray vector
machines pried from their cold dead fingers.

Cheers,
Rupert

0
Reply Rupert 9/9/2004 9:54:27 PM

Greg Lindahl wrote:
> In article <1094755645.912389@teapot.planet.gong>,
> Rupert Pigott  <roo@try-removing-this.darkboong.demon.co.uk> wrote:
> 
> 
>>In general I think queueing is a bad idea, let the app programmer do
>>that if they want it. All that's needed is a way of passing data from
>>one process to another and a way for the receving process to know that
>>there is data waiting for it (or that something went horribly wrong).
> 
> 
> Experience is shown that actual programs need more than this. MPI
> didn't come about just because people wanted one way to pass messages,

.... Having said that, reading chapter 6 of the MPI-2 spec it says in
section 6.1, line 32 :

"Message passing communication achieves two effects: communication of
data from sender to receiver; and sychronization of sender with
receiver."

Not entirely sure I can agree with that on the basis of my current
understanding of MPI-2 message passing. Non-blocking primatives
break the synchronization property by some intuitive definitions.

The thing which strikes me about MPI is that it's really having to
jump through some huge hoops (and force the programmer to as well)
in order to be portable across languages...

I had a whacky idea a long time back that perhaps you might be able
to do a kind of OCCAM harness and embed bits of other languages
in it, like you do with C & inline assembler... In an ideal world
you would get the clarity, conciseness and safety of OCCAM, yet you
can re-use existing code from other langauges fairly trivially too.

> it also helped to provide means to write correct programs that ran on
> a variety of hardware. As an example, you need to be able to live with
> finite memory. And have sequencing.

I can see how it would help with that. I'm going back to reading
the spec now. :)

Cheers,
Rupert

0
Reply Rupert 9/9/2004 11:35:06 PM

In article <1094772909.516103@teapot.planet.gong>,
Rupert Pigott  <roo@try-removing-this.darkboong.demon.co.uk> wrote:

>The thing which strikes me about MPI is that it's really having to
>jump through some huge hoops (and force the programmer to as well)
>in order to be portable across languages...

That's not the only thing it provides. If you don't understand that,
then you can't make a good criticism of it.

-- greg
0
Reply lindahl 9/9/2004 11:55:42 PM

Greg Lindahl wrote:

> In article <1094772909.516103@teapot.planet.gong>,
> Rupert Pigott  <roo@try-removing-this.darkboong.demon.co.uk> wrote:
> 
> 
>>The thing which strikes me about MPI is that it's really having to
>>jump through some huge hoops (and force the programmer to as well)
>>in order to be portable across languages...
> 
> 
> That's not the only thing it provides. If you don't understand that,
> then you can't make a good criticism of it.

Steady on Greg, I'm still reading the spec. I was focussing on that
partly because you highlighted the multiplatform nature of it, but
also because I figure that requirement must have a fairly major
impact on the design and implementation of MPI.

FWIW I'm getting into the meatier stuff now and soon I'll be hunting
for some code to see what people are actually doing with it. Can you
recommend any particular example of code that you consider important
& representitive that I will be able to get my hands on *without*
signing NDAs/donating Kidneys etc ?

Cheers,
Rupert

0
Reply Rupert 9/10/2004 12:48:49 AM

nmm1@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<chqess$c2o$1@pegasus.csx.cam.ac.uk>...
> In article <2qbrtmFtnqbpU1@uni-berlin.de>,
> Del  Cecchi <cecchinospam@us.ibm.com> wrote:
> >
> >One can put some sort of proprietary high speed link or links on a processor
> >chip for relatively little silicon area/pins.  Said link would achieve
> >speeds in the 3-5 Gbit/second/differential pair range for distances of  a
> >few feet to a few meters depending on the details of the interconnect
> >construction.  This does not include a significant amount of buffering or
> >protocol logic.

> 
> Experience is that, for HPC work, you need very little such logic
> (but it had better be the RIGHT logic, and it and the software had
> better speak the same language).
> 
> >Or one could put such links in the "chipset" hub, ala IBM and Intel with
> >their scalability port designs, or Cray with Black Widow or whatever.  It
> >seems to make little sense to put the links on the processor chip unless the
> >memory is also attached to the processor.
> 
> I would put it more strongly.  You are bonkers to put it anywhere
> other than in the "memory controller" layer.  I.e. communication
> is a memory access function, and not a computation one.
> 
> 
> Regards,
> Nick Maclaren.



Absolutely, it would be crazy to separate link HW in the Transputer
model from the cpu chip when the links are so closely tied into the
fundmental model of pervasive Processes. They should be part of memory
HW layer, and are also part of the computation model to some degree
since they interact with the HW process scheduler and may include
routing computations of their own.

In the original Transputers the HW logic for 4 links was about the
same as the primary 32b datapath in area. That seems about the right
proportion when you want to send messages at such fine granularity.

Taking the links off chip would really turn the architecture back into
the poor process models we have how. Also of course the Transputer had
the mem interface on too, another good decision, it allowed building
glueless arrays of cpus with extra memory as an option/node.

regards

johnjakson_usa_com
0
Reply johnjakson 9/10/2004 3:20:36 AM

"Stephen Fuld" <s.fuld@PleaseRemove.att.net> wrote in message news:<p9m_c.552632$Gx4.320887@bgtnsc04-news.ops.worldnet.att.net>...
> "Scott Moore" <samiam@moorecad.com> wrote in message
> news:rzf_c.32443$_g7.1885@attbi_s52...
> 
> snip
> 
> > The reason SMP exists is that programmers don't want to change. Hillis
> > avocated the need to throw the present computing structures out with
> > the bathwater to get to "perfect" parallelisim.
> >
> > I'm not arguing that the present languages are bad for parallelisim.
> > Just that nobody feels like starting over, and any approach (like SMP)
> > based on the way things work now, instead of ideally, is going to
> > deliver more results if only because the state of the art is already
> > so far along.
> 
> I freely admit that I may be way off base here, but I am very much reminded
> an analogous situation in a somewhat earlier age.  Perhaps it can best be
> described with the paraphrase "SMP considered harmfull to parallel
> programming progress".  That is SMP is like the use of the Goto statement in
> that it is very usefull in modest sized applications (think perhaps quick
> and dirty) but as things scale up, neither works well and both seem to have
> unintended consequences that make further progress much harder.  Do we need
> to bite the bullet and "throw out" the SMP code, just like we mostly did
> with goto filled code and thus regress in order to make more progress later?
> I very well remember the resistance to eliminating goto, the projected cost
> in terms of inefficient programs the cost of rewriting, etc.  But now, few
> would go back.
> 
> Just a thought, but I find it interesting.

Some good folks at U.Kent took Java to task when the weak thread model
(<v1.5) it espoused was first positioned as "threading for the rest of
us", well it wasn't.

The quote was something like "java classes considered harmfull" on the
basis that OOP combined with java thread model produces one hell of a
monster of spahgetti control flow, maybe ok for a few threads, but
what about 1000s or even millions of threads.

Their answer was JavaCSP which makes sense to me, makes classes much
more like HW objects only it made it a whole lot slower too. Wonder
why we don't hear more about CSP or will it remain a euro cult thing.

On the pt of HW, ofcourse occam being a process description language
has also became synthesizeable to HW via Handelc. So if occam can be
used to describe SW processes and also HW processes, why can't other
HDLs with some huge improvememnts be used for same?

I'm not too worried about gotos, its the available thread models that
keeps me away.

regards

johnjakson_usa_com
0
Reply johnjakson 9/10/2004 3:37:13 AM

In article <1094777332.490174@teapot.planet.gong>,
Rupert Pigott  <roo@try-removing-this.darkboong.demon.co.uk> wrote:
>
>Steady on Greg, I'm still reading the spec. I was focussing on that
>partly because you highlighted the multiplatform nature of it, but
>also because I figure that requirement must have a fairly major
>impact on the design and implementation of MPI.

Less than you might think, as the main languages in question were
the conceptually similar Fortran 77 and C90 (well, similar as far
as argument passing goes).

>FWIW I'm getting into the meatier stuff now and soon I'll be hunting
>for some code to see what people are actually doing with it. Can you
>recommend any particular example of code that you consider important
>& representitive that I will be able to get my hands on *without*
>signing NDAs/donating Kidneys etc ?

ScaLAPACK.  I don't recommend it as clarifying anything.

I will send you a copy of my timer, which shows how to use the main
MPI-1 facilities, but is not an 'application'.  What it may also do
is help to show how MPI can be made pretty solid, as far as error
detection goes.


Regards,
Nick Maclaren.
0
Reply nmm1 9/10/2004 7:53:57 AM

In comp.arch Jouni Osmala <josmala@cc.hut.fi> wrote:
> Okay. As Engrish is my 2nd language, and Finnish is my first AND my
> expression of ideas is not clearest, as on a team work, I typicly have
> to spend over 10 hours speaking for other students to get how my
> algorithms and thats live with pen&paper as assistant in my native
> language, even if they are excellent students near graduating that DO
> coding as part of their studies. [Or is the problem that my algorithms
> are so weird that others have hard time understanding them.]
> 
> Lets make some simple claims.
> a)
> I think LISP is great for parallerization.

Many dialects of Lisp are not. elisp is very proably
one such.

> b) 
> Emacs operating system has several aplications running in top of it,
> and atleast SOME of them benefit from parallelized lisp execution.

s/benefit/might benefit/

> c)
> Some one is going to write parallerized Lisp interpreter/(ORjit) just
> for the kicks for eLisp after desktop multiprocessing becomes
> mainstream.

See, you need a paralellised eLisp engine, just a "generic" lisp one
won't do you any good. the lisps are a legion.

> d) 
> After that some others will improve the underlaying lisp code for
> better parallel execution IF there need for performance in that code.
> 
> Now I don't claim, WHEN the c happens, and how quicly d is going to
> happen nor that ALL the code is going to be parallerized. Heck it
> might be that after reading the post of those other people in this
> matter that very little current lisp code is going to usefull for
> parallerization, but after the parallerization back end has been done,
> there will be gradual improvement in that matter, or a great jumps in
> different areas. And new elisp application written in more functional
> form, perhaps even DOOM clone written in elisp that parallerizes for n
> processors ;)

You have left thie real world and gotten way lost in the dream one.

> 
> 
> Jouni

-- 
	Sander

+++ Out of cheese error +++
0
Reply Sander 9/10/2004 8:58:30 AM

Rupert Pigott wrote:
> "Message passing communication achieves two effects: communication of
> data from sender to receiver; and sychronization of sender with
> receiver."
> 
> Not entirely sure I can agree with that on the basis of my current
> understanding of MPI-2 message passing. Non-blocking primatives
> break the synchronization property by some intuitive definitions.

One-way message passing doesn't need synchronization. If you want
synchronization with non-blocking or buffered message passing, use round
trips (acknowledges).

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
0
Reply Bernd 9/10/2004 9:07:52 AM

Sander Vesik wrote:
> In comp.arch Jouni Osmala <josmala@cc.hut.fi> wrote:
> 
>>Okay. As Engrish is my 2nd language, and Finnish is my first AND my
>>expression of ideas is not clearest, as on a team work, I typicly have
>>to spend over 10 hours speaking for other students to get how my
>>algorithms and thats live with pen&paper as assistant in my native
>>language, even if they are excellent students near graduating that DO
>>coding as part of their studies. [Or is the problem that my algorithms
>>are so weird that others have hard time understanding them.]
>>
>>Lets make some simple claims.
>>a)
>>I think LISP is great for parallerization.
>  
> Many dialects of Lisp are not. elisp is very proably
> one such.

Ok.

>>b) 
>>Emacs operating system has several aplications running in top of it,
>>and atleast SOME of them benefit from parallelized lisp execution.
> 
> 
> s/benefit/might benefit/
> 
> 
>>c)
>>Some one is going to write parallerized Lisp interpreter/(ORjit) just
>>for the kicks for eLisp after desktop multiprocessing becomes
>>mainstream.
> 
> 
> See, you need a paralellised eLisp engine, just a "generic" lisp one
> won't do you any good. the lisps are a legion.

OK. I'm not specialized on eLisp, it looked like scheme but having more 
in line things. But still if there are 16 cores on every "normal" home 
computer, then eLisp will be extended/subsetted to something that could 
use them. (Of course having old unparallerisable code in emacs will 
continue.) But there will be eLisp mode set that runs parallel some time 
in the future even if current eLisp is not parallerisable. So transition 
will take time. Lets hope that people find some use their 4 or 8 core 
CPU:s before eLisp gets parallerized ;)


>>d) 
>>After that some others will improve the underlaying lisp code for
>>better parallel execution IF there need for performance in that code.
>>
>>Now I don't claim, WHEN the c happens, and how quicly d is going to
>>happen nor that ALL the code is going to be parallerized. Heck it
>>might be that after reading the post of those other people in this
>>matter that very little current lisp code is going to usefull for
>>parallerization, but after the parallerization back end has been done,
>>there will be gradual improvement in that matter, or a great jumps in
>>different areas. And new elisp application written in more functional
>>form, perhaps even DOOM clone written in elisp that parallerizes for n
>>processors ;)
> 
> 
> You have left thie real world and gotten way lost in the dream one.

Why this would be dream. There are plenty of emacs games, including 
elite. When elite was a new thing NO-ONE probably though making it run 
inside emacs. But these days its ported to it.
If intel/AMD finds that the biggest gain improvement is increasing 
number of cores, then the elisp version of doom that will happen in next 
3 decades, probably will use what ever number of cores there was 
available two years before its creation...

Jouni Osmala
0
Reply Jouni 9/10/2004 9:25:24 AM

In article <8g7812-80g.ln1@miriam.mikron.de>,
Bernd Paysan <bernd.paysan@gmx.de> writes:
|> 
|> One-way message passing doesn't need synchronization. If you want
|> synchronization with non-blocking or buffered message passing, use round
|> trips (acknowledges).

Yes, it does.  All that paradigm does is to separate the
synchronisation from the data transfer, thus increasing the
chance of making a logic error.  No matter HOW you cut it, there
has to be SOME way that both ends know when the transfer has
completed.

The way that most one-way designs reduce the number of basic
communication operations they need is by requiring or assuming
a serialisation model.  In neither case is this a great help
for efficiency and, if it is merely assumed, it is a disaster
for robustness.

I have NEVER seen a use of one-way message passing that was much
better than two-way, in the absence of assuming serialisation,
and if the two-way system had a decent multiple acknowledgement
primitive.

If you have a counter-example, please post.


Regards,
Nick Maclaren.
0
Reply nmm1 9/10/2004 9:51:08 AM

In article <chpcn0$d3v$1@pegasus.csx.cam.ac.uk>,
	nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
>
> In article <r4l412xvf6.ln2@homer.edgehp.invalid>,
> dale@edgehp.net () writes:
>|> >
>|> > Sigh.  You are STILL missing the point.  Spaghetti C++ may be about
>|> > as bad as it gets, but the SAME applies to the cleanest of Fortran,
>|> > if it is using the same programming paradigms.  I can't get excited
>|> > over factors of 5-10 difference in optimisability, when we are
>|> > talking about improvements over decades.
>|> >
>|> Simple...
>|>
>|> Let's all dust off our old APL manuals, and then practically ALL of
>|> our code will be vectorizable/parallel.
>
> Hmm.  Do you have a good APL Dirichlet tesselation code handy?
>
I have two main memories of APL, both about 2.5 decades old.

To the APL programmer, every problem looks like a vector/matrix.
(To the man with a hammer, every problem looks like a nail.)

You can apply every monadic operator, in the correct sequence, to
zero, and the result is 42. (HHGTG reference)

And a few other snippets, like the general flavor of the language.
I could probably relearn it in short order, if I had a set of APL
keycaps and a manual.

Dale Pontius
0
Reply dale 9/10/2004 9:58:16 AM

Nick Maclaren wrote:

> 
> In article <8g7812-80g.ln1@miriam.mikron.de>,
> Bernd Paysan <bernd.paysan@gmx.de> writes:
> |> 
> |> One-way message passing doesn't need synchronization. If you want
> |> synchronization with non-blocking or buffered message passing, use
> |> round trips (acknowledges).
> 
> Yes, it does.  All that paradigm does is to separate the
> synchronisation from the data transfer, thus increasing the
> chance of making a logic error.  No matter HOW you cut it, there
> has to be SOME way that both ends know when the transfer has
> completed.

Send an EOT packet on the transfer side, receive an EOT packet on the
receive side. You only need to acknowledge an EOT if your connection is
unreliable.

> The way that most one-way designs reduce the number of basic
> communication operations they need is by requiring or assuming
> a serialisation model.  In neither case is this a great help
> for efficiency and, if it is merely assumed, it is a disaster
> for robustness.
> 
> I have NEVER seen a use of one-way message passing that was much
> better than two-way, in the absence of assuming serialisation,
> and if the two-way system had a decent multiple acknowledgement
> primitive.

Unreliable connections are definitely better with two-way systems (the only
alternative is to accept lost packets), but with a reliable transfer path,
you don't need acknowlegements. If you have a multipath network, you need
serialization, or at least offer enough information to the receiver that it
can serialize when necessary (example for something that doesn't have to be
serialized, but needs to know the transfer position: file transfer. Just
write the blocks as they arrive - you have to seek, but you don't have to
worry about which block comes first).

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
0
Reply Bernd 9/10/2004 11:25:51 AM

In article <vif812-49g.ln1@miriam.mikron.de>,
Bernd Paysan <bernd.paysan@gmx.de> writes:
|> Nick Maclaren wrote:
|> 
|> > Yes, it does.  All that paradigm does is to separate the
|> > synchronisation from the data transfer, thus increasing the
|> > chance of making a logic error.  No matter HOW you cut it, there
|> > has to be SOME way that both ends know when the transfer has
|> > completed.
|> 
|> Send an EOT packet on the transfer side, receive an EOT packet on the
|> receive side. You only need to acknowledge an EOT if your connection is
|> unreliable.

Er, that assumes rather a lot!

If the receiver isn't ready or the channel is busy, the sender has
to either block or the communication channel has to buffer the whole
message.

If you are demanding that the receiver guarantees readiness and then
waits until EOT, and that the send doesn't return until the transfer
has completed successfully, you are effectively using a two-way
communication model.

If you assume arbitrary, reliable buffering, then you are getting
into the realms of theory rather than practice.  If you are assuming
restricted buffering, then its constraints need specifying.  Not
least its synchronisation!

[[[ Consider a real network (i.e. a mesh of sorts).  A->B buffers,
A->C works, C->B works and tells B to process the message from A.
This is a COMMON cause of trouble with one-sided designs. ]]]

|> > I have NEVER seen a use of one-way message passing that was much
|> > better than two-way, in the absence of assuming serialisation,
|> > and if the two-way system had a decent multiple acknowledgement
|> > primitive.
|> 
|> Unreliable connections are definitely better with two-way systems
|> (the only alternative is to accept lost packets), but with a
|> reliable transfer path, you don't need acknowlegements. If you
|> have a multipath network, you need serialization, or at least
|> offer enough information to the receiver that it can serialize
|> when necessary (example for something that doesn't have to be
|> serialized, but needs to know the transfer position: file
|> transfer. Just write the blocks as they arrive - you have to seek,
|> but you don't have to worry about which block comes first).

Grrk.  In a real, complex application, you don't have to deal with
JUST communication failures, but nodes stopping unexpectedly (e.g.
the user interrupts a long operation and cancels it).

By the time you have added the logic necessary to get that sort of
thing right on a one-sided model, you usually end up with as many
or more handshakes as a two-sided one.  Yes, I agree that, IF you
have a perfectly reliable network AND you can assume no asynchronous
exceptions, THEN one-sided communication can be a little easier.
But that doesn't counter my statement.


Regards,
Nick Maclaren.
0
Reply nmm1 9/10/2004 11:51:05 AM

"john jakson" <johnjakson@yahoo.com> wrote in message
news:adb3971c.0409091937.33467e55@posting.google.com...

snip

> I'm not too worried about gotos, its the available thread models that
> keeps me away.

I agree.  Part of the motivation for what I am thinking about is to get rid
of a lot of the idea of threads in programming languages.  They are very
hard to get right (for the programmer) and lead to lots of confusion.  They
work fine if confined to a few small uses, but when you try to go to many
threads on many processors, it gets really "icky".  That was the analogy
with Gotos.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/10/2004 2:37:07 PM


Stephen Fuld wrote:
> 
> "john jakson" <johnjakson@yahoo.com> wrote in message
> news:adb3971c.0409091937.33467e55@posting.google.com...
> 
> snip
> 
> > I'm not too worried about gotos, its the available thread models that
> > keeps me away.
> 
> I agree.  Part of the motivation for what I am thinking about is to get rid
> of a lot of the idea of threads in programming languages.  They are very
> hard to get right (for the programmer) and lead to lots of confusion.  They
> work fine if confined to a few small uses, but when you try to go to many
> threads on many processors, it gets really "icky".  That was the analogy
> with Gotos.
> 
> --
>  - Stephen Fuld
>    e-mail address disguised to prevent spam
0
Reply Joe 9/10/2004 3:04:34 PM


Stephen Fuld wrote:
> 
> "john jakson" <johnjakson@yahoo.com> wrote in message
> news:adb3971c.0409091937.33467e55@posting.google.com...
> 
> snip
> 
> > I'm not too worried about gotos, its the available thread models that
> > keeps me away.
> 
> I agree.  Part of the motivation for what I am thinking about is to get rid
> of a lot of the idea of threads in programming languages.  They are very
> hard to get right (for the programmer) and lead to lots of confusion.  They
> work fine if confined to a few small uses, but when you try to go to many
> threads on many processors, it gets really "icky".  That was the analogy
> with Gotos.

(sorry about the previous empty reply.  Too many buttons in the gui's different places.
clicked the wrong one.)

Normally, I'd disagree but after seeing that the Apache Portable Runtime is screwed
up (in the win32 condition varaible implementation and in the atomic operations)
and the Apache programmers being considered pretty competent given the install
base of Apache, I'd have to agree a little bit anyhow.

Joe Seigh
0
Reply Joe 9/10/2004 3:09:05 PM

josmala@cc.hut.fi (Jouni Osmala) writes:
> Lets make some simple claims.
> a) I think LISP is great for parallerization.

Provide some evidence.

-- 
David Gay
dgay@acm.org
0
Reply David 9/10/2004 3:46:19 PM

In article <oea812xpv8.ln2@homer.edgehp.invalid>,  <dale@edgehp.net> wrote:
>>
>I have two main memories of APL, both about 2.5 decades old.
>
>To the APL programmer, every problem looks like a vector/matrix.
>(To the man with a hammer, every problem looks like a nail.)

Yes, quite.  Which is why when, faced with the problem of uncrewing
a fitting, the solution is to smash the unit it is attached to,
thus freeing the fitting.


Regards,
Nick Maclaren.
0
Reply nmm1 9/11/2004 8:59:20 AM

Joe Seigh <jseigh_01@xemaps.com> wrote in message news:<4141C3CE.C20C2F29@xemaps.com>...
> Normally, I'd disagree but after seeing that the Apache Portable Runtime is screwed
> up (in the win32 condition varaible implementation and in the atomic operations)
> and the Apache programmers being considered pretty competent given the install
> base of Apache, I'd have to agree a little bit anyhow.

Concurrent programming is indeed difficult, but I'm not sure this is
the proper conclusion. The apr_cond_wait function on win32 is just so
obviosuly broken that it isn't good code that has a bug despite a lot
of effort. Rather it is just run-of-the-mill bad code. (The unit test
suite here is somewhat anemic. Assertions in the implementation would
help. Checking all error returns would help. Plus there are no
comments in the code.)

I have implemented basically the same pthreads subset APR is trying
for on Win32. I did a much better job in a very tight timeframe. (3 or
4 days. Win32 and importing Mac OS X pthreads implementation to Carbon
environment.) Of course I've read a lot of research literature on the
topic and knew enough to cop the algorithm from the DEC SRC folks.
RedHat's pthreads for win32 has a more difficult problem because they
support cancelation. (The history of that project is interesting and
largely visible on their mailing list, etc. The implementation is
quite high quality by now. If one can accept an LGPL license, it is
the way to go.)

Microsoft's synchronization API should take some blame here because it
is huge and yet still makes pthread_cond_wait difficult to implement.

Have you filed a bug report on apr's cond stuff? I see at least four
bugs in apr_cond_wait for win32 and only one of those seems to be
reported. (And the others are more serious. In particular, a thread
can signal a condition variable and then consume that signal with a
subsequent wait.) I started writing up a bug. Perhaps I'll file it
with some test cases.

-Z-
0
Reply googlenews 9/12/2004 6:55:34 AM


Zalman Stern wrote:
> 
> Joe Seigh <jseigh_01@xemaps.com> wrote in message news:<4141C3CE.C20C2F29@xemaps.com>...
> > Normally, I'd disagree but after seeing that the Apache Portable Runtime is screwed
> > up (in the win32 condition varaible implementation and in the atomic operations)
> > and the Apache programmers being considered pretty competent given the install
> > base of Apache, I'd have to agree a little bit anyhow.
> 
> Concurrent programming is indeed difficult, but I'm not sure this is
> the proper conclusion. The apr_cond_wait function on win32 is just so
> obviosuly broken that it isn't good code that has a bug despite a lot
> of effort. Rather it is just run-of-the-mill bad code. (The unit test
> suite here is somewhat anemic. Assertions in the implementation would
> help. Checking all error returns would help. Plus there are no
> comments in the code.)
> 
> I have implemented basically the same pthreads subset APR is trying
> for on Win32. I did a much better job in a very tight timeframe. (3 or
> 4 days. Win32 and importing Mac OS X pthreads implementation to Carbon
> environment.) Of course I've read a lot of research literature on the
> topic and knew enough to cop the algorithm from the DEC SRC folks.
> RedHat's pthreads for win32 has a more difficult problem because they
> support cancelation. (The history of that project is interesting and
> largely visible on their mailing list, etc. The implementation is
> quite high quality by now. If one can accept an LGPL license, it is
> the way to go.)

Strictly conforming Posix condvars are a little more tricky due to the
way they specificed it.  Besides cancelation, you have the fact that
condvars can be destroyed immediately after signaling them. And timeout
semantics for condvars are different than timeout semanitics for Posix
semaphores, requiring extra suboptimizing synchronization on the former
with no clear benefit from the different semantics.

Plus I believe there is some form of imprinting by the standard
curriculum for teaching multi-threading that leads to the use of
unnecessary counters in various synchronization implementations.
Contract the Redhat condvar implementation here
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/nptl/sysdeps/pthread/?cvsroot=glibc

with the "pseudocode" version I did here
http://groups.google.com/groups?selm=4125E3D3.3E8A71A2%40xemaps.com
which is lock-free (if you don't count syscalls using spin locks) and
handles the destroy after signaling problem.  No comments but some
commentary in the follow ups.  Additional hints - spurious wakeups
a allowed but too many are unlikely given the timing windows.

> 
> Microsoft's synchronization API should take some blame here because it
> is huge and yet still makes pthread_cond_wait difficult to implement.

And pulse event is broken (at least they document it).  It can easily be
fixed but Microsoft's architects seem to be sealed up in their Fortress
of Doom with no way to communicate with them.  But that's standard for
all architecture groups.  

Douglas Schmidt documented some of the difficulty in doing this here
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
as a result, I guess, of doing the implementation of win32 condvars
in ACE.

> 
> Have you filed a bug report on apr's cond stuff? I see at least four
> bugs in apr_cond_wait for win32 and only one of those seems to be
> reported. (And the others are more serious. In particular, a thread
> can signal a condition variable and then consume that signal with a
> subsequent wait.) I started writing up a bug. Perhaps I'll file it
> with some test cases.
> 
The other bug I noticed was on broadcast the condvar could remain
signaled if enough waiter traffic kept the waiters count from going
to zero, highly likely given that condvars are typically used in
a polling loop on the condition.

No, the only way I know of communicating with them seems to be through their
mailing list and I'm loathe to sign up to mailing lists given the primitive
email client I use (basically because it *is* primitive).  Mailing lists
are ancient.  I don't know why everyone doesn't use blogs with RSS feeds
these days.

Joe Seigh
0
Reply Joe 9/12/2004 12:33:23 PM

Joe Seigh wrote:
> 
> And pulse event is broken (at least they document it).  It can easily be
> fixed but Microsoft's architects seem to be sealed up in their Fortress
> of Doom with no way to communicate with them.  But that's standard for
> all architecture groups.

The only documented problem on PulseEvent that I can see is
where events may be lost during debug due to suspend/resume.
The MS docs explicitly claims this can only happen during debug.
However their explanation states the underlying cause is due to the
debugger using thread Suspend/Resume and it therefore seems that any
application using Suspend/Resume and PulseEvent would be susceptible.

(If I thought it would have any effect on MS, at this point I would
include a rant on allowing such basic design flaws to continue
to fester 12 years after product launch.)

Anyway, as most applications can avoid using Suspend/Resume
directly, I don't see how this can be claimed as a source for
any Posix implementation problems.

Eric

0
Reply Eric 9/12/2004 6:35:20 PM


Eric wrote:
> 
> Joe Seigh wrote:
> >
> > And pulse event is broken (at least they document it).  It can easily be
> > fixed but Microsoft's architects seem to be sealed up in their Fortress
> > of Doom with no way to communicate with them.  But that's standard for
> > all architecture groups.
> 
> The only documented problem on PulseEvent that I can see is
> where events may be lost during debug due to suspend/resume.
> The MS docs explicitly claims this can only happen during debug.
> However their explanation states the underlying cause is due to the
> debugger using thread Suspend/Resume and it therefore seems that any
> application using Suspend/Resume and PulseEvent would be susceptible.
> 


The documentation for PulseEvent says
  A thread waiting on a synchronization object can be momentarily removed from the
  wait state by a kernel-mode APC, and then returned to the wait state after the APC is
  complete. If the call to PulseEvent occurs during the time when the thread has been
  removed from the wait state, the thread will not be released because PulseEvent
  releases only those threads that are waiting at the moment it is called. Therefore,
  PulseEvent is unreliable and should not be used by new applications.

This could probably be fixed by making Events eventcounts internally if internal logic
doesn't preclude that for some other unkown reason.


> (If I thought it would have any effect on MS, at this point I would
> include a rant on allowing such basic design flaws to continue
> to fester 12 years after product launch.)

That's why they have their Fortress of Doom.  They're safe from all our
rants.

> 
> Anyway, as most applications can avoid using Suspend/Resume
> directly, I don't see how this can be claimed as a source for
> any Posix implementation problems.

It came up in discussions about implementing win32 condvars.  I don't
remember if there were any working solutions using it but it being broken
made those solutions moot.  You could use it with SignalObjectandWait
to make signal only or broadcast only condvars depending on whether the
Event was autoresettable or not.

Joe Seigh
0
Reply Joe 9/12/2004 7:32:11 PM

Eric <eric_pattison@sympaticoREMOVE.ca> wrote in message news:<414496E8.538447C0@sympaticoREMOVE.ca>...
> The only documented problem on PulseEvent that I can see is
> where events may be lost during debug due to suspend/resume.
> The MS docs explicitly claims this can only happen during debug.
> However their explanation states the underlying cause is due to the
> debugger using thread Suspend/Resume and it therefore seems that any
> application using Suspend/Resume and PulseEvent would be susceptible.

The knowledge base documents the debugger issue here:
    http://support.microsoft.com/default.aspx?scid=kb;en-us;173260

The documentation for PulseEvent gives the general issue that APC
delivery may cause event pulses to be lost:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/pulseevent.asp

Kernel-mode APC delivery covers a large class of things including
async I/O, etc. It is thus rather hard to reason about when this
failure happens and it is no surprise that PulseEvent is deprecated.

Note that debugging and debug events happen in other contexts, such as
running the application under test harness systems such as Purify or
the application verifier, etc. It is a big burden in an engineering
process to not be able to use these tools. (Plus there's Intel's new
thread profiling tool that Judi Goldstein covered at IDF. I have no
idea if PulseEvent works under such profiling or not.) It is possible
that none of these tools cause debug events to happen to a thread that
is waiting and that they do not suspend/resume threads, but my
experience in this area does not allow me to be so optimistic...

> (If I thought it would have any effect on MS, at this point I would
> include a rant on allowing such basic design flaws to continue
> to fester 12 years after product launch.)

Ditto. Though at least in this case, they have deprecated the call
instead of denying that it is broken :-)

Of course it is not clear that having Microsoft try to fix these APIs
is any better. E.g. they added the function SignalAndWaitObject for
the NT based versions of Windows from NT 4.0 and beyond. (It does not
exist on 95/98/ME .) You'd think this would allow easy implementation
of pthread_cond_wait/pthread_cond_signal/pthread_cond_broadcast , but
it doesn't really. What waitable object to you use for the condition
variable? How do you get both signal and broadcast behavior?

This is not even touching on cancelation type issues. Which were
discussed in another thread recently.

> Anyway, as most applications can avoid using Suspend/Resume
> directly, I don't see how this can be claimed as a source for
> any Posix implementation problems.

Having thread sync primtives change behavior, in particular having
them completely fail to operate, under the debugger is a complete game
over scenario. We have enough problems developing concurrent systems.
Dealing with random unexpected unreliablity in the thread sync
primtives is unecessary and unacceptable. These primtives should just
work. They should just work under the debugger. They should just work
when the moon is full.

In case the point is a little overwrought above, it is just horrible
systems engineering for seemingly unrelated system calls to cause a
wakeup event to be lost. Alternatively, one can say the debug
mechanisms should not be doing thread suspensions and resumes. (And
while we're at it, you'd think Microsoft could make it so debugging a
GUI app wouldn't deadlock the entire Windows user interface solid for
minutes at a time.)

Given these kind of issues and the informality of the specification
Microsoft provides on their synchronization APIs, I chose to use the
simplest ones I could to implement a pthreads subset. Namely critical
sections and semaphores. My overall take on the Win32 synchronization
API is that it is a complex hinderance to getting real work done. This
situation is somewhat improved inside .NET .

-Z-
0
Reply googlenews 9/13/2004 1:11:19 AM

Joe Seigh <jseigh_01@xemaps.com> wrote in message news:<41444257.4F6D1D6@xemaps.com>...
> with the "pseudocode" version I did here
> http://groups.google.com/groups?selm=4125E3D3.3E8A71A2%40xemaps.com
> which is lock-free (if you don't count syscalls using spin locks) and
> handles the destroy after signaling problem.  No comments but some
> commentary in the follow ups.  Additional hints - spurious wakeups
> a allowed but too many are unlikely given the timing windows.

This seems to use an event count based sleep/wakeup primtive provided
by the underlying OS. (Read event count, call a routine that
atomically tests whether the event count is the same and if so puts
the thread asleep otherwise does not sleep. A wakeup atomically
increments the event count and checks for a thread to wake up.) In
terms of implementing pthreads condvars on Win32, that's cheating :-)
I mean eventcount stuff is a great primitive which, like
pthread_cond_wait, makes it a lot easier to avoid sleep/wakeup races.
But I don't think Win32 gives  you anything like that.

Oh, and the code looks like it needs some memory barriers. (This is an
issue with lock free code in general.) One should at least put in a
comment indicating where they might be needed.

-Z-
0
Reply googlenews 9/13/2004 1:18:06 AM

> What handled it in Transputers.  At least according to my understanding, a
> program just did a "send" command to a process and something figured out if
> that process was on the same die or required transversal of a link.  Am I
> wrong about that?

Well, yes and no. In a channel communication, you need to specify the address
of the channel control word - we're talking assembly here. That address could
either be a "normal" memory word, in which case there was a defined protocol
to store a value in it that allowed the two processes to communicate properly,
but without error checking on this level. But it could also be one of the
addresses allocated to a link engine, in which case the required DMA transfer
between chips would be started.

In practice, however, you needed special software sitting on either side of
the hardware links, because you had only a limited number of them, and comms
across them weren't virtualized. The T9000 with its virtual channel concept
and built-in virtual channel processor (VCP), and especially in combination
with the C104 cross-bar routing chip, would have pushed all of this into hard-
or rather firmware. All the remained for the programmer to do was to decide
on the granularity of his program, and where to place what 8-).

	Jan
0
Reply ISO 9/13/2004 9:06:10 AM

> I wonder if the "easy parallelism" of most transaction systems doesn't 
> rest solidly on the slowness of disk drives, the slowness of the network 
> interconnect, and the relatively low expectations that are the result.

Hmmm. Around 1988/89 I saw a prototype DB system for teller machines. It
had a lot of disk nodes, an internal network, and a lot of processing nodes.
A lot of the infrastructure was build from transputers, but I can't remember
whether there weren't also some more "normal" processors involved to do the
brunt work. In any case, during normal operations, there were no disk ops at
all: the database was loaded into the "cache" memory of the disk nodes on 
startup, and written back on shutdown. And it still was a very parallel system.

	Jan
0
Reply ISO 9/13/2004 9:09:46 AM

> The knowledge base documents the debugger issue here:
>     http://support.microsoft.com/default.aspx?scid=kb;en-us;173260
> 
> The documentation for PulseEvent gives the general issue that APC
> delivery may cause event pulses to be lost:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/pulseevent.asp
> 
> Kernel-mode APC delivery covers a large class of things including
> async I/O, etc. It is thus rather hard to reason about when this
> failure happens and it is no surprise that PulseEvent is deprecated.

Sounds a lot like the problems similar code has in VMS - and that was 
designed-in, as it were, from the beginning (c. 1978). They're just not
fixable (in VMS) because the peculiarities of the initial implementation
define the semantics. All you can do it to design and implement other
primitives that have the "right" semantics.

	Jan
0
Reply ISO 9/13/2004 9:30:03 AM

Zalman Stern wrote:
[...]
> API is that it is a complex hinderance to getting real work done. This
> situation is somewhat improved inside .NET .

Really?

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfSystemThreadingMonitorClassPulseTopic.asp

regards,
alexander.
0
Reply Alexander 9/13/2004 9:45:53 AM

>>ftp://download.intel.com/pressroom/kits/events/idffall_
>>2004/otellini_presentation.pdf#page=38
> 
> Most interesting.  Unfortunately, that failed to download.

Worked for me. The line break at the underline is unfortunate, however.

	Jan
0
Reply ISO 9/13/2004 9:50:57 AM

Zalman Stern wrote:

[... win32 condvars ...]

> If one can accept an LGPL license, it is the way to go.)

17 USC 109 aside for a moment, just grab the condvar impl code and
safely ignore the "lesser" silliness. Idea-Expression Merger and 
Sc�nes � Faire.

regards,
alexander.
0
Reply Alexander 9/13/2004 9:53:29 AM

"Jan Vorbr�ggen" <jvorbrueggen-not@mediasec.de> wrote in message
news:2ql6esF110p8sU1@uni-berlin.de...
> > I wonder if the "easy parallelism" of most transaction systems doesn't
> > rest solidly on the slowness of disk drives, the slowness of the network
> > interconnect, and the relatively low expectations that are the result.
>
> Hmmm. Around 1988/89 I saw a prototype DB system for teller machines. It
> had a lot of disk nodes, an internal network, and a lot of processing
nodes.
> A lot of the infrastructure was build from transputers, but I can't
remember
> whether there weren't also some more "normal" processors involved to do
the
> brunt work. In any case, during normal operations, there were no disk ops
at
> all: the database was loaded into the "cache" memory of the disk nodes on
> startup, and written back on shutdown. And it still was a very parallel
system.
>
> Jan

Whitecross maybe?

Peter


0
Reply Peter 9/13/2004 10:28:39 AM

In article <2ql8s1F10sovaU1@uni-berlin.de>,
=?ISO-8859-1?Q?Jan_Vorbr=FCggen?= <jvorbrueggen-not@mediasec.de> writes:
|> >>ftp://download.intel.com/pressroom/kits/events/idffall_
|> >>2004/otellini_presentation.pdf#page=38
|> > 
|> > Most interesting.  Unfortunately, that failed to download.
|> 
|> Worked for me. The line break at the underline is unfortunate, however.

I got it eventually.  3 of 5 downloads failed, however, which
indicates some sort of problem with the server.


Regards,
Nick Maclaren.
0
Reply nmm1 9/13/2004 10:43:06 AM

> Whitecross maybe?

Dunno what Whitecross is, but that was in a Citibank lab in LA.

	Jan
0
Reply ISO 9/13/2004 11:48:34 AM

Zalman Stern wrote:
>     http://www28.cplan.com/cbi_export/MA_OSAS002_266814_68-1_v2.pdf

Authorization required?!

> (2 threads means "2 threads per core" in case it is not clear. Slide
> elsewhere indicates SMT.)

Multi-threaded: yes.  SMT: no.  Montecito uses a different version of 
multithreading than SMT.  I know that's been discussed before.  Search 
for it if you want details.

Alex
-- 
My words are my own.  They represent no other; they belong to no other.
Don't read anything into them or you may be required to compensate me
for violation of copyright.  (I do not speak for my employer.)

0
Reply Alex 9/13/2004 11:57:12 AM


Zalman Stern wrote:
> 
> Joe Seigh <jseigh_01@xemaps.com> wrote in message news:<41444257.4F6D1D6@xemaps.com>...
> > with the "pseudocode" version I did here
> > http://groups.google.com/groups?selm=4125E3D3.3E8A71A2%40xemaps.com
> > which is lock-free (if you don't count syscalls using spin locks) and
> > handles the destroy after signaling problem.  No comments but some
> > commentary in the follow ups.  Additional hints - spurious wakeups
> > a allowed but too many are unlikely given the timing windows.
> 
> This seems to use an event count based sleep/wakeup primtive provided
> by the underlying OS. (Read event count, call a routine that
> atomically tests whether the event count is the same and if so puts
> the thread asleep otherwise does not sleep. A wakeup atomically
> increments the event count and checks for a thread to wake up.) In
> terms of implementing pthreads condvars on Win32, that's cheating :-)
> I mean eventcount stuff is a great primitive which, like
> pthread_cond_wait, makes it a lot easier to avoid sleep/wakeup races.
> But I don't think Win32 gives  you anything like that.

You can implement a lock-free eventcount in windows using standard windows
synchronization objects.  I've done it useing Semaphores but it's not been
posted anywhere.

> 
> Oh, and the code looks like it needs some memory barriers. (This is an
> issue with lock free code in general.) One should at least put in a
> comment indicating where they might be needed.
> 

I could argue it's pseudocode and assumes total order memory access but
in the case of condvars the proper memory visibility is provided by the
mutexes that the condvar is bound to.  You can do a condvar signal without
a lock but it's undefined as to whether the signal is lossey or not in that
case (with certain exceptions).

In the case of lock-free signaling, yes, you'd probably need memory barriers.
What those would be would be determined by the semantics of lock-free signaling
if the semantics were published.

Since condvars are "bound" to mutexes, it's sort of moot that you can implement
them lock-free or almost so.  It's more to point out that even the people who
you would consider experts in threaded programming, the NPTL library implementors,
aren't all that expert if you go by the rule of unnecesary complexity as shown
by example.  Shared memory multi-threading is complicated even for the experts.

Joe Seigh
0
Reply Joe 9/13/2004 12:04:43 PM

In article <ci41up$6b4$1@news01.intel.com>,
Alex Johnson <compuwiz@jhu.edu> writes:
|> 
|> > (2 threads means "2 threads per core" in case it is not clear. Slide
|> > elsewhere indicates SMT.)
|> 
|> Multi-threaded: yes.  SMT: no.  Montecito uses a different version of 
|> multithreading than SMT.  I know that's been discussed before.  Search 
|> for it if you want details.

Hmm.  I have seen no details worth a damn.  Yes, it is known that
it does something different, but I haven't seen a clear statement
of what.  And there are a lot of possibilities.  Of course, I might
have missed some actual information in the morass of buzzwords and
general waffle.


Regards,
Nick Maclaren.
0
Reply nmm1 9/13/2004 12:21:36 PM

Joe Seigh wrote:
> 
> Eric wrote:
> >
> > Joe Seigh wrote:
> > >
> > > And pulse event is broken (at least they document it).  It can easily be
> > > fixed but Microsoft's architects seem to be sealed up in their Fortress
> > > of Doom with no way to communicate with them.  But that's standard for
> > > all architecture groups.
> >
> > The only documented problem on PulseEvent that I can see is
> > where events may be lost during debug due to suspend/resume.
> > The MS docs explicitly claims this can only happen during debug.
> > However their explanation states the underlying cause is due to the
> > debugger using thread Suspend/Resume and it therefore seems that any
> > application using Suspend/Resume and PulseEvent would be susceptible.
> >
> 
> The documentation for PulseEvent says
>   A thread waiting on a synchronization object can be momentarily removed from the
>   wait state by a kernel-mode APC, and then returned to the wait state after the APC is
>   complete. If the call to PulseEvent occurs during the time when the thread has been
>   removed from the wait state, the thread will not be released because PulseEvent
>   releases only those threads that are waiting at the moment it is called. Therefore,
>   PulseEvent is unreliable and should not be used by new applications.

Ok, that text is a recent addition. It is also in the online MSDN.
On this issue PulseEvent may not really be 'broken' in the sense
that you can use it as a one-shot that releases as single thread.
It is not broken in the sense that it should not loose a pulse
signal, however its design would result is unexpected behavior.
In other situations it might have erroneous behavior. In the Suspend
example I cited, it really is broken - it forgets that it was pulsed.

> This could probably be fixed by making Events eventcounts internally if internal logic
> doesn't preclude that for some other unkown reason.

Unfortunately it is a little bit complicated, but I'll take a crack
at explaining it.

(Some of the following is based on WNT info I have pieced together
over the years. Some is based on having been down exactly this design
road for my own hobby OS. I designed mine to avoid exactly these
problems so I do understand some of the considerations.)

Very early in WNTs design they made the (what I consider erroneous)
core design decision that threads will abandon waits if something 
'important' happens. An APC (essentially a thread interrupt) is one
such thing that needs to break into a wait state, Suspend is another.
There are many consequences to this decision, positive and negative.

When a thread performs a Wait on one or more events, it gets a
data structure called (I think) a Wait Control Block (WCB) and a
single linked list of of Wait Items (WITM), one for each event to
wait on. The WCB is filled in with some description of the wait
operation and cross linked to the thread to awaken. The wait items
form a single linked list which contain a pointer to the next wait
item, a pointer back to the parent WCB, and a double link list entry.
Each event contains a state and double linked list head of wait items.

To perform the actual wait operation each wait item is pushed
at the tail of the each events' list. The thread is then put
into a wait state and moved to the schedulers' thread wait list.
The result forms a cross linked structure like this:

              EVNT  EVNT  EVNT
                ^     ^     ^
                |     |     |
                v     v     v
Thread<->WCB->WITM->WITM->WITM
          ^_____|_____|_____|

If the event is Set or Pulsed, the signaling routine sets the state
field and goes to the first item in the list. It chases the pointer
back to the WCB and looks at the wait type. If WaitForOne it walks
the list of items and dequeues then. Then the thread is awakened
by moving it from the schedulers wait list and to ready list.
So far this is a good design.

Now here is the problem (and design error in my opinion).
An APC (Asynchronous Procedure Call) is like a interrupt to a thread;
it is a forced subroutine call. If an APC is queued to the
thread, for example because of IO completion, WNT wakes up the
thread and it *dequeues all the WCB's wait items from their lists*.
It then executes the APC routine and requeues the wait items
*at the tail of their event lists*.

The benefit of this design are simplicity:
- An APC can also perform waitable operations.
  If it did not cancel the wait then the design would have to
  deal with multiple concurrent nested waits.
- If an exception causes a stack unwind then this design does not
  have to take into account whether it is unwinding across a running
  wait operation since waits are canceled upon APC delivery.

The negatives of this design are:
- It causes non-FIFO behavior. This can cause starvation.
- It can cause bizarre or erroneous behavior if an event for a group
  of threads arrives while one of them is off doing other things
  and it misses the signal it should have gotten.
- It is more expensive to continuously queue and dequeue items.

In my opinion these benefits are minimal because multiple nested
waits and stack unwinds are handlable. The negatives are serious.
The correct design is to not abandon waits and support multiple
nested waits.

> > (If I thought it would have any effect on MS, at this point I would
> > include a rant on allowing such basic design flaws to continue
> > to fester 12 years after product launch.)
> 
> That's why they have their Fortress of Doom.  They're safe from all our
> rants.

Yeah, but who doesn't want one of those? I want turrets on mine.

> > Anyway, as most applications can avoid using Suspend/Resume
> > directly, I don't see how this can be claimed as a source for
> > any Posix implementation problems.
> 
> It came up in discussions about implementing win32 condvars.  I don't
> remember if there were any working solutions using it but it being broken
> made those solutions moot.  You could use it with SignalObjectandWait
> to make signal only or broadcast only condvars depending on whether the
> Event was autoresettable or not.
> 
> Joe Seigh

If you follow the above description you should see that PulseEvent
can reliably release single waiting threads, though not FIFO ordered.
I believe LeaveCriticalSection uses it to pulse an auto-event.
That may not help cond-vars (I haven't thought about them).

Eric

0
Reply Eric 9/13/2004 1:29:37 PM

Jan Vorbr=FCggen wrote:
> =

> > Kernel-mode APC delivery covers a large class of things including
> > async I/O, etc. It is thus rather hard to reason about when this
> > failure happens and it is no surprise that PulseEvent is deprecated.
> =

> Sounds a lot like the problems similar code has in VMS - and that was
> designed-in, as it were, from the beginning (c. 197    8.They'rejustnot=

> fixable (in VMS) because the peculiarities of the initial implementatio=
n
> define the semantics. All you can do it to design and implement other
> primitives that have the "right" semantics.

Do you happen to recall the VMS problems? Just curious.
I was under the impression that it did deal with these issues,
though it didn't (used to) have a PulseEvent.

Eric

0
Reply Eric 9/13/2004 2:41:07 PM

"Jan Vorbr�ggen" <jvorbrueggen-not@mediasec.de> wrote in message
news:2qlfoiF118pd9U1@uni-berlin.de...
> > Whitecross maybe?
>
> Dunno what Whitecross is, but that was in a Citibank lab in LA.
>
> Jan

Obviously not. Whitecross Systems(?) was a UK company building in memory
database engines using transputers with customers in the financial area.

Peter


0
Reply Peter 9/13/2004 3:30:02 PM

"Jan Vorbr�ggen" <jvorbrueggen-not@mediasec.de> wrote in message
news:2ql684Ft75u8U1@uni-berlin.de...
> > What handled it in Transputers.  At least according to my understanding,
a
> > program just did a "send" command to a process and something figured out
if
> > that process was on the same die or required transversal of a link.  Am
I
> > wrong about that?
>
> Well, yes and no. In a channel communication, you need to specify the
address
> of the channel control word - we're talking assembly here. That address
could
> either be a "normal" memory word, in which case there was a defined
protocol
> to store a value in it that allowed the two processes to communicate
properly,
> but without error checking on this level. But it could also be one of the
> addresses allocated to a link engine, in which case the required DMA
transfer
> between chips would be started.
>
> In practice, however, you needed special software sitting on either side
of
> the hardware links, because you had only a limited number of them, and
comms
> across them weren't virtualized. The T9000 with its virtual channel
concept
> and built-in virtual channel processor (VCP), and especially in
combination
> with the C104 cross-bar routing chip, would have pushed all of this into
hard-
> or rather firmware. All the remained for the programmer to do was to
decide
> on the granularity of his program, and where to place what 8-).

Thanks for the explanation, Jan.  Is all of this sort of stuff, together
with the kind of things that Rupert talked about on line somewhere?  I know
Rupert mentioned a paper, but he didn't know where it was.  Perhaps you do?

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/13/2004 3:57:10 PM

Robert Myers <rmyers1400@comcast.net> writes:

> Maybe by the time Whitefield and Niagara are available, Transmeta will
> have a similar product, too.  A ULV Whitefield is where I'd want to
> start, and I don't think I'd be too bothered by the separate
> controller, which I'd get to amortize over at least four cores.  By
> the time Whitefield is available, Intel should have more complete
> infrastructure like Advanced Switching as an interconnect.

By the way, Niagara has reached silicon and boots Solaris. Lots of
work still to do, I'm sure.

http://blogs.sun.com/roller/page/jonathan/20040910#the_difference_between_humans_and


Chris
-- 
Chris Morgan
   "Post posting of policy changes by the boss will result in 
    real rule revisions that are irreversible"

		- anonymous correspondent
0
Reply Chris 9/13/2004 4:40:06 PM

Stephen Fuld wrote:

> 
> "Jan Vorbr�ggen" <jvorbrueggen-not@mediasec.de> wrote in message
> news:2ql684Ft75u8U1@uni-berlin.de...
>> > What handled it in Transputers.  At least according to my
>> > understanding,
> a
>> > program just did a "send" command to a process and something figured
>> > out
> if
>> > that process was on the same die or required transversal of a link.  Am
> I
>> > wrong about that?
>>
>> Well, yes and no. In a channel communication, you need to specify the
> address
>> of the channel control word - we're talking assembly here. That address
> could
>> either be a "normal" memory word, in which case there was a defined
> protocol
>> to store a value in it that allowed the two processes to communicate
> properly,
>> but without error checking on this level. But it could also be one of the
>> addresses allocated to a link engine, in which case the required DMA
> transfer
>> between chips would be started.
>>
>> In practice, however, you needed special software sitting on either side
> of
>> the hardware links, because you had only a limited number of them, and
> comms
>> across them weren't virtualized. The T9000 with its virtual channel
> concept
>> and built-in virtual channel processor (VCP), and especially in
> combination
>> with the C104 cross-bar routing chip, would have pushed all of this into
> hard-
>> or rather firmware. All the remained for the programmer to do was to
> decide
>> on the granularity of his program, and where to place what 8-).
> 
> Thanks for the explanation, Jan.  Is all of this sort of stuff, together
> with the kind of things that Rupert talked about on line somewhere?  I
> know
> Rupert mentioned a paper, but he didn't know where it was.  Perhaps you
> do?

Jan's description didn't leave much out...  Although he did
not mention the scheduling stuff which is pretty important
(done in HW & ucode). My favourite OCCAM construct, "ALT"
had HW support too, I wish to god other languages had it.

You can probably find OCCAM tutorials online, and there are
the various papers on CSP written by the smart people at
Oxford (UK) in the 80s.

The ~1990 vintage datasheets used to give you all the grubby
details of how the T4/8 did their scheduling. Pretty sure the
preliminary T9000 databook did the same too. I figure these
days you'd probably want to go for a VCP style solution, you
can get an idea of what VCP was about by reading the IEEE1355
spec. 

Alas all I found *online* were the later ST datasheets here :
http://www.classiccmp.org/transputer

The other place to look for info is :
http://www.wotug.org

If you can wait a couple of months I might be able to scan
the blurb about comms and scheduling in the old datasheets
for you.

Transputer info is slowly falling through the holes in the
web, so I suggest you binge-download the stuff you want. :)

-- 
Cheers,
Rupert
0
Reply Rupert 9/13/2004 5:08:03 PM

Alexander Terekhov <terekhov@web.de> wrote in message news:<41456C51.7D77B992@web.de>...
> Zalman Stern wrote:
> [...]
> > API is that it is a complex hinderance to getting real work done. This
> > situation is somewhat improved inside .NET .
> 
> Really?
> 
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfSystemThreadingMonitorClassPulseTopic.asp

Other than that they hardwired the common usage case by bundling the
condition variable and the mutex together into a single object, I
don't see the problem. In particular, I would not expect the
referenced Monitor.Pulse method to use the Win32 PulseEvent call.

My mental model of System::Threading::Monitor is:

     class Monitor
     {
        private:
            pthread_mutex_t lock;
            pthread_cond_t  condition;

        public:
        // Constructors/destructor, etc.

        static void Enter(Monitor *obj)
        {
            // Check for NULL, etc.
            pthread_mutex_lock(obj->mutex);
        }

        static void Exit(Monitor *obj)
        {
            // Check for NULL, etc.
            pthread_mutex_unlock(obj->mutex);
        }

        static bool Wait(Monitor *obj, int timeout)
        {
             // Check for NULL, etc.

            if (timeout == Infinite)
                pthread_cond_wait(obj->condition, obj->mutex);
            else
                pthread_cond_trywait(obj->condition, obj->mutex, /*
convert timeout, etc. */);

            // Figure out return code, etc.
        }

        static void Pulse(Monitor *obj)
        {
            // Check for NULL, etc.
            pthread_cond_signal(obj->condition);
        }

        static void PulseAll(Monitor *obj)
        {
            // Check for NULL, etc.
            pthread_cond_broadcast(obj->condition);
        }
    };

(Note Monitor may give a stronger fairness guarantee than the above
would according to the pthreads spec.)

The Pulse/PulseAll naming is perhaps unfortunate, but I expect they
wanted to make it clear that the "signal" is not persistent.

There are some other constraints, such as only being able to call
Pulse/PulseAll while holding the mutex. However these are not that
serious. Especially compared to the constraint that the mutex and
condition are paired.

So I do consider this an improvement in the state of the world
compared to raw win32. Seems straight forward to write reliable code
using the above Monitor abstraction. But I have not programmed
extensively using .NET. Am I'm completely missing something?

(I suppose another viewpoint is that one can implement pthreads on top
of win32 while it might be less efficient to implement pthreads on top
of the .NET Framework. This is perhaps a real issue now given C++/CLI
for folks who have existing C/C++ code using pthreads they'd like to
run under the CLR.)

-Z-
0
Reply googlenews 9/13/2004 5:16:14 PM

Jan Vorbr�ggen <jvorbrueggen-not@mediasec.de> wrote in message news:<2ql7l1F1078bvU1@uni-berlin.de>...
> Sounds a lot like the problems similar code has in VMS - and that was 
> designed-in, as it were, from the beginning (c. 1978). They're just not
> fixable (in VMS) because the peculiarities of the initial implementation
> define the semantics. All you can do it to design and implement other
> primitives that have the "right" semantics.

When did the VMS community decide there were issues with this? When
was an alternative API that works reliably provided? What is the main
threading interface used in C/C++ on OpenVMS? (I'd expect it to be
pthreads today, but also expect that to be a very recent addition
compared to when NT was spec'ed.)

Unless compatibility with VMS was a goal, and I doubt it was, it seems
they should have left the known bad ideas behind when writing NT...

-Z-
0
Reply googlenews 9/13/2004 5:25:37 PM

In comp.arch Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
> In article <oea812xpv8.ln2@homer.edgehp.invalid>,  <dale@edgehp.net> wrote:

>>To the APL programmer, every problem looks like a vector/matrix.
>>(To the man with a hammer, every problem looks like a nail.)

> Yes, quite.  Which is why when, faced with the problem of uncrewing
> a fitting, the solution is to smash the unit it is attached to,
> thus freeing the fitting.

Unfortunately in software this may be a perfectly viable design :-)

G.
0
Reply gavin 9/13/2004 7:06:47 PM

Zalman Stern wrote:
[...]
> My mental model of System::Threading::Monitor is: [ mutex + condvar ]

Apart from dynamic (1:N) mutex:condvar(s) binding and ability to cv-
signal/broadcast without holding associated lock, POSIX's cv-wait 
wisely doesn't "unroll" recursively locked mutexes and does allow 
spurious wakes.

MS monitors are brain-dead and error-prone (even more than Java 
ones; JSR-166 condvars aside for a moment). It's no surprise that 
that MS example is busted. It's quite normal state for MSDN's sorta 
exemplary threading stuff, AFAICS.

regards,
alexander.
0
Reply Alexander 9/13/2004 7:15:32 PM

In comp.arch Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
> I got it eventually.  3 of 5 downloads failed, however, which
> indicates some sort of problem with the server.

Or the intervening network.  My download came through on the first
try.  Of course all _that_ might mean is I was lucky :)

rick jones
-- 
a wide gulf separates "what if" from "if only"
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to raj in cup.hp.com  but NOT BOTH...
0
Reply Rick 9/13/2004 7:47:17 PM

In article <9Pm1d.10486$uP2.4964@news.cpqcorp.net>,
Rick Jones  <foo@bar.baz.invalid> wrote:
>In comp.arch Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
>> I got it eventually.  3 of 5 downloads failed, however, which
>> indicates some sort of problem with the server.
>
>Or the intervening network.  My download came through on the first
>try.  Of course all _that_ might mean is I was lucky :)

Yes and no.  FTP uses TCP/IP, and I downloaded the file many times
to both IRIX and Linux, and got the same unreliability.  Now, despite
common belief, FTP is a VERY unreliable protocol, but TCP/IP isn't.
I am 90% certain (based on that and previous experience) is that the
server was using the FTP protocol in one of its many unreliable ways.


Regards,
Nick Maclaren.
0
Reply nmm1 9/13/2004 8:10:16 PM

In comp.sys.intel Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
> In article <9Pm1d.10486$uP2.4964@news.cpqcorp.net>,
> Rick Jones  <foo@bar.baz.invalid> wrote:
>>In comp.arch Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
>>> I got it eventually.  3 of 5 downloads failed, however, which
>>> indicates some sort of problem with the server.
>>
>>Or the intervening network.  My download came through on the first
>>try.  Of course all _that_ might mean is I was lucky :)

> Yes and no.  FTP uses TCP/IP, and I downloaded the file many times
> to both IRIX and Linux, and got the same unreliability.  Now,
> despite common belief, FTP is a VERY unreliable protocol, but TCP/IP
> isn't.  I am 90% certain (based on that and previous experience) is
> that the server was using the FTP protocol in one of its many
> unreliable ways.

TCP is "reliable" only in that it will tell you if it believes that
the data has not arrived at the desired destination.  TCP can only
overcome so much in the way of packet loss and the like, so if there
was nasty packet loss between you and the other end, or routing
instability somewhere...

rick jones
-- 
The computing industry isn't as much a game of "Follow The Leader" as
it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose." 
                                                    - Rick Jones
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to raj in cup.hp.com  but NOT BOTH...
0
Reply Rick 9/13/2004 8:38:20 PM

In article <0zn1d.10505$uP2.7089@news.cpqcorp.net>,
Rick Jones  <foo@bar.baz.invalid> wrote:
>
>TCP is "reliable" only in that it will tell you if it believes that
>the data has not arrived at the desired destination.  TCP can only
>overcome so much in the way of packet loss and the like, so if there
>was nasty packet loss between you and the other end, or routing
>instability somewhere...

Yes, but there are SUPPOSED to be checksums and sequence counts.
If those were used properly, the chances of error are low.  Yes,
I know that regrettably many systems don't check them correctly,
or even run with no checking by default, but still ....

In a previous task, I had to investigate FTP, and that is nothing
like as solid.  In particular, it makes it too easy to truncate a
transfer early and think that was EOF.  I think that was what was
happening - the last window wasn't being pushed, and so the last
few KB of the file were always arriving.


Regards,
Nick Maclaren.
0
Reply nmm1 9/13/2004 8:41:03 PM

In comp.arch Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
> In article <0zn1d.10505$uP2.7089@news.cpqcorp.net>,
> Rick Jones  <foo@bar.baz.invalid> wrote:
>>
>>TCP is "reliable" only in that it will tell you if it believes that
>>the data has not arrived at the desired destination.  TCP can only
>>overcome so much in the way of packet loss and the like, so if there
>>was nasty packet loss between you and the other end, or routing
>>instability somewhere...

> Yes, but there are SUPPOSED to be checksums and sequence counts.
> If those were used properly, the chances of error are low.  Yes,
> I know that regrettably many systems don't check them correctly,
> or even run with no checking by default, but still ....

Let's back-up a step - were you saying that the transfers were not
completing, or that the transfers were completing, but have corrupt
files?  I (perhaps mistakenly) thought I read you writing that the
transfers were not completing.

> In a previous task, I had to investigate FTP, and that is nothing
> like as solid.  In particular, it makes it too easy to truncate a
> transfer early and think that was EOF.  I think that was what was
> happening - the last window wasn't being pushed, and so the last
> few KB of the file were always arriving.

So the transfers were not completing?

rick jones
-- 
The glass is neither half-empty nor half-full. The glass has a leak.
The real question is "Can it be patched?"
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to raj in cup.hp.com  but NOT BOTH...
0
Reply Rick 9/13/2004 8:54:26 PM

Zalman Stern wrote:
> 
> Eric <eric_pattison@sympaticoREMOVE.ca> wrote in message news:<414496E8.538447C0@sympaticoREMOVE.ca>...
> > The only documented problem on PulseEvent that I can see is
> > where events may be lost during debug due to suspend/resume.
> > The MS docs explicitly claims this can only happen during debug.
> > However their explanation states the underlying cause is due to the
> > debugger using thread Suspend/Resume and it therefore seems that any
> > application using Suspend/Resume and PulseEvent would be susceptible.
> 
> The knowledge base documents the debugger issue here:
>     http://support.microsoft.com/default.aspx?scid=kb;en-us;173260
> 
> The documentation for PulseEvent gives the general issue that APC
> delivery may cause event pulses to be lost:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/pulseevent.asp
> 
> Kernel-mode APC delivery covers a large class of things including
> async I/O, etc. It is thus rather hard to reason about when this
> failure happens and it is no surprise that PulseEvent is deprecated.
> 
> Note that debugging and debug events happen in other contexts, such as
> running the application under test harness systems such as Purify or
> the application verifier, etc. It is a big burden in an engineering
> process to not be able to use these tools. (Plus there's Intel's new
> thread profiling tool that Judi Goldstein covered at IDF. I have no
> idea if PulseEvent works under such profiling or not.) It is possible
> that none of these tools cause debug events to happen to a thread that
> is waiting and that they do not suspend/resume threads, but my
> experience in this area does not allow me to be so optimistic...
> 
> > (If I thought it would have any effect on MS, at this point I would
> > include a rant on allowing such basic design flaws to continue
> > to fester 12 years after product launch.)
> 
> Ditto. Though at least in this case, they have deprecated the call
> instead of denying that it is broken :-)

If my understanding and analysis of the situation is correct,
and ignoring the Suspend problem, a PulseEvent to an auto-event
releasing a single thread may, ironically enough, be one of the
things that actually DOES always work as specified.

Furthermore, since a PulseEvent is logically very similar to
a high priority thread performing a SetEvent then a ResetEvent,
if PulseEvent is broken then Set-Reset should be too.

> Of course it is not clear that having Microsoft try to fix these APIs
> is any better. E.g. they added the function SignalAndWaitObject for
> the NT based versions of Windows from NT 4.0 and beyond. (It does not
> exist on 95/98/ME .) You'd think this would allow easy implementation
> of pthread_cond_wait/pthread_cond_signal/pthread_cond_broadcast , but
> it doesn't really. What waitable object to you use for the condition
> variable? How do you get both signal and broadcast behavior?

It might be possible to construct a scenario that logically
"can never fail" using SetEvent and SignalObjectAndWait that can
fail because of the non deterministic WaitFor behavior.

1) ThreadA waits for event A, then uses SignalObjectAndWait
   to set event B and wait for event X.
2) ThreadB waits for event B, then uses SignalObjectAndWait
   to set event C and wait for event X.
3) Set event A to trigger the sequence. It is now "guaranteed" that
   both threads A and B are waiting on event X. 
4) High priority thread C waits for event C then does a SetEvent and
   ResetEvent on event X and it should always release both
   threads A and B.

BUT if thread A was off processing an APC it could miss its
trigger from thread C and block when it requeued for the wait.
Other scenarios may be possible to cause missed triggers.

So I'm not so sure that this is confind to PulseEvent
but without access to the source code I can only speculate.

> This is not even touching on cancelation type issues. Which were
> discussed in another thread recently.
> 
> > Anyway, as most applications can avoid using Suspend/Resume
> > directly, I don't see how this can be claimed as a source for
> > any Posix implementation problems.
> 
> Having thread sync primtives change behavior, in particular having
> them completely fail to operate, under the debugger is a complete game
> over scenario. We have enough problems developing concurrent systems.
> Dealing with random unexpected unreliablity in the thread sync
> primtives is unecessary and unacceptable. These primtives should just
> work. They should just work under the debugger. They should just work
> when the moon is full.
> 
> In case the point is a little overwrought above, it is just horrible
> systems engineering for seemingly unrelated system calls to cause a
> wakeup event to be lost. Alternatively, one can say the debug
> mechanisms should not be doing thread suspensions and resumes. (And
> while we're at it, you'd think Microsoft could make it so debugging a
> GUI app wouldn't deadlock the entire Windows user interface solid for
> minutes at a time.)

Yes. One indication of a good design that turning on
the radio does NOT make the windshield wipers go.

> Given these kind of issues and the informality of the specification
> Microsoft provides on their synchronization APIs, I chose to use the
> simplest ones I could to implement a pthreads subset. Namely critical
> sections and semaphores. My overall take on the Win32 synchronization
> API is that it is a complex hinderance to getting real work done. This
> situation is somewhat improved inside .NET .
> 
> -Z-

I try to stay with their simplest interfaces and avoid their layered
products which add to the obsfulgation. No MFC or .NET. I find this
greatly increases my product quality, minimizes my futzing about
and frustration, and gives improved delivery schedules.

Eric

0
Reply Eric 9/13/2004 9:55:28 PM

Alex Johnson <compuwiz@jhu.edu> wrote in message news:<ci41up$6b4$1@news01.intel.com>...
> Zalman Stern wrote:
> >     http://www28.cplan.com/cbi_export/MA_OSAS002_266814_68-1_v2.pdf

Auth requirements on the presentations seems to be inconsistent. Some
require them, some don't. Some of the ones you can download without
authorization have the username and password on one of the last few
slides. (You can get to the list of presentations here:
    http://www.cplan.com/idfafa04/sys/catalog1
(just click the search button without filling in any fields.)

> > (2 threads means "2 threads per core" in case it is not clear. Slide
> > elsewhere indicates SMT.)
> 
> Multi-threaded: yes.  SMT: no.  Montecito uses a different version of 
> multithreading than SMT.  I know that's been discussed before.  Search 
> for it if you want details.

Rereading the slides, I conclude that my original comement was just
wrong. The presentations says thread switches happen on "long latency
operations." There is n't really any detailed information on this in
the pressentation. My apologies for the error. (They have slides
showing something that looks like SMT, but I think it is just for an
overview, not what Montecito does.)

-Z-
0
Reply googlenews 9/13/2004 10:56:21 PM

Zalman Stern wrote:
> Alex Johnson <compuwiz@jhu.edu> wrote in message news:<ci41up$6b4$1@news01.intel.com>...
> 
>>Zalman Stern wrote:
>>
>>>    http://www28.cplan.com/cbi_export/MA_OSAS002_266814_68-1_v2.pdf
> 
> 
> Auth requirements on the presentations seems to be inconsistent. Some
> require them, some don't. Some of the ones you can download without
> authorization have the username and password on one of the last few
> slides. (You can get to the list of presentations here:
>     http://www.cplan.com/idfafa04/sys/catalog1
> (just click the search button without filling in any fields.)
> 
> 
>>>(2 threads means "2 threads per core" in case it is not clear. Slide
>>>elsewhere indicates SMT.)
>>
>>Multi-threaded: yes.  SMT: no.  Montecito uses a different version of 
>>multithreading than SMT.  I know that's been discussed before.  Search 
>>for it if you want details.
> 
> 
> Rereading the slides, I conclude that my original comement was just
> wrong. The presentations says thread switches happen on "long latency
> operations." There is n't really any detailed information on this in
> the pressentation. My apologies for the error. (They have slides
> showing something that looks like SMT, but I think it is just for an
> overview, not what Montecito does.)
> 

The google search

speculative slice "delinquent loads"

yields a cornucopia of what I believe are the relevant links,

http://www.intel.com/research/mrl/library/148_collins_j.pdf

in particular.

RM

0
Reply Robert 9/14/2004 3:09:39 AM

"Jan Vorbr�ggen" <jvorbrueggen-not@mediasec.de> wrote in message
news:2qlfoiF118pd9U1@uni-berlin.de...
> > Whitecross maybe?
>
> Dunno what Whitecross is, but that was in a Citibank lab in LA.

I remember them, but I can't quite dredge the name up from my brain archive.
Something like TTI???, (Transaction something Institute???), in Santa
Monica.  They had some very sharp people, but just played sandbox games.  I
think many of the people there left to start up Teradata.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/14/2004 4:12:30 AM

"Rupert Pigott" <roo@try-removing-this.darkboong.demon.co.uk> wrote in
message news:1095095284.222200@teapot.planet.gong...
> Stephen Fuld wrote:

snip

> > Thanks for the explanation, Jan.  Is all of this sort of stuff, together
> > with the kind of things that Rupert talked about on line somewhere?  I
> > know
> > Rupert mentioned a paper, but he didn't know where it was.  Perhaps you
> > do?
>
> Jan's description didn't leave much out...  Although he did
> not mention the scheduling stuff which is pretty important
> (done in HW & ucode). My favourite OCCAM construct, "ALT"
> had HW support too, I wish to god other languages had it.
>
> You can probably find OCCAM tutorials online, and there are
> the various papers on CSP written by the smart people at
> Oxford (UK) in the 80s.
>
> The ~1990 vintage datasheets used to give you all the grubby
> details of how the T4/8 did their scheduling. Pretty sure the
> preliminary T9000 databook did the same too. I figure these
> days you'd probably want to go for a VCP style solution, you
> can get an idea of what VCP was about by reading the IEEE1355
> spec.
>
> Alas all I found *online* were the later ST datasheets here :
> http://www.classiccmp.org/transputer
>
> The other place to look for info is :
> http://www.wotug.org
>
> If you can wait a couple of months I might be able to scan
> the blurb about comms and scheduling in the old datasheets
> for you.
>
> Transputer info is slowly falling through the holes in the
> web, so I suggest you binge-download the stuff you want. :)

Thank you for all the links, etc.  I am slowly trying to absorb what is
there.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/14/2004 4:22:46 AM

Eric <eric_pattison@sympaticoREMOVE.ca> wrote in message news:<41461750.D7A640A3@sympaticoREMOVE.ca>...
> If my understanding and analysis of the situation is correct,
> and ignoring the Suspend problem, a PulseEvent to an auto-event
> releasing a single thread may, ironically enough, be one of the
> things that actually DOES always work as specified.

I've read your posts in this thread and I still do not understand why
you say this. In some sense it works as specified because the
specification says the event may be lost.

With an auto-event, one would normally use SetEvent, not PulseEvent.
This guarantees a sleeping thread will be woken even if it is handed
an APC because the kernel does a ResetEvent equivalent when a thread
is woken and not before. But this does not help with pthread condition
variables because the signaled state cannot be persistent. (Not quite
sure how to put it. Conceptually condvar's only state is a possibly
empty list of waiters.)

> Furthermore, since a PulseEvent is logically very similar to
> a high priority thread performing a SetEvent then a ResetEvent,
> if PulseEvent is broken then Set-Reset should be too.

Microsoft's docs indicate that the same problems that apply to
PulseEvent apply to SetEvent immediately followed by ResetEvent. If
one thread is wating on an event object and the kernel invokes an APC
(something that happens at unpredictable times as far as most user
code is concerned) and another thread on another processor pulses the
event at exactly that moment, the first thread will go back to sleep
after the APC is finished.

I suppose another way to look at it is that the program author can't
tell the difference between "the thread was asleep inside WaitObject
in the kernel" and "the thread fell asleep on the syscall instruction,
which is not the kernel's fault." This is a valid response if one is
not using SignalObjectAndWait .

The entire point of a mechanism like pthread_cond_wait or an
eventcount based sleep mechanism is to avoid sleep wakeup races. Since
these are a significant source of problems in concurrent code, it
seems just heinous to specify event mechanisms that cause this class
of bugs rather than making it hard to have them.

When using pthread_cond_wait, there is a simple pattern which avoids
sleep/wakeup races. One has to go out of one's way to write code that
has such races. (Same applies to event counts, though I have not used
them as much.) With win32 event objects, the operating system will
insert the sleep/wakeup races for you. Even if it didn't, it turns out
to be difficult to avoid them using event objects.

[...]
> It might be possible to construct a scenario that logically
> "can never fail" using SetEvent and SignalObjectAndWait that can
> fail because of the non deterministic WaitFor behavior.
> 
> 1) ThreadA waits for event A, then uses SignalObjectAndWait
>    to set event B and wait for event X.
> 2) ThreadB waits for event B, then uses SignalObjectAndWait
>    to set event C and wait for event X.
> 3) Set event A to trigger the sequence. It is now "guaranteed" that
>    both threads A and B are waiting on event X. 
> 4) High priority thread C waits for event C then does a SetEvent and
>    ResetEvent on event X and it should always release both
>    threads A and B.
>
> BUT if thread A was off processing an APC it could miss its
> trigger from thread C and block when it requeued for the wait.
> Other scenarios may be possible to cause missed triggers.
> 
> So I'm not so sure that this is confind to PulseEvent
> but without access to the source code I can only speculate.

I do not understand why you need anything that complicated to
demonstrate this bug in pulsed events.

[...]
> I try to stay with their simplest interfaces and avoid their layered
> products which add to the obsfulgation. No MFC or .NET. I find this
> greatly increases my product quality, minimizes my futzing about
> and frustration, and gives improved delivery schedules.

It is pretty difficult to write competitive GUI apps for Windows
without using higher level stuff... But the issue with synchronization
objects is at the lowest level of the kernel and win32. No fancy
toolkits here...

-Z-
0
Reply googlenews 9/14/2004 4:34:18 AM

Alexander Terekhov <terekhov@web.de> wrote in message news:<4145F1D4.721D2129@web.de>...
> Zalman Stern wrote:
> [...]
> > My mental model of System::Threading::Monitor is: [ mutex + condvar ]
> 
> Apart from dynamic (1:N) mutex:condvar(s) binding and ability to cv-
> signal/broadcast without holding associated lock, POSIX's cv-wait 
> wisely doesn't "unroll" recursively locked mutexes and does allow 
> spurious wakes.
> 
> MS monitors are brain-dead and error-prone (even more than Java 
> ones; JSR-166 condvars aside for a moment). It's no surprise that 
> that MS example is busted. It's quite normal state for MSDN's sorta 
> exemplary threading stuff, AFAICS.

I think I missed your point in the first post as I didn't read the
sample code in the docs for the Pulse method because the formatting
was so messed up on the web page. I just looked it up in the app
version of MSDN that comes with VS.NET 2005 beta. It is obviously
buggy. (For those following along at home, if thread 2 runs past the
Monitor.Pulse call before thread 1 gets to the first Monitor.Wait
call, the example hangs.)

-Z-
0
Reply googlenews 9/14/2004 5:27:38 AM

> Thanks for the explanation, Jan.  Is all of this sort of stuff, together
> with the kind of things that Rupert talked about on line somewhere?  I know
> Rupert mentioned a paper, but he didn't know where it was.  Perhaps you do?

Somebody with a name from India (Ramesh (?) Me...) was collecting such stuff
some years ago. Dunno whether it is still online. Perhaps look into the 
archives of comp.sys.transputer.

	Jan
0
Reply ISO 9/14/2004 7:11:43 AM

> Unless compatibility with VMS was a goal, and I doubt it was, it seems
> they should have left the known bad ideas behind when writing NT...

Oh yes, I quite agree. It was a known problem at the time WNT was spec'ed,
but then, does it surprise you that people do not learn from other people's
or, in this case, even their own mistakes? (Whether Cutler realised it was
a mistake - I dunno whether RSX also has this misfeature - is another matter.)

	Jan
0
Reply ISO 9/14/2004 7:17:43 AM

> Do you happen to recall the VMS problems? Just curious.
> I was under the impression that it did deal with these issues,
> though it didn't (used to) have a PulseEvent.

Well, the problems aren't identical, but they are similar.

Of course, VMS's AST mechanism is WNT's APC. This mechanism, while
very useful in general, complicates scheduling in that systems events
that are not under control by the process may trigger a kernel-mode
AST/APC and cause the process to leave a wait state it was in. This
leads to a potential loss of information - the AST is not quite
"transparent", as it were, to the scheduler and to the rest of the
code that makes decisions based on scheduling state. In VMS, this is
compounded in that there is only a single bit per process for each of
the indications whether the process is suspended or whether it should
wake-up, and running an AST either messes these up (wake-up) or inhibts
ASTs totally (suspend) - which leads to other problems, which were worked
around in an ad-hoc way. And there is the usual interaction with the
out-of-process debugger already mentioned in this thread...

	Jan
0
Reply ISO 9/14/2004 7:24:35 AM

> I remember them, but I can't quite dredge the name up from my brain archive.
> Something like TTI???, (Transaction something Institute???), in Santa
> Monica.  They had some very sharp people, but just played sandbox games.  I
> think many of the people there left to start up Teradata.

Yep, that's them. The TLA rings a bell. I also had the impression that they
"just played sandbox games" - but with a lot of corporate R&D organisations,
that's not their fault (cue Siemens central corporate research).

	Jan
0
Reply ISO 9/14/2004 7:30:05 AM

In article <6On1d.10507$HZ2.2523@news.cpqcorp.net>,
Rick Jones <foo@bar.baz.invalid> writes:
|> 
|> Let's back-up a step - were you saying that the transfers were not
|> completing, or that the transfers were completing, but have corrupt
|> files?  I (perhaps mistakenly) thought I read you writing that the
|> transfers were not completing.

Acrobat was gagging.  I didn't check the files in detail.

|> So the transfers were not completing?

Let me explain the issue.

TCP/IP (in its general sense) and its 'Internet' interfaces specify
actions on closing a connexion cleanly, but assume that connexions
will be kept open until they are closed cleanly.  There is thus no
architected way of indicating an unsuccessful close.  It isn't quite
like that, but that is the effect.

Most semi-decent systems (i.e. Unix, not Microsoft) do pass that
information back up to the application, but all of them sometimes
get it wrong and the indecent ones USUALLY get it wrong.  In
particular, there is no way of passing that information through a
filter which does not have a suitable out-of-band messaging system.

FTP does not have a post-close checking 'flag', so cannot tell the
difference between a break and a close.  Its assumption is that a
transfer completes or hangs.  And that is the misdesign I was
referring to.

The effect is that it is quite common for transfers to appear to
have completed successfully but actually to have failed.  In this
case, at least almost all of the file transferred (i.e. it got its
size right to the nearest few KB), but it could have been a last
block problem.


Regards,
Nick Maclaren.
0
Reply nmm1 9/14/2004 9:04:49 AM


Zalman Stern wrote:
> 
> With an auto-event, one would normally use SetEvent, not PulseEvent.
> This guarantees a sleeping thread will be woken even if it is handed
> an APC because the kernel does a ResetEvent equivalent when a thread
> is woken and not before. But this does not help with pthread condition
> variables because the signaled state cannot be persistent. (Not quite
> sure how to put it. Conceptually condvar's only state is a possibly
> empty list of waiters.)

Condvars have waitsets whose membership can be inferred from the logic
of the programs using them in conjunction withe the locks the condvars
are bound to.   Condvar signaling can be done without a lock but if
you don't use locks in conjunction with the signaling, you can't
determine whether any other thread was reliably signaled (i.e. didn't
miss the signal when it should not have).

....
> 
> The entire point of a mechanism like pthread_cond_wait or an
> eventcount based sleep mechanism is to avoid sleep wakeup races. Since
> these are a significant source of problems in concurrent code, it
> seems just heinous to specify event mechanisms that cause this class
> of bugs rather than making it hard to have them.

The difference between events and eventcounts is the reset mechanism.
The reset for an eventcount is the reading of the current event count
which is kept locally and makes eventcounts shareable.  Events aren't
shareable by more than one thread, and strictly speaking, should only
be reset by that thread, not by the signaling thread which is what
causes most of the problems.

You can write you own event by creating an object with a count field
and an eventcount.  The reset method would (atomically) read the
eventcounts current count and set the object's count field.  Waiting
would just wait on the eventcout with the objects count field value.
I'm not saying that you'd want to do this, but you can create the same
potential for problems by making the count shared.

Joe Seigh
0
Reply Joe 9/14/2004 10:16:40 AM

"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:ci6c7h$ab3$1@pegasus.csx.cam.ac.uk...
>
> In article <6On1d.10507$HZ2.2523@news.cpqcorp.net>,
> Rick Jones <foo@bar.baz.invalid> writes:
> |>
> |> Let's back-up a step - were you saying that the transfers were not
> |> completing, or that the transfers were completing, but have corrupt
> |> files?  I (perhaps mistakenly) thought I read you writing that the
> |> transfers were not completing.
>
> Acrobat was gagging.  I didn't check the files in detail.
>
> |> So the transfers were not completing?
>
> Let me explain the issue.
>
> TCP/IP (in its general sense) and its 'Internet' interfaces specify
> actions on closing a connexion cleanly, but assume that connexions
> will be kept open until they are closed cleanly.  There is thus no
> architected way of indicating an unsuccessful close.  It isn't quite
> like that, but that is the effect.
>
> Most semi-decent systems (i.e. Unix, not Microsoft) do pass that
> information back up to the application, but all of them sometimes
> get it wrong and the indecent ones USUALLY get it wrong.  In
> particular, there is no way of passing that information through a
> filter which does not have a suitable out-of-band messaging system.
>
> FTP does not have a post-close checking 'flag', so cannot tell the
> difference between a break and a close.  Its assumption is that a
> transfer completes or hangs.  And that is the misdesign I was
> referring to.
>
> The effect is that it is quite common for transfers to appear to
> have completed successfully but actually to have failed.  In this
> case, at least almost all of the file transferred (i.e. it got its
> size right to the nearest few KB), but it could have been a last
> block problem.
>

Most ftp clients and servers support passing information about file sizes to
the client, so that the client knows if it has successfully downloaded the
whole file (although it can't tell if the download has been corrupted in
some way).  One particular client is very poor at this - internet explorer.
It will happily tell the user that a download has completed successfully,
when in fact it has stopped halfway.  There are some ftp servers that don't
give proper size information, so that even the best clients can only guess
as to whether the transfer has been completed successfully or has failed.

The reason why ftp does not have proper completion singalling is that it
does not, as you previously stated, run on TCP/IP, but on UDP/IP.  TCP
communications establish a two-way link through a series of handshake
telegrams, and every data telegram must be acknowledged by the receiver.
UDP, on the other hand, is basically a one-way link (or, for ftp, two
one-way links in anti-parallel), and telegrams are not acknowledged.  This
reduces the overhead (which is why it is used for ftp - the assumption is
that a higher level checking mechanism such as md5sum will be used to verify
the transfer), but means that there is no way to distinguish between a lost
packet and no packet.



0
Reply David 9/14/2004 12:36:42 PM

David Brown wrote:

> The reason why ftp does not have proper completion singalling is that it
> does not, as you previously stated, run on TCP/IP, but on UDP/IP.

Best to quit while you're ahead.  FTP does not use UDP.
0
Reply _ 9/14/2004 12:47:21 PM


David Brown wrote:
> The reason why ftp does not have proper completion singalling is that it
> does not, as you previously stated, run on TCP/IP, but on UDP/IP.  TCP
> communications establish a two-way link through a series of handshake
> telegrams, and every data telegram must be acknowledged by the receiver.
> UDP, on the other hand, is basically a one-way link (or, for ftp, two
> one-way links in anti-parallel), and telegrams are not acknowledged.  This
> reduces the overhead (which is why it is used for ftp - the assumption is
> that a higher level checking mechanism such as md5sum will be used to verify
> the transfer), but means that there is no way to distinguish between a lost
> packet and no packet.

When did this happen?  I implemented an FTP client for both active and passive tcp
transfers and it seemed to work okay.  Of course shutting down the tcp connections
was fun as the ftp implementations varied widely.  Some shutdown only if you
did a QUIT and others ignored the QUIT and you had to unilaterally close the
connection in which case the former would complain.  HTTP implementations were
almost just as bad were as likely to ignore the keepalive attribute no matter
what its proper default value was.  You basically cannot write a comforming
FTP or HTTP implementation because if you did, you wouldn't be able to talk
to most of the other servers/clients out there.

Joe Seigh
0
Reply Joe 9/14/2004 12:51:48 PM

Zalman Stern wrote:
> 
> Eric <eric_pattison@sympaticoREMOVE.ca> wrote in message news:<41461750.D7A640A3@sympaticoREMOVE.ca>...
> > If my understanding and analysis of the situation is correct,
> > and ignoring the Suspend problem, a PulseEvent to an auto-event
> > releasing a single thread may, ironically enough, be one of the
> > things that actually DOES always work as specified.
> 
> I've read your posts in this thread and I still do not understand why
> you say this.

Ok. This stuff really requires a white board.

> In some sense it works as specified because the
> specification says the event may be lost.

There are two issues here. The Suspend problem looks like a real bug
because the system is not behaving as intended. So let's just put that
one aside. The other behaviors are visible artifacts of the algorithm
WNT uses. I understand why *they* (MS) claim it is not a bug: because
it is functioning as designed. I believe the design is flawed,
and "deprecating PulseEvent" is not going to make it better.

The wording of the MS text implies that PulseEvent is unreliable
under all circumstance. I believe this to be both an overstatement
and understatement of the problem. It is an overstatement because
(if I understand correctly) PulseEvent should function reliably
when used in certain ways. It is an understatement because PulseEvent
is not the problem - the underlying design is the problem.

One of the ways that should work reliably is to apply PulseEvent
to an auto-event. That is supposed to release a single thread,
or if none is waiting, remember the pulse and release the next
thread that comes along. This should function reliably because
pulling an item out of the event queue and requeuing it should
not cause a pulse to be lost even with APC's being delivered.
If there is no thread waiting when the pulse occurs then the
event state is set and gets detected when the APC completes.

What should not work reliably is trying to release multiple threads
at once using either SetEvent or PulseEvent because the threads can
be off doing background work when the trigger occurs. PulseEvent
just amplifies the problem by increasing the window of vulnerability.

> With an auto-event, one would normally use SetEvent, not PulseEvent.
> This guarantees a sleeping thread will be woken even if it is handed
> an APC because the kernel does a ResetEvent equivalent when a thread
> is woken and not before. But this does not help with pthread condition
> variables because the signaled state cannot be persistent. (Not quite
> sure how to put it. Conceptually condvar's only state is a possibly
> empty list of waiters.)

I was not considering condvars, just looking at whether native events
can be made to function reliably as MS originally intended.
Condvars are considerations above and beyond that.

> > Furthermore, since a PulseEvent is logically very similar to
> > a high priority thread performing a SetEvent then a ResetEvent,
> > if PulseEvent is broken then Set-Reset should be too.
> 
> Microsoft's docs indicate that the same problems that apply to
> PulseEvent apply to SetEvent immediately followed by ResetEvent. If
> one thread is wating on an event object and the kernel invokes an APC
> (something that happens at unpredictable times as far as most user
> code is concerned) and another thread on another processor pulses the
> event at exactly that moment, the first thread will go back to sleep
> after the APC is finished.

Ok, that is what I would expect their algorithm to do.
The text that Joe quoted looks like MS believes the problem
is confined to PulseEvent and so not using PulseEvent will
somehow make the problem go away. It won't.

> I suppose another way to look at it is that the program author can't
> tell the difference between "the thread was asleep inside WaitObject
> in the kernel" and "the thread fell asleep on the syscall instruction,
> which is not the kernel's fault." This is a valid response if one is
> not using SignalObjectAndWait .

BINGO! EXACTLY - you've got it. That appears to have been the
designers original thoughts: Since the event setter cannot know
whether the event receiver is performing a wait, or is 1 ns before
the wait, it should not affect the logic of a "correct program".
And since you cannot know whether something is queued waiting for
an event or not, it cannot hurt to dequeue it and requeue it.

However this non-deterministic, non-FIFO behavior can cause visible
side effects such as starvation or programs appearing to stall.
Some designs require FIFO behavior is order to function correctly
and not suffer live lock (busy wait) lock ups.

When you add SignalObjectAndWait the original assumption is no longer
true. Now you can "prove" that when a receiver sees an event that
the sender is waiting on another event, and this proof can be
necessary for an algorithm to work reliably. Except this "proof" is
wrong because the underlying mechanism does not work that way.
This can result in program hard lock ups.

> The entire point of a mechanism like pthread_cond_wait or an
> eventcount based sleep mechanism is to avoid sleep wakeup races. Since
> these are a significant source of problems in concurrent code, it
> seems just heinous to specify event mechanisms that cause this class
> of bugs rather than making it hard to have them.
> 
> When using pthread_cond_wait, there is a simple pattern which avoids
> sleep/wakeup races. One has to go out of one's way to write code that
> has such races. (Same applies to event counts, though I have not used
> them as much.) With win32 event objects, the operating system will
> insert the sleep/wakeup races for you. Even if it didn't, it turns out
> to be difficult to avoid them using event objects.
> 
> [...]
> > It might be possible to construct a scenario that logically
> > "can never fail" using SetEvent and SignalObjectAndWait that can
> > fail because of the non deterministic WaitFor behavior.
> >
> > 1) ThreadA waits for event A, then uses SignalObjectAndWait
> >    to set event B and wait for event X.
> > 2) ThreadB waits for event B, then uses SignalObjectAndWait
> >    to set event C and wait for event X.
> > 3) Set event A to trigger the sequence. It is now "guaranteed" that
> >    both threads A and B are waiting on event X.
> > 4) High priority thread C waits for event C then does a SetEvent and
> >    ResetEvent on event X and it should always release both
> >    threads A and B.
> >
> > BUT if thread A was off processing an APC it could miss its
> > trigger from thread C and block when it requeued for the wait.
> > Other scenarios may be possible to cause missed triggers.
> >
> > So I'm not so sure that this is confind to PulseEvent
> > but without access to the source code I can only speculate.
> 
> I do not understand why you need anything that complicated to
> demonstrate this bug in pulsed events.

It just gets two (or more) threads waiting on a common event X
(a quorum). Setting the event should *always* release all the members,
but it does not necessarily do so. This could cause a deadlock.

> [...]
> > I try to stay with their simplest interfaces and avoid their layered
> > products which add to the obsfulgation. No MFC or .NET. I find this
> > greatly increases my product quality, minimizes my futzing about
> > and frustration, and gives improved delivery schedules.
> 
> It is pretty difficult to write competitive GUI apps for Windows
> without using higher level stuff... But the issue with synchronization
> objects is at the lowest level of the kernel and win32. No fancy
> toolkits here...
> 
> -Z-

Eric

0
Reply Eric 9/14/2004 2:39:49 PM


Eric wrote:
> 
> One of the ways that should work reliably is to apply PulseEvent
> to an auto-event. That is supposed to release a single thread,
> or if none is waiting, remember the pulse and release the next
> thread that comes along. This should function reliably because
> pulling an item out of the event queue and requeuing it should
> not cause a pulse to be lost even with APC's being delivered.
> If there is no thread waiting when the pulse occurs then the
> event state is set and gets detected when the APC completes.
> 
The documentation specifically says it does not do that.

  If no threads are waiting, or if no thread can be released immediately, PulseEvent
  simply sets the event object's state to nonsignaled and returns.

Joe Seigh
0
Reply Joe 9/14/2004 3:17:37 PM

Joe Seigh wrote:
> 
> Eric wrote:
> >
> > One of the ways that should work reliably is to apply PulseEvent
> > to an auto-event. That is supposed to release a single thread,
> > or if none is waiting, remember the pulse and release the next
> > thread that comes along. This should function reliably because
> > pulling an item out of the event queue and requeuing it should
> > not cause a pulse to be lost even with APC's being delivered.
> > If there is no thread waiting when the pulse occurs then the
> > event state is set and gets detected when the APC completes.
> >
> The documentation specifically says it does not do that.
> 
>   If no threads are waiting, or if no thread can be released immediately, PulseEvent
>   simply sets the event object's state to nonsignaled and returns.
> 
> Joe Seigh

Oops, you are right. I must have either misread it years ago or
read an incorrect article, assumed it worked "the correct way",
and just never double checked as I do not use it. My apologies.

Yes, if it does not leave it signaled then, it would not work.
How dumb.

Well in that case there is no situation that one can rely on a
thread being released. PulseEvent is exactly the same as a quick
Set-Reset sequence (so deprecating it still makes no difference),
and the rest of what I said still applies.

Eric

0
Reply Eric 9/14/2004 5:11:15 PM

In comp.arch David Brown <david@no.westcontrol.spam.com> wrote:
> The reason why ftp does not have proper completion singalling is
> that it does not, as you previously stated, run on TCP/IP, but on
> UDP/IP.  

I believe you have confused FTP, which does indeed use TCP for its
transport, with TFTP, which uses UDP.  While there is a considerable
substring match on their acronyms, they are _very_ different beasts.

rick jones
-- 
Process shall set you free from the need for rational thought. 
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to raj in cup.hp.com  but NOT BOTH...
0
Reply Rick 9/14/2004 6:41:23 PM

Eric <eric_pattison@sympaticoREMOVE.ca> wrote in message news:<414702B5.82DD1296@sympaticoREMOVE.ca>...
> One of the ways that should work reliably is to apply PulseEvent
> to an auto-event. That is supposed to release a single thread,
> or if none is waiting, remember the pulse and release the next
> thread that comes along.

If this is the case, it is in direct contradiction to Microsoft's
documentation of PulseEvent:
[
For a manual-reset event object, all waiting threads that can be
released immediately are released. The function then resets the event
object's state to nonsignaled and returns.

For an auto-reset event object, the function resets the state to
nonsignaled and returns after releasing a single waiting thread, even
if multiple threads are waiting.

If no threads are waiting, or if no thread can be released
immediately, PulseEvent simply sets the event object's state to
nonsignaled and returns.
]

This discrepancy is the cause of my misunderstanding about what Eric
is saying.

I'll have to write the test case to figure out what PulseEvent really
does. How does anyone program with concurrency primtives that don't
have a formal specification? Put another way, it seems silly to talk
about how hard concurrent programming is when we can't even reliably
figure out what the API primitives actually do...

-Z-
0
Reply googlenews 9/14/2004 8:09:25 PM

Zalman Stern wrote:
> 
> Eric <eric_pattison@sympaticoREMOVE.ca> wrote in message news:<414702B5.82DD1296@sympaticoREMOVE.ca>...
> > One of the ways that should work reliably is to apply PulseEvent
> > to an auto-event. That is supposed to release a single thread,
> > or if none is waiting, remember the pulse and release the next
> > thread that comes along.
> 
> If this is the case, it is in direct contradiction to Microsoft's
> documentation of PulseEvent:
> [
> For a manual-reset event object, all waiting threads that can be
> released immediately are released. The function then resets the event
> object's state to nonsignaled and returns.
> 
> For an auto-reset event object, the function resets the state to
> nonsignaled and returns after releasing a single waiting thread, even
> if multiple threads are waiting.
> 
> If no threads are waiting, or if no thread can be released
> immediately, PulseEvent simply sets the event object's state to
> nonsignaled and returns.
> ]
> 
> This discrepancy is the cause of my misunderstanding about what Eric
> is saying.
> 
> I'll have to write the test case to figure out what PulseEvent really
> does. How does anyone program with concurrency primtives that don't
> have a formal specification? Put another way, it seems silly to talk
> about how hard concurrent programming is when we can't even reliably
> figure out what the API primitives actually do...

Don't bother, the mistake was mine. I either misread it years ago or
forgot how it worked and assumed it worked the way I thought it should.
Apologies for any confusion I caused.

However I don't believe my error on PulseEvent affects any of the
other things on how it works. It just removes the only mechanism
(I thought) I knew that would always release a single thread.

Eric

0
Reply Eric 9/14/2004 9:06:22 PM


Zalman Stern wrote:
> 
> I'll have to write the test case to figure out what PulseEvent really
> does. How does anyone program with concurrency primtives that don't
> have a formal specification? Put another way, it seems silly to talk
> about how hard concurrent programming is when we can't even reliably
> figure out what the API primitives actually do...
> 

Like Posix?

From the Single Unix Specification

  Formal definitions of the memory model were rejected as unreadable by the vast
  majority of programmers. In addition, most of the formal work in the literature has
  concentrated on the memory as provided by the hardware as opposed to the application
  programmer through the compiler and runtime system. It was believed that a simple
  statement intuitive to most programmers would be most effective.
  IEEE Std 1003.1-2001 defines functions that can be used to synchronize access to
  memory, but it leaves open exactly how one relates those functions to the semantics of
  each function as specified elsewhere in IEEE Std 1003.1-2001. IEEE Std 1003.1-2001
  also does not make a formal specification of the partial ordering in time that the
  functions can impose, as that is implied in the description of the semantics of each
  function. It simply states that the programmer has to ensure that modifications do not
  occur "simultaneously" with other access to a memory location.

There is no formal specification for Posix threads.  Basically everyone implements pthreads
based on their own personal understanding of the api, whatever that is.  Based on
the number of times I've heard people explain you need to use locks to "flush" cache
so you can see the changes from other threads, I'd say that intuitive definition isn't
very intuitive.

Joe Seigh
0
Reply Joe 9/14/2004 9:09:26 PM

Joe Seigh wrote:
> 
> Zalman Stern wrote:
> 
>>I'll have to write the test case to figure out what PulseEvent really
>>does. How does anyone program with concurrency primtives that don't
>>have a formal specification? Put another way, it seems silly to talk
>>about how hard concurrent programming is when we can't even reliably
>>figure out what the API primitives actually do...
>>
> 
> 
> Like Posix?
> 
> From the Single Unix Specification
> 
>   Formal definitions of the memory model were rejected as unreadable by the vast
>   majority of programmers. In addition, most of the formal work in the literature has
>   concentrated on the memory as provided by the hardware as opposed to the application
>   programmer through the compiler and runtime system. It was believed that a simple
>   statement intuitive to most programmers would be most effective.
>   IEEE Std 1003.1-2001 defines functions that can be used to synchronize access to
>   memory, but it leaves open exactly how one relates those functions to the semantics of
>   each function as specified elsewhere in IEEE Std 1003.1-2001. IEEE Std 1003.1-2001
>   also does not make a formal specification of the partial ordering in time that the
>   functions can impose, as that is implied in the description of the semantics of each
>   function. It simply states that the programmer has to ensure that modifications do not
>   occur "simultaneously" with other access to a memory location.
> 
> There is no formal specification for Posix threads.  Basically everyone implements pthreads
> based on their own personal understanding of the api, whatever that is.  Based on
> the number of times I've heard people explain you need to use locks to "flush" cache
> so you can see the changes from other threads, I'd say that intuitive definition isn't
> very intuitive.
> 

This post would be one for the archives, except that I think Nick has 
been saying more or less the same thing for at least as long as I've 
been following comp.arch.

It renders not only discussions of the difficulty of concurrent 
programming meaningless, but also any discussion of complexity or of 
what the real limits of scaling might be for very large systems.  The 
limit, apparently, is one thread, unless all the programming is being 
done by one person or by a group of people each of whom can read all the 
  others' minds.

RM

0
Reply Robert 9/14/2004 10:28:35 PM

Joe Seigh wrote:

> 
> 
> Zalman Stern wrote:
>> 
>> I'll have to write the test case to figure out what PulseEvent really
>> does. How does anyone program with concurrency primtives that don't
>> have a formal specification? Put another way, it seems silly to talk
>> about how hard concurrent programming is when we can't even reliably
>> figure out what the API primitives actually do...

Indeed, Brother Stern. :(

> Like Posix?
> 
> From the Single Unix Specification
> 
>   Formal definitions of the memory model were rejected as unreadable by
>   the vast majority of programmers. In addition, most of the formal work
>   in the literature has concentrated on the memory as provided by the
>   hardware as opposed to the application programmer through the compiler
>   and runtime system. It was believed that a simple statement intuitive to
>   most programmers would be most effective. IEEE Std 1003.1-2001 defines

That begs the question as to why the hell didn't
they produce a formal spec anyway. A formal spec
doesn't prevent them from writing commentaries in
plain English for the layman...

They would have made a bit of extra cash on the
side flogging some books doing just that.

Cheers,
Rupert
0
Reply Rupert 9/14/2004 10:46:26 PM

Joe Seigh wrote:

[...]

> From the Single Unix Specification ...

It was written many moons ago when full stop membar was the state 
of the art. Well,

http://groups.google.com/groups?selm=3EEF38E8.311EEC77%40web.de
http://groups.google.com/groups?selm=40A8B15E.FDD505E8%40web.de

regards,
alexander.
0
Reply Alexander 9/14/2004 11:13:25 PM

Robert Myers wrote:
> This post would be one for the archives, except that I think Nick has 
> been saying more or less the same thing for at least as long as I've 
> been following comp.arch.

And which others seem to have started to voice as the slogan "SMP 
considered harmful"...

> It renders not only discussions of the difficulty of concurrent 
> programming meaningless, but also any discussion of complexity or of 
> what the real limits of scaling might be for very large systems.  The 
> limit, apparently, is one thread, unless all the programming is being 
> done by one person or by a group of people each of whom can read all the 
>  others' minds.

A model based on communicating (peripherally) strictly sequential 
processes with no shared data doesn't have any of these problems, 
though, and can be used for highly-parallel programs.  And 
compilers can go wild with the reordering and common subexpression 
elimination because they can have a clue about what's happening... 
  Of course, there may very well be other problems...

So, are there any other, useful, threading modes with rigerous 
definitions?  How about the intrinsic threading of Mach (which 
starts with threads and builds processes around them, rather than 
dividing processes up into threads, if I remember rightly)?  Any 
other takers?  How are the users of existing really big SMP 
systems by the likes of SGI and Sun getting anything done?  Are 
there subsets of posix threads functionality that are safe to use, 
and with which you can get useufl work done, or is useful work 
done by augmenting posix threads with system-specific memory 
barrier functions or compiler extensions?

Cheers,

-- 
Andrew
0
Reply Andrew 9/15/2004 1:39:33 AM


Andrew Reilly wrote:
> 
> So, are there any other, useful, threading modes with rigerous
> definitions?  How about the intrinsic threading of Mach (which
> starts with threads and builds processes around them, rather than
> dividing processes up into threads, if I remember rightly)?  Any
> other takers?  How are the users of existing really big SMP
> systems by the likes of SGI and Sun getting anything done?  Are
> there subsets of posix threads functionality that are safe to use,
> and with which you can get useufl work done, or is useful work
> done by augmenting posix threads with system-specific memory
> barrier functions or compiler extensions?
> 

I took a try at formally defining mutexes at least.  I think I
have a better way that will handle defining other synchronization
mechanisms as well.  There didn't seem to be much interest in it.

What's SGI doing?  Using Linux.  Apart from improvments in (I think)
i/o and numa, they worked on tuning RCU to scale well with that many
processors.

Lock-free stuff like RCU seems to be key in getting scalability since
in a reader/writer situation you can eliminate reader synchronization,
i.e. no reader locks, which cuts down considerably on synchronization
overhead.

I don't know if it's trickier but it's different than normal synchronization
usage patterns.  There's a learning curve anyway.

Joe Seigh
0
Reply Joe 9/15/2004 2:36:47 AM

"Andrew Reilly" <andrew-newspost@areilly.bpc-users.org> wrote in message
news:41479D55.3080604@areilly.bpc-users.org...

snip

>  How are the users of existing really big SMP
> systems by the likes of SGI and Sun getting anything done?

I think there are two models here.  For Sun, most of their customers are in
the commercial world, so they use the transaction model, where each
transaction is single threaded but the parallelism occurs due to lots of
transactions in process simultaneously.  For SGI, some of their workload is
liek that, but there is much scientific, HPC type.  They can use SMP or
message passing but they key is that there are a modest sized cadre of
people who can spend a lot of time tuning very valuable applications to
within an inch of their life.  That works if you can spend the time on a few
applications, but the supply of such people is quite limited.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam


0
Reply Stephen 9/15/2004 6:33:14 AM

Joe Seigh <jseigh_01@xemaps.com> wrote in message news:<41475E4D.1CB8A451@xemaps.com>...
> Like Posix?
[...]

The underlying synchronization model used by pthreads is that of
Modula-2+ (and later Modula-3) from DEC SRC. (Via lineage from C. A.
R. Hoare and the work at Xeorx PARC by Butler Lampson and others.)
There is a formal specification of the SRC synchronization mechanism,
whcih I mentioned in a previous post -- DEC SRC Research Report 20,
but it does not explicitly mention memory barriers or related issues.

My reading of this report suggests the following:
    1) Visibility of shared variables is entirely controlled by mutex
acquire and release. Shared variable modifications must happen within
a mutual exclusion scopre provided my a mutes. The acquire and release
operations on the mutex provide necessary memory barriers.
    2) The operation of signal and broadcast should work without
regard to memory barriers. That is no use of signal or broadcast
should go from incorrect to correct do to the insertion of a memory
barrier alone. Correct uses of condition variables usually (and
perhaps always) require use with a mutex as well.

Due to this, I'm more or less ok with POSIX not providing a formal
specification. Though it would definitely improve the world if they
were to do so.

Beyond that, there is Andre Birrell's
_An_Introduction_To_Programming_with_Threads_ which I believe applies
to phtreads. (Thatis, if a pthreads implementation doesn't meet the
spec implied in that document, it is buggy.)

Dr. Birrell has written a similar paper for C#/.NET, which can be
found at:
    http://research.microsoft.com/~birrell/papers/ThreadsCSharp.pdf
While this is not at all a formal specification, it is good enough to
write code by. (And clearly the person who wrote the example code in
the Monitor.Pulse docs did not read that paper.)

(I have never seen anything at the level of those two papers for the
win32 synchronization API primtives.)

Memory barrier issues and Joe's mentioning lock free implementation
and avoiding extra counters strikes me as being on the cutting edge a
bit. This is vital and all, but it is a ways beyond just getting
correctness. All problems are hard when one tries to optimize them to
great extents.

I'd be interested in seeing data on how much performance there is to
be gained using relaxed memory models compared to stricter ones.

-Z-
0
Reply googlenews 9/15/2004 7:49:24 AM

In article <41477B15.3ED18933@web.de>,
Alexander Terekhov <terekhov@web.de> writes:
|> Joe Seigh wrote:
|> 
|> > From the Single Unix Specification ...
|> 
|> It was written many moons ago when full stop membar was the state 
|> of the art. ...

At least as known to those who were unaware that there was any
IT development before Unix.


Regards,
Nick Maclaren.
0
Reply nmm1 9/15/2004 8:24:14 AM

In article <41479D55.3080604@areilly.bpc-users.org>, Andrew Reilly <andrew-newspost@areilly.bpc-users.org> writes:
|> Robert Myers wrote:
|> > This post would be one for the archives, except that I think Nick has 
|> > been saying more or less the same thing for at least as long as I've 
|> > been following comp.arch.
|> 
|> And which others seem to have started to voice as the slogan "SMP 
|> considered harmful"...

And which I then say is wrong, and the correct statement should
be more like "bolted-on SMP considered harmful" :-)


Regards,
Nick Maclaren.
0
Reply nmm1 9/15/2004 8:25:41 AM

In article <2d24c5a8.0409142349.57284419@posting.google.com>,
googlenews@peachfish.com (Zalman Stern) writes:
|> 
|> Beyond that, there is Andre Birrell's
|> _An_Introduction_To_Programming_with_Threads_ which I believe applies
|> to phtreads. (Thatis, if a pthreads implementation doesn't meet the
|> spec implied in that document, it is buggy.)

Andrew Birrell?  Nice to hear of him again.

But why has he been promoted to Keeper of the One True Interpretation
of POSIX?  He was (and I assume is) extremely competent, but I saw
no halo over his head when he was here.  My assumption would be that,
if a pthreads implementation doesn't meet the specification implied
in that document, it is either very specialised or perverse, but that
is not the same as buggy.




Regards,
Nick Maclaren.
0
Reply nmm1 9/15/2004 8:29:39 AM

"_" <_@_._> wrote in message
news:4146e859$0$20579$afc38c87@news.optusnet.com.au...
> David Brown wrote:
>
> > The reason why ftp does not have proper completion singalling is that it
> > does not, as you previously stated, run on TCP/IP, but on UDP/IP.
>
> Best to quit while you're ahead.  FTP does not use UDP.

Ok, I quit (even though I'm not ahead on this one).  As another poster said,
I probably mixed it up with TFTP.  Sorry.




0
Reply David 9/15/2004 10:32:18 AM


Zalman Stern wrote:
> 
> Joe Seigh <jseigh_01@xemaps.com> wrote in message news:<41475E4D.1CB8A451@xemaps.com>...
> > Like Posix?
> [...]
> 
> The underlying synchronization model used by pthreads is that of
> Modula-2+ (and later Modula-3) from DEC SRC. (Via lineage from C. A.
> R. Hoare and the work at Xeorx PARC by Butler Lampson and others.)
> There is a formal specification of the SRC synchronization mechanism,
> whcih I mentioned in a previous post -- DEC SRC Research Report 20,
> but it does not explicitly mention memory barriers or related issues.
> 
> My reading of this report suggests the following:
>     1) Visibility of shared variables is entirely controlled by mutex
> acquire and release. Shared variable modifications must happen within
> a mutual exclusion scopre provided my a mutes. The acquire and release
> operations on the mutex provide necessary memory barriers.
>     2) The operation of signal and broadcast should work without
> regard to memory barriers. That is no use of signal or broadcast
> should go from incorrect to correct do to the insertion of a memory
> barrier alone. Correct uses of condition variables usually (and
> perhaps always) require use with a mutex as well.

I had though you could get away without memory barriers in signaling
but it's not so.  Whatever condition you are signaling has to be
visible before the signal is received.  The use of mutexes with
condvars obscures this requirement.  When you use signaling with
lock-free data structures it becomes more obvious.

[...]
> 
> Memory barrier issues and Joe's mentioning lock free implementation
> and avoiding extra counters strikes me as being on the cutting edge a
> bit. This is vital and all, but it is a ways beyond just getting
> correctness. All problems are hard when one tries to optimize them to
> great extents.
> 
> I'd be interested in seeing data on how much performance there is to
> be gained using relaxed memory models compared to stricter ones.
> 

That is what the benchmarks purport to show, though in some sense they
are self fulfilling prophesies.  Hardware is designed for usages that
the hardware designers think is important.  Single threaded and large
granularity multi-threading is considered important and the deeply
pipelined processors reward that kind of program behavior with improved
performance and penalize fine grained multi-threading.  As the latter
is discouraged and becomes less prevalent, it becomes less of a factor
in average program behavior and so on.  The whole setup has feedback
built into it.

There are some aspects of hyperthreading that allows more optimal fine
grained multi-threading.  But since that isn't benchmarked, that kind
of stuff could be dropped without warning.   So there is no point in
even trying to exploit it.

Joe Seigh
0
Reply Joe 9/15/2004 12:45:13 PM

Eric wrote:
> 
> Joe Seigh wrote:
> >
> > Eric wrote:
> > >
> > > One of the ways that should work reliably is to apply PulseEvent
> > > to an auto-event. That is supposed to release a single thread,
> > > or if none is waiting, remember the pulse and release the next
> > > thread that comes along. This should function reliably because
> > > pulling an item out of the event queue and requeuing it should
> > > not cause a pulse to be lost even with APC's being delivered.
> > > If there is no thread waiting when the pulse occurs then the
> > > event state is set and gets detected when the APC completes.
> > >
> > The documentation specifically says it does not do that.
> >
> >   If no threads are waiting, or if no thread can be released immediately, PulseEvent
> >   simply sets the event object's state to nonsignaled and returns.
> >
> > Joe Seigh
> 
> Oops, you are right. I must have either misread it years ago or
> read an incorrect article, assumed it worked "the correct way",
> and just never double checked as I do not use it. My apologies.
> 
> Eric

It is SetEvent that leaves the event signaled not PulseEvent.
I knew there is a way to always release a single thread.
Just suffered a slight memory brain fart.
To rephrase:

One of the ways that should work reliably is to apply SetEvent
to an auto-event. That is supposed to release a single thread,
or if none is waiting, remember the signal and release the next
thread that comes along. This should function reliably because
pulling an item out of the event queue and requeuing it should
not cause a signal to be lost even with APC's being delivered.
If there is no thread waiting when the signal occurs then the
event state is set and gets detected when the APC completes.

Eric

0
Reply Eric 9/15/2004 1:27:27 PM

>>I'd be interested in seeing data on how much performance there is to
>>be gained using relaxed memory models compared to stricter ones.
> That is what the benchmarks purport to show, though in some sense they
> are self fulfilling prophesies.  Hardware is designed for usages that
> the hardware designers think is important.  Single threaded and large
> granularity multi-threading is considered important and the deeply
> pipelined processors reward that kind of program behavior with improved
> performance and penalize fine grained multi-threading. 

I don't understand that argument. A more relaxed memory model allows the
code to define those points in time at which a guarantee about externally-
visible memory content is made, compared to a stricter model where the
hardware is continually working to ensure such guarantees. It seems to me
that this optimization will benefit both coarse- and fine-grained threaded
applications, although the latter to a somewhat smaller degree. I cannot
imagine a case, however, where there would be worse performance for the
relaxed memory model compared to the stricter one (except for very patho-
logical programs that continually trigger the memory barrier mechanism -
that is why some SMT systems have a special spinlock-wait instruction).

	Jan
0
Reply ISO 9/15/2004 1:27:48 PM

In article <2qqualF138nblU1@uni-berlin.de>,
=?ISO-8859-1?Q?Jan_Vorbr=FCggen?= <jvorbrueggen-not@mediasec.de> writes:
|> >>I'd be interested in seeing data on how much performance there is to
|> >>be gained using relaxed memory models compared to stricter ones.
|> 
|> > That is what the benchmarks purport to show, though in some sense they
|> > are self fulfilling prophesies.  Hardware is designed for usages that
|> > the hardware designers think is important.  Single threaded and large
|> > granularity multi-threading is considered important and the deeply
|> > pipelined processors reward that kind of program behavior with improved
|> > performance and penalize fine grained multi-threading. 
|> 
|> I don't understand that argument. A more relaxed memory model allows the
|> code to define those points in time at which a guarantee about externally-
|> visible memory content is made, compared to a stricter model where the
|> hardware is continually working to ensure such guarantees. It seems to me
|> that this optimization will benefit both coarse- and fine-grained threaded
|> applications, although the latter to a somewhat smaller degree. I cannot
|> imagine a case, however, where there would be worse performance for the
|> relaxed memory model compared to the stricter one (except for very patho-
|> logical programs that continually trigger the memory barrier mechanism -
|> that is why some SMT systems have a special spinlock-wait instruction).

Well, I have fairly often been in the situation of having to write
some spaghetti of locking calls to bypass the problems caused by a
relaxed synchronicity model.  In other cases, I have had to plaster
the code with synchronisation operations.  The cost of those often
dominated the costs of the operation, sometimes by large factors.
[ This has usually been for non-memory uses, but that is the same
problem. ]

Now, in all cases, that was because the designers had wanted to
have the advantages of being constrained only by a relaxed model,
without accepting the responsibility for providing appropriate
synchronisation primitives.  As a general rule, the more relaxed
a model, the more care and effort must go into those - and it is
rarely done.

So I think that it is more an issue about good design than anything
else.


Regards,
Nick Maclaren.
0
Reply nmm1 9/15/2004 1:36:18 PM

> One of the ways that should work reliably is to apply SetEvent
> to an auto-event. That is supposed to release a single thread,
> or if none is waiting, remember the signal and release the next
> thread that comes along. This should function reliably because
> pulling an item out of the event queue and requeuing it should
> not cause a signal to be lost even with APC's being delivered.
> If there is no thread waiting when the signal occurs then the
> event state is set and gets detected when the APC completes.

However, it remains to be verified that the implementation actually
behaves as described in this particular case.

	Jan
0
Reply ISO 9/15/2004 2:38:35 PM

Robert Myers <rmyers1400@comcast.net> writes:

> Joe Seigh wrote:
> > Zalman Stern wrote:
> >
> >>I'll have to write the test case to figure out what PulseEvent really
> >>does. How does anyone program with concurrency primtives that don't
> >>have a formal specification? Put another way, it seems silly to talk
> >>about how hard concurrent programming is when we can't even reliably
> >>figure out what the API primitives actually do...
> >>
> > Like Posix?
> > From the Single Unix Specification
> >   Formal definitions of the memory model were rejected as unreadable by the vast
> >   majority of programmers. 

The real question to me, then, is how to address this problem. Complaining that
the "intuitive" approach doesn't work either doesn't do us much good.

> It renders not only discussions of the difficulty of concurrent programming
> meaningless, but also any discussion of complexity or of what the real
> limits of scaling might be for very large systems.  The limit, apparently,
> is one thread, unless all the programming is being done by one person or by
> a group of people each of whom can read all the others' minds.

-- 
David Gay
dgay@acm.org
0
Reply David 9/15/2004 4:39:35 PM

googlenews@peachfish.com (Zalman Stern) wrote in message news:<2d24c5a8.0409142349.57284419@posting.google.com>...
> Beyond that, there is Andre Birrell's
> _An_Introduction_To_Programming_with_Threads_ which I believe applies
> to phtreads. (Thatis, if a pthreads implementation doesn't meet the
> spec implied in that document, it is buggy.)

My apologies for the typo in Dr. Andrew Birrell's first name.

-Z-
0
Reply googlenews 9/15/2004 5:28:35 PM

Eric <eric_pattison@sympaticoREMOVE.ca> wrote in message news:<4148433F.8895C1C9@sympaticoREMOVE.ca>...
> It is SetEvent that leaves the event signaled not PulseEvent.
> I knew there is a way to always release a single thread.
> Just suffered a slight memory brain fart.
> To rephrase:
[...]

As I wrote in the original response to Eric (a parent message of this
one):
[
With an auto-event, one would normally use SetEvent, not PulseEvent.
This guarantees a sleeping thread will be woken even if it is handed
an APC because the kernel does a ResetEvent equivalent when a thread
is woken and not before. But this does not help with pthread condition
variables because the signaled state cannot be persistent. (Not quite
sure how to put it. Conceptually condvar's only state is a possibly
empty list of waiters.)
]

Not only did Eric have a brain fart, he apparently isn't reading other
people's stuff very carefully.

To expand on the last point in the quoted text, even with
SignalObjectAndWait, one cannot use an auto-reset event and SetEvent
to get pthread_signal like behavior.

-Z-
0
Reply googlenews 9/15/2004 5:34:51 PM

David Gay wrote:

> 
>>Joe Seigh wrote:
>>
>>>Zalman Stern wrote:
>>>
>>>
>>>>I'll have to write the test case to figure out what PulseEvent really
>>>>does. How does anyone program with concurrency primtives that don't
>>>>have a formal specification? Put another way, it seems silly to talk
>>>>about how hard concurrent programming is when we can't even reliably
>>>>figure out what the API primitives actually do...
>>>>
>>>
>>>Like Posix?
>>>From the Single Unix Specification
>>>  Formal definitions of the memory model were rejected as unreadable by the vast
>>>  majority of programmers. 
> 
> 
> The real question to me, then, is how to address this problem. Complaining that
> the "intuitive" approach doesn't work either doesn't do us much good.
> 

I am reminded of my first graduate year mathematical physics text, which 
stated, roughly, that it's okay to use a "rough and ready" approach to 
applied mathematics, as long as you know that someone competent has 
worked through the details at least once.

There are serviceable and presentable programming models for concurrent 
systems that have been subjected to mathematical scrutiny, but, for the 
most part, they don't seem to be used.

Even more troubling is that the memory semantics of the hardware on 
which the programming model is to be implemented never seem even once to 
have been subjected to mathematical scrutiny in a way that verifies that 
anything universally predictable or even comprehensible is being 
implemented.  It would appear that it is left to the programmer to wade 
through the details of the memory semantics of each hardware 
implementation to be used.  If I'm wrong about that, I'm certain I'll be 
informed in the most definite terms.

The semantics of both hardware and software should be reducible to 
formalism in a way that permits checking for correctness, independent of 
the comfort level of some particular poster to comp.arch.  The 
competence to carry through such a program certainly exists.  The will, 
apparently, does not.

RM

0
Reply Robert 9/15/2004 5:59:38 PM

In article <nTm_c.371401$%_6.4568@attbi_s01>,
Robert Myers  <rmyers1400@comcast.net> wrote:
>Stephen Fuld wrote:
>> described with the paraphrase "SMP considered harmfull to parallel
>> programming progress".  That is SMP is like the use of the Goto ..
>
>No. No. No. No. No. No.
>Single-processor system images harmful to parallel programming.
>SPSI = Everything has to cross user/kernel space boundaries and 
>commmunication stack for any kind of nontrivial parallelism.
>
>Better ways to do it than classic SMP?  I'm sure there are, but tens of 
>thousands of instances of the Linux kernel aren't the answer, either.

Likely not the answer except in politcally QD (quick and dirty) clusters.

Oh, there are people working on things in labs in quiet ways (some
making the literature).  The people who have to know will know.
The problem is whether some of those ideas will come to a general market.

One good example were the hardware used in genome sequencing.
People paid there.  It was worth it to them.  But there were
several horse races, most of which weren't seen by the general public,
not just the private vs. Fed.


Gotta go.

-- 
0
Reply eugene 9/15/2004 7:01:14 PM

Zalman Stern wrote:
> 
> Eric <eric_pattison@sympaticoREMOVE.ca> wrote in message news:<4148433F.8895C1C9@sympaticoREMOVE.ca>...
> > It is SetEvent that leaves the event signaled not PulseEvent.
> > I knew there is a way to always release a single thread.
> > Just suffered a slight memory brain fart.
> > To rephrase:
> [...]
> 
> As I wrote in the original response to Eric (a parent message of this
> one):
> [
> With an auto-event, one would normally use SetEvent, not PulseEvent.
> This guarantees a sleeping thread will be woken even if it is handed
> an APC because the kernel does a ResetEvent equivalent when a thread
> is woken and not before. But this does not help with pthread condition
> variables because the signaled state cannot be persistent. (Not quite
> sure how to put it. Conceptually condvar's only state is a possibly
> empty list of waiters.)
> ]
> 
> Not only did Eric have a brain fart, he apparently isn't reading other
> people's stuff very carefully.
>
> To expand on the last point in the quoted text, even with
> SignalObjectAndWait, one cannot use an auto-reset event and SetEvent
> to get pthread_signal like behavior.

My primary concern was to correct the misstatement I had made
(though I'm sure you are gratified that I confirm your analysis :-)

My interest in this topic is not in solving the convar problem
because I don't use Posix and I don't know enough about its
requirements to comment. So I am not looking from this point of view.

My interest is in understanding the details of WNTs internals,
its' design strengths and learning from any weaknesses.
To that extent, any info in this area is of common interest.

Eric

0
Reply Eric 9/15/2004 8:02:21 PM

>> the tarpit that is automagic resource management for parallel
>> workloads). Anyway you slice it : the overhead will limit the
>> granularity, and the granularity pretty much defines what kinds
>> of problems you can tackle.

In article <6d00d.335992$OB3.169224@bgtnsc05-news.ops.worldnet.att.net>,
Stephen Fuld <s.fuld@PleaseRemove.att.net> wrote:
>Yes, agreed.  That is why my thoughts include trying to come up with some
>kind of low overhead message passing system to reduce that overhead.  I
>remember the Elixi (sp?) system was aimed at HPC type applications and had
	Elxsi.
Naw, it wasn't really an HPC machine.  Grab Rich Maine over in c.l.fortran.
Rich had one.  I've got to find the corp. body for the Museum.

>hardware support for interprocess message passing. 

Ah EMBOSS, yet another Unix-like (say what?) operating system....
As opposed to ENIX.

>And I have mentioned the
>transputer, whic, I gather had hardware/microcode support for something
>similar.  I think a key is to limit message length to minimize resource
>overcommittment, and handle the common cases without any OS intervention,
>but perhaps have some when the queues got large enough so that you could
>prevent overflows, etc.  But again, I am just mussing here.

Well if you want to limit message length reduce the MTU size BUZSIZ()
to as small as possible.  I'm not certain that you can use 1.  It may
have to be at least 2 and watch the system thrash.  Really.
Every adult should have a little bit of experience with B&D/S&M.

Transputer messages were far smaller (you mean in hardware or software
[in Occam?]).


>I also want to make clear that when I am talking "transactions" I am talking
>many thousands to millions of instructions.  I know that does put a lower
>limit on the kinds of granularity you can reasonably have, but it also
>limits the overhead.  Think in terms of a few hundred microseconds to a few
>milliseconds of CPU time per "transaction".

Too long.


>> Personally I think for a lot of typical commerical tasks this stuff
>> will fit nicely (just as RDBs have shown).
>
>And long before RDBs.  Think airline res systems since the 1960s.  It is the
>success with that kind of workload that gives me hope the same ideas can be
>used to help HPC.

The history of message passing in HPC has been dismal.
It goes back to DEIMOS.  It's only done now because of political expedience.
I'd be careful making comparisons to RDBs and systems like SABRE.

The efficiency (and overhead) right now are counted around 1% (at
least single digits, and fortunately Congress isn't miserly enough
to penny pinch on research).

-- 
0
Reply eugene 9/15/2004 9:07:59 PM

Jan Vorbr�ggen wrote:
>>> ftp://download.intel.com/pressroom/kits/events/idffall_
>>> 2004/otellini_presentation.pdf#page=38
>>
>>
>> Most interesting.  Unfortunately, that failed to download.
> 
> 
> Worked for me. The line break at the underline is unfortunate, however.

I've tried multiple times and the server says it can't find the file. 
Has it been pulled?

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me
0
Reply Bill 9/15/2004 9:13:51 PM

Zalman Stern wrote:
> Jim Hull <jim.hull@hp.com> wrote in message news:<MPG.1ba8ed33652544ed9896c7@usenet01.boi.hp.com>...
> 
>>This is wrong.  As described by Paul Otellini during his keynote speech at 
>>IDF yesterday, and documented here (watch for URL wrap):
>>
>>ftp://download.intel.com/pressroom/kits/events/idffall_
>>2004/otellini_presentation.pdf#page=38
>>
>>the dual-core, multithreaded, Montecito package actually consumes *less* 
>>power than current Itanium 2 processors.
>>
>> -- Jim Hull
>>    Itanium Processor Architect at HP
> 
> 
> There is also:
>     http://www28.cplan.com/cbi_export/MA_OSAS002_266814_68-1_v2.pdf
> which gives the specific quote:
>     2 cores, 2 threads, 26.5MByte of cache, and 1.72 billion
> transistors at 100W
> (2 threads means "2 threads per core" in case it is not clear. Slide
> elsewhere indicates SMT.)
> 
> (The crypto folks will appreciate y'all adding an extra shifter per
> core too. Its the little extra touches that count :-))

Undoubtedly they will appreciate that the file requires a login to D/L 
it, as well. I'm getting suspicious when server after server can't 
provide data for one reason or another.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me
0
Reply Bill 9/15/2004 9:17:35 PM

dale@edgehp.net wrote:
> In article <chpcn0$d3v$1@pegasus.csx.cam.ac.uk>,
> 	nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
> 
>>In article <r4l412xvf6.ln2@homer.edgehp.invalid>,
>>dale@edgehp.net () writes:
>>|> >
>>|> > Sigh.  You are STILL missing the point.  Spaghetti C++ may be about
>>|> > as bad as it gets, but the SAME applies to the cleanest of Fortran,
>>|> > if it is using the same programming paradigms.  I can't get excited
>>|> > over factors of 5-10 difference in optimisability, when we are
>>|> > talking about improvements over decades.
>>|> >
>>|> Simple...
>>|>
>>|> Let's all dust off our old APL manuals, and then practically ALL of
>>|> our code will be vectorizable/parallel.
>>
>>Hmm.  Do you have a good APL Dirichlet tesselation code handy?
>>
> 
> I have two main memories of APL, both about 2.5 decades old.
> 
> To the APL programmer, every problem looks like a vector/matrix.
> (To the man with a hammer, every problem looks like a nail.)
> 
> You can apply every monadic operator, in the correct sequence, to
> zero, and the result is 42. (HHGTG reference)

So *that's* where the 42 came from. Neat! I always wonders why 42.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me
0
Reply Bill 9/15/2004 9:28:02 PM

In article <ci8u7e$in0$1@pegasus.csx.cam.ac.uk>,
Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
>In article <41477B15.3ED18933@web.de>,
>Alexander Terekhov <terekhov@web.de> writes:
>|> Joe Seigh wrote:
>|> > From the Single Unix Specification ...
>|> It was written many moons ago when full stop membar was the state 
>|> of the art. ...
>
>At least as known to those who were unaware that there was any
>IT development before Unix.

Knowing (remembering or taking the time to learn) history is mixed blessing.

Considerable progress got made by people who had no concept of "rules."
Those who do "know" get to be an amusing hurtle to those who don't
(sometimes a good talent filter, and likely why there has been little
history of PARC like organizations in the UK [knowing that some in the
UK trying to change that]).



%A Paul Graham
%T Hackers & Painters
%I O'Reilly & Associates, Inc.
%C Sebastopol, CA 95472
%D 2004
%P 263-264
%K hacker parallelism,
%X Most will be wasted.


4. Good Bad Attitude
        Like Americans, hackers win by breaking the rules.
Hackers & Painters: Big Ideas from the Computer Age
        --Paul Graham

If you can't hire ten Lisp hackers,
then your company is probably based in the wrong city for developing software.
Hackers & Painters: Big Ideas from the Computer Age
        --Paul Graham

-- 
0
Reply eugene 9/15/2004 9:32:00 PM

"Robert Myers" <rmyers1400@comcast.net> wrote in message
news:eq%1d.49977$MQ5.16530@attbi_s52...
snip
> Even more troubling is that the memory semantics of the hardware on
> which the programming model is to be implemented never seem even once to
> have been subjected to mathematical scrutiny in a way that verifies that
> anything universally predictable or even comprehensible is being
> implemented.  It would appear that it is left to the programmer to wade
> through the details of the memory semantics of each hardware
> implementation to be used.  If I'm wrong about that, I'm certain I'll be
> informed in the most definite terms.
>
> The semantics of both hardware and software should be reducible to
> formalism in a way that permits checking for correctness, independent of
> the comfort level of some particular poster to comp.arch.  The
> competence to carry through such a program certainly exists.  The will,
> apparently, does not.
>
> RM
>
are you alleging that the folks who design computers don't verify that the
memory system doesn't work the way it is supposed to?  Processor ordering or
weak consistency or whatever?   Or are you trying to say something else,
perhaps about formal methods, which isn't obvious from your post?

del cecchi


0
Reply Del 9/15/2004 9:35:34 PM

In article <4147AB03.BDB95A0F@xemaps.com>,
Joe Seigh  <jseigh_01@xemaps.com> wrote:
>What's SGI doing?

I have to go bike past them.  I'll ask.
Might not be able to post what I learn.

They are on the way to Google and past that to my car which is getting
serviced.

-- 
0
Reply eugene 9/15/2004 9:35:48 PM

In article <YUl%c.323752$OB3.13282@bgtnsc05-news.ops.worldnet.att.net>,
Stephen Fuld <s.fuld@PleaseRemove.att.net> wrote:
>> > "SMP considered harmful...
>>
>> SPSI = Everything has to cross user/kernel space boundaries and
>> commmunication stack for any kind of nontrivial parallelism.

Oh sort of.

>The two seem related.  While you can certainly implement message passing on
>an SMP hardware design, you can't realistically use shared memory semantics
>on things like clusters.  NUMAs are sort of in the middle.
>
>Perhaps I should refine my claim.  How about something like software designs
>that assume shared memory semantics will have to go?  A good piece of
>sofware would work well and without any wource changes on a single or
>multiple (up to some limit) CPUs.

Well if you want to consider history, look at the CMU Cm*.
But it didn't go very far, and it resulted in an amusing OS split
with limited (by today's standards) apps.  The problem today is that
few would want to customize a Kmap.

>> Better ways to do it than classic SMP?  I'm sure there are, but tens of
>> thousands of instances of the Linux kernel aren't the answer, either.
>
>Of course!  But just as it took a while, and a few failed attempts, to come
>up with a successful strategy to get rid of GOTOs (remember "Structured
>Programing"?) , it will probably take the same to come up with a reasonable
>successor for existing parallel programming paradigms.  And that will
>require both hardware and software to make it work well. (In that I agree
>with Nick).  I am thinking it is something with a better interconnect
>architecture than using NICs of some kind with an I/O type interface.

I don't recall what you might term a successful strategy to get rid of gotos.
There was a nice compendium of papers titled Classics in Software Engineering.
But there have been many attempts at parallel programming.  The
functional types are still plugging away hard.  The associative types
are in large part dead and gone (except a small band of cache improvers).
Andy swears by static dataflow.  A few swear by dynamic dataflow coming back.

>I am intrigued by what was done with the transputer in that area and also by
>some of the hypercube designs like NCube.  We need simple primitives to get
>information from one "process" to another, with no code changes no matter
>whether the two processes are on the same CPU or not.  Then the underlying
>physical mechanism can be optimized for particular sizes and technologies.

Yeah, I have to collect a couple of these for the Museum.

You missed a SC Valley local (still visible in the archives, and whom I
see on odd occasions) named Mitch Lobell who used to run a loose group
he called the PPC Parallel Processing Connection.  In the 80s, Mitch
negotiated the use of a single transputer in local Inmos offices.
This was kind of impressive were it not that one had to learn Occam and
the whole transputer environment (this was hard in the shadow of Intel
down the street).  That's three strikes against Mitch at that point.
No one really took him up on it.
He was well meaning (I know a lot of people took strong views of Mitch's
hair oils).  A couple of times Sherryl Tomboulian attended as did others
like Horst, and not as speakers.  I suspect that Mitch was trying to be
a Jobs with the PPC as his Homebrew/Apple; btu he was just too weak
technically.

I really need to locate Palmer and some of the others from that era.

If you want no code changes: you'd better seek your kicks else where.
Not likely in the near future.

-- 
0
Reply eugene 9/16/2004 12:24:35 AM

Eugene Miya wrote:

> In article <ci8u7e$in0$1@pegasus.csx.cam.ac.uk>,
> Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
>>In article <41477B15.3ED18933@web.de>,
>>Alexander Terekhov <terekhov@web.de> writes:
>>|> Joe Seigh wrote:
>>|> > From the Single Unix Specification ...
>>|> It was written many moons ago when full stop membar was the state
>>|> of the art. ...
>>
>>At least as known to those who were unaware that there was any
>>IT development before Unix.
> 
> Knowing (remembering or taking the time to learn) history is mixed
> blessing.
> 
> Considerable progress got made by people who had no concept of "rules."
> Those who do "know" get to be an amusing hurtle to those who don't
> (sometimes a good talent filter, and likely why there has been little
> history of PARC like organizations in the UK [knowing that some in the
> UK trying to change that]).

Maybe I misunderstand the role that PARC played, but
isn't it a think-tank ? I think the UK is full of good
ideas, the problem is getting the money together to
develop them. I don't think it's just down to some
kind of transatlantic difference in VCs either.

In the US you have DARPA to help with that. I am not
sure that would work as well in the UK because I think
it would be starved for cash. American defence spend
is so huge that DARPA funding gets lost in the noise,
we don't have that luxury available to us.

ARPANET, Super Computers and ICs were all developed with
a substantial amount of defence money. The genius of the
US system is feeding that stuff back into the civillian
sector so you can bootstrap the next stage. :)


Cheers,
Rupert
0
Reply Rupert 9/16/2004 1:40:57 AM

Del Cecchi wrote:

> "Robert Myers" <rmyers1400@comcast.net> wrote in message
> news:eq%1d.49977$MQ5.16530@attbi_s52...
> snip
> 
>>Even more troubling is that the memory semantics of the hardware on
>>which the programming model is to be implemented never seem even once to
>>have been subjected to mathematical scrutiny in a way that verifies that
>>anything universally predictable or even comprehensible is being
>>implemented.  It would appear that it is left to the programmer to wade
>>through the details of the memory semantics of each hardware
>>implementation to be used.  If I'm wrong about that, I'm certain I'll be
>>informed in the most definite terms.
>>
>>The semantics of both hardware and software should be reducible to
>>formalism in a way that permits checking for correctness, independent of
>>the comfort level of some particular poster to comp.arch.  The
>>competence to carry through such a program certainly exists.  The will,
>>apparently, does not.
>>
>>
> 
> are you alleging that the folks who design computers don't verify that the
> memory system doesn't work the way it is supposed to?

No.  The problem is entirely on the software side.

> Processor ordering or
> weak consistency or whatever?   Or are you trying to say something else,
> perhaps about formal methods, which isn't obvious from your post?

My understanding of the situation is that software conforming to the 
Posix standard does not have an unambiguous interpretation independent 
of the actual implementation of the API and of the hardware on which the 
API is implemented.  Software that runs correctly on one architecture 
could, for example, not run correctly on another, and there is, 
apparently, no precription that can be followed to avoid that possibility.

RM

0
Reply Robert 9/16/2004 3:47:44 AM

Robert Myers <rmyers1400@comcast.net> wrote in message 
> I am reminded of my first graduate year mathematical physics text, which 
> stated, roughly, that it's okay to use a "rough and ready" approach to 
> applied mathematics, as long as you know that someone competent has 
> worked through the details at least once.

Exactly, which is why I point to the DEC SRC work on a formal
specification of mutexes and condition variables.

That POSIX doesn't have a formal spec is somewhat less a sin because
they are relying on the same model that someone else did a formal spec
for. And to my mind, the mapping between the two is close enough that
much relevance carries over. (As one gets into HPC type environments,
some of the assumptions and constraints of this formal specification
break down. But I really do feel that this is an issue for the cutting
edge. I have a hard time worrying much about that when run-of-the-mill
stuff still hasn't caught up to using good practices.)

Compare pthread mutex+condvar to win32 synch. primitives. Win32 is
downright hokey by comparison.

(Note there are other attempts at formal specification of
multiprocessor primtives. Rob Pike did one as part of the Plan9 stuff.
I recall liking the SRC work better, but its been a while since I read
Pike's paper. I think the Apollo Domain people published a paper on
event counts. I'm sure there are many others.)

> There are serviceable and presentable programming models for concurrent 
> systems that have been subjected to mathematical scrutiny, but, for the 
> most part, they don't seem to be used.

Yep. At the low end, where one is trying to take advantage of 2 to 8
threads of concurrency, we have the technology to get it right. (If
the concurrency exists in the application at all.) But most of the
programming industry either a) doesn't get it or b) thinks you can
hack it out and debug it to being shippable like they do with
everything else. End result: "Concurrency is impossibly hard. We can't
go there." You end up with Intel fighting the tide by providing
engineers to bolt concurrency on to legacy code for app vendors.

Some areas such as database and OS implementation are better. HPC is a
different world of course.

> Even more troubling is that the memory semantics of the hardware on 
> which the programming model is to be implemented never seem even once to 
> have been subjected to mathematical scrutiny in a way that verifies that 
> anything universally predictable or even comprehensible is being 
> implemented.  It would appear that it is left to the programmer to wade 
> through the details of the memory semantics of each hardware 
> implementation to be used.  If I'm wrong about that, I'm certain I'll be 
> informed in the most definite terms.

A lot of formal work has been done on characterizing relaxed memory
models. On the research side the primitives are pretty well defined.
(The notion that the only thing that varies is the order in which
reads and writes are seen, acquire/release, taxonomies of ordering
models, etc.)

The hardware side is also pretty well understood and I believe has
translated into architecure specs and implementations. As a
generality, hardware development uses mroe formalism than software. It
is significantly more difficult to sell hardware that doesn't work
quite right than to sell software that doesn't work quite right.

The translation of research into practical software technology is
imperfect to say the least. E.g. many people are surprised by double
checked locking not working and the initial version of the JVM spec
made a mess of memory model issues. (I don't know the current state of
JVM specification.) Even in .NET, memory barriers were added in 1.1 .
(Stated that way because when .NET 1.0 came out, the Java experience
had already happened and was public and well known to those in the
area. It is one thing to make a mistake. It is another thing to make
the same mistake your precursor/competitor made just a few years
before. Though in this case it wasn't particularly costly for either
camp so maybe it was simple pragmatics at work.)

> The semantics of both hardware and software should be reducible to 
> formalism in a way that permits checking for correctness, independent of 
> the comfort level of some particular poster to comp.arch.  The 
> competence to carry through such a program certainly exists.  The will, 
> apparently, does not.

Yep. I can understand the lack of will down in the trenches of
software development, but Microsoft and POSIX, etc. should be setting
a better example at the standardization level.

On the whole, commercial programming eschews rigor. Usually people get
away with it, but with concurrency, it is really hard to ship software
unless the design works.

-Z-
0
Reply googlenews 9/16/2004 6:52:03 AM

Jan Vorbr�ggen <jvorbrueggen-not@mediasec.de> wrote in message news:<2qr2fbF12f5lhU1@uni-berlin.de>...
> > One of the ways that should work reliably is to apply SetEvent
> > to an auto-event. That is supposed to release a single thread,
> > or if none is waiting, remember the signal and release the next
> > thread that comes along. This should function reliably because
> > pulling an item out of the event queue and requeuing it should
> > not cause a signal to be lost even with APC's being delivered.
> > If there is no thread waiting when the signal occurs then the
> > event state is set and gets detected when the APC completes.
> 
> However, it remains to be verified that the implementation actually
> behaves as described in this particular case.

In a test harness a year ago, trying to see if SignalObjectAndWait was
viable, I was easily able to observe lost event signaling with
PulseEvent while debugging. SetEvent did not exhibit that behavior.

For this limited case of usage, I can't see any reason to use an event
object rather than a semaphore. And it seems reasonable to expect
semaphores to work as God and Dijkstra intended :-)

-Z-
0
Reply googlenews 9/16/2004 7:05:24 AM

> Transputer messages were far smaller (you mean in hardware or software
> [in Occam?]).

The software / assembler message length was limited to the length of an INT,
i.e., the native word size of the processor (since there were both 16-bit
and 32-bit processors, and you could run the same code on both in most cases).
The initial hardware did link flow control per byte - 10 bits sent, 2 bits of
ACK returned - while the second-generation hardware did (IIRC) per-link flow
control of 8 bytes and per-connection flow control of 32 bytes.

So there is nothing to stop you doing an OUT of SIZE MAXINT.

	Jan
0
Reply ISO 9/16/2004 8:26:21 AM


Zalman Stern wrote:
> 
> Robert Myers <rmyers1400@comcast.net> wrote in message
> > I am reminded of my first graduate year mathematical physics text, which
> > stated, roughly, that it's okay to use a "rough and ready" approach to
> > applied mathematics, as long as you know that someone competent has
> > worked through the details at least once.
> 
> Exactly, which is why I point to the DEC SRC work on a formal
> specification of mutexes and condition variables.

SRC-20 is a specification of mutexes and condition variables by
meta implementation.  Not really semantics per se and not at all
useful for reasoning about multithreaded programs.  Lamport's stuff
is a little closer although his TLA is maybe too abstract and complex
for normal progammers to use in everyday life.

[...]
> 
> > The semantics of both hardware and software should be reducible to
> > formalism in a way that permits checking for correctness, independent of
> > the comfort level of some particular poster to comp.arch.  The
> > competence to carry through such a program certainly exists.  The will,
> > apparently, does not.
> 
> Yep. I can understand the lack of will down in the trenches of
> software development, but Microsoft and POSIX, etc. should be setting
> a better example at the standardization level.

It a huge amout of work. It probably helps if you get paid or are doing
it as part of your graduate studies.  Otherwise it is an empty exercise.
At least with programming you can see the program run.

> 
> On the whole, commercial programming eschews rigor. Usually people get
> away with it, but with concurrency, it is really hard to ship software
> unless the design works.

No, they ship non working software all the time.  The bugs are usually
race conditions and the vendor will tell you that they cannot recreate
the bug on their set up.

Joe Seigh
0
Reply Joe 9/16/2004 11:34:05 AM


Zalman Stern wrote:
> 

> 
> For this limited case of usage, I can't see any reason to use an event
> object rather than a semaphore. And it seems reasonable to expect
> semaphores to work as God and Dijkstra intended :-)
> 

Interestingly, semaphores are easy to specify.  A semaphore is an integer
with arithmetically correct increment and decrement operations which are
observably atomic and whose value never goes negative.

To verify this the semaphores operations need to return the semaphore's value
or have memory barriers.  You probably want the latter anyway.

Note (and hint):  nothing was said about the duration of some of the semaphore
operations, in particular the decrement.

Joe Seigh
0
Reply Joe 9/16/2004 11:44:02 AM

>On the whole, commercial programming eschews rigor. Usually people get
>away with it, but with concurrency, it is really hard to ship software
>unless the design works.

Uh-hu. It just so happens that I re-read Nacy Levenson's Therac-25 report
recently. Concurrent real-time program in control of linear accelerator
for medical purposes, managed to kill four people (IIRC) out of thens of
thousands treated over several years. Two main bugs in the code: one a race
condition on data entry that could result in massive overdoses because
machine was configured for one mode but run in another; the second a bad
design of a handshake between two processes that could result in a check
being skipped due to integer overflow/wraparound with the same effect.
But it worked correctly almost all of the time...

	Jan

0
Reply ISO 9/16/2004 1:02:17 PM

Zalman Stern wrote:

> Robert Myers <rmyers1400@comcast.net> wrote in message 

<snip>

> 
>>Even more troubling is that the memory semantics of the hardware on 
>>which the programming model is to be implemented never seem even once to 
>>have been subjected to mathematical scrutiny in a way that verifies that 
>>anything universally predictable or even comprehensible is being 
>>implemented.  It would appear that it is left to the programmer to wade 
>>through the details of the memory semantics of each hardware 
>>implementation to be used.  If I'm wrong about that, I'm certain I'll be 
>>informed in the most definite terms.
> 
> 
> A lot of formal work has been done on characterizing relaxed memory
> models. On the research side the primitives are pretty well defined.
> (The notion that the only thing that varies is the order in which
> reads and writes are seen, acquire/release, taxonomies of ordering
> models, etc.)
> 
> The hardware side is also pretty well understood and I believe has
> translated into architecure specs and implementations. As a
> generality, hardware development uses mroe formalism than software. It
> is significantly more difficult to sell hardware that doesn't work
> quite right than to sell software that doesn't work quite right.
> 

I should be more careful of how I word things when talking about 
subjects around which there is already enough confusion.  In this case, 
I put an unreasonably heavy burden on the word "universally" in the 
phrase "universally predictable or even comprehensible."

I accept that that the memory semantics of hardware are rigorously 
specified, unambiguously understood, and implemented with whatever tools 
can be brought to bear by manufacturers (although I will confess that 
the occasional discussion here has left me with the uneasy feeling that 
envelopes and paper napkins may play a role even when millions of copies 
of millions of gates are at stake).

I also accept that there are designers and implementers of software who 
are perfectly capable of understanding all of the implications of the 
hardware specification.  It worries me, though, that nothing about (say) 
the POSIX standard offers an unambiguous recipe for abstracting those 
details away.  Knowing that software is written to the standard is 
apparently inadequate for knowing with certainty exactly what result the 
software will produce.

How serious a problem the apparent imprecision and arbitrariness creates 
in actual practice I wouldn't be competent to judge.  I take your 
position to be that there is enough room for improvement in workaday 
software development and that life would at least be bearable for a 
modest number of threads if developers would consistently apply the 
level of understanding that was the basis of the POSIX specification, 
however imperfectly worked out in detail.  I believe you have correctly 
inferred my concern that, if you intend to scale arbitrarily (as in 
HPC), any kind of imprecision at the foundations is likely eventually to 
lead to chaos (unless you have somehow cleverly designed your system 
specifically to be tolerant of known imprecision).

RM

0
Reply Robert 9/16/2004 1:40:20 PM

In article <2d24c5a8.0409152252.35cf8edd@posting.google.com>,
googlenews@peachfish.com (Zalman Stern) writes:
|> 
|> That POSIX doesn't have a formal spec is somewhat less a sin because
|> they are relying on the same model that someone else did a formal spec
|> for. And to my mind, the mapping between the two is close enough that
|> much relevance carries over. ...

Oh?  Really?  Even if one were to regard the behaviour of the POSIX
primitives as bullet-proof, there is no specification (formal,
informal or hand-waving) as to how they interact with EITHER the C
language's memory model OR other POSIX facilities.

|> A lot of formal work has been done on characterizing relaxed memory
|> models. On the research side the primitives are pretty well defined.
|> (The notion that the only thing that varies is the order in which
|> reads and writes are seen, acquire/release, taxonomies of ordering
|> models, etc.)

That is debatable, but let it pass.

|> The hardware side is also pretty well understood and I believe has
|> translated into architecure specs and implementations. As a
|> generality, hardware development uses mroe formalism than software. It
|> is significantly more difficult to sell hardware that doesn't work
|> quite right than to sell software that doesn't work quite right.

That is too optimistic.  Most of the systems I have gone into at
all deeply have NOT got the hardware right.  At BEST, they have got
it right enough that defensive, unprivileged parallel code doesn't
hit any undocumented problems.  A stronger statement is probably
false.

|> The translation of research into practical software technology is
|> imperfect to say the least. ...

That is, indeed, true.  Describing it accurately and without being
offensive is itself an unsolved topic.

|> > The semantics of both hardware and software should be reducible to 
|> > formalism in a way that permits checking for correctness, independent of 
|> > the comfort level of some particular poster to comp.arch.  The 
|> > competence to carry through such a program certainly exists.  The will, 
|> > apparently, does not.
|> 
|> Yep. I can understand the lack of will down in the trenches of
|> software development, but Microsoft and POSIX, etc. should be setting
|> a better example at the standardization level.

The traditional standards bodies attempted to.  Unfortunately,
the IEEE's failing POSIX process (and it is regrettably true that
it WAS failing) got taken over by a much sloppier organisation.
Even with the best will and the best experts in the world, there
is no way that the current specification could have been made
that rigorous in a reasonable amount of time.  It is just FAR
too large and complex.


Regards,
Nick Maclaren.
0
Reply nmm1 9/16/2004 2:03:41 PM


Nick Maclaren wrote:
> 
> In article <2d24c5a8.0409152252.35cf8edd@posting.google.com>,
> googlenews@peachfish.com (Zalman Stern) writes:
> |>
> |> That POSIX doesn't have a formal spec is somewhat less a sin because
> |> they are relying on the same model that someone else did a formal spec
> |> for. And to my mind, the mapping between the two is close enough that
> |> much relevance carries over. ...
> 
> Oh?  Really?  Even if one were to regard the behaviour of the POSIX
> primitives as bullet-proof, there is no specification (formal,
> informal or hand-waving) as to how they interact with EITHER the C
> language's memory model OR other POSIX facilities.

And they're having fits over this in c.l.c++.m.  Apparently they've
gotten interested in lock-free programming and realize thread support
in C/C++ is important.  They seem to thing that threads have to be
formally defined in order to support threads.  I maintain you don't
have to have that in order to make pretty good guesses about what
language mechanisms are needed to support threads any more than
the lack of a formal Posix definition has prevented hardware architects
from providing the hardware primitives to support Posix thread
implementations.  As to whether you can formally prove that a
particular implementation works, or any implementation can for that matter,
and thus that the hardware architects were right, is an open question.

Note about c.l.c++.m.  They apparantly think you need to define threads
in the C++ runtime like Java.  They generally don't understand threads
enough to realize that it can be done as a library api.  IMO decoupling
that from that language standard/spec would be a good thing given how
much doing things thru the standard would slow down development.

Joe Seigh
0
Reply Joe 9/16/2004 2:57:44 PM

"Eugene Miya" <eugene@cse.ucsc.edu> wrote in message 
news:4148dd43$1@darkstar...
> In article <YUl%c.323752$OB3.13282@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.fuld@PleaseRemove.att.net> wrote:

snip

>>> Better ways to do it than classic SMP?  I'm sure there are, but tens of
>>> thousands of instances of the Linux kernel aren't the answer, either.
>>
>>Of course!  But just as it took a while, and a few failed attempts, to 
>>come
>>up with a successful strategy to get rid of GOTOs (remember "Structured
>>Programing"?) , it will probably take the same to come up with a 
>>reasonable
>>successor for existing parallel programming paradigms.  And that will
>>require both hardware and software to make it work well. (In that I agree
>>with Nick).  I am thinking it is something with a better interconnect
>>architecture than using NICs of some kind with an I/O type interface.
>
> I don't recall what you might term a successful strategy to get rid of 
> gotos.

What I was referring to was the "object oriented" paradigm.  But before 
anyone starts laughing and flames me, I am defining successfull here as 
being popularly used - i.e. successful in the marketing sense.

> There was a nice compendium of papers titled Classics in Software 
> Engineering.
> But there have been many attempts at parallel programming.  The
> functional types are still plugging away hard.  The associative types
> are in large part dead and gone (except a small band of cache improvers).
> Andy swears by static dataflow.  A few swear by dynamic dataflow coming 
> back.

Yup.  And by analogy, there was the "structured programming" phase of the 
goto controversy.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam 


0
Reply Stephen 9/16/2004 5:55:09 PM

"Eugene Miya" <eugene@cse.ucsc.edu> wrote in message 
news:4148af2f@darkstar...

snip

> The history of message passing in HPC has been dismal.

OK, but please note that I am not referring to "generalized" message passing 
such as MPI.  Think more of an updated transputer network.  I am starting to 
work my way through the CSP papers (the whole original book is on line at 
one of the references Jan geve me).  I want to try to eliminate the mess 
that currently exists with threads, mutexes and unconstrained arbitrary 
message passing such as MPI.  Of course, I don't expect a high probability 
of success.  :-)  But I will then understand more about why my ideas won't 
work.

> It goes back to DEIMOS.  It's only done now because of political 
> expedience.
> I'd be careful making comparisons to RDBs and systems like SABRE.

Well, I didn't bring up RDBs, but the idea in things like Sabe is that the 
individual transactions are each relativly simple to program (no threads, 
etc.) and they can communicate by having the ability of a transaction to 
pass a message to another transaction which then executes its code with the 
passed message as input.  That gets cupled with hardware to make that 
message passing fast and easy.  If you have any comments on why that is a 
bad idea, or won't work, or any suggestions or pointers to articles about 
that, etc., I would appreaciate them.

-- 
 - Stephen Fuld
   e-mail address disguised to prevent spam 


0
Reply Stephen 9/16/2004 6:06:15 PM

Jan Vorbr�ggen wrote:
>>> ftp://download.intel.com/