COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### Divide by constant

• Email
• Follow

```I would like to divide x by 4500 (in mathematical terms: x/4500) but how can
I do this the fastest way?
(is there another way then using DIV`?)

TIA!

//Spike

```
 0

See related articles to this posting

```"Spike" <im_a_user@hotmail.com> wrote in message
news:StqVb.81993\$dP1.214517@newsc.telia.net...
> I would like to divide x by 4500 (in mathematical terms: x/4500) but how
can
> I do this the fastest way?
> (is there another way then using DIV`?)

Refer to the K7 Optimization Manual. You can replace a divide-by-constant
with a multiply. Modulo is not as easy, but you can get it with 2 multiplies
by using a % b == a - ((a / b) * b).

-Matt

```
 0

```Matt Taylor wrote:

> "Spike" <im_a_user@hotmail.com> wrote in message
> news:StqVb.81993\$dP1.214517@newsc.telia.net...
>
>>I would like to divide x by 4500 (in mathematical terms: x/4500) but how
>
> can
>
>>I do this the fastest way?
>>(is there another way then using DIV`?)
>
>
> Refer to the K7 Optimization Manual. You can replace a divide-by-constant
> with a multiply. Modulo is not as easy, but you can get it with 2 multiplies
> by using a % b == a - ((a / b) * b).

Matt is right.

Years ago, I rediscovered and proved an algorithm to do this for all
possible values on an x86 (i.e. any 32-bit/32-bit -> 32-bit division).

If you look for Agner Fog's optimization guide (Google!), you'll find
the writeup Agner did of the algorithm.

Terje

PS. The short version is like this:

Calculate 1/4500 with at least 34 bits of precision, then round the
result up to have 33 bits.

If the final bit is zero, you're in luck, because you can then save a
cycle or two.
....
2^44/4500 = 3909374676.53689

So, to divide by 4500 you can use this instead:

; Assume number to be divided in EAX:

mov edx,3909374677
mul edx
shr edx,12

; At this point EDX contains the correctly truncated result of
; dividing EAX by 4500

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Terje Mathisen <terje.mathisen@hda.hydro.com> writes:

> Matt Taylor wrote:
>
> > "Spike" <im_a_user@hotmail.com> wrote in message
> > news:StqVb.81993\$dP1.214517@newsc.telia.net...
> >
> >>I would like to divide x by 4500 (in mathematical terms: x/4500) but how
> > can
> >
> >>I do this the fastest way?
> >>(is there another way then using DIV`?)
> > Refer to the K7 Optimization Manual. You can replace a
> > divide-by-constant
> > with a multiply. Modulo is not as easy, but you can get it with 2 multiplies
> > by using a % b == a - ((a / b) * b).
>
> Matt is right.
>
> Years ago, I rediscovered and proved an algorithm to do this for all
> possible values on an x86 (i.e. any 32-bit/32-bit -> 32-bit division).
>
> If you look for Agner Fog's optimization guide (Google!), you'll find
> the writeup Agner did of the algorithm.

There's an in-depth Dr Dobbs on this matter, dating from 1992-1993
I think, which deals with the 16/16->16 case. I don't believe it
covered the theory particularly deeply, but certainly had a very
easy learning curve.

Phil
--
Unpatched IE vulnerability: Extended HTML Form Attack
Description: Cross Site Scripting through non-HTTP ports,
Published: February 6th 2002

```
 0

```Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
>Matt Taylor wrote:
>
>> "Spike" <im_a_user@hotmail.com> wrote in message
>> news:StqVb.81993\$dP1.214517@newsc.telia.net...
>>
>>>I would like to divide x by 4500 (in mathematical terms: x/4500) but how
>>
>> can
>>
>>>I do this the fastest way?
>>>(is there another way then using DIV`?)
>>
>>
>> Refer to the K7 Optimization Manual. You can replace a divide-by-constant
>> with a multiply. Modulo is not as easy, but you can get it with 2 multiplies
>> by using a % b == a - ((a / b) * b).
>
>Matt is right.
>
>Years ago, I rediscovered and proved an algorithm to do this for all
>possible values on an x86 (i.e. any 32-bit/32-bit -> 32-bit division).
>
>If you look for Agner Fog's optimization guide (Google!), you'll find
>the writeup Agner did of the algorithm.
>
>Terje
>
>PS. The short version is like this:
>
>Calculate 1/4500 with at least 34 bits of precision, then round the
>result up to have 33 bits.
>
>If the final bit is zero, you're in luck, because you can then save a
>cycle or two.
>....
>2^44/4500 = 3909374676.53689

Is that 2 to the 44th power.
And does it work with any number value.

Thanks.

>So, to divide by 4500 you can use this instead:
>
>; Assume number to be divided in EAX:
>
>  mov edx,3909374677
>  mul edx
>  shr edx,12
>
>; At this point EDX contains the correctly truncated result of
>; dividing EAX by 4500
>
>--
>- <Terje.Mathisen@hda.hydro.com>
>"almost all programming can be viewed as an exercise in caching"
>

```
 0

```Andrew Kennedy wrote:

> Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
>
>>Matt Taylor wrote:
>>
>>
>>>"Spike" <im_a_user@hotmail.com> wrote in message
>>>news:StqVb.81993\$dP1.214517@newsc.telia.net...
>>>
>>>
>>>>I would like to divide x by 4500 (in mathematical terms: x/4500) but how
>>>
>>>can
>>>
>>>
>>>>I do this the fastest way?
>>>>(is there another way then using DIV`?)
>>>
>>>
>>>Refer to the K7 Optimization Manual. You can replace a divide-by-constant
>>>with a multiply. Modulo is not as easy, but you can get it with 2 multiplies
>>>by using a % b == a - ((a / b) * b).
>>
>>Matt is right.
>>
>>Years ago, I rediscovered and proved an algorithm to do this for all
>>possible values on an x86 (i.e. any 32-bit/32-bit -> 32-bit division).
>>
>>If you look for Agner Fog's optimization guide (Google!), you'll find
>>the writeup Agner did of the algorithm.
>>
>>Terje
>>
>>PS. The short version is like this:
>>
>>Calculate 1/4500 with at least 34 bits of precision, then round the
>>result up to have 33 bits.
>>
>>If the final bit is zero, you're in luck, because you can then save a
>>cycle or two.
>>....
>>2^44/4500 = 3909374676.53689
>
>
> Is that 2 to the 44th power.
> And does it work with any number value.

Yes and maybe: the power needs to be 32 + the largest power of two
that's less than the divisor (4096 in the case of 4500).

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```"Andrew Kennedy" <andrewkennedy2@LOGev1.net> wrote:
>>
>>PS. The short version is like this:
>>
>>Calculate 1/4500 with at least 34 bits of precision, then round the
>>result up to have 33 bits.
>>
>>If the final bit is zero, you're in luck, because you can then save a
>>cycle or two.
>>....
>>2^44/4500 = 3909374676.53689
>
>Is that 2 to the 44th power.

Yes, caret (^) is the C exponentiation operator.

>And does it work with any number value.

Any value up to 32 bits.  This might not be obvious from the code:

>>So, to divide by 4500 you can use this instead:
>>
>>; Assume number to be divided in EAX:
>>
>>  mov edx,3909374677
>>  mul edx
>>  shr edx,12
>>
>>; At this point EDX contains the correctly truncated result of
>>; dividing EAX by 4500

The key to understanding this code is to realize that this instruction:
mul edx
multiplies eax by edx and produces a 64-bit result in edx:eax.  That 64-bit
result has your answer times 2 to the 44th power.  We ignore eax, which
throws away the lowest 32 bits, and then shift edx right by 12 more, which
produces the correct result.
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

```
 0

```Here is one example:

Yours,

Nuutti

On Mon, 9 Feb 2004 19:01:55 +0000 (UTC), "Spike"
<im_a_user@hotmail.com> wrote:

>I would like to divide x by 4500 (in mathematical terms: x/4500) but how can
>I do this the fastest way?
>(is there another way then using DIV`?)
>
>TIA!
>
>//Spike

```
 0

```How can multiply be faster then divide? , in my opinion they should be
running in the same speed since repeated addition probaly would take the
same time to complete as repeated subtraction.

How fast is multiply compared to divide?

//Spike

"Terje Mathisen" <terje.mathisen@hda.hydro.com> skrev i meddelandet
news:c08r95\$9vb\$1@osl016lin.hda.hydro.com...
> Matt Taylor wrote:
>
> > "Spike" <im_a_user@hotmail.com> wrote in message
> > news:StqVb.81993\$dP1.214517@newsc.telia.net...
> >
> >>I would like to divide x by 4500 (in mathematical terms: x/4500) but how
> >
> > can
> >
> >>I do this the fastest way?
> >>(is there another way then using DIV`?)
> >
> >
> > Refer to the K7 Optimization Manual. You can replace a
divide-by-constant
> > with a multiply. Modulo is not as easy, but you can get it with 2
multiplies
> > by using a % b == a - ((a / b) * b).
>
> Matt is right.
>
> Years ago, I rediscovered and proved an algorithm to do this for all
> possible values on an x86 (i.e. any 32-bit/32-bit -> 32-bit division).
>
> If you look for Agner Fog's optimization guide (Google!), you'll find
> the writeup Agner did of the algorithm.
>
> Terje
>
> PS. The short version is like this:
>
> Calculate 1/4500 with at least 34 bits of precision, then round the
> result up to have 33 bits.
>
> If the final bit is zero, you're in luck, because you can then save a
> cycle or two.
> ...
> 2^44/4500 = 3909374676.53689
>
> So, to divide by 4500 you can use this instead:
>
> ; Assume number to be divided in EAX:
>
>   mov edx,3909374677
>   mul edx
>   shr edx,12
>
> ; At this point EDX contains the correctly truncated result of
> ; dividing EAX by 4500
>
> --
> - <Terje.Mathisen@hda.hydro.com>
> "almost all programming can be viewed as an exercise in caching"
>

```
 0

```Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
>Andrew Kennedy wrote:
>> Is that 2 to the 44th power.
>> And does it work with any number value.
>
>Yes and maybe: the power needs to be 32 + the largest power of two
>that's less than the divisor (4096 in the case of 4500).
>
>
>Terje

I am currently reading it and getting a lot of good optimization
ideas. There is a lot of material, but it's given me an incentive
to go on to using 32 bit regs, alignment, and optimization
based on processor present.

Thanks.

```
 0

```Tim Roberts <timr@probo.com> wrote:
>"Andrew Kennedy" <andrewkennedy2@LOGev1.net> wrote:
>>>
>>>PS. The short version is like this:
>>>
>>>Calculate 1/4500 with at least 34 bits of precision, then round the
>>>result up to have 33 bits.
>>>
>>>If the final bit is zero, you're in luck, because you can then save a

>>>cycle or two.
>>>....
>>>2^44/4500 = 3909374676.53689
>>
>>Is that 2 to the 44th power.
>
>Yes, caret (^) is the C exponentiation operator.
>
>>And does it work with any number value.
>
>Any value up to 32 bits.  This might not be obvious from the code:

Is that the same as any number that will fit into a quadword ?

>>>So, to divide by 4500 you can use this instead:
>>>
>>>; Assume number to be divided in EAX:
>>>
>>>  mov edx,3909374677
>>>  mul edx
>>>  shr edx,12
>>>
>>>; At this point EDX contains the correctly truncated result of
>>>; dividing EAX by 4500
>
>The key to understanding this code is to realize that this instruction:
>    mul edx
>multiplies eax by edx and produces a 64-bit result in edx:eax.  That 64-bit
>result has your answer times 2 to the 44th power.  We ignore eax, which
>throws away the lowest 32 bits, and then shift edx right by 12 more, which
>produces the correct result.

```
 0

```>How can multiply be faster then divide? , in my opinion they should be
>running in the same speed since repeated addition probaly would take the
>same time to complete as repeated subtraction.

IF repeated add/sub were used then there would be something to this
argument.  But the basic multiply hardware does something like multiple
digit muliplying as taught in grade school (only base-2 instead of base-10)
and similarly basic divide hardware does the equivalent of long-division.

Of course, modern processors don't even do the basic forms, there are all
sorts of speed-up tricks that can be, and are,  applied.

But for multiply, the sub-parts can be done in parallel and then all added,
but for long-division, the each successive sub-part depends on the one previous,
and while tricks can get around much of that inherent serialization, not all of it
is avoidable.

In the end, hardware assisted divide takes longer, on average, than hardware
assisted multiply.  But not in the ratio it used to, nor as much so as comparing
long division to multiplication would suggest.

And if you need some extended precision not supported by the hardware, it gets

```
 0

```"Spike" <im_a_user@hotmail.com> wrote in message
news:XQ_Vb.82172\$dP1.216441@newsc.telia.net...
> How can multiply be faster then divide? , in my opinion they should be
> running in the same speed since repeated addition probaly would take the
> same time to complete as repeated subtraction.
>
> How fast is multiply compared to divide?
<snip>

Check the manuals. Multiply is usually <10 cycles, but it depends on the
width. A 32x32->32 multiply is faster than a 32x32->64 multiply. OTOH,
Opteron can do a 32x32->64 multiply in 3 cycles. Athlon takes 6 cycles.
Pentium-4 takes 15-18 cycles, but that is because it has no integer multiply
unit! Prescott will be much faster in that area. For comparison, the
original Pentium took 10 cycles.

Divide on all these chips takes about 40 cycles.

The difference is that multiplication can be run in parallel. For a*b, you
can set it up as a series of shifts/masks/adds that can be highly
parallelized. For example, 15*5 = 0*(15 << 3) + 1*(15 << 2) + 0*(15 << 1) +
1*(15 << 0). You can obviously compute this in log2(32) time, but it's also
possible to fully expand the expression into boolean functions and compute
it in 1 cycle. It depends on how many gates the chip designer wants to
spend.

Division is not so lucky; it has to be done as a series of sequential
shifts/subtracts. The algorithm I was given uses an iterative
double-precision shift (shld) by 1 place. It performs long division no
differently than how children are taught to do it by hand in grade school.
At the end, the modulo and quotient remain in the two registers used.

-Matt

```
 0

```"Spike" <im_a_user@hotmail.com> wrote in message news:<XQ_Vb.82172\$dP1.216441@newsc.telia.net>...
> How can multiply be faster then divide? , in my opinion they should be
> running in the same speed since repeated addition probaly would take the
> same time to complete as repeated subtraction.
>
> How fast is multiply compared to divide?

It's much easier to implement fast multiplication than division.  On a
P4, for example, FP multiplication (FMUL) has a latency of 7 clocks,
and FDIV 23-48, depending on precision.  (I)MUL is about 14, (I)DIV
56-70.

The multiplication and division schemes we learned in school both take
O(n**2), but you can do much better.  Even if you stick with the grade
school scheme (which is common in hardware), you can trivially carry
out many steps of a multiplication in parallel and complete the
operation n O(n log n), which you *cannot* do with division, which
remains O(N**2).  In hardware, you can do one "n" in parallel for both
multiplication and division, which reduces the total number of steps
to log(n) for multiplication and n for division.

Further, you can implement multiplication in O(n log (n (log log n)))
with an FFT based scheme.  The much simpler (to understand and
implement) Karatsuba method get you about n**1.6.  Usually it's
fastest to divide by multiplying by 1/x, after computing 1/x with an
iterative newtonian approximation.  So you can divide in something
like O(log n) times the cost of a multiplication.

```
 0

```On Mon, 9 Feb 2004 20:48:47 +0000 (UTC), Terje Mathisen
<terje.mathisen@hda.hydro.com> wrote:

>Matt Taylor wrote:
>
>> "Spike" <im_a_user@hotmail.com> wrote in message
>> news:StqVb.81993\$dP1.214517@newsc.telia.net...
>>
>>>I would like to divide x by 4500 (in mathematical terms: x/4500) but how
>>
>> can
>>
>>>I do this the fastest way?
>>>(is there another way then using DIV`?)
>>
>>
>> Refer to the K7 Optimization Manual. You can replace a divide-by-constant
>> with a multiply. Modulo is not as easy, but you can get it with 2 multiplies
>> by using a % b == a - ((a / b) * b).
>
>Matt is right.
>
>Years ago, I rediscovered and proved an algorithm to do this for all
>possible values on an x86 (i.e. any 32-bit/32-bit -> 32-bit division).
>
>If you look for Agner Fog's optimization guide (Google!), you'll find
>the writeup Agner did of the algorithm.
>
>Terje
>
>PS. The short version is like this:
>
>Calculate 1/4500 with at least 34 bits of precision, then round the
>result up to have 33 bits.
>
>If the final bit is zero, you're in luck, because you can then save a
>cycle or two.
>...
>2^44/4500 = 3909374676.53689
>
>So, to divide by 4500 you can use this instead:
>
>; Assume number to be divided in EAX:
>
>  mov edx,3909374677
>  mul edx
>  shr edx,12
>
>; At this point EDX contains the correctly truncated result of
>; dividing EAX by 4500
>

Instead of shifting EDX down by 12 bits after the MUL, why not
just shift the original multiplier constant by that amount and
skip the SHR completely?

Bob Masta

D A Q A R T A
Data AcQuisition And Real-Time Analysis
www.daqarta.com

```
 0

```Bob Masta wrote:

> On Mon, 9 Feb 2004 20:48:47 +0000 (UTC), Terje Mathisen
> <terje.mathisen@hda.hydro.com> wrote:
>>So, to divide by 4500 you can use this instead:
>>
>>; Assume number to be divided in EAX:
>>
>> mov edx,3909374677
>> mul edx
>> shr edx,12
>>
>>; At this point EDX contains the correctly truncated result of
>>; dividing EAX by 4500
>>
> Instead of shifting EDX down by 12 bits after the MUL, why not
> just shift the original multiplier constant by that amount and
> skip the SHR completely?

Good question!

Because that loses information, i.e. bits, that are needed to make sure
that the reciprocal multiplication works for _all_ possible inputs.

If you know that your input values are bounded to less than the full
2^32 unsigned range, then you can also get away with a shorter reciprocal.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```"Bob Masta" <NoSpam@daqarta.com> wrote in message
news:402a2441.506170@news.itd.umich.edu...
> On Mon, 9 Feb 2004 20:48:47 +0000 (UTC), Terje Mathisen
> <terje.mathisen@hda.hydro.com> wrote:
>
<snip>
> >So, to divide by 4500 you can use this instead:
> >
> >; Assume number to be divided in EAX:
> >
> >  mov edx,3909374677
> >  mul edx
> >  shr edx,12
> >
> >; At this point EDX contains the correctly truncated result of
> >; dividing EAX by 4500
> >
>
> Instead of shifting EDX down by 12 bits after the MUL, why not
> just shift the original multiplier constant by that amount and
> skip the SHR completely?

Try it with Terje's example and see:

(4500 * 3909374677) / 2^44 = 17592186046500 / 2^44 = 1
4500 * (3909374677 / 2^44) = 4500 * 0 = 0

To make it painfully obvious, try this, too:
(4500 * (3909374677 / 2)) / 2^43 = (4500 * 1954687338) / 2^43 =
8796093021000 / 2^43 = 0

The divisor is typically odd, so you can't reduce it or you will clip
precision and get the wrong answer for some numbers. You can normalize the
divisor so that bit 0 is 1, but I don't think it ever reduces further.

-Matt

```
 0

```Matt Taylor wrote:

> Pentium-4 takes 15-18 cycles, but that is because it has no integer
> multiply unit! Prescott will be much faster in that area.

Matt,

Be careful, Prescott *is* a Pentium 4

<g>

```
 0

```Tim Roberts wrote:
> "Andrew Kennedy" <andrewkennedy2@LOGev1.net> wrote:
>>> ....
>>> 2^44/4500 = 3909374676.53689
>>
>> Is that 2 to the 44th power.
>
> Yes, caret (^) is the C exponentiation operator.
>

It is the exponentiation operator in several languages, but not in C.
In C it is the bitwise exclusive-or operator.

Emlyn

```
 0

```Tim Roberts wrote:

> Yes, caret (^) is the C exponentiation operator.

Errr, Tim? There is no exponentiation operator in C.

In C, the circumflex accent (ASCII code 0x5e) is the bitwise exclusive
OR operator.

What you probably meant was:

2^n can be computed as 1 << n in C syntax.

```
 0

```Grumble wrote:

> Matt Taylor wrote:
>
>> Pentium-4 takes 15-18 cycles, but that is because it has no integer
>> multiply unit! Prescott will be much faster in that area.
>
>
> Matt,
>
> Be careful, Prescott *is* a Pentium 4

Only sort of:

50% longer pipeline
New instructions
4 x slower L1 cache
Dynamic SMT cache allocation

Probably much larger reorder/renaming pools.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Emlyn Peter Corrin wrote:

> Tim Roberts wrote:
>
>>"Andrew Kennedy" <andrewkennedy2@LOGev1.net> wrote:
>>
>>>>....
>>>>2^44/4500 = 3909374676.53689
>>>
>>>Is that 2 to the 44th power.
>>
>>Yes, caret (^) is the C exponentiation operator.
>
> It is the exponentiation operator in several languages, but not in C.
> In C it is the bitwise exclusive-or operator.

In pseudo-code used to illustrate an algorithm, 2^44 is indeed shorthand
for the exponentiation operation C doesn't have.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:c0e4ai\$94e\$1@osl016lin.hda.hydro.com...
> Grumble wrote:
>
> > Matt Taylor wrote:
> >
> >> Pentium-4 takes 15-18 cycles, but that is because it has no integer
> >> multiply unit! Prescott will be much faster in that area.
> >
> >
> > Matt,
> >
> > Be careful, Prescott *is* a Pentium 4
>
> Only sort of:
>
> 50% longer pipeline
> New instructions
> 4 x slower L1 cache
> MUL and SHx hw added
> Dynamic SMT cache allocation
>
> Probably much larger reorder/renaming pools.

He was referring to the marketting name. Prescott is being marketted as
Pentium-4E. For some time (a year?) I have expected Prescott to be marketted
as Pentium-5, so my naming system is a little unconventional.

By the way, are you sure the L1 cache is 4x slower? I thought it was only
2x. The old round-trip latency was 2 cycles, and from what I have read the
new round-trip latency is 4 cycles. (Huh? How can it be slower than Athlon's
3-cycle L1 cache?) That still doesn't make any sense to me. The original
purpose of the small 8KB cache was to keep the latency low.

-Matt

```
 0

```Matt Taylor wrote:
> By the way, are you sure the L1 cache is 4x slower? I thought it was only

cycle figure seemed to be the load-to-use latency and not the

If even the P4 is 2 cycles, then 4 isn't _that_ much worse, but it
doesn't seem like a good choise.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Grumble <invalid@kma.eu.org> wrote in message news:<402A0072.1030700@kma.eu.org>...
> Matt Taylor wrote:
>
> > Pentium-4 takes 15-18 cycles, but that is because it has no integer
> > multiply unit! Prescott will be much faster in that area.
>
> Matt,
>
> Be careful, Prescott *is* a Pentium 4
>
> <g>

Well, Intel's marketing department may call it a P4, but it sure
doesn't look much like a Northwood or Willie.  And the more we see of
it the stranger it becomes.  In any event, they did appear to add a
proper integer multiplier to Prescott.

```
 0

```"Emlyn Peter Corrin" <Emlyn.Corrin@cern.ch> wrote:
>Tim Roberts wrote:
>> "Andrew Kennedy" <andrewkennedy2@LOGev1.net> wrote:
>>>> ....
>>>> 2^44/4500 = 3909374676.53689
>>>
>>> Is that 2 to the 44th power.
>>
>> Yes, caret (^) is the C exponentiation operator.
>
>It is the exponentiation operator in several languages, but not in C.
>In C it is the bitwise exclusive-or operator.

Doh!  I knew that.  Thanks for pointing it out.

I suppose I expose my programming roots by saying I write exponentiation in
pseudo-code as 2**43.
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

```
 0

```"Matt Taylor" <para@tampabay.rr.com> wrote:

>"Bob Masta" <NoSpam@daqarta.com> wrote in message
>>
>> Instead of shifting EDX down by 12 bits after the MUL, why not
>> just shift the original multiplier constant by that amount and
>> skip the SHR completely?
>
>Try it with Terje's example and see:
>
>(4500 * 3909374677) / 2^44 = 17592186046500 / 2^44 = 1
>4500 * (3909374677 / 2^44) = 4500 * 0 = 0

Actually, his suggestion was just to eliminate the final shift by 12 -- you
still wrong:

4500 * (3909374677 / 2^12) = 4500 * 954437 = 4294966500 / 2^32 = 0
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

```
 0

```[ Message cross-posted to comp.arch ]

Matt Taylor wrote:

> Terje Mathisen wrote:
>
>> Grumble wrote:
>>
>>> Matt Taylor wrote:
>>>
>>>> Pentium-4 takes 15-18 cycles, but that is because it has no integer
>>>> multiply unit! Prescott will be much faster in that area.
>>>
>>> Matt,
>>>
>>> Be careful, Prescott *is* a Pentium 4
>>
>> Only sort of:
>>
>> 50% longer pipeline
>> New instructions
>> 4 x slower L1 cache
>> MUL and SHx hw added
>> Dynamic SMT cache allocation
>>
>> Probably much larger reorder/renaming pools.
>
> He was referring to the marketing name. Prescott is being
> marketed as Pentium-4E.

Indeed. I wrote tongue-in-cheek :-)

> [...] and from what I have read the new round-trip latency is 4 cycles.
> (Huh? How can it be slower than Athlon's 3-cycle L1 cache?) That still
> doesn't make any sense to me. The original purpose of the small 8KB
> cache was to keep the latency low.

Let's see...

K7 L1 data cache:
64 KB
2-way set associative
3-cycle latency @ 2.2 Ghz = 1.36 ns

Northwood L1 data cache:
8 KB
4-way set associative
2-cycle latency @ 3.2 Ghz = 0.625 ns

Prescott L1 data cache:
16 KB
8-way set associative
4-cycle latency @ 5 (??) Ghz = 0.8 ns

In Prescott, size and associativity have been doubled, compared to
Northwood. As far as I understand, both factors make the cache slower
to access. If Intel intends to scale Prescott to 5 GHz, they had to
increase the cache's cycle-latency, and they would still have to use
faster SRAM than AMD (0.8 ns vs 1.36 ns).

How much influence did these changes have on cache latency:

cache size doubled
associativity doubled
process change from 130 nm to 90 nm

```
 0

```Terje Mathisen wrote:

> 50% longer pipeline
> New instructions
> 4 x slower L1 cache
> MUL and SHx hw added
> Dynamic SMT cache allocation
>
> Probably much larger reorder/renaming pools.

Do you believe there are more than 128 physical registers in Prescott?

```
 0

```"Grumble" <invalid@kma.eu.org> wrote in message
news:402B5015.2010501@kma.eu.org...
> Terje Mathisen wrote:
>
> > 50% longer pipeline
> > New instructions
> > 4 x slower L1 cache
> > MUL and SHx hw added
> > Dynamic SMT cache allocation
> >
> > Probably much larger reorder/renaming pools.
>
> Do you believe there are more than 128 physical registers in Prescott?

There are 256 now.

-Matt

```
 0

```Grumble wrote:

> Terje Mathisen wrote:
>
>> 50% longer pipeline
>> New instructions
>> 4 x slower L1 cache
>> MUL and SHx hw added
>> Dynamic SMT cache allocation
>>
>> Probably much larger reorder/renaming pools.
>
>
> Do you believe there are more than 128 physical registers in Prescott?
>

If 128 is the P4 number, then Yes, I'd expect this to be increased for a
cpu with 50% longer pipeline.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```> Northwood L1 data cache:
> 8 KB
> 4-way set associative
> 2-cycle latency @ 3.2 Ghz = 0.625 ns
>
> Prescott L1 data cache:
> 16 KB
> 8-way set associative
> 4-cycle latency @ 5 (??) Ghz = 0.8 ns

Writing it out in ns is definitely the right way to think about
latency.  You need to specify whether you are subtracting a cycle
for issuing the load (this used to be common when we had scalar
processors, superscalar just makes this practice confusing).  In
any case, I suspect these numbers are not completely right, even
aside from the fact that you're quoting 5 GHz (10 GHz fireball),
which has not been announced or shipped (but probably is what they
were targetting).

I would like to see some definitive numbers here.  When I earlier
stated that Prescott had a 4-cycle latency, I was going off an
article on the web that did not distinguish between the main clock
(currently 3.2 GHz) and the fireball clock (currently 6.4 GHz).
That article showed Northwood with a latency of 1 cycle, and
Prescott with 4.

I don't have access to a P4, Northwood or Prescott.  Can someone
who does run a latency test (like McVoy's lmbench) and post the

```
 0

```Matt Taylor wrote:

> Grumble wrote:
>
>> Do you believe there are more than 128 physical registers in Prescott?
>
> There are 256 now.

Where did you read that? :-)

```
 0

```"Iain McClatchie" <iain-3@truecircuits.com> wrote in message
> > Northwood L1 data cache:
> > 8 KB
> > 4-way set associative
> > 2-cycle latency @ 3.2 Ghz = 0.625 ns
> >
> > Prescott L1 data cache:
> > 16 KB
> > 8-way set associative
> > 4-cycle latency @ 5 (??) Ghz = 0.8 ns
>
> Writing it out in ns is definitely the right way to think about
> latency.  You need to specify whether you are subtracting a cycle
> for issuing the load (this used to be common when we had scalar
> processors, superscalar just makes this practice confusing).  In
> any case, I suspect these numbers are not completely right, even
> aside from the fact that you're quoting 5 GHz (10 GHz fireball),
> which has not been announced or shipped (but probably is what they
> were targetting).

The only parts running 10 GHz are the double-clocked ALUs. The L/S hardware
and caches are only running 5 GHz. Intel has not shipped a 5 GHz part, but
that is their intention. Prescott will hit 4 GHz before the end of this
year. I have seen that claimed both in print as well as on an unofficial
dual-core Prescott.

> I would like to see some definitive numbers here.  When I earlier
> stated that Prescott had a 4-cycle latency, I was going off an
> article on the web that did not distinguish between the main clock
> (currently 3.2 GHz) and the fireball clock (currently 6.4 GHz).
> That article showed Northwood with a latency of 1 cycle, and
> Prescott with 4.

Yes, I saw those figures as well. Northwood has a 2-cycle access time, not
1-cycle. Anandtech lists both figures
(http://www.anandtech.com/cpu/showdoc.html?i=1956&p=8), but they go with the
1-cycle figure rather than doing their homework. 2-cycles is what was quoted
on RWT.
(http://www.realworldtech.com/page.cfm?ArticleID=RWT091000000000&p=4)

> I don't have access to a P4, Northwood or Prescott.  Can someone
> who does run a latency test (like McVoy's lmbench) and post the

I have access to a Northwood, but it's stuck running Windows, and I don't
see any evidence that lmbench will compile. ScienceMark reports 2 cycles
latency for the L1 cache.

Intel is shipping Prescott, but I haven't seen it for sale anywhere.

-Matt

```
 0

```Matt> The only parts running 10 GHz are the double-clocked ALUs. The L/S
Matt> hardware and caches are only running 5 GHz.

Are you absolutely sure about that?  Is it true of both Northwood and
Prescott?  The implication is that the current 3.2 GHz machines do just
3.2 billion loads per second from L1, whereas a K7 at 2 GHz can do nearly
4 billion (maybe 3.5 billion).  I had previously been under the impression
that the P4 had more L1 bandwidth than the K7.

Matt> Yes, I saw those figures as well. Northwood has a 2-cycle access time,
Matt> not 1-cycle. Anandtech lists both figures
Matt> (http://www.anandtech.com/cpu/showdoc.html?i=1956&p=8), but they go
Matt> with the 1-cycle figure rather than doing their homework. 2-cycles is
Matt> what was quoted on RWT.
Matt> (http://www.realworldtech.com/page.cfm?ArticleID=RWT091000000000&p=4)

That makes a lot more sense.  And you are speaking of 5 GHz (future) cycles,
so Prescott is just twice the latency of Northwood, rather than 4x which
seemed so ridiculous.

Of course, it does mean Prescott can execute 16 ALU ops (8 serial) in the
time it takes to do 1 load (and issue 4 independent).  You can really see
why those ALUs are getting no traction.

It also means the current 3.2GHz Prescott has a 1.25ns primary cache
latency, versus 1.5ns in the K7 at 2GHz.  By the end of the year, if they
make their schedule, Intel will be back at 1ns for their primary again, but
will not approach the 625ps they are shipping now for the next two years!

Northwood's REE is amazing.

```
 0

```"Iain McClatchie" <iain-3@truecircuits.com> wrote in message
> Matt> The only parts running 10 GHz are the double-clocked ALUs. The L/S
> Matt> hardware and caches are only running 5 GHz.
>
> Are you absolutely sure about that?  Is it true of both Northwood and
> Prescott?  The implication is that the current 3.2 GHz machines do just
> 3.2 billion loads per second from L1, whereas a K7 at 2 GHz can do nearly
> 4 billion (maybe 3.5 billion).  I had previously been under the impression
> that the P4 had more L1 bandwidth than the K7.

Pentium-4 can issue one load and one store per cycle, both up to 16 bytes.
K7 can issue any combination of loads or stores up to two per cycle, both 8
bytes. This information for the Pentium-4 is given in the optimization
manual. In my copy (248966-009), it is page 1-22. I don't see a copy updated
for Prescott yet, but I presume the L/S hardware has not changed aside from
buffer increases.

> Matt> Yes, I saw those figures as well. Northwood has a 2-cycle access
time,
> Matt> not 1-cycle. Anandtech lists both figures
> Matt> (http://www.anandtech.com/cpu/showdoc.html?i=1956&p=8), but they go
> Matt> with the 1-cycle figure rather than doing their homework. 2-cycles
is
> Matt> what was quoted on RWT.
> Matt>
(http://www.realworldtech.com/page.cfm?ArticleID=RWT091000000000&p=4)
>
> That makes a lot more sense.  And you are speaking of 5 GHz (future)
cycles,
> so Prescott is just twice the latency of Northwood, rather than 4x which
> seemed so ridiculous.

Yep.

> Of course, it does mean Prescott can execute 16 ALU ops (8 serial) in the
> time it takes to do 1 load (and issue 4 independent).  You can really see
> why those ALUs are getting no traction.

This is exactly why I was shocked to hear it. Perhaps they will fix it with
the next iteration of chips much as they seem to have done with the
shifter/multiplier in Prescott. (Yay!)

I had heard that Intel initially planned to increase to a 16K u-op cache
with additional execution bandwidth (now slated for Potomac/Nocona??), but
that apparently got cut. Perhaps the larger, slower dcache was part of that
package which was supposed to benefit Hyperthreading.

I have heard rumors of 64-bit extensions in Prescott to compete with
Opteron. If that is true and x86-64 takes off, then the cache latency will
be mitigated.

> It also means the current 3.2GHz Prescott has a 1.25ns primary cache
> latency, versus 1.5ns in the K7 at 2GHz.  By the end of the year, if they
> make their schedule, Intel will be back at 1ns for their primary again,
but
> will not approach the 625ps they are shipping now for the next two years!
>
> Northwood's REE is amazing.

I'm not familiar with that acronym; what's the term?

-Matt

```
 0

```"Grumble" <invalid@kma.eu.org> wrote in message
news:c0i4hm\$ch1\$1@news-rocq.inria.fr...
> Matt Taylor wrote:
>
> > Grumble wrote:
> >
> >> Do you believe there are more than 128 physical registers in Prescott?
> >
> > There are 256 now.
>
> Where did you read that? :-)

http://www.chip-architect.com/news/2003_03_23_Prescott_doubles_inflight_uOps.html

-Matt

```
 0

```Matt> Pentium-4 can issue one load and one store per cycle, both up to 16 bytes.

Meaning???
- Two agen units, one for loads, one for stores, can both calculate
addresses at the same time.  Clearly the store will hold off writing
data into the 16KB SRAM until graduation.  Seems reasonable, but it
means that the primary cache bandwidth isn't necessarily better than K8
- The 16KB SRAM can do a 16B read and a 16B write in the same cycle.
Seems doubtful.
- The 16KB SRAM is banked/replicated, and the load only goes to one
bank/replicate, so that the store data queue eventually writes to
both, but on different cycles.  I personally like this idea, but I
doubt anyone's done it yet.

I'm frankly amazed they can do a read in the 300ps after a write --
the RC on the bit lines in precharge must be less than 50ps, including
the driver == ~200um of wire + zero-resistance driver.  Hmm.. L1 cache
looks like it's 95um x 155um in the chip-architect.com picture, so if
that size is right I suppose 100um bitlines are possible.  Wire pitch
would have to be < 160nm, though, so something is wrong with these numbers.

Matt> I have heard rumors of 64-bit extensions in Prescott to compete with
Matt> Opteron. If that is true and x86-64 takes off, then the cache latency
Matt> will be mitigated.

Huh?  Do you mean the L1 Dcache latency won't matter so much because
with 16 registers there will be a lot fewer memory ops?

It's been too long since I've played with an architectural simulator, but
I seem to remember that the "Alien" folks at SGI circa 1996 were
contemplating an OoO machine with a load latency of 3 cycles, and two
ld/st and at least two ALUs per cycle.  The MIPS architecture has 32 regs,
and they seemed to think that 3 cycle latency was okay, but 4 cycles led
to more execution stalls than it saved in better hit rate.

So, I don't think the extra 8 registers will mitigate the extra dcache
latency very much.

Iain> Northwood's REE is amazing.

Matt> I'm not familiar with that acronym; what's the term?

Intel marketing, sorry.  "Rapid Execution Engine", a.k.a. Fireball --
the thing running at 6.4 GHz right now.

```
 0

```Matt Taylor wrote:

> Grumble wrote:
>
>> Matt Taylor wrote:
>>
>>> Grumble wrote:
>>>
>>>> Do you believe there are more than 128 physical registers
>>>> in Prescott?
>>>
>>> There are 256 now.
>>
>> Where did you read that? :-)
>
> http://www.chip-architect.com/news/2003_03_23_Prescott_doubles_inflight_uOps.html

Hans de Vries wrote that article almost a year ago. His information
didn't come from Intel, he took a guess based on several pictures of
a Prescott die.

Regards,

Nudge

```
 0

```In article <45022fc8.0402131534.281f25da@posting.google.com>,
iain-3@truecircuits.com (Iain McClatchie) wrote:

> Of course, it does mean Prescott can execute 16 ALU ops (8 serial) in the
> time it takes to do 1 load (and issue 4 independent).  You can really see
> why those ALUs are getting no traction.

Can they actually do that? I thought there was a limit of 3 microops per
cycle.

```
 0

```"Matt Taylor" <para@tampabay.rr.com> wrote in message news:<hffXb.112034\$fH2.24419@twister.tampabay.rr.com>...
> "Iain McClatchie" <iain-3@truecircuits.com> wrote in message
> > Matt> The only parts running 10 GHz are the double-clocked ALUs. The L/S
> > Matt> hardware and caches are only running 5 GHz.
> >
> > Are you absolutely sure about that?  Is it true of both Northwood and
> > Prescott?  The implication is that the current 3.2 GHz machines do just
> > 3.2 billion loads per second from L1, whereas a K7 at 2 GHz can do nearly
> > 4 billion (maybe 3.5 billion).  I had previously been under the impression
> > that the P4 had more L1 bandwidth than the K7.
>
> Pentium-4 can issue one load and one store per cycle, both up to 16 bytes.
> K7 can issue any combination of loads or stores up to two per cycle, both 8
> bytes. This information for the Pentium-4 is given in the optimization
> manual. In my copy (248966-009), it is page 1-22. I don't see a copy updated
> for Prescott yet, but I presume the L/S hardware has not changed aside from
> buffer increases.

Not quite: K7 can issue 3 memory address computations from the schedulers
every cycle, however, only 2 can be processed in the data cache per cycle;
with excess memory references being held in the LS1 until such time as they
can obtain a data cache port.

> -Matt

Mitch

```
 0

```"Christian Bau" <christian.bau@cbau.freeserve.co.uk> wrote in message
news:christian.bau-162908.14283716022004@slb-newsm1.svr.pol.co.uk...
>  iain-3@truecircuits.com (Iain McClatchie) wrote:
>
> > Of course, it does mean Prescott can execute 16 ALU ops (8 serial) in
the
> > time it takes to do 1 load (and issue 4 independent).  You can really
see
> > why those ALUs are getting no traction.
>
> Can they actually do that? I thought there was a limit of 3 microops per
> cycle.

His numbers do not take into account the TC bandwidth limitation. Prescott
can issue 12 u-ops in 4 cycles, the latency of an L1 dcache access. If you
look at the raw execution bandwidth, both double-speed ALUs can jointly
execute 16 u-ops in that amount of time. Because they are double-speed, they
can execute 8 back-to-back *dependent* ALU ops. The point is that the
schedulers will not be able to keep the execution units full because the
latency of the L1 cache is too high.

-Matt

```
 0

```"Iain McClatchie" <iain-3@truecircuits.com> wrote in message
> Matt> Pentium-4 can issue one load and one store per cycle, both up to 16
bytes.
>
> Meaning???
>   - Two agen units, one for loads, one for stores, can both calculate
>     addresses at the same time.  Clearly the store will hold off writing
>     data into the 16KB SRAM until graduation.  Seems reasonable, but it
>     means that the primary cache bandwidth isn't necessarily better than
K8
>     unless you use SSE loads/stores.

That came out of the PDF describing Williamette and Northwood. I checked
today, and the PDF has been updated for Prescott. It still claims one load
and one store per cycle. Some older non-official, non-Intel literature
claims that Prescott now runs the L/S and caches at double-speed, but the
Intel docs don't mention that.

>   - The 16KB SRAM can do a 16B read and a 16B write in the same cycle.
>     Seems doubtful.
<snip>

I spent several days looking for an answer on this point, and I could not
find any information at all. However, point (d) in figure 2-2 on page 2-32
says that a 16-byte STLF must be 16-byte aligned. In AMD's implementation, I
don't think that matters. A movdqa simply issues a pair of 8-byte loads.
Perhaps Intel does the same and the manual is wrong; I don't know.

> Matt> I have heard rumors of 64-bit extensions in Prescott to compete with
> Matt> Opteron. If that is true and x86-64 takes off, then the cache
latency
> Matt> will be mitigated.
>
> Huh?  Do you mean the L1 Dcache latency won't matter so much because
> with 16 registers there will be a lot fewer memory ops?

Yes, although that is an overstatement of what I think. On a processor with
7 effective registers, 4-cycles to the L1 cache is an atrocity. Even with
hand-tuned assembly, there is no way to completely hide that latency most of
the time. I find myself frequently lacking only 1 or 2 registers for live
variables. On Prescott, I will pay dearly for that. On "Prescott-64" I more
frequently will be able to keep all live variables in registers and avoid
that penalty. That also gives me more freedom to schedule loads earlier
since there will be less contention for the L/S.

> It's been too long since I've played with an architectural simulator, but
> I seem to remember that the "Alien" folks at SGI circa 1996 were
> contemplating an OoO machine with a load latency of 3 cycles, and two
> ld/st and at least two ALUs per cycle.  The MIPS architecture has 32 regs,
> and they seemed to think that 3 cycle latency was okay, but 4 cycles led
> to more execution stalls than it saved in better hit rate.
>
> So, I don't think the extra 8 registers will mitigate the extra dcache
> latency very much.

I have no doubt that Intel would have been better off with the old 8KB
cache, but I also have no doubt that a significant number of memory accesses
will be eliminated in x86-64. For starters, both Linux and Windows will be
using register calling conventions. I was told that the beta of Win64 on
Opteron is 10% faster in 32-bit graphics/games than its pure 32-bit
counterpart. It would seem to be related to the virtues of the x86-64
architecture more than the Opteron itself. If Opteron gains >10% with
3-cycles cache latency, "Prescott-64" will get even more.

I do hope they fix the cache in a future iteration. They fixed the shifter
in Prescott; the multiplier is still slow.

> Iain> Northwood's REE is amazing.
>
> Matt> I'm not familiar with that acronym; what's the term?
>
> Intel marketing, sorry.  "Rapid Execution Engine", a.k.a. Fireball --
> the thing running at 6.4 GHz right now.

Ah, right. Specifically the double-clocked path with the ALUs, registers,
and schedulers. There is some funny business going on there. On Prescott all
ALU ops now have a latency of 1 clk. Logical ops have a throughput of 0.5
clk while arithmatic ops have a throughput of 1 clk. It would seem the REE
isn't quite so amazing now. :-(

-Matt

```
 0

```In article <y_lYb.219047\$fH2.80917@twister.tampabay.rr.com>,
Matt Taylor <para@tampabay.rr.com> wrote:

> In AMD's implementation, I
> don't think that matters. A movdqa simply issues a pair of 8-byte loads.
> Perhaps Intel does the same and the manual is wrong; I don't know.

In AMD's implementation, the load has to be 128-bit aligned.

-- greg

```
 0

```"Greg Lindahl" <lindahl@pbm.com> wrote in message
news:40326fb9\$1@news.meer.net...
> In article <y_lYb.219047\$fH2.80917@twister.tampabay.rr.com>,
> Matt Taylor <para@tampabay.rr.com> wrote:
>
> > In AMD's implementation, I
> > don't think that matters. A movdqa simply issues a pair of 8-byte loads.
> > Perhaps Intel does the same and the manual is wrong; I don't know.
>
> In AMD's implementation, the load has to be 128-bit aligned.

store on a 16-byte boundary and then trying to load from the high half. I
ran a little test on my 1.4 GHz Palomino, and it can STLF just fine even
when the subsequent load is not 16-byte aligned.

I repeated the following sequences of code 12 times each and measured the
number of cycles each took to execute:

; Test 1: 16-byte aligned store, 16-byte aligned load
movaps [esi], xmm0
movq mm0, [esi]

; Test 2: 16-byte aligned store, 8-byte aligned load
movaps [esi], xmm0
movq mm0, [esi+8]

; Test 3: 8-byte aligned store, 8-byte aligned load
movq [esi], mm0
mov eax, [esi]

; Test 4: 8-byte aligned store, 4-byte aligned load
movq [esi], mm0
mov eax, [esi+4]

The last 2 tests were mainly to show that my timing framework can detect
STLF stalls. Here are the results:

Test 1: 53 cycles
Test 2: 57 cycles
Test 3: 27 cycles
Test 4: 127 cycles

This is pretty much what I expected. On K7 there is no 16-byte alignment
restriction on a load to STLF an SSE store.

I ran the same code on a Northwood as well:

Test 1: 44 cycles
Test 2: 468 cycles
Test 3: 44 cycles
Test 4: 60 cycles

The code isn't scheduled for a Northwood, but I can't think of a good way to
do that without hiding the STLF stalls. It seems MMX and SSE loads/stores
all take 6 cycles to complete. This code also demonstrates my original claim
that Northwood can load and store 16-bytes in the cache each cycle. I
presume Prescott can as well, but again I can't find any docs that say
anything explicitly about the L/S width.

-Matt

```
 0

```I got back from ISSCC today.  I'll try to write up a little of what I learned
that's relavent to comp.arch later.  But:

After the Prescott presentation (3.4), I asked about the primary cache latency.
Specifically, I pointed out that with the numbers presented (5.5 GHz in two
years), they would be shipping a 700 ps L0 Dcache (that's what they call it)
two years from now, and are currently shipping 625ps.  The speaker (C. Webb?)
told me it didn't matter.

I talked with D. Deleganes seperately, who claims to have led the effort on
the Prescott FCLK core.  Currently, MCLK = 3.2 GHz, FCLK = 6.4 GHz.  Mr.
Deleganes gave an enlightening presentation on the circuits used in the FCLK
core -- I assume they're patented out the wazoo.  Mr. Deleganes confirmed
that the L0 Dcache latencies don't matter.  A friend of his, who had worked
on Willamette and Northwood and Centrino, but who no longer works for Intel,
suggested without arousing counterargument that Mr Deleganes had been assigned
to make those circuits cycle at 8-12 GHz, latency be damned.  Mr Deleganes
pointed out several times that the new circuits are much less finicky than
the Willamette and Northwood circuits.  He seemed remarkably unfamiliar with
the L0 cache latencies, commenting several times that the latency depended
on where you started counting from.  (I said, you pick a spot, walk around a
loop, get back to where you started, and tell me the latency.)  I think the
trouble may have come from the fact that the AGEN runs in the FCLK domain,
but can only kick out a load every other cycle.

A (the?) Prescott cache designer was also at this conversation.  I asked if
there was some sort of mid-project change that explained the large cache
latency and modest size increase.  He pointed out that cache design is a fairly
mature field, you wouldn't expect surprises.  So, basically, I was being
stonewalled, but it does sound as if they deliberately chose GHz over latency.
I get the feeling the guys I was talking to weren't very familiar with the
architectural implications.

The Prescott cache runs does one load and one store every MCLK.  It wasn't
clear if this is done off FCLK, or just alternate phases of MCLK, but that
doesn't much matter from a code execution point of view.  Mr. Deleganes did
say that loads were aligned, stores were unaligned, but perhaps he meant the
reverse (now I'm not so sure I can remember which way he said was which).

The following is what I've assembled from several conversations:

Willamette and Northwood used self-resetting Domino logic.  The Northwood
core was shrunk from Willamette; apparently this caused a great deal of pain.
The issue is that self-resetting domino gates generate pulses of small width.
When these pulses arrive at another self-resetting domino gate, the output
pulse is generally shorter than the input pulses, and gets shorter still if
the input pulses do not arrive simultaneously.  It took a lot of work to tune
these circuits to work.  Apparently probing the silicon to help debug runtime
failures is NOT considered feasible -- silicon debug is limited to screwing
with the clock to try to find the failing path, and paths that don't work at
any speed (and give zero debug informatino) are likely.

>From the presentation:
Prescott uses low swing logic.  They have true and complement inverters,
driving true and complement nmos-pass-gate structures (up to 6 xtrs in series),
going into a sense amp which drives a domino gate.  That appears to be the
entire FCLK cycle.  The inputs to the sense amps typically swing tens of
millivolts.  All interior nodes to the pass gate structure are reset low, and
a cutoff transistor after the initial inverter prevents it from sneaking any
charge in there.  The pass gate nmos gates are driven full-rail.  The load
alignment mux is apparently one big merged nmos pass gate structure with over
5000 transistors.  This sounds amazing, but I've got designs from 1997 with
almost exactly the same thing, and it's not a big deal... shifters just end up
like that.

The big advantage of this logic is that it develops more signal over time, so
a given chip will usually run at some speed, rather than just failing outright
like self-resetting domino.  This should help them port to multiple voltages,
deliberate process variations (mobile parts, etc).

In paper 24.6, a few guys from Intel showed how they were able to build a
100 GHz VCO (voltage controlled oscillator, the thing in the PLL that goes
tick-tock) on their 90 nm process.  This corresponds to a free-space wavelength
of 300 microns -- in copper through silicon dioxide I'm sure it's more like
40-50 microns.  It's just a curiosity, but... yow.

Considering the size of the FCLK region, that means that Prescott is running
at ten percent or more of, literally, the speed of light.  Mark Horowitz gave
a talk saying that we've pretty convincingly smacked into the wall on speed
and power, and how life will be interesting anyway.  My prediction for next
year's keynote:

Photons Won't Help

```
 0

```JRS:  In article <45022fc8.0402182323.7c99aaa7@posting.google.com>, seen
in news:comp.lang.asm.x86, Iain McClatchie <iain-3@truecircuits.com>
posted at Thu, 19 Feb 2004 09:26:20 :-
>
>In paper 24.6, a few guys from Intel showed how they were able to build a
>100 GHz VCO (voltage controlled oscillator, the thing in the PLL that goes
>tick-tock) on their 90 nm process.  This corresponds to a free-space wavelength
>of 300 microns --

300 um * 100 GHz = 3E7 m/s.   But c = 3E8 m/s.   ???

--
� John Stockton, Surrey, UK. ?@merlyn.demon.co.uk / ??.Stockton@physics.org �
Web  <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
Correct <= 4-line sig. separator as above, a line precisely "-- " (SoRFC1036)
Do not Mail News to me.    Before a reply, quote with ">" or "> " (SoRFC1036)

```
 0

```

On Thu, 19 Feb 2004, Iain McClatchie wrote:

> I got back from ISSCC today.  I'll try to write up a little of what I learned
> that's relavent to comp.arch later.  But:

Thanks Iain,
I look forward to your later post!
Peter

```
 0

```Iain McClatchie wrote:

> I got back from ISSCC today.  I'll try to write up a little of what I learned
> that's relavent to comp.arch later.  But:
>
> After the Prescott presentation (3.4), I asked about the primary cache latency.
> Specifically, I pointed out that with the numbers presented (5.5 GHz in two
> years), they would be shipping a 700 ps L0 Dcache (that's what they call it)
> two years from now, and are currently shipping 625ps.  The speaker (C. Webb?)
> told me it didn't matter.
>
> I talked with D. Deleganes seperately, who claims to have led the effort on
> the Prescott FCLK core.  Currently, MCLK = 3.2 GHz, FCLK = 6.4 GHz.  Mr.
> Deleganes gave an enlightening presentation on the circuits used in the FCLK
> core -- I assume they're patented out the wazoo.  Mr. Deleganes confirmed
> that the L0 Dcache latencies don't matter.  A friend of his, who had worked
> on Willamette and Northwood and Centrino, but who no longer works for Intel,
> suggested without arousing counterargument that Mr Deleganes had been assigned
> to make those circuits cycle at 8-12 GHz, latency be damned.  Mr Deleganes
> pointed out several times that the new circuits are much less finicky than
> the Willamette and Northwood circuits.  He seemed remarkably unfamiliar with
> the L0 cache latencies, commenting several times that the latency depended
> on where you started counting from.  (I said, you pick a spot, walk around a
> loop, get back to where you started, and tell me the latency.)  I think the
> trouble may have come from the fact that the AGEN runs in the FCLK domain,
> but can only kick out a load every other cycle.

OK, getting more headroom is a valid design criteria, but history have
shown many times that you'd better not leave too much performance on the
floor even for the first couple of revisions.
>
> A (the?) Prescott cache designer was also at this conversation.  I asked if
> there was some sort of mid-project change that explained the large cache
> latency and modest size increase.  He pointed out that cache design is a fairly
> mature field, you wouldn't expect surprises.  So, basically, I was being
> stonewalled, but it does sound as if they deliberately chose GHz over latency.
> I get the feeling the guys I was talking to weren't very familiar with the
> architectural implications.
....

Interesting stuff, thanks for the report!

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```John>         300 um * 100 GHz = 3E7 m/s.   But c = 3E8 m/s.   ???

Yep, off by 1 decimal place.  Sorry about that.  I posted a correction
yesterday; google seems to have dumped it.  This is the second time that's
happened to me -- have other people noticed Google dropping posts?

```
 0

49 Replies
304 Views

Similar Articles

12/7/2013 2:50:59 PM
page loaded in 7379 ms. (0)

Similar Artilces:

Constant
Hello, how can I know the value of a constant, when I have the name of the constant as a literal string. For example: define ID_CONSTANT := 100 cVar := "ID_CONSTANT" Using the makro operator doesn't work. Stefan, You cannot. Constants are resolved at compile time, there is no way to find out the name at runtime. hth Erik "Stefan" <info@sanosoft.com> schreef in bericht news:e1106ecb-098d-4986-a7f9-df2d380d2435@j5g2000yqm.googlegroups.com... > Hello, > > how can I know the value of a constant, when I have the name of the > constant as a literal string. For example: > > define ID_CONSTANT := 100 > > cVar := "ID_CONSTANT" > > Using the makro operator doesn't work. Dear Stefan: On Jan 18, 7:57=A0am, Stefan <i...@sanosoft.com> wrote: > Hello, > > how can I know the value of a constant, when I > have the name of the constant as a literal > string. For example: > > define ID_CONSTANT :=3D 100 > > cVar :=3D "ID_CONSTANT" > > Using the makro operator doesn't work. You could add this line: xID_CONSTANT

A constant with if-else-if
How do I embed an if-else-if within a constant definition? This doesn't compile in Modelsim: constant Y : integer := A when A>B else B; If I use the math_real library, I can do this: constant Y : integer := integer(realmax(real(A),real(B))); but I'd still like a general solution. Must I write a constant function? -Kevin "Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message news:g0fsiu\$80g6@cnn.xsj.xilinx.com... > How do I embed an if-else-if within a constant definition? This doesn't > compile in Modelsim: > > constant Y : integer := A when A>B else B; > > If I use the math_real library, I can do this: > > constant Y : integer := integer(realmax(real(A),real(B))); > > but I'd still like a general solution. Must I write a constant function? > Yes, you have to write a function. Kevin Jennings On May 14, 6:28 pm, Kevin Neilson <kevin_neil...@removethiscomcast.net> wrote: > How do I embed an if-else-if within a constant definition? This doesn't > compile in Modelsim: > > constant Y : integer := A when A>B else B; > > If I use

Is it constant folding at work
Have a look at the following code: #include<iostream> using namespace std; int main() { const int a = 1; const int* const aptr1 = &a; int* const aptr2 = const_cast<int*>aptr1); *aptr2 = 2; cout<<a<<endl; cout<<*(&a)<<endl; cout<<*aptr1<<endl; cout<<*aptr2<<endl; } Output is (for g++ (ver. 3.2) and visual C++ 6 compiler) 1 1 2 2 Is it the correct behaviour? Is it due to constant folding? If yes, how is constant folding possible when we are referencing the variable? Anshul Sawant wrote in news:bd9cdcf.0402230532.562e3292@posting.google.com: > Have a look at the following code: > > #include<iostream> > > using namespace std; > > int main() > { > const int a = 1; > const int* const aptr1 = &a; This cast if fine and legal (ok you missed a '('), but ... > int* const aptr2 = const_cast<int*>aptr1); The use of the result const_cast here is UB (uderfined Behaviour), you should only use const_cast when you *know* the original thing pointed to wasn't a constant. > *aptr2 = 2; All bets are now Officialy (according

uninitialized constant ActionController (NameError)
Hello, My ruby cannot find a class from the actionpack gem: require 'rubygems' require 'actionpack' ActionController::Session::AbstractStore => uninitialized constant ActionController (NameError) Directly adding the code works, though: require 'actionpack' load "/var/lib/gems/1.8/gems/actionpack-2.3.4/lib/action_controller/session/abstract_store.rb" ActionController::Session::AbstractStore => no error. This problem occurs when I try to require dm-core, which needs rails_datamapper, which uses the AbstractStore Class. Thanks for your... if it loads what you want, and if it doesn't, you'll need to load what you want yourself. >> require 'rubygems' => true >> require 'actionpack' => true >> ActionController::Session::AbstractStore NameError: uninitialized constant ActionController from (irb):3 >> require 'action_controller' => true >> ActionController::Session::AbstractStore => ActionController::Session::AbstractStore >> [This is with rails-2.3.2 under Ubuntu Hardy with ruby-1.8.6p111 compiled from source] And you can see why easily

YAML and constant objects
is to look up an already-made constant. Is there a good solution to making the yaml-ing of constants efficient? Thanks! --Brian Buckley foo =3D Foo:FOO1 data =3D foo.to_yaml # the yaml string here is unnecessarily long foo =3D YAML::load(data) # and now after retrieval we have an additional foo object when we really only need the one original constant - Brian Buckley wrote: > Hello all, > > I have an immutable class Foo (once a foo object is created its state > is fixed). Foo objects have a bunch of attributes so their yaml > strings are fairly long. > > There is a small set of standard, frequently used foo objects and I > have defined these as constants right in the class (newed and set > each attribute for Foo::FOO1, Foo::FOO2, ... etc.). > > When I YAML one of these standard foos to the database and back, it > seems inefficient to create and send a big long yaml string describing > each attribute and then on retrieval to create a whole new Foo object > when all I really want is to look up an already-made constant. > > Is there a good solution to making the yaml-ing of constants efficient? > > Thanks! > >

constant, scale, entropy
://en.wikipedia.org/wiki/Entropy_(information_theory) 4) finally the idea that there should "exist" constants at scales that would have Riemann type attractors {0,1,Infinity} and zero being the perfect second and third law of thermodynamics crystal. 5) manufacture of a unitary ( binary) entropy probability based constant. Alexander Povolotsky gave me the original idea: > What do you think of Ray Solomonoff's approach ? which gave the idea of "Algorithmic Probability". All I did was figure out a way to compute probability in terms of scaled BBP/ Infinite sums. Looking at the idea that the fundamental constants involve the metamathamatics of the physical universe made me think that they might have information in an information theory sense, that is entropy. Mathematica: constant from entropy one for scale=5/3; Clear[p, n, s] \$MaxExtraPrecision = 200 s = 2*Sum[1/(5/3)^n, {n, 0, Infinity}] Solve[-x*Log[x]/Log[5/3] - 2/(3*(5/3)^n) == 0, x] p[n_] := -(2/3)*(3/5)^n*Log[5/3]/(ProductLog[-(2/3)*(3/5)^n*Log[5/3]]) a = Sum[p[n + 1]/(5/3)^(n + 1), {n, 0, Infinity}] N[a, 100] 1.286440035403290152912874911399834673243151932466`32.460965191217305

Scalar integer constant expression
) must be a constant Looking at the standard (N1830.pdf), the second argument to the 'int' intrinsic function should be a 'scalar integer constant expression'. Is an array element the same as a scalar? It is my understanding that it is, and this is reinforced by the fact that in the declaration of i3 i2(1) is accepted as kind-selector, for which the standard requires a 'scalar-int-constant-expr'. Erik. Erik Toussaint <user@example.net.invalid> wrote: > Is the following program standard conforming? Well, not until f2008 because of the * in > integer, parameter :: i2(*) = [i1] I think f2008 might allow that; nothing before does. Since I don't yet have any f2003 compilers, I don't yet get too excited about f2008. [most of the rest of the code elided] .... > i3 = int(0, i2(1)) ! This line gives an error when compiling. .... > Error: 'kind' argument of 'int' intrinsic at (1) must be a constant > > > Looking at the standard (N1830.pdf), the second argument to the 'int' > intrinsic function should be a 'scalar integer constant expression'. > Is an array element

Binary constant TMP...
Hi all, I was looking around for a nice snippet to express binary constants in C++ and could only find a C approach that was entirely macro based (and wouldn't prevent numbers other than 0 and 1 being used in the constant). So after a bit of playing around I've put together a little TMP solution that might be of some use to others. #include <iostream> using namespace std; template< int N > struct binary_digit { }; template< > struct binary_digit< 0 > { enum { value = 0 }; }; template< > struct binary_digit< 1 > { enum { value = 1... binary constants in > C++ and could only find a C approach that was entirely macro based (and > wouldn't prevent numbers other than 0 and 1 being used in the > constant). So after a bit of playing around I've put together a little > TMP solution that might be of some use to others. You might look at the first code example in "C++ Template Metaprogramming" ;-) -- Dave Abrahams Boost Consulting www.boost-consulting.com [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] "Ian

Removing constant multiples from polynomials
I am working on an algorithm which manipulates sets of polynomials. Often it produces (after an application of FullSimplify) polynomials like: 1276(x-1)(y-2)/397 I'm only interested in the solutions to the polynomial, so to me this is "equivalent" to: (x-1)(y-2) As a short-term solution I call "GroebnerBasis" on the single polynomial, which seems to perform this simplification. However is there a simpler method? Thank you, Chris Let st = (Random[Integer, {0, 10}]/Random[Integer, {10, 20}])*(x - 1)*(y - 2) (9/14)*(-1 + x)*(-2 + y) Fir

Using constant variables in header file
Hi, I have one constant variable and want to use it in two files. I put it in the header file and then include the header file. The compiler always say "error C2370: 'arraysize' : redefinition; different storage class". What shall I do? Files are listed below. I'm using Visual C++ 6.0 By the way, how to use STL (vector , for example) to create a multi-dimension array? Many thanks. Karen --------------------------------- my header file - consts.h: const int arraysize = 15; my sources file 1 - op1.cpp: #include consts.h char array1[arraysize]; my source file 2 - op2.cpp: #include consts.h char array2[arraysize]; "Karen" <Karen12000@yahoo.com> wrote... > I have one constant variable and want to use it in two files. I put it in > the header file and then include the header file. The compiler always say > "error C2370: 'arraysize' : redefinition; different storage class". What > shall I do? Files are listed below. I'm using Visual C++ 6.0 > > By the way, how to use STL (vector , for example) to create a > multi-dimension array? > > Many thanks. > > Karen >