Here is my example code.
mov edx, 020h
shl edx, 08H
mov ecx, 01h
or edx, ecx
It works on Pentium III that it only takes 4 cycles.
mov edx, 020h
lea edx, [edx*4]
lea edx, [edx*4]
lea edx, [edx*4]
lea edx, [edx*4]
mov ecx, 01h
or edx, ecx
It works on Pentium III and IV that takes 7 cycles (normal 4 cycles + 3
extra cycles).
The rule applies that "*4" must be used for each 4 bits so it looks like
4 * 4 Bits = 16 Bits. Please let me know what you think my example. Do you
think that extra 3 cycles would be okay?
--
Bryan Parkoff
|
|
0
|
|
|
|
Reply
|
Bryan
|
1/31/2004 4:06:35 AM |
|
Bryan Parkoff wrote:
I think I have no clue why you like to replace the shifts.
At least SHL/SHR are as fast as ADD/SUB/LEA substitutions.
> Here is my example code.
>
> mov edx, 020h
> shl edx, 08H
> mov ecx, 01h
> or edx, ecx
I'd have used instead:
mov edx,00002001h
or if the starting edx is variable:
mov edx, ????
shl edx,08
inc edx ;or 'OR edx,+01 (,+n)'
> It works on Pentium III that it only takes 4 cycles.
Three cycles if 'INC' or 'OR +n'
> mov edx, 020h
> lea edx, [edx*4]
> lea edx, [edx*4]
> lea edx, [edx*4]
> lea edx, [edx*4]
> mov ecx, 01h
> or edx, ecx
> Do you think that extra 3 cycles would be okay?
Not at all, why should you use that,
as delay or code-size enlarger (24 bytes) ?
You can use
IMUL edx,edx,00000100h
as well, that's also slower than SHL.
Even the shorter (14 instead of 24 bytes)
mov edx, ????
lea edx, [edx*8] ;stall+SIB+TEMP
lea edx, [edx*8] ;stall+SIB+TEMP
or edx, +01h ;stall
;=four full cycles
isn't faster than (11 bytes)
mov edx, ????
shl edx,8 ;stall+TEMP
or edx,+01h ;stall
;=three full cycles
due to the dependency- and SIB-pipe stalls.
PSLL.. may be faster.
__
wolfgang
|
|
0
|
|
|
|
Reply
|
wolfgang
|
1/31/2004 7:43:23 PM
|
|
Bryan Parkoff wrote:
> mov edx, 020h
> shl edx, 08H
mov dh, 0x20 is functionally equivalent.
|
|
0
|
|
|
|
Reply
|
Nudge
|
1/31/2004 7:43:59 PM
|
|
"wolfgang kern" <nowhere@nevernet.at> wrote in message
news:bvgodv$djn$3@newsreader1.utanet.at...
>
> Bryan Parkoff wrote:
>
> I think I have no clue why you like to replace the shifts.
> At least SHL/SHR are as fast as ADD/SUB/LEA substitutions.
Not on a Pentium-4. The shifter runs in 4-6 cycles.
> > Here is my example code.
> >
> > mov edx, 020h
> > shl edx, 08H
> > mov ecx, 01h
> > or edx, ecx
>
> I'd have used instead:
> mov edx,00002001h
>
> or if the starting edx is variable:
> mov edx, ????
> shl edx,08
> inc edx ;or 'OR edx,+01 (,+n)'
Pentium-4 prefers add edx, 1 over inc edx because it sets all the flags.
> > It works on Pentium III that it only takes 4 cycles.
> Three cycles if 'INC' or 'OR +n'
>
> > mov edx, 020h
> > lea edx, [edx*4]
> > lea edx, [edx*4]
> > lea edx, [edx*4]
> > lea edx, [edx*4]
> > mov ecx, 01h
> > or edx, ecx
>
> > Do you think that extra 3 cycles would be okay?
>
> Not at all, why should you use that,
> as delay or code-size enlarger (24 bytes) ?
> You can use
> IMUL edx,edx,00000100h
> as well, that's also slower than SHL.
imul is 14-18 cycles on a Pentium-4. :-(
> Even the shorter (14 instead of 24 bytes)
> mov edx, ????
> lea edx, [edx*8] ;stall+SIB+TEMP
> lea edx, [edx*8] ;stall+SIB+TEMP
> or edx, +01h ;stall
> ;=four full cycles
> isn't faster than (11 bytes)
> mov edx, ????
> shl edx,8 ;stall+TEMP
> or edx,+01h ;stall
> ;=three full cycles
> due to the dependency- and SIB-pipe stalls.
Unfortunately Intel also chopped off the AGU circuit which made previous
generations of chips quite fast. The lea is going to generate a shift
internally.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
1/31/2004 8:59:27 PM
|
|
Nudge <devnull@kma.eu.org> writes:
> Bryan Parkoff wrote:
>
> > mov edx, 020h
> > shl edx, 08H
>
> mov dh, 0x20 is functionally equivalent.
No it's not.
Phil
--
Unpatched IE vulnerability: mhtml wecerr CAB flip
Description: Delivery and installation of an executable
Reference: http://msgs.securepoint.com/cgi-bin/get/bugtraq0305/48.html
|
|
0
|
|
|
|
Reply
|
Phil
|
2/1/2004 12:53:31 AM
|
|
On 2004-01-31, Nudge <devnull@kma.eu.org> wrote:
> Bryan Parkoff wrote:
>
>> mov edx, 020h
>> shl edx, 08H
>
> mov dh, 0x20 is functionally equivalent.
Will that not cause a partial register stall? I don't know what the
trade off is actually. Maybe
xor edx, edx
mov dh, 0x20
would be better?
Julian
|
|
0
|
|
|
|
Reply
|
Julian
|
2/1/2004 12:53:48 AM
|
|
"Julian Brown" <brown@cs.bris.ac.uk> wrote in message
news:slrnc1oiku.j6j.brown@panic.cs.bris.ac.uk...
> On 2004-01-31, Nudge <devnull@kma.eu.org> wrote:
> > Bryan Parkoff wrote:
> >
> >> mov edx, 020h
> >> shl edx, 08H
> >
> > mov dh, 0x20 is functionally equivalent.
>
> Will that not cause a partial register stall? I don't know what the
> trade off is actually. Maybe
>
> xor edx, edx
> mov dh, 0x20
>
> would be better?
mov edx, 0x20 is best. It's one byte longer and 1 cycle faster, and on a
Pentium-4 (which, though not stated, is the context here) the actual code
size doesn't matter. The number of u-ops is important because of the trace
cache, so ops that decode into fewer u-ops would be preferred.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
2/1/2004 2:39:01 AM
|
|
Hi Matt,
> > I think I have no clue why you like to replace the shifts.
> > At least SHL/SHR are as fast as ADD/SUB/LEA substitutions.
> Not on a Pentium-4. The shifter runs in 4-6 cycles.
Even on byte-aligned 8/16-bit shifts?
AMD: SHL reg32,imm8 (1 tick, regardless of shift-count)
> Pentium-4 prefers add edx, 1 over inc edx because it sets all the flags.
Destroying the carry-flag is perhaps not always wanted :)
Ok, INC, DEC and LOOP use the internal auxiliary zero-flag,
which needs to be copied.
> imul is 14-18 cycles on a Pentium-4. :-(
AMD: IMUL reg32,imm8 (5 ticks)
> Unfortunately Intel also chopped off the AGU circuit which made previous
> generations of chips quite fast.
> The lea is going to generate a shift internally.
The same slow shifter in LEA?
I see, and I'm quite happy for my P4 1.2 GHZ is used in the office,
and I kept my 'slower' AMD K7 500MHz as programmming console.
I knew about the BTxx are quite faster on AMD than on Intel,
but I wouldn't have thought of that large disadvantage with shifts.
Why buy an P4 if you can have a better CPU for half of the money?
Seems I need to upgrade my Intel latency-lists.
I've got all P4 manuals, where are the detailed latency lists?
The optimization guide tells only things which should be avoided.
__
wolfgang
|
|
0
|
|
|
|
Reply
|
wolfgang
|
2/1/2004 3:04:21 AM
|
|
You mean
MOV EDX, 0x2000
:)
Jeremy
"Matt Taylor" <para@tampabay.rr.com> wrote in message
news:DoZSb.126030$873.2634434@twister.tampabay.rr.com...
> "Julian Brown" <brown@cs.bris.ac.uk> wrote in message
> news:slrnc1oiku.j6j.brown@panic.cs.bris.ac.uk...
> > On 2004-01-31, Nudge <devnull@kma.eu.org> wrote:
> > > Bryan Parkoff wrote:
> > >
> > >> mov edx, 020h
> > >> shl edx, 08H
> > >
> > > mov dh, 0x20 is functionally equivalent.
> >
> > Will that not cause a partial register stall? I don't know what the
> > trade off is actually. Maybe
> >
> > xor edx, edx
> > mov dh, 0x20
> >
> > would be better?
>
> mov edx, 0x20 is best. It's one byte longer and 1 cycle faster, and on a
> Pentium-4 (which, though not stated, is the context here) the actual code
> size doesn't matter. The number of u-ops is important because of the trace
> cache, so ops that decode into fewer u-ops would be preferred.
>
> -Matt
>
>
|
|
0
|
|
|
|
Reply
|
Dragon
|
2/1/2004 3:51:09 AM
|
|
Wrong...
Please read my post, "Partial Register Stall & Store Forwarding Blocked
Info"
It gives you the answer. You must use shift to move 20H from AL to AH
(SHL EAX,8H) before use OR to include 01H that equals to 2001H in AX.
Without SHL, it will cause stall if you place 20H in AH and 01H in 01H. XOR
EAX,EAX will not work to prevent stall so AH and AL can't be combined. It
is why you must always use EAX that uses SHL and OR.
You can't use mov byte ptr [A], al and mov byte ptr [A+1] otherwise
"store forwarding blocked" WILL OCCUR!! You must always use mov dword ptr
[A] instead.
Please reread my post so it gives you the clear answer. Thanks....
--
Bryan Parkoff
"Nudge" <devnull@kma.eu.org> wrote in message
news:401bfac9$0$11348$636a55ce@news.free.fr...
> Bryan Parkoff wrote:
>
> > mov edx, 020h
> > shl edx, 08H
>
> mov dh, 0x20 is functionally equivalent.
>
>
|
|
0
|
|
|
|
Reply
|
Bryan
|
2/1/2004 5:08:25 AM
|
|
"wolfgang kern" <nowhere@nevernet.at> wrote in message
news:bvhot7$v3d$1@newsreader1.utanet.at...
>
> Hi Matt,
>
> > > I think I have no clue why you like to replace the shifts.
> > > At least SHL/SHR are as fast as ADD/SUB/LEA substitutions.
>
> > Not on a Pentium-4. The shifter runs in 4-6 cycles.
>
> Even on byte-aligned 8/16-bit shifts?
> AMD: SHL reg32,imm8 (1 tick, regardless of shift-count)
AFAIK, yes. The manual claims 4-6 cycles for any shift. I have not
personally measured it.
> > Pentium-4 prefers add edx, 1 over inc edx because it sets all the flags.
>
> Destroying the carry-flag is perhaps not always wanted :)
> Ok, INC, DEC and LOOP use the internal auxiliary zero-flag,
> which needs to be copied.
Fast code probably won't preserve the carry flag, though. Even K7 has
trouble with partial flags reads/writes in some cases. I do not recall now
which they were, but doing a rotate and then cmovcc stalled close to 30
cycles for a few of the combinations.
> > imul is 14-18 cycles on a Pentium-4. :-(
>
> AMD: IMUL reg32,imm8 (5 ticks)
>
> > Unfortunately Intel also chopped off the AGU circuit which made previous
> > generations of chips quite fast.
> > The lea is going to generate a shift internally.
>
> The same slow shifter in LEA?
Yep. The P4 has no AGU, so lea eax, [edx*4] translates internally to sll
eax, edx, 2. Doing lea eax, [eax+edx] translates internally to add eax, eax,
edx.
> I see, and I'm quite happy for my P4 1.2 GHZ is used in the office,
> and I kept my 'slower' AMD K7 500MHz as programmming console.
1.2 GHz? The slowest was a 1.4 GHz Williamette, IIRC. I could be wrong...
> I knew about the BTxx are quite faster on AMD than on Intel,
> but I wouldn't have thought of that large disadvantage with shifts.
> Why buy an P4 if you can have a better CPU for half of the money?
The major players (Microsoft, Intel, nVidia, ATi, etc.) see the future of
computing in digital media. On the hardware side of things, Intel's position
has been threatened several times because PC architecture is not fast enough
for digital video. A year or two ago, nVidia's CEO announced his goal of
supplanting the CPU's dominance in this arena.
To compete, the P4 was architected to make media-related programs incredibly
fast. Compiler-produced code sucks, so the P4 has double-clocked ALUs to
execute back-to-back *dependent* ops. Register starvation means code
frequently accesses the stack, so the P4 has a small, 2-cycle L1 data cache
backed by a large L2 cache. Bandwidth is also important, so Intel made ties
with Rambus. The P4's architectural quirks (slow shifter/multiplier, small
cache, slow flags access) come as a tradeoff.
Personally, I find K7/K8 more interesting to write assembly for since I am
free to be tricky.
> Seems I need to upgrade my Intel latency-lists.
> I've got all P4 manuals, where are the detailed latency lists?
> The optimization guide tells only things which should be avoided.
There are none. Intel reports only on instructions they deem to be common.
(Likely the latencies of "uncommon" instructions are completely
discouraging.) The optimization manual (248966) describes this "common" set.
My definition of common/uncommon appears to be substantially different.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
2/1/2004 5:26:55 AM
|
|
> Wrong...
>
> Please read my post, "Partial Register Stall & Store Forwarding
Blocked
> Info"
>
> It gives you the answer. You must use shift to move 20H from AL to AH
> (SHL EAX,8H) before use OR to include 01H that equals to 2001H in AX.
> Without SHL, it will cause stall if you place 20H in AH and 01H in 01H.
XOR
> EAX,EAX will not work to prevent stall so AH and AL can't be combined. It
> is why you must always use EAX that uses SHL and OR.
Your example is using constants. You can compute the final value in edx and
simply mov edx, final_value. Based on what you said, Nudge was right. I'm
guessing you meant to use variables. Please, for clarity, use variables when
you're talking about non-constant data. So, now we have this:
; Do edx & ecx come from memory somewhere?
movzx edx, [a]
movzx ecx, [b]
shl edx, 8
or edx, ecx
There is really no fast way on a P4 to do what you're trying to do. You
can't move bytes around like that. If edx & ecx are being loaded from memory
as I'm assuming, then you can combine them into a single load. Without
information about how edx and ecx are initialized, all I can say is that
this code is optimal. The lea sequence you described is far worse. On a P4,
lea will translate into shift & add u-ops. Not only is one shift faster than
a series of shifts, but it's smaller too.
> You can't use mov byte ptr [A], al and mov byte ptr [A+1] otherwise
> "store forwarding blocked" WILL OCCUR!! You must always use mov dword ptr
> [A] instead.
<snip>
It might occur. Again, you didn't give any context in your example. Store
forwarding is blocked when the store is smaller than a future load, the
store and load are not made to the same address, or when the store is
misaligned. If there is no load, then there is no way store forwarding can
be blocked.
Bottom line: you have to give better context in your examples. You asked
about moving bytes around in a register. You didn't say anything about
memory.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
2/1/2004 9:29:02 AM
|
|
Phil Carmody wrote:
> Nudge wrote:
>
>> Bryan Parkoff wrote:
>>
>>> mov edx, 020h
>>> shl edx, 08H
>>
>> mov dh, 0x20 is functionally equivalent.
>
> No it's not.
I stand corrected.
I knew I shouldn't have taken part in this thread...
|
|
0
|
|
|
|
Reply
|
Nudge
|
2/1/2004 9:47:20 AM
|
|
Hi Matt,
> > I see, and I'm quite happy for my P4 1.2 GHZ is used in the office,
> > and I kept my 'slower' AMD K7 500 MHz as programming console.
> 1.2 GHz? The slowest was a 1.4 GHz Williamette, IIRC. I could be wrong...
Not sure yet, as the office is the place where my money grows,
I better keep my hands off the PCs there ... yet. :)
[..]
> > Why buy an P4 if you can have a better CPU for half of the money?
> The major players (Microsoft, Intel, nVidia, ATi, etc.)
> see the future of computing in digital media.
> On the hardware side of things, Intel's position has been
> threatened several times because PC architecture is not fast
> enough for digital video. A year or two ago,
> nVidia's CEO announced his goal of supplanting the CPU's
> dominance in this arena.
> To compete, the P4 was architected to make media-related programs
> incredibly fast. Compiler-produced code sucks, so the P4 has
> double-clocked ALUs to execute back-to-back *dependent* ops.
> Register starvation means code frequently accesses the stack,
> so the P4 has a small, 2-cycle L1 data cache backed by a large
> L2 cache. Bandwidth is also important, so Intel made ties with Rambus.
> The P4's architectural quirks (slow shifter/multiplier, small
> cache, slow flags access) come as a tradeoff.
Interesting facts, I'd have expected more similarity in
Intel- and AMD CPUs.
But I think it's nice to have a choice, even programmers and
compilers will need to differentiate between the target CPU.
windoze for K8 ? ;)
> Personally, I find K7/K8 more interesting to write assembly for
> since I am free to be tricky.
Me too ...
> > ... where are the detailed latency lists?
> There are none.
> Intel reports only on instructions they deem to be common.
> (Likely the latencies of "uncommon" instructions are completely
> discouraging.)
> The optimization manual (248966) describes this "common" set.
> My definition of common/uncommon appears to be substantially different.
Yes.
By reading over the P4-docs to upgrade my tools to 128 bit XMM,
I just found some codes which look very identical to me:
MOVNTQD, MOVNTPD, MOVNTPS (three dedicated opcodes)
all seem to just store 128-bits from XMM to memory.
And what's the difference between ANDPD and ANDPS (and similar),
doesn't all the 64/128 bit Booleans work on 'all' bits?
Many other opcodes also seem to be different just to inform
the HLL-compilers about the used variable-format,
like MOVSD/MOVQ/MOVD ( xmm <- xmm/Zx[mem64] ).
Or did I miss any conversion or NaN-check here?
Perhaps it is just the same story as 8BC1 vs. 89C8.
Poor CPU design caused by backwards compatibility,
which wont make too much sense today,
as every program need to check for CPU-type anyway.
A new design from scratch would perform 'somehow' better.
__
wolfgang
|
|
0
|
|
|
|
Reply
|
wolfgang
|
2/2/2004 1:53:56 AM
|
|
> By reading over the P4-docs to upgrade my tools to 128 bit XMM,
> I just found some codes which look very identical to me:
>
> MOVNTQD, MOVNTPD, MOVNTPS (three dedicated opcodes)
> all seem to just store 128-bits from XMM to memory.
>
> And what's the difference between ANDPD and ANDPS (and similar),
> doesn't all the 64/128 bit Booleans work on 'all' bits?
>
> Many other opcodes also seem to be different just to inform
> the HLL-compilers about the used variable-format,
> like MOVSD/MOVQ/MOVD ( xmm <- xmm/Zx[mem64] ).
> Or did I miss any conversion or NaN-check here?
after briefly checking AMD x86-64 programmer's manual vols. 1, 4, and 5....
i've only been able to come to the same conclusion.. it has to be a matter
of some sort of "valid data" check... as in, whether or not it's
reading/writing a QNaN, SNaN, or some sort of infinity.... i'm digging in it
further.... i'll probably find something at 7am tomorrow while I wait for my
school instructor to open the door to the classroom.... hopefully...
strangely though, i don't think there was mention of a special exception
raised by any of the instructions you've mentioned in your post.. the only
differences i can find is whether or not the SSE enabled bit vs. SSE2
enabled bit is set...
> Perhaps it is just the same story as 8BC1 vs. 89C8.
> Poor CPU design caused by backwards compatibility,
> which wont make too much sense today,
> as every program need to check for CPU-type anyway.
> A new design from scratch would perform 'somehow' better.
i want to say IA-64 appears to be intel's *attempt* at a scrubbed-to-scratch
design... if IA-64 becomes the starting point of a new family (as opposed to
generation) of processors... then i think i'll pass on assembly... maybe on
computer programming as a whole... and stick with my Network Specialist
courses at LTC/HAC.
|
|
0
|
|
|
|
Reply
|
Bx
|
2/2/2004 5:54:39 AM
|
|
"wolfgang kern" <nowhere@nevernet.at> wrote in message
news:bvk7m2$8rs$2@newsreader1.utanet.at...
<snip>
> > To compete, the P4 was architected to make media-related programs
> > incredibly fast. Compiler-produced code sucks, so the P4 has
> > double-clocked ALUs to execute back-to-back *dependent* ops.
> > Register starvation means code frequently accesses the stack,
> > so the P4 has a small, 2-cycle L1 data cache backed by a large
> > L2 cache. Bandwidth is also important, so Intel made ties with Rambus.
> > The P4's architectural quirks (slow shifter/multiplier, small
> > cache, slow flags access) come as a tradeoff.
>
> Interesting facts, I'd have expected more similarity in
> Intel- and AMD CPUs.
You expect right for Pentium-3 and Athlon. They're two peas in a pod. Athlon
is architected a bit better, but overall they get roughly the same
performance. Interestingly enough, I heard today that Prescott (Pentium-5)
removes the double-clocked ALUs and restore fast shifting. (Disclaimer: I
haven't confirmed this yet.)
<snip>
> By reading over the P4-docs to upgrade my tools to 128 bit XMM,
> I just found some codes which look very identical to me:
>
> MOVNTQD, MOVNTPD, MOVNTPS (three dedicated opcodes)
> all seem to just store 128-bits from XMM to memory.
>
> And what's the difference between ANDPD and ANDPS (and similar),
> doesn't all the 64/128 bit Booleans work on 'all' bits?
<snip>
Yes and no. At first I thought these were all side-effects from the
encodings, but it appears they have real purpose. On K8 at least, and I
believe on Pentium-4 as well, the xmm registers are tagged with a format:
int, single, or double. The K8 manual talks about preserving the type to
avoid stalls. For example:
movaps xmm0, [mem] ; xmm0 in packed single format
movapd xmm1, [mem] ; xmm1 in packed double format
andps xmm0, xmm1 ; xmm1 converted into single, penalty here
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
2/2/2004 6:20:05 AM
|
|
Bryan Parkoff wrote:
> Here is my example code.
>
> mov edx, 020h
> shl edx, 08H
> mov ecx, 01h
> or edx, ecx
According to AnandTech:
In Prescott, a shift/rotate block has been added to one of the fast
ALUs so that simple shifts/rotates may execute quickly.
http://www.anandtech.com/cpu/showdoc.html?i=1956&p=7
Perhaps you could purchase a Prescott and stop obsessing over the
Pentium 4's shift latency?
|
|
0
|
|
|
|
Reply
|
Grumble
|
2/2/2004 6:51:22 PM
|
|
Matt Taylor wrote:
> Interestingly enough, I heard today that Prescott (Pentium-5)
I've read that Prescott will keep the Pentium 4 name.
http://news.com.com/2100-1006_3-5121483.html
> removes the double-clocked ALUs and restore fast shifting.
> (Disclaimer: I haven't confirmed this yet.)
I've seen the Inquirer spread that rumor.
http://www.theinquirer.net/?article=13546
|
|
0
|
|
|
|
Reply
|
Grumble
|
2/2/2004 6:51:23 PM
|
|
Hi Matt,
[..]Ok, let's see what (dis)advantages future CPUs will show up.
> > MOVNTQD, MOVNTPD, MOVNTPS (three dedicated opcodes)
> > all seem to just store 128-bits from XMM to memory.
> > And what's the difference between ANDPD and ANDPS (and similar),
> > doesn't all the 64/128 bit Booleans work on 'all' bits?
> Yes and no. At first I thought these were all side-effects from the
> encodings, but it appears they have real purpose. On K8 at least, and I
> believe on Pentium-4 as well, the xmm registers are tagged with a format:
> int, single, or double. The K8 manual talks about preserving the type to
> avoid stalls. For example:
> movaps xmm0, [mem] ; xmm0 in packed single format
> movapd xmm1, [mem] ; xmm1 in packed double format
> andps xmm0, xmm1 ; xmm1 converted into single, penalty here
I see, it make sense for inherent format conversions,
but I'll use only one mnemonic and produce just one code
for all three MOVNT [mem128],xmm as the TAG doesn't go with it.
Thanks, back to work ...
__
wolfgang
|
|
0
|
|
|
|
Reply
|
wolfgang
|
2/2/2004 10:11:20 PM
|
|
<snip>
> Interestingly enough, I heard today that Prescott (Pentium-5)
> removes the double-clocked ALUs and restore fast shifting. (Disclaimer: I
> haven't confirmed this yet.)
<snip>
Many of you will no doubt have already seen this but it seems to
confirm Matt's point:
"Shift and rotate instructions can now be executed quickly by a new
shifter/rotator logic block included in one of the two fast ALUs."
http://www.extremetech.com/article2/0,3973,1478837,00.asp
Cheers,
Dave
|
|
0
|
|
|
|
Reply
|
david
|
2/4/2004 5:39:50 PM
|
|
Dave Boyle wrote:
> <snip>
>
>>Interestingly enough, I heard today that Prescott (Pentium-5)
>>removes the double-clocked ALUs and restore fast shifting. (Disclaimer: I
>>haven't confirmed this yet.)
>
>
> <snip>
>
> Many of you will no doubt have already seen this but it seems to
> confirm Matt's point:
>
> "Shift and rotate instructions can now be executed quickly by a new
> shifter/rotator logic block included in one of the two fast ALUs."
>
> http://www.extremetech.com/article2/0,3973,1478837,00.asp
They also fixed the integer MUL unit.
Unfortunately, the L1 cache was made 4 X slower! From 1 to 4 cycles. :-(
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
|
|
0
|
|
|
|
Reply
|
Terje
|
2/4/2004 11:42:32 PM
|
|
|
20 Replies
106 Views
(page loaded in 0.198 seconds)
Similiar Articles: Update Statement - Replace, Wildcard, and partial update? - comp ...SHL Is Replaced!! - comp.lang.asm.x86... wrote: I think I have no clue why you like to replace ... Will that not cause a partial register stall? ksh or sed search & replace - comp.unix.programmerIt the replacement string is as long as the replaced one, you can do: sed 's/x/y/g' 1 ... sed replace string when line match something - comp.unix.shell ... ksh or sed search ... Wrap a function - comp.lang.pythonI > don't think Python will ever beat sh as a shell replacement. > > When people say that Python is great for some situations, but not so > much for others, I think ... the difference between two forms of logic and - comp.unix.shell ...The following function is another pushd. If the && were replaced by -a in the pointed line, a error would come. What's the difference between && and... Do solaris users never do a recursive grep? - comp.unix.solaris ...A command argument {} is replaced by the current path name. ... xpg4/bin), but since their behavior is slightly different, this could cause a shell ... Handling special characters in Sybase ISQL script - comp.unix ...Looking at the output directed to a file, all these characters are replaced by "^Q ... in ISQL A Sybase 12.0, Solaris 8 scripting problem: When I use ISQL in a Unix shell ... Job Schedulers (preferably Open Source) - comp.unix.admin ...I realize that "cron + shell" is probably not the answer you were looking for, but ... why those 80 servers can't be consolidated into one (or two) server, and replaced by ... Math logic - comp.lang.asm.x86... if 3600 or less than 216000, answer is in hours In modern code, division is replaced ... ebx,[hms_ms] ;; Scaled reciprocal of ms in 1 hour, rounded up mov edx, ((1 shl ... ld: Unsatisfied symbol "main" in file - comp.sys.hp.hpux ...shl_load() limitations in HP-UX - comp.sys.hp.hpux ld: Unsatisfied symbol "main" in ... How to get back replaced file on HP UX - comp.sys.hp.hpux ... How to get back replaced ... multiple scp using xargs - comp.unix.shell... xargs -i scp {} user@server:/somedir Unfortunately, I have to use the replacement ... multiple scp using xargs - comp.unix.shell I have a lot of files I need to copy to ... Delete old disk information from metastat - comp.unix.solaris ...For performance purposes I've replaced the SCSI controller with two FCAL-HBA. ... What's a fast way to delete file by inode ? - comp.unix.shell ... Best way to delete ... GAWK: A fix for "missing file is a fatal error" - comp.lang.awk ...5) I was going to point out that the above is shell, so OT, but then again, I ... It functions as a drop-in replacement for "gawk" (can be used in the #! line). Signed shift of 32-bit int using 16-bit instructions? - comp.lang ...Something like this might work: shl ax,1 rcl dx,1 pushf shl dx,1 popf rcr dx,1 ... is rotated into the low- ;; order bit of the destination, and itself is replaced ... 0-padding 4 digit numbers with sed - comp.unix.shell... in this output: The 'g' causes all non-overlapping occurrences to be replaced, but ... in Excel - comp.lang.java.help 0-padding 4 digit numbers with sed - comp.unix.shell ... How to copy missing files only? - comp.unix.solarisNote that localhost can be replaced with any remote host to which you have rsh access. ... Use of a bit of shell scripting and find, particularly including use of find's ... Talk:List of alternative shells for Windows - Wikipedia, the free ...Shell replacements vs. GUI enhancements. I have removed references to HoverDesk and Desktop X because they are not shell replacements. Shell replacements replace ... Z-ShellProbably the first of its kind, Z-Shell is an open-source, managed, Windows replacement shell written entirely in VB.NET. The end vision is to have a highly ... 7/20/2012 5:09:39 AM
|