I'm puzzled by the following test results...
My program is searching small blocks of ordered pointers to
see if a given pointer is there.
The typical case is a miss, so that's what I tested for.
I assumed a single repne scasd instruction would be faster than
setting up a binary search for blocks smaller than 30 pointers or so.
However, my binary search seems to outperform the rep scasd
search even when there is only a single pointer to be searched.
The binary search goes like this:
xor edx,edx ; prepare for odd value flag
..split:
jecxz .miss
..loop:
shr ecx,1 ; Cut in half
setc dl ; Remember odd length
cmp eax,[edi+ecx*4] ; Compare 1st element after cut
jb short .split ; Before: search elements before cut
je short .hit ; Found: exit the loop
lea edi,[edi+4*ecx+4]; After: search elements after the cut
add ecx,edx ; inclucing a last odd element
loop .loop
..miss:
...
..hit:
...
(The binsearch is crafted so that when there is an odd number of
elemets left, the middle element is tested; when there is an even
number, the fisrt range will be 1 element longer.)
The simple search goes:
std
lea edi,[edi+ecx*4-4]
repne scasd
je .hit
..miss:
...
..hit:
...
I'm testing on a Pentium M 1.7Ghz.
Is this processor particluarly bad with std - rep scasd -
or does it "like" my binsearch loop particularly well -
or does the bad scasd performance relate to the std, or
should I generally expect the binary search to be the fastest
even for small blocks?
|
|
0
|
|
|
|
Reply
|
Ole
|
7/29/2005 4:49:06 PM |
|
Ole Nielsby wrote:
| I'm puzzled by the following test results...
|
| My program is searching small blocks of ordered pointers to
| see if a given pointer is there.
|
| The typical case is a miss, so that's what I tested for.
|
| I assumed a single repne scasd instruction would be faster than
| setting up a binary search for blocks smaller than 30 pointers or so.
|
| However, my binary search seems to outperform the rep scasd
| search even when there is only a single pointer to be searched.
|
| The binary search goes like this:
|
| xor edx,edx ; prepare for odd value flag
| .split:
| jecxz .miss
| .loop:
| shr ecx,1 ; Cut in half
| setc dl ; Remember odd length
| cmp eax,[edi+ecx*4] ; Compare 1st element after cut
| jb short .split ; Before: search elements before cut
| je short .hit ; Found: exit the loop
| lea edi,[edi+4*ecx+4]; After: search elements after the cut
| add ecx,edx ; inclucing a last odd element
| loop .loop
| .miss:
| ...
| .hit:
| ...
| (The binsearch is crafted so that when there is an odd number of
| elemets left, the middle element is tested; when there is an even
| number, the fisrt range will be 1 element longer.)
|
| The simple search goes:
|
| std
| lea edi,[edi+ecx*4-4]
| repne scasd
| je .hit
| .miss:
| ...
| .hit:
| ...
|
| I'm testing on a Pentium M 1.7Ghz.
|
| Is this processor particluarly bad with std - rep scasd -
| or does it "like" my binsearch loop particularly well -
| or does the bad scasd performance relate to the std, or
| should I generally expect the binary search to be the fastest
| even for small blocks?
I think your method of 'successive approximation' will be faster
in most cases and it wont alter edi, while rep scan can be faster
only if the match occures very early in the search, even with the
faster 'cld'-version (cache).
btw: I'd try:
xor edx,edx ; prepare for odd value flag
..split:
shr ecx,1 ; Cut in half
je miss ;jecxz is a slow one
..loop:
setc dl ; Remember odd length
cmp eax,[edi+ecx*4] ; Compare 1st element after cut
jb short .split ; Before: search elements before cut
je short .hit ; Found: exit the loop
lea edi,[edi+4*ecx+4] ; After: search elements after the cut
add ecx,edx ; inclucing a last odd element
jmp short .split ; (save two bytes) or continue with:
;shr ecx,1 ; Cut in half
;jne .loop ;faster than loop
..miss:
...
__
wolfgang
|
|
0
|
|
|
|
Reply
|
wolfgang
|
7/29/2005 7:21:50 PM
|
|
Ole Nielsby wrote:
> I'm puzzled by the following test results...
>
> My program is searching small blocks of ordered pointers to
> see if a given pointer is there.
>
> The typical case is a miss, so that's what I tested for.
>
> I assumed a single repne scasd instruction would be faster than
> setting up a binary search for blocks smaller than 30 pointers or so.
scasd is rather slow. Slower than the discrete instructions that do the
same job. Just something to be aware of. This may be your problem.
>
> However, my binary search seems to outperform the rep scasd
> search even when there is only a single pointer to be searched.
Makes sense. Coding scasd using discrete instructions is often a lot
faster.
>
> Is this processor particluarly bad with std - rep scasd -
Yes.
This has been true on the Pentium processors for quite some time. MOVS
and STOS seem to work well, the other string instructions are far
slower than using (simple) discrete instructions to implement the same
thing.
Cheers,
Randy Hyde
|
|
0
|
|
|
|
Reply
|
randyhyde
|
7/29/2005 7:26:30 PM
|
|
wolfgang kern <nowhere@nevernet.at> suggested:
>
> btw: I'd try:
>
> xor edx,edx ; prepare for odd value flag
> .split:
> shr ecx,1 ; Cut in half
> je miss ;jecxz is a slow one
> .loop:
> setc dl ; Remember odd length
> cmp eax,[edi+ecx*4] ; Compare 1st element after cut
> jb short .split ; Before: search elements before cut
> je short .hit ; Found: exit the loop
> lea edi,[edi+4*ecx+4] ; After: search elements after the cut
> add ecx,edx ; inclucing a last odd element
> jmp short .split ; (save two bytes) or continue with:
> ;shr ecx,1 ; Cut in half
> ;jne .loop ;faster than loop
> .miss:
> ...
Thanks for the clarification - and thanks for trying to improve
my code - though I'm afraid your code is broken in two places:
1. the case where ecx=1 creates a miss
(my code tests ecx before the shr),
2. your code misses the "dec ecx" implicit in the loop instruction,
and will search too far. (The decrement in my code, together
with the +4 in the lea, excludes the already tested middle element
from the last range.)
Anyway, it seems the best I can do is to keep the jecxz while changing
the loop - so I end up with:
xor edx,edx
..split:
jecxz .miss
..loop:
shr ecx,1
setnc dl ; the flag is now reverted
cmp eax,[edi+ecx*4]
jb short .split
je short .hit
lea edi,[edi+4*ecx+4]
sub ecx,edx ; subtract instead of add, to get the decrement
jnz short .loop
..miss:
which is a tiny bit faster than the code I had. There seems to be
little or no gain in replacing the jecxz instruction.
|
|
0
|
|
|
|
Reply
|
Ole
|
7/30/2005 12:37:22 AM
|
|
Ole Nielsby wrote:
[ ...my hasty typed attmept ;) ]
| Thanks for the clarification - and thanks for trying to improve
| my code - though I'm afraid your code is broken in two places:
| 1. the case where ecx=1 creates a miss
| (my code tests ecx before the shr),
Yes, my idea was to jump to the comparision, but I failed to tell.
And unfortunately we haven't got a single 'jncz' on IA-Cpus.
| 2. your code misses the "dec ecx" implicit in the loop instruction,
| and will search too far. (The decrement in my code, together
| with the +4 in the lea, excludes the already tested middle element
| from the last range.)
Yes, I made a classical off by one (4 here).
| Anyway, it seems the best I can do is to keep the jecxz while changing
| the loop - so I end up with:
|
| xor edx,edx
| .split:
| jecxz .miss
| .loop:
| shr ecx,1
| setnc dl ; the flag is now reverted
| cmp eax,[edi+ecx*4]
| jb short .split
| je short .hit
| lea edi,[edi+4*ecx+4]
? lea edi,[edi+ecx*4+] ?
| sub ecx,edx ; subtract instead of add, to get the decrement
| jnz short .loop
| .miss:
| which is a tiny bit faster than the code I had. There seems to be
| little or no gain in replacing the jecxz instruction.
Right, even jecxz is slower on my AMD,
it seems to be faster than a cmp/jcc pair on Intel's.
I still think that you could get rid of the jecxz,
but it may not be faster then.
Looks now good to me anyway :)
__
wolfgang
|
|
0
|
|
|
|
Reply
|
wolfgang
|
7/30/2005 5:48:21 AM
|
|
Ole,
A couple of suggestions, try replacing jecxz with the instruction pair,
tst ecx, ecx
jz .miss
See if you can remove the SHR instruction by changing the scale from 4
to 2 in the CMP and LEA complex addressing as it is probably the major
bottleneck in your loop code.
Regards,
hutch at movsd dot com
|
|
0
|
|
|
|
Reply
|
hutch
|
7/30/2005 6:04:14 PM
|
|
randyhyde@earthlink.net wrote:
> This has been true on the Pentium processors for quite some time. MOVS
> and STOS seem to work well, the other string instructions are far
> slower than using (simple) discrete instructions to implement the same
> thing.
At what processor family did this become the case? I ask because, as
part of a hobby project, I'm writing code optimized to every major x86
architecture from 8088 to Pentium, and I was under the impression that
the string instructions were to be favored above all else.
|
|
0
|
|
|
|
Reply
|
Jim
|
7/30/2005 8:02:20 PM
|
|
Ole Nielsby wrote:
> I'm puzzled by the following test results...
>
> My program is searching small blocks of ordered pointers to
> see if a given pointer is there.
>
> The typical case is a miss, so that's what I tested for.
>
> I assumed a single repne scasd instruction would be faster than
> setting up a binary search for blocks smaller than 30 pointers or so.
>
> However, my binary search seems to outperform the rep scasd
> search even when there is only a single pointer to be searched.
30 pointers = 5 comparisons in the binary search vs. average of 15 from the
scas loop. Intel was nice enough to not bother documenting half the
instructions on the P4 and P-M processors. Referring to a document produced
by Agner Fog, it appears that repe/repne cmps executes at best in 6+4.5*n
cycles, so that's 6+4.5*15 = 74 clocks. Your binary search loop, despite
using the horrible loop instruction, is somewhere under 13 clocks, and 13*5
= 65 clocks. I am not sure of the exact timing since I am not familiar with
all the rules about dispatch ports on the P6 architecture.
> The binary search goes like this:
>
> xor edx,edx ; prepare for odd value flag
> .split:
> jecxz .miss
> .loop:
> shr ecx,1
> setc dl
> cmp eax,[edi+ecx*4]
> jb short .split
> je short .hit
> lea edi,[edi+4*ecx+4]
> add ecx,edx
> loop .loop
>
> .miss:
> ...
> .hit:
> ...
Try rewriting like so:
; Inputs:
; eax = search value
; ecx = array length
; edi = array
;
; Trashes:
; ebx, ecx, edx, edi
xor edx, edx
test ecx, ecx
jz .miss
..top:
shr ecx, 1
setc dl
lea ebx, [edi+ecx*4+4]
cmp eax, [edi+ecx*4]
je .hit
cmova edi, ebx
add ecx, edx
jne .top
..miss:
...
..hit:
...
This still isn't the best code, but it's a bit better. I don't know a whole
lot about optimization for P6 core CPUs, and it isn't clear to me whether
the AGU computation requires an additional u-op. The docs seem to indicate
that it doesn't, but that doesn't make any sense. Anyway, the loop body here
should still execute faster than the loop instruction by itself.
The cmovcc instruction should be cheaper than a second and unpredictable
branch.
[...]
> I'm testing on a Pentium M 1.7Ghz.
>
> Is this processor particluarly bad with std - rep scasd -
> or does it "like" my binsearch loop particularly well -
> or does the bad scasd performance relate to the std, or
> should I generally expect the binary search to be the fastest
> even for small blocks?
All string instructions are bad, though there are occasions to use them.
Generally you only want to use them when you are optimizing for size, i.e. a
code path which will only execute once. As has been pointed out, rep movs*
and rep stos* perform decently well; however, the circumstances when they
should be used are still quite limited. AMD recommends that rep movs* only
be used for blocks between 16 and 512 bytes. I suspect the figures are
similar for Intel CPUs.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
7/30/2005 11:09:52 PM
|
|
Matt wrote:
> All string instructions are bad, though there are occasions to use them.
> Generally you only want to use them when you are optimizing for size, i.e. a
> code path which will only execute once. As has been pointed out, rep movs*
> and rep stos* perform decently well; however, the circumstances when they
> should be used are still quite limited. AMD recommends that rep movs* only
> be used for blocks between 16 and 512 bytes. I suspect the figures are
> similar for Intel CPUs.
>
> -Matt
Interesting. The lower bound is obvious. What is the reason for the
upper bound?
Cheers,
Randy Hyde
|
|
0
|
|
|
|
Reply
|
randyhyde
|
7/31/2005 5:31:47 PM
|
|
"Jim Leonard" <spamtrap@crayne.org> wrote:
>randyhyde@earthlink.net wrote:
>> This has been true on the Pentium processors for quite some time. MOVS
>> and STOS seem to work well, the other string instructions are far
>> slower than using (simple) discrete instructions to implement the same
>> thing.
>
>At what processor family did this become the case? I ask because, as
>part of a hobby project, I'm writing code optimized to every major x86
>architecture from 8088 to Pentium, and I was under the impression that
>the string instructions were to be favored above all else.
They have NEVER been a "win" in every case. To my knowledge, a standalone
string instruction (without REP) has always been slower than the equivalent
sequence of simple instructions.
The breakeven point varies based on processor. In the Pentiums, for
example, rep lods/movs/stos is not a win until you reach 8 iterations. The
situation is more complicated with MMX or SSE2.
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
|
|
0
|
|
|
|
Reply
|
Tim
|
8/1/2005 5:19:13 AM
|
|
Its pretty simple actually, on late enough hardware the SSE(2-3)
intructions are faster than the special case circuitry for STOSD and
MOVSD. Where you can seperate the IO streams between temporal reads and
non-temporal writes, you see a viable speed increase.
If you are forced to work at a byte level with the string data,
incremented pointers are hard to beat but anywhere where you can use
block data transfer, SSE(2-3) is faster.
Regards,
hutch at movsd dot com
|
|
0
|
|
|
|
Reply
|
hutch
|
8/1/2005 8:00:04 AM
|
|
hutch-- wrote:
> Its pretty simple actually, on late enough hardware the SSE(2-3)
> intructions are faster than the special case circuitry for STOSD and
> MOVSD. Where you can seperate the IO streams between temporal reads
> and non-temporal writes, you see a viable speed increase.
>
> If you are forced to work at a byte level with the string data,
> incremented pointers are hard to beat but anywhere where you can use
> block data transfer, SSE(2-3) is faster.
s/SSE/MMX/g
There isn't a huge difference, but MMX generally tends to be a hair faster.
Non-temporal writes don't have much to do with the upper bound, though they
are helpful in general.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
8/1/2005 8:14:04 PM
|
|
Tim Roberts wrote:
> >At what processor family did this become the case? I ask because, as
> >part of a hobby project, I'm writing code optimized to every major x86
> >architecture from 8088 to Pentium, and I was under the impression that
> >the string instructions were to be favored above all else.
>
> They have NEVER been a "win" in every case. To my knowledge, a standalone
> string instruction (without REP) has always been slower than the equivalent
> sequence of simple instructions.
Sorry, I didn't write clearly and was talking about loops; what I meant
to write was "at what point did REP <string instruction> become slower
than the simple instruction sequence/loop equivalent?" For example,
REP MOVSD vs. "mov reg,mem; mov mem,reg; inc regs; dec cx; jnz"? I'm
*assuming* on 386DX it was break-even and on 486 it was faster, but if
anyone knows for sure I'd like to know...
However, you're wrong about "a standalone string instruction...has
always been slower than the equivalent sequence of simple
instructions". In my 8088 projects, I use single-shot MOVSB's in
places like decompression code where the testing of a single bit
indicates if the input symbol is a literal and needs to be copied to
the output. In those cases, a single MOVSB will copy the byte, then
increment si and di. It does all this in 18 cycles and is one byte in
size, so the total execution time is 22 cycles if the MOVSB hasn't been
prefetched. The closest equivalent I can think of to a one-shot MOVSB
would be something like:
mov ax,[si] ; 8+5c, 2b=8c, total if not prefetched=22c
mov [di],ax ; 8+5=13c (prefetched by now thanks to the above slow MOV)
inc si ; 3c (prefetched)
inc di ; 3c (prefetched)
total: 41c if the initial MOV isn't prefetched
So on 8088 it is clearly 2 times faster (and 3 times smaller) to use a
one-shot MOVSB than the equivalent instruction sequence. Again, I
don't know at what point that situation was reversed (486?) so if
anyone knows, please tell me.
> The breakeven point varies based on processor. In the Pentiums, for
> example, rep lods/movs/stos is not a win until you reach 8 iterations. The
> situation is more complicated with MMX or SSE2.
Damn caching :-) This is why coding for 8088 is more "fun" for me,
since I get to think about the most efficient way to perform a task,
rather than the most efficient in terms of dealing with machine
implementation. The most complicated memory issue I deal with in 8088
is *not touching it at all* since it's so damn slow... but when I do,
there aren't any goofy rules, simply "a single byte takes 4 cycles to
read/write", that's it.
|
|
0
|
|
|
|
Reply
|
Jim
|
8/2/2005 6:45:13 PM
|
|
|
12 Replies
308 Views
(page loaded in 0.22 seconds)
|