Slow string search/fast binary search

  • Follow


I'm puzzled by the following test results...

My program is searching small blocks of ordered pointers to
see if a given pointer is there.

The typical case is a miss, so that's what I tested for.

I assumed a single  repne scasd  instruction would be faster than
setting up a binary search for blocks smaller than 30 pointers or so.

However, my binary search seems to outperform the  rep scasd
search even when there is only a single pointer to be searched.

The binary search goes like this:

            xor edx,edx            ; prepare for odd value flag
..split:
            jecxz .miss
..loop:
            shr ecx,1                    ; Cut in half
            setc dl                        ; Remember odd length
            cmp eax,[edi+ecx*4]    ; Compare 1st element after cut
            jb short .split            ; Before: search elements before cut
            je short .hit                ; Found: exit the loop
            lea edi,[edi+4*ecx+4]; After: search elements after the cut
            add ecx,edx            ; inclucing a last odd element
            loop .loop
..miss:
            ...
..hit:
            ...
(The binsearch is crafted so that when there is an odd number of
elemets left, the middle element is tested; when there is an even
number, the fisrt range will be 1 element longer.)

The simple search goes:

          std
          lea edi,[edi+ecx*4-4]
          repne scasd
          je .hit
..miss:
            ...
..hit:
            ...

I'm testing on a Pentium M 1.7Ghz.

Is this processor particluarly bad with  std - rep scasd  -
or does it "like" my binsearch loop particularly well -
or does the bad scasd performance relate to the  std, or
should I generally expect the binary search to be the fastest
even for small blocks?



0
Reply Ole 7/29/2005 4:49:06 PM

Ole Nielsby wrote:

| I'm puzzled by the following test results...
| 
| My program is searching small blocks of ordered pointers to
| see if a given pointer is there.
| 
| The typical case is a miss, so that's what I tested for.
| 
| I assumed a single  repne scasd  instruction would be faster than
| setting up a binary search for blocks smaller than 30 pointers or so.
| 
| However, my binary search seems to outperform the  rep scasd
| search even when there is only a single pointer to be searched.
| 
| The binary search goes like this:
| 
|             xor edx,edx            ; prepare for odd value flag
| .split:
|             jecxz .miss
| .loop:
|             shr ecx,1                    ; Cut in half
|             setc dl                        ; Remember odd length
|             cmp eax,[edi+ecx*4]    ; Compare 1st element after cut
|             jb short .split            ; Before: search elements before cut
|             je short .hit                ; Found: exit the loop
|             lea edi,[edi+4*ecx+4]; After: search elements after the cut
|             add ecx,edx            ; inclucing a last odd element
|             loop .loop
| .miss:
|             ...
| .hit:
|             ...
| (The binsearch is crafted so that when there is an odd number of
| elemets left, the middle element is tested; when there is an even
| number, the fisrt range will be 1 element longer.)
| 
| The simple search goes:
| 
|           std
|           lea edi,[edi+ecx*4-4]
|           repne scasd
|           je .hit
| .miss:
|             ...
| .hit:
|             ...
| 
| I'm testing on a Pentium M 1.7Ghz.
| 
| Is this processor particluarly bad with  std - rep scasd  -
| or does it "like" my binsearch loop particularly well -
| or does the bad scasd performance relate to the  std, or
| should I generally expect the binary search to be the fastest
| even for small blocks?

I think your method of 'successive approximation' will be faster
in most cases and it wont alter edi, while rep scan can be faster
only if the match occures very early in the search, even with the
faster 'cld'-version (cache).

btw: I'd try:

         xor edx,edx            ; prepare for odd value flag
..split:     
         shr ecx,1              ; Cut in half
         je miss                ;jecxz is a slow one
..loop:
         setc dl                ; Remember odd length
         cmp eax,[edi+ecx*4]    ; Compare 1st element after cut
         jb short .split        ; Before: search elements before cut
         je short .hit          ; Found: exit the loop
         lea edi,[edi+4*ecx+4]  ; After: search elements after the cut
         add ecx,edx            ; inclucing a last odd element
         jmp short .split       ; (save two bytes) or continue with: 
         ;shr ecx,1             ; Cut in half
         ;jne .loop             ;faster than loop 
..miss:
         ...

__
wolfgang


0
Reply wolfgang 7/29/2005 7:21:50 PM


Ole Nielsby wrote:
> I'm puzzled by the following test results...
>
> My program is searching small blocks of ordered pointers to
> see if a given pointer is there.
>
> The typical case is a miss, so that's what I tested for.
>
> I assumed a single  repne scasd  instruction would be faster than
> setting up a binary search for blocks smaller than 30 pointers or so.

scasd is rather slow. Slower than the discrete instructions that do the
same job. Just something to be aware of. This may be your problem.

>
> However, my binary search seems to outperform the  rep scasd
> search even when there is only a single pointer to be searched.

Makes sense. Coding scasd using discrete instructions is often a lot
faster.

>
> Is this processor particluarly bad with  std - rep scasd  -

Yes.
This has been true on the Pentium processors for quite some time. MOVS
and STOS seem to work well, the other string instructions are far
slower than using (simple) discrete instructions to implement the same
thing.

Cheers,
Randy Hyde

0
Reply randyhyde 7/29/2005 7:26:30 PM

wolfgang kern <nowhere@nevernet.at> suggested:
>
> btw: I'd try:
>
>         xor edx,edx            ; prepare for odd value flag
> .split:
>         shr ecx,1              ; Cut in half
>         je miss                ;jecxz is a slow one
> .loop:
>         setc dl                ; Remember odd length
>         cmp eax,[edi+ecx*4]    ; Compare 1st element after cut
>         jb short .split        ; Before: search elements before cut
>         je short .hit          ; Found: exit the loop
>         lea edi,[edi+4*ecx+4]  ; After: search elements after the cut
>         add ecx,edx            ; inclucing a last odd element
>         jmp short .split       ; (save two bytes) or continue with:
>         ;shr ecx,1             ; Cut in half
>         ;jne .loop             ;faster than loop
> .miss:
>         ...

Thanks for the clarification - and thanks for trying to improve
my code -  though I'm afraid your code is broken in two places:

1. the case where ecx=1 creates a miss
    (my code tests ecx before the shr),
2. your code misses the "dec ecx" implicit in the loop instruction,
    and will search too far. (The decrement in my code, together
    with the +4 in the lea, excludes the already tested middle element
    from the last range.)

Anyway, it seems the best I can do is to keep the jecxz while changing
the loop - so I end up with:

    xor    edx,edx
..split:
      jecxz .miss
..loop:
      shr ecx,1
      setnc dl        ; the flag is now reverted
      cmp eax,[edi+ecx*4]
      jb short .split
      je short .hit
      lea edi,[edi+4*ecx+4]
      sub ecx,edx    ; subtract instead of add, to get the decrement
      jnz short .loop
..miss:

which is a tiny bit faster than the code I had. There seems to be
little or no gain in replacing the jecxz instruction. 


0
Reply Ole 7/30/2005 12:37:22 AM

Ole Nielsby wrote:

[ ...my hasty typed attmept ;) ] 

| Thanks for the clarification - and thanks for trying to improve
| my code -  though I'm afraid your code is broken in two places:
 
| 1. the case where ecx=1 creates a miss
|     (my code tests ecx before the shr),

Yes, my idea was to jump to the comparision,  but I failed to tell.
And unfortunately we haven't got a single 'jncz' on IA-Cpus. 

| 2. your code misses the "dec ecx" implicit in the loop instruction,
|     and will search too far. (The decrement in my code, together
|     with the +4 in the lea, excludes the already tested middle element
|     from the last range.)

Yes, I made a classical off by one (4 here).  
 
| Anyway, it seems the best I can do is to keep the jecxz while changing
| the loop - so I end up with:
| 
|     xor    edx,edx
| .split:
|       jecxz .miss
| .loop:
|       shr ecx,1
|       setnc dl        ; the flag is now reverted
|       cmp eax,[edi+ecx*4]
|       jb short .split
|       je short .hit
|       lea edi,[edi+4*ecx+4]

?       lea edi,[edi+ecx*4+]  ?

|       sub ecx,edx    ; subtract instead of add, to get the decrement
|       jnz short .loop
| .miss:
 
| which is a tiny bit faster than the code I had. There seems to be
| little or no gain in replacing the jecxz instruction. 

Right, even jecxz is slower on my AMD,
it seems to be faster than a cmp/jcc pair on Intel's.

I still think that you could get rid of the jecxz,
but it may not be faster then.

Looks now good to me anyway :)

__
wolfgang



    
 

0
Reply wolfgang 7/30/2005 5:48:21 AM

Ole,

A couple of suggestions, try replacing jecxz with the instruction pair,

tst ecx, ecx
jz .miss

See if you can remove the SHR instruction by changing the scale from 4
to 2 in the CMP and LEA complex addressing as it is probably the major
bottleneck in your loop code.

Regards,

hutch at movsd dot com

0
Reply hutch 7/30/2005 6:04:14 PM

randyhyde@earthlink.net wrote:
> This has been true on the Pentium processors for quite some time. MOVS
> and STOS seem to work well, the other string instructions are far
> slower than using (simple) discrete instructions to implement the same
> thing.

At what processor family did this become the case?  I ask because, as
part of a hobby project, I'm writing code optimized to every major x86
architecture from 8088 to Pentium, and I was under the impression that
the string instructions were to be favored above all else.

0
Reply Jim 7/30/2005 8:02:20 PM

Ole Nielsby wrote:
> I'm puzzled by the following test results...
>
> My program is searching small blocks of ordered pointers to
> see if a given pointer is there.
>
> The typical case is a miss, so that's what I tested for.
>
> I assumed a single  repne scasd  instruction would be faster than
> setting up a binary search for blocks smaller than 30 pointers or so.
>
> However, my binary search seems to outperform the  rep scasd
> search even when there is only a single pointer to be searched.

30 pointers = 5 comparisons in the binary search vs. average of 15 from the 
scas loop. Intel was nice enough to not bother documenting half the 
instructions on the P4 and P-M processors. Referring to a document produced 
by Agner Fog, it appears that repe/repne cmps executes at best in 6+4.5*n 
cycles, so that's 6+4.5*15 = 74 clocks. Your binary search loop, despite 
using the horrible loop instruction, is somewhere under 13 clocks, and 13*5 
= 65 clocks. I am not sure of the exact timing since I am not familiar with 
all the rules about dispatch ports on the P6 architecture.

> The binary search goes like this:
>
>            xor edx,edx            ; prepare for odd value flag
> .split:
>            jecxz .miss
> .loop:
>            shr ecx,1
>            setc dl
>            cmp eax,[edi+ecx*4]
>            jb short .split
>            je short .hit
>            lea edi,[edi+4*ecx+4]
>            add ecx,edx
>            loop .loop
>
> .miss:
>            ...
> .hit:
>            ...

Try rewriting like so:

 ; Inputs:
 ; eax = search value
 ; ecx = array length
 ; edi = array
 ;
 ; Trashes:
 ; ebx, ecx, edx, edi

 xor edx, edx
 test ecx, ecx
 jz .miss

..top:
 shr ecx, 1
 setc dl
 lea ebx, [edi+ecx*4+4]
 cmp eax, [edi+ecx*4]
 je .hit
 cmova edi, ebx
 add ecx, edx
 jne .top

..miss:
 ...
..hit:
 ...

This still isn't the best code, but it's a bit better. I don't know a whole 
lot about optimization for P6 core CPUs, and it isn't clear to me whether 
the AGU computation requires an additional u-op. The docs seem to indicate 
that it doesn't, but that doesn't make any sense. Anyway, the loop body here 
should still execute faster than the loop instruction by itself.

The cmovcc instruction should be cheaper than a second and unpredictable 
branch.

[...]
> I'm testing on a Pentium M 1.7Ghz.
>
> Is this processor particluarly bad with  std - rep scasd  -
> or does it "like" my binsearch loop particularly well -
> or does the bad scasd performance relate to the  std, or
> should I generally expect the binary search to be the fastest
> even for small blocks?

All string instructions are bad, though there are occasions to use them. 
Generally you only want to use them when you are optimizing for size, i.e. a 
code path which will only execute once. As has been pointed out, rep movs* 
and rep stos* perform decently well; however, the circumstances when they 
should be used are still quite limited. AMD recommends that rep movs* only 
be used for blocks between 16 and 512 bytes. I suspect the figures are 
similar for Intel CPUs.

-Matt 

0
Reply Matt 7/30/2005 11:09:52 PM

Matt wrote:

> All string instructions are bad, though there are occasions to use them.
> Generally you only want to use them when you are optimizing for size, i.e. a
> code path which will only execute once. As has been pointed out, rep movs*
> and rep stos* perform decently well; however, the circumstances when they
> should be used are still quite limited. AMD recommends that rep movs* only
> be used for blocks between 16 and 512 bytes. I suspect the figures are
> similar for Intel CPUs.
>
> -Matt

Interesting. The lower bound is obvious. What is the reason for the
upper bound?
Cheers,
Randy Hyde

0
Reply randyhyde 7/31/2005 5:31:47 PM

"Jim Leonard"  <spamtrap@crayne.org> wrote:

>randyhyde@earthlink.net wrote:
>> This has been true on the Pentium processors for quite some time. MOVS
>> and STOS seem to work well, the other string instructions are far
>> slower than using (simple) discrete instructions to implement the same
>> thing.
>
>At what processor family did this become the case?  I ask because, as
>part of a hobby project, I'm writing code optimized to every major x86
>architecture from 8088 to Pentium, and I was under the impression that
>the string instructions were to be favored above all else.

They have NEVER been a "win" in every case.  To my knowledge, a standalone
string instruction (without REP) has always been slower than the equivalent
sequence of simple instructions.

The breakeven point varies based on processor.  In the Pentiums, for
example, rep lods/movs/stos is not a win until you reach 8 iterations.  The
situation is more complicated with MMX or SSE2.
-- 
- Tim Roberts, timr@probo.com
  Providenza & Boekelheide, Inc.

0
Reply Tim 8/1/2005 5:19:13 AM

Its pretty simple actually, on late enough hardware the SSE(2-3)
intructions are faster than the special case circuitry for STOSD and
MOVSD. Where you can seperate the IO streams between temporal reads and
non-temporal writes, you see a viable speed increase.

If you are forced to work at a byte level with the string data,
incremented pointers are hard to beat but anywhere where you can use
block data transfer, SSE(2-3) is faster.

Regards,

hutch at movsd dot com

0
Reply hutch 8/1/2005 8:00:04 AM

hutch-- wrote:
> Its pretty simple actually, on late enough hardware the SSE(2-3)
> intructions are faster than the special case circuitry for STOSD and
> MOVSD. Where you can seperate the IO streams between temporal reads
> and non-temporal writes, you see a viable speed increase.
>
> If you are forced to work at a byte level with the string data,
> incremented pointers are hard to beat but anywhere where you can use
> block data transfer, SSE(2-3) is faster.

s/SSE/MMX/g

There isn't a huge difference, but MMX generally tends to be a hair faster. 
Non-temporal writes don't have much to do with the upper bound, though they 
are helpful in general.

-Matt 

0
Reply Matt 8/1/2005 8:14:04 PM

Tim Roberts wrote:
> >At what processor family did this become the case?  I ask because, as
> >part of a hobby project, I'm writing code optimized to every major x86
> >architecture from 8088 to Pentium, and I was under the impression that
> >the string instructions were to be favored above all else.
>
> They have NEVER been a "win" in every case.  To my knowledge, a standalone
> string instruction (without REP) has always been slower than the equivalent
> sequence of simple instructions.

Sorry, I didn't write clearly and was talking about loops; what I meant
to write was "at what point did REP <string instruction> become slower
than the simple instruction sequence/loop equivalent?"  For example,
REP MOVSD vs. "mov reg,mem; mov mem,reg; inc regs; dec cx; jnz"? I'm
*assuming* on 386DX it was break-even and on 486 it was faster, but if
anyone knows for sure I'd like to know...

However, you're wrong about "a standalone string instruction...has
always been slower than the equivalent sequence of simple
instructions".  In my 8088 projects, I use single-shot MOVSB's in
places like decompression code where the testing of a single bit
indicates if the input symbol is a literal and needs to be copied to
the output.  In those cases, a single MOVSB will copy the byte, then
increment si and di.  It does all this in 18 cycles and is one byte in
size, so the total execution time is 22 cycles if the MOVSB hasn't been
prefetched.  The closest equivalent I can think of to a one-shot MOVSB
would be something like:

mov ax,[si] ; 8+5c, 2b=8c, total if not prefetched=22c
mov [di],ax ; 8+5=13c (prefetched by now thanks to the above slow MOV)
inc si      ; 3c (prefetched)
inc di      ; 3c (prefetched)
       total: 41c if the initial MOV isn't prefetched

So on 8088 it is clearly 2 times faster (and 3 times smaller) to use a
one-shot MOVSB than the equivalent instruction sequence.  Again, I
don't know at what point that situation was reversed (486?) so if
anyone knows, please tell me.

> The breakeven point varies based on processor.  In the Pentiums, for
> example, rep lods/movs/stos is not a win until you reach 8 iterations.  The
> situation is more complicated with MMX or SSE2.

Damn caching :-)  This is why coding for 8088 is more "fun" for me,
since I get to think about the most efficient way to perform a task,
rather than the most efficient in terms of dealing with machine
implementation.  The most complicated memory issue I deal with in 8088
is *not touching it at all* since it's so damn slow... but when I do,
there aren't any goofy rules, simply "a single byte takes 4 cycles to
read/write", that's it.

0
Reply Jim 8/2/2005 6:45:13 PM

12 Replies
308 Views

(page loaded in 0.22 seconds)

Similiar Articles:


















7/27/2012 4:00:27 PM


Reply: