Re: Fastest logical not?

  • Follow


"io_x" <a@b.c.invalid> ha scritto nel messaggio news:...
> pheraps this could be ok
> ; input in eax output in edx trash eax and edx
> xor  edx, edx
> sub  eax,   1
> adc  edx, edx
> ; if eax==0 => CF==1 => edx=1
> ; if eax!=0 => CF==0 => edx=0
> ; always if i understand when set carry flag on sub

this not change eax :)

; input in eax output in edx trash edx only
xor  edx, edx
cmp  eax,   1
adc  edx, edx
; if eax==0 => CF==1 => edx=1
; if eax!=0 => CF==0 => edx=0

is it true that "cmp eax, 1" == "sub eax, 1" for the flag only?





0
Reply io_x 6/18/2010 6:20:15 PM

io_x wrote:
> "io_x" <a@b.c.invalid> ha scritto nel messaggio news:...
> 
> this not change eax :)
> 
> ; input in eax output in edx trash edx only
> xor  edx, edx
> cmp  eax,   1
> adc  edx, edx
> ; if eax==0 => CF==1 => edx=1
> ; if eax!=0 => CF==0 => edx=0
> 
> is it true that "cmp eax, 1" == "sub eax, 1" for the flag only?

Yes, they have the same effect on the flags; the only difference is that
"cmp eax, 1" has no effect on eax.
0
Reply Rob 6/19/2010 12:56:23 AM


"io_x" <a@b.c.invalid> ha scritto nel messaggio
news:4c1bb68a$0$31375$4fafbaef@reader1.news.tin.it...
> this not change eax :)
>
> ; input in eax output in edx trash edx only
> xor  edx, edx
> cmp  eax,   1
> adc  edx, edx
> ; if eax==0 => CF==1 => edx=1
> ; if eax!=0 => CF==0 => edx=0
>
> is it true that "cmp eax, 1" == "sub eax, 1" for the flag only?

what about this, use only 2 instructions

; input eax output in eax
cmp  eax,   1
sbb  eax, eax
; if eax==0 => CF==1 => eax==-1
; if eax!=0 => CF==0 => eax== 0



0
Reply io_x 6/19/2010 5:31:27 AM

"io_x" posted:
....
>> is it true that "cmp eax, 1" == "sub eax, 1" for the flag only?

Sure.

> what about this, use only 2 instructions

> ; input eax output in eax
> cmp  eax,   1
> sbb  eax, eax
> ; if eax==0 => CF==1 => eax==-1
> ; if eax!=0 => CF==0 => eax== 0

Congrats Rosario, you finally got it ! :)

and to fit OP's request for 1 and 0 (instead of true/false)
just add an INC eax to your two line code.

I haven't checked if this perform any faster:
(in theory it should do within 3..4 cycles on AMD K8/10)

cmp eax, 1  ; cy if 0 only
mov eax, 1
sbb eax, 0  ; = 0 if it was zero only

__
wolfgang


0
Reply wolfgang 6/19/2010 7:08:29 AM

io_x wrote:
> "io_x" <a@b.c.invalid> ha scritto nel messaggio
> 
> what about this, use only 2 instructions
> 
> ; input eax output in eax
> cmp  eax,   1
> sbb  eax, eax
> ; if eax==0 => CF==1 => eax==-1
> ; if eax!=0 => CF==0 => eax== 0
> 
> 
> 

I added it in to get the following:  it looks like using the only the two 
instructions could help on the AMD (and I imagine if doing large sections of 
code with lots of tests it probably would be worthwhile).
However, if it has to conform to the OP's exact requirements, changing it to:
> 	cmp eax,1
> 	sbb eax,eax
> 	neg eax
makes it the same as the other ones.  However, as Rod noted earlier, working 
with -1 and 0 might be better anyway.

> ==> Running performance tests: (10000000 iterations)
>     Intel(R) Pentium(R) 4 CPU 2.40GHz Processor Detected @ 2400 MHz
> Reference Procedure timing took 0.050255321s  =   12.061 cycles/iteration
> Noop 1                           -->   0.021570294s  =    5.176 cycles/iteration
> Noop 2                           -->   0.019148795s  =    4.595 cycles/iteration
> Noop 3                           -->   0.016664411s  =    3.999 cycles/iteration
> Noop 4                           -->   0.016539342s  =    3.969 cycles/iteration
> Rod Pemberton 1                  -->   0.016643700s  =    3.994 cycles/iteration
> Rod Pemberton 1 (dword)          -->   0.016526647s  =    3.966 cycles/iteration
> Rod Pemberton 2                  -->   0.016718413s  =    4.012 cycles/iteration
> Bernhard Schornak                -->   0.016577269s  =    3.978 cycles/iteration
> Bernhard Schornak 2              -->   0.060770751s  =   14.584 cycles/iteration
> Rob                              -->   0.016664253s  =    3.999 cycles/iteration
> io_x (true=-1)                   -->   0.016574824s  =    3.977 cycles/iteration
> io_x (conforms)                  -->   0.016552388s  =    3.972 cycles/iteration

> ==> Running performance tests: (10000000 iterations)
>     AMD Athlon(tm) XP 2100+ Processor Detected @ 1734 MHz
> Reference Procedure timing took 0.034733522s  =    6.022 cycles/iteration
> Noop 1                           -->   0.017316694s  =    3.002 cycles/iteration
> Noop 2                           -->   0.017328916s  =    3.004 cycles/iteration
> Noop 3                           -->   0.005782117s  =    1.002 cycles/iteration
> Noop 4                           -->   0.005760363s  =    0.998 cycles/iteration
> Rod Pemberton 1                  -->   0.005763820s  =    0.999 cycles/iteration
> Rod Pemberton 1 (dword)          -->   0.005762479s  =    0.999 cycles/iteration
> Rod Pemberton 2                  -->   0.005762902s  =    0.999 cycles/iteration
> Bernhard Schornak                -->   0.005764289s  =    0.999 cycles/iteration
> Bernhard Schornak 2              -->   0.011549449s  =    2.002 cycles/iteration
> Rob                              -->   0.005762832s  =    0.999 cycles/iteration
> io_x (true=-1)                   -->   0.000000000s  =    0.000 cycles/iteration
> io_x (conforms)                  -->   0.005763369s  =    0.999 cycles/iteration

I updated the files if anyone wants to test.
0
Reply Rob 6/19/2010 2:02:35 PM

On Sat, 19 Jun 2010 10:02:35 -0400
Rob <junkmail3@nospicedham.lavabit.com> wrote:

> io_x wrote:
> > "io_x" <a@b.c.invalid> ha scritto nel messaggio
> > 
> > what about this, use only 2 instructions
> > 
> > ; input eax output in eax
> > cmp  eax,   1
> > sbb  eax, eax
> > ; if eax==0 => CF==1 => eax==-1
> > ; if eax!=0 => CF==0 => eax== 0
> > 
> > 
> > 
> 
> I added it in to get the following:  it looks like using the only the
> two instructions could help on the AMD (and I imagine if doing large
> sections of code with lots of tests it probably would be worthwhile).
> However, if it has to conform to the OP's exact requirements,
> changing it to:
> > 	cmp eax,1
> > 	sbb eax,eax
> > 	neg eax
> makes it the same as the other ones.  However, as Rod noted earlier,
> working with -1 and 0 might be better anyway.
....
> 
> I updated the files if anyone wants to test.

bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)$ ./timer

Linux Code Benchmarking Tool version 0.1.1, 2010/06/18

## Calculating clockspeed... (Attempting to boost process priority to level 99)

==> Running performance tests: (10000000 iterations)
    Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.023489901s  =    7.058 cycles/iteration
Noop 1                           -->   0.018399736s  =    5.529 cycles/iteration
Noop 2                           -->   0.016479211s  =    4.952 cycles/iteration
Noop 3                           -->   0.023974400s  =    7.204 cycles/iteration
Noop 4                           -->   0.009346287s  =    2.808 cycles/iteration
Rod Pemberton 1                  -->   0.020491987s  =    6.157 cycles/iteration
Rod Pemberton 1 (dword)          -->   0.020545233s  =    6.173 cycles/iteration
Rod Pemberton 2                  -->   0.020901664s  =    6.280 cycles/iteration
Bernhard Schornak                -->   0.011462474s  =    3.444 cycles/iteration
Bernhard Schornak 2              -->   0.027416318s  =    8.238 cycles/iteration
Rob                              -->   0.011489957s  =    3.452 cycles/iteration
io_x (true=-1)                   -->   0.009982938s  =    2.999 cycles/iteration
io_x (conforms)                  -->   0.020529672s  =    6.169 cycles/iteration
bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)$ ./timer

Linux Code Benchmarking Tool version 0.1.1, 2010/06/18

## Calculating clockspeed... (Attempting to boost process priority to level 99)

==> Running performance tests: (10000000 iterations)
    Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.023554749s  =    7.078 cycles/iteration
Noop 1                           -->   0.006417787s  =    1.928 cycles/iteration
Noop 2                           -->   0.003076953s  =    0.924 cycles/iteration
Noop 3                           -->   0.008103817s  =    2.435 cycles/iteration
Noop 4                           -->   0.000000000s  =    0.000 cycles/iteration
Rod Pemberton 1                  -->   0.005753219s  =    1.728 cycles/iteration
Rod Pemberton 1 (dword)          -->   0.005811873s  =    1.746 cycles/iteration
Rod Pemberton 2                  -->   0.005788437s  =    1.739 cycles/iteration
Bernhard Schornak                -->   0.000000000s  =    0.000 cycles/iteration
Bernhard Schornak 2              -->   0.010484905s  =    3.150 cycles/iteration
Rob                              -->   0.000000000s  =    0.000 cycles/iteration
io_x (true=-1)                   -->   0.000000000s  =    0.000 cycles/iteration
io_x (conforms)                  -->   0.006127355s  =    1.841 cycles/iteration
bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)$ ./timer

Linux Code Benchmarking Tool version 0.1.1, 2010/06/18

## Calculating clockspeed... (Attempting to boost process priority to level 99)

==> Running performance tests: (10000000 iterations)
    Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.026043430s  =    7.826 cycles/iteration
Noop 1                           -->   0.003932810s  =    1.181 cycles/iteration
Noop 2                           -->   0.000602773s  =    0.181 cycles/iteration
Noop 3                           -->   0.005581371s  =    1.677 cycles/iteration
Noop 4                           -->   0.000000000s  =    0.000 cycles/iteration
Rod Pemberton 1                  -->   0.003263918s  =    0.980 cycles/iteration
Rod Pemberton 1 (dword)          -->   0.006819594s  =    2.049 cycles/iteration
Rod Pemberton 2                  -->   0.018136679s  =    5.450 cycles/iteration
Bernhard Schornak                -->   0.008962243s  =    2.693 cycles/iteration
Bernhard Schornak 2              -->   0.025032740s  =    7.522 cycles/iteration
Rob                              -->   0.008919725s  =    2.680 cycles/iteration
io_x (true=-1)                   -->   0.006785378s  =    2.039 cycles/iteration
io_x (conforms)                  -->   0.018474077s  =    5.551 cycles/iteration
bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)$ ./timer

Linux Code Benchmarking Tool version 0.1.1, 2010/06/18

## Calculating clockspeed... (Attempting to boost process priority to level 99)

==> Running performance tests: (10000000 iterations)
    Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.026464334s  =    7.952 cycles/iteration
Noop 1                           -->   0.003503553s  =    1.052 cycles/iteration
Noop 2                           -->   0.000175563s  =    0.052 cycles/iteration
Noop 3                           -->   0.005157638s  =    1.549 cycles/iteration
Noop 4                           -->   0.000000000s  =    0.000 cycles/iteration
Rod Pemberton 1                  -->   0.011589178s  =    3.482 cycles/iteration
Rod Pemberton 1 (dword)          -->   0.017831240s  =    5.358 cycles/iteration
Rod Pemberton 2                  -->   0.017632329s  =    5.298 cycles/iteration
Bernhard Schornak                -->   0.008528689s  =    2.562 cycles/iteration
Bernhard Schornak 2              -->   0.024680431s  =    7.416 cycles/iteration
Rob                              -->   0.008536761s  =    2.565 cycles/iteration
io_x (true=-1)                   -->   0.006846476s  =    2.057 cycles/iteration
io_x (conforms)                  -->   0.017597269s  =    5.287 cycles/iteration
bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)$ ./timer

Linux Code Benchmarking Tool version 0.1.1, 2010/06/18

## Calculating clockspeed... (Attempting to boost process priority to level 99)

==> Running performance tests: (10000000 iterations)
    Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.025945025s  =    7.796 cycles/iteration
Noop 1                           -->   0.004053231s  =    1.217 cycles/iteration
Noop 2                           -->   0.012710062s  =    3.819 cycles/iteration
Noop 3                           -->   0.021561162s  =    6.479 cycles/iteration
Noop 4                           -->   0.007167236s  =    2.153 cycles/iteration
Rod Pemberton 1                  -->   0.018039244s  =    5.420 cycles/iteration
Rod Pemberton 1 (dword)          -->   0.018007566s  =    5.411 cycles/iteration
Rod Pemberton 2                  -->   0.018331825s  =    5.508 cycles/iteration
Bernhard Schornak                -->   0.009022759s  =    2.711 cycles/iteration
Bernhard Schornak 2              -->   0.024985062s  =    7.508 cycles/iteration
Rob                              -->   0.009049770s  =    2.719 cycles/iteration
io_x (true=-1)                   -->   0.007222247s  =    2.170 cycles/iteration
io_x (conforms)                  -->   0.018110963s  =    5.442 cycles/iteration
bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)$ ./timer

Linux Code Benchmarking Tool version 0.1.1, 2010/06/18

## Calculating clockspeed... (Attempting to boost process priority to level 99)

==> Running performance tests: (10000000 iterations)
    Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.025947498s  =    7.797 cycles/iteration
Noop 1                           -->   0.004020106s  =    1.208 cycles/iteration
Noop 2                           -->   0.000731383s  =    0.219 cycles/iteration
Noop 3                           -->   0.005678014s  =    1.706 cycles/iteration
Noop 4                           -->   0.000000000s  =    0.000 cycles/iteration
Rod Pemberton 1                  -->   0.003350944s  =    1.006 cycles/iteration
Rod Pemberton 1 (dword)          -->   0.003460227s  =    1.039 cycles/iteration
Rod Pemberton 2                  -->   0.003352601s  =    1.007 cycles/iteration
Bernhard Schornak                -->   0.000000000s  =    0.000 cycles/iteration
Bernhard Schornak 2              -->   0.008163741s  =    2.453 cycles/iteration
Rob                              -->   0.000000000s  =    0.000 cycles/iteration
io_x (true=-1)                   -->   0.000000000s  =    0.000 cycles/iteration
io_x (conforms)                  -->   0.003424348s  =    1.029 cycles/iteration
bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)$ 

Greets, Branimir!

-- 
drwxr-xr-x 2 bmaxa bmaxa 4096 2010-06-18 12:45 .

0
Reply Branimir 6/19/2010 3:25:57 PM

"wolfgang kern" <nowhere@never.at> ha scritto nel messaggio
news:hvhrj0$tjs$1@newsreader2.utanet.at...
> I haven't checked if this perform any faster:
> (in theory it should do within 3..4 cycles on AMD K8/10)
>
> cmp eax, 1  ; cy if 0 only
> mov eax, 1
> sbb eax, 0  ; = 0 if it was zero only

follow the same line

cmp  eax, 1
mov  eax, 0
setc al

or

cmp  eax, 1
mov  eax, 0
adc  eax, eax



0
Reply io_x 6/20/2010 6:06:45 AM

"wolfgang kern" <nowhere@never.at> ha scritto nel messaggio
news:hvhrj0$tjs$1@newsreader2.utanet.at...
>
> "io_x" posted:
> ...
>>> is it true that "cmp eax, 1" == "sub eax, 1" for the flag only?
>
> Sure.
>
>> what about this, use only 2 instructions
>
>> ; input eax output in eax
>> cmp  eax,   1
>> sbb  eax, eax
>> ; if eax==0 => CF==1 => eax==-1
>> ; if eax!=0 => CF==0 => eax== 0
>
> Congrats Rosario, you finally got it ! :)

yes but some other wrote it first
for me it is a game for learn something

> and to fit OP's request for 1 and 0 (instead of true/false)
> just add an INC eax to your two line code.

i can add one "neg eax" [hope that neg(0)==0] or "and  eax,  1" like Rob

> I haven't checked if this perform any faster:
> (in theory it should do within 3..4 cycles on AMD K8/10)
>
> cmp eax, 1  ; cy if 0 only
> mov eax, 1
> sbb eax, 0  ; = 0 if it was zero only

if eax==0 => CF==1 => eax=eax-1=1-1=0
if eax!=0 => CF==0 => eax=eax-0 => eax!=0
so there is something wrong

i think that what could be right too if consider
eax==0  false
eax==1  true

so the "logical not" of these posts whould be just
xor  eax,  1




0
Reply io_x 6/20/2010 5:35:03 PM

"wolfgang kern" <nowhere@never.at> ha scritto nel messaggio
news:hvhrj0$tjs$1@newsreader2.utanet.at...
>
> "io_x" posted:
>
> I haven't checked if this perform any faster:
> (in theory it should do within 3..4 cycles on AMD K8/10)
>
> cmp eax, 1  ; cy if 0 only
> mov eax, 1
> sbb eax, 0  ; = 0 if it was zero only

yes this could be the code that "normalize" to
true  == 1  [first was true if it is a value different from 0]
false == 0

so the code result good here should be
; normalize
 cmp eax, 1  ; cy if 0 only
 mov eax, 1
 sbb eax, 0  ; = 0 if it was zero only
; if eax==0 => CF==1 => eax=1-1=0
; if eax!=0 => CF==0 => eax=1
; invert
 xor eax, 1




0
Reply io_x 6/20/2010 6:11:30 PM

Well, I set it up so each piece of code is executed 4 times.
I also cleaned up the code a bit and set it up so that the code is inlined via 
macros instead of a procedure (which would make it a little harder to test C 
code; which it was set up for in the beginning - still possible though, it would 
have to be called from the loop).  It is interesting though, as that seemed to 
make quite a difference in the timings.

On the Pentium 4 I got (a couple times as it varied a little bit):

> ==> Running performance tests: (4000000 iterations)
>     Intel(R) Pentium(R) 4 CPU 2.40GHz Processor Detected @ 2400 MHz
> Reference Procedure timing took 0.001999166s  =    1.199 cycles/iteration
> Noop 1                           -->    0.008380969s  =    5.028 cycles/iteration
> Noop 2                           -->    0.008973599s  =    5.384 cycles/iteration
> Noop 3                           -->    0.044951069s  =   26.970 cycles/iteration
> Noop 4                           -->    0.044929676s  =   26.957 cycles/iteration
> Rod Pemberton 1                  -->    0.051656874s  =   30.994 cycles/iteration
> Rod Pemberton 1 (dword)          -->    0.051574552s  =   30.944 cycles/iteration
> Rod Pemberton 2                  -->    0.051833085s  =   31.099 cycles/iteration
> Rob                              -->    0.051710325s  =   31.026 cycles/iteration
> Bernhard Schornak 1              -->    0.044852599s  =   26.911 cycles/iteration
> Bernhard Schornak 1 rearranged   -->    0.045006106s  =   27.003 cycles/iteration
> Bernhard Schornak 2              -->    0.086183280s  =   51.709 cycles/iteration
> Rob 2                            -->    0.045101107s  =   27.060 cycles/iteration
> io_x (true=-1)                   -->    0.045004517s  =   27.002 cycles/iteration
> io_x (conforms)                  -->    0.051889863s  =   31.133 cycles/iteration
> io_x 2                           -->    0.044870523s  =   26.922 cycles/iteration
> io_x 3                           -->    0.044915668s  =   26.949 cycles/iteration


> ==> Running performance tests: (4000000 iterations)
>     Intel(R) Pentium(R) 4 CPU 2.40GHz Processor Detected @ 2400 MHz
> Reference Procedure timing took 0.001999438s  =    1.199 cycles/iteration
> Noop 1                           -->    0.008288132s  =    4.972 cycles/iteration
> Noop 2                           -->    0.008926663s  =    5.355 cycles/iteration
> Noop 3                           -->    0.045298991s  =   27.179 cycles/iteration
> Noop 4                           -->    0.044975481s  =   26.985 cycles/iteration
> Rod Pemberton 1                  -->    0.051847143s  =   31.108 cycles/iteration
> Rod Pemberton 1 (dword)          -->    0.051833966s  =   31.100 cycles/iteration
> Rod Pemberton 2                  -->    0.051825479s  =   31.095 cycles/iteration
> Rob                              -->    0.051602658s  =   30.961 cycles/iteration
> Bernhard Schornak 1              -->    0.044904328s  =   26.942 cycles/iteration
> Bernhard Schornak 1 rearranged   -->    0.044894118s  =   26.936 cycles/iteration
> Bernhard Schornak 2              -->    0.087300608s  =   52.380 cycles/iteration
> Rob 2                            -->    0.044952695s  =   26.971 cycles/iteration
> io_x (true=-1)                   -->    0.044967082s  =   26.980 cycles/iteration
> io_x (conforms)                  -->    0.051672787s  =   31.003 cycles/iteration
> io_x 2                           -->    0.044765342s  =   26.859 cycles/iteration
> io_x 3                           -->    0.045021009s  =   27.012 cycles/iteration

And the AMD was (it seemed pretty consistent over several runs):

> ==> Running performance tests: (4000000 iterations)
>     AMD Athlon(tm) XP 2100+ Processor Detected @ 1734 MHz
> Reference Procedure timing took 0.004679899s  =    2.028 cycles/iteration
> Noop 1                           -->    0.013860430s  =    6.008 cycles/iteration
> Noop 2                           -->    0.013862878s  =    6.009 cycles/iteration
> Noop 3                           -->    0.023104264s  =   10.015 cycles/iteration
> Noop 4                           -->    0.023104864s  =   10.015 cycles/iteration
> Rod Pemberton 1                  -->    0.023106142s  =   10.016 cycles/iteration
> Rod Pemberton 1 (dword)          -->    0.023104913s  =   10.015 cycles/iteration
> Rod Pemberton 2                  -->    0.023105420s  =   10.016 cycles/iteration
> Rob                              -->    0.023090743s  =   10.009 cycles/iteration
> Bernhard Schornak 1              -->    0.023199651s  =   10.057 cycles/iteration
> Bernhard Schornak 1 rearranged   -->    0.020766501s  =    9.002 cycles/iteration
> Bernhard Schornak 2              -->    0.023089921s  =   10.009 cycles/iteration
> Rob 2                            -->    0.013848657s  =    6.003 cycles/iteration
> io_x (true=-1)                   -->    0.013847794s  =    6.003 cycles/iteration
> io_x (conforms)                  -->    0.023092167s  =   10.010 cycles/iteration
> io_x 2                           -->    0.013835934s  =    5.997 cycles/iteration
> io_x 3                           -->    0.013909315s  =    6.029 cycles/iteration

0
Reply Rob 6/21/2010 2:40:16 AM

Ok, here's the final setup that I'm going to post - let's see what you guys get.

> ==> Running performance tests: (4000000 iterations)
>     Intel(R) Pentium(R) 4 CPU 2.40GHz Processor Detected @ 2400 MHz
> Reference Procedure timing took 0.001998761s  =    1.199 cycles/iteration
> Noop 1                           -->    0.003056183s  =    1.833 cycles/iteration
> Noop 2                           -->    0.012885864s  =    7.731 cycles/iteration
> Noop 3                           -->    0.009790604s  =    5.874 cycles/iteration
> Noop 4                           -->    0.009731597s  =    5.838 cycles/iteration
> Rod Pemberton 1                  -->    0.011450241s  =    6.870 cycles/iteration
> Rod Pemberton 1 (dword)          -->    0.011411202s  =    6.846 cycles/iteration
> Rod Pemberton 2                  -->    0.011408713s  =    6.845 cycles/iteration
> Rob                              -->    0.011823475s  =    7.094 cycles/iteration
> Bernhard Schornak 1              -->    0.009752518s  =    5.851 cycles/iteration
> Bernhard Schornak 1 rearranged   -->    0.009820718s  =    5.892 cycles/iteration
> Bernhard Schornak 2              -->    0.020539177s  =   12.323 cycles/iteration
> Rob 2                            -->    0.010207311s  =    6.124 cycles/iteration
> io_x (true=-1)                   -->    0.009836334s  =    5.901 cycles/iteration
> io_x (conforms)                  -->    0.011562429s  =    6.937 cycles/iteration
> io_x 2                           -->    0.009838613s  =    5.903 cycles/iteration
> io_x 3                           -->    0.009800646s  =    5.880 cycles/iteration


> ==> Running performance tests: (4000000 iterations)
>     AMD Athlon(tm) XP 2100+ Processor Detected @ 1734 MHz
> Reference Procedure timing took 0.004645913s  =    2.014 cycles/iteration
> Noop 1                           -->    0.004628110s  =    2.006 cycles/iteration
> Noop 2                           -->    0.004617581s  =    2.001 cycles/iteration
> Noop 3                           -->    0.002323831s  =    1.007 cycles/iteration
> Noop 4                           -->    0.002323585s  =    1.007 cycles/iteration
> Rod Pemberton 1                  -->    0.002323169s  =    1.007 cycles/iteration
> Rod Pemberton 1 (dword)          -->    0.002300807s  =    0.997 cycles/iteration
> Rod Pemberton 2                  -->    0.002310613s  =    1.001 cycles/iteration
> Rob                              -->    0.002323304s  =    1.007 cycles/iteration
> Bernhard Schornak 1              -->    0.002324694s  =    1.007 cycles/iteration
> Bernhard Schornak 1 rearranged   -->    0.000818792s  =    0.354 cycles/iteration
> Bernhard Schornak 2              -->    0.002324507s  =    1.007 cycles/iteration
> Rob 2                            -->    0.002323687s  =    1.007 cycles/iteration
> io_x (true=-1)                   -->    0.000000000s  =    0.000 cycles/iteration
> io_x (conforms)                  -->    0.002313310s  =    1.002 cycles/iteration
> io_x 2                           -->    0.000007071s  =    0.003 cycles/iteration
> io_x 3                           -->    0.000007169s  =    0.003 cycles/iteration

Code tested was as follows:

> timer_start time1,"Noop 1"
> 	cmp eax,0
> 	je @f
> 	mov eax,1
> @@: xor eax,1
> timer_end
> 
> timer_start time2,"Noop 2"
> 	test eax, eax
> 	jz @f
> 	mov eax, 1
> @@:	xor eax, 1
> timer_end
> 
> timer_start time3,"Noop 3"
> 	or eax,eax
> 	setz al
> 	and eax,1
> timer_end
> 
> timer_start time4,"Noop 4"
> 	or eax,eax
> 	setz al
> 	movzx eax,al
> timer_end
> 
> timer_start time5,"Rod Pemberton 1"
> 	cmp eax,1
> 	sbb eax,eax
> 	and eax,1
> timer_end
> 
> timer_start time5a,"Rod Pemberton 1 (dword)"
> 	cmp eax,dword 1
> 	sbb eax,eax
> 	and eax,dword 1
> timer_end
> 
> timer_start time6,"Rod Pemberton 2"
> 	neg eax
> 	sbb eax,eax
> 	inc eax
> timer_end
> 
> timer_start time7,"Rob"
> 	neg eax
> 	sbb eax,eax
> 	add eax,1
> timer_end
> 
> timer_start time8,"Bernhard Schornak 1"
> 	test eax,eax
> 	sete al
> 	movzx eax,al
> timer_end
> 
> timer_start time8a,"Bernhard Schornak 1 rearranged"
> 	test eax,eax
> 	movzx eax,al
> 	sete al
> timer_end
> 
> timer_start time9,"Bernhard Schornak 2"
> 	test eax,eax
> 	cmovne eax,[.zero]
> 	cmove eax,[.one]
> timer_end
> align 4
> .zero	dd 0
> .one	dd 1
> 
> timer_start time10,"Rob 2"
> 	test eax,eax
> 	mov eax,0
> 	cmove eax,[.one	]
> timer_end
> align 4
> .one dd 1
> 
> timer_start time11,"io_x (true=-1)"
> 	cmp eax,1
> 	sbb eax,eax
> timer_end
> 
> timer_start time12,"io_x (conforms)"
> 	cmp eax,1
> 	sbb eax,eax
> 	neg eax
> timer_end
> 
> timer_start time13,"io_x 2"
> 	cmp eax,1
> 	mov  eax,0
> 	setc al
> timer_end
> 
> timer_start time14,"io_x 3"
> 	cmp eax,1
> 	mov eax,0
> 	adc eax, eax
 > timer_end


Source is here:  (I put a couple of different types of archives up in case 
someone doesn't have one of them).  I also added a (brief) README to explain it 
a little bit.
http://70.53.58.62/Forums/clax/Fastest%20logical%20not.7z
http://70.53.58.62/Forums/clax/Fastest%20logical%20not.tar.bz2
0
Reply Rob 6/21/2010 3:17:07 AM

On Sun, 20 Jun 2010 23:17:07 -0400
Rob <junkmail3@nospicedham.lavabit.com> wrote:

> Ok, here's the final setup that I'm going to post - let's see what
> you guys get.
> 
This one looks better:

bmaxa@maxa:~/Desktop$ ./timer

Linux Code Benchmarking Tool version 0.1, 2010/06/06

==> Running Conformance Tests:
Noop 1                           --> 00000001h -            1
Noop 2                           --> 00000001h -            1
Noop 3                           --> 00000001h -            1
Noop 4                           --> 00000001h -            1
Rod Pemberton 1                  --> 00000001h -            1
Rod Pemberton 1 (dword)          --> 00000001h -            1
Rod Pemberton 2                  --> 00000001h -            1
Rob                              --> 00000001h -            1
Bernhard Schornak 1              --> 00000001h -            1
Bernhard Schornak 1 rearranged   --> 00000001h -            1
Bernhard Schornak 2              --> 00000001h -            1
Rob 2                            --> 00000001h -            1
io_x (true=-1)                   --> FFFFFFFFh -   4294967295
io_x (conforms)                  --> 00000001h -            1
io_x 2                           --> 00000001h -            1
io_x 3                           --> 00000001h -            1

==> Calculating CPU clockspeed... (Attempting to boost process priority to level 99)
** No user permissions to boost priority:  the CPU calculation should still work though. **

==> Running performance tests: (4000000 iterations)
    Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.002000860s  =    1.503 cycles/iteration
Noop 1                           -->    0.005968339s  =    4.483 cycles/iteration
Noop 2                           -->    0.003528693s  =    2.650 cycles/iteration
Noop 3                           -->    0.003667782s  =    2.755 cycles/iteration
Noop 4                           -->    0.002459816s  =    1.847 cycles/iteration
Rod Pemberton 1                  -->    0.003746486s  =    2.814 cycles/iteration
Rod Pemberton 1 (dword)          -->    0.003726928s  =    2.799 cycles/iteration
Rod Pemberton 2                  -->    0.003751117s  =    2.818 cycles/iteration
Rob                              -->    0.003477926s  =    2.612 cycles/iteration
Bernhard Schornak 1              -->    0.002486430s  =    1.867 cycles/iteration
Bernhard Schornak 1 rearranged   -->    0.003646139s  =    2.739 cycles/iteration
Bernhard Schornak 2              -->    0.004498696s  =    3.379 cycles/iteration
Rob 2                            -->    0.002741681s  =    2.059 cycles/iteration
io_x (true=-1)                   -->    0.002482969s  =    1.865 cycles/iteration
io_x (conforms)                  -->    0.003477933s  =    2.612 cycles/iteration
io_x 2                           -->    0.003670713s  =    2.757 cycles/iteration
io_x 3                           -->    0.002851575s  =    2.142 cycles/iteration

Greets, Branimir

-- 
drwxr-xr-x 2 bmaxa bmaxa 4096 2010-06-19 19:26 .

0
Reply Branimir 6/21/2010 3:43:40 AM

On Mon, 2010-06-21 at 05:43 +0200, Branimir Maksimovic wrote:
> On Sun, 20 Jun 2010 23:17:07 -0400
> Rob <junkmail3@nospicedham.lavabit.com> wrote:
>=20
> > Ok, here's the final setup that I'm going to post - let's see what
> > you guys get.

There must be something wrong with your processor as mine outperforms
yours by a massive margin!

=3D=3D> Running performance tests: (4000000 iterations)
    Mobile Intel(R) Pentium(R) 4 CPU 2.80GHz Processor Detected @ 2791 MHz
Reference Procedure timing took 0.004559407s  =3D    3.181 cycles/iteration
Noop 1                           -->    0.004874103s  =3D    3.400 cycles/i=
teration
Noop 2                           -->    0.002916938s  =3D    2.035 cycles/i=
teration
Noop 3                           -->    0.020436658s  =3D   14.259 cycles/i=
teration
Noop 4                           -->    0.019746686s  =3D   13.778 cycles/i=
teration
Rod Pemberton 1                  -->    0.022776560s  =3D   15.892 cycles/i=
teration
Rod Pemberton 1 (dword)          -->    0.022845334s  =3D   15.940 cycles/i=
teration
Rod Pemberton 2                  -->    0.022944944s  =3D   16.009 cycles/i=
teration
Rob                              -->    0.022822393s  =3D   15.924 cycles/i=
teration
Bernhard Schornak 1              -->    0.020395952s  =3D   14.231 cycles/i=
teration
Bernhard Schornak 1 rearranged   -->    0.017738976s  =3D   12.377 cycles/i=
teration
Bernhard Schornak 2              -->    0.038830060s  =3D   27.093 cycles/i=
teration
Rob 2                            -->    0.018485669s  =3D   12.898 cycles/i=
teration
io_x (true=3D-1)                   -->    0.020139807s  =3D   14.052 cycles=
/iteration
io_x (conforms)                  -->    0.022526891s  =3D   15.718 cycles/i=
teration
io_x 2                           -->    0.018357516s  =3D   12.808 cycles/i=
teration
io_x 3                           -->    0.020623856s  =3D   14.390 cycles/i=
teration

--=20
http://www.munted.org.uk

One very high maintenance cat living here.

0
Reply Mr 6/21/2010 7:54:20 AM

"io_x" wrote:

>> I haven't checked if this perform any faster:
>> (in theory it should do within 3..4 cycles on AMD K8/10)

>> cmp eax, 1  ; cy if 0 only
>> mov eax, 1
>> sbb eax, 0  ; = 0 if it was zero only

> yes this could be the code that "normalize" to
> true  == 1  [first was true if it is a value different from 0]
> false == 0

Yes, this is 'positive LOGIC' seen by engineers,
in oppostion to HLL-coders :)

> so the code result good here should be
> ; normalize
> cmp eax, 1  ; cy if 0 only
> mov eax, 1
> sbb eax, 0  ; = 0 if it was zero only
> ; if eax==0 => CF==1 => eax=1-1=0
> ; if eax!=0 => CF==0 => eax=1
> ; invert
> xor eax, 1

Save the last line if you like it reverse:

cmp eax, 1
mov eax, 0
adc eax ,0   ;1 if it was zero only (0+WasZero)

but your C-styled true/false isn't bad at all:

[IF !=0]
 cmp eax, 1
 sbb eax,eax  ;-1 (false) only if it was 0

[IF ==0]
 cmp eax, 1
 mov eax,-1
 adc eax, 0   ;0 (true) only if it was 0

shorter in bytes, perhaps slower
[IF NOT !=0]
 cmp eax ,1
 sbb eax,eax
 not eax      ;same as 'xor eax, -1' except no flags altered

Even I never would need 0/1 results
my way would be anyway:

or eax,eax            ;get zero/sign flags
cmovnz eax,[one]      ;if it's 0 then let it be 0 :)


and my solution for the fastest 'logical NOT':

NOT eax               ;this seem to be designed for it!
or
XOR eax,-1            ;may be faster on some CPUs
__
wolfgang


0
Reply wolfgang 6/21/2010 8:25:01 AM

On Mon, 21 Jun 2010 10:25:01 +0200
"wolfgang kern" <nowhere@never.at> wrote:

> 
> "io_x" wrote:
> 
> >> I haven't checked if this perform any faster:
> >> (in theory it should do within 3..4 cycles on AMD K8/10)
> 
.......
> 
> but your C-styled true/false isn't bad at all:

C didn't have bool type. C style true/false
is actually , zero for false, not zero (anything)
for true.
Same logic is used for pointers (true/false).

> 
> and my solution for the fastest 'logical NOT':
> 
> NOT eax               ;this seem to be designed for it!
> or
> XOR eax,-1            ;may be faster on some CPUs

Great!

> __
> wolfgang
> 
> 

Greets, Branimir

-- 
drwxr-xr-x 2 bmaxa bmaxa 4096 2010-06-19 19:26 .

0
Reply Branimir 6/21/2010 9:42:03 AM

Branimir Maksimovic posted:
> "wolfgang kern" <nowhere@never.at> wrote:
....
>> but your C-styled true/false isn't bad at all:

> C didn't have bool type. C style true/false
> is actually , zero for false, not zero (anything)
> for true.
> Same logic is used for pointers (true/false).

Oh! that's why most code for windoze end with
 pop ebp
 mov esp,ebp
 xor eax,eax   ;or an error code in eax  :)
 ret 16

__
wolfgang


0
Reply wolfgang 6/21/2010 10:04:27 AM

On Mon, 21 Jun 2010 08:54:20 +0100
Mr Sensible <alex.buell@nospicedham.munted.org.uk> wrote:

> On Mon, 2010-06-21 at 05:43 +0200, Branimir Maksimovic wrote:
> > On Sun, 20 Jun 2010 23:17:07 -0400
> > Rob <junkmail3@nospicedham.lavabit.com> wrote:
> > 
> > > Ok, here's the final setup that I'm going to post - let's see what
> > > you guys get.
> 
> There must be something wrong with your processor as mine outperforms
> yours by a massive margin!
> 
> ==> Running performance tests: (4000000 iterations)
>     Mobile Intel(R) Pentium(R) 4 CPU 2.80GHz Processor Detected @
> 2791 MHz Reference Procedure timing took 0.004559407s  =    3.181
> cycles/iteration Noop 1                           -->
> 0.004874103s  =    3.400 cycles/iteration Noop
> 2                           -->    0.002916938s  =    2.035
> cycles/iteration Noop 3                           -->
> 0.020436658s  =   14.259 cycles/iteration Noop
> 4                           -->    0.019746686s  =   13.778
> cycles/iteration Rod Pemberton 1                  -->
> 0.022776560s  =   15.892 cycles/iteration Rod Pemberton 1
> (dword)          -->    0.022845334s  =   15.940 cycles/iteration Rod
> Pemberton 2                  -->    0.022944944s  =   16.009
> cycles/iteration Rob                              -->
> 0.022822393s  =   15.924 cycles/iteration Bernhard Schornak
> 1              -->    0.020395952s  =   14.231 cycles/iteration
> Bernhard Schornak 1 rearranged   -->    0.017738976s  =   12.377
> cycles/iteration Bernhard Schornak 2              -->
> 0.038830060s  =   27.093 cycles/iteration Rob
> 2                            -->    0.018485669s  =   12.898
> cycles/iteration io_x (true=-1)                   -->
> 0.020139807s  =   14.052 cycles/iteration io_x
> (conforms)                  -->    0.022526891s  =   15.718
> cycles/iteration io_x 2                           -->
> 0.018357516s  =   12.808 cycles/iteration io_x
> 3                           -->    0.020623856s  =   14.390
> cycles/iteration
> 

==> Running performance tests: (4000000 iterations)
    Intel(R) Xeon(TM) CPU 2.80GHz Processor Detected @ 2791 MHz
Reference Procedure timing took 0.002098834s  =    1.464 cycles/iteration
Noop 1                           -->    0.002204584s  =    1.538 cycles/iteration
Noop 2                           -->    0.002208957s  =    1.541 cycles/iteration
Noop 3                           -->    0.007932801s  =    5.535 cycles/iteration
Noop 4                           -->    0.007932259s  =    5.534 cycles/iteration
Rod Pemberton 1                  -->    0.009366177s  =    6.535 cycles/iteration
Rod Pemberton 1 (dword)          -->    0.009364695s  =    6.534 cycles/iteration
Rod Pemberton 2                  -->    0.009368607s  =    6.536 cycles/iteration
Rob                              -->    0.009954535s  =    6.945 cycles/iteration
Bernhard Schornak 1              -->    0.007932956s  =    5.535 cycles/iteration
Bernhard Schornak 1 rearranged   -->    0.007936561s  =    5.537 cycles/iteration
Bernhard Schornak 2              -->    0.016799123s  =   11.721 cycles/iteration
Rob 2                            -->    0.008283530s  =    5.779 cycles/iteration
io_x (true=-1)                   -->    0.007934037s  =    5.535 cycles/iteration
io_x (conforms)                  -->    0.009365197s  =    6.534 cycles/iteration
io_x 2                           -->    0.008029410s  =    5.602 cycles/iteration
io_x 3                           -->    0.008050396s  =    5.617 cycles/iteration
[bmaxa@devel ~]$ 

Greets, Branimir

-- 
drwxr-xr-x 2 bmaxa bmaxa 4096 2010-06-19 19:26 .

0
Reply Branimir 6/21/2010 12:19:15 PM

"Mr Sensible" <alex.buell...> answered Rob:

>> Ok, here's the final setup that I'm going to post - let's see what
>> you guys get.

> There must be something wrong with your processor as mine outperforms
> yours by a massive margin!

> ==> Running performance tests: (4000000 iterations)
....
4 million iterations against 1 million used by Rob ?
....
what you both mainly measure here may be just OS background noise
and this will more depend on how often you moved the mouse during
the test rather than on the actual code snip timing :)
perhaps that's why your measurements vary from 0 to 27 cycles ?

I didn't time all the variants in this thread but io_x versions

cmp eax,1
sbb eax,eax
inc eax      ;or NEG eax

tested on:
AMD phenom II 3GHz quad
AMD athlon K8 1.8 GHZ dual
AMD K7 500MHz

always read two (three on K7) cycles more than an empty
(single NOP) test for the second iteration (see below),
first iteration timing depend on cache-status.

looks like each of this three instructions can perform
and complete within one cycle at least, otherwise
we would see dependency-penalties.
__
wolfgang

I posted this in ALA (1.June):
_______
.... RDTSC timing is machine specific (5...else cycles).
I think to have posted this several times already:

MOV esi,result_buf  ;16 bytes needed yet here
MOV byte[loop],01
CLI
L1:
;PUSH esi           ;if desired/required
  CPUID
  RDTSC
    MOV ebx,eax ;or: PUSH eax
    MOV ecx,edx ;or: PUSH edx
    ...             ;code under test (must preserve what's needed)
  RDTSC             ;without serialising yet !!!
    SUB eax,ebx ;or: SUB eax,[esp+4]
    SBB edx,ecx ;or: SBB edx,[esp]
                ;or: ADD esp,8
;POP esi
MOV [esi],eax       ;
MOV [esi+4],edx     ;store 64-bit cycle-count
ADD esi,8
DEC byte [loop]
JNS L1              ;loop the above just one more time
STI
* I have an INT3 here to watch results in the debugger view.

A first run (I intentionally avoid the term 'iterate' here)
would show a cycle count which implies cache-burst-reads.
The second test-run follows immediate still with IRQs disabled.

One million iterations may just measure OS-noise from IRQ-
taskswitching or whatsoever an OS may do behind your back.

I measure 6..7 cycles (almost just the RDTSC itself) on an
empty (single NOP) test one my current machine with the MOV-
version while the PUSH-variant takes 1..2 cycles more.

Usually I ignore the first 64-bit result (caching time)
for code variants compare, but the info there is very
useful for proper code alignment.

Note:
this method shouldn't be used on huge code parts,
too long disabled IRQs can result in stuck hardware.
__________
eof 


0
Reply wolfgang 6/21/2010 1:06:02 PM

On Mon, 2010-06-21 at 14:19 +0200, Branimir Maksimovic wrote:

> Bernhard Schornak 2              -->    0.016799123s  =3D   11.721
> cycles/iteration

Mine:
Bernhard Schornak 2              -->    0.038830060s  =3D   27.093
cycles/iteration

I cannot believe my processor (Mobile P4 2.8GHz w/1MB L2 cache) is
faster than a Xeon (Xeon 2.8GHz). Something is NOT right - if I'm
reading this correctly, I'm taking 27 cycles per iteration whilst yours
takes 11.7 cycles per iteration.=20

Oh hang on, what *sort* of cycles are we talking about?
--=20
http://www.munted.org.uk

One very high maintenance cat living here.

0
Reply Mr 6/21/2010 1:36:06 PM

On Mon, 21 Jun 2010 15:06:02 +0200
"wolfgang kern" <nowhere@never.at> wrote:

> MOV esi,result_buf  ;16 bytes needed yet here
> MOV byte[loop],01
> CLI
> L1:
> ;PUSH esi           ;if desired/required
>   CPUID
>   RDTSC
>     MOV ebx,eax ;or: PUSH eax
>     MOV ecx,edx ;or: PUSH edx
>     ...             ;code under test (must preserve what's needed)
>   RDTSC             ;without serialising yet !!!
>     SUB eax,ebx ;or: SUB eax,[esp+4]
>     SBB edx,ecx ;or: SBB edx,[esp]
>                 ;or: ADD esp,8
> ;POP esi
> MOV [esi],eax       ;
> MOV [esi+4],edx     ;store 64-bit cycle-count
> ADD esi,8
> DEC byte [loop]
> JNS L1              ;loop the above just one more time
> STI
.....
> Note:
> this method shouldn't be used on huge code parts,
> too long disabled IRQs can result in stuck hardware.

What happens with this code when there are more than 1
cpus? I guess cli is valid only for current cpu?
So task has to be bound to single cpu?

Greets, Branimir



-- 
drwxr-xr-x 2 bmaxa bmaxa 4096 2010-06-19 19:26 .

0
Reply Branimir 6/21/2010 2:34:42 PM

Branimir Maksimovic wrote:

> On Mon, 21 Jun 2010 15:06:02 +0200
> "wolfgang kern"<nowhere@never.at>  wrote:
>
>> MOV esi,result_buf  ;16 bytes needed yet here
>> MOV byte[loop],01
>> CLI
>> L1:
>> ;PUSH esi           ;if desired/required
>>    CPUID
>>    RDTSC
>>      MOV ebx,eax ;or: PUSH eax
>>      MOV ecx,edx ;or: PUSH edx
>>      ...             ;code under test (must preserve what's needed)
>>    RDTSC             ;without serialising yet !!!
>>      SUB eax,ebx ;or: SUB eax,[esp+4]
>>      SBB edx,ecx ;or: SBB edx,[esp]
>>                  ;or: ADD esp,8
>> ;POP esi
>> MOV [esi],eax       ;
>> MOV [esi+4],edx     ;store 64-bit cycle-count
>> ADD esi,8
>> DEC byte [loop]
>> JNS L1              ;loop the above just one more time
>> STI
> ....
>> Note:
>> this method shouldn't be used on huge code parts,
>> too long disabled IRQs can result in stuck hardware.
>
> What happens with this code when there are more than 1
> cpus? I guess cli is valid only for current cpu?
> So task has to be bound to single cpu?

Why should any OS split a single thread
into parts and execute them on separate
cores? Threads generally are not execu-
ted on more than one core, except a 2nd
instance was started before the 1st one
finished. On some, but not all OS', the
2nd instance may be executed on another
(idle) core.

All errors in Rob's test routines could
be eliminated if the loop count was re-
duced to 1,024 (or less) - actually, 16
tests were sufficient to get a reliable
result in cycles (rather than micro- or
nanoseconds) using RDTSC. Too many runs
(millions in our case) exceed the usual
time slice by far, adding times for un-
related threads + task switching to the
results. Maybe a MUST_COMPLETE or what-
ever this is called on <insert your OS>
plus setting that thread to the highest
possible priority is required to finish
it without being 'preempted'...


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 6/21/2010 7:31:19 PM

Branimir Maksimovic asked:

>> MOV esi,result_buf  ;16 bytes needed yet here
>> MOV byte[loop],01
>> CLI
>> L1:
>> ;PUSH esi           ;if desired/required
>>   CPUID
>>   RDTSC
>>     MOV ebx,eax ;or: PUSH eax
>>     MOV ecx,edx ;or: PUSH edx
>>     ...             ;code under test (must preserve what's needed)
>>   RDTSC             ;without serialising yet !!!
>>     SUB eax,ebx ;or: SUB eax,[esp+4]
>>     SBB edx,ecx ;or: SBB edx,[esp]
>>                 ;or: ADD esp,8
>> ;POP esi
>> MOV [esi],eax       ;
>> MOV [esi+4],edx     ;store 64-bit cycle-count
>> ADD esi,8
>> DEC byte [loop]
>> JNS L1              ;loop the above just one more time
>> STI
> ....
>> Note:
>> this method shouldn't be used on huge code parts,
>> too long disabled IRQs can result in stuck hardware.

> What happens with this code when there are more than 1
> cpus? I guess cli is valid only for current cpu?
> So task has to be bound to single cpu?

Bernhard already said it.
I cannot see how and when another CPU-core could intercept
this piece of code between the two RDTSCs as long
  the test code is short enough to remain cached and
  it doesn't switch tasks nor context by far calls.
__
wolfgang


0
Reply wolfgang 6/21/2010 8:10:10 PM

Mr Sensible wrote:
> On Mon, 2010-06-21 at 14:19 +0200, Branimir Maksimovic wrote:
> 
>> Bernhard Schornak 2              -->    0.016799123s  =   11.721
>> cycles/iteration
> 
> Mine:
> Bernhard Schornak 2              -->    0.038830060s  =   27.093
> cycles/iteration
> 
> I cannot believe my processor (Mobile P4 2.8GHz w/1MB L2 cache) is
> faster than a Xeon (Xeon 2.8GHz). Something is NOT right - if I'm
> reading this correctly, I'm taking 27 cycles per iteration whilst yours
> takes 11.7 cycles per iteration. 
> 
> Oh hang on, what *sort* of cycles are we talking about?

Thanks for testing!  It makes me wonder though what could be going on with the 
code since your P4 has quite different results than mine.  Are you running many 
other programs at the same time as testing it?
I run it with all my other programs closed (although I am still in X).  In 
theory, it shouldn't make much of a difference, but I found in practice it did 
(at least in consistent results).

I was meaning them as clock cycles:  IOW, lower is faster (the actual speed of 
execution is dependent on the clock speed though).

Thanks,
Rob
0
Reply Rob 6/22/2010 4:11:48 AM

"Branimir Maksimovic" <bmaxa@nospicedham.hotmail.com> wrote in message
news:hvnta3$2br$9@solani.org...
> What happens with this code when there are more than 1
> cpus?

Uh...   RDTSC is replaced with RDTSCP?

There are a number of x86 microprocessor's that have problems
with their TSC, either drift or artificially locked.

"TSC (cpu TimeStamp Counter)
 - RDTSC instruction on Pentium or later cpu's
   - 64-bit counter
   - one count per cpu clock
   - overhead of 11 clocks
   - Efficeon updates TSC counters at maximum frequency
   - Efficeon doesn't properly update at actual cpu clock speed
   - TSC drifts on AMD K8 and dual-core platforms
   - TSC is not frequency independent on AMD K8 and dual-core
   - used by Linux
   - used by MS Vista
   - used by MS SMP HAL Windows
 - RDTSCP instruction on AMD NPT 0F, e.g., F, AM2, S1g1 cpus
   - can affected by power management events
   - bit returned by CPUID indicates power invariant
"
http://groups.google.com/group/alt.os.development/msg/63f2d9cbf900b39e

Quotes of issues with RDTSC instruction:
http://groups.google.com/group/comp.lang.asm.x86/msg/30ea02cbe2dc74ee


Rod Pemberton



0
Reply Rod 6/22/2010 8:58:57 AM

On Mon, 21 Jun 2010 15:06:02 +0200, "wolfgang
kern" <nowhere@never.at> wrote:

<snip>

>_______
>... RDTSC timing is machine specific (5...else cycles).
>I think to have posted this several times already:
>
>MOV esi,result_buf  ;16 bytes needed yet here
>MOV byte[loop],01
>CLI
>L1:
>;PUSH esi           ;if desired/required
>  CPUID
>  RDTSC
>    MOV ebx,eax ;or: PUSH eax
>    MOV ecx,edx ;or: PUSH edx
>    ...             ;code under test (must preserve what's needed)
>  RDTSC             ;without serialising yet !!!
>    SUB eax,ebx ;or: SUB eax,[esp+4]
>    SBB edx,ecx ;or: SBB edx,[esp]
>                ;or: ADD esp,8
>;POP esi
>MOV [esi],eax       ;
>MOV [esi+4],edx     ;store 64-bit cycle-count
>ADD esi,8
>DEC byte [loop]
>JNS L1              ;loop the above just one more time
>STI
>* I have an INT3 here to watch results in the debugger view.
>
>A first run (I intentionally avoid the term 'iterate' here)
>would show a cycle count which implies cache-burst-reads.
>The second test-run follows immediate still with IRQs disabled.
>
>One million iterations may just measure OS-noise from IRQ-
>taskswitching or whatsoever an OS may do behind your back.
>
>I measure 6..7 cycles (almost just the RDTSC itself) on an
>empty (single NOP) test one my current machine with the MOV-
>version while the PUSH-variant takes 1..2 cycles more.
>
>Usually I ignore the first 64-bit result (caching time)
>for code variants compare, but the info there is very
>useful for proper code alignment.
>
>Note:
>this method shouldn't be used on huge code parts,
>too long disabled IRQs can result in stuck hardware.

I am surprised to see you using CLI to disable
IRQs.  As far as I know, there is no way to do
this in Windows from user mode.  Does *nix allow
this?  (Seems dangerous, and out of keeping with
the usual ideas of protected mode.)  Or do you
have a way to run in Ring 0? ... Or in real mode?

Best regards,


Bob Masta
 
              DAQARTA  v5.10
   Data AcQuisition And Real-Time Analysis
              www.daqarta.com
Scope, Spectrum, Spectrogram, Sound Level Meter
    Frequency Counter, FREE Signal Generator
           Pitch Track, Pitch-to-MIDI 
         DaqMusic - FREE MUSIC, Forever!
             (Some assembly required)
     Science (and fun!) with your sound card!
0
Reply N0Spam 6/22/2010 12:42:07 PM

"Bob Masta" asked:
> <snip>

>>_______
>>... RDTSC timing is machine specific (5...else cycles).
>>I think to have posted this several times already:
>>
>>MOV esi,result_buf  ;16 bytes needed yet here
>>MOV byte[loop],01
>>CLI
>>L1:
>>;PUSH esi           ;if desired/required
>>  CPUID
>>  RDTSC
>>    MOV ebx,eax ;or: PUSH eax
>>    MOV ecx,edx ;or: PUSH edx
>>    ...             ;code under test (must preserve what's needed)
>>  RDTSC             ;without serialising yet !!!
>>    SUB eax,ebx ;or: SUB eax,[esp+4]
>>    SBB edx,ecx ;or: SBB edx,[esp]
>>                ;or: ADD esp,8
>>;POP esi
>>MOV [esi],eax       ;
>>MOV [esi+4],edx     ;store 64-bit cycle-count
>>ADD esi,8
>>DEC byte [loop]
>>JNS L1              ;loop the above just one more time
>>STI
>>* I have an INT3 here to watch results in the debugger view.

>>A first run (I intentionally avoid the term 'iterate' here)
>>would show a cycle count which implies cache-burst-reads.
>>The second test-run follows immediate still with IRQs disabled.

>>One million iterations may just measure OS-noise from IRQ-
>>taskswitching or whatsoever an OS may do behind your back.

>>I measure 6..7 cycles (almost just the RDTSC itself) on an
>>empty (single NOP) test one my current machine with the MOV-
>>version while the PUSH-variant takes 1..2 cycles more.

>>Usually I ignore the first 64-bit result (caching time)
>>for code variants compare, but the info there is very
>>useful for proper code alignment.

>>Note:
>>this method shouldn't be used on huge code parts,
>>too long disabled IRQs can result in stuck hardware.

> I am surprised to see you using CLI to disable
> IRQs.  As far as I know, there is no way to do
> this in Windows from user mode.  Does *nix allow
> this?  (Seems dangerous, and out of keeping with
> the usual ideas of protected mode.)  Or do you
> have a way to run in Ring 0? ... Or in real mode?

Can't tell for L'unix ...
my OS and applications always run with PL==0 (ring 0),
for windoze XP there are ways to get 'ADMIN'-IO/PL rights
(win 95/98 allowed CLI/STI without further permission).
I don't know if vista or win7 may grant you IO-PL changes too.

__
wolfgang


0
Reply wolfgang 6/23/2010 11:04:16 AM

25 Replies
162 Views

(page loaded in 0.167 seconds)

Similiar Articles:


















7/16/2012 3:11:27 AM


Reply: