COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### Re: Fastest logical not?

• Email
• Follow

```"io_x" <a@b.c.invalid> ha scritto nel messaggio news:...
> pheraps this could be ok
> ; input in eax output in edx trash eax and edx
> xor  edx, edx
> sub  eax,   1
> ; if eax==0 => CF==1 => edx=1
> ; if eax!=0 => CF==0 => edx=0
> ; always if i understand when set carry flag on sub

this not change eax :)

; input in eax output in edx trash edx only
xor  edx, edx
cmp  eax,   1
; if eax==0 => CF==1 => edx=1
; if eax!=0 => CF==0 => edx=0

is it true that "cmp eax, 1" == "sub eax, 1" for the flag only?

```
 0

See related articles to this posting

```io_x wrote:
> "io_x" <a@b.c.invalid> ha scritto nel messaggio news:...
>
> this not change eax :)
>
> ; input in eax output in edx trash edx only
> xor  edx, edx
> cmp  eax,   1
> ; if eax==0 => CF==1 => edx=1
> ; if eax!=0 => CF==0 => edx=0
>
> is it true that "cmp eax, 1" == "sub eax, 1" for the flag only?

Yes, they have the same effect on the flags; the only difference is that
"cmp eax, 1" has no effect on eax.
```
 0

```"io_x" <a@b.c.invalid> ha scritto nel messaggio
> this not change eax :)
>
> ; input in eax output in edx trash edx only
> xor  edx, edx
> cmp  eax,   1
> ; if eax==0 => CF==1 => edx=1
> ; if eax!=0 => CF==0 => edx=0
>
> is it true that "cmp eax, 1" == "sub eax, 1" for the flag only?

; input eax output in eax
cmp  eax,   1
sbb  eax, eax
; if eax==0 => CF==1 => eax==-1
; if eax!=0 => CF==0 => eax== 0

```
 0

```"io_x" posted:
....
>> is it true that "cmp eax, 1" == "sub eax, 1" for the flag only?

Sure.

> ; input eax output in eax
> cmp  eax,   1
> sbb  eax, eax
> ; if eax==0 => CF==1 => eax==-1
> ; if eax!=0 => CF==0 => eax== 0

Congrats Rosario, you finally got it ! :)

and to fit OP's request for 1 and 0 (instead of true/false)

I haven't checked if this perform any faster:
(in theory it should do within 3..4 cycles on AMD K8/10)

cmp eax, 1  ; cy if 0 only
mov eax, 1
sbb eax, 0  ; = 0 if it was zero only

__
wolfgang

```
 0

```io_x wrote:
> "io_x" <a@b.c.invalid> ha scritto nel messaggio
>
>
> ; input eax output in eax
> cmp  eax,   1
> sbb  eax, eax
> ; if eax==0 => CF==1 => eax==-1
> ; if eax!=0 => CF==0 => eax== 0
>
>
>

I added it in to get the following:  it looks like using the only the two
instructions could help on the AMD (and I imagine if doing large sections of
code with lots of tests it probably would be worthwhile).
However, if it has to conform to the OP's exact requirements, changing it to:
> 	cmp eax,1
> 	sbb eax,eax
> 	neg eax
makes it the same as the other ones.  However, as Rod noted earlier, working
with -1 and 0 might be better anyway.

> ==> Running performance tests: (10000000 iterations)
>     Intel(R) Pentium(R) 4 CPU 2.40GHz Processor Detected @ 2400 MHz
> Reference Procedure timing took 0.050255321s  =   12.061 cycles/iteration
> Noop 1                           -->   0.021570294s  =    5.176 cycles/iteration
> Noop 2                           -->   0.019148795s  =    4.595 cycles/iteration
> Noop 3                           -->   0.016664411s  =    3.999 cycles/iteration
> Noop 4                           -->   0.016539342s  =    3.969 cycles/iteration
> Rod Pemberton 1                  -->   0.016643700s  =    3.994 cycles/iteration
> Rod Pemberton 1 (dword)          -->   0.016526647s  =    3.966 cycles/iteration
> Rod Pemberton 2                  -->   0.016718413s  =    4.012 cycles/iteration
> Bernhard Schornak                -->   0.016577269s  =    3.978 cycles/iteration
> Bernhard Schornak 2              -->   0.060770751s  =   14.584 cycles/iteration
> Rob                              -->   0.016664253s  =    3.999 cycles/iteration
> io_x (true=-1)                   -->   0.016574824s  =    3.977 cycles/iteration
> io_x (conforms)                  -->   0.016552388s  =    3.972 cycles/iteration

> ==> Running performance tests: (10000000 iterations)
>     AMD Athlon(tm) XP 2100+ Processor Detected @ 1734 MHz
> Reference Procedure timing took 0.034733522s  =    6.022 cycles/iteration
> Noop 1                           -->   0.017316694s  =    3.002 cycles/iteration
> Noop 2                           -->   0.017328916s  =    3.004 cycles/iteration
> Noop 3                           -->   0.005782117s  =    1.002 cycles/iteration
> Noop 4                           -->   0.005760363s  =    0.998 cycles/iteration
> Rod Pemberton 1                  -->   0.005763820s  =    0.999 cycles/iteration
> Rod Pemberton 1 (dword)          -->   0.005762479s  =    0.999 cycles/iteration
> Rod Pemberton 2                  -->   0.005762902s  =    0.999 cycles/iteration
> Bernhard Schornak                -->   0.005764289s  =    0.999 cycles/iteration
> Bernhard Schornak 2              -->   0.011549449s  =    2.002 cycles/iteration
> Rob                              -->   0.005762832s  =    0.999 cycles/iteration
> io_x (true=-1)                   -->   0.000000000s  =    0.000 cycles/iteration
> io_x (conforms)                  -->   0.005763369s  =    0.999 cycles/iteration

I updated the files if anyone wants to test.
```
 0

```On Sat, 19 Jun 2010 10:02:35 -0400
Rob <junkmail3@nospicedham.lavabit.com> wrote:

> io_x wrote:
> > "io_x" <a@b.c.invalid> ha scritto nel messaggio
> >
> >
> > ; input eax output in eax
> > cmp  eax,   1
> > sbb  eax, eax
> > ; if eax==0 => CF==1 => eax==-1
> > ; if eax!=0 => CF==0 => eax== 0
> >
> >
> >
>
> I added it in to get the following:  it looks like using the only the
> two instructions could help on the AMD (and I imagine if doing large
> sections of code with lots of tests it probably would be worthwhile).
> However, if it has to conform to the OP's exact requirements,
> changing it to:
> > 	cmp eax,1
> > 	sbb eax,eax
> > 	neg eax
> makes it the same as the other ones.  However, as Rod noted earlier,
> working with -1 and 0 might be better anyway.
....
>
> I updated the files if anyone wants to test.

bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)\$ ./timer

Linux Code Benchmarking Tool version 0.1.1, 2010/06/18

## Calculating clockspeed... (Attempting to boost process priority to level 99)

==> Running performance tests: (10000000 iterations)
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.023489901s  =    7.058 cycles/iteration
Noop 1                           -->   0.018399736s  =    5.529 cycles/iteration
Noop 2                           -->   0.016479211s  =    4.952 cycles/iteration
Noop 3                           -->   0.023974400s  =    7.204 cycles/iteration
Noop 4                           -->   0.009346287s  =    2.808 cycles/iteration
Rod Pemberton 1                  -->   0.020491987s  =    6.157 cycles/iteration
Rod Pemberton 1 (dword)          -->   0.020545233s  =    6.173 cycles/iteration
Rod Pemberton 2                  -->   0.020901664s  =    6.280 cycles/iteration
Bernhard Schornak                -->   0.011462474s  =    3.444 cycles/iteration
Bernhard Schornak 2              -->   0.027416318s  =    8.238 cycles/iteration
Rob                              -->   0.011489957s  =    3.452 cycles/iteration
io_x (true=-1)                   -->   0.009982938s  =    2.999 cycles/iteration
io_x (conforms)                  -->   0.020529672s  =    6.169 cycles/iteration
bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)\$ ./timer

Linux Code Benchmarking Tool version 0.1.1, 2010/06/18

## Calculating clockspeed... (Attempting to boost process priority to level 99)

==> Running performance tests: (10000000 iterations)
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.023554749s  =    7.078 cycles/iteration
Noop 1                           -->   0.006417787s  =    1.928 cycles/iteration
Noop 2                           -->   0.003076953s  =    0.924 cycles/iteration
Noop 3                           -->   0.008103817s  =    2.435 cycles/iteration
Noop 4                           -->   0.000000000s  =    0.000 cycles/iteration
Rod Pemberton 1                  -->   0.005753219s  =    1.728 cycles/iteration
Rod Pemberton 1 (dword)          -->   0.005811873s  =    1.746 cycles/iteration
Rod Pemberton 2                  -->   0.005788437s  =    1.739 cycles/iteration
Bernhard Schornak                -->   0.000000000s  =    0.000 cycles/iteration
Bernhard Schornak 2              -->   0.010484905s  =    3.150 cycles/iteration
Rob                              -->   0.000000000s  =    0.000 cycles/iteration
io_x (true=-1)                   -->   0.000000000s  =    0.000 cycles/iteration
io_x (conforms)                  -->   0.006127355s  =    1.841 cycles/iteration
bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)\$ ./timer

Linux Code Benchmarking Tool version 0.1.1, 2010/06/18

## Calculating clockspeed... (Attempting to boost process priority to level 99)

==> Running performance tests: (10000000 iterations)
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.026043430s  =    7.826 cycles/iteration
Noop 1                           -->   0.003932810s  =    1.181 cycles/iteration
Noop 2                           -->   0.000602773s  =    0.181 cycles/iteration
Noop 3                           -->   0.005581371s  =    1.677 cycles/iteration
Noop 4                           -->   0.000000000s  =    0.000 cycles/iteration
Rod Pemberton 1                  -->   0.003263918s  =    0.980 cycles/iteration
Rod Pemberton 1 (dword)          -->   0.006819594s  =    2.049 cycles/iteration
Rod Pemberton 2                  -->   0.018136679s  =    5.450 cycles/iteration
Bernhard Schornak                -->   0.008962243s  =    2.693 cycles/iteration
Bernhard Schornak 2              -->   0.025032740s  =    7.522 cycles/iteration
Rob                              -->   0.008919725s  =    2.680 cycles/iteration
io_x (true=-1)                   -->   0.006785378s  =    2.039 cycles/iteration
io_x (conforms)                  -->   0.018474077s  =    5.551 cycles/iteration
bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)\$ ./timer

Linux Code Benchmarking Tool version 0.1.1, 2010/06/18

## Calculating clockspeed... (Attempting to boost process priority to level 99)

==> Running performance tests: (10000000 iterations)
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.026464334s  =    7.952 cycles/iteration
Noop 1                           -->   0.003503553s  =    1.052 cycles/iteration
Noop 2                           -->   0.000175563s  =    0.052 cycles/iteration
Noop 3                           -->   0.005157638s  =    1.549 cycles/iteration
Noop 4                           -->   0.000000000s  =    0.000 cycles/iteration
Rod Pemberton 1                  -->   0.011589178s  =    3.482 cycles/iteration
Rod Pemberton 1 (dword)          -->   0.017831240s  =    5.358 cycles/iteration
Rod Pemberton 2                  -->   0.017632329s  =    5.298 cycles/iteration
Bernhard Schornak                -->   0.008528689s  =    2.562 cycles/iteration
Bernhard Schornak 2              -->   0.024680431s  =    7.416 cycles/iteration
Rob                              -->   0.008536761s  =    2.565 cycles/iteration
io_x (true=-1)                   -->   0.006846476s  =    2.057 cycles/iteration
io_x (conforms)                  -->   0.017597269s  =    5.287 cycles/iteration
bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)\$ ./timer

Linux Code Benchmarking Tool version 0.1.1, 2010/06/18

## Calculating clockspeed... (Attempting to boost process priority to level 99)

==> Running performance tests: (10000000 iterations)
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.025945025s  =    7.796 cycles/iteration
Noop 1                           -->   0.004053231s  =    1.217 cycles/iteration
Noop 2                           -->   0.012710062s  =    3.819 cycles/iteration
Noop 3                           -->   0.021561162s  =    6.479 cycles/iteration
Noop 4                           -->   0.007167236s  =    2.153 cycles/iteration
Rod Pemberton 1                  -->   0.018039244s  =    5.420 cycles/iteration
Rod Pemberton 1 (dword)          -->   0.018007566s  =    5.411 cycles/iteration
Rod Pemberton 2                  -->   0.018331825s  =    5.508 cycles/iteration
Bernhard Schornak                -->   0.009022759s  =    2.711 cycles/iteration
Bernhard Schornak 2              -->   0.024985062s  =    7.508 cycles/iteration
Rob                              -->   0.009049770s  =    2.719 cycles/iteration
io_x (true=-1)                   -->   0.007222247s  =    2.170 cycles/iteration
io_x (conforms)                  -->   0.018110963s  =    5.442 cycles/iteration
bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)\$ ./timer

Linux Code Benchmarking Tool version 0.1.1, 2010/06/18

## Calculating clockspeed... (Attempting to boost process priority to level 99)

==> Running performance tests: (10000000 iterations)
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.025947498s  =    7.797 cycles/iteration
Noop 1                           -->   0.004020106s  =    1.208 cycles/iteration
Noop 2                           -->   0.000731383s  =    0.219 cycles/iteration
Noop 3                           -->   0.005678014s  =    1.706 cycles/iteration
Noop 4                           -->   0.000000000s  =    0.000 cycles/iteration
Rod Pemberton 1                  -->   0.003350944s  =    1.006 cycles/iteration
Rod Pemberton 1 (dword)          -->   0.003460227s  =    1.039 cycles/iteration
Rod Pemberton 2                  -->   0.003352601s  =    1.007 cycles/iteration
Bernhard Schornak                -->   0.000000000s  =    0.000 cycles/iteration
Bernhard Schornak 2              -->   0.008163741s  =    2.453 cycles/iteration
Rob                              -->   0.000000000s  =    0.000 cycles/iteration
io_x (true=-1)                   -->   0.000000000s  =    0.000 cycles/iteration
io_x (conforms)                  -->   0.003424348s  =    1.029 cycles/iteration
bmaxa@maxa:~/fasm/FastestLogNot/source/timer/boolean (clax)\$

Greets, Branimir!

--
drwxr-xr-x 2 bmaxa bmaxa 4096 2010-06-18 12:45 .

```
 0

```"wolfgang kern" <nowhere@never.at> ha scritto nel messaggio
> I haven't checked if this perform any faster:
> (in theory it should do within 3..4 cycles on AMD K8/10)
>
> cmp eax, 1  ; cy if 0 only
> mov eax, 1
> sbb eax, 0  ; = 0 if it was zero only

cmp  eax, 1
mov  eax, 0
setc al

or

cmp  eax, 1
mov  eax, 0

```
 0

```"wolfgang kern" <nowhere@never.at> ha scritto nel messaggio
>
> "io_x" posted:
> ...
>>> is it true that "cmp eax, 1" == "sub eax, 1" for the flag only?
>
> Sure.
>
>
>> ; input eax output in eax
>> cmp  eax,   1
>> sbb  eax, eax
>> ; if eax==0 => CF==1 => eax==-1
>> ; if eax!=0 => CF==0 => eax== 0
>
> Congrats Rosario, you finally got it ! :)

yes but some other wrote it first
for me it is a game for learn something

> and to fit OP's request for 1 and 0 (instead of true/false)

i can add one "neg eax" [hope that neg(0)==0] or "and  eax,  1" like Rob

> I haven't checked if this perform any faster:
> (in theory it should do within 3..4 cycles on AMD K8/10)
>
> cmp eax, 1  ; cy if 0 only
> mov eax, 1
> sbb eax, 0  ; = 0 if it was zero only

if eax==0 => CF==1 => eax=eax-1=1-1=0
if eax!=0 => CF==0 => eax=eax-0 => eax!=0
so there is something wrong

i think that what could be right too if consider
eax==0  false
eax==1  true

so the "logical not" of these posts whould be just
xor  eax,  1

```
 0

```"wolfgang kern" <nowhere@never.at> ha scritto nel messaggio
>
> "io_x" posted:
>
> I haven't checked if this perform any faster:
> (in theory it should do within 3..4 cycles on AMD K8/10)
>
> cmp eax, 1  ; cy if 0 only
> mov eax, 1
> sbb eax, 0  ; = 0 if it was zero only

yes this could be the code that "normalize" to
true  == 1  [first was true if it is a value different from 0]
false == 0

so the code result good here should be
; normalize
cmp eax, 1  ; cy if 0 only
mov eax, 1
sbb eax, 0  ; = 0 if it was zero only
; if eax==0 => CF==1 => eax=1-1=0
; if eax!=0 => CF==0 => eax=1
; invert
xor eax, 1

```
 0

```Well, I set it up so each piece of code is executed 4 times.
I also cleaned up the code a bit and set it up so that the code is inlined via
macros instead of a procedure (which would make it a little harder to test C
code; which it was set up for in the beginning - still possible though, it would
have to be called from the loop).  It is interesting though, as that seemed to
make quite a difference in the timings.

On the Pentium 4 I got (a couple times as it varied a little bit):

> ==> Running performance tests: (4000000 iterations)
>     Intel(R) Pentium(R) 4 CPU 2.40GHz Processor Detected @ 2400 MHz
> Reference Procedure timing took 0.001999166s  =    1.199 cycles/iteration
> Noop 1                           -->    0.008380969s  =    5.028 cycles/iteration
> Noop 2                           -->    0.008973599s  =    5.384 cycles/iteration
> Noop 3                           -->    0.044951069s  =   26.970 cycles/iteration
> Noop 4                           -->    0.044929676s  =   26.957 cycles/iteration
> Rod Pemberton 1                  -->    0.051656874s  =   30.994 cycles/iteration
> Rod Pemberton 1 (dword)          -->    0.051574552s  =   30.944 cycles/iteration
> Rod Pemberton 2                  -->    0.051833085s  =   31.099 cycles/iteration
> Rob                              -->    0.051710325s  =   31.026 cycles/iteration
> Bernhard Schornak 1              -->    0.044852599s  =   26.911 cycles/iteration
> Bernhard Schornak 1 rearranged   -->    0.045006106s  =   27.003 cycles/iteration
> Bernhard Schornak 2              -->    0.086183280s  =   51.709 cycles/iteration
> Rob 2                            -->    0.045101107s  =   27.060 cycles/iteration
> io_x (true=-1)                   -->    0.045004517s  =   27.002 cycles/iteration
> io_x (conforms)                  -->    0.051889863s  =   31.133 cycles/iteration
> io_x 2                           -->    0.044870523s  =   26.922 cycles/iteration
> io_x 3                           -->    0.044915668s  =   26.949 cycles/iteration

> ==> Running performance tests: (4000000 iterations)
>     Intel(R) Pentium(R) 4 CPU 2.40GHz Processor Detected @ 2400 MHz
> Reference Procedure timing took 0.001999438s  =    1.199 cycles/iteration
> Noop 1                           -->    0.008288132s  =    4.972 cycles/iteration
> Noop 2                           -->    0.008926663s  =    5.355 cycles/iteration
> Noop 3                           -->    0.045298991s  =   27.179 cycles/iteration
> Noop 4                           -->    0.044975481s  =   26.985 cycles/iteration
> Rod Pemberton 1                  -->    0.051847143s  =   31.108 cycles/iteration
> Rod Pemberton 1 (dword)          -->    0.051833966s  =   31.100 cycles/iteration
> Rod Pemberton 2                  -->    0.051825479s  =   31.095 cycles/iteration
> Rob                              -->    0.051602658s  =   30.961 cycles/iteration
> Bernhard Schornak 1              -->    0.044904328s  =   26.942 cycles/iteration
> Bernhard Schornak 1 rearranged   -->    0.044894118s  =   26.936 cycles/iteration
> Bernhard Schornak 2              -->    0.087300608s  =   52.380 cycles/iteration
> Rob 2                            -->    0.044952695s  =   26.971 cycles/iteration
> io_x (true=-1)                   -->    0.044967082s  =   26.980 cycles/iteration
> io_x (conforms)                  -->    0.051672787s  =   31.003 cycles/iteration
> io_x 2                           -->    0.044765342s  =   26.859 cycles/iteration
> io_x 3                           -->    0.045021009s  =   27.012 cycles/iteration

And the AMD was (it seemed pretty consistent over several runs):

> ==> Running performance tests: (4000000 iterations)
>     AMD Athlon(tm) XP 2100+ Processor Detected @ 1734 MHz
> Reference Procedure timing took 0.004679899s  =    2.028 cycles/iteration
> Noop 1                           -->    0.013860430s  =    6.008 cycles/iteration
> Noop 2                           -->    0.013862878s  =    6.009 cycles/iteration
> Noop 3                           -->    0.023104264s  =   10.015 cycles/iteration
> Noop 4                           -->    0.023104864s  =   10.015 cycles/iteration
> Rod Pemberton 1                  -->    0.023106142s  =   10.016 cycles/iteration
> Rod Pemberton 1 (dword)          -->    0.023104913s  =   10.015 cycles/iteration
> Rod Pemberton 2                  -->    0.023105420s  =   10.016 cycles/iteration
> Rob                              -->    0.023090743s  =   10.009 cycles/iteration
> Bernhard Schornak 1              -->    0.023199651s  =   10.057 cycles/iteration
> Bernhard Schornak 1 rearranged   -->    0.020766501s  =    9.002 cycles/iteration
> Bernhard Schornak 2              -->    0.023089921s  =   10.009 cycles/iteration
> Rob 2                            -->    0.013848657s  =    6.003 cycles/iteration
> io_x (true=-1)                   -->    0.013847794s  =    6.003 cycles/iteration
> io_x (conforms)                  -->    0.023092167s  =   10.010 cycles/iteration
> io_x 2                           -->    0.013835934s  =    5.997 cycles/iteration
> io_x 3                           -->    0.013909315s  =    6.029 cycles/iteration

```
 0

```Ok, here's the final setup that I'm going to post - let's see what you guys get.

> ==> Running performance tests: (4000000 iterations)
>     Intel(R) Pentium(R) 4 CPU 2.40GHz Processor Detected @ 2400 MHz
> Reference Procedure timing took 0.001998761s  =    1.199 cycles/iteration
> Noop 1                           -->    0.003056183s  =    1.833 cycles/iteration
> Noop 2                           -->    0.012885864s  =    7.731 cycles/iteration
> Noop 3                           -->    0.009790604s  =    5.874 cycles/iteration
> Noop 4                           -->    0.009731597s  =    5.838 cycles/iteration
> Rod Pemberton 1                  -->    0.011450241s  =    6.870 cycles/iteration
> Rod Pemberton 1 (dword)          -->    0.011411202s  =    6.846 cycles/iteration
> Rod Pemberton 2                  -->    0.011408713s  =    6.845 cycles/iteration
> Rob                              -->    0.011823475s  =    7.094 cycles/iteration
> Bernhard Schornak 1              -->    0.009752518s  =    5.851 cycles/iteration
> Bernhard Schornak 1 rearranged   -->    0.009820718s  =    5.892 cycles/iteration
> Bernhard Schornak 2              -->    0.020539177s  =   12.323 cycles/iteration
> Rob 2                            -->    0.010207311s  =    6.124 cycles/iteration
> io_x (true=-1)                   -->    0.009836334s  =    5.901 cycles/iteration
> io_x (conforms)                  -->    0.011562429s  =    6.937 cycles/iteration
> io_x 2                           -->    0.009838613s  =    5.903 cycles/iteration
> io_x 3                           -->    0.009800646s  =    5.880 cycles/iteration

> ==> Running performance tests: (4000000 iterations)
>     AMD Athlon(tm) XP 2100+ Processor Detected @ 1734 MHz
> Reference Procedure timing took 0.004645913s  =    2.014 cycles/iteration
> Noop 1                           -->    0.004628110s  =    2.006 cycles/iteration
> Noop 2                           -->    0.004617581s  =    2.001 cycles/iteration
> Noop 3                           -->    0.002323831s  =    1.007 cycles/iteration
> Noop 4                           -->    0.002323585s  =    1.007 cycles/iteration
> Rod Pemberton 1                  -->    0.002323169s  =    1.007 cycles/iteration
> Rod Pemberton 1 (dword)          -->    0.002300807s  =    0.997 cycles/iteration
> Rod Pemberton 2                  -->    0.002310613s  =    1.001 cycles/iteration
> Rob                              -->    0.002323304s  =    1.007 cycles/iteration
> Bernhard Schornak 1              -->    0.002324694s  =    1.007 cycles/iteration
> Bernhard Schornak 1 rearranged   -->    0.000818792s  =    0.354 cycles/iteration
> Bernhard Schornak 2              -->    0.002324507s  =    1.007 cycles/iteration
> Rob 2                            -->    0.002323687s  =    1.007 cycles/iteration
> io_x (true=-1)                   -->    0.000000000s  =    0.000 cycles/iteration
> io_x (conforms)                  -->    0.002313310s  =    1.002 cycles/iteration
> io_x 2                           -->    0.000007071s  =    0.003 cycles/iteration
> io_x 3                           -->    0.000007169s  =    0.003 cycles/iteration

Code tested was as follows:

> timer_start time1,"Noop 1"
> 	cmp eax,0
> 	je @f
> 	mov eax,1
> @@: xor eax,1
> timer_end
>
> timer_start time2,"Noop 2"
> 	test eax, eax
> 	jz @f
> 	mov eax, 1
> @@:	xor eax, 1
> timer_end
>
> timer_start time3,"Noop 3"
> 	or eax,eax
> 	setz al
> 	and eax,1
> timer_end
>
> timer_start time4,"Noop 4"
> 	or eax,eax
> 	setz al
> 	movzx eax,al
> timer_end
>
> timer_start time5,"Rod Pemberton 1"
> 	cmp eax,1
> 	sbb eax,eax
> 	and eax,1
> timer_end
>
> timer_start time5a,"Rod Pemberton 1 (dword)"
> 	cmp eax,dword 1
> 	sbb eax,eax
> 	and eax,dword 1
> timer_end
>
> timer_start time6,"Rod Pemberton 2"
> 	neg eax
> 	sbb eax,eax
> 	inc eax
> timer_end
>
> timer_start time7,"Rob"
> 	neg eax
> 	sbb eax,eax
> timer_end
>
> timer_start time8,"Bernhard Schornak 1"
> 	test eax,eax
> 	sete al
> 	movzx eax,al
> timer_end
>
> timer_start time8a,"Bernhard Schornak 1 rearranged"
> 	test eax,eax
> 	movzx eax,al
> 	sete al
> timer_end
>
> timer_start time9,"Bernhard Schornak 2"
> 	test eax,eax
> 	cmovne eax,[.zero]
> 	cmove eax,[.one]
> timer_end
> align 4
> .zero	dd 0
> .one	dd 1
>
> timer_start time10,"Rob 2"
> 	test eax,eax
> 	mov eax,0
> 	cmove eax,[.one	]
> timer_end
> align 4
> .one dd 1
>
> timer_start time11,"io_x (true=-1)"
> 	cmp eax,1
> 	sbb eax,eax
> timer_end
>
> timer_start time12,"io_x (conforms)"
> 	cmp eax,1
> 	sbb eax,eax
> 	neg eax
> timer_end
>
> timer_start time13,"io_x 2"
> 	cmp eax,1
> 	mov  eax,0
> 	setc al
> timer_end
>
> timer_start time14,"io_x 3"
> 	cmp eax,1
> 	mov eax,0
> timer_end

Source is here:  (I put a couple of different types of archives up in case
someone doesn't have one of them).  I also added a (brief) README to explain it
a little bit.
http://70.53.58.62/Forums/clax/Fastest%20logical%20not.7z
http://70.53.58.62/Forums/clax/Fastest%20logical%20not.tar.bz2
```
 0

```On Sun, 20 Jun 2010 23:17:07 -0400
Rob <junkmail3@nospicedham.lavabit.com> wrote:

> Ok, here's the final setup that I'm going to post - let's see what
> you guys get.
>
This one looks better:

bmaxa@maxa:~/Desktop\$ ./timer

Linux Code Benchmarking Tool version 0.1, 2010/06/06

==> Running Conformance Tests:
Noop 1                           --> 00000001h -            1
Noop 2                           --> 00000001h -            1
Noop 3                           --> 00000001h -            1
Noop 4                           --> 00000001h -            1
Rod Pemberton 1                  --> 00000001h -            1
Rod Pemberton 1 (dword)          --> 00000001h -            1
Rod Pemberton 2                  --> 00000001h -            1
Rob                              --> 00000001h -            1
Bernhard Schornak 1              --> 00000001h -            1
Bernhard Schornak 1 rearranged   --> 00000001h -            1
Bernhard Schornak 2              --> 00000001h -            1
Rob 2                            --> 00000001h -            1
io_x (true=-1)                   --> FFFFFFFFh -   4294967295
io_x (conforms)                  --> 00000001h -            1
io_x 2                           --> 00000001h -            1
io_x 3                           --> 00000001h -            1

==> Calculating CPU clockspeed... (Attempting to boost process priority to level 99)
** No user permissions to boost priority:  the CPU calculation should still work though. **

==> Running performance tests: (4000000 iterations)
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz Processor Detected @ 3005 MHz
Reference Procedure timing took 0.002000860s  =    1.503 cycles/iteration
Noop 1                           -->    0.005968339s  =    4.483 cycles/iteration
Noop 2                           -->    0.003528693s  =    2.650 cycles/iteration
Noop 3                           -->    0.003667782s  =    2.755 cycles/iteration
Noop 4                           -->    0.002459816s  =    1.847 cycles/iteration
Rod Pemberton 1                  -->    0.003746486s  =    2.814 cycles/iteration
Rod Pemberton 1 (dword)          -->    0.003726928s  =    2.799 cycles/iteration
Rod Pemberton 2                  -->    0.003751117s  =    2.818 cycles/iteration
Rob                              -->    0.003477926s  =    2.612 cycles/iteration
Bernhard Schornak 1              -->    0.002486430s  =    1.867 cycles/iteration
Bernhard Schornak 1 rearranged   -->    0.003646139s  =    2.739 cycles/iteration
Bernhard Schornak 2              -->    0.004498696s  =    3.379 cycles/iteration
Rob 2                            -->    0.002741681s  =    2.059 cycles/iteration
io_x (true=-1)                   -->    0.002482969s  =    1.865 cycles/iteration
io_x (conforms)                  -->    0.003477933s  =    2.612 cycles/iteration
io_x 2                           -->    0.003670713s  =    2.757 cycles/iteration
io_x 3                           -->    0.002851575s  =    2.142 cycles/iteration

Greets, Branimir

--
drwxr-xr-x 2 bmaxa bmaxa 4096 2010-06-19 19:26 .

```
 0

```On Mon, 2010-06-21 at 05:43 +0200, Branimir Maksimovic wrote:
> On Sun, 20 Jun 2010 23:17:07 -0400
> Rob <junkmail3@nospicedham.lavabit.com> wrote:
>=20
> > Ok, here's the final setup that I'm going to post - let's see what
> > you guys get.

There must be something wrong with your processor as mine outperforms
yours by a massive margin!

=3D=3D> Running performance tests: (4000000 iterations)
Mobile Intel(R) Pentium(R) 4 CPU 2.80GHz Processor Detected @ 2791 MHz
Reference Procedure timing took 0.004559407s  =3D    3.181 cycles/iteration
Noop 1                           -->    0.004874103s  =3D    3.400 cycles/i=
teration
Noop 2                           -->    0.002916938s  =3D    2.035 cycles/i=
teration
Noop 3                           -->    0.020436658s  =3D   14.259 cycles/i=
teration
Noop 4                           -->    0.019746686s  =3D   13.778 cycles/i=
teration
Rod Pemberton 1                  -->    0.022776560s  =3D   15.892 cycles/i=
teration
Rod Pemberton 1 (dword)          -->    0.022845334s  =3D   15.940 cycles/i=
teration
Rod Pemberton 2                  -->    0.022944944s  =3D   16.009 cycles/i=
teration
Rob                              -->    0.022822393s  =3D   15.924 cycles/i=
teration
Bernhard Schornak 1              -->    0.020395952s  =3D   14.231 cycles/i=
teration
Bernhard Schornak 1 rearranged   -->    0.017738976s  =3D   12.377 cycles/i=
teration
Bernhard Schornak 2              -->    0.038830060s  =3D   27.093 cycles/i=
teration
Rob 2                            -->    0.018485669s  =3D   12.898 cycles/i=
teration
io_x (true=3D-1)                   -->    0.020139807s  =3D   14.052 cycles=
/iteration
io_x (conforms)                  -->    0.022526891s  =3D   15.718 cycles/i=
teration
io_x 2                           -->    0.018357516s  =3D   12.808 cycles/i=
teration
io_x 3                           -->    0.020623856s  =3D   14.390 cycles/i=
teration

--=20
http://www.munted.org.uk

One very high maintenance cat living here.

```
 0

```"io_x" wrote:

>> I haven't checked if this perform any faster:
>> (in theory it should do within 3..4 cycles on AMD K8/10)

>> cmp eax, 1  ; cy if 0 only
>> mov eax, 1
>> sbb eax, 0  ; = 0 if it was zero only

> yes this could be the code that "normalize" to
> true  == 1  [first was true if it is a value different from 0]
> false == 0

Yes, this is 'positive LOGIC' seen by engineers,
in oppostion to HLL-coders :)

> so the code result good here should be
> ; normalize
> cmp eax, 1  ; cy if 0 only
> mov eax, 1
> sbb eax, 0  ; = 0 if it was zero only
> ; if eax==0 => CF==1 => eax=1-1=0
> ; if eax!=0 => CF==0 => eax=1
> ; invert
> xor eax, 1

Save the last line if you like it reverse:

cmp eax, 1
mov eax, 0
adc eax ,0   ;1 if it was zero only (0+WasZero)

[IF !=0]
cmp eax, 1
sbb eax,eax  ;-1 (false) only if it was 0

[IF ==0]
cmp eax, 1
mov eax,-1
adc eax, 0   ;0 (true) only if it was 0

shorter in bytes, perhaps slower
[IF NOT !=0]
cmp eax ,1
sbb eax,eax
not eax      ;same as 'xor eax, -1' except no flags altered

Even I never would need 0/1 results
my way would be anyway:

or eax,eax            ;get zero/sign flags
cmovnz eax,[one]      ;if it's 0 then let it be 0 :)

and my solution for the fastest 'logical NOT':

NOT eax               ;this seem to be designed for it!
or
XOR eax,-1            ;may be faster on some CPUs
__
wolfgang

```
 0

```On Mon, 21 Jun 2010 10:25:01 +0200
"wolfgang kern" <nowhere@never.at> wrote:

>
> "io_x" wrote:
>
> >> I haven't checked if this perform any faster:
> >> (in theory it should do within 3..4 cycles on AMD K8/10)
>
.......
>

C didn't have bool type. C style true/false
is actually , zero for false, not zero (anything)
for true.
Same logic is used for pointers (true/false).

>
> and my solution for the fastest 'logical NOT':
>
> NOT eax               ;this seem to be designed for it!
> or
> XOR eax,-1            ;may be faster on some CPUs

Great!

> __
> wolfgang
>
>

Greets, Branimir

--
drwxr-xr-x 2 bmaxa bmaxa 4096 2010-06-19 19:26 .

```
 0

```Branimir Maksimovic posted:
> "wolfgang kern" <nowhere@never.at> wrote:
....

> C didn't have bool type. C style true/false
> is actually , zero for false, not zero (anything)
> for true.
> Same logic is used for pointers (true/false).

Oh! that's why most code for windoze end with
pop ebp
mov esp,ebp
xor eax,eax   ;or an error code in eax  :)
ret 16

__
wolfgang

```
 0

```On Mon, 21 Jun 2010 08:54:20 +0100
Mr Sensible <alex.buell@nospicedham.munted.org.uk> wrote:

> On Mon, 2010-06-21 at 05:43 +0200, Branimir Maksimovic wrote:
> > On Sun, 20 Jun 2010 23:17:07 -0400
> > Rob <junkmail3@nospicedham.lavabit.com> wrote:
> >
> > > Ok, here's the final setup that I'm going to post - let's see what
> > > you guys get.
>
> There must be something wrong with your processor as mine outperforms
> yours by a massive margin!
>
> ==> Running performance tests: (4000000 iterations)
>     Mobile Intel(R) Pentium(R) 4 CPU 2.80GHz Processor Detected @
> 2791 MHz Reference Procedure timing took 0.004559407s  =    3.181
> cycles/iteration Noop 1                           -->
> 0.004874103s  =    3.400 cycles/iteration Noop
> 2                           -->    0.002916938s  =    2.035
> cycles/iteration Noop 3                           -->
> 0.020436658s  =   14.259 cycles/iteration Noop
> 4                           -->    0.019746686s  =   13.778
> cycles/iteration Rod Pemberton 1                  -->
> 0.022776560s  =   15.892 cycles/iteration Rod Pemberton 1
> (dword)          -->    0.022845334s  =   15.940 cycles/iteration Rod
> Pemberton 2                  -->    0.022944944s  =   16.009
> cycles/iteration Rob                              -->
> 0.022822393s  =   15.924 cycles/iteration Bernhard Schornak
> 1              -->    0.020395952s  =   14.231 cycles/iteration
> Bernhard Schornak 1 rearranged   -->    0.017738976s  =   12.377
> cycles/iteration Bernhard Schornak 2              -->
> 0.038830060s  =   27.093 cycles/iteration Rob
> 2                            -->    0.018485669s  =   12.898
> cycles/iteration io_x (true=-1)                   -->
> 0.020139807s  =   14.052 cycles/iteration io_x
> (conforms)                  -->    0.022526891s  =   15.718
> cycles/iteration io_x 2                           -->
> 0.018357516s  =   12.808 cycles/iteration io_x
> 3                           -->    0.020623856s  =   14.390
> cycles/iteration
>

==> Running performance tests: (4000000 iterations)
Intel(R) Xeon(TM) CPU 2.80GHz Processor Detected @ 2791 MHz
Reference Procedure timing took 0.002098834s  =    1.464 cycles/iteration
Noop 1                           -->    0.002204584s  =    1.538 cycles/iteration
Noop 2                           -->    0.002208957s  =    1.541 cycles/iteration
Noop 3                           -->    0.007932801s  =    5.535 cycles/iteration
Noop 4                           -->    0.007932259s  =    5.534 cycles/iteration
Rod Pemberton 1                  -->    0.009366177s  =    6.535 cycles/iteration
Rod Pemberton 1 (dword)          -->    0.009364695s  =    6.534 cycles/iteration
Rod Pemberton 2                  -->    0.009368607s  =    6.536 cycles/iteration
Rob                              -->    0.009954535s  =    6.945 cycles/iteration
Bernhard Schornak 1              -->    0.007932956s  =    5.535 cycles/iteration
Bernhard Schornak 1 rearranged   -->    0.007936561s  =    5.537 cycles/iteration
Bernhard Schornak 2              -->    0.016799123s  =   11.721 cycles/iteration
Rob 2                            -->    0.008283530s  =    5.779 cycles/iteration
io_x (true=-1)                   -->    0.007934037s  =    5.535 cycles/iteration
io_x (conforms)                  -->    0.009365197s  =    6.534 cycles/iteration
io_x 2                           -->    0.008029410s  =    5.602 cycles/iteration
io_x 3                           -->    0.008050396s  =    5.617 cycles/iteration
[bmaxa@devel ~]\$

Greets, Branimir

--
drwxr-xr-x 2 bmaxa bmaxa 4096 2010-06-19 19:26 .

```
 0

```"Mr Sensible" <alex.buell...> answered Rob:

>> Ok, here's the final setup that I'm going to post - let's see what
>> you guys get.

> There must be something wrong with your processor as mine outperforms
> yours by a massive margin!

> ==> Running performance tests: (4000000 iterations)
....
4 million iterations against 1 million used by Rob ?
....
what you both mainly measure here may be just OS background noise
and this will more depend on how often you moved the mouse during
the test rather than on the actual code snip timing :)
perhaps that's why your measurements vary from 0 to 27 cycles ?

I didn't time all the variants in this thread but io_x versions

cmp eax,1
sbb eax,eax
inc eax      ;or NEG eax

tested on:
AMD athlon K8 1.8 GHZ dual
AMD K7 500MHz

always read two (three on K7) cycles more than an empty
(single NOP) test for the second iteration (see below),
first iteration timing depend on cache-status.

looks like each of this three instructions can perform
and complete within one cycle at least, otherwise
we would see dependency-penalties.
__
wolfgang

I posted this in ALA (1.June):
_______
.... RDTSC timing is machine specific (5...else cycles).
I think to have posted this several times already:

MOV esi,result_buf  ;16 bytes needed yet here
MOV byte[loop],01
CLI
L1:
;PUSH esi           ;if desired/required
CPUID
RDTSC
MOV ebx,eax ;or: PUSH eax
MOV ecx,edx ;or: PUSH edx
...             ;code under test (must preserve what's needed)
RDTSC             ;without serialising yet !!!
SUB eax,ebx ;or: SUB eax,[esp+4]
SBB edx,ecx ;or: SBB edx,[esp]
;POP esi
MOV [esi],eax       ;
MOV [esi+4],edx     ;store 64-bit cycle-count
DEC byte [loop]
JNS L1              ;loop the above just one more time
STI
* I have an INT3 here to watch results in the debugger view.

A first run (I intentionally avoid the term 'iterate' here)
would show a cycle count which implies cache-burst-reads.
The second test-run follows immediate still with IRQs disabled.

One million iterations may just measure OS-noise from IRQ-

I measure 6..7 cycles (almost just the RDTSC itself) on an
empty (single NOP) test one my current machine with the MOV-
version while the PUSH-variant takes 1..2 cycles more.

Usually I ignore the first 64-bit result (caching time)
for code variants compare, but the info there is very
useful for proper code alignment.

Note:
this method shouldn't be used on huge code parts,
too long disabled IRQs can result in stuck hardware.
__________
eof

```
 0

```On Mon, 2010-06-21 at 14:19 +0200, Branimir Maksimovic wrote:

> Bernhard Schornak 2              -->    0.016799123s  =3D   11.721
> cycles/iteration

Mine:
Bernhard Schornak 2              -->    0.038830060s  =3D   27.093
cycles/iteration

I cannot believe my processor (Mobile P4 2.8GHz w/1MB L2 cache) is
faster than a Xeon (Xeon 2.8GHz). Something is NOT right - if I'm
reading this correctly, I'm taking 27 cycles per iteration whilst yours
takes 11.7 cycles per iteration.=20

Oh hang on, what *sort* of cycles are we talking about?
--=20
http://www.munted.org.uk

One very high maintenance cat living here.

```
 0

```On Mon, 21 Jun 2010 15:06:02 +0200
"wolfgang kern" <nowhere@never.at> wrote:

> MOV esi,result_buf  ;16 bytes needed yet here
> MOV byte[loop],01
> CLI
> L1:
> ;PUSH esi           ;if desired/required
>   CPUID
>   RDTSC
>     MOV ebx,eax ;or: PUSH eax
>     MOV ecx,edx ;or: PUSH edx
>     ...             ;code under test (must preserve what's needed)
>   RDTSC             ;without serialising yet !!!
>     SUB eax,ebx ;or: SUB eax,[esp+4]
>     SBB edx,ecx ;or: SBB edx,[esp]
> ;POP esi
> MOV [esi],eax       ;
> MOV [esi+4],edx     ;store 64-bit cycle-count
> DEC byte [loop]
> JNS L1              ;loop the above just one more time
> STI
.....
> Note:
> this method shouldn't be used on huge code parts,
> too long disabled IRQs can result in stuck hardware.

What happens with this code when there are more than 1
cpus? I guess cli is valid only for current cpu?
So task has to be bound to single cpu?

Greets, Branimir

--
drwxr-xr-x 2 bmaxa bmaxa 4096 2010-06-19 19:26 .

```
 0

```Branimir Maksimovic wrote:

> On Mon, 21 Jun 2010 15:06:02 +0200
> "wolfgang kern"<nowhere@never.at>  wrote:
>
>> MOV esi,result_buf  ;16 bytes needed yet here
>> MOV byte[loop],01
>> CLI
>> L1:
>> ;PUSH esi           ;if desired/required
>>    CPUID
>>    RDTSC
>>      MOV ebx,eax ;or: PUSH eax
>>      MOV ecx,edx ;or: PUSH edx
>>      ...             ;code under test (must preserve what's needed)
>>    RDTSC             ;without serialising yet !!!
>>      SUB eax,ebx ;or: SUB eax,[esp+4]
>>      SBB edx,ecx ;or: SBB edx,[esp]
>> ;POP esi
>> MOV [esi],eax       ;
>> MOV [esi+4],edx     ;store 64-bit cycle-count
>> DEC byte [loop]
>> JNS L1              ;loop the above just one more time
>> STI
> ....
>> Note:
>> this method shouldn't be used on huge code parts,
>> too long disabled IRQs can result in stuck hardware.
>
> What happens with this code when there are more than 1
> cpus? I guess cli is valid only for current cpu?
> So task has to be bound to single cpu?

Why should any OS split a single thread
into parts and execute them on separate
cores? Threads generally are not execu-
ted on more than one core, except a 2nd
instance was started before the 1st one
finished. On some, but not all OS', the
2nd instance may be executed on another
(idle) core.

All errors in Rob's test routines could
be eliminated if the loop count was re-
duced to 1,024 (or less) - actually, 16
tests were sufficient to get a reliable
result in cycles (rather than micro- or
nanoseconds) using RDTSC. Too many runs
(millions in our case) exceed the usual
time slice by far, adding times for un-
results. Maybe a MUST_COMPLETE or what-
ever this is called on <insert your OS>
plus setting that thread to the highest
possible priority is required to finish
it without being 'preempted'...

Greetings from Augsburg

Bernhard Schornak
```
 0

```Branimir Maksimovic asked:

>> MOV esi,result_buf  ;16 bytes needed yet here
>> MOV byte[loop],01
>> CLI
>> L1:
>> ;PUSH esi           ;if desired/required
>>   CPUID
>>   RDTSC
>>     MOV ebx,eax ;or: PUSH eax
>>     MOV ecx,edx ;or: PUSH edx
>>     ...             ;code under test (must preserve what's needed)
>>   RDTSC             ;without serialising yet !!!
>>     SUB eax,ebx ;or: SUB eax,[esp+4]
>>     SBB edx,ecx ;or: SBB edx,[esp]
>> ;POP esi
>> MOV [esi],eax       ;
>> MOV [esi+4],edx     ;store 64-bit cycle-count
>> DEC byte [loop]
>> JNS L1              ;loop the above just one more time
>> STI
> ....
>> Note:
>> this method shouldn't be used on huge code parts,
>> too long disabled IRQs can result in stuck hardware.

> What happens with this code when there are more than 1
> cpus? I guess cli is valid only for current cpu?
> So task has to be bound to single cpu?

I cannot see how and when another CPU-core could intercept
this piece of code between the two RDTSCs as long
the test code is short enough to remain cached and
it doesn't switch tasks nor context by far calls.
__
wolfgang

```
 0

```Mr Sensible wrote:
> On Mon, 2010-06-21 at 14:19 +0200, Branimir Maksimovic wrote:
>
>> Bernhard Schornak 2              -->    0.016799123s  =   11.721
>> cycles/iteration
>
> Mine:
> Bernhard Schornak 2              -->    0.038830060s  =   27.093
> cycles/iteration
>
> I cannot believe my processor (Mobile P4 2.8GHz w/1MB L2 cache) is
> faster than a Xeon (Xeon 2.8GHz). Something is NOT right - if I'm
> reading this correctly, I'm taking 27 cycles per iteration whilst yours
> takes 11.7 cycles per iteration.
>
> Oh hang on, what *sort* of cycles are we talking about?

Thanks for testing!  It makes me wonder though what could be going on with the
code since your P4 has quite different results than mine.  Are you running many
other programs at the same time as testing it?
I run it with all my other programs closed (although I am still in X).  In
theory, it shouldn't make much of a difference, but I found in practice it did
(at least in consistent results).

I was meaning them as clock cycles:  IOW, lower is faster (the actual speed of
execution is dependent on the clock speed though).

Thanks,
Rob
```
 0

```"Branimir Maksimovic" <bmaxa@nospicedham.hotmail.com> wrote in message
news:hvnta3\$2br\$9@solani.org...
> What happens with this code when there are more than 1
> cpus?

Uh...   RDTSC is replaced with RDTSCP?

There are a number of x86 microprocessor's that have problems
with their TSC, either drift or artificially locked.

"TSC (cpu TimeStamp Counter)
- RDTSC instruction on Pentium or later cpu's
- 64-bit counter
- one count per cpu clock
- Efficeon updates TSC counters at maximum frequency
- Efficeon doesn't properly update at actual cpu clock speed
- TSC drifts on AMD K8 and dual-core platforms
- TSC is not frequency independent on AMD K8 and dual-core
- used by Linux
- used by MS Vista
- used by MS SMP HAL Windows
- RDTSCP instruction on AMD NPT 0F, e.g., F, AM2, S1g1 cpus
- can affected by power management events
- bit returned by CPUID indicates power invariant
"

Quotes of issues with RDTSC instruction:

Rod Pemberton

```
 0

```On Mon, 21 Jun 2010 15:06:02 +0200, "wolfgang
kern" <nowhere@never.at> wrote:

<snip>

>_______
>... RDTSC timing is machine specific (5...else cycles).
>I think to have posted this several times already:
>
>MOV esi,result_buf  ;16 bytes needed yet here
>MOV byte[loop],01
>CLI
>L1:
>;PUSH esi           ;if desired/required
>  CPUID
>  RDTSC
>    MOV ebx,eax ;or: PUSH eax
>    MOV ecx,edx ;or: PUSH edx
>    ...             ;code under test (must preserve what's needed)
>  RDTSC             ;without serialising yet !!!
>    SUB eax,ebx ;or: SUB eax,[esp+4]
>    SBB edx,ecx ;or: SBB edx,[esp]
>;POP esi
>MOV [esi],eax       ;
>MOV [esi+4],edx     ;store 64-bit cycle-count
>DEC byte [loop]
>JNS L1              ;loop the above just one more time
>STI
>* I have an INT3 here to watch results in the debugger view.
>
>A first run (I intentionally avoid the term 'iterate' here)
>would show a cycle count which implies cache-burst-reads.
>The second test-run follows immediate still with IRQs disabled.
>
>One million iterations may just measure OS-noise from IRQ-
>
>I measure 6..7 cycles (almost just the RDTSC itself) on an
>empty (single NOP) test one my current machine with the MOV-
>version while the PUSH-variant takes 1..2 cycles more.
>
>Usually I ignore the first 64-bit result (caching time)
>for code variants compare, but the info there is very
>useful for proper code alignment.
>
>Note:
>this method shouldn't be used on huge code parts,
>too long disabled IRQs can result in stuck hardware.

I am surprised to see you using CLI to disable
IRQs.  As far as I know, there is no way to do
this in Windows from user mode.  Does *nix allow
this?  (Seems dangerous, and out of keeping with
the usual ideas of protected mode.)  Or do you
have a way to run in Ring 0? ... Or in real mode?

Best regards,

Bob Masta

DAQARTA  v5.10
Data AcQuisition And Real-Time Analysis
www.daqarta.com
Scope, Spectrum, Spectrogram, Sound Level Meter
Frequency Counter, FREE Signal Generator
Pitch Track, Pitch-to-MIDI
DaqMusic - FREE MUSIC, Forever!
(Some assembly required)
Science (and fun!) with your sound card!
```
 0

```"Bob Masta" asked:
> <snip>

>>_______
>>... RDTSC timing is machine specific (5...else cycles).
>>I think to have posted this several times already:
>>
>>MOV esi,result_buf  ;16 bytes needed yet here
>>MOV byte[loop],01
>>CLI
>>L1:
>>;PUSH esi           ;if desired/required
>>  CPUID
>>  RDTSC
>>    MOV ebx,eax ;or: PUSH eax
>>    MOV ecx,edx ;or: PUSH edx
>>    ...             ;code under test (must preserve what's needed)
>>  RDTSC             ;without serialising yet !!!
>>    SUB eax,ebx ;or: SUB eax,[esp+4]
>>    SBB edx,ecx ;or: SBB edx,[esp]
>>;POP esi
>>MOV [esi],eax       ;
>>MOV [esi+4],edx     ;store 64-bit cycle-count
>>DEC byte [loop]
>>JNS L1              ;loop the above just one more time
>>STI
>>* I have an INT3 here to watch results in the debugger view.

>>A first run (I intentionally avoid the term 'iterate' here)
>>would show a cycle count which implies cache-burst-reads.
>>The second test-run follows immediate still with IRQs disabled.

>>One million iterations may just measure OS-noise from IRQ-

>>I measure 6..7 cycles (almost just the RDTSC itself) on an
>>empty (single NOP) test one my current machine with the MOV-
>>version while the PUSH-variant takes 1..2 cycles more.

>>Usually I ignore the first 64-bit result (caching time)
>>for code variants compare, but the info there is very
>>useful for proper code alignment.

>>Note:
>>this method shouldn't be used on huge code parts,
>>too long disabled IRQs can result in stuck hardware.

> I am surprised to see you using CLI to disable
> IRQs.  As far as I know, there is no way to do
> this in Windows from user mode.  Does *nix allow
> this?  (Seems dangerous, and out of keeping with
> the usual ideas of protected mode.)  Or do you
> have a way to run in Ring 0? ... Or in real mode?

Can't tell for L'unix ...
my OS and applications always run with PL==0 (ring 0),
for windoze XP there are ways to get 'ADMIN'-IO/PL rights
(win 95/98 allowed CLI/STI without further permission).
I don't know if vista or win7 may grant you IO-PL changes too.

__
wolfgang

```
 0

25 Replies
203 Views

Similar Articles

12/13/2013 6:31:18 PM
[PageSpeed]

Similar Artilces:

RE: The Evil within! (Was Re: The Hole in Cerner's Logic)
>=20 > Richard Maher wrote: > ... > > > > PS. Is there no other Health System that runs on VMS? The=20 > fact that Cerner > > has yet to embrace clustering is a disappointment. >=20 Well, I know of one...but it is not on the market, and is only suitable in it's current form for one particular hospital. It is not even in the field of Cerner, I-soft etc when it comes to GUI (mostly character cell interfaces), tailorability etc, but it has a good dollop of functionality, requires a low-level of h/w grunt (comparatively) and as it runs on Rdb and VMS, it is as...

Re: [LogoForum] Re: Can Logo be used to implement Fuzzy Logic? 11708
> I'm rather a fan of fuzzy logic... I have a hunch > that fuzzy logic might be a wonderful way to introduce programming to > young people. It behaves more like human reasoning, and is far less > picky about errors. > > It is fascinating to think about what a 'fuzzy turtle' would be like! Lee Hart, As long as I want to incorporate NLP into Logo programming (i.e. using NLP to create and finetune Logo programs)I'd be more interested to think about how fuzzy logic can help in this case. I'd be nice if users could write a Logo program in an environment w...

Re: Logic help
You are very close: data want; set have (where=(var2=0)) end=last; if last then output; run; On Wed, 30 Apr 2008 19:15:23 -0400, DP <adsingh78@GMAIL.COM> wrote: >Hi, > >How to get the maximum event time (MAXEVETM) that should be the last value >of VAR1 where VAR2=0. > >Have: >VAR1 VAR2 >8 0 >8 0 >8 1 >8 0 >8 1 >9 0 >9 0 >9 0 >9 1 >10 0 >11 1 > >Want: > >VAR1 VAR2 MAXEVETM >10 0 10 > >Can it be done without sorting the data? > >I have this but not sure if correctâ€¦ > >data want...

Re: Logic Paths #3
> From: Droogendyk, Harry [mailto:Harry.Droogendyk@CIBC.com] > Oops!! > 1 proc format ; > 2 value \$MapId1 'R063' = 'R106' > ERROR: The format name MAPID1 ends in a number, which is invalid. > 3 'R064' = 'R067' > 4 ; ah, so! yes, I forgot the Standard Disclaimer: what I always need to back me up: tedious testors thanx Harry, keeping me honest and well-poofread! ;-) Ron Fehd the macro maven CDC Atlanta GA USA RJF2@cdc.gov --> cheerful provider of UNTESTED SAS code from the Clue?Gee!Wrx <-- ...

Re: if then else logic #3
Sorry, if i was not clear....Here it goes, variable type has both the values 1 and 2 for sequence 100, so that will be a newvar='mixed'. Varible type has the value 1 for sequence 101, so that will be 'type1'. Variable type has the value 2 for sequence 102, so that will be 'type2' and so on.... hope its clear, thanks On Wed, 5 Dec 2007 14:52:29 -0600, data _null_, <datanull@GMAIL.COM> wrote: >This is never true "if (type = 1 and type = 2) then " do you want OR? > >On Dec 5, 2007 2:49 PM, Hari Nath <hari_s_nath@yahoo.com> wrote: >>...

Re: Changing the Logic of (AND NOT) #10 1543030
Frank -=20 Sorry about my ASCII attempts at the "and" and "or" logic symbols, I = learned logic before I learned SAS speak, and still think that way... Logic.........SAS v ... or ... |, !, =A6 ^ ... and ... & ~ ... not ... ^, =AC, ~ It was especially bad form to use SAS's "^" not symbol for a logical = "and". I'd say mea culpa, but I'd probably misspell it. =20 Regards, Paul Choate DDS Data Extraction (916) 654-2160 -----Original Message----- From: Frank Schiffel [mailto:SchifF@dhss.mo.gov]=20 Sent: Tuesday, June 08, 2004 10:46 AM...

Re: DECW\$PRIVATE_SERVER_SETUP symbol or logical ?
JF, did you see this posting http://groups.google.de/group/comp.os.vms/browse_frm/thread/952919117ea0fa45/f83dfcf71d9dc963?lnk=gst&q=radeon&rnum=6#f83dfcf71d9dc963 Only the refresh rate parameter had strange values (and produced a black screen) \$ DECW\$DEFAULT_KEYBOARD_MAP == "AUSTRIAN_GERMAN_LK444AG_LK" \$DECW\$SERVER_TRANSPORTS=="LOCAL,DECNET,TCPIP" \$DECW\$BUG_COMPATIBILITY=="TRUE" \$DECW\$SERVER_DISABLE_BACKING_STORE:==FALSE \$DECW\$SERVER_DISABLE_SAVE_UNDER:==FALSE \$DECW\$SERVER_CONNECT_LOG=="TRUE" \$!NO MORE Adobe-DPS-Extension, THANKS TO HP/COMP...

Re: What is the fastest way to threshold data?
On 2/27/08 at 4:24 AM, Kevin.McCann@umbc.edu (Kevin J. McCann) wrote: >I have some 3d data {{x1,y1,z1},{x2,y2,z2},...} and I would like to >set the z-values to zero if they fall below zero. More generally, I >would like to set z to a threshold value if it is at or below a >threshold. This seems as though it should be an easy enough thing to >do, but the only way I have figured out is to parse out the >z-vector, do >mask=((#<thresh&) /@ zdata)/.True->0/.False->1; >then zvector = zvector*mask; >and rebuild the {x,y,z} data. Using the bu...

Re: Changing the Logic of (AND NOT) #7 631866
Richard: Not quite: where not not not ( a^=1 or x>1 or y>1 or z<1 ) [ reverse negative legalese alternative, even more indecipherable but still not incomprehensible, or is it the other way around? ] Sig -----Original Message----- From: Richard A. DeVenezia [mailto:radevenz@IX.NETCOM.COM] Sent: Tuesday, June 08, 2004 12:21 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Changing the Logic of (AND NOT) Sigurd Hermansen wrote: > a=1 and x<=1 and y<=1 and z>=1 [alternative expressions] > Sig Then there is always where not ( a^=1 or x>1 or y>1 or z<1 )...

Re: What is Logic of doing something like this #2
> if a and not b then output main; > else if b and not a then output main; this is xor(a,b) if ( a and not b) or(not a and b) then ...; the sas function is bxor: bitwise xor see also: band: bitwise and if bxor(a,b) then ...; Ron Fehd the band(logic ,macro) maven CDC Atlanta GA USA RJF2 at cdc dot gov > -----Original Message----- > From: owner-sas-l@listserv.uga.edu > [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of SAS_learner > Sent: Thursday, November 27, 2008 4:41 PM > To: SAS(r) Discussion > Subject: What is Logic of doin...

Re: UNIX equivalent of VMS group logicals?
From: Mike <michael.o'connor@bluescopesteel.com> > We have a bunch of in-house applications written in C, running on > Alpha OpenVMS V8.3. These applications use the system service sys > \$trnlnm to read group logicals, which contain options controlling > their behaviour (e.g. logging level verbosity). The logicals can be > changed on the command line by privileged users. > > We are looking to port some of these applications to Solaris 10, and > as a UNIX novice, I am struggling to find something equivalent in > Solaris to the OpenVMS group logicals. Can ...

Re: Fastest way to highlight printable ASCII
"Jim Leonard" wrote: > Is that how people used to handle lookup tables if both es:di and ds:si > were tied up? The first thing that comes to mind is a segment override on an XLAT. 23 012A D7 XLAT TABLE 24 012B 26: D7 XLAT ES:TABLE 25 012D 2E: D7 XLAT CS:TABLE HTH Steve N. ...

RE: The Hole in Cerner's Logic #4
> -----Original Message----- > From: Dave Froble [mailto:davef@tsoft-inc.com]=20 > Sent: December 16, 2006 3:27 AM > To: Info-VAX@Mvb.Saic.Com > Subject: Re: The Hole in Cerner's Logic >=20 [snip ...] > Custom code, > backup archiving changes, support ISV changes, add-on apps, staff > re-training, license changes, re-cert efforts (remember we are talking > Health and medical systems here) >> Now it gets real good. Do keep in mind that a change from Alpha/VMS to=20 >> itanic VMS also can require some of what you list, and most certainly >&g...

Re: SAS Puzzle: Evaluating Logic in a Character Field
Here's some tested code that works specifically around the example you provided. If there are other operators (>, >=, etc), then there will have to be modifications made to the code. data paul; length logic \$10; input logic; n1=scan(logic,1,"<"); n2=scan(logic,2,"<"); n3=scan(logic,3,"<"); if n3 ^= . then flag=n1<n2<n3; else flag=n1<n2; drop n1 n2 n3; cards; 0<0.9<1 0<15 0<1.3<1 run; proc print data=paul; run; Obs LOGIC FLAG 1 0<0.9<1 1 2 0<15 1 3 0<1.3<1 ...

Re: Macro v. data step logic #12
On Tue, 22 Jul 2008 02:35:08 +0000, "Ian Whitlock" <iw1junk@COMCAST.NET> said: > Jack, > > You raise an interesting question. Given that the DATA step > > data q ; > set w ; > if u < x < y < z ; > run ; > > produces the correct subset for any data set W containing the > appropriate numeric variables, how would you introduce parentheses > without changing the meaning? "Add parentheses" was just my shorthand for "add parentheses, or apply De Morgan's Laws, or do whatever you need to do to mak...

Re: Fastest way to highlight printable ASCII #2
This apparently did not make it the first time. Sorry if its a duplicate post. "Jim Leonard" wrote: > Is that how people used to handle lookup tables if both es:di and ds:si > were tied up? The first thing that comes to mind is a segment override on an XLAT. 23 012A D7 XLAT TABLE 24 012B 26: D7 XLAT ES:TABLE 25 012D 2E: D7 XLAT CS:TABLE HTH Steve N. Steve <spamtrap@crayne.org> wrote: > >"Jim Leonard" wrote: >> Is that how people used to handle lookup tables if both es:di and ds:si >> were tied up? >...

Re: SAS Puzzle: Evaluating Logic in a Character Field #9
Well...who was "right"? On Nov 27, 2007 11:01 AM, Paul A. Thompson <paul@wubios.wustl.edu> wrote: > This is a homework problem. > > Paul A. Thompson, Ph.D. > Division of Biostatistics, Washington University School of Medicine > 660 S. Euclid, St. Louis, MO 63110-1093 > 314-747-3793 > paul@wubios.wustl.edu > > > -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Paul > Walker > Sent: Tuesday, November 27, 2007 10:38 AM > To: SAS-L@LISTSERV.UGA.EDU > Subject: SAS Puzzle: Evaluating Logic ...

Re: SAS Puzzle: Evaluating Logic in a Character Field #20
Another solution that works: %MACRO wantset(values= ); %LOCAL i dimhave; PROC SQL NOPRINT; CREATE TABLE have (logic CHAR(20)); INSERT INTO have &values; SELECT logic INTO :condition1 - :condition&SysMaxLong FROM have; %LET dimhave = &sqlObs; QUIT; DATA want; SET have; %DO i = 1 %TO &dimhave; IF _N_=&i THEN flag=&&condition&i; %END; RUN; PROC PRINT DATA=want;RUN; %MEND wantset; %wantset(values = VALUES("0<0.9<1") VALUES("0<15") VALUES("0<1.3&...