A working Win64 test program is available now:
http://st-open.eclipselabs.org.codespot.com/files/ttoy.zip
The zip file includes all source files and two
executables (for AMD or iNTEL). TDM-minGW64 is
required to recompile Test Toy. It is not much
more than that, proving the execution times of
two or three line long code snippets cannot be
measured with reliable accuracy.
You can scroll through up to 64 test runs with
Rob's 14 tests and two empty loops, executed 8
times in a row (a 16 * 8 matrix displayed in a
dialog box).
Suggestions how to convice my test function to
emit more reliable results are welcome. It can
be found in tstA.S and/or tstI.S.
I am interested how both versions differ on an
iNTEL machine (my case is marked with an iNTEL
outside logo...).
Greetings from Augsburg
Bernhard Schornak
|
|
0
|
|
|
|
Reply
|
Bernhard
|
6/24/2010 7:11:50 AM |
|
Bernhard Schornak wrote:
>A working Win64 test program is available now:
> http://st-open.eclipselabs.org.codespot.com/files/ttoy.zip
It's interesting that RDTSC need more cycles on faster CPUs :)
I may confirm your '71' because my phenom II x4 3GHz show
a RDTSC-latency of 72 cycles! while the 1.8GHz K8 make it
in 13 and my old K7 500MHz take 11 cycles.
My debuggers timing view zero-references itself, so I normally
don't see this values.
> The zip file includes all source files and two
> executables (for AMD or iNTEL). TDM-minGW64 is
> required to recompile Test Toy. It is not much
> more than that, proving the execution times of
> two or three line long code snippets cannot be
> measured with reliable accuracy.
My test results seem reproducable also on short parts.
Yeah, sometimes I see one cycle more or less.
To make sure I hit the run-button a few (max.8 and NOT million)
times, and ignore/discard the lower readings.
But in general I have no problem to time short code-snips and
if they would show less than a cycle then I'd assume one.
> You can scroll through up to 64 test runs with
> Rob's 14 tests and two empty loops, executed 8
> times in a row (a 16 * 8 matrix displayed in a
> dialog box).
I haven't got any 64 OS yet.
> Suggestions how to convice my test function to
> emit more reliable results are welcome. It can
> be found in tstA.S and/or tstI.S.
My testfield is 64 byte aligned (cache bounds) and the whole
thing is never larger than 64 byte. This allow testcode size
up to 32 byte at least and I can play a bit with testcode
alignment within this field too.
> I am interested how both versions differ on an
> iNTEL machine (my case is marked with an iNTEL
> outside logo...).
Same here :) the sign on my workstation 'netname:AMD64'.
__
wolfgang
|
|
0
|
|
|
|
Reply
|
wolfgang
|
6/24/2010 9:42:52 AM
|
|
wolfgang kern wrote:
> Bernhard Schornak wrote:
>
>> A working Win64 test program is available now:
>
>> http://st-open.eclipselabs.org.codespot.com/files/ttoy.zip
>
> It's interesting that RDTSC need more cycles on faster CPUs :)
> I may confirm your '71' because my phenom II x4 3GHz show
> a RDTSC-latency of 72 cycles! while the 1.8GHz K8 make it
> in 13 and my old K7 500MHz take 11 cycles.
> My debuggers timing view zero-references itself, so I normally
> don't see this values.
I did not bother with CLI/STI, because these
probably won't work (illegal in Ring 3), but
a RDTSCP - save EAX/EDX - RDTSCP sequence on
my Phenom II X4 takes 69-90 cycles. Test Toy
generates (up to) 512 runs and you can watch
how the 'calibration' sequences differ, even
if they run one after the other, directly.
AMD's optimisation guide says 45 CPU + 16 NB
- I guess, NB cycles are a little bit slower
than CPU cycles, hence the difference. RDTSC
reads a MSR, which might cause the different
times to access its content completely.
>> The zip file includes all source files and two
>> executables (for AMD or iNTEL). TDM-minGW64 is
>> required to recompile Test Toy. It is not much
>> more than that, proving the execution times of
>> two or three line long code snippets cannot be
>> measured with reliable accuracy.
>
> My test results seem reproducable also on short parts.
> Yeah, sometimes I see one cycle more or less.
> To make sure I hit the run-button a few (max.8 and NOT million)
> times, and ignore/discard the lower readings.
> But in general I have no problem to time short code-snips and
> if they would show less than a cycle then I'd assume one.
Test Toy runs the 16 tests 8 times in a row.
All 512 results are displayed, but: I do not
trust the first two runs, either. It is very
interesting that all functions are on par if
the processor found the fastest way to opti-
mise the single functions. It's amazing, how
performant recent hardware really is... ;)
I use a take it or leave it method - results
are compared against a lower and upper limit
and the test is re-done if the result is out
of range. Lower limit is two, upper limit is
ten cycles. According to AMD's 'optimisation
guide', any testee should execute inside the
chosen range.
>> You can scroll through up to 64 test runs with
>> Rob's 14 tests and two empty loops, executed 8
>> times in a row (a 16 * 8 matrix displayed in a
>> dialog box).
>
> I haven't got any 64 OS yet.
I have two. One is up and running, the other
wastes my spare time with pointless tries to
install. Debian ... meanwhile a swearword in
my vocabulary.
>> Suggestions how to convice my test function to
>> emit more reliable results are welcome. It can
>> be found in tstA.S and/or tstI.S.
>
> My testfield is 64 byte aligned (cache bounds) and the whole
> thing is never larger than 64 byte. This allow testcode size
> up to 32 byte at least and I can play a bit with testcode
> alignment within this field too.
I store the first RDTSCP result in EDI/ESI -
should not trigger any cache issues with two
GPRs. The references for my and Rob's CMOVxx
with memory access are in L1 when both func-
tions are executed. They are on par/+1 clock
with all other testees.
>> I am interested how both versions differ on an
>> iNTEL machine (my case is marked with an iNTEL
>> outside logo...).
>
> Same here :) the sign on my workstation 'netname:AMD64'.
Oh, well. Actually, it's a 'WiNTEL outside'-
logo. I just have to omit the 'W' as long as
I write code for it, though. Wimbledon would
announce "Advantage: Wolfgang!", now... ;)
Greetings from Augsburg
Bernhard Schornak
|
|
0
|
|
|
|
Reply
|
Bernhard
|
6/24/2010 3:50:05 PM
|
|
"Bernhard Schornak" <schornak@nospicedham.web.de> wrote in message
news:hvvurk$42e$1@news.eternal-september.org...
> wolfgang kern wrote:
> > Bernhard Schornak wrote:
> >> A working Win64 test program is available now:
> >
> >> http://st-open.eclipselabs.org.codespot.com/files/ttoy.zip
> >
> > It's interesting that RDTSC need more cycles on faster CPUs :)
> > I may confirm your '71' because my phenom II x4 3GHz show
> > a RDTSC-latency of 72 cycles! while the 1.8GHz K8 make it
> > in 13 and my old K7 500MHz take 11 cycles.
> > My debuggers timing view zero-references itself, so I normally
> > don't see this values.
>
> I did not bother with CLI/STI, because these
> probably won't work (illegal in Ring 3), but
> a RDTSCP - save EAX/EDX - RDTSCP sequence on
> my Phenom II X4 takes 69-90 cycles. Test Toy
> generates (up to) 512 runs and you can watch
> how the 'calibration' sequences differ, even
> if they run one after the other, directly.
>
Is there another way to confirm or cross-check that the recent timings of
the "fastest logical not" are correct?
E.g., depending on the micro, there is the trap flag, debug registers,
performance counters, upto 7 different timers: DRAM, RTC, PIT, LAPIC, ACPI,
TSC, HPET, etc.
Rod Pemberton
|
|
0
|
|
|
|
Reply
|
Rod
|
6/25/2010 7:32:46 AM
|
|
Rod Pemberton wrote:
> "Bernhard Schornak"<schornak@nospicedham.web.de> wrote in message
> news:hvvurk$42e$1@news.eternal-september.org...
>> wolfgang kern wrote:
>>> Bernhard Schornak wrote:
>>>> A working Win64 test program is available now:
>>>
>>>> http://st-open.eclipselabs.org.codespot.com/files/ttoy.zip
>>>
>>> It's interesting that RDTSC need more cycles on faster CPUs :)
>>> I may confirm your '71' because my phenom II x4 3GHz show
>>> a RDTSC-latency of 72 cycles! while the 1.8GHz K8 make it
>>> in 13 and my old K7 500MHz take 11 cycles.
>>> My debuggers timing view zero-references itself, so I normally
>>> don't see this values.
>>
>> I did not bother with CLI/STI, because these
>> probably won't work (illegal in Ring 3), but
>> a RDTSCP - save EAX/EDX - RDTSCP sequence on
>> my Phenom II X4 takes 69-90 cycles. Test Toy
>> generates (up to) 512 runs and you can watch
>> how the 'calibration' sequences differ, even
>> if they run one after the other, directly.
>>
>
> Is there another way to confirm or cross-check that the recent timings of
> the "fastest logical not" are correct?
>
> E.g., depending on the micro, there is the trap flag, debug registers,
> performance counters, upto 7 different timers: DRAM, RTC, PIT, LAPIC, ACPI,
> TSC, HPET, etc.
This is not a problem of the timing method,
the tested code snippets are just too short
for reliable measurement.
Using another timing facility than RDTSC(P)
will throw the same results, I bet, because
recent processors optimise code after a few
iterations. If one and the same instruction
sequence is executed eight times (or more),
the processor's internal mechanisms analyse
the executed code and develop strategies to
optimise it.
If this is true, and I bet it is (I tried a
lot of things to hinder the processor to do
these optimisations - but nothing worked as
expected), the test itself must be improved
to bypass internal optimisation completely.
It might be possible to 'outsource' testees
and run them in single threads, terminating
themselves after the test was done. Another
way was to call computation intensive, time
consuming functions between two tests, with
the potential to reset the optimisation for
the testee completely. But such a test pro-
gram surely is much larger than "Test Toy",
which, by the way, works very reliable with
real functions, e.g. qsort() and the likes.
(That's what I wrote the original test pro-
gram for.)
The different timings for RDTSC(P) probably
are a Northbridge issue. If the bus is busy
at the moment, reading the MSR is delayed a
few cycles. Everything outside the core die
itself tends to be very slow (seen from the
core's side). Different readings in a range
of 22 (-3 through 19) cycles will influence
results of any test running 10 times faster
than the possible error range. Test results
only are reliable if the error becomes much
smaller than the probe. Most testees in our
case should execute in two or three cycles,
so the possible error of our probes is in a
range between -300 and +1,900 percent. With
probes running around (approximately) 1,000
cycles, the possible error is in a range of
-0.2 and +1.9 percent, -0.02 through +0.19%
at 10,000 cycles, and so on. While thousand
percent are inacceptable, errors in a range
of two percent still are 'bad', but as well
sufficient for many applications (e.g. thin
film resistors with -ten- percent allowance
work reliable in many electronic devices).
The most reliable timing in this case is to
add those cycles taken from the processor's
manual. Possible parallel execution of more
than one instruction in multiple pipes must
be taken into account, not to speak of 'out
of order execution', register renaming plus
other optimisations I did not mention, done
by the processor itself. Putting it all to-
gether, all testees (except those accessing
memory) run in about two ("io-x"'s 2-liner)
or three (all other) cycles.
Greetings from Augsburg
Bernhard Schornak
|
|
0
|
|
|
|
Reply
|
Bernhard
|
6/25/2010 4:44:16 PM
|
|
On Jun 25, 3:32=A0am, "Rod Pemberton"
<do_not_h...@nospicedham.notreplytome.cmm> wrote:
>
> Is there another way to confirm or cross-check that the recent timings of
> the "fastest logical not" are correct?
>
> E.g., depending on the micro, there is the trap flag, debug registers,
> performance counters, upto 7 different timers: DRAM, RTC, PIT, LAPIC, ACP=
I,
> TSC, HPET, etc.
>
Maybe one could do some degree of 'statistical analysis' on results
from repeated runs??
Here are variations I get from the earlier 'das/das2/Ultrano' version.
sans sudo:
######## Yodel (sort of) Linux Port version 0.1, 2010/03/20
## Calculating clockspeed...
(Your computer might temporarily appear frozen as process priority is
being boosted to level 99)
** The user permissions to boost priority are unavailable: the final
test results may be slightly less accurate. **
## Test parameters: 10000000 iterations.
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1596 MHz.
Reference Procedure timing took 0.081773427s =3D 13.051 cycles/
iteration
das test --> 0.082592318s =3D 13.181 cycles/
iteration
das2 test --> 0.083211258s =3D 13.280 cycles/
iteration
Ultrano Test --> 0.215024121s =3D 34.317 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1596 MHz.
Reference Procedure timing took 0.080833281s =3D 12.900 cycles/
iteration
das test --> 0.083477488s =3D 13.323 cycles/
iteration
das2 test --> 0.084217729s =3D 13.441 cycles/
iteration
Ultrano Test --> 0.219213575s =3D 34.986 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1596 MHz.
Reference Procedure timing took 0.082405860s =3D 13.151 cycles/
iteration
das test --> 0.081782050s =3D 13.052 cycles/
iteration
das2 test --> 0.082199485s =3D 13.119 cycles/
iteration
Ultrano Test --> 0.213978173s =3D 34.150 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1596 MHz.
Reference Procedure timing took 0.082217903s =3D 13.121 cycles/
iteration
das test --> 0.088886204s =3D 14.186 cycles/
iteration
das2 test --> 0.082372189s =3D 13.146 cycles/
iteration
Ultrano Test --> 0.214135332s =3D 34.175 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1596 MHz.
Reference Procedure timing took 0.086342861s =3D 13.780 cycles/
iteration
das test --> 0.100088738s =3D 15.974 cycles/
iteration
das2 test --> 0.078476199s =3D 12.524 cycles/
iteration
Ultrano Test --> 0.210415225s =3D 33.582 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1596 MHz.
Reference Procedure timing took 0.084335369s =3D 13.459 cycles/
iteration
das test --> 0.079954910s =3D 12.760 cycles/
iteration
das2 test --> 0.080113530s =3D 12.786 cycles/
iteration
Ultrano Test --> 0.212403613s =3D 33.899 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1596 MHz.
Reference Procedure timing took 0.084323653s =3D 13.458 cycles/
iteration
das test --> 0.079817552s =3D 12.738 cycles/
iteration
das2 test --> 0.080879553s =3D 12.908 cycles/
iteration
Ultrano Test --> 0.212436703s =3D 33.904 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1596 MHz.
Reference Procedure timing took 0.082114144s =3D 13.105 cycles/
iteration
das test --> 0.084075330s =3D 13.418 cycles/
iteration
das2 test --> 0.083942376s =3D 13.397 cycles/
iteration
Ultrano Test --> 0.214259397s =3D 34.195 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1596 MHz.
Reference Procedure timing took 0.084514220s =3D 13.488 cycles/
iteration
das test --> 0.080000961s =3D 12.768 cycles/
iteration
das2 test --> 0.080269816s =3D 12.811 cycles/
iteration
Ultrano Test --> 0.212160187s =3D 33.860 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1596 MHz.
Reference Procedure timing took 0.086864584s =3D 13.863 cycles/
iteration
das test --> 0.077403982s =3D 12.353 cycles/
iteration
das2 test --> 0.077464181s =3D 12.363 cycles/
iteration
Ultrano Test --> 0.209466279s =3D 33.430 cycles/
iteration
Using sudo:
######## Yodel (sort of) Linux Port version 0.1, 2010/03/20
## Calculating clockspeed...
(Your computer might temporarily appear frozen as process priority is
being boosted to level 99)
## Test parameters: 10000000 iterations.
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1595 MHz.
Reference Procedure timing took 0.082299451s =3D 13.126 cycles/
iteration
das test --> 0.082476202s =3D 13.154 cycles/
iteration
das2 test --> 0.082849356s =3D 13.214 cycles/
iteration
Ultrano Test --> 0.214532593s =3D 34.217 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1595 MHz.
Reference Procedure timing took 0.089095167s =3D 14.210 cycles/
iteration
das test --> 0.075398747s =3D 12.026 cycles/
iteration
das2 test --> 0.075862700s =3D 12.100 cycles/
iteration
Ultrano Test --> 0.207347874s =3D 33.071 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1595 MHz.
Reference Procedure timing took 0.079252770s =3D 12.640 cycles/
iteration
das test --> 0.085339172s =3D 13.611 cycles/
iteration
das2 test --> 0.086168880s =3D 13.743 cycles/
iteration
Ultrano Test --> 0.217180440s =3D 34.640 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1595 MHz.
Reference Procedure timing took 0.083508801s =3D 13.319 cycles/
iteration
das test --> 0.080890676s =3D 12.902 cycles/
iteration
das2 test --> 0.081974367s =3D 13.074 cycles/
iteration
Ultrano Test --> 0.213102856s =3D 33.989 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1595 MHz.
Reference Procedure timing took 0.129113061s =3D 20.593 cycles/
iteration
das test --> 0.036138283s =3D 5.764 cycles/
iteration
das2 test --> 0.035283868s =3D 5.627 cycles/
iteration
Ultrano Test --> 0.167439102s =3D 26.706 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1595 MHz.
Reference Procedure timing took 0.094187012s =3D 15.022 cycles/
iteration
das test --> 0.070220487s =3D 11.200 cycles/
iteration
das2 test --> 0.070306465s =3D 11.213 cycles/
iteration
Ultrano Test --> 0.202331195s =3D 32.271 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1595 MHz.
Reference Procedure timing took 0.090354321s =3D 14.411 cycles/
iteration
das test --> 0.073869279s =3D 11.782 cycles/
iteration
das2 test --> 0.073973489s =3D 11.798 cycles/
iteration
Ultrano Test --> 0.206065847s =3D 32.867 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1595 MHz.
Reference Procedure timing took 0.084351188s =3D 13.454 cycles/
iteration
das test --> 0.080054496s =3D 12.768 cycles/
iteration
das2 test --> 0.080156139s =3D 12.784 cycles/
iteration
Ultrano Test --> 0.212111491s =3D 33.831 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1595 MHz.
Reference Procedure timing took 0.081051347s =3D 12.927 cycles/
iteration
das test --> 0.083272439s =3D 13.281 cycles/
iteration
das2 test --> 0.084582377s =3D 13.490 cycles/
iteration
Ultrano Test --> 0.215958778s =3D 34.445 cycles/
iteration
/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1595 MHz.
Reference Procedure timing took 0.090540797s =3D 14.441 cycles/
iteration
das test --> 0.074023682s =3D 11.806 cycles/
iteration
das2 test --> 0.074024962s =3D 11.806 cycles/
iteration
Ultrano Test --> 0.206322661s =3D 32.908 cycles/
iteration
Nathan.
|
|
0
|
|
|
|
Reply
|
Nathan
|
6/26/2010 6:50:47 AM
|
|
Rod Pemberton asked:
>> wolfgang kern wrote:
>>> Bernhard Schornak wrote:
>>>> A working Win64 test program is available now:
>>>> http://st-open.eclipselabs.org.codespot.com/files/ttoy.zip
>>> It's interesting that RDTSC need more cycles on faster CPUs :)
>>> I may confirm your '71' because my phenom II x4 3GHz show
>>> a RDTSC-latency of 72 cycles! while the 1.8GHz K8 make it
>>> in 13 and my old K7 500MHz take 11 cycles.
>>> My debuggers timing view zero-references itself, so I normally
>>> don't see this values.
>> I did not bother with CLI/STI, because these
>> probably won't work (illegal in Ring 3), but
>> a RDTSCP - save EAX/EDX - RDTSCP sequence on
>> my Phenom II X4 takes 69-90 cycles. Test Toy
>> generates (up to) 512 runs and you can watch
>> how the 'calibration' sequences differ, even
>> if they run one after the other, directly.
> Is there another way to confirm or cross-check that the recent timings of
> the "fastest logical not" are correct?
> E.g., depending on the micro, there is the trap flag, debug registers,
> performance counters, upto 7 different timers: DRAM, RTC, PIT, LAPIC,
> ACPI,
> TSC, HPET, etc.
This time it took some time for response because I checked it myself...
RDTSC seem to be the only enough fine granular methode for testing
speed on short code-parts.
Your suggestd clock sources:
DRAM (refresh timer?) if accessible at all, quite variable
depending on RAM installed, and much slower than CPU.
RTC 1/1024 Seconds sound just much too coarse.
PIT pull it down to lowest possible may reiterate IRQs on faster
CPUs because minimum PIT IRQ-pulse duration is 429.04875 nSec.
I use a PIT setting of ~1mS in my debugger to show 'real time'
consumed by more complex, more than two lines :) functions.
(L)APIC seem to count in terms of frames/blocks/accessed...
ACPI is sceduled by anything slower than CPU-clocks for sure.
TSC that's what we use.
HPT/HPET 'high precision timers' may exist (1 nanoSec) on some
chipsets, unfortunately (because not enough cycles left to cover)
they can't produce IRQs.
But this could be an alternative to readTCS, even I think it
will also show +-1 results added to the time for reading it.
I also checked on USB-host frame counters:
not worth to mention for fine granular code snip timing.
Depending on the CPU we got the so called performance counters
as MSRs. I think these are just designed for HLL and windoze,
because I couldn't find much sense in the info gained there.
I already posted several times that a first run and the second
iteration timing are the main values which should be of concern.
Several (more than 2) iterations will show the already cached
behave (including CPU-internal prefetch optimisation).
In general we need both, the caching time, the run time after prefetch,
'and only' if it is a loop, also its timing for a repeated run.
All this measurements are useless if not seen in a larger context:
* Short code can save on disk-load and memory demand.
* frequent used code should be optimised in terms of size "and" speed.
* Alignment of loop-targets often help, but sometimes a single NOP
insertion can speedup the whole thing a lot.
Details on this look quite complex and are hard to predict during
code-creation ... so better never trust a compiler on this.
__
wolfgang
|
|
0
|
|
|
|
Reply
|
wolfgang
|
7/1/2010 6:40:34 PM
|
|
On 1 July, 19:40, "wolfgang kern" <nowh...@never.at> wrote:
....
> PIT =A0 pull it down to lowest possible may reiterate IRQs on faster
> =A0 =A0 =A0 CPUs because minimum PIT IRQ-pulse duration is 429.04875 nSec=
..
> =A0 =A0 =A0 I use a PIT setting of ~1mS in my debugger to show 'real time=
'
> =A0 =A0 =A0 consumed by more complex, more than two lines :) functions.
You mean PIT timer 0? AFAICS the problem with that for timing is that
it is as standard set to mode 3 (square wave) and while it can be read
(probably slowly) at any time it seems to iterate over the same
numbers *twice* before generating an interrupt.
In mode 3 it counts down in twos so you'll get
<<IRQ>>, 65534, 65532, ... 4, 2, 0, 65534, 65532, ... 4, 2, <<IRQ>>,
65534, etc
So the thing counts down twice for each interrupt using the same
values. Therefore it's useless for timing since although a counter can
be incremented in the interrupt a programmer doesn't know on reading
the timer whether it is counting down the first time or the second
time. The latch-timers command can report the state of the counter's
output pin - high on 1st half and low on second half - but that's
another write and read operation.
Maybe it's better setting it to mode 2. That way it counts down in 1s
and generates an IRQ when it reaches zero. It's not the IBM standard
mode 3 but some BIOSes set Timer 0 to mode 2 anyway.
cc. alt.os.development
James
|
|
0
|
|
|
|
Reply
|
James
|
7/1/2010 8:39:33 PM
|
|
wolfgang kern wrote:
> Rod Pemberton asked:
>>> wolfgang kern wrote:
>>>> Bernhard Schornak wrote:
>>>>> A working Win64 test program is available now:
>
>>>>> http://st-open.eclipselabs.org.codespot.com/files/ttoy.zip
>
>>>> It's interesting that RDTSC need more cycles on faster CPUs :)
>>>> I may confirm your '71' because my phenom II x4 3GHz show
>>>> a RDTSC-latency of 72 cycles! while the 1.8GHz K8 make it
>>>> in 13 and my old K7 500MHz take 11 cycles.
>>>> My debuggers timing view zero-references itself, so I normally
>>>> don't see this values.
>
>>> I did not bother with CLI/STI, because these
>>> probably won't work (illegal in Ring 3), but
>>> a RDTSCP - save EAX/EDX - RDTSCP sequence on
>>> my Phenom II X4 takes 69-90 cycles. Test Toy
>>> generates (up to) 512 runs and you can watch
>>> how the 'calibration' sequences differ, even
>>> if they run one after the other, directly.
>
>> Is there another way to confirm or cross-check that the recent timings of
>> the "fastest logical not" are correct?
>
>> E.g., depending on the micro, there is the trap flag, debug registers,
>> performance counters, upto 7 different timers: DRAM, RTC, PIT, LAPIC,
>> ACPI,
>> TSC, HPET, etc.
>
> This time it took some time for response because I checked it myself...
> RDTSC seem to be the only enough fine granular methode for testing
> speed on short code-parts.
RDTSC(P) consumes 45 CPU clocks plus 16 NB clocks,
required for reading the TSC. My conclusions after
running Test Toy several hundreds of times: The 16
clocks NB traffic are not 16 CPU clocks and should
be treated with care.
RDTSC(P) runs in approximately 71 cycles - 71-45 =
21 ... ten cycles more than the assumed 61 cycles.
The variations observed in some thousand runs were
changing from 68 to 90, suggesting that NB traffic
influences the TSC reading returned in EAX/EDX. If
there's no traffic on the NB, RDTSC executes in 68
cycles. With heavy traffic, it is delayed to up to
90 cycles. NB traffic is not neccessarily introdu-
ced by the core running the probe. Due to AMD's MP
design, all cores share the NB interface - even if
our core does not access NB at the moment, another
core might have issued a NB transaction when we're
going to read the TSC - we have to wait until that
transaction finished before we can read the TSC.
Hence, RDTSC(P) is unreliable for probing snippets
running in two or three cycles: The results *only*
depend on NB traffic. The probes must run at least
2 times longer than the 22 cycles variation to get
(somehow) reliable results. With code executing in
1,000 cycles and up, RDTSC(P) is the most reliable
method to get the exact runtime of a function. The
possible error is -.2/+1.9 percent @ 1,000 cycles,
-.02/+.19 percent @ 10,000 cycles, and so on.
Greetings from Augsburg
Bernhard Schornak
|
|
0
|
|
|
|
Reply
|
Bernhard
|
7/1/2010 11:07:31 PM
|
|
Bernhard Schornak wrote:
....
> I wrote:
>> Rod asked:
>>> Is there another way to confirm or cross-check that the recent timings
>>> of
>>> the "fastest logical not" are correct?
>>> E.g., depending on the micro, there is the trap flag, debug registers,
>>> performance counters, upto 7 different timers: DRAM, RTC, PIT, LAPIC,
>>> ACPI,
>>> TSC, HPET, etc.
>> This time it took some time for response because I checked it myself...
>> RDTSC seem to be the only enough fine granular methode for testing
>> speed on short code-parts.
> RDTSC(P) consumes 45 CPU clocks plus 16 NB clocks,
> required for reading the TSC. My conclusions after
> running Test Toy several hundreds of times: The 16
> clocks NB traffic are not 16 CPU clocks and should
> be treated with care.
> RDTSC(P) runs in approximately 71 cycles - 71-45 =
> 21 ... ten cycles more than the assumed 61 cycles.
> The variations observed in some thousand runs were
> changing from 68 to 90, suggesting that NB traffic
> influences the TSC reading returned in EAX/EDX. If
> there's no traffic on the NB, RDTSC executes in 68
> cycles. With heavy traffic, it is delayed to up to
> 90 cycles. NB traffic is not neccessarily introdu-
> ced by the core running the probe. Due to AMD's MP
> design, all cores share the NB interface - even if
> our core does not access NB at the moment, another
> core might have issued a NB transaction when we're
> going to read the TSC - we have to wait until that
> transaction finished before we can read the TSC.
> Hence, RDTSC(P) is unreliable for probing snippets
> running in two or three cycles: The results *only*
> depend on NB traffic. The probes must run at least
> 2 times longer than the 22 cycles variation to get
> (somehow) reliable results. With code executing in
> 1,000 cycles and up, RDTSC(P) is the most reliable
> method to get the exact runtime of a function. The
> possible error is -.2/+1.9 percent @ 1,000 cycles,
> -.02/+.19 percent @ 10,000 cycles, and so on.
You said this before ...
I cannot see this huge variation within my environment
My OS doesn't do anything behind my back except if I
don't disable interrupts, I see the timer0 IRQ at every
milliSecond which easy is determineable during short
code tests because of ~800 added cycles.
But right we always can have a +1 cycle report with
RDTSC and also when reading 1 nanoSec HPT counters.
So short code snip timing become almost guess 'which
is correct'-issue :)
__
wolfgang
your line-size by term-chosing-alignment posts look nice :)
|
|
0
|
|
|
|
Reply
|
wolfgang
|
7/2/2010 3:33:18 PM
|
|
wolfgang kern wrote:
> Bernhard Schornak wrote:
> ...
>> I wrote:
>>> Rod asked:
>
>>>> Is there another way to confirm or cross-check that the recent timings
>>>> of
>>>> the "fastest logical not" are correct?
>
>>>> E.g., depending on the micro, there is the trap flag, debug registers,
>>>> performance counters, upto 7 different timers: DRAM, RTC, PIT, LAPIC,
>>>> ACPI,
>>>> TSC, HPET, etc.
>
>>> This time it took some time for response because I checked it myself...
>>> RDTSC seem to be the only enough fine granular methode for testing
>>> speed on short code-parts.
>
>> RDTSC(P) consumes 45 CPU clocks plus 16 NB clocks,
>> required for reading the TSC. My conclusions after
>> running Test Toy several hundreds of times: The 16
>> clocks NB traffic are not 16 CPU clocks and should
>> be treated with care.
>
>> RDTSC(P) runs in approximately 71 cycles - 71-45 =
>> 21 ... ten cycles more than the assumed 61 cycles.
>> The variations observed in some thousand runs were
>> changing from 68 to 90, suggesting that NB traffic
>> influences the TSC reading returned in EAX/EDX. If
>> there's no traffic on the NB, RDTSC executes in 68
>> cycles. With heavy traffic, it is delayed to up to
>> 90 cycles. NB traffic is not neccessarily introdu-
>> ced by the core running the probe. Due to AMD's MP
>> design, all cores share the NB interface - even if
>> our core does not access NB at the moment, another
>> core might have issued a NB transaction when we're
>> going to read the TSC - we have to wait until that
>> transaction finished before we can read the TSC.
>
>> Hence, RDTSC(P) is unreliable for probing snippets
>> running in two or three cycles: The results *only*
>> depend on NB traffic. The probes must run at least
>> 2 times longer than the 22 cycles variation to get
>> (somehow) reliable results. With code executing in
>> 1,000 cycles and up, RDTSC(P) is the most reliable
>> method to get the exact runtime of a function. The
>> possible error is -.2/+1.9 percent @ 1,000 cycles,
>> -.02/+.19 percent @ 10,000 cycles, and so on.
>
> You said this before ...
> I cannot see this huge variation within my environment
> My OS doesn't do anything behind my back except if I
> don't disable interrupts, I see the timer0 IRQ at every
> milliSecond which easy is determineable during short
> code tests because of ~800 added cycles.
>
> But right we always can have a +1 cycle report with
> RDTSC and also when reading 1 nanoSec HPT counters.
> So short code snip timing become almost guess 'which
> is correct'-issue :)
I probably forgot to mention that Test Toy is
a Windoze 64 bit application. All Windoze(r)s
tend to do a lot of things all of the time.
With opened task manager, a look at the loads
of my four CPUs tells me: Only two are really
busy - one for my app, one for Windoze. It is
very likely that system activity causes those
variations. I assume you switch off the other
cores (or send them to sleep) while you run a
test. In my case, it's impossible to do that.
If I tried, Windoze terminated my application
immediately.
BTW: Former AMD processors had the TSC in the
core itself - no NB access required. On newer
processors, TSCs were outsourced to guarantee
proper timing in all power states. As a side-
effect, the NB transaction doesn't only delay
RDTSC(P), it introduces inaccuracy, as well.
About short snippets:
Compare it with having a 20 MHz oscilloscope,
but the input is 80 MHz, a Voltmeter from one
to 20 Volt to measure a 100 Microvolt signal,
and so on - the probes are simply outside the
range of those tools. Same applies to RDTSC.
> wolfgang
> your line-size by term-chosing-alignment posts look nice :)
File it under "Spleens of Bernhard S."... ;)
Actually, it is my personal "footprint". It's
a good opportunity to enhance the vocabulary,
as well.
Have a nice weekend
Bernhard
|
|
0
|
|
|
|
Reply
|
Bernhard
|
7/2/2010 6:08:12 PM
|
|
Bernhard Schornak wrote:
.... what I read and snipped yet :)
[about TSC variations... +-1 vs. +-22]
CPUID 8000_0007 bit8 edx = 1 on my machine.
copied from 25481.pdf:
TscInvariant: 1 = The TSC rate is ensured to be invariant
across all P-States, C-States, and stop grant transitions
(such as STPCLK Throttling); therefore the TSC is suitable
for use as a source of time. 0 = No such guarantee is made
and software should avoid attempting to use the TSC as a
source of time.
It doesn't say anything about RDTSC(P) latency variation,
but could this be a cause for the difference in our results?
__
wolfgang
|
|
0
|
|
|
|
Reply
|
wolfgang
|
7/5/2010 9:00:14 AM
|
|
wolfgang kern wrote:
> Bernhard Schornak wrote:
>
> ... what I read and snipped yet :)
>
> [about TSC variations... +-1 vs. +-22]
>
> CPUID 8000_0007 bit8 edx = 1 on my machine.
>
> copied from 25481.pdf:
>
> TscInvariant: 1 = The TSC rate is ensured to be invariant
> across all P-States, C-States, and stop grant transitions
> (such as STPCLK Throttling); therefore the TSC is suitable
> for use as a source of time. 0 = No such guarantee is made
> and software should avoid attempting to use the TSC as a
> source of time.
>
> It doesn't say anything about RDTSC(P) latency variation,
> but could this be a cause for the difference in our results?
EDX is 0x01F9 (TSC updated with core clock).
With my (not MP aware) OS/2, running on a dual core
Athlon, I got proper readings down to single clocks
with reproducible precision. In my new environment,
Windows 7 (64 bit), quad core Phenom II, results in
this range are not reliable any longer.
I doubt there's any way to determine execution time
of such tiny snippets in this "everything is multi"
environment. Too many different things interact and
access resources concurrently - no way to determine
what other cores are busy with while my testees are
running.
BTW: AVX is a next generation extension. You cannot
find it on any recent AMD or iNTEL processor. Won't
appear before 2011 => "Bulldozer", "Sandy Bridge".
Greetings from Augsburg
Bernhard
|
|
0
|
|
|
|
Reply
|
Bernhard
|
7/5/2010 12:32:41 PM
|
|
|
12 Replies
308 Views
(page loaded in 0.235 seconds)
Similiar Articles: RDTSC (was: Fastest logical not) - comp.lang.asm.x86A working Win64 test program is available now: http://st-open.eclipselabs.org.codespot.com/files/ttoy.zip The zip file includes all source files and... Fastest finger first - comp.lang.javascriptIf something isn't fitting right, ask questions first. ... user-process memory (although it will be far from fast). ... might not be able to keep up with a mouse or a 1 ... System ACPI Power State? - comp.sys.sun.hardwareRDTSC (was: Fastest logical not) - comp.lang.asm.x86 ACPI is sceduled by anything slower than CPU-clocks ... It is very likely that system activity causes those ... scrolling too fast - comp.sys.mac.appsRDTSC (was: Fastest logical not) - comp.lang.asm.x86 scrolling too fast - comp.sys.mac.apps RDTSC (was: Fastest logical not) - comp.lang.asm.x86 scrolling too fast - comp ... The fastest mechanism of IPC - comp.unix.programmerRDTSC (was: Fastest logical not) - comp.lang.asm.x86... is very interesting that all functions are on par if the processor found the fastest ... AFAICS the problem with ... load JDE(E) etc. on Demand (Debian) - comp.emacsRDTSC (was: Fastest logical not) - comp.lang.asm.x86 Debian ... meanwhile a swearword in my vocabulary. ... DRAM, RTC, PIT, LAPIC, ACP= I, > TSC, HPET, etc ... context ... High resolution timer. - comp.lang.asm.x86RDTSC (was: Fastest logical not) - comp.lang.asm.x86 HPT/HPET 'high precision timers' may exist (1 nanoSec) on some chipsets, unfortunately (because not enough cycles left ... what about clockspeed and NTP ? - comp.protocols.time.ntp ...RDTSC (was: Fastest logical not) - comp.lang.asm.x86... Yodel (sort of) Linux Port version 0.1, 2010/03/20 ## Calculating clockspeed... ... ntpd cross-compiling - comp ... How to calculate "elapsed time" and "functions execution time" in ...Hi, How can I calculate "elapsed time" since the start of my script as well as "execution time" of each command. I need to use this in writing lo... Compare group,user,permissions from two HD - comp.unix.shell ...RDTSC (was: Fastest logical not) - comp.lang.asm.x86 Compare group,user,permissions from two HD - comp.unix.shell ... RDTSC (was: Fastest logical not) - comp.lang.asm.x86 ... CPUID and number of cores - comp.lang.asm.x86RDTSC (was: Fastest logical not) - comp.lang.asm.x86 CPUID and number of cores - comp.lang.asm.x86 RDTSC (was: Fastest logical not) - comp.lang.asm.x86 CPUID and number of ... Analyse the amplitude while recording a sound - comp.lang.java ...RDTSC (was: Fastest logical not) - comp.lang.asm.x86... II x4 3GHz show a RDTSC-latency of 72 cycles! while the ... RTC 1/1024 Seconds sound just much too coarse. ... Generating High/Low Limit Waveforms - comp.lang.labviewPart of the waveform has a pretty fast risetime & falltime, so just allowing for an ... Here you would upload a sampling of the waveform to ... and so does not even have a ... AMD vs Intel timing on this code... - comp.lang.asm.x86RDTSC (was: Fastest logical not) - comp.lang.asm.x86 AMD vs Intel timing on this code... - comp.lang.asm.x86 RDTSC (was: Fastest logical not) - comp.lang.asm.x86 AMD vs ... ntpd cross-compiling - comp.protocols.time.ntpRDTSC (was: Fastest logical not) - comp.lang.asm.x86... directly. > Is there another way to confirm or cross ... Test ntpd performance - comp.protocols.time.ntp RDTSC ... RDTSC (was: Fastest logical not) - comp.lang.asm.x86 | Computer GroupA working Win64 test program is available now: http://st-open.eclipselabs.org.codespot.com/files/ttoy.zip The zip file includes all source files and... RDTSC (was: Fastest logical not) - RhinocerusA working Win64 test program is available now: http://st-open.eclipselabs.org.codes...files/ttoy.zip The zip file includes all source files and two 7/24/2012 5:08:38 PM
|