f



P4 non-deterministic timing?

Hi All,

I'm currently trying to improve some algorithms in my theorem prover
E, and was confused by some unexpected behavior (i.e. better
algorithm, worse run time, even for things that definitly just should
have cut out cycles). I then became suspicious, and rerun some tests:

LCL191-1+rm_eq_rstfp.tptp     T 1154.802  success
LCL191-1+rm_eq_rstfp.tptp     T 1167.091  success
LCL191-1+rm_eq_rstfp.tptp     T 1200.641  success
LCL191-1+rm_eq_rstfp.tptp     T 1209.716  success
LCL191-1+rm_eq_rstfp.tptp     T 1158.691  success


LCL030-1+rm_eq_rstfp.tptp     T 1772.909  success
LCL030-1+rm_eq_rstfp.tptp     T 1952.637  success
LCL030-1+rm_eq_rstfp.tptp     T 1829.534  success
LCL030-1+rm_eq_rstfp.tptp     T 1827.304  success
LCL030-1+rm_eq_rstfp.tptp     F 2000.000  maxtime

First column is the name of the proof problem, second status
(provable, not a theorem or failure) , third is CPU time in
100/518.546 seconds (don't ask). After 2000/518.546 seconds the
program was terminated.

All runs have been done with the same binary, on the same computer,
with exactly the same parameters. Notice that for the second case
the difference between fastest and slowest is definitely more than
10%. There was no significant other load on the machine.

uname -a:
Linux condor 2.4.18-686 #1 Sun Apr 14 11:32:47 EST 2002 i686 unknown
IntelR-PentiumR-4-CPU-2.40GHz-2405.526

Does anybody have an explanation? I suspect that even insignificant
programs can e.g. pollute the trace cache. But more than 10%
variation makes it very hard to use that machine for any kind of
measurements. 

Bye,

    Stephan

-- 
-------------------------- It can be done! ---------------------------------
   Please email me as schulz@informatik.tu-muenchen.de (Stephan Schulz)
----------------------------------------------------------------------------
0
schulz
7/4/2003 12:06:33 AM
comp.arch 7611 articles. 0 followers. carchreader (32) is leader. Post Follow

3 Replies
861 Views

Similar Articles

[PageSpeed] 32

Stephan Schulz <schulz@sunbroy2.informatik.tu-muenchen.de> wrote:
> Does anybody have an explanation? I suspect that even insignificant
> programs can e.g. pollute the trace cache. But more than 10%
> variation makes it very hard to use that machine for any kind of
> measurements. 

  If you google back a few years, you'll find some discussion of this 
phenomenon on the Pentium Pro.  The following loop:

loop:	add eax,eax
	add ebx,ebx
	add eax,eax
	add ebx,ebx
	add eax,eax
	add ebx,ebx
	dec ecx
	jnz loop

seemingly randomly enters one of four states, taking 4.0, 5.1, 5.29, or 5.32 
cycles per iteration (or something like that; it's been several years).  
Note that a greedy scheduler always takes 4.0 cycles per iteration.
  The most convincing explanantion I saw was from Andy Glew; he argued that 
this was a result of interactions between several circular buffers with 
different sizes; unfortunately, while convincing, his explanation did 
nothing to help predict how long a piece of code would take.
  Last I saw, Intel's VTune still predicts that the above loop will always 
take 4.0 cycles per iteration, so this is clearly something which is not 
well understood inside Intel either.

Colin Percival
0
Colin
7/4/2003 3:05:34 AM
In article <be2qtu$bqn$1@morgoth.sfu.ca>, Colin Andrew Percival wrote:
>   If you google back a few years, you'll find some discussion of this 
> phenomenon on the Pentium Pro.  The following loop:
> 
> loop:	add eax,eax
> 	add ebx,ebx
> 	add eax,eax
> 	add ebx,ebx
> 	add eax,eax
> 	add ebx,ebx
> 	dec ecx
> 	jnz loop
> 
> seemingly randomly enters one of four states, taking 4.0, 5.1, 5.29, or 5.32 
> cycles per iteration (or something like that; it's been several years).  

In this example I think the behavior is due to the fact that, in
the P6, the scheduler is not truly FIFO.  Andy Glew told me that
it was deemed too much of a critical path to do true FIFO scheduling,
and instead a heuristic was designed that is "mostly" FIFO.

What this means is that, in situations when two uops simultaneously
become ready (i.e. their input registers are available), the P6
scheduler does not *always* choose to issue the oldest of those
uops, although it usually does.  The above example has several
points where 2 uops could become ready simultaneously, and so can
exercise the non-FIFOness of the P6 scheduler.

I don't know how to characterize the exact way in which the P6
scheduler is non-FIFO.
0
Mike
7/4/2003 4:45:12 PM
In article <slrnbg9h85.jgf.schulz@sunbroy2.informatik.tu-muenchen.de>,
Stephan Schulz wrote:
>Hi All,
>
>I'm currently trying to improve some algorithms in my theorem prover
>E, and was confused by some unexpected behavior (i.e. better
>algorithm, worse run time, even for things that definitly just should
>have cut out cycles). I then became suspicious, and rerun some tests:
>
>LCL191-1+rm_eq_rstfp.tptp     T 1154.802  success
>LCL191-1+rm_eq_rstfp.tptp     T 1167.091  success
>LCL191-1+rm_eq_rstfp.tptp     T 1200.641  success
>LCL191-1+rm_eq_rstfp.tptp     T 1209.716  success
>LCL191-1+rm_eq_rstfp.tptp     T 1158.691  success
>
>
>LCL030-1+rm_eq_rstfp.tptp     T 1772.909  success
>LCL030-1+rm_eq_rstfp.tptp     T 1952.637  success
>LCL030-1+rm_eq_rstfp.tptp     T 1829.534  success
>LCL030-1+rm_eq_rstfp.tptp     T 1827.304  success
>LCL030-1+rm_eq_rstfp.tptp     F 2000.000  maxtime
>
>First column is the name of the proof problem, second status
>(provable, not a theorem or failure) , third is CPU time in
>100/518.546 seconds (don't ask). After 2000/518.546 seconds the
>program was terminated.
[Why the variations]

Sorry to follow-up to myself - of course the times I gave out are
wrong by a factor of 100, i.e. the unit is 2000/5.18546 s. We are
talking about approximately 400 seconds of runtime here., with nearly
a minute of variations in the worst case.

Thanks of all the suggestions I have received so far (including those
by email). Most suggest different behaviour of caches depending on
where the physical pages are allocated.

I'm still not convinced - I have not observed such differences on any
SPARC machine or on my Powerbook, although both use various levels of
caches. Variations of perhaps up to 1% are frequent, but 10-15% is
weird. 

Bye,

     Stephan

-- 
-------------------------- It can be done! ---------------------------------
   Please email me as schulz@informatik.tu-muenchen.de (Stephan Schulz)
----------------------------------------------------------------------------
0
schulz
7/5/2003 8:32:55 PM
Reply: