Hi All, I'm currently trying to improve some algorithms in my theorem prover E, and was confused by some unexpected behavior (i.e. better algorithm, worse run time, even for things that definitly just should have cut out cycles). I then became suspicious, and rerun some tests: LCL191-1+rm_eq_rstfp.tptp T 1154.802 success LCL191-1+rm_eq_rstfp.tptp T 1167.091 success LCL191-1+rm_eq_rstfp.tptp T 1200.641 success LCL191-1+rm_eq_rstfp.tptp T 1209.716 success LCL191-1+rm_eq_rstfp.tptp T 1158.691 success LCL030-1+rm_eq_rstfp.tptp T 1772.909 success LCL030-1+rm_eq_rstfp.tptp T 1952.637 success LCL030-1+rm_eq_rstfp.tptp T 1829.534 success LCL030-1+rm_eq_rstfp.tptp T 1827.304 success LCL030-1+rm_eq_rstfp.tptp F 2000.000 maxtime First column is the name of the proof problem, second status (provable, not a theorem or failure) , third is CPU time in 100/518.546 seconds (don't ask). After 2000/518.546 seconds the program was terminated. All runs have been done with the same binary, on the same computer, with exactly the same parameters. Notice that for the second case the difference between fastest and slowest is definitely more than 10%. There was no significant other load on the machine. uname -a: Linux condor 2.4.18-686 #1 Sun Apr 14 11:32:47 EST 2002 i686 unknown IntelR-PentiumR-4-CPU-2.40GHz-2405.526 Does anybody have an explanation? I suspect that even insignificant programs can e.g. pollute the trace cache. But more than 10% variation makes it very hard to use that machine for any kind of measurements. Bye, Stephan -- -------------------------- It can be done! --------------------------------- Please email me as schulz@informatik.tu-muenchen.de (Stephan Schulz) ----------------------------------------------------------------------------

0 |

7/4/2003 12:06:33 AM

Stephan Schulz <schulz@sunbroy2.informatik.tu-muenchen.de> wrote: > Does anybody have an explanation? I suspect that even insignificant > programs can e.g. pollute the trace cache. But more than 10% > variation makes it very hard to use that machine for any kind of > measurements. If you google back a few years, you'll find some discussion of this phenomenon on the Pentium Pro. The following loop: loop: add eax,eax add ebx,ebx add eax,eax add ebx,ebx add eax,eax add ebx,ebx dec ecx jnz loop seemingly randomly enters one of four states, taking 4.0, 5.1, 5.29, or 5.32 cycles per iteration (or something like that; it's been several years). Note that a greedy scheduler always takes 4.0 cycles per iteration. The most convincing explanantion I saw was from Andy Glew; he argued that this was a result of interactions between several circular buffers with different sizes; unfortunately, while convincing, his explanation did nothing to help predict how long a piece of code would take. Last I saw, Intel's VTune still predicts that the above loop will always take 4.0 cycles per iteration, so this is clearly something which is not well understood inside Intel either. Colin Percival

0 |

7/4/2003 3:05:34 AM

In article <be2qtu$bqn$1@morgoth.sfu.ca>, Colin Andrew Percival wrote: > If you google back a few years, you'll find some discussion of this > phenomenon on the Pentium Pro. The following loop: > > loop: add eax,eax > add ebx,ebx > add eax,eax > add ebx,ebx > add eax,eax > add ebx,ebx > dec ecx > jnz loop > > seemingly randomly enters one of four states, taking 4.0, 5.1, 5.29, or 5.32 > cycles per iteration (or something like that; it's been several years). In this example I think the behavior is due to the fact that, in the P6, the scheduler is not truly FIFO. Andy Glew told me that it was deemed too much of a critical path to do true FIFO scheduling, and instead a heuristic was designed that is "mostly" FIFO. What this means is that, in situations when two uops simultaneously become ready (i.e. their input registers are available), the P6 scheduler does not *always* choose to issue the oldest of those uops, although it usually does. The above example has several points where 2 uops could become ready simultaneously, and so can exercise the non-FIFOness of the P6 scheduler. I don't know how to characterize the exact way in which the P6 scheduler is non-FIFO.

0 |

7/4/2003 4:45:12 PM

In article <slrnbg9h85.jgf.schulz@sunbroy2.informatik.tu-muenchen.de>, Stephan Schulz wrote: >Hi All, > >I'm currently trying to improve some algorithms in my theorem prover >E, and was confused by some unexpected behavior (i.e. better >algorithm, worse run time, even for things that definitly just should >have cut out cycles). I then became suspicious, and rerun some tests: > >LCL191-1+rm_eq_rstfp.tptp T 1154.802 success >LCL191-1+rm_eq_rstfp.tptp T 1167.091 success >LCL191-1+rm_eq_rstfp.tptp T 1200.641 success >LCL191-1+rm_eq_rstfp.tptp T 1209.716 success >LCL191-1+rm_eq_rstfp.tptp T 1158.691 success > > >LCL030-1+rm_eq_rstfp.tptp T 1772.909 success >LCL030-1+rm_eq_rstfp.tptp T 1952.637 success >LCL030-1+rm_eq_rstfp.tptp T 1829.534 success >LCL030-1+rm_eq_rstfp.tptp T 1827.304 success >LCL030-1+rm_eq_rstfp.tptp F 2000.000 maxtime > >First column is the name of the proof problem, second status >(provable, not a theorem or failure) , third is CPU time in >100/518.546 seconds (don't ask). After 2000/518.546 seconds the >program was terminated. [Why the variations] Sorry to follow-up to myself - of course the times I gave out are wrong by a factor of 100, i.e. the unit is 2000/5.18546 s. We are talking about approximately 400 seconds of runtime here., with nearly a minute of variations in the worst case. Thanks of all the suggestions I have received so far (including those by email). Most suggest different behaviour of caches depending on where the physical pages are allocated. I'm still not convinced - I have not observed such differences on any SPARC machine or on my Powerbook, although both use various levels of caches. Variations of perhaps up to 1% are frequent, but 10-15% is weird. Bye, Stephan -- -------------------------- It can be done! --------------------------------- Please email me as schulz@informatik.tu-muenchen.de (Stephan Schulz) ----------------------------------------------------------------------------

0 |

7/5/2003 8:32:55 PM