Initially I posted the problem below to a webforum at SandPile.org, and I
was redirected to a post in this group:
i don't think that post answered my questions, but I got a nice hint, where
the place for asking such thing might be (i.e. here).
any help with the anomaly i discuss below will be highly appreciated!
My intent is to detect the number and type of functional units in a CPU
with running benchmark tests. Just for the sake of clarity I will talk in
"C" below, although, be assured I looked at the assembly and it is what you
If you repeat "r1+=r2" on a X GHz P4, you get 2X BIPS (billion instructions
per second), because the ALU is double pumped. Now, if you repeat
"r1+=r3;r2+=r3", these are two independent instructions, which can go to the
two different ALUs (and P4 has 2 double-pumped integer ALUs). So one would
expect a twofold improvement in BIPS (i.e. 4X BIPS). The cruel fact is that
you only get 3X BIPS. The reason, as far as I understand it now is that the
trace cache can only pass 3 instructions per cycle to the pipe (as opposed
to the 4 we are trying...)
Now everything looks fine till now, until you repeat the same experiment on
a PIII. PIII also has 2 ALUs, but they are not double-pumped. When I try the
first experiment amove on a X GHz PIII, I get X BIPS as expected. When I try
the second experiment I get 1.5X BIPS (instead of the expected 2X). I don't
really see where the problem is here! We are trying to sustain 2 integer
instructions per cycle and PIII does not cope with it!
Please help with any ideas!