Hi all. I have a question about paging on the x86 CPU.
I know that changing the CR3 register doesn't affect global
pages, but what about non-global pages? Does code like this:
mov eax,cr3
mov cr3,eax
flush the TLB completely on a Pentium 1, 486, or 386 CPU?
What if CR3 is loaded with the same value by a TSS-based
task-switch instead of using the MOV instruction?
I tried writing a small program to figure this out:
http://my.execpc.com/~geezer/temp/tlb-test.asm
I get different cycle counts, but I can't tell if it's
due to a TLB flush or executing different instructions.
|
|
0
|
|
|
|
Reply
|
spamtrap
|
11/23/2004 3:36:16 AM |
|
On Tue, 23 Nov 2004 03:36:16 +0000 (UTC), spamtrap@crayne.org (Chris
Giese) wrote in comp.lang.asm.x86:
> Hi all. I have a question about paging on the x86 CPU.
>
> I know that changing the CR3 register doesn't affect global
> pages, but what about non-global pages? Does code like this:
> mov eax,cr3
> mov cr3,eax
>
> flush the TLB completely on a Pentium 1, 486, or 386 CPU?
>
> What if CR3 is loaded with the same value by a TSS-based
> task-switch instead of using the MOV instruction?
>
> I tried writing a small program to figure this out:
> http://my.execpc.com/~geezer/temp/tlb-test.asm
>
> I get different cycle counts, but I can't tell if it's
> due to a TLB flush or executing different instructions.
Note that this not only happens when you deliberately write a value to
CR3, it also happens when you load a TSS automatically during a task
switch.
In the Intel manuals for the 386, it specifically stated that loading
CR3 with the same value it already held would not flush the TLB. This
statement was omitted from the 486 and, as far as I know, all later
Intel x86 processor manuals.
When I was developing a custom 486 board for an embedded system 8 or
so years ago, I asked this question about the 486 several places, like
comp.arch.embedded, maybe comp.sys.intel, and on the support groups
that Intel used to have on their web site (I think they discontinued
them since).
Someone with a heavy-duty Intel emulator for the 486 checked and
verified that reloading CR3 with the same value does not flush the TLB
on the 486. In the end it made no difference to my project, because
we decided not to enable paging anyway.
As for later processors, I would think it would be the same, but I
don't know. If it is not in the Intel manuals for the specific
processors, you might get a reply through their web site tech support,
if they still do that.
--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html
|
|
0
|
|
|
|
Reply
|
Jack
|
11/23/2004 4:11:37 AM
|
|
spamtrap@crayne.org (Chris Giese) wrote:
> Hi all. I have a question about paging on the x86 CPU.
> I know that changing the CR3 register doesn't affect global
> pages, but what about non-global pages? Does code like this:
> mov eax,cr3
> mov cr3,eax
> flush the TLB completely on a Pentium 1, 486, or 386 CPU?
> What if CR3 is loaded with the same value by a TSS-based
> task-switch instead of using the MOV instruction?
> I tried writing a small program to figure this out:
> http://my.execpc.com/~geezer/temp/tlb-test.asm
> I get different cycle counts, but I can't tell if it's
> due to a TLB flush or executing different instructions.
Since i386 and i486 machines didn't have a tlb flush instruction, the
only way to flush the tlb after a page table update was to reload the
pdbr (cr3).
Modern cpus have specific a instruction for this, but flushing page
directory entries is still done this way.
Viktor
ps: from the ia32 manual:
These instructions have the following side effect:
When writing to control register CR3, all non-global TLB entries are
flushed (see "Translation Lookaside Buffers (TLBs)" in Chapter 3 of
the IA-32 Intel Architecture Software Developer’s Manual, Volume 3).
|
|
0
|
|
|
|
Reply
|
spamtrap
|
11/23/2004 12:20:21 PM
|
|
I made some changes to my test program. Here's the new version:
http://my.execpc.com/~geezer/temp/tlbtest2.zip
This version actually touches each of the 1024 pages in the
bottom 4 MB of memory after changing (or not) CR3. It also uses
the timer chip for timing instead of RDTSC, so it should work
even on a 386 system. Lastly, you can now run this program from
a FAT12 bootsector (included) if you don't have DOS.
Results of three runs on a 120 MHz Pentium-1 CPU:
After loading CR3 with different value: 2598 2588 2602
After loading CR3 with same value: 2220 2226 2214
After doing nothing with CR3: 2208 2222 2220
Conclusion: for a Pentium 1 CPU, using MOV to load CR3 with the
same value does NOT flush the TLB.
Results from a 25 MHz 486SX system are more baffling:
After loading CR3 with different value: 5330 5330 5330
After loading CR3 with same value: 3438 3436 3440
After doing nothing with CR3: 4010 4012 4016
Why does "doing nothing" take LONGER than reloading CR3?
Maybe a bug in my code? Some weirdness in the 486 TLB?
|
|
0
|
|
|
|
Reply
|
spamtrap
|
11/24/2004 2:39:07 AM
|
|
spamtrap@crayne.org (Chris Giese) wrote in message news:<41a3f0bc.1264685@news.voyager.net>...
> I made some changes to my test program. Here's the new version:
> http://my.execpc.com/~geezer/temp/tlbtest2.zip
>
> This version actually touches each of the 1024 pages in the
> bottom 4 MB of memory after changing (or not) CR3. It also uses
> the timer chip for timing instead of RDTSC, so it should work
> even on a 386 system. Lastly, you can now run this program from
> a FAT12 bootsector (included) if you don't have DOS.
>
> Results of three runs on a 120 MHz Pentium-1 CPU:
>
> After loading CR3 with different value: 2598 2588 2602
> After loading CR3 with same value: 2220 2226 2214
> After doing nothing with CR3: 2208 2222 2220
>
> Conclusion: for a Pentium 1 CPU, using MOV to load CR3 with the
> same value does NOT flush the TLB.
>
> Results from a 25 MHz 486SX system are more baffling:
>
> After loading CR3 with different value: 5330 5330 5330
> After loading CR3 with same value: 3438 3436 3440
> After doing nothing with CR3: 4010 4012 4016
>
> Why does "doing nothing" take LONGER than reloading CR3?
> Maybe a bug in my code? Some weirdness in the 486 TLB?
For comparison, I ran your code on my system starting from a power off
state with the floppy in A: -->
Results of three runs on a 3 GHz Pentium-4 HT CPUs:
After loading CR3 with different value: 1914 1914 1914
After loading CR3 with same value: 668 668 668
After doing nothing with CR3: 668 668 668
Remarkably consistent, did you do each of your runs from a power off
state? -I suppose you had to.
|
|
0
|
|
|
|
Reply
|
spamtrap
|
11/25/2004 2:42:36 AM
|
|
Chris Giese wrote:
>
> I made some changes to my test program. Here's the new version:
> http://my.execpc.com/~geezer/temp/tlbtest2.zip
.....
> Results from a 25 MHz 486SX system are more baffling:
>
> After loading CR3 with different value: 5330 5330 5330
> After loading CR3 with same value: 3438 3436 3440
> After doing nothing with CR3: 4010 4012 4016
>
> Why does "doing nothing" take LONGER than reloading CR3?
> Maybe a bug in my code? Some weirdness in the 486 TLB?
Dunno. I ran several tests on a K6-300...
different: 2002
same: 1732
nothing: 1730
..... that's a rough "average"... but I hadda throw out a value! On my
third run from dos, the "same" value came out as 4294903490... !!! This
unusual value appears to be 1730 plus FFFF0000h... I suspect a glitch in
the code, but haven't spotted it. Is "not ebx" going to do the right
thing if there's a "rollover" in the 16-bit value??? Can't think!
I couldn't get it to repeat this strange behavior after several runs -
both from dos and boot - values were otherwise always close to those
above...
Happy Huntin',
Frank
|
|
0
|
|
|
|
Reply
|
Frank
|
11/25/2004 6:18:25 AM
|
|
"Chris Giese" <spamtrap@crayne.org> wrote in message
news:41a3f0bc.1264685@news.voyager.net...
>I made some changes to my test program. Here's the new version:
> http://my.execpc.com/~geezer/temp/tlbtest2.zip
>
> This version actually touches each of the 1024 pages in the
> bottom 4 MB of memory after changing (or not) CR3. It also uses
> the timer chip for timing instead of RDTSC, so it should work
> even on a 386 system. Lastly, you can now run this program from
> a FAT12 bootsector (included) if you don't have DOS.
>
> Results of three runs on a 120 MHz Pentium-1 CPU:
>
> After loading CR3 with different value: 2598 2588 2602
> After loading CR3 with same value: 2220 2226 2214
> After doing nothing with CR3: 2208 2222 2220
>
> Conclusion: for a Pentium 1 CPU, using MOV to load CR3 with the
> same value does NOT flush the TLB.
>
> Results from a 25 MHz 486SX system are more baffling:
>
> After loading CR3 with different value: 5330 5330 5330
> After loading CR3 with same value: 3438 3436 3440
> After doing nothing with CR3: 4010 4012 4016
>
> Why does "doing nothing" take LONGER than reloading CR3?
> Maybe a bug in my code? Some weirdness in the 486 TLB?
I would like to know why you see any difference at all between runs. They
should all miss 99-100% of the time. The TLB caches of modern chips can hold
about 32 entries -- much smaller than 1,024 pages. Since the 1,024 pages do
not fit in the 32-entry TLB, you're going to miss every single time. I don't
know how big the TLB caches on older chips were, but my guess is that they
were smaller than the 32-entry caches on P4 and K7. (As memory serves, K7
actually has 40 entries and Intel's chips have even more; however, the
additional entries are used for 2M/4M pages. My memory of the Pentium's TLB
cache size is somewhat vague, though I recall it being marginally bigger
than K7.)
The real question to be asking, however, is how would software possibly
update the page tables if the CPU doesn't semantically force a page table
reload every time you modify the register? You would have to copy the page
directory to an entirely new page directory every time to force the update.
The only other options are hardware-based: either the MMU snoops writes to
entries in its cache, or the MMU actually does flush entries.
I'm all but positive that software generally assumes a write to cr3 will
flush regardless of the value being written, so either the MMU is snooping
or it's flushing.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
11/25/2004 7:22:55 PM
|
|
Matt's questions are very interesting:
> I would like to know why you see any difference at all between runs. They
> should all miss 99-100% of the time. The TLB caches of modern chips can hold
> about 32 entries -- much smaller than 1,024 pages. Since the 1,024 pages do
> not fit in the 32-entry TLB, you're going to miss every single time.
It means that the provided utility checks only that the write to CR3
does behave very differently when doing a simple read/write and when
doing a read/modify/write.
The results of the utility suggest that YES, it does something very
different - probably a TLB flush, possibly a whole lot of other
flushes as well.
The manual (24547108) also specifies explicitly that:
When writing to CR3 all non-global TLB entries are flushed.
(You should make sure you're running with CR4.PGE disabled).
You should run your program without CR3 writes on a variable number of
pages (8/16/32/etc.), and check if performance varies linearly in each
of the three cases (no write, r/w, r/m/w).
In the 'no CR3 write' case you should see a non-linear time curve as
you exceed the size of the TLB.
You can use the same change in slope to check if the r/w and r/m/w
cases flush the TLB or not. If the graph is linear irrespective of the
number of TLB entries, the the TLB was flushed. If the graph is
non-linear (should be similar to the 'no write' case), then the TLB is
not flushed.
In the case of the r/m/w operation, the graph must be linear.
Hope I didn't get too ecstatic here...
e
"Matt" <spamtrap@crayne.org> wrote in message news:<j%hpd.68795$8G4.4063@tornado.tampabay.rr.com>...
> "Chris Giese" <spamtrap@crayne.org> wrote in message
> news:41a3f0bc.1264685@news.voyager.net...
> >I made some changes to my test program. Here's the new version:
> > http://my.execpc.com/~geezer/temp/tlbtest2.zip
> >
> > This version actually touches each of the 1024 pages in the
> > bottom 4 MB of memory after changing (or not) CR3. It also uses
> > the timer chip for timing instead of RDTSC, so it should work
> > even on a 386 system. Lastly, you can now run this program from
> > a FAT12 bootsector (included) if you don't have DOS.
> >
> > Results of three runs on a 120 MHz Pentium-1 CPU:
> >
> > After loading CR3 with different value: 2598 2588 2602
> > After loading CR3 with same value: 2220 2226 2214
> > After doing nothing with CR3: 2208 2222 2220
> >
> > Conclusion: for a Pentium 1 CPU, using MOV to load CR3 with the
> > same value does NOT flush the TLB.
> >
> > Results from a 25 MHz 486SX system are more baffling:
> >
> > After loading CR3 with different value: 5330 5330 5330
> > After loading CR3 with same value: 3438 3436 3440
> > After doing nothing with CR3: 4010 4012 4016
> >
> > Why does "doing nothing" take LONGER than reloading CR3?
> > Maybe a bug in my code? Some weirdness in the 486 TLB?
>
> I would like to know why you see any difference at all between runs. They
> should all miss 99-100% of the time. The TLB caches of modern chips can hold
> about 32 entries -- much smaller than 1,024 pages. Since the 1,024 pages do
> not fit in the 32-entry TLB, you're going to miss every single time. I don't
> know how big the TLB caches on older chips were, but my guess is that they
> were smaller than the 32-entry caches on P4 and K7. (As memory serves, K7
> actually has 40 entries and Intel's chips have even more; however, the
> additional entries are used for 2M/4M pages. My memory of the Pentium's TLB
> cache size is somewhat vague, though I recall it being marginally bigger
> than K7.)
>
> The real question to be asking, however, is how would software possibly
> update the page tables if the CPU doesn't semantically force a page table
> reload every time you modify the register? You would have to copy the page
> directory to an entirely new page directory every time to force the update.
> The only other options are hardware-based: either the MMU snoops writes to
> entries in its cache, or the MMU actually does flush entries.
>
> I'm all but positive that software generally assumes a write to cr3 will
> flush regardless of the value being written, so either the MMU is snooping
> or it's flushing.
>
> -Matt
|
|
0
|
|
|
|
Reply
|
spamtrap
|
11/28/2004 10:15:23 PM
|
|
|
7 Replies
340 Views
(page loaded in 0.17 seconds)
|