Does reloading CR3 with the same value flush the TLB?

  • Follow


Hi all. I have a question about paging on the x86 CPU.

I know that changing the CR3 register doesn't affect global
pages, but what about non-global pages? Does code like this:
	mov eax,cr3
	mov cr3,eax

flush the TLB completely on a Pentium 1, 486, or 386 CPU?

What if CR3 is loaded with the same value by a TSS-based
task-switch instead of using the MOV instruction?

I tried writing a small program to figure this out:
	http://my.execpc.com/~geezer/temp/tlb-test.asm

I get different cycle counts, but I can't tell if it's
due to a TLB flush or executing different instructions.

0
Reply spamtrap 11/23/2004 3:36:16 AM

On Tue, 23 Nov 2004 03:36:16 +0000 (UTC), spamtrap@crayne.org (Chris
Giese) wrote in comp.lang.asm.x86:

> Hi all. I have a question about paging on the x86 CPU.
> 
> I know that changing the CR3 register doesn't affect global
> pages, but what about non-global pages? Does code like this:
> 	mov eax,cr3
> 	mov cr3,eax
> 
> flush the TLB completely on a Pentium 1, 486, or 386 CPU?
> 
> What if CR3 is loaded with the same value by a TSS-based
> task-switch instead of using the MOV instruction?
> 
> I tried writing a small program to figure this out:
> 	http://my.execpc.com/~geezer/temp/tlb-test.asm
> 
> I get different cycle counts, but I can't tell if it's
> due to a TLB flush or executing different instructions.

Note that this not only happens when you deliberately write a value to
CR3, it also happens when you load a TSS automatically during a task
switch.

In the Intel manuals for the 386, it specifically stated that loading
CR3 with the same value it already held would not flush the TLB.  This
statement was omitted from the 486 and, as far as I know, all later
Intel x86 processor manuals.

When I was developing a custom 486 board for an embedded system 8 or
so years ago, I asked this question about the 486 several places, like
comp.arch.embedded, maybe comp.sys.intel, and on the support groups
that Intel used to have on their web site (I think they discontinued
them since).

Someone with a heavy-duty Intel emulator for the 486 checked and
verified that reloading CR3 with the same value does not flush the TLB
on the 486.  In the end it made no difference to my project, because
we decided not to enable paging anyway.

As for later processors, I would think it would be the same, but I
don't know.  If it is not in the Intel manuals for the specific
processors, you might get a reply through their web site tech support,
if they still do that.

-- 
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html

0
Reply Jack 11/23/2004 4:11:37 AM


spamtrap@crayne.org (Chris Giese) wrote:
> Hi all. I have a question about paging on the x86 CPU.
> I know that changing the CR3 register doesn't affect global
> pages, but what about non-global pages? Does code like this:
> 	mov eax,cr3
> 	mov cr3,eax
> flush the TLB completely on a Pentium 1, 486, or 386 CPU?
> What if CR3 is loaded with the same value by a TSS-based
> task-switch instead of using the MOV instruction?
> I tried writing a small program to figure this out:
> 	http://my.execpc.com/~geezer/temp/tlb-test.asm
> I get different cycle counts, but I can't tell if it's
> due to a TLB flush or executing different instructions.

 Since i386 and i486 machines didn't have a tlb flush instruction, the
 only way to flush the tlb after a page table update was to reload the
 pdbr (cr3).

 Modern cpus have specific a instruction for this, but flushing page
 directory entries is still done this way.

  Viktor

ps: from the ia32 manual:
 These instructions have the following side effect:
  When writing to control register CR3, all non-global TLB entries are
  flushed (see "Translation Lookaside Buffers (TLBs)" in Chapter 3 of
  the IA-32 Intel Architecture Software Developer’s Manual, Volume 3).

0
Reply spamtrap 11/23/2004 12:20:21 PM

I made some changes to my test program. Here's the new version:
        http://my.execpc.com/~geezer/temp/tlbtest2.zip

This version actually touches each of the 1024 pages in the
bottom 4 MB of memory after changing (or not) CR3. It also uses
the timer chip for timing instead of RDTSC, so it should work
even on a 386 system. Lastly, you can now run this program from
a FAT12 bootsector (included) if you don't have DOS.

Results of three runs on a 120 MHz Pentium-1 CPU:

After loading CR3 with different value: 2598    2588    2602
After loading CR3 with same value:      2220    2226    2214
After doing nothing with CR3:           2208    2222    2220

Conclusion: for a Pentium 1 CPU, using MOV to load CR3 with the
same value does NOT flush the TLB.

Results from a 25 MHz 486SX system are more baffling:

After loading CR3 with different value: 5330    5330    5330
After loading CR3 with same value:      3438    3436    3440
After doing nothing with CR3:           4010    4012    4016

Why does "doing nothing" take LONGER than reloading CR3?
Maybe a bug in my code? Some weirdness in the 486 TLB?

0
Reply spamtrap 11/24/2004 2:39:07 AM

spamtrap@crayne.org (Chris Giese) wrote in message news:<41a3f0bc.1264685@news.voyager.net>...
> I made some changes to my test program. Here's the new version:
>         http://my.execpc.com/~geezer/temp/tlbtest2.zip
> 
> This version actually touches each of the 1024 pages in the
> bottom 4 MB of memory after changing (or not) CR3. It also uses
> the timer chip for timing instead of RDTSC, so it should work
> even on a 386 system. Lastly, you can now run this program from
> a FAT12 bootsector (included) if you don't have DOS.
> 
> Results of three runs on a 120 MHz Pentium-1 CPU:
> 
> After loading CR3 with different value: 2598    2588    2602
> After loading CR3 with same value:      2220    2226    2214
> After doing nothing with CR3:           2208    2222    2220
> 
> Conclusion: for a Pentium 1 CPU, using MOV to load CR3 with the
> same value does NOT flush the TLB.
> 
> Results from a 25 MHz 486SX system are more baffling:
> 
> After loading CR3 with different value: 5330    5330    5330
> After loading CR3 with same value:      3438    3436    3440
> After doing nothing with CR3:           4010    4012    4016
> 
> Why does "doing nothing" take LONGER than reloading CR3?
> Maybe a bug in my code? Some weirdness in the 486 TLB?

For comparison, I ran your code on my system starting from a power off
state with the floppy in A: -->

 Results of three runs on a 3 GHz Pentium-4 HT CPUs:
 
 After loading CR3 with different value: 1914    1914    1914
 After loading CR3 with same value:       668     668     668
 After doing nothing with CR3:            668     668     668

Remarkably consistent, did you do each of your runs from a power off
state? -I suppose you had to.

0
Reply spamtrap 11/25/2004 2:42:36 AM

Chris Giese wrote:
> 
> I made some changes to my test program. Here's the new version:
>         http://my.execpc.com/~geezer/temp/tlbtest2.zip

.....
> Results from a 25 MHz 486SX system are more baffling:
> 
> After loading CR3 with different value: 5330    5330    5330
> After loading CR3 with same value:      3438    3436    3440
> After doing nothing with CR3:           4010    4012    4016
> 
> Why does "doing nothing" take LONGER than reloading CR3?
> Maybe a bug in my code? Some weirdness in the 486 TLB?

Dunno. I ran several tests on a K6-300...

different: 2002
same:      1732
nothing:   1730

..... that's a rough "average"... but I hadda throw out a value! On my
third run from dos, the "same" value came out as 4294903490... !!! This
unusual value appears to be 1730 plus FFFF0000h... I suspect a glitch in
the code, but haven't spotted it. Is "not ebx" going to do the right
thing if there's a "rollover" in the 16-bit value??? Can't think!

I couldn't get it to repeat this strange behavior after several runs -
both from dos and boot - values were otherwise always close to those
above...

Happy Huntin',
Frank

0
Reply Frank 11/25/2004 6:18:25 AM

"Chris Giese" <spamtrap@crayne.org> wrote in message 
news:41a3f0bc.1264685@news.voyager.net...
>I made some changes to my test program. Here's the new version:
>        http://my.execpc.com/~geezer/temp/tlbtest2.zip
>
> This version actually touches each of the 1024 pages in the
> bottom 4 MB of memory after changing (or not) CR3. It also uses
> the timer chip for timing instead of RDTSC, so it should work
> even on a 386 system. Lastly, you can now run this program from
> a FAT12 bootsector (included) if you don't have DOS.
>
> Results of three runs on a 120 MHz Pentium-1 CPU:
>
> After loading CR3 with different value: 2598    2588    2602
> After loading CR3 with same value:      2220    2226    2214
> After doing nothing with CR3:           2208    2222    2220
>
> Conclusion: for a Pentium 1 CPU, using MOV to load CR3 with the
> same value does NOT flush the TLB.
>
> Results from a 25 MHz 486SX system are more baffling:
>
> After loading CR3 with different value: 5330    5330    5330
> After loading CR3 with same value:      3438    3436    3440
> After doing nothing with CR3:           4010    4012    4016
>
> Why does "doing nothing" take LONGER than reloading CR3?
> Maybe a bug in my code? Some weirdness in the 486 TLB?

I would like to know why you see any difference at all between runs. They 
should all miss 99-100% of the time. The TLB caches of modern chips can hold 
about 32 entries -- much smaller than 1,024 pages. Since the 1,024 pages do 
not fit in the 32-entry TLB, you're going to miss every single time. I don't 
know how big the TLB caches on older chips were, but my guess is that they 
were smaller than the 32-entry caches on P4 and K7. (As memory serves, K7 
actually has 40 entries and Intel's chips have even more; however, the 
additional entries are used for 2M/4M pages. My memory of the Pentium's TLB 
cache size is somewhat vague, though I recall it being marginally bigger 
than K7.)

The real question to be asking, however, is how would software possibly 
update the page tables if the CPU doesn't semantically force a page table 
reload every time you modify the register? You would have to copy the page 
directory to an entirely new page directory every time to force the update. 
The only other options are hardware-based: either the MMU snoops writes to 
entries in its cache, or the MMU actually does flush entries.

I'm all but positive that software generally assumes a write to cr3 will 
flush regardless of the value being written, so either the MMU is snooping 
or it's flushing.

-Matt 

0
Reply Matt 11/25/2004 7:22:55 PM

Matt's questions are very interesting:

> I would like to know why you see any difference at all between runs. They 
> should all miss 99-100% of the time. The TLB caches of modern chips can hold 
> about 32 entries -- much smaller than 1,024 pages. Since the 1,024 pages do 
> not fit in the 32-entry TLB, you're going to miss every single time.

It means that the provided utility checks only that the write to CR3
does behave very differently when doing a simple read/write and when
doing a read/modify/write.
The results of the utility suggest that YES, it does something very
different - probably a TLB flush, possibly a whole lot of other
flushes as well.

The manual (24547108) also specifies explicitly that:
When writing to CR3 all non-global TLB entries are flushed.
(You should make sure you're running with CR4.PGE disabled).

You should run your program without CR3 writes on a variable number of
pages (8/16/32/etc.), and check if performance varies linearly in each
of the three cases (no write, r/w, r/m/w).

In the 'no CR3 write' case you should see a non-linear time curve as
you exceed the size of the TLB.
You can use the same change in slope to check if the r/w and r/m/w
cases flush the TLB or not. If the graph is linear irrespective of the
number of TLB entries, the the TLB was flushed. If the graph is
non-linear (should be similar to the 'no write' case), then the TLB is
not flushed.
In the case of the r/m/w operation, the graph must be linear.

Hope I didn't get too ecstatic here...

 e


"Matt"  <spamtrap@crayne.org> wrote in message news:<j%hpd.68795$8G4.4063@tornado.tampabay.rr.com>...
> "Chris Giese" <spamtrap@crayne.org> wrote in message 
> news:41a3f0bc.1264685@news.voyager.net...
> >I made some changes to my test program. Here's the new version:
> >        http://my.execpc.com/~geezer/temp/tlbtest2.zip
> >
> > This version actually touches each of the 1024 pages in the
> > bottom 4 MB of memory after changing (or not) CR3. It also uses
> > the timer chip for timing instead of RDTSC, so it should work
> > even on a 386 system. Lastly, you can now run this program from
> > a FAT12 bootsector (included) if you don't have DOS.
> >
> > Results of three runs on a 120 MHz Pentium-1 CPU:
> >
> > After loading CR3 with different value: 2598    2588    2602
> > After loading CR3 with same value:      2220    2226    2214
> > After doing nothing with CR3:           2208    2222    2220
> >
> > Conclusion: for a Pentium 1 CPU, using MOV to load CR3 with the
> > same value does NOT flush the TLB.
> >
> > Results from a 25 MHz 486SX system are more baffling:
> >
> > After loading CR3 with different value: 5330    5330    5330
> > After loading CR3 with same value:      3438    3436    3440
> > After doing nothing with CR3:           4010    4012    4016
> >
> > Why does "doing nothing" take LONGER than reloading CR3?
> > Maybe a bug in my code? Some weirdness in the 486 TLB?
> 
> I would like to know why you see any difference at all between runs. They 
> should all miss 99-100% of the time. The TLB caches of modern chips can hold 
> about 32 entries -- much smaller than 1,024 pages. Since the 1,024 pages do 
> not fit in the 32-entry TLB, you're going to miss every single time. I don't 
> know how big the TLB caches on older chips were, but my guess is that they 
> were smaller than the 32-entry caches on P4 and K7. (As memory serves, K7 
> actually has 40 entries and Intel's chips have even more; however, the 
> additional entries are used for 2M/4M pages. My memory of the Pentium's TLB 
> cache size is somewhat vague, though I recall it being marginally bigger 
> than K7.)
> 
> The real question to be asking, however, is how would software possibly 
> update the page tables if the CPU doesn't semantically force a page table 
> reload every time you modify the register? You would have to copy the page 
> directory to an entirely new page directory every time to force the update. 
> The only other options are hardware-based: either the MMU snoops writes to 
> entries in its cache, or the MMU actually does flush entries.
> 
> I'm all but positive that software generally assumes a write to cr3 will 
> flush regardless of the value being written, so either the MMU is snooping 
> or it's flushing.
> 
> -Matt

0
Reply spamtrap 11/28/2004 10:15:23 PM

7 Replies
340 Views

(page loaded in 0.17 seconds)

Similiar Articles:





7/23/2012 5:48:31 PM


Reply: