Bug 56461

Summary: Memory corruption on PAE x86 systems with Google Chrome browser (it's not a culprit but it exposes this bug)
Product: Memory Management Reporter: Artem S. Tashkinov (aros)
Component: OtherAssignee: Andrew Morton (akpm)
Status: RESOLVED CODE_FIX    
Severity: high CC: dave, davej, hpa, mikhail.v.gavrilov, tj, torvalds
Priority: P1    
Hardware: i386   
OS: Linux   
URL: https://bugzilla.redhat.com/show_bug.cgi?id=859188
Kernel Version: 3.8 Subsystem:
Regression: No Bisected commit-id:
Attachments: Untested patch for comments..
My 3.8 .config
New patch to be tested
patch to fix pagetable corruption using mmu_gather state

Description Artem S. Tashkinov 2013-04-10 17:49:26 UTC
This happens regularly on my PC:

chrome: Corrupted page table at address 34a03000
*pdpt = 0000000000000000 *pde = 0000000000000000
Bad pagetable: 000f [#1] PREEMPT SMP
Modules linked in: vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) ipv6 nf_conntrack_ftp nf_conntrack_netbios_ns nf_conntrack_broadcast xt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp xt_pkttype ipt_ULOG xt_owner xt_multiport iptable_filter ip_tables x_tables w83627ehf adt7475 hwmon_vid binfmt_misc fuse hid_generic uvcvideo videobuf2_core videodev videobuf2_vmalloc videobuf2_memops snd_usb_audio snd_usbmidi_lib snd_rawmidi usbhid hid fan coretemp kvm_intel kvm aesni_intel ablk_helper cryptd lrw aes_i586 xts gf128mul microcode pcspkr i2c_i801 sr_mod cdrom sg xhci_hcd snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd snd_page_alloc ehci_pci ehci_hcd e1000e nvidia(PO) evdev
Pid: 6114, comm: chrome Tainted: P           O 3.8.5-ic #5 System manufacturer System Product Name/P8P67 PRO
EIP: 0073:[<b42dcd18>] EFLAGS: 00010202 CPU: 1
EIP is at 0xb42dcd18
EAX: 34a03000 EBX: b78cbff4 ECX: 00001000 EDX: 00000057
ESI: bf894d50 EDI: 34a03000 EBP: bf894d08 ESP: bf894cf0
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
Process chrome (pid: 6114, ti=cc136000 task=ccadbb10 task.ti=cc136000)

EIP: [<b42dcd18>] 0xb42dcd18 SS:ESP 007b:bf894cf0
---[ end trace 71406aaad5321207 ]---

Untainted reports can be found here: https://bugzilla.redhat.com/show_bug.cgi?id=859188
Comment 1 Linus Torvalds 2013-04-10 19:11:59 UTC
Ok, that's just insane.

The "Corrupted page table" message happens when the error code has the PF_RSVD bit set. And indeed, the error code seems to be 000f, so it has all of the PROT/WRITE/USER/RSVD bits set (but not PF_INSTR).

And the *pgd value is 0, which should never happen, I think. For PAE, we always fill in the top-level entries, and they aren't allowed to be non-present, iirc. It's been too long since I worked with PAE to remember. That may explain the RSVD bit.

Can you add some printouts to dump_pagetable(), please?

Add a line at the top of the CONFIG_X86_PAE case that says something like

  printk("pgd=%p pgd_offset=%p\n", pgd, pgd_offset(tsk->active_mm, address));

to see if the problem is that we've screwed up cr3-vs-active_mm somehow.

Then, change the "pgd_val()" in the printk("*pdpt = ..) thing into "native_pgd_val()", in case the virtualization stuff has screwed up the value.
Comment 2 Linus Torvalds 2013-04-10 19:17:14 UTC
Adding more people. Ingo, Andi, Peter, any ideas?

See

    https://bugzilla.kernel.org/show_bug.cgi?id=56461

for details. I forget the rules for the top-level PAE page table
entries, it may be that I mis-remember about that whole "the four top
entries must exist" thing.

                   Linus

On Wed, Apr 10, 2013 at 12:11 PM, Linus Torvalds
<torvalds@linux-foundation.org>  write:
> Ok, that's just insane.
>
> The "Corrupted page table" message happens when the error code has the
> PF_RSVD
> bit set. And indeed, the error code seems to be 000f, so it has all of the
> PROT/WRITE/USER/RSVD bits set (but not PF_INSTR).
>
> And the *pgd value is 0, which should never happen, I think. For PAE, we
> always
> fill in the top-level entries, and they aren't allowed to be non-present,
> iirc.
> It's been too long since I worked with PAE to remember. That may explain the
> RSVD bit.
>
> Can you add some printouts to dump_pagetable(), please?
>
> Add a line at the top of the CONFIG_X86_PAE case that says something like
>
>   printk("pgd=%p pgd_offset=%p\n", pgd, pgd_offset(tsk->active_mm, address));
>
> to see if the problem is that we've screwed up cr3-vs-active_mm somehow.
>
> Then, change the "pgd_val()" in the printk("*pdpt = ..) thing into
> "native_pgd_val()", in case the virtualization stuff has screwed up the
> value.
>
> --
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
Comment 3 Anonymous Emailer 2013-04-10 22:26:18 UTC
Reply-To: andi@firstfloor.org

On Wed, Apr 10, 2013 at 12:17:13PM -0700, Linus Torvalds wrote:
> Adding more people. Ingo, Andi, Peter, any ideas?
> 
> See
> 
>     https://bugzilla.kernel.org/show_bug.cgi?id=56461
> 
> for details. I forget the rules for the top-level PAE page table
> entries, it may be that I mis-remember about that whole "the four top
> entries must exist" thing.

I believe you're right. They must all exist.

So his pgd got corrupted. It would be probably good 
to trace the system calls and faults to see what
chrome is doing before?

-Andi
Comment 4 Linus Torvalds 2013-04-11 01:06:15 UTC
On Wed, Apr 10, 2013 at 3:26 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> So his pgd got corrupted. It would be probably good
> to trace the system calls and faults to see what
> chrome is doing before?

Hmm. Some googling shows that others see it too:

   https://bugzilla.redhat.com/show_bug.cgi?id=859188

but it seems to be new to F18, so it's a recent problem. Either a new
kernel bug, or just a more modern chrome triggering it.

A lot of those have uvcvideo in the module list. Not Mikails, though.
Some of the other google hits seem to indicate it's tied to a
particular chrome version.

Very odd.

              Linus
Comment 5 Dave Jones 2013-04-11 04:35:12 UTC
These bugs sadly aren't new.   We started seeing them regularly circa 3.2

Some people seem to be able to trigger them *really* easily, but I've still not managed to figure out exactly what the pattern is.  A lot of these have been filed by the automated bug reporting tool, and the actual reporters are generally unresponsive, so it's been tough to even figure out what's special about their systems.

https://bugzilla.redhat.com/buglist.cgi?bug_status=NEW&bug_status=CLOSED&classification=Fedora&component=kernel&product=Fedora&query_format=advanced&short_desc=bad%20page&short_desc_type=allwordssubstr&version=15&version=16&version=17&version=18&version=19&version=rawhide&order=bug_id&list_id=1268538

(There's likely multiple bugs in that pile, but of the more recent ones, I'm sure there are some duplicates of this one).

There's also a mix of 64bit bugs in there, which could be another sign of multiple bugs given the different page table code..

One hunch we had a while ago was that it was transparent huge page related.
So we disabled it for 32bit to see if the reports would stop on that arch. It didn't make any difference.

We've also got a whole bunch of reports of things like linked-list corruption that I've got no real explanation for.

The usual suspects (Suspend/Resume and/or DRM) have been ruled out in some cases, but not all.  I don't know, it's a mess.
Comment 6 Anonymous Emailer 2013-04-11 08:51:58 UTC
Reply-To: mingo@kernel.org

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Adding more people. Ingo, Andi, Peter, any ideas?
> 
> See
> 
>     https://bugzilla.kernel.org/show_bug.cgi?id=56461
> 
> for details. I forget the rules for the top-level PAE page table
> entries, it may be that I mis-remember about that whole "the four top
> entries must exist" thing.

All those 4 PAE entries must exist - they get cached/shadowed by the CPU on a CR3 
load.

> > The "Corrupted page table" message happens when the error code has the
> PF_RSVD 
> > bit set. And indeed, the error code seems to be 000f, so it has all of the 
> > PROT/WRITE/USER/RSVD bits set (but not PF_INSTR).
> >
> > And the *pgd value is 0, which should never happen, I think. For PAE, we 
> > always fill in the top-level entries, and they aren't allowed to be 
> > non-present, iirc. It's been too long since I worked with PAE to remember. 
> > That may explain the RSVD bit.

Correct, they must all be present, otherwise IIRC the CPU won't even load them 
(will triple fault IIRC).

So I think either the page tables got corrupted after a cr3 switch and they did 
not get loaded into the CPU [we'd not be able to do a printk if they did] - or the 
mm_struct got corrupted and we are printing out the wrong pagetables.

> > Can you add some printouts to dump_pagetable(), please?
> >
> > Add a line at the top of the CONFIG_X86_PAE case that says something like
> >
> >   printk("pgd=%p pgd_offset=%p\n", pgd, pgd_offset(tsk->active_mm,
> address));
> >
> > to see if the problem is that we've screwed up cr3-vs-active_mm somehow.
> >
> > Then, change the "pgd_val()" in the printk("*pdpt = ..) thing into 
> > "native_pgd_val()", in case the virtualization stuff has screwed up the
> value.

While the module list shows VirtualBox drivers (and VB is seriously messy kernel 
code running at ring 0), there's a lot of other reports that are untainted.

By the patterns I'd say this is a vanilla kernel bug. The corruption is too 
specific to pagetables to be generic memory corruption. The crashes seem to be too 
widely distributed to be linked to any particular 'rare' driver.

There also appears to be another pattern as well: all systems seem to have a 
significant amount of RAM.

Thanks,

	Ingo
Comment 7 Anonymous Emailer 2013-04-11 08:53:22 UTC
Reply-To: mingo@kernel.org

Is there anyone building vanilla kernels who can reproduce this reliably enough so 
that the bug can be bisected, or at least narrowed down to a particular -rc or so?

Thanks,

	Ingo
Comment 8 Anonymous Emailer 2013-04-11 09:04:42 UTC
Reply-To: t.artem@lycos.com

Apr 11, 2013 02:53:22 PM, Ingo wrote:

>Is there anyone building vanilla kernels who can reproduce this reliably
>enough so 
>that the bug can be bisected, or at least narrowed down to a particular -rc or
>so?

In my case this bug is *very* difficult if not impossible to trigger at will. Sometimes it
happens just after the boot, sometimes weeks pass and I don't see it even once.

Probably other guys are luckier, let's ask them.

Best regards,

Artem
Comment 9 Linus Torvalds 2013-04-11 17:18:06 UTC
On Thu, Apr 11, 2013 at 1:51 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> While the module list shows VirtualBox drivers (and VB is seriously messy
> kernel
> code running at ring 0), there's a lot of other reports that are untainted.
>
> By the patterns I'd say this is a vanilla kernel bug. The corruption is too
> specific to pagetables to be generic memory corruption. The crashes seem to
> be too
> widely distributed to be linked to any particular 'rare' driver.

I agree. If it was random memory corruption by some driver, this would
*not* be a common place to hit it. It smells like something very
specific related to the page directory pointer, which is why I already
suggested printing out the actual pgd pointer both as calculated from
%cr3 and by using tsk->active_mm..

Hmm. Do we ever end up clearing the PGD entries when we unmap the last
mapping? We have that whole "free_pgd_range()" logic in
free_pgtables() after we've unmapped the last vma in some range.. I'm
looking at that "pud_clear()" in particular.. The pud is mapped inside
the pgd, no?

Maybe normal programs never unmap a whole gigabyte range, and chrome
just ends up having special virtual memory layout, and happens to at
some point do a munmap() that leaves one of the three user-level PGD
entries without any mapping at all, and then we do the PUD free..

Normally the code lives in the low gig, the data lives at 1G+, and the
stack lives at just below the 3G mark, so all *normal* apps end up
having all three user PGD entries mapped.

I have one 32-bit install, but that one isn't running a PAE kernel.
Let me test...

                 Linus
Comment 10 H. Peter Anvin 2013-04-11 17:23:05 UTC
On 04/11/2013 01:51 AM, Ingo Molnar wrote:
>>
>> for details. I forget the rules for the top-level PAE page table
>> entries, it may be that I mis-remember about that whole "the four top
>> entries must exist" thing.
> 
> All those 4 PAE entries must exist - they get cached/shadowed by the CPU on a
> CR3 
> load.
> 

Well, they can be not-present if the OS wants them to be.

	-hpa
Comment 11 H. Peter Anvin 2013-04-11 17:28:48 UTC
On 04/11/2013 01:53 AM, Ingo Molnar wrote:
> 
> Is there anyone building vanilla kernels who can reproduce this reliably
> enough so 
> that the bug can be bisected, or at least narrowed down to a particular -rc
> or so?
> 

An strace would also be incredibly useful, especially any mmap/munmaps.

	-hpa
Comment 12 Linus Torvalds 2013-04-11 18:02:18 UTC
On Thu, Apr 11, 2013 at 10:18 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I have one 32-bit install, but that one isn't running a PAE kernel.
> Let me test...

No luck. I thought that pud_clear() would trigger a zero pgd entry,
but either it does and everything works fine regardless (random TLB
flushing issue? I think those things end up being magical wrt
flushing) or I'm missing something and we never actually clear these
things (I didn't actually add any printk for it triggering).

Of course, some random delayed TLB flush issue might well explain why
it only happens for some people, so I'd ask people to think about this
a bit. Maybe it's harebrained..

                    Linus
Comment 13 Anonymous Emailer 2013-04-11 18:23:50 UTC
Reply-To: andi@firstfloor.org

On Thu, Apr 11, 2013 at 09:04:40AM +0000, Artem S. Tashkinov wrote:
> Apr 11, 2013 02:53:22 PM, Ingo wrote:
> 
> >Is there anyone building vanilla kernels who can reproduce this reliably
> enough so 
> >that the bug can be bisected, or at least narrowed down to a particular -rc
> or so?
> 
> In my case this bug is *very* difficult if not impossible to trigger at will.
> Sometimes it
> happens just after the boot, sometimes weeks pass and I don't see it even
> once.


Can you apply the attached patch and do the following commands

(make sure debugfs is mounted)
cd /sys/kernel/debug/tracing
echo syscalls:* > set_event
echo mm:page_fault >> set_event
echo 1 > tracing_on
echo 2 > /proc/sys/kernel/ftrace_dump_on_oops

That will dump all recent syscalls and all page faults on the oops.
It should not slowdown your system significantly, so you can just
use it normally until the problem happens.

Then send that log.

If it's not enough can also add remote TLB flushes.

-Andi
Comment 14 H. Peter Anvin 2013-04-11 18:24:11 UTC
Can someone please try this without third-party modules, specifically VirtualBox?
Comment 15 H. Peter Anvin 2013-04-11 18:25:46 UTC
Oh, and we have the Nvidia module.  Two sets of external modules which are known to grot around with page tables.
Comment 16 H. Peter Anvin 2013-04-11 18:42:21 UTC
OK, the Red Hat bugzilla does have some untainted reports, so I guess there is some hope.

I note that all of these seem to be in the first quadrant of the address space, indicating that it is the first PUD that is affected.  This may or may not have any relevance.

Something else that strikes me:

chrome: Corrupted page table at address 34a03000
*pdpt = 0000000000000000 *pde = 0000000000000000

If *pgd is zero, we should have hit the "goto out" in dump_pagetable() and not printed *pde.

It is possible there is a race condition there, but in every dump?
Comment 17 Linus Torvalds 2013-04-11 18:48:49 UTC
On Thu, Apr 11, 2013 at 11:42 AM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
>
> I note that all of these seem to be in the first quadrant of the address
> space,
> indicating that it is the first PUD that is affected.  This may or may not
> have
> any relevance.

The EIP is also in an odd location (ie it is up where I'd normally
expect the stack to be), which is one reason why I was thinking that
chrome does some fancy footwork with mapping itself in odd places,
which could be what triggers this, and which is why I was thinking
"what if it ends up unmapping all of one PGD entry".

> Something else that strikes me:
>
> chrome: Corrupted page table at address 34a03000
> *pdpt = 0000000000000000 *pde = 0000000000000000
>
> If *pgd is zero, we should have hit the "goto out" in dump_pagetable() and
> not
> printed *pde.

No, I thought so too initially, but the thing is "pgd_present()" is
hardcoded to 1. Because they can't not be present.

              Linus
Comment 18 H. Peter Anvin 2013-04-11 18:50:52 UTC
On 04/11/2013 11:48 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
>>
>> I note that all of these seem to be in the first quadrant of the address
>> space,
>> indicating that it is the first PUD that is affected.  This may or may not
>> have
>> any relevance.
> 
> The EIP is also in an odd location (ie it is up where I'd normally
> expect the stack to be), which is one reason why I was thinking that
> chrome does some fancy footwork with mapping itself in odd places,
> which could be what triggers this, and which is why I was thinking
> "what if it ends up unmapping all of one PGD entry".
> 

Yes, Chrome definitely does all kinds of "special" things with the
address space.

>> Something else that strikes me:
>>
>> chrome: Corrupted page table at address 34a03000
>> *pdpt = 0000000000000000 *pde = 0000000000000000
>>
>> If *pgd is zero, we should have hit the "goto out" in dump_pagetable() and
>> not
>> printed *pde.
> 
> No, I thought so too initially, but the thing is "pgd_present()" is
> hardcoded to 1. Because they can't not be present.
> 

OK, that's a Linux rule if anything.

	-hpa
Comment 19 Linus Torvalds 2013-04-11 19:34:37 UTC
On Thu, Apr 11, 2013 at 11:50 AM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
>
> OK, that's a Linux rule if anything.

Hmm. There *were* some special rules about them, though. Like getting
read in on cr3 load, and not being invalidated by normal tlb
invalidates.

There's also an intriguing issue about the accesses: the PDPTE is
documented as being accessed "uncached" in some microarchitectures,
which I wonder whether it might mean "non-cache-coherent". Which in
turn might mean that we should do an actual wbinvl on the cacheline
after writing the entries? Does anybody know the exact rules?

Maybe we ended up really clearing the PGD entry, and freeing the page,
and another CPU then loads the old value because of that crazy
uncached access means that it doesn't go through the cache coherency
logic? And because it loads the old value, it then starts using PMD's
in a page that has already been free'd - explaining the reserved bits
error - but when we print out the page table values, we obviously use
normal loads, so we don't *see* this.

This would be insane. But it would certainly explain the symptoms.

PAE is a piece of crap. Always was. The fact that people still use it
on machines that could do better is just sad.

             Linus
Comment 20 H. Peter Anvin 2013-04-11 21:00:58 UTC
On 04/11/2013 12:34 PM, Linus Torvalds wrote:
> Hmm. There *were* some special rules about them, though. Like getting
> read in on cr3 load, and not being invalidated by normal tlb
> invalidates.

Indeed... they also don't have a lot of the normal attribute bits (they
*do* have the Present bit though.)

> There's also an intriguing issue about the accesses: the PDPTE is
> documented as being accessed "uncached" in some microarchitectures,
> which I wonder whether it might mean "non-cache-coherent". Which in
> turn might mean that we should do an actual wbinvl on the cacheline
> after writing the entries? Does anybody know the exact rules?

That sounds extremely odd, but I will try to find out.  I'm wondering if
"uncached" means that it *doesn't* load them all at once, which might
cause some fun things instead.

> Maybe we ended up really clearing the PGD entry, and freeing the page,
> and another CPU then loads the old value because of that crazy
> uncached access means that it doesn't go through the cache coherency
> logic? And because it loads the old value, it then starts using PMD's
> in a page that has already been free'd - explaining the reserved bits
> error - but when we print out the page table values, we obviously use
> normal loads, so we don't *see* this.
> 
> This would be insane. But it would certainly explain the symptoms.

Insane indeed.

> PAE is a piece of crap. Always was. The fact that people still use it
> on machines that could do better is just sad.

Just say no to HIGHMEM.

	-hpa
Comment 21 H. Peter Anvin 2013-04-11 21:10:05 UTC
OK, so the rule that really seems to stand out and shine is:

* INVLPG doesn't invalidate the PDPTR (PGD/PUD) entries.  Only a load to
CR0, CR3, or CR4 that invalidates the whole TLB (except global entries)
invalidate the PDPTEs.

This is probably the cause of all of this.  Since this is so unlikely to
ever happen, I wouldn't be surprised if we used to require that all of
these page tables existed, but that unification has violated that
constraint, and because in a normal process all four quadrants are
occupied, it was never noticed.

If we can free the upper-level pointers, hard-wiring pde_present() to 1
is clearly bogus.

As you said, the stale entry points to a page that is being freed, but
the pointer-chasing code doesn't know that and instead loads from
address zero (although address zero physical shouldn't be zero in the
normal case...)

Digging now...

	-hpa
Comment 22 Linus Torvalds 2013-04-11 21:22:49 UTC
On Thu, Apr 11, 2013 at 2:09 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> If we can free the upper-level pointers, hard-wiring pde_present() to 1
> is clearly bogus.

Actually, it ends up being correct, because with the four-level page
tables, the PDG ends up containing the PUD, and it's *pud* present
that tests the page present bit.

Which explains why the generic VM code can fill these things
correctly.  It's just that the debug code in dump_pagetable() is
buggy. It never did the proper pud level traversal (it just does a
"pud_offset()" inside the "pmd_offset()" macro withotu ever checking
pud_present() at all.

But none of that explains why we'd get the "reserved bit" fault. It
just explains the bogus extra debug output for the pmd contents that
do not exist..

           Linus
Comment 23 H. Peter Anvin 2013-04-11 21:24:30 UTC
On 04/11/2013 02:22 PM, Linus Torvalds wrote:
> On Thu, Apr 11, 2013 at 2:09 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> If we can free the upper-level pointers, hard-wiring pde_present() to 1
>> is clearly bogus.
> 
> Actually, it ends up being correct, because with the four-level page
> tables, the PDG ends up containing the PUD, and it's *pud* present
> that tests the page present bit.
> 
> Which explains why the generic VM code can fill these things
> correctly.  It's just that the debug code in dump_pagetable() is
> buggy. It never did the proper pud level traversal (it just does a
> "pud_offset()" inside the "pmd_offset()" macro withotu ever checking
> pud_present() at all.
> 
> But none of that explains why we'd get the "reserved bit" fault. It
> just explains the bogus extra debug output for the pmd contents that
> do not exist..
> 

Fair enough.  That's straightforward to deal with.

I still believe we have the "lack of special flushing" problem I
outlined in the previous message.

	-hpa
Comment 24 H. Peter Anvin 2013-04-11 21:42:36 UTC
OK, I just asked Dave Hansen to investigate the current hypothesis...

	-hpa
Comment 25 Linus Torvalds 2013-04-11 21:44:40 UTC
On Thu, Apr 11, 2013 at 2:24 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> I still believe we have the "lack of special flushing" problem I
> outlined in the previous message.

Possible. When we clear the pud (in free_pmd_range()) we end up
calling flush_tlb_mm_range() (through unmap_region -> tlb_finisg_mmu),
and it does try to optimize things a bit based on the range.

So *if* it's a small munmap, but we freed the page tables over a much
larger range (because we had just a single small mapping in one big
1GB region), I could possibly see it happening.

Sounds a bit unlikely. But maybe we should make the page-table freeing
code always free the whole region if it frees a page table (as opposed
to just a few pages).

           Linus
Comment 26 H. Peter Anvin 2013-04-11 21:53:49 UTC
On 04/11/2013 02:44 PM, Linus Torvalds wrote:
> On Thu, Apr 11, 2013 at 2:24 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> I still believe we have the "lack of special flushing" problem I
>> outlined in the previous message.
> 
> Possible. When we clear the pud (in free_pmd_range()) we end up
> calling flush_tlb_mm_range() (through unmap_region -> tlb_finisg_mmu),
> and it does try to optimize things a bit based on the range.
> 
> So *if* it's a small munmap, but we freed the page tables over a much
> larger range (because we had just a single small mapping in one big
> 1GB region), I could possibly see it happening.
> 
> Sounds a bit unlikely. But maybe we should make the page-table freeing
> code always free the whole region if it frees a page table (as opposed
> to just a few pages).
> 

It sounds a bit unlikely, but it also sounds like we're chasing a rare
bug that only one oddball application stomps on.  One thing in
particular which I note: at least one reporter says this tended to
happen on *exiting Chrome* which would be consistent with this happening
as a result of the last mapping being removed.

The combination of events that would have to trigger this:

1. PAE paging
2. A piece of software not using the standard runtime
3. Freeing the last mapping in a quadrant
4. And then causing something in that quadrant to be accessed

Chrome probably has all these characteristics; few other things would.

	-hpa
Comment 27 Linus Torvalds 2013-04-11 21:59:58 UTC
On Thu, Apr 11, 2013 at 2:53 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> It sounds a bit unlikely, but it also sounds like we're chasing a rare
> bug that only one oddball application stomps on.  One thing in
> particular which I note: at least one reporter says this tended to
> happen on *exiting Chrome* which would be consistent with this happening
> as a result of the last mapping being removed.
>
> The combination of events that would have to trigger this:
>
> 1. PAE paging
> 2. A piece of software not using the standard runtime
> 3. Freeing the last mapping in a quadrant
> 4. And then causing something in that quadrant to be accessed
>
> Chrome probably has all these characteristics; few other things would.


Maybe. Attached is a patch that is TOTALLY UNTESTED and not very well
thought out, but has the potential to fix that particular thing.

Does bugzilla pick up on email attachments automatically? Let's see..

              Linus
Comment 28 Linus Torvalds 2013-04-11 22:02:02 UTC
Created attachment 98301 [details]
Untested patch for comments..

Yeah, so bugzilla didn't pick up on the attachment in the email, so here goes.

Really, I didn't test if this compiles, and I haven't thought it through all that much in general. But this just makes our pmd/pud freeing logic also set the fullmm flag, that in turn hopefully makes the actual TLB flush (done later) do a full TLB flush rather than individual patches.

Hmm?
Comment 29 H. Peter Anvin 2013-04-11 22:16:22 UTC
>  arch/x86/mm/pgtable.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 193350b51f90..8a3f2b66dc52 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -58,6 +58,7 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page
> *pte)
>  void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
>  {
>       paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
> +     tlb->fullmm = 1;
>       tlb_remove_page(tlb, virt_to_page(pmd));
>  }
>  
> @@ -65,6 +66,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
>  void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
>  {
>       paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
> +     tlb->fullmm = 1;
>       tlb_remove_page(tlb, virt_to_page(pud));
>  }
>  #endif       /* PAGETABLE_LEVELS > 3 */

OK, I just scrutinized the SDM and the semantics of INVLPG, and I'm
quite certain we *only* need to do this for the PUDs, not the PMDs, and
only for PAE paging, not for 64-bit paging.

INVLPG is documented as invalidating all the levels of the paging cache
hierarchy, however, the PDPTEs are not considered to be part of that
hierarchy and so does not get invalidated by INVLPG.

Specific SDM references (March 2013 edition):

Vol 3A § 4.4.1: PDPTE Registers

	-> PAE PDPTEs not invalidated by INVLPG

Vol 3A § 4.10.3.1: Caches for Paging Structures

	-> The third level (PDPTEs) is a cache in IA-32e (long) mode.

Vol 3A § 4.10.4.1: Operations that Invalidate TLBs and Paging-Structure
Caches

	-> INVLPG invalidate all caches associated with the current PCID

So we should be okay to only do this when (a) changing the PUD and (b)
in 32-bit PAE mode.

	-hpa
Comment 30 Linus Torvalds 2013-04-11 22:18:52 UTC
On Thu, Apr 11, 2013 at 2:59 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Maybe. Attached is a patch that is TOTALLY UNTESTED and not very well
> thought out, but has the potential to fix that particular thing.

No, thinking about this some more (but still not actually testing)
this is wrong.

We need to add a new flag to the field, because we use 'fullmm' to
also optimize ptep_get_and_clear_full() and
pte_clear_not_present_full(), so we can't just set it because we want
the full tlb flush.

So something similar to that patch, but that uses a separate bitfield
(and tests that in addition to ->fullmm in the TLB flush path) should
work, but that patch as-is is broken.

                Linus
Comment 31 Dave Hansen 2013-04-11 22:24:09 UTC
Artem, could you post your kernel .config?  I just want to make sure there's nothing funky like a non-3:1 VMSPLIT.
Comment 32 Anonymous Emailer 2013-04-12 05:07:04 UTC
Reply-To: mingo@kernel.org

* H. Peter Anvin <hpa@zytor.com> wrote:

> The combination of events that would have to trigger this:
> 
> 1. PAE paging
> 2. A piece of software not using the standard runtime
> 3. Freeing the last mapping in a quadrant
> 4. And then causing something in that quadrant to be accessed
> 
> Chrome probably has all these characteristics; few other things would.

The PAE code, when it was written, assumed that pagetables would never be freed 
during the life-time of an mm. This was true of the VM back then.

Now, freeing a full quadrant is rare enough to match the patterns here.

We also changed the TLB flushing code to be more optimized at around v3.2, which 
might have turned a latent bug into a real bug.

So my guess would be that this got triggered by:

 611ae8e3f520 x86/tlb: enable tlb flush range support for x86

To test this hypothesis the patch below could be tried (UNTESTED) - it reverts 
back to full TLB flushing.

Note, this might still be wrong for single-page invlpg, but I'm not sure Chrome 
triggers that, IIRC single-page invlpg is mostly used for page fault invalidation, 
not munmap() invalidation.

So this patch might be worth testing.

Thanks,

	Ingo

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 282375f..dfdc726 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -196,7 +196,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		goto flush_all;
 	}
 
-	if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
+	if (1 || end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
 					|| vmflag & VM_HUGETLB) {
 		local_flush_tlb();
 		goto flush_all;
Comment 33 H. Peter Anvin 2013-04-12 05:17:02 UTC
Dave Hansen was working on a patch but had to go pick up his small child... I got his patch but it needed a small touchups that he promised tomorrow.

Ingo Molnar <mingo@kernel.org> wrote:

>
>* H. Peter Anvin <hpa@zytor.com> wrote:
>
>> The combination of events that would have to trigger this:
>> 
>> 1. PAE paging
>> 2. A piece of software not using the standard runtime
>> 3. Freeing the last mapping in a quadrant
>> 4. And then causing something in that quadrant to be accessed
>> 
>> Chrome probably has all these characteristics; few other things
>would.
>
>The PAE code, when it was written, assumed that pagetables would never
>be freed 
>during the life-time of an mm. This was true of the VM back then.
>
>Now, freeing a full quadrant is rare enough to match the patterns here.
>
>We also changed the TLB flushing code to be more optimized at around
>v3.2, which 
>might have turned a latent bug into a real bug.
>
>So my guess would be that this got triggered by:
>
> 611ae8e3f520 x86/tlb: enable tlb flush range support for x86
>
>To test this hypothesis the patch below could be tried (UNTESTED) - it
>reverts 
>back to full TLB flushing.
>
>Note, this might still be wrong for single-page invlpg, but I'm not
>sure Chrome 
>triggers that, IIRC single-page invlpg is mostly used for page fault
>invalidation, 
>not munmap() invalidation.
>
>So this patch might be worth testing.
>
>Thanks,
>
>       Ingo
>
>diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>index 282375f..dfdc726 100644
>--- a/arch/x86/mm/tlb.c
>+++ b/arch/x86/mm/tlb.c
>@@ -196,7 +196,7 @@ void flush_tlb_mm_range(struct mm_struct *mm,
>unsigned long start,
>               goto flush_all;
>       }
> 
>-      if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
>+      if (1 || end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
>                                       || vmflag & VM_HUGETLB) {
>               local_flush_tlb();
>               goto flush_all;
Comment 34 Artem S. Tashkinov 2013-04-12 05:17:49 UTC
Created attachment 98331 [details]
My 3.8 .config

(In reply to comment #31)
> Artem, could you post your kernel .config?  I just want to make sure there's
> nothing funky like a non-3:1 VMSPLIT.

Here it is.

I've enabled KVM just recently, I've never actually used it yet.
Comment 35 H. Peter Anvin 2013-04-12 05:18:14 UTC
He is also working on a test program.

Ingo Molnar <mingo@kernel.org> wrote:

>
>* H. Peter Anvin <hpa@zytor.com> wrote:
>
>> The combination of events that would have to trigger this:
>> 
>> 1. PAE paging
>> 2. A piece of software not using the standard runtime
>> 3. Freeing the last mapping in a quadrant
>> 4. And then causing something in that quadrant to be accessed
>> 
>> Chrome probably has all these characteristics; few other things
>would.
>
>The PAE code, when it was written, assumed that pagetables would never
>be freed 
>during the life-time of an mm. This was true of the VM back then.
>
>Now, freeing a full quadrant is rare enough to match the patterns here.
>
>We also changed the TLB flushing code to be more optimized at around
>v3.2, which 
>might have turned a latent bug into a real bug.
>
>So my guess would be that this got triggered by:
>
> 611ae8e3f520 x86/tlb: enable tlb flush range support for x86
>
>To test this hypothesis the patch below could be tried (UNTESTED) - it
>reverts 
>back to full TLB flushing.
>
>Note, this might still be wrong for single-page invlpg, but I'm not
>sure Chrome 
>triggers that, IIRC single-page invlpg is mostly used for page fault
>invalidation, 
>not munmap() invalidation.
>
>So this patch might be worth testing.
>
>Thanks,
>
>       Ingo
>
>diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>index 282375f..dfdc726 100644
>--- a/arch/x86/mm/tlb.c
>+++ b/arch/x86/mm/tlb.c
>@@ -196,7 +196,7 @@ void flush_tlb_mm_range(struct mm_struct *mm,
>unsigned long start,
>               goto flush_all;
>       }
> 
>-      if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
>+      if (1 || end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
>                                       || vmflag & VM_HUGETLB) {
>               local_flush_tlb();
>               goto flush_all;
Comment 36 Dave Hansen 2013-04-12 13:48:59 UTC
Well, I can at least confirm that one of the PGD's can get cleared out in practice:

http://sr71.net/~dave/linux/paefun.c.txt

I'm working on reproducing the crash itself now.
Comment 37 Linus Torvalds 2013-04-12 14:41:42 UTC
On Thu, Apr 11, 2013 at 3:18 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So something similar to that patch, but that uses a separate bitfield
> (and tests that in addition to ->fullmm in the TLB flush path) should
> work, but that patch as-is is broken.

I think I'm going to go with the attached patch. It just disables the
"let's try to do page-by-page flushing" entirely on x86 when
CONFIG_X86_PAE is enabled.

Looking at that code, I also happen to think that it is very
questionable for non-PAE. In particular, look at this part:

                if (has_large_page(mm, start, end)) {
                        local_flush_tlb();
                        goto flush_all;
                }

and please realize that TLB flushing is done *after* the page tables
are torn down, so looking for a huge-page in the page tables to see if
we are flushing one seems rather extremely stupid.

I suspect that one doesn't end up triggering, because in order to
munmap a hugepage, the size of the munmap has to be big enough that
the TLB size checks trigger, but it makes me feel less than happy
about this whole code-sequence.

                   Linus
Comment 38 Linus Torvalds 2013-04-12 14:43:51 UTC
Created attachment 98401 [details]
New patch to be tested

Disable the whole page-at-a-time TLB flushing optimization for PAE, because it's unsafe due to invlpg not invalidating the page directory pointer table entries.

This one is actually worth testing on PAE.
Comment 39 Dave Hansen 2013-04-12 15:22:46 UTC
That patch looks like it'll work, although it's kinda mean to the poor PAE users.  But, I guess they have it coming.

BTW... Another thing that makes this harder to trigger is the "balance point" logic flush_tlb_mm_range().  If the range we're flushing is over the "balance point", we do a full flush anyway.  So this probably only triggers when we're flushing a relatively small mapping, _and_ it's the last one in the quadrant.

I haven't had any luck triggering this in a 32-bit KVM VM.  Hoping some real hardware will do the trick.
Comment 40 H. Peter Anvin 2013-04-12 16:03:53 UTC
INVLPG does invalidate huge pages, so that is really confusing.

Linus Torvalds <torvalds@linux-foundation.org> wrote:

>On Thu, Apr 11, 2013 at 3:18 PM, Linus Torvalds
><torvalds@linux-foundation.org> wrote:
>>
>> So something similar to that patch, but that uses a separate bitfield
>> (and tests that in addition to ->fullmm in the TLB flush path) should
>> work, but that patch as-is is broken.
>
>I think I'm going to go with the attached patch. It just disables the
>"let's try to do page-by-page flushing" entirely on x86 when
>CONFIG_X86_PAE is enabled.
>
>Looking at that code, I also happen to think that it is very
>questionable for non-PAE. In particular, look at this part:
>
>                if (has_large_page(mm, start, end)) {
>                        local_flush_tlb();
>                        goto flush_all;
>                }
>
>and please realize that TLB flushing is done *after* the page tables
>are torn down, so looking for a huge-page in the page tables to see if
>we are flushing one seems rather extremely stupid.
>
>I suspect that one doesn't end up triggering, because in order to
>munmap a hugepage, the size of the munmap has to be big enough that
>the TLB size checks trigger, but it makes me feel less than happy
>about this whole code-sequence.
>
>                   Linus
Comment 41 H. Peter Anvin 2013-04-12 16:06:41 UTC
As far as being mean to poor PAE users... it is one thing what we do for 3.9, 3.10 and 3.11.  3.9 is at -rc6 so unless we come up with a better solution by today-ish it makes total sense to kill it now and improve later.

bugzilla-daemon@bugzilla.kernel.org wrote:

>https://bugzilla.kernel.org/show_bug.cgi?id=56461
>
>
>
>
>
>--- Comment #39 from Dave Hansen <dave@sr71.net>  2013-04-12 15:22:46
>---
>That patch looks like it'll work, although it's kinda mean to the poor
>PAE
>users.  But, I guess they have it coming.
>
>BTW... Another thing that makes this harder to trigger is the "balance
>point"
>logic flush_tlb_mm_range().  If the range we're flushing is over the
>"balance
>point", we do a full flush anyway.  So this probably only triggers when
>we're
>flushing a relatively small mapping, _and_ it's the last one in the
>quadrant.
>
>I haven't had any luck triggering this in a 32-bit KVM VM.  Hoping some
>real
>hardware will do the trick.
Comment 42 Dave Hansen 2013-04-12 17:15:17 UTC
Peter suggested doing something along these lines. (attaching in a sec)

Except, earlier, Peter said:
> OK, I just scrutinized the SDM and the semantics of INVLPG, and I'm
> quite certain we *only* need to do this for the PUDs, not the PMDs, and
> only for PAE paging, not for 64-bit paging.

That's right, but I think Peter means that we do it for PUD _entries_, while ___pmd_free_tlb() is freeing a pmd *PAGE*.  So the attached patch only modifies the logic in ___pmd_free_tlb().  The ___pud_free_tlb() code that Linus's first patch touched isn't even used on PAE, so it doesn't need to get touched.
Comment 43 Dave Hansen 2013-04-12 17:16:28 UTC
Created attachment 98451 [details]
patch to fix pagetable corruption using mmu_gather state
Comment 44 Linus Torvalds 2013-04-12 22:31:27 UTC
Comment #43 from Dave Hansen <dave@sr71.net>  2013-04-12 17:16:28 ---
> Created an attachment (id=98451)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=98451)
> patch to fix pagetable corruption using mmu_gather state

Patch looks good to me. Mind sending it as a full email with proper
sign-offs etc.

Peter, Ingo, any other comments?

                Linus