This happens regularly on my PC: chrome: Corrupted page table at address 34a03000 *pdpt = 0000000000000000 *pde = 0000000000000000 Bad pagetable: 000f [#1] PREEMPT SMP Modules linked in: vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) ipv6 nf_conntrack_ftp nf_conntrack_netbios_ns nf_conntrack_broadcast xt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp xt_pkttype ipt_ULOG xt_owner xt_multiport iptable_filter ip_tables x_tables w83627ehf adt7475 hwmon_vid binfmt_misc fuse hid_generic uvcvideo videobuf2_core videodev videobuf2_vmalloc videobuf2_memops snd_usb_audio snd_usbmidi_lib snd_rawmidi usbhid hid fan coretemp kvm_intel kvm aesni_intel ablk_helper cryptd lrw aes_i586 xts gf128mul microcode pcspkr i2c_i801 sr_mod cdrom sg xhci_hcd snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd snd_page_alloc ehci_pci ehci_hcd e1000e nvidia(PO) evdev Pid: 6114, comm: chrome Tainted: P O 3.8.5-ic #5 System manufacturer System Product Name/P8P67 PRO EIP: 0073:[<b42dcd18>] EFLAGS: 00010202 CPU: 1 EIP is at 0xb42dcd18 EAX: 34a03000 EBX: b78cbff4 ECX: 00001000 EDX: 00000057 ESI: bf894d50 EDI: 34a03000 EBP: bf894d08 ESP: bf894cf0 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b Process chrome (pid: 6114, ti=cc136000 task=ccadbb10 task.ti=cc136000) EIP: [<b42dcd18>] 0xb42dcd18 SS:ESP 007b:bf894cf0 ---[ end trace 71406aaad5321207 ]--- Untainted reports can be found here: https://bugzilla.redhat.com/show_bug.cgi?id=859188
Ok, that's just insane. The "Corrupted page table" message happens when the error code has the PF_RSVD bit set. And indeed, the error code seems to be 000f, so it has all of the PROT/WRITE/USER/RSVD bits set (but not PF_INSTR). And the *pgd value is 0, which should never happen, I think. For PAE, we always fill in the top-level entries, and they aren't allowed to be non-present, iirc. It's been too long since I worked with PAE to remember. That may explain the RSVD bit. Can you add some printouts to dump_pagetable(), please? Add a line at the top of the CONFIG_X86_PAE case that says something like printk("pgd=%p pgd_offset=%p\n", pgd, pgd_offset(tsk->active_mm, address)); to see if the problem is that we've screwed up cr3-vs-active_mm somehow. Then, change the "pgd_val()" in the printk("*pdpt = ..) thing into "native_pgd_val()", in case the virtualization stuff has screwed up the value.
Adding more people. Ingo, Andi, Peter, any ideas? See https://bugzilla.kernel.org/show_bug.cgi?id=56461 for details. I forget the rules for the top-level PAE page table entries, it may be that I mis-remember about that whole "the four top entries must exist" thing. Linus On Wed, Apr 10, 2013 at 12:11 PM, Linus Torvalds <torvalds@linux-foundation.org> write: > Ok, that's just insane. > > The "Corrupted page table" message happens when the error code has the > PF_RSVD > bit set. And indeed, the error code seems to be 000f, so it has all of the > PROT/WRITE/USER/RSVD bits set (but not PF_INSTR). > > And the *pgd value is 0, which should never happen, I think. For PAE, we > always > fill in the top-level entries, and they aren't allowed to be non-present, > iirc. > It's been too long since I worked with PAE to remember. That may explain the > RSVD bit. > > Can you add some printouts to dump_pagetable(), please? > > Add a line at the top of the CONFIG_X86_PAE case that says something like > > printk("pgd=%p pgd_offset=%p\n", pgd, pgd_offset(tsk->active_mm, address)); > > to see if the problem is that we've screwed up cr3-vs-active_mm somehow. > > Then, change the "pgd_val()" in the printk("*pdpt = ..) thing into > "native_pgd_val()", in case the virtualization stuff has screwed up the > value. > > -- > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug.
Reply-To: andi@firstfloor.org On Wed, Apr 10, 2013 at 12:17:13PM -0700, Linus Torvalds wrote: > Adding more people. Ingo, Andi, Peter, any ideas? > > See > > https://bugzilla.kernel.org/show_bug.cgi?id=56461 > > for details. I forget the rules for the top-level PAE page table > entries, it may be that I mis-remember about that whole "the four top > entries must exist" thing. I believe you're right. They must all exist. So his pgd got corrupted. It would be probably good to trace the system calls and faults to see what chrome is doing before? -Andi
On Wed, Apr 10, 2013 at 3:26 PM, Andi Kleen <andi@firstfloor.org> wrote: > > So his pgd got corrupted. It would be probably good > to trace the system calls and faults to see what > chrome is doing before? Hmm. Some googling shows that others see it too: https://bugzilla.redhat.com/show_bug.cgi?id=859188 but it seems to be new to F18, so it's a recent problem. Either a new kernel bug, or just a more modern chrome triggering it. A lot of those have uvcvideo in the module list. Not Mikails, though. Some of the other google hits seem to indicate it's tied to a particular chrome version. Very odd. Linus
These bugs sadly aren't new. We started seeing them regularly circa 3.2 Some people seem to be able to trigger them *really* easily, but I've still not managed to figure out exactly what the pattern is. A lot of these have been filed by the automated bug reporting tool, and the actual reporters are generally unresponsive, so it's been tough to even figure out what's special about their systems. https://bugzilla.redhat.com/buglist.cgi?bug_status=NEW&bug_status=CLOSED&classification=Fedora&component=kernel&product=Fedora&query_format=advanced&short_desc=bad%20page&short_desc_type=allwordssubstr&version=15&version=16&version=17&version=18&version=19&version=rawhide&order=bug_id&list_id=1268538 (There's likely multiple bugs in that pile, but of the more recent ones, I'm sure there are some duplicates of this one). There's also a mix of 64bit bugs in there, which could be another sign of multiple bugs given the different page table code.. One hunch we had a while ago was that it was transparent huge page related. So we disabled it for 32bit to see if the reports would stop on that arch. It didn't make any difference. We've also got a whole bunch of reports of things like linked-list corruption that I've got no real explanation for. The usual suspects (Suspend/Resume and/or DRM) have been ruled out in some cases, but not all. I don't know, it's a mess.
Reply-To: mingo@kernel.org * Linus Torvalds <torvalds@linux-foundation.org> wrote: > Adding more people. Ingo, Andi, Peter, any ideas? > > See > > https://bugzilla.kernel.org/show_bug.cgi?id=56461 > > for details. I forget the rules for the top-level PAE page table > entries, it may be that I mis-remember about that whole "the four top > entries must exist" thing. All those 4 PAE entries must exist - they get cached/shadowed by the CPU on a CR3 load. > > The "Corrupted page table" message happens when the error code has the > PF_RSVD > > bit set. And indeed, the error code seems to be 000f, so it has all of the > > PROT/WRITE/USER/RSVD bits set (but not PF_INSTR). > > > > And the *pgd value is 0, which should never happen, I think. For PAE, we > > always fill in the top-level entries, and they aren't allowed to be > > non-present, iirc. It's been too long since I worked with PAE to remember. > > That may explain the RSVD bit. Correct, they must all be present, otherwise IIRC the CPU won't even load them (will triple fault IIRC). So I think either the page tables got corrupted after a cr3 switch and they did not get loaded into the CPU [we'd not be able to do a printk if they did] - or the mm_struct got corrupted and we are printing out the wrong pagetables. > > Can you add some printouts to dump_pagetable(), please? > > > > Add a line at the top of the CONFIG_X86_PAE case that says something like > > > > printk("pgd=%p pgd_offset=%p\n", pgd, pgd_offset(tsk->active_mm, > address)); > > > > to see if the problem is that we've screwed up cr3-vs-active_mm somehow. > > > > Then, change the "pgd_val()" in the printk("*pdpt = ..) thing into > > "native_pgd_val()", in case the virtualization stuff has screwed up the > value. While the module list shows VirtualBox drivers (and VB is seriously messy kernel code running at ring 0), there's a lot of other reports that are untainted. By the patterns I'd say this is a vanilla kernel bug. The corruption is too specific to pagetables to be generic memory corruption. The crashes seem to be too widely distributed to be linked to any particular 'rare' driver. There also appears to be another pattern as well: all systems seem to have a significant amount of RAM. Thanks, Ingo
Reply-To: mingo@kernel.org Is there anyone building vanilla kernels who can reproduce this reliably enough so that the bug can be bisected, or at least narrowed down to a particular -rc or so? Thanks, Ingo
Reply-To: t.artem@lycos.com Apr 11, 2013 02:53:22 PM, Ingo wrote: >Is there anyone building vanilla kernels who can reproduce this reliably >enough so >that the bug can be bisected, or at least narrowed down to a particular -rc or >so? In my case this bug is *very* difficult if not impossible to trigger at will. Sometimes it happens just after the boot, sometimes weeks pass and I don't see it even once. Probably other guys are luckier, let's ask them. Best regards, Artem
On Thu, Apr 11, 2013 at 1:51 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > While the module list shows VirtualBox drivers (and VB is seriously messy > kernel > code running at ring 0), there's a lot of other reports that are untainted. > > By the patterns I'd say this is a vanilla kernel bug. The corruption is too > specific to pagetables to be generic memory corruption. The crashes seem to > be too > widely distributed to be linked to any particular 'rare' driver. I agree. If it was random memory corruption by some driver, this would *not* be a common place to hit it. It smells like something very specific related to the page directory pointer, which is why I already suggested printing out the actual pgd pointer both as calculated from %cr3 and by using tsk->active_mm.. Hmm. Do we ever end up clearing the PGD entries when we unmap the last mapping? We have that whole "free_pgd_range()" logic in free_pgtables() after we've unmapped the last vma in some range.. I'm looking at that "pud_clear()" in particular.. The pud is mapped inside the pgd, no? Maybe normal programs never unmap a whole gigabyte range, and chrome just ends up having special virtual memory layout, and happens to at some point do a munmap() that leaves one of the three user-level PGD entries without any mapping at all, and then we do the PUD free.. Normally the code lives in the low gig, the data lives at 1G+, and the stack lives at just below the 3G mark, so all *normal* apps end up having all three user PGD entries mapped. I have one 32-bit install, but that one isn't running a PAE kernel. Let me test... Linus
On 04/11/2013 01:51 AM, Ingo Molnar wrote: >> >> for details. I forget the rules for the top-level PAE page table >> entries, it may be that I mis-remember about that whole "the four top >> entries must exist" thing. > > All those 4 PAE entries must exist - they get cached/shadowed by the CPU on a > CR3 > load. > Well, they can be not-present if the OS wants them to be. -hpa
On 04/11/2013 01:53 AM, Ingo Molnar wrote: > > Is there anyone building vanilla kernels who can reproduce this reliably > enough so > that the bug can be bisected, or at least narrowed down to a particular -rc > or so? > An strace would also be incredibly useful, especially any mmap/munmaps. -hpa
On Thu, Apr 11, 2013 at 10:18 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I have one 32-bit install, but that one isn't running a PAE kernel. > Let me test... No luck. I thought that pud_clear() would trigger a zero pgd entry, but either it does and everything works fine regardless (random TLB flushing issue? I think those things end up being magical wrt flushing) or I'm missing something and we never actually clear these things (I didn't actually add any printk for it triggering). Of course, some random delayed TLB flush issue might well explain why it only happens for some people, so I'd ask people to think about this a bit. Maybe it's harebrained.. Linus
Reply-To: andi@firstfloor.org On Thu, Apr 11, 2013 at 09:04:40AM +0000, Artem S. Tashkinov wrote: > Apr 11, 2013 02:53:22 PM, Ingo wrote: > > >Is there anyone building vanilla kernels who can reproduce this reliably > enough so > >that the bug can be bisected, or at least narrowed down to a particular -rc > or so? > > In my case this bug is *very* difficult if not impossible to trigger at will. > Sometimes it > happens just after the boot, sometimes weeks pass and I don't see it even > once. Can you apply the attached patch and do the following commands (make sure debugfs is mounted) cd /sys/kernel/debug/tracing echo syscalls:* > set_event echo mm:page_fault >> set_event echo 1 > tracing_on echo 2 > /proc/sys/kernel/ftrace_dump_on_oops That will dump all recent syscalls and all page faults on the oops. It should not slowdown your system significantly, so you can just use it normally until the problem happens. Then send that log. If it's not enough can also add remote TLB flushes. -Andi
Can someone please try this without third-party modules, specifically VirtualBox?
Oh, and we have the Nvidia module. Two sets of external modules which are known to grot around with page tables.
OK, the Red Hat bugzilla does have some untainted reports, so I guess there is some hope. I note that all of these seem to be in the first quadrant of the address space, indicating that it is the first PUD that is affected. This may or may not have any relevance. Something else that strikes me: chrome: Corrupted page table at address 34a03000 *pdpt = 0000000000000000 *pde = 0000000000000000 If *pgd is zero, we should have hit the "goto out" in dump_pagetable() and not printed *pde. It is possible there is a race condition there, but in every dump?
On Thu, Apr 11, 2013 at 11:42 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > > I note that all of these seem to be in the first quadrant of the address > space, > indicating that it is the first PUD that is affected. This may or may not > have > any relevance. The EIP is also in an odd location (ie it is up where I'd normally expect the stack to be), which is one reason why I was thinking that chrome does some fancy footwork with mapping itself in odd places, which could be what triggers this, and which is why I was thinking "what if it ends up unmapping all of one PGD entry". > Something else that strikes me: > > chrome: Corrupted page table at address 34a03000 > *pdpt = 0000000000000000 *pde = 0000000000000000 > > If *pgd is zero, we should have hit the "goto out" in dump_pagetable() and > not > printed *pde. No, I thought so too initially, but the thing is "pgd_present()" is hardcoded to 1. Because they can't not be present. Linus
On 04/11/2013 11:48 AM, bugzilla-daemon@bugzilla.kernel.org wrote: >> >> I note that all of these seem to be in the first quadrant of the address >> space, >> indicating that it is the first PUD that is affected. This may or may not >> have >> any relevance. > > The EIP is also in an odd location (ie it is up where I'd normally > expect the stack to be), which is one reason why I was thinking that > chrome does some fancy footwork with mapping itself in odd places, > which could be what triggers this, and which is why I was thinking > "what if it ends up unmapping all of one PGD entry". > Yes, Chrome definitely does all kinds of "special" things with the address space. >> Something else that strikes me: >> >> chrome: Corrupted page table at address 34a03000 >> *pdpt = 0000000000000000 *pde = 0000000000000000 >> >> If *pgd is zero, we should have hit the "goto out" in dump_pagetable() and >> not >> printed *pde. > > No, I thought so too initially, but the thing is "pgd_present()" is > hardcoded to 1. Because they can't not be present. > OK, that's a Linux rule if anything. -hpa
On Thu, Apr 11, 2013 at 11:50 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > > OK, that's a Linux rule if anything. Hmm. There *were* some special rules about them, though. Like getting read in on cr3 load, and not being invalidated by normal tlb invalidates. There's also an intriguing issue about the accesses: the PDPTE is documented as being accessed "uncached" in some microarchitectures, which I wonder whether it might mean "non-cache-coherent". Which in turn might mean that we should do an actual wbinvl on the cacheline after writing the entries? Does anybody know the exact rules? Maybe we ended up really clearing the PGD entry, and freeing the page, and another CPU then loads the old value because of that crazy uncached access means that it doesn't go through the cache coherency logic? And because it loads the old value, it then starts using PMD's in a page that has already been free'd - explaining the reserved bits error - but when we print out the page table values, we obviously use normal loads, so we don't *see* this. This would be insane. But it would certainly explain the symptoms. PAE is a piece of crap. Always was. The fact that people still use it on machines that could do better is just sad. Linus
On 04/11/2013 12:34 PM, Linus Torvalds wrote: > Hmm. There *were* some special rules about them, though. Like getting > read in on cr3 load, and not being invalidated by normal tlb > invalidates. Indeed... they also don't have a lot of the normal attribute bits (they *do* have the Present bit though.) > There's also an intriguing issue about the accesses: the PDPTE is > documented as being accessed "uncached" in some microarchitectures, > which I wonder whether it might mean "non-cache-coherent". Which in > turn might mean that we should do an actual wbinvl on the cacheline > after writing the entries? Does anybody know the exact rules? That sounds extremely odd, but I will try to find out. I'm wondering if "uncached" means that it *doesn't* load them all at once, which might cause some fun things instead. > Maybe we ended up really clearing the PGD entry, and freeing the page, > and another CPU then loads the old value because of that crazy > uncached access means that it doesn't go through the cache coherency > logic? And because it loads the old value, it then starts using PMD's > in a page that has already been free'd - explaining the reserved bits > error - but when we print out the page table values, we obviously use > normal loads, so we don't *see* this. > > This would be insane. But it would certainly explain the symptoms. Insane indeed. > PAE is a piece of crap. Always was. The fact that people still use it > on machines that could do better is just sad. Just say no to HIGHMEM. -hpa
OK, so the rule that really seems to stand out and shine is: * INVLPG doesn't invalidate the PDPTR (PGD/PUD) entries. Only a load to CR0, CR3, or CR4 that invalidates the whole TLB (except global entries) invalidate the PDPTEs. This is probably the cause of all of this. Since this is so unlikely to ever happen, I wouldn't be surprised if we used to require that all of these page tables existed, but that unification has violated that constraint, and because in a normal process all four quadrants are occupied, it was never noticed. If we can free the upper-level pointers, hard-wiring pde_present() to 1 is clearly bogus. As you said, the stale entry points to a page that is being freed, but the pointer-chasing code doesn't know that and instead loads from address zero (although address zero physical shouldn't be zero in the normal case...) Digging now... -hpa
On Thu, Apr 11, 2013 at 2:09 PM, H. Peter Anvin <hpa@zytor.com> wrote: > > If we can free the upper-level pointers, hard-wiring pde_present() to 1 > is clearly bogus. Actually, it ends up being correct, because with the four-level page tables, the PDG ends up containing the PUD, and it's *pud* present that tests the page present bit. Which explains why the generic VM code can fill these things correctly. It's just that the debug code in dump_pagetable() is buggy. It never did the proper pud level traversal (it just does a "pud_offset()" inside the "pmd_offset()" macro withotu ever checking pud_present() at all. But none of that explains why we'd get the "reserved bit" fault. It just explains the bogus extra debug output for the pmd contents that do not exist.. Linus
On 04/11/2013 02:22 PM, Linus Torvalds wrote: > On Thu, Apr 11, 2013 at 2:09 PM, H. Peter Anvin <hpa@zytor.com> wrote: >> >> If we can free the upper-level pointers, hard-wiring pde_present() to 1 >> is clearly bogus. > > Actually, it ends up being correct, because with the four-level page > tables, the PDG ends up containing the PUD, and it's *pud* present > that tests the page present bit. > > Which explains why the generic VM code can fill these things > correctly. It's just that the debug code in dump_pagetable() is > buggy. It never did the proper pud level traversal (it just does a > "pud_offset()" inside the "pmd_offset()" macro withotu ever checking > pud_present() at all. > > But none of that explains why we'd get the "reserved bit" fault. It > just explains the bogus extra debug output for the pmd contents that > do not exist.. > Fair enough. That's straightforward to deal with. I still believe we have the "lack of special flushing" problem I outlined in the previous message. -hpa
OK, I just asked Dave Hansen to investigate the current hypothesis... -hpa
On Thu, Apr 11, 2013 at 2:24 PM, H. Peter Anvin <hpa@zytor.com> wrote: > > I still believe we have the "lack of special flushing" problem I > outlined in the previous message. Possible. When we clear the pud (in free_pmd_range()) we end up calling flush_tlb_mm_range() (through unmap_region -> tlb_finisg_mmu), and it does try to optimize things a bit based on the range. So *if* it's a small munmap, but we freed the page tables over a much larger range (because we had just a single small mapping in one big 1GB region), I could possibly see it happening. Sounds a bit unlikely. But maybe we should make the page-table freeing code always free the whole region if it frees a page table (as opposed to just a few pages). Linus
On 04/11/2013 02:44 PM, Linus Torvalds wrote: > On Thu, Apr 11, 2013 at 2:24 PM, H. Peter Anvin <hpa@zytor.com> wrote: >> >> I still believe we have the "lack of special flushing" problem I >> outlined in the previous message. > > Possible. When we clear the pud (in free_pmd_range()) we end up > calling flush_tlb_mm_range() (through unmap_region -> tlb_finisg_mmu), > and it does try to optimize things a bit based on the range. > > So *if* it's a small munmap, but we freed the page tables over a much > larger range (because we had just a single small mapping in one big > 1GB region), I could possibly see it happening. > > Sounds a bit unlikely. But maybe we should make the page-table freeing > code always free the whole region if it frees a page table (as opposed > to just a few pages). > It sounds a bit unlikely, but it also sounds like we're chasing a rare bug that only one oddball application stomps on. One thing in particular which I note: at least one reporter says this tended to happen on *exiting Chrome* which would be consistent with this happening as a result of the last mapping being removed. The combination of events that would have to trigger this: 1. PAE paging 2. A piece of software not using the standard runtime 3. Freeing the last mapping in a quadrant 4. And then causing something in that quadrant to be accessed Chrome probably has all these characteristics; few other things would. -hpa
On Thu, Apr 11, 2013 at 2:53 PM, H. Peter Anvin <hpa@zytor.com> wrote: > > It sounds a bit unlikely, but it also sounds like we're chasing a rare > bug that only one oddball application stomps on. One thing in > particular which I note: at least one reporter says this tended to > happen on *exiting Chrome* which would be consistent with this happening > as a result of the last mapping being removed. > > The combination of events that would have to trigger this: > > 1. PAE paging > 2. A piece of software not using the standard runtime > 3. Freeing the last mapping in a quadrant > 4. And then causing something in that quadrant to be accessed > > Chrome probably has all these characteristics; few other things would. Maybe. Attached is a patch that is TOTALLY UNTESTED and not very well thought out, but has the potential to fix that particular thing. Does bugzilla pick up on email attachments automatically? Let's see.. Linus
Created attachment 98301 [details] Untested patch for comments.. Yeah, so bugzilla didn't pick up on the attachment in the email, so here goes. Really, I didn't test if this compiles, and I haven't thought it through all that much in general. But this just makes our pmd/pud freeing logic also set the fullmm flag, that in turn hopefully makes the actual TLB flush (done later) do a full TLB flush rather than individual patches. Hmm?
> arch/x86/mm/pgtable.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c > index 193350b51f90..8a3f2b66dc52 100644 > --- a/arch/x86/mm/pgtable.c > +++ b/arch/x86/mm/pgtable.c > @@ -58,6 +58,7 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page > *pte) > void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd) > { > paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT); > + tlb->fullmm = 1; > tlb_remove_page(tlb, virt_to_page(pmd)); > } > > @@ -65,6 +66,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd) > void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud) > { > paravirt_release_pud(__pa(pud) >> PAGE_SHIFT); > + tlb->fullmm = 1; > tlb_remove_page(tlb, virt_to_page(pud)); > } > #endif /* PAGETABLE_LEVELS > 3 */ OK, I just scrutinized the SDM and the semantics of INVLPG, and I'm quite certain we *only* need to do this for the PUDs, not the PMDs, and only for PAE paging, not for 64-bit paging. INVLPG is documented as invalidating all the levels of the paging cache hierarchy, however, the PDPTEs are not considered to be part of that hierarchy and so does not get invalidated by INVLPG. Specific SDM references (March 2013 edition): Vol 3A § 4.4.1: PDPTE Registers -> PAE PDPTEs not invalidated by INVLPG Vol 3A § 4.10.3.1: Caches for Paging Structures -> The third level (PDPTEs) is a cache in IA-32e (long) mode. Vol 3A § 4.10.4.1: Operations that Invalidate TLBs and Paging-Structure Caches -> INVLPG invalidate all caches associated with the current PCID So we should be okay to only do this when (a) changing the PUD and (b) in 32-bit PAE mode. -hpa
On Thu, Apr 11, 2013 at 2:59 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Maybe. Attached is a patch that is TOTALLY UNTESTED and not very well > thought out, but has the potential to fix that particular thing. No, thinking about this some more (but still not actually testing) this is wrong. We need to add a new flag to the field, because we use 'fullmm' to also optimize ptep_get_and_clear_full() and pte_clear_not_present_full(), so we can't just set it because we want the full tlb flush. So something similar to that patch, but that uses a separate bitfield (and tests that in addition to ->fullmm in the TLB flush path) should work, but that patch as-is is broken. Linus
Artem, could you post your kernel .config? I just want to make sure there's nothing funky like a non-3:1 VMSPLIT.
Reply-To: mingo@kernel.org * H. Peter Anvin <hpa@zytor.com> wrote: > The combination of events that would have to trigger this: > > 1. PAE paging > 2. A piece of software not using the standard runtime > 3. Freeing the last mapping in a quadrant > 4. And then causing something in that quadrant to be accessed > > Chrome probably has all these characteristics; few other things would. The PAE code, when it was written, assumed that pagetables would never be freed during the life-time of an mm. This was true of the VM back then. Now, freeing a full quadrant is rare enough to match the patterns here. We also changed the TLB flushing code to be more optimized at around v3.2, which might have turned a latent bug into a real bug. So my guess would be that this got triggered by: 611ae8e3f520 x86/tlb: enable tlb flush range support for x86 To test this hypothesis the patch below could be tried (UNTESTED) - it reverts back to full TLB flushing. Note, this might still be wrong for single-page invlpg, but I'm not sure Chrome triggers that, IIRC single-page invlpg is mostly used for page fault invalidation, not munmap() invalidation. So this patch might be worth testing. Thanks, Ingo diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 282375f..dfdc726 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -196,7 +196,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, goto flush_all; } - if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 + if (1 || end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 || vmflag & VM_HUGETLB) { local_flush_tlb(); goto flush_all;
Dave Hansen was working on a patch but had to go pick up his small child... I got his patch but it needed a small touchups that he promised tomorrow. Ingo Molnar <mingo@kernel.org> wrote: > >* H. Peter Anvin <hpa@zytor.com> wrote: > >> The combination of events that would have to trigger this: >> >> 1. PAE paging >> 2. A piece of software not using the standard runtime >> 3. Freeing the last mapping in a quadrant >> 4. And then causing something in that quadrant to be accessed >> >> Chrome probably has all these characteristics; few other things >would. > >The PAE code, when it was written, assumed that pagetables would never >be freed >during the life-time of an mm. This was true of the VM back then. > >Now, freeing a full quadrant is rare enough to match the patterns here. > >We also changed the TLB flushing code to be more optimized at around >v3.2, which >might have turned a latent bug into a real bug. > >So my guess would be that this got triggered by: > > 611ae8e3f520 x86/tlb: enable tlb flush range support for x86 > >To test this hypothesis the patch below could be tried (UNTESTED) - it >reverts >back to full TLB flushing. > >Note, this might still be wrong for single-page invlpg, but I'm not >sure Chrome >triggers that, IIRC single-page invlpg is mostly used for page fault >invalidation, >not munmap() invalidation. > >So this patch might be worth testing. > >Thanks, > > Ingo > >diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c >index 282375f..dfdc726 100644 >--- a/arch/x86/mm/tlb.c >+++ b/arch/x86/mm/tlb.c >@@ -196,7 +196,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, >unsigned long start, > goto flush_all; > } > >- if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 >+ if (1 || end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 > || vmflag & VM_HUGETLB) { > local_flush_tlb(); > goto flush_all;
Created attachment 98331 [details] My 3.8 .config (In reply to comment #31) > Artem, could you post your kernel .config? I just want to make sure there's > nothing funky like a non-3:1 VMSPLIT. Here it is. I've enabled KVM just recently, I've never actually used it yet.
He is also working on a test program. Ingo Molnar <mingo@kernel.org> wrote: > >* H. Peter Anvin <hpa@zytor.com> wrote: > >> The combination of events that would have to trigger this: >> >> 1. PAE paging >> 2. A piece of software not using the standard runtime >> 3. Freeing the last mapping in a quadrant >> 4. And then causing something in that quadrant to be accessed >> >> Chrome probably has all these characteristics; few other things >would. > >The PAE code, when it was written, assumed that pagetables would never >be freed >during the life-time of an mm. This was true of the VM back then. > >Now, freeing a full quadrant is rare enough to match the patterns here. > >We also changed the TLB flushing code to be more optimized at around >v3.2, which >might have turned a latent bug into a real bug. > >So my guess would be that this got triggered by: > > 611ae8e3f520 x86/tlb: enable tlb flush range support for x86 > >To test this hypothesis the patch below could be tried (UNTESTED) - it >reverts >back to full TLB flushing. > >Note, this might still be wrong for single-page invlpg, but I'm not >sure Chrome >triggers that, IIRC single-page invlpg is mostly used for page fault >invalidation, >not munmap() invalidation. > >So this patch might be worth testing. > >Thanks, > > Ingo > >diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c >index 282375f..dfdc726 100644 >--- a/arch/x86/mm/tlb.c >+++ b/arch/x86/mm/tlb.c >@@ -196,7 +196,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, >unsigned long start, > goto flush_all; > } > >- if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 >+ if (1 || end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 > || vmflag & VM_HUGETLB) { > local_flush_tlb(); > goto flush_all;
Well, I can at least confirm that one of the PGD's can get cleared out in practice: http://sr71.net/~dave/linux/paefun.c.txt I'm working on reproducing the crash itself now.
On Thu, Apr 11, 2013 at 3:18 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So something similar to that patch, but that uses a separate bitfield > (and tests that in addition to ->fullmm in the TLB flush path) should > work, but that patch as-is is broken. I think I'm going to go with the attached patch. It just disables the "let's try to do page-by-page flushing" entirely on x86 when CONFIG_X86_PAE is enabled. Looking at that code, I also happen to think that it is very questionable for non-PAE. In particular, look at this part: if (has_large_page(mm, start, end)) { local_flush_tlb(); goto flush_all; } and please realize that TLB flushing is done *after* the page tables are torn down, so looking for a huge-page in the page tables to see if we are flushing one seems rather extremely stupid. I suspect that one doesn't end up triggering, because in order to munmap a hugepage, the size of the munmap has to be big enough that the TLB size checks trigger, but it makes me feel less than happy about this whole code-sequence. Linus
Created attachment 98401 [details] New patch to be tested Disable the whole page-at-a-time TLB flushing optimization for PAE, because it's unsafe due to invlpg not invalidating the page directory pointer table entries. This one is actually worth testing on PAE.
That patch looks like it'll work, although it's kinda mean to the poor PAE users. But, I guess they have it coming. BTW... Another thing that makes this harder to trigger is the "balance point" logic flush_tlb_mm_range(). If the range we're flushing is over the "balance point", we do a full flush anyway. So this probably only triggers when we're flushing a relatively small mapping, _and_ it's the last one in the quadrant. I haven't had any luck triggering this in a 32-bit KVM VM. Hoping some real hardware will do the trick.
INVLPG does invalidate huge pages, so that is really confusing. Linus Torvalds <torvalds@linux-foundation.org> wrote: >On Thu, Apr 11, 2013 at 3:18 PM, Linus Torvalds ><torvalds@linux-foundation.org> wrote: >> >> So something similar to that patch, but that uses a separate bitfield >> (and tests that in addition to ->fullmm in the TLB flush path) should >> work, but that patch as-is is broken. > >I think I'm going to go with the attached patch. It just disables the >"let's try to do page-by-page flushing" entirely on x86 when >CONFIG_X86_PAE is enabled. > >Looking at that code, I also happen to think that it is very >questionable for non-PAE. In particular, look at this part: > > if (has_large_page(mm, start, end)) { > local_flush_tlb(); > goto flush_all; > } > >and please realize that TLB flushing is done *after* the page tables >are torn down, so looking for a huge-page in the page tables to see if >we are flushing one seems rather extremely stupid. > >I suspect that one doesn't end up triggering, because in order to >munmap a hugepage, the size of the munmap has to be big enough that >the TLB size checks trigger, but it makes me feel less than happy >about this whole code-sequence. > > Linus
As far as being mean to poor PAE users... it is one thing what we do for 3.9, 3.10 and 3.11. 3.9 is at -rc6 so unless we come up with a better solution by today-ish it makes total sense to kill it now and improve later. bugzilla-daemon@bugzilla.kernel.org wrote: >https://bugzilla.kernel.org/show_bug.cgi?id=56461 > > > > > >--- Comment #39 from Dave Hansen <dave@sr71.net> 2013-04-12 15:22:46 >--- >That patch looks like it'll work, although it's kinda mean to the poor >PAE >users. But, I guess they have it coming. > >BTW... Another thing that makes this harder to trigger is the "balance >point" >logic flush_tlb_mm_range(). If the range we're flushing is over the >"balance >point", we do a full flush anyway. So this probably only triggers when >we're >flushing a relatively small mapping, _and_ it's the last one in the >quadrant. > >I haven't had any luck triggering this in a 32-bit KVM VM. Hoping some >real >hardware will do the trick.
Peter suggested doing something along these lines. (attaching in a sec) Except, earlier, Peter said: > OK, I just scrutinized the SDM and the semantics of INVLPG, and I'm > quite certain we *only* need to do this for the PUDs, not the PMDs, and > only for PAE paging, not for 64-bit paging. That's right, but I think Peter means that we do it for PUD _entries_, while ___pmd_free_tlb() is freeing a pmd *PAGE*. So the attached patch only modifies the logic in ___pmd_free_tlb(). The ___pud_free_tlb() code that Linus's first patch touched isn't even used on PAE, so it doesn't need to get touched.
Created attachment 98451 [details] patch to fix pagetable corruption using mmu_gather state
Comment #43 from Dave Hansen <dave@sr71.net> 2013-04-12 17:16:28 --- > Created an attachment (id=98451) > --> (https://bugzilla.kernel.org/attachment.cgi?id=98451) > patch to fix pagetable corruption using mmu_gather state Patch looks good to me. Mind sending it as a full email with proper sign-offs etc. Peter, Ingo, any other comments? Linus