Bug 198517

Summary: Double fault in load_new_mm_cr3 with KPTI enabled
Product: Memory Management Reporter: Neil Berrington (neil.berrington)
Component: OtherAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: high CC: dave, luto, rob
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.13.0-26-generic, 4.14.14, 4.15.rc8, 4.15.rc9 Subsystem:
Regression: No Bisected commit-id:
Attachments: Screen shot of double fault from hyperv vm
Another screen shot of double fault from hyperv vm
kernel module reproducer
Crash under hyper-v when running a 4.15.rc9 kernel
Disassembly of load_new_mm_cr3 in 4.15.rc9

Description Neil Berrington 2018-01-19 11:06:06 UTC
Created attachment 273711 [details]
Screen shot of double fault from hyperv vm

We have a repeatable reproducer, but only on large systems - we're using a dual Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz with 768GB RAM. We've reproduced the problem on several hardware systems, both bare metal and running as a Hyper-V VM. 

The panic only happens when KPTI is enabled. The test does not reproduce on smaller systems (for example, same dual Xeon, 256GB). 

We cannot reproduce on the 4.4.112 kernel or on Centos 3.10.0-693.11.6.el7.x86_64.

Steps to reproduce:

1) Allocate 1/2 memory size using vmalloc 
2) Attempt to spawn a large number of threads (user mode or kernel mode).

I'll attach a kernel module that demonstrates the issue, and some screen shots of the double fault when running in a Hyper-V VM. We were able to cause a hang on bare metal, but our remote IMM console session didn't capture the panic for some reason.
Comment 1 Neil Berrington 2018-01-19 11:06:43 UTC
Created attachment 273713 [details]
Another screen shot of double fault from hyperv vm
Comment 2 Neil Berrington 2018-01-19 11:07:20 UTC
Created attachment 273715 [details]
kernel module reproducer
Comment 3 Dave Hansen 2018-01-19 21:13:51 UTC
Isn't this an Ubuntu kernel?

I suspect this is one of the switch_mm*() bits that is wrongly using PCID when they are unsupported.  We have fixes in mainline, and it probably doesn't occur with 4.4.112 (or 3.10.*) because the switch_mm*() code is so different.

Does booting with 'nopcid' on the kernel command-line help?
Comment 4 Andy Lutomirski 2018-01-19 21:20:38 UTC
Can you attach disassembly of load_new_mm_cr3() on the kernel that you got the screen shot from?  I'm curious which instruction is double faulting.  Your RDI makes it look like the value written to CR3 had the noflush bit set and PCID=6.  But this is hyperv and I didn't think hyperv supported PCID.

But there's another possibility that would explain the only-happens-on-enormous-system issue: something could be wrong with the "if (unlikely(pgd_none(*pgd))) set_pgd(pgd, init_mm.pgd[index]);" thingy.

What kernel are you seeing the problem on?  This is a "KAISER" kernel, not "PTI", right?
Comment 5 Dave Hansen 2018-01-20 01:34:40 UTC
Yeah, since a PGD controls 512GB of address space this could somehow require more than 1 PGD be used for the kernel direct map.  If that somehow leads to some vmalloc()'d stack pages being unreadable, it double-faults, but the double fault handler itself gets a new stack so *it* runs and gets the oops out.  That's consistent with being able to get an oops out, but the RSP in the dump is not horribly close to a PGD boundary which somewhat rules out the stack being the culprit.

It would be really great if you could reproduce this on a current 4.14-stable or current 4.15 kernel.  Otherwise, please report it to the great Ubuntu folks.
Comment 6 Neil Berrington 2018-01-22 13:57:32 UTC
We did test with a vanilla kernel in Ubuntu and Centos environments.
 
Today I've tested with 4.15.rc9 and am still able to reproduce the issue on a Hyper-V VM and bare metal. I'll attach a screen shot of the Hyper-V failure and disassembly of load_new_mm_cr3 for this failure.
 
We can reproduce the crash when booting with nopcid on both Hyper-V and bare metal.

As mentioned previously, booting with nopti we're unable to reproduce the crash.
Comment 7 Neil Berrington 2018-01-22 13:58:18 UTC
Created attachment 273787 [details]
Crash under hyper-v when running a 4.15.rc9 kernel
Comment 8 Neil Berrington 2018-01-22 14:00:57 UTC
Created attachment 273789 [details]
Disassembly of load_new_mm_cr3 in 4.15.rc9
Comment 9 Dave Hansen 2018-01-22 15:24:10 UTC
Neil, thanks for testing 4.15-rc9.  I'll see if I can reproduce this on a machine with a comparable amount of RAM and something similar to your reproducer.
Comment 10 Andy Lutomirski 2018-01-22 17:55:11 UTC
Can you give a longer disassembly that includes the actual load_new_mm_cr3() label (or, if you're disassembling from a stripped vmlinux, more disassembly + the Symbol.map entry for load_new_mm_cr3 works, too)?  I don't think the actual faulting instruction is in the listing you gave.
Comment 11 Rob Whitton 2018-01-22 18:58:11 UTC
Hi,

I'm working with Neil on this and generated the disassembly this morning.

The disassembly is from the symbol as identified by the appropriate Symbol.map file. The function length is 0xf0 bytes (i.e. you should have the whole thing) and the offset to the faulting instruction in this case is 0x51 (which is a pop %rbp).

Now we have also generated the disassembly for one of the other kernel versions (as reported earlier). In this case the function was 0xe0 long and the offset for the faulting instruction was 0x3e. Now in both cases the instruction that's being pointed to is pop %rbp. Now in this disassembly of the earlier case the instruction sequence prior to the faulting instruction is:


ffffffff81075630:	e8 ab c8 98 00       	callq  0xffffffff81a01ee0
ffffffff81075635:	55                   	push   %rbp
ffffffff81075636:	84 d2                	test   %dl,%dl
ffffffff81075638:	48 89 e5             	mov    %rsp,%rbp
ffffffff8107563b:	75 43                	jne    0xffffffff81075680
ffffffff8107563d:	b9 00 00 00 80       	mov    $0x80000000,%ecx
ffffffff81075642:	48 8b 05 c7 c9 19 01 	mov    0x119c9c7(%rip),%rax        # 0xffffffff82212010
ffffffff81075649:	48 01 f9             	add    %rdi,%rcx
ffffffff8107564c:	73 22                	jae    0xffffffff81075670
ffffffff8107564e:	8d 7e 01             	lea    0x1(%rsi),%edi
ffffffff81075651:	48 ba 00 00 00 00 00 	movabs $0x8000000000000000,%rdx
ffffffff81075658:	00 00 80 
ffffffff8107565b:	48 01 c8             	add    %rcx,%rax
ffffffff8107565e:	0f b7 ff             	movzwl %di,%edi
ffffffff81075661:	48 09 d7             	or     %rdx,%rdi
ffffffff81075664:	48 09 c7             	or     %rax,%rdi
ffffffff81075667:	ff 14 25 f8 0a 18 82 	callq  *0xffffffff82180af8
ffffffff8107566e:	5d                   	pop    %rbp  <--- apparently faulting here


Now if we look at how rax, rdi & rdx are manipulated just before the call and we look at the register contents in the attached crash dump "Another screen shot of double fault from hyperv vm" then this looks like we really are looking at the correct location.
Comment 12 Dave Hansen 2018-01-22 19:58:09 UTC
Neil, thanks for testing 4.15-rc9.  I've managed to reproduce _something_ funky on a smaller system.  Namely, I see a valid tsk->stack, but a NULL tsk->stack_vm_area in dup_task_struct().  This leads to us doing virt_to_page(tsk->stack), which give us a fault on something off the end of the vmemmap[], like ffffeb040003911c.

That matches the signature from the reported oopses, I think.

I reproduced this by taking your reproducer module and doing this instead of vmalloc():

    s_mem = alloc_vmap_area((unsigned long long)s_vmalloc_gib * GIB,
                    PAGE_SIZE, VMALLOC_START, VMALLOC_END, 0, GFP_KERNEL);

and then never freeing it.  That lets us move swiftly through the vmalloc() space on a small system by leaking a few of those vmap_areas.

Now we've just got to figure out how we got a NULL tsk->stack_vm_area when we have a valid tsk->stack.
Comment 13 Andy Lutomirski 2018-01-22 20:50:33 UTC
Dave, I suspect you've found an unrelated bug.  Maybe you're breaking the vmalloc subsystem in such a way that find_vm_area() incorrectly returns NULL?  You could try BUG_ON(!tsk->stack_vm_area) after find_vm_area().  FWIW, someone posted a patch to allocate the stack and find its vm_area all at once, and I'm not sure what happened to it.

Neil, can you test the following two patches:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=e427f8713ac026aafac9f37285765a06d1ccc820
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=d9b92ca301271eb62acca09429914a87ff80e951

I'm hoping you can test two different things:

1. Test with PTI on with both patches applied.  Your bug should be fixed.

2. Test with PTI off and only the first patch applied.  This will help me validate that I didn't screw up the first patch too badly.  I don't have access to a large enough system to do this usefully myself.

Thanks!
Comment 14 Neil Berrington 2018-01-23 16:57:37 UTC
Hi Andy,

I applied both patches and tested with PTI on - everything was fine, no panics. Testing with just the first patch applied with PTI off also went fine.

Looks good to me!