Bug 42980

Summary: BUG in gfn_to_pfn_prot
Product: Virtualization Reporter: Luke-Jr (luke-jr+linuxbugs)
Component: kvmAssignee: virtualization_kvm
Status: REOPENED ---    
Severity: blocking CC: alan, arequipeno, avi, florian, szg00000, xerofoify
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.4 Subsystem:
Regression: No Bisected commit-id:
Attachments: Fix

Description Luke-Jr 2012-03-22 21:28:37 UTC
BUG: unable to handle kernel paging request at ffff87ffffffffff
IP: [<ffffffffa03311b7>] __direct_map.clone.86+0xa7/0x240 [kvm]
PGD 0 
Oops: 0000 [#1] PREEMPT SMP 
CPU 0 
Modules linked in: tun cdc_ether usbnet cdc_acm fuse usbmon pci_stub kvm_intel kvm netconsole configfs cfq_iosched blk_cgroup snd_seq_oss snd_seq_midi_event snd_seq bridge snd_seq_device ipv6 snd_pcm_oss snd_mixer_oss stp llc coretemp hwmon usblp snd_hda_codec_hdmi snd_hda_codec_realtek usb_storage ftdi_sio usbserial usbhid hid snd_hda_intel i915 snd_hda_codec drm_kms_helper snd_hwdep drm snd_pcm firewire_ohci tpm_tis 8139too tpm firewire_core xhci_hcd i2c_algo_bit snd_timer 8250_pci 8250_pnp ehci_hcd usbcore snd e1000e 8250 tpm_bios crc_itu_t serial_core snd_page_alloc sg rtc_cmos psmouse i2c_i801 mii usb_common video evdev ata_generic pata_acpi button

Pid: 9995, comm: qemu-system-x86 Not tainted 3.2.2-gentoo #1                  /DQ67SW
RIP: 0010:[<ffffffffa03311b7>]  [<ffffffffa03311b7>] __direct_map.clone.86+0xa7/0x240 [kvm]
RSP: 0018:ffff88010bc39b08  EFLAGS: 00010293
RAX: ffff87ffffffffff RBX: 000ffffffffff000 RCX: 0000000000000027
RDX: 0000000029b55000 RSI: 0000000000000004 RDI: 0000000000000003
RBP: ffff88010bc39bb8 R08: ffff87ffffffffff R09: 0000000000113661
R10: 00000000c174f000 R11: 080000000000d974 R12: ffff880000000000
R13: ffff8803b7e6c240 R14: 0000000000000001 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff88043e200000(0063) knlGS:00000000f5ffab70
CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: ffff87ffffffffff CR3: 00000001027f1000 CR4: 00000000000426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-system-x86 (pid: 9995, threadinfo ffff88010bc38000, task ffff88000bc154f0)
Stack:
 ffff8803b7e6c240 ffff88010bc39bf0 0000000000000000 0000000000029b55
 ffff88010bc39b38 ffffffffa031ae14 00ff88010bc39bb8 0000000000000000
 0000000000113661 0000000000029b55 0000000029b55000 ffffffffffffffff
Call Trace:
 [<ffffffffa031ae14>] ? gfn_to_pfn_prot+0x14/0x20 [kvm]
 [<ffffffffa03316c0>] tdp_page_fault+0x1a0/0x1e0 [kvm]
 [<ffffffffa032d2e2>] kvm_mmu_page_fault+0x32/0xb0 [kvm]
 [<ffffffffa0362bec>] handle_ept_violation+0x4c/0xd0 [kvm_intel]
 [<ffffffffa0368ff4>] vmx_handle_exit+0xb4/0x6f0 [kvm_intel]
 [<ffffffff8103afad>] ? sub_preempt_count+0x9d/0xd0
 [<ffffffffa0329e23>] kvm_arch_vcpu_ioctl_run+0x473/0xf40 [kvm]
 [<ffffffff8103afad>] ? sub_preempt_count+0x9d/0xd0
 [<ffffffffa03197c2>] kvm_vcpu_ioctl+0x392/0x5e0 [kvm]
 [<ffffffffa031a3ed>] ? kvm_vm_ioctl+0x9d/0x410 [kvm]
 [<ffffffff81315529>] ? sys_sendto+0x119/0x140
 [<ffffffffa0319a65>] kvm_vcpu_compat_ioctl+0x55/0x100 [kvm]
 [<ffffffff810f81df>] ? fget_light+0x8f/0xf0
 [<ffffffff8113ee2e>] compat_sys_ioctl+0x8e/0xff0
 [<ffffffff8105df3c>] ? posix_ktime_get_ts+0xc/0x10
 [<ffffffff8105f190>] ? sys_clock_gettime+0x90/0xb0
 [<ffffffff810860db>] ? compat_sys_clock_gettime+0x7b/0x90
 [<ffffffff813c34c9>] sysenter_dispatch+0x7/0x27
Code: 89 d0 8d 4c ff 0c 4d 89 e0 48 d3 e8 4c 03 45 a8 25 ff 01 00 00 41 39 f6 89 45 bc 89 c0 49 8d 04 c0 48 89 45 b0 0f 84 e1 00 00 00 <4c> 8b 00 41 f6 c0 01 74 40 4c 8b 0d 89 80 01 00 4d 89 c2 4d 21 
RIP  [<ffffffffa03311b7>] __direct_map.clone.86+0xa7/0x240 [kvm]
 RSP <ffff88010bc39b08>
CR2: ffff87ffffffffff
---[ end trace 4db76b33c09285f5 ]---
note: qemu-system-x86[9995] exited with preempt_count 1
usb 2-1.2: USB disconnect, device number 77
INFO: rcu_preempt detected stall on CPU 3 (t=60000 jiffies)
Pid: 3610, comm: kwin Tainted: G      D      3.2.2-gentoo #1
Call Trace:
 <IRQ>  [<ffffffff810a2949>] __rcu_pending+0x1d9/0x420
 [<ffffffff8106f920>] ? tick_nohz_handler+0xe0/0xe0
 [<ffffffff810a2f62>] rcu_check_callbacks+0x122/0x1a0
 [<ffffffff810504c3>] update_process_times+0x43/0x80
 [<ffffffff8106f97b>] tick_sched_timer+0x5b/0xa0
 [<ffffffff81063873>] __run_hrtimer.clone.30+0x63/0x140
 [<ffffffff810641af>] hrtimer_interrupt+0xdf/0x210
 [<ffffffff8101d643>] smp_apic_timer_interrupt+0x63/0xa0
 [<ffffffff813c2b8b>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff810b69a2>] ? __pagevec_free+0x22/0x30
 [<ffffffff813c1862>] ? _raw_spin_lock+0x32/0x40
 [<ffffffff813c1846>] ? _raw_spin_lock+0x16/0x40
 [<ffffffffa0319c3c>] kvm_mmu_notifier_invalidate_page+0x3c/0x90 [kvm]
 [<ffffffff810e31c8>] __mmu_notifier_invalidate_page+0x48/0x60
 [<ffffffff810d6ce5>] try_to_unmap_one+0x3c5/0x3f0
 [<ffffffff810d762d>] try_to_unmap_anon+0x9d/0xe0
 [<ffffffff810d7715>] try_to_unmap+0x55/0x70
 [<ffffffff810e8d21>] migrate_pages+0x2f1/0x4d0
 [<ffffffff810e1ec0>] ? suitable_migration_target+0x50/0x50
 [<ffffffff810e271f>] compact_zone+0x44f/0x7a0
 [<ffffffff810e2c07>] try_to_compact_pages+0x197/0x1f0
 [<ffffffff810b7026>] __alloc_pages_direct_compact+0xc6/0x1c0
 [<ffffffff810b74f9>] __alloc_pages_nodemask+0x3d9/0x7a0
 [<ffffffff813c14b0>] ? _raw_spin_unlock+0x10/0x40
 [<ffffffff810cd2fb>] ? handle_pte_fault+0x3bb/0x9f0
 [<ffffffff810ec831>] do_huge_pmd_anonymous_page+0x131/0x350
 [<ffffffff810cdcae>] handle_mm_fault+0x21e/0x300
 [<ffffffff81027dad>] do_page_fault+0x12d/0x430
 [<ffffffff810d3854>] ? do_mmap_pgoff+0x344/0x380
 [<ffffffff813c1cef>] page_fault+0x1f/0x30
Comment 1 Avi Kivity 2012-03-28 13:03:25 UTC
   0:	89 d0                	mov    %edx,%eax
   2:	8d 4c ff 0c          	lea    0xc(%rdi,%rdi,8),%ecx
   6:	4d 89 e0             	mov    %r12,%r8
   9:	48 d3 e8             	shr    %cl,%rax
   c:	4c 03 45 a8          	add    -0x58(%rbp),%r8
  10:	25 ff 01 00 00       	and    $0x1ff,%eax
  15:	41 39 f6             	cmp    %esi,%r14d
  18:	89 45 bc             	mov    %eax,-0x44(%rbp)
  1b:	89 c0                	mov    %eax,%eax
  1d:	49 8d 04 c0          	lea    (%r8,%rax,8),%rax
  21:	48 89 45 b0          	mov    %rax,-0x50(%rbp)
  25:	0f 84 e1 00 00 00    	je     0x10c
  2b:	4c 8b 00             	mov    (%rax),%r8
  2e:	41 f6 c0 01          	test   $0x1,%r8b
  32:	74 40                	je     0x74
  34:	4c 8b 0d 89 80 01 00 	mov    0x18089(%rip),%r9        # 0x180c4
  3b:	4d 89 c2             	mov    %r8,%r10

Appears to be __direct_map()'s

		if (!is_shadow_present_pte(*iterator.sptep)) {
			u64 base_addr = iterator.addr;

%rax is 0xffff87ffffffffff. That is one less than the base of the direct map of all physical memory.  So it looks like the code


static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
{
	if (iterator->level < PT_PAGE_TABLE_LEVEL)
		return false;

	iterator->index = SHADOW_PT_INDEX(iterator->addr, iterator->level);
	iterator->sptep	= ((u64 *)__va(iterator->shadow_addr)) + iterator->index;
	return true;
}

saw iterator->shadow_addr == -1ULL.

That might be INVALID_PAGE assigned to pae_root (but that is masked out in shadow_walk_init()) or a stray -1 due to a completely unrelated bug.

Anything interesting about how this was triggered?
Comment 2 Luke-Jr 2012-03-28 13:37:53 UTC
IIRC, it was pretty out of the blue. I might have had one or both of two KVMs running in the background at the time:
- 64-bit Gentoo with a Radeon 5850 passthrough'd (VT-d)
- 32-bit Ubuntu with a nested 32-bit KVM
Comment 3 Avi Kivity 2012-03-28 13:45:25 UTC
You're a brave one.

It wasn't the nested one (at least, it wasn't running in the guest's guest at the moment of the crash), but it might be related.
Comment 4 Luke-Jr 2012-03-28 13:49:26 UTC
I suppose I should mention I'd been running both of these stable for at least a month now (and the GPU passthrough for nearly a full year). One factor that might (or might not) be related - the GPU fan recently died. When this crash took me down, I removed the GPU, so I won't be able to do any further testing with that setup (unless I find another similar GPU at a good price).
Comment 5 Avi Kivity 2012-03-28 15:07:25 UTC
vcpu_enter_guest()
  kvm_mmu_reload() // now root_hpa is valid
  inject_pending_event()
    vmx_interrupt_allowed()
      nested_vmx_vmexit()
        load_vmcs12_host_state()
          kvm_mmu_reset_context() // root_hpa now invalid
  kvm_guest_enter()
  ... page fault because root_hpa is invalid, oops
Comment 6 Avi Kivity 2012-05-10 10:53:48 UTC
Created attachment 73244 [details]
Fix

Please test the attached patch.
Comment 7 Luke-Jr 2012-05-10 13:17:17 UTC
Is there anything I can do to reproduce the problem condition for the test? It seems to only occur about once every 6 months normally.
Comment 8 Avi Kivity 2012-05-10 13:30:36 UTC
Try running 

  while :; do :; done

in the nested (L2) guest, and ping -f the L1 guest from the host.
Comment 9 Luke-Jr 2012-05-17 20:58:50 UTC
The while/ping thing doesn't reproduce it even before the patch. :(
Comment 10 Luke-Jr 2012-06-16 03:16:44 UTC
For what it's worth, no crashes in over a month. But it wasn't common enough that it can't be coincidence either...
Comment 11 Florian Mickler 2012-07-01 09:46:47 UTC
A patch referencing this bug report has been merged in Linux v3.5-rc1:

commit d8368af8b46b904def42a0f341d2f4f29001fa77
Author: Avi Kivity <avi@redhat.com>
Date:   Mon May 14 18:07:56 2012 +0300

    KVM: Fix mmu_reload() clash with nested vmx event injection
Comment 12 Luke-Jr 2012-08-15 22:24:36 UTC
Sorry I didn't report it sooner, but I have had the same crash since June, with this patch. :(
Comment 13 Alan 2012-08-15 22:34:02 UTC
Which kernel ?
Comment 14 Luke-Jr 2012-08-15 22:38:45 UTC
I'm not sure if it was 3.4.0, 3.4.3, or 3.4.4. Since May 17, I have been building all my kernels (including those) with this patch applied.
Comment 15 Luke-Jr 2012-08-15 22:47:39 UTC
3.4.0: http://luke.dashjr.org/tmp/code/20120624_002.jpg
Comment 16 Alan 2012-08-16 09:32:17 UTC
Thanks
Comment 17 Ian Pilcher 2012-11-17 22:00:39 UTC
I just hit this.

Host:  Intel DQ67SW, Core i7 2600, 24GB RAM
BIOS:  SWQ6710H.86A.0065.2012.0917.1519

Host OS:  Fedora 17
          kernel-3.6.6-1.fc17.x86_64
          qemu-kvm-1.2.0-20.fc17.x86_64

L1 Guest OS:  RHEL 6.3
              kernel-2.6.32-279.14.1.el6.x86_64
              qemu-kvm-rhev-0.12.1.2-2.295.el6_3.5.x86_64

L2 Guest OS:  RHEL 6.3
              kernel-2.6.32-279.14.1.el6.x86_64

I was running a Pulp sync between a couple of L2 guests when this occurred, which presumably generated quite a bit of traffic across the virtual bridges.  I am using Open vSwitch for all of the bridges on the host OS.  The virtualized RHEV hypervisors use standard Linux bridges.

Please let me know if I can provide any additional information to help track this down.
Comment 18 Ian Pilcher 2012-11-17 22:10:45 UTC
(In reply to comment #11)
> A patch referencing this bug report has been merged in Linux v3.5-rc1:
> 
> commit d8368af8b46b904def42a0f341d2f4f29001fa77
> Author: Avi Kivity <avi@redhat.com>
> Date:   Mon May 14 18:07:56 2012 +0300
> 
>     KVM: Fix mmu_reload() clash with nested vmx event injection

Silly question.  Is this patch applicable to the physical host, the L1 guest (virtualized hypervisor), or both?
Comment 19 Avi Kivity 2012-11-18 14:15:41 UTC
On 11/18/2012 12:10 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=42980
> 
> 
> 
> 
> 
> --- Comment #18 from Ian Pilcher <arequipeno@gmail.com>  2012-11-17 22:10:45
> ---
> (In reply to comment #11)
>> A patch referencing this bug report has been merged in Linux v3.5-rc1:
>> 
>> commit d8368af8b46b904def42a0f341d2f4f29001fa77
>> Author: Avi Kivity <avi@redhat.com>
>> Date:   Mon May 14 18:07:56 2012 +0300
>> 
>>     KVM: Fix mmu_reload() clash with nested vmx event injection
> 
> Silly question.  Is this patch applicable to the physical host, the L1 guest
> (virtualized hypervisor), or both?
> 

The physical host.  If you want to run a hypervisor in L2, you need to
apply it to L1 as well.
Comment 20 Ian Pilcher 2012-11-18 17:06:36 UTC
(In reply to comment #19)
> The physical host.  If you want to run a hypervisor in L2, you need to
> apply it to L1 as well.

OK.  If I'm parsing that correctly, it sounds like backporting the patch to the RHEL 6 kernel, so I could run it in the L1 hypervisors, wouldn't help anything.

Bummer.

Any ideas on how I can make this environment stable?

I see that Luke-Jr is also on a DQ67SW, and he's doing PCI passthrough.  I do have VT-d enabled, although I'm not actually doing any PCI-passthrough.  I that something that could be related to this?
Comment 21 Ian Pilcher 2012-12-08 20:50:25 UTC
I just hit this again (I think).  Pretty much out of the blue, with a bunch of VMs running, including at least 2 nested guests.

I have been trying to get a kdump of this, and I believe that I was at least somewhat successful.  The system didn't dump automatically, but I was able to get it to do so by hitting alt-sysrq-c.  The vmcore file is 3.7G, so suggestions as to a place to post it publicly would be appreciated.
Comment 22 xerofoify 2014-06-25 02:11:39 UTC
Please tell against a newer kernel. This bug seems obsolete to me as of
kernel versions released in 2014 time frame.
Cheers Nick