Bug 37732

Summary: intel-kvm/ksmd gp fault after some page scan changes in /sys/kernel/mm/ksm/
Product: Virtualization Reporter: Konstantin (kon)
Component: kvmAssignee: virtualization_kvm
Status: RESOLVED INSUFFICIENT_DATA    
Severity: normal CC: alan, avi
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 2.6.38.5 Subsystem:
Regression: No Bisected commit-id:

Description Konstantin 2011-06-17 06:12:16 UTC
Hi,

the followring trace occured on our hardware machine

Jun 16 21:34:01 dezem kernel: [3317439.691539] general protection fault: 0000 [#1] SMP
Jun 16 21:34:01 dezem kernel: [3317439.691575] last sysfs file: /sys/devices/pci0000:00/0000:00:1c.4/0000:06:00.0/irq
Jun 16 21:34:01 dezem kernel: [3317439.691623] CPU 2
Jun 16 21:34:01 dezem kernel: [3317439.691630] Modules linked in: iptable_filter ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs tpm tpm_bios timeriomem_rng xt_tcpudp ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables bridge stp kvm_intel kvm snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi fbcon snd_seq_midi_event snd_seq tileblit font snd_timer bitblit snd_seq_device softcursor snd radeon ttm drm_kms_helper drm psmouse i2c_algo_bit i7core_edac soundcore edac_core snd_page_alloc serio_raw multipath linear aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 r8169 ahci libahci mii sata_nv sata_sil sata_via [last unloaded: virtio_rng]
Jun 16 21:34:01 dezem kernel: [3317439.692139]
Jun 16 21:34:01 dezem kernel: [3317439.692161] Pid: 50, comm: ksmd Not tainted 2.6.38.5 #4 MSI MS-7522/MSI X58 Pro-E (MS-7522)
Jun 16 21:34:01 dezem kernel: [3317439.692222] RIP: 0010:[<ffffffffa0441e96>]  [<ffffffffa0441e96>] kvm_set_pte_rmapp+0x56/0x140 [kvm]
Jun 16 21:34:01 dezem kernel: [3317439.692287] RSP: 0018:ffff88060c7f9c30  EFLAGS: 00010202
Jun 16 21:34:01 dezem kernel: [3317439.692316] RAX: 00008804530057f8 RBX: 00008804530057f8 RCX: ffff880569a61050
Jun 16 21:34:01 dezem kernel: [3317439.692363] RDX: 0000000000000000 RSI: ffffc90016c1eff8 RDI: ffff88060d30c000
Jun 16 21:34:01 dezem kernel: [3317439.692410] RBP: ffff88060c7f9c70 R08: ffff88060c4f43e0 R09: 0000000000000000
Jun 16 21:34:01 dezem kernel: [3317439.692456] R10: ffffea001507f2d0 R11: 00000000001a7188 R12: ffff88060d30c000
Jun 16 21:34:01 dezem kernel: [3317439.692503] R13: ffffc90016c1eff8 R14: ffff88060c7f9d00 R15: 00000000005b94bd
Jun 16 21:34:01 dezem kernel: [3317439.692551] FS:  0000000000000000(0000) GS:ffff8800bf440000(0000) knlGS:0000000000000000
Jun 16 21:34:01 dezem kernel: [3317439.692600] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 16 21:34:01 dezem kernel: [3317439.692630] CR2: 00007fda2accc538 CR3: 000000000171f000 CR4: 00000000000026e0
Jun 16 21:34:01 dezem kernel: [3317439.692677] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 16 21:34:01 dezem kernel: [3317439.692724] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 16 21:34:01 dezem kernel: [3317439.692771] Process ksmd (pid: 50, threadinfo ffff88060c7f8000, task ffff88060c7f7710)
Jun 16 21:34:01 dezem kernel: [3317439.692819] Stack:
Jun 16 21:34:01 dezem kernel: [3317439.692840]  ffff88060c7f7748 0000000602456000 ffff88060c7f7748 ffff880569a61000
Jun 16 21:34:01 dezem kernel: [3317439.692895]  0000000000000001 ffff880569a61060 00007f99fe13a000 0000000000000060
Jun 16 21:34:01 dezem kernel: [3317439.692950]  ffff88060c7f9cf0 ffffffffa043f57c 0000000000012140 ffff880569a61050
Jun 16 21:34:01 dezem kernel: [3317439.693005] Call Trace:
Jun 16 21:34:01 dezem kernel: [3317439.693036]  [<ffffffffa043f57c>] kvm_handle_hva+0xbc/0x1a0 [kvm]
Jun 16 21:34:01 dezem kernel: [3317439.693074]  [<ffffffffa0441e40>] ? kvm_set_pte_rmapp+0x0/0x140 [kvm]
Jun 16 21:34:01 dezem kernel: [3317439.693112]  [<ffffffffa043f6c1>] kvm_set_spte_hva+0x21/0x30 [kvm]
Jun 16 21:34:01 dezem kernel: [3317439.693145]  [<ffffffff8155b57e>] ? _raw_spin_lock+0xe/0x20
Jun 16 21:34:01 dezem kernel: [3317439.693180]  [<ffffffffa042668a>] kvm_mmu_notifier_change_pte+0x5a/0x90 [kvm]
Jun 16 21:34:01 dezem kernel: [3317439.693229]  [<ffffffff81123b5e>] __mmu_notifier_change_pte+0x3e/0x90
Jun 16 21:34:01 dezem kernel: [3317439.693262]  [<ffffffff81124e88>] try_to_merge_with_ksm_page+0x318/0x5f0
Jun 16 21:34:01 dezem kernel: [3317439.693295]  [<ffffffff8112d33d>] ? follow_trans_huge_pmd+0x3d/0x130
Jun 16 21:34:01 dezem kernel: [3317439.693327]  [<ffffffff81125e0d>] ksm_scan_thread+0x69d/0xdd0
Jun 16 21:34:01 dezem kernel: [3317439.693360]  [<ffffffff810726d0>] ? autoremove_wake_function+0x0/0x40
Jun 16 21:34:01 dezem kernel: [3317439.693392]  [<ffffffff81125770>] ? ksm_scan_thread+0x0/0xdd0
Jun 16 21:34:01 dezem kernel: [3317439.693423]  [<ffffffff81072176>] kthread+0x96/0xa0
Jun 16 21:34:01 dezem kernel: [3317439.693453]  [<ffffffff81003da4>] kernel_thread_helper+0x4/0x10
Jun 16 21:34:01 dezem kernel: [3317439.693485]  [<ffffffff810720e0>] ? kthread+0x0/0xa0
Jun 16 21:34:01 dezem kernel: [3317439.693514]  [<ffffffff81003da0>] ? kernel_thread_helper+0x0/0x10
Jun 16 21:34:01 dezem kernel: [3317439.693544] Code: 48 89 f8 0f 1f 40 00 31 d2 49 89 c7 4c 89 ee 49 c1 e7 12 4c 89 e7 49 c1 ef 1e e8 86 d0 ff ff 48 89 c3 48 85 c0 0f 84 82 00 00 00 <48> 8b 00 48 8b 15 50 5a 02 00 48 39 d0 74 61 48 3b 05 4c 5a 02
Jun 16 21:34:01 dezem kernel: [3317439.693746] RIP  [<ffffffffa0441e96>] kvm_set_pte_rmapp+0x56/0x140 [kvm]
Jun 16 21:34:01 dezem kernel: [3317439.693786]  RSP <ffff88060c7f9c30>
Jun 16 21:34:01 dezem kernel: [3317439.694036] ---[ end trace 2a7d82397ef2ccf5 ]---

about one hour later we have rebooted your physical server.

We are running with 1 Intel i7 950, 24GB RAM,our VM's don't using the HyperThreading feature at all cause threads per core is setted to one.

I believe the problem could be HyperThreading cause i have never seen this on AMD machines :-)
Comment 1 Avi Kivity 2011-06-19 11:55:51 UTC
 2b:	48 8b 00             	mov    (%rax),%rax
RAX: 00008804530057f8

The high 16 bits of RAX are clear.

Please re-run with the kernel parameter slub_debug=ZFPU and report (assuming CONFIG_SLUB=y)

Are you using netfilter?

Likely a duplicate of https://bugzilla.kernel.org/show_bug.cgi?id=27052 and of http://www.spinics.net/lists/kvm/msg55556.html.
Comment 2 Konstantin 2011-06-19 13:44:05 UTC
RAX should contain ffff8804530057f8 right?

If so then i understand the gp fault here :-) ... a bit

Yes i use netfilter to forward traffic to and from vm's.

Is it possible that the RCU subsystem is involved cause netfilter use it?

I will try again with slub_debug=ZFPU and report as soon as possible.

But first i have to setup a dev box todo this.

Now this server is running with 2.6.38.8.
Comment 3 Konstantin 2011-06-19 14:11:46 UTC
(In reply to comment #2)

This machine is using SLAB not SLUB.
Comment 4 Avi Kivity 2011-06-19 14:37:24 UTC
Please reconfigure using SLUB and run the tests.  We don't suspect SLUB; 
rather we want to use its debug capabilities.