Bug 209079 - CPU 0/KVM: page allocation failure on 5.8 kernel
Summary: CPU 0/KVM: page allocation failure on 5.8 kernel
Status: RESOLVED OBSOLETE
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: virtualization_kvm
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-08-30 15:22 UTC by Martin Schrodt
Modified: 2020-09-20 09:17 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.8.5-arch1-1
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Martin Schrodt 2020-08-30 15:22:06 UTC
When starting my KVM-VM in the current 5.8 kernel, it won't start, complaining:

> internal error: qemu unexpectedly closed the monitor:
> 2020-08-30T15:16:10.389012Z qemu-system-x86_64: kvm_init_vcpu failed: Cannot
> allocate memory

The same VM works fine in a 5.7 kernel. I tried an earlier 5.8 kernel too, same outcome.

dmesg shows the following:

[Sun Aug 30 17:16:09 2020] CPU 0/KVM: page allocation failure: order:0, mode:0x400cc4(GFP_KERNEL_ACCOUNT|GFP_DMA32), nodemask=(null),cpuset=emulator,mems_allowed=1
[Sun Aug 30 17:16:09 2020] CPU: 11 PID: 16473 Comm: CPU 0/KVM Tainted: P           OE     5.8.5-arch1-1 #1
[Sun Aug 30 17:16:09 2020] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Phantom Gaming 6, BIOS P1.10 11/15/2018
[Sun Aug 30 17:16:09 2020] Call Trace:
[Sun Aug 30 17:16:09 2020]  dump_stack+0x6b/0x88
[Sun Aug 30 17:16:09 2020]  warn_alloc.cold+0x78/0xdc
[Sun Aug 30 17:16:09 2020]  __alloc_pages_slowpath.constprop.0+0xd14/0xd50
[Sun Aug 30 17:16:09 2020]  __alloc_pages_nodemask+0x2e4/0x310
[Sun Aug 30 17:16:09 2020]  alloc_mmu_pages+0x27/0x90 [kvm]
[Sun Aug 30 17:16:09 2020]  kvm_mmu_create+0x100/0x140 [kvm]
[Sun Aug 30 17:16:09 2020]  kvm_arch_vcpu_create+0x48/0x360 [kvm]
[Sun Aug 30 17:16:09 2020]  kvm_vm_ioctl+0xa2d/0xe60 [kvm]
[Sun Aug 30 17:16:09 2020]  ksys_ioctl+0x82/0xc0
[Sun Aug 30 17:16:09 2020]  __x64_sys_ioctl+0x16/0x20
[Sun Aug 30 17:16:09 2020]  do_syscall_64+0x44/0x70
[Sun Aug 30 17:16:09 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Sun Aug 30 17:16:09 2020] RIP: 0033:0x7f4e8ba7cf6b
[Sun Aug 30 17:16:09 2020] Code: 89 d8 49 8d 3c 1c 48 f7 d8 49 39 c4 72 b5 e8 1c ff ff ff 85 c0 78 ba 4c 89 e0 5b 5d 41 5c c3 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d5 ae 0c 00 f7 d8 64 89 01 48
[Sun Aug 30 17:16:09 2020] RSP: 002b:00007f4e6bffe6b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Sun Aug 30 17:16:09 2020] RAX: ffffffffffffffda RBX: 000000000000ae41 RCX: 00007f4e8ba7cf6b
[Sun Aug 30 17:16:09 2020] RDX: 0000000000000000 RSI: 000000000000ae41 RDI: 0000000000000019
[Sun Aug 30 17:16:09 2020] RBP: 0000563079828020 R08: 0000000000000000 R09: 0000563079844010
[Sun Aug 30 17:16:09 2020] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[Sun Aug 30 17:16:09 2020] R13: 00007fff52f4494f R14: 0000000000000000 R15: 00007f4e6bfff640
[Sun Aug 30 17:16:09 2020] Mem-Info:
[Sun Aug 30 17:16:09 2020] active_anon:414866 inactive_anon:28099 isolated_anon:0
                            active_file:31776 inactive_file:88136 isolated_file:0
                            unevictable:32 dirty:521 writeback:0
                            slab_reclaimable:19827 slab_unreclaimable:137048
                            mapped:142120 shmem:28302 pagetables:6905 bounce:0
                            free:6992691 free_pcp:4628 free_cma:0

System is a Threadripper 1920x, on ASRock Phantom Gaming 6 X399 board with 32GB RAM, which is a NUMA architecture, having 2 nodes

➜  numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 15966 MB
node 0 free: 12860 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 16112 MB
node 1 free: 14559 MB
node distances:
node   0   1 
  0:  10  16 
  1:  16  10 

The VM is configured to only allocate memory on node 1.

Happy to provide more information!
Comment 1 Wanpeng Li 2020-09-09 06:00:37 UTC
It is appreciated if you can bisect.
Comment 2 Sean Christopherson 2020-09-09 06:41:20 UTC
Are you disabling NPT (via KVM module param)?  You're obviously running a 64-bit kernel, and presumably that CPU supports NPT, so the only way KVM should reach the failing allocation is if NPT is being explicitly disabled.  There's nothing wrong with using shadow paging, it's just uncommon these days.

NPT aside, the interesting part of the failing allocation is that it uses GFP_DMA32.  I did a quick test to force that allocation on my system and nothing exploded.  Odds are good the bug is outside of KVM, which means a bisection is probably necessary.
Comment 3 Martin Schrodt 2020-09-10 15:33:27 UTC
Damn. 

I did some changes to the VM in the last few days, to make it support AVIC and that made me change the kvm module parameters, without remembering what they were before. They are now

> options kvm ignore_msrs=1 report_ignored_msrs=0
> options kvm_amd nested=0 avic=1 npt=1

and Seans post mentioning NPT having to be disabled for the bug to occur, I updated the kernel again (to 5.8.7), and voilà, the VM works.

So I have to concur that it really was disabled before, but I can't remember why I did so, maybe because of some bug that only existed when I setup the VM somewhen in 2018.

Regarding GFP_DMA32, I don't know what it really means. Might be related to me passing through a GPU, an NVME drive and a USB controller to the VM.

So I guess I'll leave learning how to bisect to my next future incident...

Thank you guys for all the work you do - Linux forever!
Comment 4 Sean Christopherson 2020-09-10 16:21:19 UTC
GFP_DMA32 is a flag that forces a memory allocation to use physical memory that is 32-bit addressable, i.e. below the 4g boundary.  Using GFP_DMA32 is relatively uncommon, e.g. KVM uses that flag if and only if KVM is using or shadowing 32-bit PAE paging.  The latter case (shadowing) is what is triggered if NPT is disabled.

Can you try trying running with "kvm_amd nested=0 avic=1 npt=0" and/or "kvm_amd nested=0 npt=0" on v5.8.7?  I'd like to at least confirm that whatever was breaking your setup was fixed between v5.8.0 and v5.8.7, even if we don't bisect to identify exactly what patch fixed the bug.
Comment 5 Martin Schrodt 2020-09-10 21:03:42 UTC
Strange things happen sometimes...

What I did (I did only unload/reload the module after config changes, hoping this would suffice):

- running with "kvm_amd nested=0 avic=1 npt=0" and "kvm_amd nested=0 npt=0" on 5.8.7, all working fine.

- rolling back to the 5.8.5 kernel I had the bug with, and trying the above combinations -> working fine

- rolling the VM back to a state before changing it to AVIC (reasonably sure it's the same) -> working fine, on both 5.8.7 and 5.8.5.

Heisenbugs here they come.

Trying to come up with things that I changed since then but did not roll back yet:

I have a qemu hook, which did the following: 

1) drop caches, 
2) compact memory 
3) create a cpuset for the host and move all tasks there to free the cores assigned to the VM (which included a flag for memory migration, so that the processes would have their memory moved to the non VM node) 
4) then let qemu allocate memory

Since then I changed this to move the compacting step after the moving step (my thought was that *after* moving the memory from node 1 to node 0, there is more free space on node 1, compaction should yield better results)

Does the error I initially got say anything about *why* the allocation failed?
Comment 6 Sean Christopherson 2020-09-11 16:19:09 UTC
Nope, the failure path is common so we can't even glean anything from the offsets in the stack trace.

In your data dump, both nodes show 10gb+ of free memory so there's plenty of space for the measly 4kb that KVM is trying to allocate.  My best guess is that the combination of nodemask/cpuset stuff resulted in a set of constraints that were impossible to satisfy.

At this point, I'd say just chalk it up to a bad configuration unless you want to pursue this further.  If there's a kernel bug lurking then odds are someone will run into again.
Comment 7 Martin Schrodt 2020-09-20 09:17:40 UTC
Fully agree. Thanks for your assistance!

Note You need to log in before you can comment on or make changes to this bug.