Bug 209733

Summary: Starting new KVM virtual machines on PPC64 starts to hang after box is up for a while
Product: Virtualization Reporter: Cameron Berkenpas (cam)
Component: kvmAssignee: platform_ppc-64
Status: CLOSED CODE_FIX    
Severity: high CC: gustavo.romero, michael
Priority: P1    
Hardware: PPC-64   
OS: Linux   
Kernel Version: >=5.8 Subsystem:
Regression: Yes Bisected commit-id:

Description Cameron Berkenpas 2020-10-18 23:09:57 UTC
Issue occurs with 5.8.14, 5.8.16, and 5.9.1.  Does NOT occur with 5.7.x. I suspect it occurs with all of 5.8, but I haven't confirmed this yet.

After the box has been up for a "while", starting new VM's fails. Completely shutting down existing VM's and then starting them back up will also fail in the same way.

What is a while? Could be 2 days, might be 9. I'll update as the pattern becomes more clear.

libvirt is generally used, but when running kvm manually with strace, kvm always gets stuck here:
ioctl(11, KVM_PPC_ALLOCATE_HTAB, 0x7fffea0bade4

Maybe the kernel is trying to find the memory needed to allocate the Hashed Page Table but is unable to do so? Maybe there's a memory leak?

Before this issue starts occurring, I have confirmed I am able to run the exact same kvm command manually:
sudo -u libvirt-qemu qemu-system-ppc64 -enable-kvm -m 8192 -nographic -vga none -drive file=/var/lib/libvirt/images/test.qcow2,format=qcow2 -mem-prealloc -smp 4

Nothing in dmesg, nothing useful in the logs.

This box's configuration:
Debian 10 stable
2x 18 core POWER9 (144 threads)
512g physical memory
Raptor Talos II motherboard
radix MMU disabled

Unfortunately, I cannot test the affected box with the Radix MMU enabled because I have some important VM's that won't run unless it is disabled.
Comment 1 Cameron Berkenpas 2020-10-30 17:46:46 UTC
Still happens with 5.9.2.
Comment 2 Cameron Berkenpas 2020-11-07 16:36:26 UTC
Verified this happens with 5.9.6 and and Debian vendor kernel of linux-image-5.9.0-1-powerpc64le.

Might also be worth mentioning this is occurring with qemu-system-ppc package version 1:3.1+dfsg-8+deb10u8.
Comment 3 Cameron Berkenpas 2020-11-08 16:33:12 UTC
Same issue now that I'm running with qemu-system-ppc version 1:5.0-14~bpo10+1 from Debian backports.
Comment 4 Cameron Berkenpas 2020-11-26 17:26:22 UTC
After enough testing, I feel confident that this issue was fixed in 5.9.9. However, I encountered issues with XFS with 5.9.9 and 5.9.10 (mainly on POWER, but to a lesser extent they seemed to happen for me on amd64 at least). 5.9.11 has the weird hang fixed and no other issues (XFS or otherwise) in over 2 days!

I feel confident in closing this issue.
Comment 5 Michael Ellerman 2020-11-26 23:16:48 UTC
Thanks for persisting with the testing.

I wonder if it was fixed by:

c4629e4e7e09 ("mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate")
or
38935861d85a ("mm/compaction: count pages and stop correctly during page isolation")

They fix a potential infinte loop in a path that's used by the HTAB allocation.

Those landed in v5.9.9, and fix a commit that was introduced in v5.7 (which doesn't match your observation that v5.7.x was OK).
Comment 6 Michael Ellerman 2020-11-27 02:26:26 UTC
Nick pointed out that it was actually:
  2da9f6305f30 ("mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit")