Issue occurs with 5.8.14, 5.8.16, and 5.9.1. Does NOT occur with 5.7.x. I suspect it occurs with all of 5.8, but I haven't confirmed this yet. After the box has been up for a "while", starting new VM's fails. Completely shutting down existing VM's and then starting them back up will also fail in the same way. What is a while? Could be 2 days, might be 9. I'll update as the pattern becomes more clear. libvirt is generally used, but when running kvm manually with strace, kvm always gets stuck here: ioctl(11, KVM_PPC_ALLOCATE_HTAB, 0x7fffea0bade4 Maybe the kernel is trying to find the memory needed to allocate the Hashed Page Table but is unable to do so? Maybe there's a memory leak? Before this issue starts occurring, I have confirmed I am able to run the exact same kvm command manually: sudo -u libvirt-qemu qemu-system-ppc64 -enable-kvm -m 8192 -nographic -vga none -drive file=/var/lib/libvirt/images/test.qcow2,format=qcow2 -mem-prealloc -smp 4 Nothing in dmesg, nothing useful in the logs. This box's configuration: Debian 10 stable 2x 18 core POWER9 (144 threads) 512g physical memory Raptor Talos II motherboard radix MMU disabled Unfortunately, I cannot test the affected box with the Radix MMU enabled because I have some important VM's that won't run unless it is disabled.
Still happens with 5.9.2.
Verified this happens with 5.9.6 and and Debian vendor kernel of linux-image-5.9.0-1-powerpc64le. Might also be worth mentioning this is occurring with qemu-system-ppc package version 1:3.1+dfsg-8+deb10u8.
Same issue now that I'm running with qemu-system-ppc version 1:5.0-14~bpo10+1 from Debian backports.
After enough testing, I feel confident that this issue was fixed in 5.9.9. However, I encountered issues with XFS with 5.9.9 and 5.9.10 (mainly on POWER, but to a lesser extent they seemed to happen for me on amd64 at least). 5.9.11 has the weird hang fixed and no other issues (XFS or otherwise) in over 2 days! I feel confident in closing this issue.
Thanks for persisting with the testing. I wonder if it was fixed by: c4629e4e7e09 ("mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate") or 38935861d85a ("mm/compaction: count pages and stop correctly during page isolation") They fix a potential infinte loop in a path that's used by the HTAB allocation. Those landed in v5.9.9, and fix a commit that was introduced in v5.7 (which doesn't match your observation that v5.7.x was OK).
Nick pointed out that it was actually: 2da9f6305f30 ("mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit")