Hello, I've encountered repeated crashes or freezes when a KVM VM receives large amounts of data over the network while the system is under memory load and performing I/O operations. The crashes sometimes occur in the filesystem code (ext4 and btrfs, at least), but they also happen in other locations. This issue occurs on my custom builds using kernel versions v6.10 to v6.11-rc2, with virtio network and disk drivers, and either Ubuntu 22.04 or Debian 12 user space. The same kernel build did not crash on an Azure VM, which does not use the virtio network driver. Since this issue only appears when receiving data, I suspect there could be an issue related to the virtio interface or receive buffer handling. This issue did not occur on the Debian backport kernel 6.9.7-1~bpo12+1 amd64. Steps to Reproduce: 1. Setup a small VM on a KVM host. I tested this on an x86_64 KVM VM with 1 CPU, 512 MB RAM, 2 GB SWAP (the smallest configuration from Vultr), using a Debian 12 user space, virtio disk, and virtio net. 2. Induce high memory and I/O load. Run the following command: stress --vm 2 --hdd 1 (Adjust --vm to to occupy all the RAM) This slows down the system but does not cause a crash. 3. Send large data to the VM. I used `iperf3 -s` on the VM and sent data using `iperf3 -c` from another host. The system crashes within a few seconds to a few minutes. (The reverse direction `iperf3 -c -R` did not cause a crash.) The OOPS messages are mostly general protection faults, but sometimes I see "Bad pagetable" or other errors, such as: Oops: general protection fault, probably for non-canonical address 0x2f9b7fa5e2bde696: 0000 [#1] PREEMPT SMP PTI Oops: Oops: 0000 [#1] PREEMPT SMP PTI Oops: Bad pagetable: 000d [#1] PREEMPT SMP PTI In some cases, dmesg contains something like: UBSAN: shift-out-of-bounds in lib/xarray.c:158:34 When the system freezes without crash, I sometimes found BUGON messages in some cases, such as: get_swap_device: Bad swap file entry 3403b0f5b2584992 BUG: Bad page map in process stress pte:c42f93fac0299e1d pmd:0d9b2047 BUG: Bad rss-counter-state mm:000000004df3dd9a type:MM_ANONPAGES val:2 BUG: Bad rss-counter-state mm:000000004df3dd9a type:MM_SWAPENTS val:-1 Thanks.
Created attachment 306714 [details] dmesg from a crash
Created attachment 306715 [details] custom build config
Wonder if it's the same problem discussed here: https://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540c0a@oracle.com/ https://lore.kernel.org/all/20240814065914.bpnFIoTXhqGpEiCvOuj0e9Kmx0tngb1NFUPxs378JDU@z/raw https://lore.kernel.org/all/7774ac707743ad8ce3afeacbd4bee63ac96dd927.1723617902.git.mst@redhat.com/ I brought that up there as well.
I bisected the issue to commit f9dac92ba908 (virtio_ring: enable premapped mode regardless of use_dma_api). It appears to be the same issue as discussed in the first thread: https://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540c0a@oracle.com/ f9dac92ba9081062a6477ee015bd3b8c5914efc4: BAD 6e62702feb6d474e969b52f0379de93e9729e457: OK However, reverting f9dac92ba908 on v6.11-rc2 did not boot up.
(In reply to Takero Funaki from comment #4) > I bisected the issue Great! > However, reverting f9dac92ba908 on v6.11-rc2 did not boot up. Then you likely need to apply two more reverts, e.g. everything from this thread: https://lore.kernel.org/all/7774ac707743ad8ce3afeacbd4bee63ac96dd927.1723617902.git.mst@redhat.com/ Can I CC you in a reply to https://lore.kernel.org/all/m2r0aqrsq6.fsf@oracle.com/ once you tried that and posted the results here? This would expose your name and email address to the public.
Thanks for the suggestion. I couldn’t initially determine which commits needed to be reverted. I tested applying all three reverts from the thread on v6.11-rc3 (with some modifications to resolve conflicts) and confirmed that the issue no longer reproduces. So far, I’ve only tested this on one VM, but I plan to check it in other environments as well. I will attach the modified patches I applied on tag:v6.11-rc3. In my case, receiving data with Debian default setting net.core.high_order_alloc_disable=0, combined with memory and I/O load, triggered the crash. I hope this information is helpful for further investigation. For the mailing list, please feel free to CC me: Cc: Takero Funaki <flintglass@gmail.com> Thanks.
Created attachment 306744 [details] patch for v6.11-rc3
Created attachment 306745 [details] patch for v6.11-rc3
Created attachment 306746 [details] patch for v6.11-rc3
Created attachment 306747 [details] patch1 for v6.11-rc3
Thanks to everyone involved, Xuan's reverting commits have been merged and released in the v6.11 kernel. Although Xuan's proposed fix to reimplement the disabled feature is still in progress, I am closing this issue as the visible problem has been resolved.