Bug 209253

Summary: Loss of connectivity on guest after important host <-> guest traffic
Product: Virtualization Reporter: Clément Aubin (aubincleme)
Component: kvmAssignee: virtualization_kvm
Status: RESOLVED CODE_FIX    
Severity: normal CC: alex.williamson, arequipeno, justin.gatzen, kdev, maironire
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.8.0-1-amd64 Subsystem:
Regression: No Bisected commit-id:
Attachments: Packet capture
Guest XML
Test fix for ioeventfd_write traces

Description Clément Aubin 2020-09-13 18:39:04 UTC
Created attachment 292501 [details]
Packet capture

I have an hypervisor running one guest VM. This guest VM has one virtual network card, configured to use a NATed network with the host.

Upon guest startup, the guest can ping both the host and a server on the internet. However, when starting heavy traffic between the guest and the host, the host kernel issues the following trace, and the VM looses its network connectivity unpon its next restart. I'm also adding to this issue a pcap file showing the traffic on the host virtual nic, which slowly degrades, starting form frames 21 to 23.

Information about the host :
Linux 5.8.0-1-amd64 #1 SMP Debian 5.8.7-1 (2020-09-05) x86_64 GNU/Linux
QEMU emulator version 5.1.0 (Debian 1:5.1+dfsg-4)

Information about the guest VM :
Linux 4.19.0-10-amd64 #1 SMP Debian 4.19.132-1 (2020-07-24) x86_64 GNU/Linux

Network configuration for the guest :
<interface type="network">
  <mac address="52:54:00:f2:29:56"/>
  <source network="bridge-vm"/>
  <model type="virtio"/>
  <link state="up"/>
  <address type="pci" domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
</interface>
<network>
  <name>bridge-vm</name>
  <uuid>159f7f26-391c-44f7-8e6e-dc1b213837a6</uuid>
  <forward mode="nat">
    <nat>
      <port start="1024" end="65535"/>
    </nat>
  </forward>
  <bridge name="virbr0" stp="on" delay="0"/>
  <mac address="52:54:00:45:68:56"/>
  <ip address="192.168.122.1" netmask="255.255.255.0">
    <dhcp>
      <range start="192.168.122.2" end="192.168.122.254"/>
    </dhcp>
  </ip>
</network>

Kernel trace on the host :
[ 1492.533631] ------------[ cut here ]------------
[ 1492.533637] WARNING: CPU: 2 PID: 3835 at fs/eventfd.c:74 eventfd_signal+0x88/0xa0
[ 1492.533638] Modules linked in: nfnetlink_queue nfnetlink_log bluetooth jitterentropy_rng drbg ansi_cprng ecdh_generic ecc cfg80211 macvtap macvlan vhost_net vhost tap vhost_iotlb nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_counter nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c nf_tables nfnetlink bridge stp llc tun overlay binfmt_misc intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp joydev kvm_intel snd_hda_codec_hdmi kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg rapl snd_hda_codec eeepc_wmi intel_cstate snd_hda_core intel_uncore snd_hwdep asus_wmi snd_pcm pcspkr battery snd_timer iTCO_wdt sparse_keymap intel_pmc_bxt rfkill xpad snd serio_raw iTCO_vendor_support ff_memless mei_me intel_wmi_thunderbolt wmi_bmof soundcore sg watchdog mei acpi_pad evdev parport_pc ppdev lp
[ 1492.533657]  parport nfsd vfio_pci vfio_virqfd vfio_iommu_type1 vfio auth_rpcgss irqbypass nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic dm_crypt dm_mod sd_mod t10_pi crc_t10dif crct10dif_generic hid_generic usbhid hid i915 crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel i2c_algo_bit drm_kms_helper ahci libahci e1000e aesni_intel mxm_wmi psmouse libaes crypto_simd cryptd glue_helper libata cec xhci_pci ptp xhci_hcd pps_core drm i2c_i801 i2c_smbus usbcore scsi_mod usb_common wmi video button
[ 1492.533674] CPU: 2 PID: 3835 Comm: vhost-3826 Not tainted 5.8.0-1-amd64 #1 Debian 5.8.7-1
[ 1492.533674] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 3802 03/15/2018
[ 1492.533676] RIP: 0010:eventfd_signal+0x88/0xa0
[ 1492.533677] Code: 03 00 00 00 4c 89 f7 e8 e6 a8 dc ff 65 ff 0d 0f 20 92 6a 4c 89 ee 4c 89 f7 e8 34 6a 52 00 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 <0f> 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d 41 5e c3 0f 1f 80 00 00
[ 1492.533678] RSP: 0018:ffffb27ac06afd40 EFLAGS: 00010202
[ 1492.533679] RAX: 0000000000000001 RBX: ffff94c12c87f000 RCX: 0000000000000000
[ 1492.533680] RDX: 000000000000ad5c RSI: 0000000000000001 RDI: ffff94c0dd9e81c0
[ 1492.533680] RBP: 0000000000000000 R08: 0000006e000000a8 R09: ffff94c12c87f348
[ 1492.533681] R10: 0000000000000000 R11: 0000000000000020 R12: ffff94c10d3f01d0
[ 1492.533681] R13: ffff94c10d3f0000 R14: ffff94c10d3f00c8 R15: ffffb27ac06afe18
[ 1492.533682] FS:  0000000000000000(0000) GS:ffff94c14ed00000(0000) knlGS:0000000000000000
[ 1492.533683] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1492.533683] CR2: 0000000000000000 CR3: 0000000bc2300003 CR4: 00000000003626e0
[ 1492.533684] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1492.533685] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1492.533685] Call Trace:
[ 1492.533691]  handle_rx+0xbe/0x9f0 [vhost_net]
[ 1492.533694]  vhost_worker+0x88/0xd0 [vhost]
[ 1492.533696]  ? vhost_exceeds_weight+0x50/0x50 [vhost]
[ 1492.533698]  kthread+0x119/0x140
[ 1492.533699]  ? __kthread_bind_mask+0x60/0x60
[ 1492.533701]  ret_from_fork+0x22/0x30
[ 1492.533703] ---[ end trace a62bb924e0497bb1 ]---

Thanks in advance for your time!
Comment 1 Ian Pilcher 2020-10-12 00:07:42 UTC
I am seeing a very similar crash, but the device in my case is an NVIDIA GPU, passed through to a Windows guest for video processing.

[ 6094.567434] WARNING: CPU: 7 PID: 2524 at fs/eventfd.c:74 eventfd_signal+0x88/0xa0
[ 6094.567464] Modules linked in: vhost_net vhost tap vhost_iotlb tun nft_chain_nat 8021q garp mrp stp llc sch_ingress bonding openvswitch nsh nf_conncount nf_nat nft_counter ipt_REJECT ip6t_REJECT nf_reject_ipv4 nf_reject_ipv6 xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal sunrpc intel_powerclamp coretemp raid10 kvm_intel iTCO_wdt intel_pmc_bxt kvm iTCO_vendor_support gpio_ich ipmi_ssif rapl ixgbe intel_cstate joydev i2c_i801 igb intel_uncore ioatdma mei_me acpi_ipmi i2c_smbus mdio intel_pch_thermal mei dca lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_pad vfat fat ip_tables xfs ast drm_vram_helper drm_ttm_helper i2c_algo_bit drm_kms_helper cec ttm mxm_wmi drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel wmi vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
[ 6094.567794] CPU: 7 PID: 2524 Comm: CPU 3/KVM Not tainted 5.8.13-200.fc32.x86_64 #1
[ 6094.567834] Hardware name: Supermicro SYS-5028D-TN4T/X10SDV-TLN4F, BIOS 2.1 11/22/2019
[ 6094.567868] RIP: 0010:eventfd_signal+0x88/0xa0
[ 6094.567889] Code: 03 00 00 00 4c 89 f7 e8 26 16 db ff 65 ff 0d 3f f3 ca 43 4c 89 ee 4c 89 f7 e8 34 8e 7f 00 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 <0f> 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d 41 5e c3 0f 1f 80 00 00
[ 6094.567974] RSP: 0018:ffffac8780b97bb0 EFLAGS: 00010286
[ 6094.567996] RAX: 00000000ffffffff RBX: ffff9b01c8ed0000 RCX: 0000000000000004
[ 6094.568021] RDX: 00000000c8088704 RSI: 0000000000000001 RDI: ffff9b11ef445240
[ 6094.568050] RBP: ffffac8780b97c18 R08: ffff9b11ef4cdf40 R09: 00000000c8088708
[ 6094.568080] R10: 0000000000000000 R11: 0000000000000190 R12: 0000000000000002
[ 6094.568105] R13: ffff9b11ef4bbb00 R14: ffff9b11ef4cdf40 R15: ffff9b11ef4bbb00
[ 6094.568145] FS:  00007f30b8b78700(0000) GS:ffff9b123fbc0000(0000) knlGS:000000ad76006000
[ 6094.568178] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6094.568202] CR2: 000001d9b5100000 CR3: 0000001fb0ed2003 CR4: 00000000003626e0
[ 6094.568232] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6094.568261] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6094.568290] Call Trace:
[ 6094.568336]  ioeventfd_write+0x51/0x80 [kvm]
[ 6094.568385]  __kvm_io_bus_write+0x88/0xb0 [kvm]
[ 6094.568417]  kvm_io_bus_write+0x43/0x60 [kvm]
[ 6094.568454]  write_mmio+0x70/0xf0 [kvm]
[ 6094.568488]  emulator_read_write_onepage+0x11e/0x330 [kvm]
[ 6094.568527]  emulator_read_write+0xca/0x180 [kvm]
[ 6094.568564]  segmented_write.isra.0+0x4a/0x60 [kvm]
[ 6094.568601]  x86_emulate_insn+0x850/0xe60 [kvm]
[ 6094.568636]  x86_emulate_instruction+0x2c7/0x780 [kvm]
[ 6094.568680]  ? kvm_io_bus_write+0x43/0x60 [kvm]
[ 6094.569821]  kvm_arch_vcpu_ioctl_run+0xeb9/0x1770 [kvm]
[ 6094.570963]  kvm_vcpu_ioctl+0x209/0x590 [kvm]
[ 6094.572099]  ksys_ioctl+0x82/0xc0
[ 6094.573208]  __x64_sys_ioctl+0x16/0x20
[ 6094.574294]  do_syscall_64+0x4d/0x90
[ 6094.575348]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 6094.576388] RIP: 0033:0x7f30c2bac3bb
[ 6094.577422] Code: 0f 1e fa 48 8b 05 dd aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad aa 0c 00 f7 d8 64 89 01 48
[ 6094.579603] RSP: 002b:00007f30b8b77668 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 6094.580713] RAX: ffffffffffffffda RBX: 000056277c8e4ae0 RCX: 00007f30c2bac3bb
[ 6094.581835] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001e
[ 6094.582951] RBP: 00007f30c010a000 R08: 000056277aebebf0 R09: 0000000000000000
[ 6094.584028] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001
[ 6094.585078] R13: 00007f30c010b001 R14: 0000000000000000 R15: 000056277b358a00
[ 6094.586103] ---[ end trace dab8395baf5baf8c ]---
Comment 2 Martin 2020-10-17 15:59:34 UTC
I am having problems possibly related to Clement's, and likely related to Ian's. 
I am running KVM on a dual Nvidia GPU machine, passing one GPU through to the KVM guest (Ubuntu 20.04.1).
The setup ran stable for quite a while on Fedora 31 (5.7.15-100). After upgrading to Fedora 32 (5.8.13), the system tends to run well for a few hours after rebooting, and then produces oopses (below). After the oops, VirtIO drives, VirtIO network cards, and PCIe passthrough tend to hang indefinitely within minutes of rebooting the guest, usually making the guest inoperable (unless only non-VirtIO devices and no GPU passthrough are used). Rebooting the host makes things work again for a few hours until the next oops happens. I am on 5.8.14 now with the same problem.

Oops 1 (I saw this twice, once on 5.8.13-200 and once on 5.8.14-200):

WARNING: CPU: 28 PID: 17651 at fs/eventfd.c:74 eventfd_signal+0x88/0xa0
Modules linked in: vhost_net vhost tap vhost_iotlb v4l2loopback(OE) xt_nat xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc evdi(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc ucsi_ccg typec_ucsi nvidia_drm(POE) typec nvidia_modeset(POE) snd_hda_codec_realtek nvidia_uvm(OE) snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi btusb edac_mce_amd btrtl btbcm snd_hda_intel uvcvideo iwlmvm snd_intel_dspcfg kvm_amd btintel snd_usb_audio snd_hda_codec videobuf2_vmalloc videobuf2_memops snd_usbmidi_lib mac80211
 nvidia(POE) kvm bluetooth snd_hda_core videobuf2_v4l2 snd_rawmidi snd_hwdep videobuf2_common libarc4 snd_seq iwlwifi videodev joydev rapl snd_seq_device ecdh_generic wmi_bmof pcspkr cfg80211 mc ecc snd_pcm drm_kms_helper snd_timer sp5100_tco k10temp snd i2c_piix4 rfkill soundcore cec i2c_nvidia_gpu gpio_amdpt gpio_generic acpi_cpufreq drm ip_tables dm_crypt hid_lenovo mxm_wmi crct10dif_pclmul crc32_pclmul crc32c_intel nvme ghash_clmulni_intel nvme_core igb wacom ccp uas dca usb_storage i2c_algo_bit wmi pinctrl_amd vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio fuse
CPU: 28 PID: 17651 Comm: CPU 5/KVM Tainted: P           OE     5.8.13-200.fc32.x86_64 #1
Hardware name: Gigabyte Technology Co., Ltd. X399 DESIGNARE EX/X399 DESIGNARE EX-CF, BIOS F12 12/11/2019
RIP: 0010:eventfd_signal+0x88/0xa0
Code: 03 00 00 00 4c 89 f7 e8 26 16 db ff 65 ff 0d 3f f3 ca 4b 4c 89 ee 4c 89 f7 e8 34 8e 7f 00 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 <0f> 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d 41 5e c3 0f 1f 80 00 00
RSP: 0018:ffffab10c8db7bb0 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff9a71e16b8000 RCX: 0000000000000004
RDX: 00000000c8088704 RSI: 0000000000000001 RDI: ffff9a8335656580
RBP: ffffab10c8db7c18 R08: ffff9a72f7d120a0 R09: 00000000c8088708
R10: 0000000000000000 R11: 0000000000000014 R12: 0000000000000001
R13: ffff9a72a3153448 R14: ffff9a72f7d120a0 R15: ffff9a72a3153448
FS:  0000000000000000(0000) GS:ffff9a7e7f280000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f64e403f024 CR3: 000000041b5f4000 CR4: 00000000003406e0
Call Trace:
 ioeventfd_write+0x51/0x80 [kvm]
 __kvm_io_bus_write+0x88/0xb0 [kvm]
 kvm_io_bus_write+0x43/0x60 [kvm]
 write_mmio+0x70/0xf0 [kvm]
 emulator_read_write_onepage+0x11e/0x330 [kvm]
 emulator_read_write+0xca/0x180 [kvm]
 segmented_write.isra.0+0x4a/0x60 [kvm]
 x86_emulate_insn+0x850/0xe60 [kvm]
 x86_emulate_instruction+0x2c7/0x780 [kvm]
 ? kvm_set_cr8+0x1e/0x40 [kvm]
 kvm_arch_vcpu_ioctl_run+0xeb9/0x1770 [kvm]
 ? x86_pmu_enable+0x106/0x2f0
 ? __switch_to_xtra+0x495/0x500
 kvm_vcpu_ioctl+0x209/0x590 [kvm]
 ksys_ioctl+0x82/0xc0
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x4d/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5f6a84f3bb
Code: 0f 1e fa 48 8b 05 dd aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad aa 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007f5f527fb668 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000055c459b6f1f0 RCX: 00007f5f6a84f3bb
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000022
RBP: 00007f5f6bcca000 R08: 000055c45750abf0 R09: 000000003b9aca00
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 00007f5f6bccb004 R14: 0000000000000000 R15: 000055c4579a4a00

Oops 2 (saw this once on 5.8.14-200):

WARNING: CPU: 24 PID: 0 at fs/eventfd.c:74 eventfd_signal+0x88/0xa0
Modules linked in: v4l2loopback(OE) nfnetlink_queue nfnetlink_log vhost_net vhost tap vhost_iotlb xt_nat xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc nvidia_drm(POE) nvidia_modeset(POE) iwlmvm nvidia_uvm(OE) snd_hda_codec_realtek ucsi_ccg typec_ucsi mac80211 typec edac_mce_amd snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi uvcvideo btusb btrtl btbcm nvidia(POE) snd_hda_intel videobuf2_vmalloc libarc4 kvm_amd btintel videobuf2_memops snd_intel_dspcfg
 snd_hda_codec videobuf2_v4l2 kvm bluetooth videobuf2_common snd_usb_audio iwlwifi snd_hda_core videodev snd_usbmidi_lib snd_hwdep snd_seq snd_rawmidi joydev rapl snd_seq_device ecdh_generic mc pcspkr wmi_bmof ecc cfg80211 snd_pcm drm_kms_helper snd_timer snd sp5100_tco i2c_piix4 k10temp rfkill soundcore cec i2c_nvidia_gpu gpio_amdpt gpio_generic acpi_cpufreq drm ip_tables dm_crypt mxm_wmi crct10dif_pclmul crc32_pclmul crc32c_intel nvme ghash_clmulni_intel igb nvme_core wacom uas dca hid_lenovo ccp usb_storage i2c_algo_bit wmi pinctrl_amd vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio fuse
CPU: 24 PID: 0 Comm: swapper/24 Tainted: P           OE     5.8.14-200.fc32.x86_64 #1
Hardware name: Gigabyte Technology Co., Ltd. X399 DESIGNARE EX/X399 DESIGNARE EX-CF, BIOS F12 12/11/2019
RIP: 0010:eventfd_signal+0x88/0xa0
Code: 03 00 00 00 4c 89 f7 e8 a6 14 db ff 65 ff 0d bf f1 ca 78 4c 89 ee 4c 89 f7 e8 b4 9c 7f 00 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 <0f> 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d 41 5e c3 0f 1f 80 00 00
RSP: 0018:ffffb5e2c6d2cf38 EFLAGS: 00010002
RAX: 0000000000000001 RBX: ffff894a2a1f1480 RCX: 000000000000001f
RDX: ffff89423920ce00 RSI: 0000000000000001 RDI: ffff894929afc580
RBP: ffff89423920cea4 R08: ffffb5e2c6d2cff8 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000a1
R13: 0000000000000000 R14: ffffb5e2c6d2cfb4 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff89423f180000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055ae44aa8990 CR3: 0000000797204000 CR4: 00000000003406e0
Call Trace:
 <IRQ>
 vfio_msihandler+0x12/0x20 [vfio_pci]
 __handle_irq_event_percpu+0x42/0x180
 handle_irq_event+0x47/0x8a
 handle_edge_irq+0x87/0x220
 asm_call_irq_on_stack+0x12/0x20
 </IRQ>
 common_interrupt+0xb2/0x140
 asm_common_interrupt+0x1e/0x40
RIP: 0010:cpuidle_enter_state+0xb6/0x3f0
Code: 90 a5 6b 78 e8 5b be 7b ff 49 89 c7 0f 1f 44 00 00 31 ff e8 2c d7 7b ff 80 7c 24 0f 00 0f 85 d4 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 e0 01 00 00 49 63 d4 4c 2b 7c 24 10 48 8d 04 52 48
RSP: 0018:ffffb5e2c0337e88 EFLAGS: 00000246
RAX: ffff89423f1aa2c0 RBX: ffff89423366e400 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 000000002abf3055 RDI: 0000000000000000
RBP: ffffffff88b78940 R08: 00000a86556fd237 R09: 0000000000000018
R10: 0000000000002358 R11: 0000000000000781 R12: 0000000000000002
R13: ffff89423366e400 R14: 0000000000000002 R15: 00000a86556fd237
 ? cpuidle_enter_state+0xa4/0x3f0
 cpuidle_enter+0x29/0x40
 do_idle+0x1d5/0x2a0
 cpu_startup_entry+0x19/0x20
 start_secondary+0x144/0x170
 secondary_startup_64+0xb6/0xc0
Comment 3 Ian Pilcher 2020-10-20 14:28:20 UTC
Thanks for the note about 5.7.15.  I've built it for F32, and my GPU *seems* stable so far (been running a video upscaling job for almost 24 hours).

(I did have an issue where the hypervisor (libvirtd?) seems to lock up if I try to start the guest wit virt-manager over SSH; had to start it with 'virsh start' locally.)

This indicates that it *may* be possible to bisect this issue, although the lack of a time-bound reproducer means that it will take a long time.

@Clément - Can you test 5.7.15?
Comment 4 Justin Gatzen 2020-10-22 16:14:21 UTC
Seeing the same issue here on 5.9.1 and 5.8.14. I did not notice any trouble on 5.8.2 for about two weeks of usage, but that is just an anecdote. I have not attempted to bisect this because the bug takes quite a while to trigger.


Stack trace from a VFIO enabled VM with SRIOV nic, NVME passthrough, GPU passthrough, Windows guest:

[83372.203651] ------------[ cut here ]------------
[83372.203659] WARNING: CPU: 16 PID: 0 at fs/eventfd.c:74 eventfd_signal+0x89/0xa0
[83372.203661] Modules linked in: vhost_net vhost tap vhost_iotlb tun vfio_pci vfio_virqfd vfio_iommu_type1 vfio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter uinput nft_meta_bridge nf_tables nfnetlink xpad hid_microsoft ff_memless mlx5_ib ib_uverbs nouveau ib_core edac_mce_amd mlx5_core iwlmvm kvm_amd mac80211 kvm snd_hda_codec_realtek irqbypass sp5100_tco snd_hda_codec_generic pcspkr rapl k10temp wmi_bmof i2c_piix4 mlxfw ledtrig_audio snd_hda_codec_hdmi joydev pci_hyperv_intf snd_hda_intel bridge snd_intel_dspcfg libarc4 snd_hda_codec btusb btrtl iwlwifi snd_hda_core btbcm mxm_wmi stp btintel llc video snd_hwdep bluetooth snd_pcm ttm snd_timer cfg80211 drm_kms_helper veth igb snd ecdh_generic ecc cec i2c_algo_bit soundcore r8169 rfkill dca acpi_cpufreq vfat fat drm essiv authenc dm_crypt xfs crct10dif_pclmul ccp crc32_pclmul crc32c_intel nvme
[83372.203687]  ghash_clmulni_intel nvme_core wmi pinctrl_amd
[83372.203702] CPU: 16 PID: 0 Comm: swapper/16 Not tainted 5.9.1-dirty #49
[83372.203704] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS MASTER/X570 AORUS MASTER, BIOS F30 09/07/2020
[83372.203707] RIP: 0010:eventfd_signal+0x89/0xa0
[83372.203709] Code: 03 00 00 00 4c 89 f7 e8 b5 3c db ff 65 ff 0d 3e 26 cb 52 4c 89 ee 4c 89 f7 e8 b3 65 7e 00 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 <0f> 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d 41 5e c3 66 0f 1f 44 00
[83372.203712] RSP: 0018:ffffa2cdc0554f28 EFLAGS: 00010002
[83372.203714] RAX: 0000000000000001 RBX: ffff8ceaec0d4880 RCX: 000000000000001f
[83372.203716] RDX: ffff8ceaed3b7200 RSI: 0000000000000001 RDI: ffff8ce6eaa58340
[83372.203717] RBP: ffff8ceaed3b7200 R08: 00004bd39a951b00 R09: 0000000000000000
[83372.203719] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000d9
[83372.203720] R13: 0000000000000000 R14: ffffa2cdc0554fa4 R15: 0000000000000000
[83372.203722] FS:  0000000000000000(0000) GS:ffff8cf5bee00000(0000) knlGS:0000000000000000
[83372.203724] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[83372.203726] CR2: 00005562c8109c68 CR3: 00000001694fc000 CR4: 0000000000350ee0
[83372.203727] Call Trace:
[83372.203729]  <IRQ>
[83372.203734]  vfio_msihandler+0x12/0x20 [vfio_pci]
[83372.203738]  __handle_irq_event_percpu+0x42/0x180
[83372.203740]  handle_irq_event_percpu+0x21/0x60
[83372.203742]  handle_irq_event+0x36/0x53
[83372.203744]  handle_edge_irq+0x83/0x190
[83372.203747]  asm_call_irq_on_stack+0x12/0x20
[83372.203749]  </IRQ>
[83372.203751]  common_interrupt+0xb5/0x130
[83372.203753]  asm_common_interrupt+0x1e/0x40
[83372.203756] RIP: 0010:cpuidle_enter_state+0xdf/0x390
[83372.203758] Code: e8 76 49 7d ff 80 7c 24 0f 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 8f 02 00 00 31 ff e8 28 b3 83 ff fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 20 01 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d
[83372.203761] RSP: 0018:ffffa2cdc01c7e88 EFLAGS: 00000246
[83372.203762] RAX: ffff8cf5bee2a300 RBX: ffff8cf3b21d9400 RCX: 000000000000001f
[83372.203764] RDX: 0000000000000000 RSI: 000000002491bed3 RDI: 0000000000000000
[83372.203765] RBP: ffffffffaeb7f180 R08: 00004bd39a95195c R09: 0000000000001206
[83372.203767] R10: 0000000000004852 R11: ffff8cf5bee29124 R12: 00004bd39a95195c
[83372.203768] R13: ffffffffaeb7f268 R14: 0000000000000002 R15: ffff8cf3b21d9400
[83372.203772]  ? cpuidle_enter_state+0xba/0x390
[83372.203774]  cpuidle_enter+0x29/0x40
[83372.203776]  do_idle+0x1b8/0x240
[83372.203778]  cpu_startup_entry+0x19/0x20
[83372.203780]  start_secondary+0x103/0x130
[83372.203783]  secondary_startup_64+0xb6/0xc0
[83372.203785] ---[ end trace a1dd3ff1b3977640 ]---


And from a non-VFIO run of the mill linux guest:

[37342.189129] ------------[ cut here ]------------
[37342.189137] WARNING: CPU: 21 PID: 1137 at fs/eventfd.c:74 eventfd_signal+0x89/0xa0
[37342.189139] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio vhost_net vhost tap vhost_iotlb tun nouveau ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_ib nft_counter nft_meta_bridge nf_tables ib_uverbs uinput nfnetlink ib_core iwlmvm xpad hid_microsoft ff_memless mac80211 mlx5_core snd_hda_codec_realtek joydev libarc4 snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio iwlwifi edac_mce_amd snd_hda_intel snd_intel_dspcfg snd_hda_codec kvm_amd bridge btusb snd_hda_core btrtl kvm btbcm video btintel snd_hwdep bluetooth ttm cfg80211 snd_pcm drm_kms_helper igb stp llc snd_timer irqbypass veth rapl snd mlxfw pcspkr wmi_bmof mxm_wmi sp5100_tco k10temp i2c_piix4 i2c_algo_bit pci_hyperv_intf ecdh_generic cec r8169 dca soundcore ecc rfkill drm acpi_cpufreq vfat fat essiv authenc dm_crypt xfs nvme nvme_core crct10dif_pclmul ccp crc32_pclmul
[37342.189162]  crc32c_intel ghash_clmulni_intel wmi pinctrl_amd
[37342.189174] CPU: 21 PID: 1137 Comm: vhost-1125 Not tainted 5.9.1-dirty #50
[37342.189175] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS MASTER/X570 AORUS MASTER, BIOS F30 09/07/2020
[37342.189178] RIP: 0010:eventfd_signal+0x89/0xa0
[37342.189180] Code: 03 00 00 00 4c 89 f7 e8 b5 3c db ff 65 ff 0d 3e 26 cb 79 4c 89 ee 4c 89 f7 e8 b3 65 7e 00 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 <0f> 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d 41 5e c3 66 0f 1f 44 00
[37342.189182] RSP: 0018:ffffadbcc1233d60 EFLAGS: 00010202
[37342.189183] RAX: 0000000000000001 RBX: ffff986bb47746b8 RCX: 00000000c3160001
[37342.189185] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff986948f6c9c0
[37342.189186] RBP: ffff986bb4770000 R08: 0000000000000001 R09: 0000000000000101
[37342.189187] R10: 000000002b30008a R11: 0000000022308000 R12: 0000000000000001
[37342.189189] R13: ffffadbcc1233e18 R14: ffff98630de8f300 R15: ffff986bb47746b8
[37342.189190] FS:  0000000000000000(0000) GS:ffff986bbef40000(0000) knlGS:0000000000000000
[37342.189192] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37342.189193] CR2: 00007f77c69ee890 CR3: 0000001029214000 CR4: 0000000000350ee0
[37342.189194] Call Trace:
[37342.189199]  vhost_tx_batch.isra.0+0x7d/0xc0 [vhost_net]
[37342.189202]  handle_tx_copy+0x15c/0x550 [vhost_net]
[37342.189204]  handle_tx+0xa5/0xe0 [vhost_net]
[37342.189207]  vhost_worker+0x8d/0xd0 [vhost]
[37342.189209]  ? vhost_vring_call_reset+0x40/0x40 [vhost]
[37342.189212]  kthread+0xfe/0x140
[37342.189214]  ? kthread_park+0x90/0x90
[37342.189216]  ret_from_fork+0x22/0x30
[37342.189218] ---[ end trace ae48714189db9592 ]---
Comment 5 Ian Pilcher 2020-10-22 22:26:32 UTC
Based on my git bisect, it looks like this commit is triggering the WARNING.

commit c49fa6397b6d29ce10c0ae5b2528bb004a14691f
Author: Alex Williamson <alex.williamson@redhat.com>
Date:   Mon Aug 17 11:08:18 2020 -0600

    vfio-pci: Avoid recursive read-lock usage
    
    [ Upstream commit bc93b9ae0151ae5ad5b8504cdc598428ea99570b ]
    
    A down_read on memory_lock is held when performing read/write accesses
    to MMIO BAR space, including across the copy_to/from_user() callouts
    which may fault.  If the user buffer for these copies resides in an
    mmap of device MMIO space, the mmap fault handler will acquire a
    recursive read-lock on memory_lock.  Avoid this by reducing the lock
    granularity.  Sequential accesses requiring multiple ioread/iowrite
    cycles are expected to be rare, therefore typical accesses should not
    see additional overhead.
    
    VGA MMIO accesses are expected to be non-fatal regardless of the PCI
    memory enable bit to allow legacy probing, this behavior remains with
    a comment added.  ioeventfds are now included in memory access testing,
    with writes dropped while memory space is disabled.
    
    Fixes: abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on disabled memory")
    Reported-by: Zhiyi Guo <zhguo@redhat.com>
    Tested-by: Zhiyi Guo <zhguo@redhat.com>
    Reviewed-by: Cornelia Huck <cohuck@redhat.com>
    Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 drivers/vfio/pci/vfio_pci_private.h |   2 +
 drivers/vfio/pci/vfio_pci_rdwr.c    | 120 ++++++++++++++++++++++++++++--------
 2 files changed, 98 insertions(+), 24 deletions(-)

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=bc93b9ae0151ae5ad5b8504cdc598428ea99570b
Comment 6 Justin Gatzen 2020-10-23 08:40:51 UTC
Thanks, I will revert that commit and report back. This commit was backported to kernel 5.8.4, which would jive with my (unproven) suspicion that 5.8.2 was unaffected.

Also, to clarify, the second stack trace on my previous post (Comment #4) could have infact been a VFIO VM. There were several VMs running with vhost/virtio net communicating with each other and it was not clear which VM triggered it.
Comment 7 Alex Williamson 2020-10-23 18:23:51 UTC
Color me suspicious, but there are backtraces from two configurations in the comments here that have no vfio devices, the original post and Justin's second trace.  The identified commit can only affect vfio configurations.

All of the backtraces seem to be from triggering this warning:

__u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
{
        unsigned long flags;

        /*
         * Deadlock or stack overflow issues can happen if we recurse here
         * through waitqueue wakeup handlers. If the caller users potentially
         * nested waitqueues with custom wakeup handlers, then it should
         * check eventfd_signal_count() before calling this function. If
         * it returns true, the eventfd_signal() call should be deferred to a
         * safe context.
         */
        if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
                return 0;

This cpu-local counter is only incremented while holding a spinlock with IRQs disabled while handling the wait queue.

It's not obvious to me how the backtraces shown can lead to recursive eventfd signals.  I've setup a configuration for stress testing, but any detailed description of a reliable reproducer would be appreciated.
Comment 8 Ian Pilcher 2020-10-23 19:05:35 UTC
Created attachment 293159 [details]
Guest XML

Attached is the XML of my VFIO guest.  This is a Windows 10 Pro VM, which I use to run Topaz Labs Video Enhance AI.

  https://topazlabs.com/video-enhance-ai/

Topaz Labs offer a free 30-day trial.  I have been using this to upscale DVD-quality video (720x480) by 225% to 1620x1080, using the Artemis-LQ model.  This typically triggers the warning within a few minutes.  (During the bisect I considered a 1 hour run without hitting the warning to be "success.")

Here is the information about my GPU (and associated audio device):

05:00.0 VGA compatible controller: NVIDIA Corporation GP107GL [Quadro P1000] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Hewlett-Packard Company Device 11bc
        Physical Slot: 2
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 89
        NUMA node: 0
        Region 0: Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 383fe0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at 383ff0000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at e000 [size=128]
        Expansion ROM at fb000000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00a78  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp-, 10BitTagReq-, OBFF Via message, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
                LaneErrStat: 0
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau

05:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
        Subsystem: Hewlett-Packard Company Device 11bc
        Physical Slot: 2
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin B routed to IRQ 17
        NUMA node: 0
        Region 0: Memory at fb080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 26.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
                        RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp-, 10BitTagReq-, OBFF Via message, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, TPHComp-, ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
Comment 9 Ian Pilcher 2020-10-23 22:28:13 UTC
Testing 5.9.1 with commit c49fa6397b6d29ce10c0ae5b2528bb004a14691f reverted, and it seems stable (before reverting, I hit the warning withing a few minutes).
Comment 10 Ian Pilcher 2020-10-27 12:51:50 UTC
(In reply to Alex Williamson from comment #7)
> Color me suspicious, but there are backtraces from two configurations in the
> comments here that have no vfio devices, the original post and Justin's
> second trace.  The identified commit can only affect vfio configurations.
> 
> All of the backtraces seem to be from triggering this warning:
> 
> __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
> {
>         unsigned long flags;
> 
>         /*
>          * Deadlock or stack overflow issues can happen if we recurse here
>          * through waitqueue wakeup handlers. If the caller users potentially
>          * nested waitqueues with custom wakeup handlers, then it should
>          * check eventfd_signal_count() before calling this function. If
>          * it returns true, the eventfd_signal() call should be deferred to a
>          * safe context.
>          */
>         if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
>                 return 0;

It's quite possible that some of the backtraces in this bug have different root cause(s).  That doesn't change the fact that commit c49fa6397b6d29ce10c0ae5b2528bb004a14691f does reliably trigger the WARNING for some of us.

> It's not obvious to me how the backtraces shown can lead to recursive
> eventfd signals.  I've setup a configuration for stress testing, but any
> detailed description of a reliable reproducer would be appreciated.

Is the VM XML and other information sufficient for you to reproduce?  (If not, I can set up access to my hypervisor.)
Comment 11 Alex Williamson 2020-10-28 17:12:06 UTC
I have reproduced, I don't know the solution yet, but I think I have a workaround, at least for occurrences like shown in comment 1.  These seem to be generated by the GeForce MSI quirk, which uses an ioeventfd for the re-arming of the MSI.  Perhaps we haven't unwound out of the previous MMIO write before another is triggered, but we should be running with IRQs disabled... I'm not sure.  We can disable this ioeventfd in QEMU though at a slight interrupt latency cost while further debugging.  libvirt xml changes:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
...
  <qemu:commandline>
    <qemu:arg value='-global'/>
    <qemu:arg value='vfio-pci.x-no-vfio-ioeventfd=on'/>
  </qemu:commandline>
</domain>

IOW, replace the first line to include the schema, otherwise libvirt will ignore and remove the addition, then add the <qemu:commandline> stanza at the end of the <domain> definition.  If using QEMU command line directly you can similarly just add "-global vfio-pci.x-no-vfio-ioeventfd=on".
Comment 12 Alex Williamson 2020-10-28 20:35:22 UTC
Created attachment 293281 [details]
Test fix for ioeventfd_write traces

For those experiencing the issue described by Ian in comment 1 and the first issue from Martin in comment 2, please try this patch if you're able.  vfio_pci_ioeventfd_handler() is called in a spinlock context with interrupts disabled and tries to acquire a read lock on the memory semaphore to verify the device memory is enabled.  The down_read() call can sleep and therefore should instead be called from a thread context if there is contention.  TBH, I've never seen it contended, so in practice the thread is never really used, but this seems to solve the specific case Ian has identified and it would have been introduced by the commit noted in comment 5.

This change should be specific to configurations with NVIDIA GPUs assigned.  I'm curious about the traces with vfio_msihandler as well, if anyone can provide a reproducer of that it would be appreciated.
Comment 13 Alex Williamson 2020-10-28 20:37:07 UTC
PS - Don't forget to undo the workaround in comment 11 if applied before testing the above patch.
Comment 14 Justin Gatzen 2020-10-28 20:52:05 UTC
I have been running 5.9.1 with bc93b9ae0151ae5ad5b8504cdc598428ea99570b reverted for 5 days now and have not had any recurrences. I can also confirm two of my VMs have Geforce GPUs (GTX1060, GT710) passed through along with virtio-net in the same VM.

Going to give the new patch from comment 12 a shot now. Thanks for looking into this Alex!
Comment 15 Ian Pilcher 2020-10-28 23:41:39 UTC
(In reply to Alex Williamson from comment #12)
> Created attachment 293281 [details]
> Test fix for ioeventfd_write traces

Stable for 75 minutes so far with this patch applied.
Comment 16 Ian Pilcher 2020-10-29 14:24:14 UTC
(In reply to Ian Pilcher from comment #15)
> Stable for 75 minutes so far with this patch applied.

Stable for 16 hours.  From my POV, the patch does fix the issue.

Thanks Alex!
Comment 17 Justin Gatzen 2020-10-30 21:52:39 UTC
Calling it fixed on my end as well using the patch from comment #12.
Comment 18 Alex Williamson 2020-10-30 22:13:12 UTC
Patch posted earlier today:

https://lore.kernel.org/lkml/160407008986.9986.83949368176304529.stgit@gimli.home/

I'll intend to get this in for v5.10-rc3 next week and it should get backported to affected kernels from there.

Justin, I'll add a Tested-by for you as well.  Interesting that you reported different backtraces and this still appears to resolve it.
Comment 19 Justin Gatzen 2020-10-30 22:53:36 UTC
Agreed its less clear for my stack traces. I also didn't have a good reproducer other than my regular daily usage and waiting. If it recurs and this was a false positive I'll let you know.
Comment 20 Ian Pilcher 2020-11-10 18:35:41 UTC
(In reply to Alex Williamson from comment #18)
> I'll intend to get this in for v5.10-rc3 next week and it should get
> backported to affected kernels from there.

I see that this has gone into Linus's tree, but it looks like you didn't cc stable@vger.kernel.org.  (I'm not an expert on the process, though, so maybe that isn't needed anymore?)
Comment 21 Alex Williamson 2020-11-10 18:52:33 UTC
(In reply to Ian Pilcher from comment #20)
> I see that this has gone into Linus's tree, but it looks like you didn't cc
> stable@vger.kernel.org.  (I'm not an expert on the process, though, so maybe
> that isn't needed anymore?)

The Fixes: tag serves the same purpose and is more explicit, it's already in the 5.9-stable queue.
Comment 22 Ian Pilcher 2020-11-10 22:43:46 UTC
(In reply to Alex Williamson from comment #21)
> The Fixes: tag serves the same purpose and is more explicit, it's already in
> the 5.9-stable queue.

Gotcha.  Sorry about the noise.

BTW, any reason not to close this?
Comment 23 Mair O'Nire 2020-11-16 11:21:32 UTC
Hi all.
I'm new here, registered exactly to track this bug.
I've got a new laptop recently, HP Omen 2020, with AMD Ryzen 4800H and Nvidia 1660Ti, and whilst was trying to setup kvm virtualized system I ran into the same issue. During investigation I've found Ian's report, and then this one, too.
Backtraces were very similar to other posted above, all at fs/eventfd.c:74, but sometimes with different origin. So I decided to give Alex's patch a try, and built mainline kernel with this patch applied. Mine host kernel was 5.8.[6,17,18], Fedora 33, and they all seems to be affected. Guest is Windows 10.
So, I do confirm - patch works, system became stable, and there was no crashes for about a week, nor under regular work load, neither while running benchmarks. Before the fix guest uptime varied from 2 to 10 minutes and then crash. Moreover, I had to reboot host to get Nvidia back to work in vm again.
Thanks Alex, Ian, and others, you guys doin' a great work if you don't know this already :)
Comment 24 Clément Aubin 2020-12-04 22:49:12 UTC
Hi everyone, 

Recently tested with the kernel 5.9.12 and the issue is indeed fixed on my side. Huge thanks to everyone participating to this issue, especially @Alex and @Ian for the diagnosis and the fix !

I'm closing the issue, don't hesitate to re-open it if needed.