Created attachment 295203 [details] lspci -vv when the card is correctly detected The device is at address 03:00.0. A virtual machine with a VFIO device cannot be migrated. To migrate a VM with a VFIO device we unplug the card and then replug the card after migration. To avoid networking interruption, the VFIO card is set in a failover set with a virtio-net device: when the migration begins, the VFIO card is automatically unplugged and the network switches to the virtio-net device and on destination the VFIO card is automatically plugged back and the network switches back to the VFIO device. On the VM boot, the VFIO card is only plugged in the VM if the virtio-net driver negociates VIRTIO_NET_F_STANDBY features. This means the VFIO card is hotplugged by the hypervisor (QEMU) while the kernel is executing the virtnet_probe(). But since commit 80696f991424 "PCI: pciehp: Tolerate Presence Detect hardwired to zero" it doesn't work anymore. Normally, during the boot sequence, we should have something like: [ 4.528949] pcieport 0000:00:02.2: pciehp: Slot(0-2): Attention button pressed [ 4.530470] pcieport 0000:00:02.2: pciehp: Slot(0-2) Powering on due to buttons [ 4.532148] pcieport 0000:00:02.2: pciehp: Slot(0-2): Card present [ 4.533380] pcieport 0000:00:02.2: pciehp: Slot(0-2): Link Up [ 4.551226] virtio_net virtio1 eth0: failover master:eth0 registered [ 4.556881] virtio_net virtio1 eth0: failover standby slave:eth1 registered [ 4.906101] virtio_net virtio1 enp2s0: failover primary slave:eth0 registered But now we have: [ 5.256937] pcieport 0000:00:02.2: pciehp: Slot(0-2): Attention button pressed [ 5.258389] pcieport 0000:00:02.2: pciehp: Slot(0-2): Powering off due to button press [ 5.414381] pcieport 0000:00:02.6: pciehp: Slot(0-6): No device found [ 5.415870] pcieport 0000:00:02.4: pciehp: Slot(0-4): No device found [ 5.477205] virtio_net virtio1 enp2s0: failover primary slave:eth0 registered [ 10.456811] virtio_net virtio1 enp2s0: failover primary slave:enp3s0 unregistered QEMU sends an "Attention Button Pressed" event with a "Presence Detected Changed" flag when the card is hotplugged, it seems the kernel doesn't detect correctly the power state. I will attach the result of lspci -vv to the bug. The QEMU command to reproduce the bug is (DEVICE is the VFIO device with a Virtual Function, IMAGE the VM image): -----8<--------------------------------------------------------------- IMAGE=rhel84.qcow2 MACADDR="22:2b:62:bb:a9:82" DEVICE="0000:06:00.0" modprobe vfio_iommu_type1 modprobe vfio-pci DEVPATH="/sys/bus/pci/devices/$DEVICE" NET=$(ls $DEVPATH/net) VF=$(basename $(readlink $DEVPATH/virtfn0)) PCIIDS=$(lspci -ns $VF|cut -d' ' -f3|awk -F':' '{ print $1" "$2 }') # disable VFS echo 0 > $DEVPATH/sriov_numvfs #enable 1 echo 1 > $DEVPATH/sriov_numvfs echo "$VF" > $DEVPATH/virtfn0/driver/unbind echo "$PCIIDS" > /sys/bus/pci/drivers/vfio-pci/new_id echo "$PCIIDS" > /sys/bus/pci/drivers/vfio-pci/remove_id ip link set $NET vf 0 mac "$MACADDR" qemu-system-x86_64 -name rhel84 \ -M q35 \ -enable-kvm \ -nodefaults \ -m 4G \ -smp 2 \ -cpu host \ -nographic \ -device pcie-root-port,id=root.1,chassis=1,addr=0x2.0,multifunction=on \ -device pcie-root-port,id=root.2,chassis=2,addr=0x2.1 \ -device pcie-root-port,id=root.3,chassis=3,addr=0x2.2 \ -device pcie-root-port,id=root.4,chassis=4,addr=0x2.3 \ -device pcie-root-port,id=root.5,chassis=5,addr=0x2.4 \ -device pcie-root-port,id=root.6,chassis=6,addr=0x2.5 \ -device pcie-root-port,id=root.7,chassis=7,addr=0x2.6 \ -device pcie-root-port,id=root.8,chassis=8,addr=0x2.7 \ -blockdev node-name=back_image,driver=file,cache.direct=on,cache.no-flush=off,filename=$IMAGE,aio=threads \ -blockdev node-name=drive-virtio-disk0,driver=qcow2,cache.direct=on,cache.no-flush=off,file=back_image \ -device virtio-blk-pci,drive=drive-virtio-disk0,id=disk0,bus=root.1 \ -netdev bridge,id=hostnet0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=$MACADDR,bus=root.2,failover=on \ -device vfio-pci,host=$VF,id=hostdev0,bus=root.3,failover_pair_id=net0 \ -monitor stdio \ -chardev socket,id=console0,server=on,telnet=on,host=0.0.0.0,port=1234 \ -serial chardev:console0 -----8<--------------------------------------------------------------- I think there is a race condition in the kernel PCI code because if I delay the card hotplug by 2 seconds after the virtio-net features negociation it works fine.
Created attachment 295205 [details] dmesg when the card is missing generated with pciehp.pciehp_debug=1 "dyndbg=file pciehp* +p" HEAD is e0756cfc7d7c Merge tag 'trace-v5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Created attachment 295207 [details] dmesg when the card is availabe generated with pciehp.pciehp_debug=1 "dyndbg=file pciehp* +p" HEAD is revert of 80696f991424d05a784c0cf9c314ac09ac280406 on top of e0756cfc7d7c Merge tag 'trace-v5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Created attachment 295209 [details] Patch to revert 80696f991424d05a784c0cf9c314ac09ac280406
QEMU version is qemu-system-x86-core-5.2.0-4.fc33.x86_64
The Data Link Layer Link Active bit in the Link Status Register contains bogus values on this hotplug port (which I understand is a software simulation of a hotplug port inside qemu): In boot-missing-failover.txt, the link is up when the machine boots, even though the slot is not occupied: Note that Link Status is 0x2204, i.e. bit 0x2000 = Data Link Layer Link Active is set. But Slot Status is 0x0010, so bit 0x40 = Presence Detect State is not set. [ 1.661105] pcieport 0000:00:02.2: pciehp: Slot Capabilities : 0x0002007b [ 1.661115] pcieport 0000:00:02.2: pciehp: Slot Status : 0x0010 [ 1.661123] pcieport 0000:00:02.2: pciehp: Slot Control : 0x07c0 [ 1.661138] pcieport 0000:00:02.2: pciehp: Slot #0 AttnBtn+ PwrCtrl+ MRL- AttnInd+ PwrInd+ HotPlug+ Surprise+ Interlock+ NoCompl- IbPresDis- LLActRep+ [ 1.662581] pcieport 0000:00:02.2: pciehp: pciehp_get_power_status: SLOTCTRL 6c value read 7c0 [ 1.662597] pcieport 0000:00:02.2: pciehp: pciehp_check_link_active: lnk_status = 2204 [ 1.662703] pcieport 0000:00:02.2: pciehp: pending interrupts 0x0010 from Slot Status [ 1.662706] pcieport 0000:00:02.2: pciehp: pcie_enable_notification: SLOTCTRL 6c write cmd 1031 [ 1.662730] pcieport 0000:00:02.2: pciehp: pciehp_check_link_active: lnk_status = 2204 [ 1.662748] pcieport 0000:00:02.2: pciehp: pciehp_check_link_active: lnk_status = 2204 [ 1.662750] pcieport 0000:00:02.2: pciehp: Slot(0-2): Link Up So pciehp notices, oh the link is up, there's a card in the slot! It brings up the slot but fails because the vendor ID of the card can't be read, so it powers the slot off: [ 2.896132] pcieport 0000:00:02.2: pciehp: pciehp_check_link_status: lnk_status = 2204 [ 2.896135] pcieport 0000:00:02.2: pciehp: Slot(0-2): No device found [ 2.896900] pcieport 0000:00:02.2: pciehp: pending interrupts 0x0010 from Slot Status [ 2.896903] pcieport 0000:00:02.2: pciehp: pciehp_power_off_slot: SLOTCTRL 6c write cmd 400 [ 3.656901] pcieport 0000:00:02.2: pciehp: pending interrupts 0x0009 from Slot Status Then the network card is hotplugged, which results in a Card present event. So pciehp tries to bring up the slot again: [ 3.907028] pcieport 0000:00:02.2: pciehp: pending interrupts 0x0018 from Slot Status [ 3.907040] pcieport 0000:00:02.2: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 340 [ 3.907162] pcieport 0000:00:02.2: pciehp: pending interrupts 0x0010 from Slot Status [ 3.907183] pcieport 0000:00:02.2: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 300 [ 3.907190] pcieport 0000:00:02.2: pciehp: Slot(0-2): Attention button pressed [ 3.909108] pcieport 0000:00:02.2: pciehp: Slot(0-2) Powering on due to button press [ 3.911244] pcieport 0000:00:02.2: pciehp: pending interrupts 0x0010 from Slot Status [ 3.911249] pcieport 0000:00:02.2: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 2c0 [ 3.911286] pcieport 0000:00:02.2: pciehp: pciehp_check_link_active: lnk_status = 204 For reasons that pass understanding, the Data Link Layer Link Active bit is now no longer set: Link Status is 0x0204, so bit 0x2000 is cleared. Hence bringup of the slot fails. That doesn't make any sense. I suspect there's a bug in qemu's hotplug port implementation which causes the Data Link Layer Link Active bit to contain an inverted value or something. pciehp's behavior looks fine to me, it can't bring up the slot if the link is down. Reverting 80696f991424 is not a proper solution, it would just paper over a bug in qemu. Rather, qemu needs to report a proper value for the Data Link Layer Link Active bit. I'm surprised this issue is reported only now, more than 2 years after 80696f991424 was merged. Perhaps the bug in qemu was introduced only recently? You're writing that a 2 second delay causes hotplug in the guest to succeed. Quite likely the Data Link Layer Link Active bit is then set for some reason.
(In reply to Lukas Wunner from comment #5) > The Data Link Layer Link Active bit in the Link Status Register contains > bogus values on this hotplug port (which I understand is a software > simulation of a hotplug port inside qemu): ... > That doesn't make any sense. I suspect there's a bug in qemu's hotplug port > implementation which causes the Data Link Layer Link Active bit to contain > an inverted value or something. pciehp's behavior looks fine to me, it can't > bring up the slot if the link is down. Thank you very much for the analysis. Based on that I'm going to see what can be wrong in QEMU. > Reverting 80696f991424 is not a proper solution, it would just paper over a > bug in qemu. Rather, qemu needs to report a proper value for the Data Link > Layer Link Active bit. I'm surprised this issue is reported only now, more > than 2 years after 80696f991424 was merged. Perhaps the bug in qemu was > introduced only recently? In fact, the problem is hard to detect because it depends on what we have on the PCI bus (I think timing is a factor to trigger the bug) For a given kernel revision it can work in a given configuration and not in an another configuration. Moreover, it's very difficult to bisect but reverting it on the HEAD of master fixes the problem. I've tested with the first version of QEMU with the virtio-net failover (QEMU v4.2.0) and the bug is the same. I don't think it's a regression in QEMU, perhaps more of a latent bug in QEMU...
Something else to be aware of: If the link comes up significantly after the Presence Detect Changed event, then this link change should result in a Data Link Layer State Changed event (bit 0x0100 in the Slot Status register). In other words, the hotplug port should send another interrupt when the link comes up. This will trigger another slot bringup attempt by pciehp. However such a Data Link Layer State Changed event is not received (according to the dmesg output). Perhaps qemu neglects to send this interrupt. That would be another bug.
Fixed in QEMU by: df72184ec158 ("pcie: don't set link state active if the slot is empty") https://github.com/qemu/qemu/commit/df72184ec15829053b3bb5a0d5801773b6d9ec25
Thanks for the update Laurent. Good to hear that you were able to root-cause and fix the issue.
(In reply to Lukas Wunner from comment #9) > Thanks for the update Laurent. Good to hear that you were able to root-cause > and fix the issue. Thank you to you, Lukas, without your analysis of the problem it would not have been possible.