Bug 211691 - virtio-net failover primary device is unavailable at boot
Summary: virtio-net failover primary device is unavailable at boot
Status: RESOLVED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-02-10 22:32 UTC by Laurent Vivier
Modified: 2021-02-25 21:46 UTC (History)
1 user (show)

See Also:
Kernel Version: v5.11-rc7
Tree: Mainline
Regression: Yes


Attachments
lspci -vv when the card is correctly detected The device is at address 03:00.0. (39.91 KB, text/plain)
2021-02-10 22:32 UTC, Laurent Vivier
Details
dmesg when the card is missing (61.99 KB, text/plain)
2021-02-10 22:39 UTC, Laurent Vivier
Details
dmesg when the card is availabe (54.49 KB, text/plain)
2021-02-10 22:51 UTC, Laurent Vivier
Details
Patch to revert 80696f991424d05a784c0cf9c314ac09ac280406 (5.65 KB, patch)
2021-02-10 22:53 UTC, Laurent Vivier
Details | Diff

Description Laurent Vivier 2021-02-10 22:32:52 UTC
Created attachment 295203 [details]
lspci -vv when the card is correctly detected  The device is at address 03:00.0.

A virtual machine with a VFIO device cannot be migrated.
To migrate a VM with a VFIO device we unplug the card and then replug the card after migration.

To avoid networking interruption, the VFIO card is set in a failover set with a virtio-net device: when the migration begins, the VFIO card is automatically unplugged and the network switches to the virtio-net device and on destination the VFIO card is automatically plugged back and the network switches back to the VFIO device.

On the VM boot, the VFIO card is only plugged in the VM if the virtio-net driver negociates  VIRTIO_NET_F_STANDBY features. This means the VFIO card is hotplugged by the hypervisor (QEMU) while the kernel is executing the virtnet_probe().

But since commit 80696f991424 "PCI: pciehp: Tolerate Presence Detect hardwired to zero" it doesn't work anymore.

Normally, during the boot sequence, we should have something like:

[    4.528949] pcieport 0000:00:02.2: pciehp: Slot(0-2): Attention button pressed
[    4.530470] pcieport 0000:00:02.2: pciehp: Slot(0-2) Powering on due to buttons
[    4.532148] pcieport 0000:00:02.2: pciehp: Slot(0-2): Card present
[    4.533380] pcieport 0000:00:02.2: pciehp: Slot(0-2): Link Up
[    4.551226] virtio_net virtio1 eth0: failover master:eth0 registered
[    4.556881] virtio_net virtio1 eth0: failover standby slave:eth1 registered
[    4.906101] virtio_net virtio1 enp2s0: failover primary slave:eth0 registered

But now we have:

[    5.256937] pcieport 0000:00:02.2: pciehp: Slot(0-2): Attention button pressed
[    5.258389] pcieport 0000:00:02.2: pciehp: Slot(0-2): Powering off due to button press
[    5.414381] pcieport 0000:00:02.6: pciehp: Slot(0-6): No device found
[    5.415870] pcieport 0000:00:02.4: pciehp: Slot(0-4): No device found
[    5.477205] virtio_net virtio1 enp2s0: failover primary slave:eth0 registered
[   10.456811] virtio_net virtio1 enp2s0: failover primary slave:enp3s0 unregistered

QEMU sends an "Attention Button Pressed" event with a "Presence Detected Changed" flag when the card is hotplugged, it seems the kernel doesn't detect correctly the power state.

I will attach the result of lspci -vv to the bug.

The QEMU command to reproduce the bug is (DEVICE is the VFIO device with a Virtual Function, IMAGE the VM image):

-----8<---------------------------------------------------------------
IMAGE=rhel84.qcow2
MACADDR="22:2b:62:bb:a9:82"
DEVICE="0000:06:00.0"

modprobe vfio_iommu_type1
modprobe vfio-pci

DEVPATH="/sys/bus/pci/devices/$DEVICE"
NET=$(ls $DEVPATH/net)
VF=$(basename $(readlink $DEVPATH/virtfn0))
PCIIDS=$(lspci -ns $VF|cut -d' ' -f3|awk -F':' '{ print $1" "$2 }')

# disable VFS
echo 0 > $DEVPATH/sriov_numvfs
#enable 1
echo 1 > $DEVPATH/sriov_numvfs
echo "$VF" > $DEVPATH/virtfn0/driver/unbind
echo "$PCIIDS" > /sys/bus/pci/drivers/vfio-pci/new_id
echo "$PCIIDS" > /sys/bus/pci/drivers/vfio-pci/remove_id

ip link set $NET vf 0 mac "$MACADDR"

qemu-system-x86_64 -name rhel84 \
-M q35 \
-enable-kvm \
-nodefaults \
-m 4G \
-smp 2 \
-cpu host \
-nographic \
-device pcie-root-port,id=root.1,chassis=1,addr=0x2.0,multifunction=on \
-device pcie-root-port,id=root.2,chassis=2,addr=0x2.1 \
-device pcie-root-port,id=root.3,chassis=3,addr=0x2.2 \
-device pcie-root-port,id=root.4,chassis=4,addr=0x2.3 \
-device pcie-root-port,id=root.5,chassis=5,addr=0x2.4 \
-device pcie-root-port,id=root.6,chassis=6,addr=0x2.5 \
-device pcie-root-port,id=root.7,chassis=7,addr=0x2.6 \
-device pcie-root-port,id=root.8,chassis=8,addr=0x2.7 \
-blockdev node-name=back_image,driver=file,cache.direct=on,cache.no-flush=off,filename=$IMAGE,aio=threads \
-blockdev node-name=drive-virtio-disk0,driver=qcow2,cache.direct=on,cache.no-flush=off,file=back_image \
-device virtio-blk-pci,drive=drive-virtio-disk0,id=disk0,bus=root.1 \
-netdev bridge,id=hostnet0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=$MACADDR,bus=root.2,failover=on \
-device vfio-pci,host=$VF,id=hostdev0,bus=root.3,failover_pair_id=net0 \
-monitor stdio \
-chardev socket,id=console0,server=on,telnet=on,host=0.0.0.0,port=1234 \
-serial chardev:console0
-----8<---------------------------------------------------------------

I think there is a race condition in the kernel PCI code because if I delay the card hotplug by 2 seconds after the virtio-net features negociation it works fine.
Comment 1 Laurent Vivier 2021-02-10 22:39:09 UTC
Created attachment 295205 [details]
dmesg when the card is missing

generated with pciehp.pciehp_debug=1 "dyndbg=file pciehp* +p"

HEAD is e0756cfc7d7c Merge tag 'trace-v5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Comment 2 Laurent Vivier 2021-02-10 22:51:53 UTC
Created attachment 295207 [details]
dmesg when the card is availabe

generated with pciehp.pciehp_debug=1 "dyndbg=file pciehp* +p"

HEAD is

revert of 80696f991424d05a784c0cf9c314ac09ac280406

on top of

e0756cfc7d7c Merge tag 'trace-v5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Comment 3 Laurent Vivier 2021-02-10 22:53:43 UTC
Created attachment 295209 [details]
Patch to revert 80696f991424d05a784c0cf9c314ac09ac280406
Comment 4 Laurent Vivier 2021-02-10 22:58:25 UTC
QEMU version is qemu-system-x86-core-5.2.0-4.fc33.x86_64
Comment 5 Lukas Wunner 2021-02-11 06:09:10 UTC
The Data Link Layer Link Active bit in the Link Status Register contains bogus values on this hotplug port (which I understand is a software simulation of a hotplug port inside qemu):

In boot-missing-failover.txt, the link is up when the machine boots, even though the slot is not occupied:
Note that Link Status is 0x2204, i.e. bit 0x2000 = Data Link Layer Link Active is set.
But Slot Status is 0x0010, so bit 0x40 = Presence Detect State is not set.

[    1.661105] pcieport 0000:00:02.2: pciehp: Slot Capabilities      : 0x0002007b
[    1.661115] pcieport 0000:00:02.2: pciehp: Slot Status            : 0x0010
[    1.661123] pcieport 0000:00:02.2: pciehp: Slot Control           : 0x07c0
[    1.661138] pcieport 0000:00:02.2: pciehp: Slot #0 AttnBtn+ PwrCtrl+ MRL- AttnInd+ PwrInd+ HotPlug+ Surprise+ Interlock+ NoCompl- IbPresDis- LLActRep+
[    1.662581] pcieport 0000:00:02.2: pciehp: pciehp_get_power_status: SLOTCTRL 6c value read 7c0
[    1.662597] pcieport 0000:00:02.2: pciehp: pciehp_check_link_active: lnk_status = 2204
[    1.662703] pcieport 0000:00:02.2: pciehp: pending interrupts 0x0010 from Slot Status
[    1.662706] pcieport 0000:00:02.2: pciehp: pcie_enable_notification: SLOTCTRL 6c write cmd 1031
[    1.662730] pcieport 0000:00:02.2: pciehp: pciehp_check_link_active: lnk_status = 2204
[    1.662748] pcieport 0000:00:02.2: pciehp: pciehp_check_link_active: lnk_status = 2204
[    1.662750] pcieport 0000:00:02.2: pciehp: Slot(0-2): Link Up

So pciehp notices, oh the link is up, there's a card in the slot! It brings up the slot but fails because the vendor ID of the card can't be read, so it powers the slot off:

[    2.896132] pcieport 0000:00:02.2: pciehp: pciehp_check_link_status: lnk_status = 2204
[    2.896135] pcieport 0000:00:02.2: pciehp: Slot(0-2): No device found
[    2.896900] pcieport 0000:00:02.2: pciehp: pending interrupts 0x0010 from Slot Status
[    2.896903] pcieport 0000:00:02.2: pciehp: pciehp_power_off_slot: SLOTCTRL 6c write cmd 400
[    3.656901] pcieport 0000:00:02.2: pciehp: pending interrupts 0x0009 from Slot Status

Then the network card is hotplugged, which results in a Card present event. So pciehp tries to bring up the slot again:

[    3.907028] pcieport 0000:00:02.2: pciehp: pending interrupts 0x0018 from Slot Status
[    3.907040] pcieport 0000:00:02.2: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 340
[    3.907162] pcieport 0000:00:02.2: pciehp: pending interrupts 0x0010 from Slot Status
[    3.907183] pcieport 0000:00:02.2: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 300
[    3.907190] pcieport 0000:00:02.2: pciehp: Slot(0-2): Attention button pressed
[    3.909108] pcieport 0000:00:02.2: pciehp: Slot(0-2) Powering on due to button press
[    3.911244] pcieport 0000:00:02.2: pciehp: pending interrupts 0x0010 from Slot Status
[    3.911249] pcieport 0000:00:02.2: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 2c0
[    3.911286] pcieport 0000:00:02.2: pciehp: pciehp_check_link_active: lnk_status = 204

For reasons that pass understanding, the Data Link Layer Link Active bit is now no longer set:
Link Status is 0x0204, so bit 0x2000 is cleared. Hence bringup of the slot fails.

That doesn't make any sense. I suspect there's a bug in qemu's hotplug port implementation which causes the Data Link Layer Link Active bit to contain an inverted value or something. pciehp's behavior looks fine to me, it can't bring up the slot if the link is down.

Reverting 80696f991424 is not a proper solution, it would just paper over a bug in qemu. Rather, qemu needs to report a proper value for the Data Link Layer Link Active bit. I'm surprised this issue is reported only now, more than 2 years after 80696f991424 was merged. Perhaps the bug in qemu was introduced only recently?

You're writing that a 2 second delay causes hotplug in the guest to succeed. Quite likely the Data Link Layer Link Active bit is then set for some reason.
Comment 6 Laurent Vivier 2021-02-11 09:31:18 UTC
(In reply to Lukas Wunner from comment #5)
> The Data Link Layer Link Active bit in the Link Status Register contains
> bogus values on this hotplug port (which I understand is a software
> simulation of a hotplug port inside qemu):
...
> That doesn't make any sense. I suspect there's a bug in qemu's hotplug port
> implementation which causes the Data Link Layer Link Active bit to contain
> an inverted value or something. pciehp's behavior looks fine to me, it can't
> bring up the slot if the link is down.

Thank you very much for the analysis.

Based on that I'm going to see what can be wrong in QEMU.

> Reverting 80696f991424 is not a proper solution, it would just paper over a
> bug in qemu. Rather, qemu needs to report a proper value for the Data Link
> Layer Link Active bit. I'm surprised this issue is reported only now, more
> than 2 years after 80696f991424 was merged. Perhaps the bug in qemu was
> introduced only recently?

In fact, the problem is hard to detect because it depends on what we have on the PCI bus (I think timing is a factor to trigger the bug)

For a given kernel revision it can work in a given configuration and not in an another configuration. Moreover, it's very difficult to bisect but reverting it on the HEAD of master fixes the problem.

I've tested with the first version of QEMU with the virtio-net failover (QEMU v4.2.0) and the bug is the same. I don't think it's a regression in QEMU, perhaps more of a latent bug in QEMU...
Comment 7 Lukas Wunner 2021-02-11 10:50:49 UTC
Something else to be aware of:

If the link comes up significantly after the Presence Detect Changed event, then this link change should result in a Data Link Layer State Changed event (bit 0x0100 in the Slot Status register). In other words, the hotplug port should send another interrupt when the link comes up. This will trigger another slot bringup attempt by pciehp.

However such a Data Link Layer State Changed event is not received (according to the dmesg output). Perhaps qemu neglects to send this interrupt. That would be another bug.
Comment 8 Laurent Vivier 2021-02-25 20:10:33 UTC
Fixed in QEMU by:

df72184ec158 ("pcie: don't set link state active if the slot is empty")
https://github.com/qemu/qemu/commit/df72184ec15829053b3bb5a0d5801773b6d9ec25
Comment 9 Lukas Wunner 2021-02-25 21:40:33 UTC
Thanks for the update Laurent. Good to hear that you were able to root-cause and fix the issue.
Comment 10 Laurent Vivier 2021-02-25 21:46:15 UTC
(In reply to Lukas Wunner from comment #9)
> Thanks for the update Laurent. Good to hear that you were able to root-cause
> and fix the issue.

Thank you to you, Lukas, without your analysis of the problem it would not have been possible.

Note You need to log in before you can comment on or make changes to this bug.