Bug 200839

Summary: QCA9005 AR9462 disappears from lspci after system suspend
Product: Drivers Reporter: mmyangfl
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: bjorn, lukas, lvivier
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.17.8 Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci, dmesg, and /sys/bus/pci/devices/
lspci vv
dmesg new
[PATCH 1/2] PCI: pciehp: Differentiate between surprise and safe removal
[PATCH 2/2] PCI: pciehp: Tolerate Presence Detect hardwired to zero
lspci, dmesg after patch
lspci -vv when the card is correctly detected

Description mmyangfl 2018-08-17 08:50:06 UTC
Created attachment 277909 [details]
lspci, dmesg, and /sys/bus/pci/devices/

I got a new wifi card QCA9005, it contains two part Wil6200 (for ad) and AR9462 (for a/b/g/n). Every time I perform a system suspend, AR9462 is missing from `lspci' and I can not see the wifi device (but Wil6200 is untouched). Someone suggested me to add `acpiphp.disable=1' according to https://bbs.archlinux.org/viewtopic.php?id=216520 , but it doesn't work either.

It seems like AR9462 is attached to Wil6200 (see attachment), and this problem might be connected with it.

Tested on Debian stable/testing live iso.
Comment 1 Bjorn Helgaas 2018-08-17 13:58:26 UTC
Can you attach the output of "sudo lspci -vv"?  The PCI database for the Wilocity devices seems wrong.  I doubt this is what's causing this problem, but it'd be nice to correct the database.
Comment 2 mmyangfl 2018-08-17 14:45:07 UTC
Created attachment 277913 [details]
lspci vv
Comment 3 Bjorn Helgaas 2018-08-17 15:27:47 UTC
Thanks.  The Wil6200 devices are a PCIe switch containing:

  03:00.0 [1ae9:0101] PCIe Switch Upstream Port (PCI bridge to [bus 04-07])
  04:00.0 [1ae9:0200] PCIe Switch Downstream Port (PCI bridge to [bus 05])
  04:02.0 [1ae9:0201] PCIe Switch Downstream Port (PCI bridge to [bus 06])
  04:03.0 [1ae9:0201] PCIe Switch Downstream Port (PCI bridge to [bus 07])

The PCI database (https://pci-ids.ucw.cz/read/PC/1ae9) claims 04:02.0 and 04:03.0 are related to wireless, but I don't understand how.  They look like plain vanilla PCIe switch ports.  They're not PCIe Endpoints, they have no BARs, and I don't see how they can themselves be NICs.

They could *lead* to a wifi NIC, although your dmesg and lspci output doesn't show any devices on buses 06 and 07.
Comment 4 Bjorn Helgaas 2018-08-17 15:30:10 UTC
The problem with the AR9462 disappearing after a suspend/resume may be a pciehp issue.  Lukas Wunner did a ton of updates in that area.  Is there any chance you could try a recent upstream kernel, e.g., 4e31843f681c ("Merge tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci"), or v4.19-rc1 when that comes out?
Comment 5 mmyangfl 2018-08-17 15:40:37 UTC
Seems like it's a single Half Mini PCIe card, with two chipsets (Wil6110? + AR9462) combined by a PCIe hub. You can get more info in https://wikidevi.com/wiki/Qualcomm_Atheros_QCA9005 and its pictures in google.

BTW, I can't see any 802.11ad devices in my current kernel.

Anyway, I'll try new kernel later if possible.
Comment 6 mmyangfl 2018-08-18 03:31:47 UTC
Luckily, with the latest kernel 5c60a7389d795e001c8748b458eb76e3a5b6008c, AR9462 doesn't disappear any more after a suspend, but now it won't appear after system boot. I have to manually run `echo 1 > /sys/devices/pci0000\:00/0000\:00\:1c.3/0000\:03\:00.0/0000\:04\:00.0/rescan' to wake up AR9462.

dmesg attached.
Comment 7 mmyangfl 2018-08-18 03:32:06 UTC
Created attachment 277923 [details]
dmesg new
Comment 8 Lukas Wunner 2018-08-23 13:45:31 UTC
Created attachment 278041 [details]
[PATCH 1/2] PCI: pciehp: Differentiate between surprise and safe removal

Preparatory patch removing one call to pciehp_get_adapter_status().
It was already submitted to the list on July 31:
https://patchwork.ozlabs.org/patch/951386/
Comment 9 Lukas Wunner 2018-08-23 13:47:15 UTC
Created attachment 278043 [details]
[PATCH 2/2] PCI: pciehp: Tolerate Presence Detect hardwired to zero
Comment 10 Lukas Wunner 2018-08-23 13:49:35 UTC
@David Yang: Please apply the two patches I've just attached on top of Linus' current tree (or alternatively 5c60a7389d79, which you used for testing previously) and report back if they fix the issue. Thanks.
Comment 11 mmyangfl 2018-08-24 06:24:13 UTC
Created attachment 278053 [details]
lspci, dmesg after patch

I tried 815f0ddb3 with above 2 patched. It works greatly, no disappear after suspend, and it's correctly enumerated at boot time. But I notice the rev number of Wil6200 was changed after resume, don't know if this is normal.

Log attached.
Comment 12 Lukas Wunner 2018-08-24 06:48:45 UTC
Right, the revision of the two wireless Downstream Ports changed from 04 to 14. Could be another hardware or BIOS bug, I wouldn't worry too much about it if the card is otherwise working. Let me submit the patch to the list then.
Comment 13 Laurent Vivier 2021-02-10 13:16:53 UTC
(In reply to Lukas Wunner from comment #9)
> Created attachment 278043 [details]
> [PATCH 2/2] PCI: pciehp: Tolerate Presence Detect hardwired to zero

This patch has introduced a regression with virtio-net failover for VFIO device.

In the failover case, the virtio-net triggers the hotplug of the VFIO card in the VM, and as it happens during the PCI bus scan it seems it's not correctly managed.

In some cases (depending on the PCI cards on the bus), the hotplugged card is simply ignored, in other cases it is unplugged as the "Presence Detect Changed" is seen as a Power Off and not a Power On.

If I revert this patch on top of 5.11.0-rc7 it works fine.

See https://bugzilla.redhat.com/show_bug.cgi?id=1917654

Any idea?
Comment 14 Lukas Wunner 2021-02-10 17:25:02 UTC
(In reply to Laurent Vivier from comment #13)
> See https://bugzilla.redhat.com/show_bug.cgi?id=1917654

I can't access that webpage, it says "You are not authorized to access bug #1917654. To see this bug, you must first log in to an account with the appropriate permissions."

Please either make that bug accessible to everyone or open a new bug on bugzilla.kernel.org and attach full dmesg and lspci -vv output. Thanks.
Comment 15 Laurent Vivier 2021-02-10 18:46:39 UTC
(In reply to Lukas Wunner from comment #14)
> (In reply to Laurent Vivier from comment #13)
> > See https://bugzilla.redhat.com/show_bug.cgi?id=1917654
> 
> I can't access that webpage, it says "You are not authorized to access bug
> #1917654. To see this bug, you must first log in to an account with the
> appropriate permissions."

Sorry, I didn't check it was not public

> Please either make that bug accessible to everyone or open a new bug on
> bugzilla.kernel.org and attach full dmesg and lspci -vv output. Thanks.

I put the information here as I have bisected to the fix that fixed this bug.
I will open a new bug after your first comments on that.

The context is:

A virtual machine with a VFIO device cannot be migrated.
To migrate a VM with a VFIO device we unplug the card and then replug the card after migration.

To avoid networking interruption, the VFIO card is set in a failover set with a virtio-net device: when the migration begins, the VFIO card is automatically unplugged and the network switches to the virtio-net device and on destination the VFIO card is automatically plugged back and the network switches back to the VFIO device.

On the VM boot, the VFIO card is only plugged in the VM if the virtio-net driver negociates  VIRTIO_NET_F_STANDBY features. This means the VFIO card is hotplugged by the hypervisor (QEMU) while the kernel is executing the virtnet_probe().

But since commit 80696f991424 "PCI: pciehp: Tolerate Presence Detect hardwired to zero" it doesn't work anymore.

Normally, during the boot sequence, we should have something like:

[    4.528949] pcieport 0000:00:02.2: pciehp: Slot(0-2): Attention button pressed
[    4.530470] pcieport 0000:00:02.2: pciehp: Slot(0-2) Powering on due to buttons
[    4.532148] pcieport 0000:00:02.2: pciehp: Slot(0-2): Card present
[    4.533380] pcieport 0000:00:02.2: pciehp: Slot(0-2): Link Up
[    4.551226] virtio_net virtio1 eth0: failover master:eth0 registered
[    4.556881] virtio_net virtio1 eth0: failover standby slave:eth1 registered
[    4.906101] virtio_net virtio1 enp2s0: failover primary slave:eth0 registered

But now we have:

[    5.256937] pcieport 0000:00:02.2: pciehp: Slot(0-2): Attention button pressed
[    5.258389] pcieport 0000:00:02.2: pciehp: Slot(0-2): Powering off due to button press
[    5.414381] pcieport 0000:00:02.6: pciehp: Slot(0-6): No device found
[    5.415870] pcieport 0000:00:02.4: pciehp: Slot(0-4): No device found
[    5.477205] virtio_net virtio1 enp2s0: failover primary slave:eth0 registered
[   10.456811] virtio_net virtio1 enp2s0: failover primary slave:enp3s0 unregistered

QEMU sends an "Attention Button Pressed" event with a "Presence Detected Changed" flag when the card is hotplugged, it seems the kernel doesn't detect correctly the power state.

I will attach the result of lspci -vv to the bug.

The QEMU command to reproduce the bug is (DEVICE is the VFIO device with a Virtual Function, IMAGE the VM image):

-----8<---------------------------------------------------------------
IMAGE=rhel84.qcow2
MACADDR="22:2b:62:bb:a9:82"
DEVICE="0000:06:00.0"

modprobe vfio_iommu_type1
modprobe vfio-pci

DEVPATH="/sys/bus/pci/devices/$DEVICE"
NET=$(ls $DEVPATH/net)
VF=$(basename $(readlink $DEVPATH/virtfn0))
PCIIDS=$(lspci -ns $VF|cut -d' ' -f3|awk -F':' '{ print $1" "$2 }')

# disable VFS
echo 0 > $DEVPATH/sriov_numvfs
#enable 1
echo 1 > $DEVPATH/sriov_numvfs
echo "$VF" > $DEVPATH/virtfn0/driver/unbind
echo "$PCIIDS" > /sys/bus/pci/drivers/vfio-pci/new_id
echo "$PCIIDS" > /sys/bus/pci/drivers/vfio-pci/remove_id

ip link set $NET vf 0 mac "$MACADDR"

qemu-system-x86_64 -name rhel84 \
-M q35 \
-enable-kvm \
-nodefaults \
-m 4G \
-smp 2 \
-cpu host \
-nographic \
-device pcie-root-port,id=root.1,chassis=1,addr=0x2.0,multifunction=on \
-device pcie-root-port,id=root.2,chassis=2,addr=0x2.1 \
-device pcie-root-port,id=root.3,chassis=3,addr=0x2.2 \
-device pcie-root-port,id=root.4,chassis=4,addr=0x2.3 \
-device pcie-root-port,id=root.5,chassis=5,addr=0x2.4 \
-device pcie-root-port,id=root.6,chassis=6,addr=0x2.5 \
-device pcie-root-port,id=root.7,chassis=7,addr=0x2.6 \
-device pcie-root-port,id=root.8,chassis=8,addr=0x2.7 \
-blockdev node-name=back_image,driver=file,cache.direct=on,cache.no-flush=off,filename=$IMAGE,aio=threads \
-blockdev node-name=drive-virtio-disk0,driver=qcow2,cache.direct=on,cache.no-flush=off,file=back_image \
-device virtio-blk-pci,drive=drive-virtio-disk0,id=disk0,bus=root.1 \
-netdev bridge,id=hostnet0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=$MACADDR,bus=root.2,failover=on \
-device vfio-pci,host=$VF,id=hostdev0,bus=root.3,failover_pair_id=net0 \
-monitor stdio \
-chardev socket,id=console0,server=on,telnet=on,host=0.0.0.0,port=1234 \
-serial chardev:console0
-----8<---------------------------------------------------------------

I think there is a race condition in the kernel PCI code because if I delay the card hotplug by 2 seconds after the virtio-net features negociation it works fine.
Comment 16 Laurent Vivier 2021-02-10 18:49:11 UTC
Created attachment 295195 [details]
lspci -vv when the card is correctly detected

The device is at address 03:00.0.
Comment 17 Lukas Wunner 2021-02-10 22:09:10 UTC
Please create a new bugzilla entry and add full dmesg output with and without 80696f991424.

Please add the following to the command line:
pciehp.pciehp_debug=1 dyndbg="file pciehp* +p"
Comment 18 Laurent Vivier 2021-02-10 22:42:17 UTC
(In reply to Lukas Wunner from comment #17)
> Please create a new bugzilla entry and add full dmesg output with and
> without 80696f991424.
> 
> Please add the following to the command line:
> pciehp.pciehp_debug=1 dyndbg="file pciehp* +p"

New bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211691