Created attachment 277909 [details] lspci, dmesg, and /sys/bus/pci/devices/ I got a new wifi card QCA9005, it contains two part Wil6200 (for ad) and AR9462 (for a/b/g/n). Every time I perform a system suspend, AR9462 is missing from `lspci' and I can not see the wifi device (but Wil6200 is untouched). Someone suggested me to add `acpiphp.disable=1' according to https://bbs.archlinux.org/viewtopic.php?id=216520 , but it doesn't work either. It seems like AR9462 is attached to Wil6200 (see attachment), and this problem might be connected with it. Tested on Debian stable/testing live iso.
Can you attach the output of "sudo lspci -vv"? The PCI database for the Wilocity devices seems wrong. I doubt this is what's causing this problem, but it'd be nice to correct the database.
Created attachment 277913 [details] lspci vv
Thanks. The Wil6200 devices are a PCIe switch containing: 03:00.0 [1ae9:0101] PCIe Switch Upstream Port (PCI bridge to [bus 04-07]) 04:00.0 [1ae9:0200] PCIe Switch Downstream Port (PCI bridge to [bus 05]) 04:02.0 [1ae9:0201] PCIe Switch Downstream Port (PCI bridge to [bus 06]) 04:03.0 [1ae9:0201] PCIe Switch Downstream Port (PCI bridge to [bus 07]) The PCI database (https://pci-ids.ucw.cz/read/PC/1ae9) claims 04:02.0 and 04:03.0 are related to wireless, but I don't understand how. They look like plain vanilla PCIe switch ports. They're not PCIe Endpoints, they have no BARs, and I don't see how they can themselves be NICs. They could *lead* to a wifi NIC, although your dmesg and lspci output doesn't show any devices on buses 06 and 07.
The problem with the AR9462 disappearing after a suspend/resume may be a pciehp issue. Lukas Wunner did a ton of updates in that area. Is there any chance you could try a recent upstream kernel, e.g., 4e31843f681c ("Merge tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci"), or v4.19-rc1 when that comes out?
Seems like it's a single Half Mini PCIe card, with two chipsets (Wil6110? + AR9462) combined by a PCIe hub. You can get more info in https://wikidevi.com/wiki/Qualcomm_Atheros_QCA9005 and its pictures in google. BTW, I can't see any 802.11ad devices in my current kernel. Anyway, I'll try new kernel later if possible.
Luckily, with the latest kernel 5c60a7389d795e001c8748b458eb76e3a5b6008c, AR9462 doesn't disappear any more after a suspend, but now it won't appear after system boot. I have to manually run `echo 1 > /sys/devices/pci0000\:00/0000\:00\:1c.3/0000\:03\:00.0/0000\:04\:00.0/rescan' to wake up AR9462. dmesg attached.
Created attachment 277923 [details] dmesg new
Created attachment 278041 [details] [PATCH 1/2] PCI: pciehp: Differentiate between surprise and safe removal Preparatory patch removing one call to pciehp_get_adapter_status(). It was already submitted to the list on July 31: https://patchwork.ozlabs.org/patch/951386/
Created attachment 278043 [details] [PATCH 2/2] PCI: pciehp: Tolerate Presence Detect hardwired to zero
@David Yang: Please apply the two patches I've just attached on top of Linus' current tree (or alternatively 5c60a7389d79, which you used for testing previously) and report back if they fix the issue. Thanks.
Created attachment 278053 [details] lspci, dmesg after patch I tried 815f0ddb3 with above 2 patched. It works greatly, no disappear after suspend, and it's correctly enumerated at boot time. But I notice the rev number of Wil6200 was changed after resume, don't know if this is normal. Log attached.
Right, the revision of the two wireless Downstream Ports changed from 04 to 14. Could be another hardware or BIOS bug, I wouldn't worry too much about it if the card is otherwise working. Let me submit the patch to the list then.
(In reply to Lukas Wunner from comment #9) > Created attachment 278043 [details] > [PATCH 2/2] PCI: pciehp: Tolerate Presence Detect hardwired to zero This patch has introduced a regression with virtio-net failover for VFIO device. In the failover case, the virtio-net triggers the hotplug of the VFIO card in the VM, and as it happens during the PCI bus scan it seems it's not correctly managed. In some cases (depending on the PCI cards on the bus), the hotplugged card is simply ignored, in other cases it is unplugged as the "Presence Detect Changed" is seen as a Power Off and not a Power On. If I revert this patch on top of 5.11.0-rc7 it works fine. See https://bugzilla.redhat.com/show_bug.cgi?id=1917654 Any idea?
(In reply to Laurent Vivier from comment #13) > See https://bugzilla.redhat.com/show_bug.cgi?id=1917654 I can't access that webpage, it says "You are not authorized to access bug #1917654. To see this bug, you must first log in to an account with the appropriate permissions." Please either make that bug accessible to everyone or open a new bug on bugzilla.kernel.org and attach full dmesg and lspci -vv output. Thanks.
(In reply to Lukas Wunner from comment #14) > (In reply to Laurent Vivier from comment #13) > > See https://bugzilla.redhat.com/show_bug.cgi?id=1917654 > > I can't access that webpage, it says "You are not authorized to access bug > #1917654. To see this bug, you must first log in to an account with the > appropriate permissions." Sorry, I didn't check it was not public > Please either make that bug accessible to everyone or open a new bug on > bugzilla.kernel.org and attach full dmesg and lspci -vv output. Thanks. I put the information here as I have bisected to the fix that fixed this bug. I will open a new bug after your first comments on that. The context is: A virtual machine with a VFIO device cannot be migrated. To migrate a VM with a VFIO device we unplug the card and then replug the card after migration. To avoid networking interruption, the VFIO card is set in a failover set with a virtio-net device: when the migration begins, the VFIO card is automatically unplugged and the network switches to the virtio-net device and on destination the VFIO card is automatically plugged back and the network switches back to the VFIO device. On the VM boot, the VFIO card is only plugged in the VM if the virtio-net driver negociates VIRTIO_NET_F_STANDBY features. This means the VFIO card is hotplugged by the hypervisor (QEMU) while the kernel is executing the virtnet_probe(). But since commit 80696f991424 "PCI: pciehp: Tolerate Presence Detect hardwired to zero" it doesn't work anymore. Normally, during the boot sequence, we should have something like: [ 4.528949] pcieport 0000:00:02.2: pciehp: Slot(0-2): Attention button pressed [ 4.530470] pcieport 0000:00:02.2: pciehp: Slot(0-2) Powering on due to buttons [ 4.532148] pcieport 0000:00:02.2: pciehp: Slot(0-2): Card present [ 4.533380] pcieport 0000:00:02.2: pciehp: Slot(0-2): Link Up [ 4.551226] virtio_net virtio1 eth0: failover master:eth0 registered [ 4.556881] virtio_net virtio1 eth0: failover standby slave:eth1 registered [ 4.906101] virtio_net virtio1 enp2s0: failover primary slave:eth0 registered But now we have: [ 5.256937] pcieport 0000:00:02.2: pciehp: Slot(0-2): Attention button pressed [ 5.258389] pcieport 0000:00:02.2: pciehp: Slot(0-2): Powering off due to button press [ 5.414381] pcieport 0000:00:02.6: pciehp: Slot(0-6): No device found [ 5.415870] pcieport 0000:00:02.4: pciehp: Slot(0-4): No device found [ 5.477205] virtio_net virtio1 enp2s0: failover primary slave:eth0 registered [ 10.456811] virtio_net virtio1 enp2s0: failover primary slave:enp3s0 unregistered QEMU sends an "Attention Button Pressed" event with a "Presence Detected Changed" flag when the card is hotplugged, it seems the kernel doesn't detect correctly the power state. I will attach the result of lspci -vv to the bug. The QEMU command to reproduce the bug is (DEVICE is the VFIO device with a Virtual Function, IMAGE the VM image): -----8<--------------------------------------------------------------- IMAGE=rhel84.qcow2 MACADDR="22:2b:62:bb:a9:82" DEVICE="0000:06:00.0" modprobe vfio_iommu_type1 modprobe vfio-pci DEVPATH="/sys/bus/pci/devices/$DEVICE" NET=$(ls $DEVPATH/net) VF=$(basename $(readlink $DEVPATH/virtfn0)) PCIIDS=$(lspci -ns $VF|cut -d' ' -f3|awk -F':' '{ print $1" "$2 }') # disable VFS echo 0 > $DEVPATH/sriov_numvfs #enable 1 echo 1 > $DEVPATH/sriov_numvfs echo "$VF" > $DEVPATH/virtfn0/driver/unbind echo "$PCIIDS" > /sys/bus/pci/drivers/vfio-pci/new_id echo "$PCIIDS" > /sys/bus/pci/drivers/vfio-pci/remove_id ip link set $NET vf 0 mac "$MACADDR" qemu-system-x86_64 -name rhel84 \ -M q35 \ -enable-kvm \ -nodefaults \ -m 4G \ -smp 2 \ -cpu host \ -nographic \ -device pcie-root-port,id=root.1,chassis=1,addr=0x2.0,multifunction=on \ -device pcie-root-port,id=root.2,chassis=2,addr=0x2.1 \ -device pcie-root-port,id=root.3,chassis=3,addr=0x2.2 \ -device pcie-root-port,id=root.4,chassis=4,addr=0x2.3 \ -device pcie-root-port,id=root.5,chassis=5,addr=0x2.4 \ -device pcie-root-port,id=root.6,chassis=6,addr=0x2.5 \ -device pcie-root-port,id=root.7,chassis=7,addr=0x2.6 \ -device pcie-root-port,id=root.8,chassis=8,addr=0x2.7 \ -blockdev node-name=back_image,driver=file,cache.direct=on,cache.no-flush=off,filename=$IMAGE,aio=threads \ -blockdev node-name=drive-virtio-disk0,driver=qcow2,cache.direct=on,cache.no-flush=off,file=back_image \ -device virtio-blk-pci,drive=drive-virtio-disk0,id=disk0,bus=root.1 \ -netdev bridge,id=hostnet0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=$MACADDR,bus=root.2,failover=on \ -device vfio-pci,host=$VF,id=hostdev0,bus=root.3,failover_pair_id=net0 \ -monitor stdio \ -chardev socket,id=console0,server=on,telnet=on,host=0.0.0.0,port=1234 \ -serial chardev:console0 -----8<--------------------------------------------------------------- I think there is a race condition in the kernel PCI code because if I delay the card hotplug by 2 seconds after the virtio-net features negociation it works fine.
Created attachment 295195 [details] lspci -vv when the card is correctly detected The device is at address 03:00.0.
Please create a new bugzilla entry and add full dmesg output with and without 80696f991424. Please add the following to the command line: pciehp.pciehp_debug=1 dyndbg="file pciehp* +p"
(In reply to Lukas Wunner from comment #17) > Please create a new bugzilla entry and add full dmesg output with and > without 80696f991424. > > Please add the following to the command line: > pciehp.pciehp_debug=1 dyndbg="file pciehp* +p" New bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211691