Bug 217056

Summary: ath11k: QCN9074: iommu problems with Thunderbolt
Product: Drivers Reporter: Andrej Podzimek (andrej)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: NEEDINFO ---    
Severity: high CC: andrej, kvalo
Priority: P4    
Hardware: All   
OS: Linux   
Kernel Version: 6.1.11 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel log (dmesg) during device association
hostapd log (timeout) during device association
lspci -kvvv entry
lspci -tv
boltctl
hostapd.conf (uncommented lines only)
iw list (parts pertaining to this particular device)
dmesg when ath11k_pci is loaded

Description Andrej Podzimek 2023-02-18 18:37:09 UTC
Created attachment 303752 [details]
kernel log (dmesg) during device association

I have a “Qualcomm Technologies, Inc QCN6024/9024/9074 Wireless Network Adapter (rev 01)” connected using a Thunderbolt adapter. Running hostapd with it appears to work fine, but all client authentication requests fail. This correlates with failures and page faults reported in dmesg and hostapd transmission retries / timeouts.
Comment 1 Andrej Podzimek 2023-02-18 18:38:31 UTC
Created attachment 303753 [details]
hostapd log (timeout) during device association
Comment 2 Andrej Podzimek 2023-02-18 18:42:31 UTC
Created attachment 303754 [details]
lspci -kvvv entry
Comment 3 Andrej Podzimek 2023-02-18 18:42:59 UTC
Created attachment 303755 [details]
lspci -tv
Comment 4 Andrej Podzimek 2023-02-18 18:43:30 UTC
Created attachment 303756 [details]
boltctl
Comment 5 Andrej Podzimek 2023-02-18 18:44:35 UTC
Created attachment 303757 [details]
hostapd.conf (uncommented lines only)
Comment 6 Andrej Podzimek 2023-02-18 18:47:12 UTC
Further notes:
The same problem occurs with crypto_mode=0 and also crypto_mode=1 (hardware vs software).
Tried around 100 configurations (which removed various 802.11ac and 802.11ax options, tried different channels, minimized hostapd.conf etc.), but the problem was still exactly the same.
The symptoms are that nothing can connect. But the network can be found and is visible.
Comment 7 Andrej Podzimek 2023-02-18 18:48:39 UTC
Created attachment 303758 [details]
iw list (parts pertaining to this particular device)
Comment 8 Andrej Podzimek 2023-02-18 18:53:35 UTC
Created attachment 303759 [details]
dmesg when ath11k_pci is loaded

The device firmware comes from the package 20230210.bf4115c-1 on ArchLinux.
It probably consists of the files in /lib/firmware/ath11k/QCN9074/hw1.0.
Comment 9 Andrej Podzimek 2023-02-18 18:58:49 UTC
Posting the kernel and hostapd log also outside attachments (for search engines to find). This is (what appears to be) the ath11k_pci driver bug exposed on a hostapd-based server during WiFi client association + authentication:

Feb 18 19:25:15 kernel: ath11k_pci 0000:3d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0xb2a000080 flags=0x0020]
Feb 18 19:25:15 kernel: ath11k_pci 0000:3d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x483040300 flags=0x0020]
Feb 18 19:25:15 kernel: ath11k_pci 0000:3d:00.0: frame rx with invalid buf_id 0
Feb 18 19:25:18 kernel: ath11k_pci 0000:3d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x481e000a0 flags=0x0020]
Feb 18 19:25:18 kernel: ath11k_pci 0000:3d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x483040340 flags=0x0020]
Feb 18 19:25:18 kernel: ath11k_pci 0000:3d:00.0: frame rx with invalid buf_id 0
Feb 18 19:25:24 kernel: ath11k_pci 0000:3d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x481e000c0 flags=0x0020]
Feb 18 19:25:24 kernel: ath11k_pci 0000:3d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x483040380 flags=0x0020]
Feb 18 19:25:24 kernel: ath11k_pci 0000:3d:00.0: frame rx with invalid buf_id 0
Feb 18 19:25:36 kernel: ath11k_pci 0000:3d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x481e000e0 flags=0x0020]
Feb 18 19:25:36 kernel: ath11k_pci 0000:3d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x4830403c0 flags=0x0020]
Feb 18 19:25:36 kernel: ath11k_pci 0000:3d:00.0: frame rx with invalid buf_id 0
Feb 18 19:25:56 kernel: ath11k_pci 0000:3d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x481e00100 flags=0x0020]
Feb 18 19:26:16 kernel: ath11k_pci 0000:3d:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x481e00120 flags=0x0020]

The same event as seen by hostapd (unable to communicate with / get a response from) the device at some point:

Feb 18 19:25:15 hostapd[3702212]: charonwifi1: STA d4:3a:2c:b7:37:13 IEEE 802.11: authenticated
Feb 18 19:25:15 hostapd[3702212]: charonwifi1: STA d4:3a:2c:b7:37:13 IEEE 802.11: authenticated
Feb 18 19:25:15 hostapd[3702212]: charonwifi1: STA-OPMODE-N_SS-CHANGED d4:3a:2c:b7:37:13 2
Feb 18 19:25:15 hostapd[3702212]: charonwifi1: STA d4:3a:2c:b7:37:13 IEEE 802.11: associated (aid 1)
Feb 18 19:25:15 hostapd[3702212]: charonwifi1: STA d4:3a:2c:b7:37:13 IEEE 802.11: associated (aid 1)
Feb 18 19:25:15 hostapd[3702212]: charonwifi1: CTRL-EVENT-EAP-STARTED d4:3a:2c:b7:37:13
Feb 18 19:25:15 hostapd[3702212]: charonwifi1: CTRL-EVENT-EAP-PROPOSED-METHOD vendor=0 method=1
Feb 18 19:25:18 hostapd[3702212]: charonwifi1: CTRL-EVENT-EAP-RETRANSMIT d4:3a:2c:b7:37:13
Feb 18 19:25:24 hostapd[3702212]: charonwifi1: CTRL-EVENT-EAP-RETRANSMIT d4:3a:2c:b7:37:13
Feb 18 19:25:36 hostapd[3702212]: charonwifi1: CTRL-EVENT-EAP-RETRANSMIT d4:3a:2c:b7:37:13
Feb 18 19:25:56 hostapd[3702212]: charonwifi1: CTRL-EVENT-EAP-RETRANSMIT d4:3a:2c:b7:37:13
Feb 18 19:26:16 hostapd[3702212]: charonwifi1: CTRL-EVENT-EAP-RETRANSMIT d4:3a:2c:b7:37:13
Feb 18 19:26:36 hostapd[3702212]: charonwifi1: CTRL-EVENT-EAP-RETRANSMIT d4:3a:2c:b7:37:13
Feb 18 19:26:36 hostapd[3702212]: charonwifi1: CTRL-EVENT-EAP-TIMEOUT-FAILURE d4:3a:2c:b7:37:13
Feb 18 19:26:41 hostapd[3702212]: charonwifi1: STA d4:3a:2c:b7:37:13 IEEE 802.11: deauthenticated due to local deauth request
Feb 18 19:26:41 hostapd[3702212]: charonwifi1: STA d4:3a:2c:b7:37:13 IEEE 802.11: deauthenticated due to local deauth request

Devices can seee the hostapd’s SSID and (attempt to) associate with it. So the transmitters on the device do have their required extra 5V power, the device itself works in general and is identified as WiFi6 etc. If it weren’t for the page fault, it would most likely just work.
Comment 10 Andrej Podzimek 2023-02-18 19:16:43 UTC
Details about the device: https://compex.com.sg/shop/wifi-module/802-11ax-wifi-module/pn02-1-wifi6-11ax-qcn6024-qcn9024-qcn9074/ (However, this is the later version of the Compex PN02.1 with a heatsink included and with a slightly smaller PCB (not sure if that matters).)

The device has its additional 5V power supply from the host’s PSU, as required by the specs.

The connection goes via a Thunderbolt —> NGFF —> M-key —> E-key “adapter chain”.

The Thunderbolt host is a desktop described (e.g.) here:
https://bbs.archlinux.org/viewtopic.php?id=261303
https://bbs.archlinux.org/viewtopic.php?id=283471

Considering this↑↑↑ context, it might be the case that a Thunderbolt-related issue is (also) to blame here and/or that this bug only occurs when the device is connected over Thunderbolt.
Comment 11 Andrej Podzimek 2023-02-19 01:12:23 UTC
Tried some “no IOMMU” workaround “wisdom” from the web:

iommu=soft               — no effect; AMD IOMMU works as usual
amd_iommu=off            — won’t boot
amd_iommu=off iommu=soft — won’t boot
iommu=pt                 — won’t boot

So I’m guessing that disabling the IOMMU is not an option on this system.
Comment 12 Kalle Valo 2023-03-08 09:02:04 UTC
What is the exact kernel version you are using?

The firmware info from the attachements is this:

eb 18 19:16:29 kernel: ath11k_pci 0000:3d:00.0: BAR 0: assigned [mem 0xe0a00000-0xe0bfffff 64bit]
Feb 18 19:16:29 kernel: ath11k_pci 0000:3d:00.0: MSI vectors: 16
Feb 18 19:16:29 kernel: ath11k_pci 0000:3d:00.0: qcn9074 hw1.0
Feb 18 19:16:30 kernel: ath11k_pci 0000:3d:00.0: chip_id 0x0 chip_family 0x0 board_id 0xff soc_id 0xffffffff
Feb 18 19:16:30 kernel: ath11k_pci 0000:3d:00.0: fw_version 0x2506844c fw_build_timestamp 2021-07-13 10:24 fw_build_id 

Unfortunately the firmware does not provide the version string but the date looks old. Please try the latest firmware from here:

https://github.com/kvalo/ath11k-firmware/tree/master/QCN9074/hw1.0

Also can you try on an another system without Thunderbolt? This would help to rule if it's a problem with the setup. To me this looks like an iommu problem.
Comment 13 Kalle Valo 2024-01-10 09:54:24 UTC
Hopefully this patch fixes the issue:

https://patchwork.kernel.org/project/linux-wireless/patch/20231212031914.47339-1-imguzh@gmail.com/

Please let us know if you are able to test it.