Bug 215777

Summary: ath11k: 5.17.x causes cpu soft-lock on xps13 9310
Product: Drivers Reporter: D.F. (dominik.foerderer)
Component: network-wirelessAssignee: Kalle Valo (kvalo)
Status: RESOLVED CODE_FIX    
Severity: high CC: dominik.foerderer, kvalo, mail
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.17.1 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Patch to revert commit 9dcf6808b253a72b2c90eed179863bf5fab7d68c ath11k: add 11d scan offload support on kernel 5.17

Description D.F. 2022-03-30 11:06:24 UTC
Since Kernel 5.17.x my xps13 9310 starts freezing short after boot with a cpu soft-lock. After about one Minute there is a load of > 25 and the system is unusable. It's not possible to get an output from dmesg as the system freezes to fast. There are absolutely no Problems with Kernel Version 5.16.x. 

Journalctl shows the following message:

    journalctl -b -1 | grep ath11k
Journal file /var/log/journal/fe844f53b6d94c76aaf15372673affbd/system@0005da31c958de81-7f691367269c7061.journal~ is truncated, ignoring file.
Mär 30 12:29:32 foer-manjaro kernel: ath11k_pci 0000:72:00.0: BAR 0: assigned [mem 0xa2500000-0xa25fffff 64bit]
Mär 30 12:29:32 foer-manjaro kernel: ath11k_pci 0000:72:00.0: enabling device (0000 -> 0002)
Mär 30 12:29:32 foer-manjaro kernel: ath11k_pci 0000:72:00.0: MSI vectors: 32
Mär 30 12:29:32 foer-manjaro kernel: ath11k_pci 0000:72:00.0: qca6390 hw2.0
Mär 30 12:29:33 foer-manjaro kernel: ath11k_pci 0000:72:00.0: chip_id 0x0 chip_family 0xb board_id 0xff soc_id 0xffffffff
Mär 30 12:29:33 foer-manjaro kernel: ath11k_pci 0000:72:00.0: fw_version 0x10121492 fw_build_timestamp 2021-11-04 11:23 fw_build_id 
Mär 30 12:29:33 foer-manjaro kernel: ath11k_pci 0000:72:00.0 wlp114s0: renamed from wlan0
Mär 30 12:29:35 foer-manjaro NetworkManager[1542]: <info>  [1648636175.0959] rfkill1: found Wi-Fi radio killswitch (at /sys/devices/pci0000:00/0000:00:1c.0/0000:72:00.0/ieee80211/phy0/rfkill1) (driver ath11k_pci)

I found out that blacklisting ath11k_pci module resolve this beahvior, the system boots normal and a login is possible. Loading ath11k_pci manually then works in about 75% of the cases. In the other cases the system also freezes with a cpu soft-lock.

Here are a few other informations:

-Distribution: Manjaro Gnome
-Kernel: 5.17.1 (Manjaro) / Mainline Kernel with Manjaro config
-Machine:
  Type: Laptop System: Dell product: XPS 13 9310 
  Mobo: Dell model: 07CVRK v: A00 
  UEFI: Dell v: 3.5.1 date: 02/25/2022
  CPU: quad core 11th Gen Intel Core i7-1165G7 (-MT MCP-)
- lspci -mnn | grep QCA
   72:00.0 "Unassigned class [ff00]" "Qualcomm [17cb]" "QCA6390 Wireless Network Adapter [AX500-DBS (2x2)] [1101]" "Rivet Networks [1a56]" "Device [a501]"
Comment 1 Kalle Valo 2022-03-30 12:43:32 UTC
I have Dell XPS 13 9310 as my daily driver and I use my ath.git tree's master branch on it (which basically means latest -rc release from Linux plus latest wireless patches, including ath11k). I don't have any issues on that laptop. Although I use Debian 11 and connman, not network-manager. I also quickly tried v5.17 release (commit f443e374ae13) and no issues so far, but I will keep using that kernel.

Can you provide more information? Do you compile the kernels on your own? When you say 5.17.x do you mean the v5.17 release from Linus? A git bisect would be the best way to understand what commit is causing this, but it's a lengthy process. Another option is to enable kernel debug facilities, most likely they would give some clues. The last message is about rfkill, maybe that has something to do with this?
Comment 2 D.F. 2022-03-30 17:23:08 UTC
First time I had this issue was with kernel 5.17rc1. I also tried kernel 5.17rc6 which shows this weird behavior. I hoped that this will be fixed on stable 5.17 but sadly it's not.

I use the standard Manjaro-Kernel (https://gitlab.manjaro.org/packages/core/linux517), which is  mostly mainline.

Sadly I have not really time to debug this Problem the next days to find the problematic commit. Beside this I afraid I haven't the necessary skills to do that.

There is a commit (ec038c6127fa772d2c5604e329f22371830d5fa6 ath11k: add support for hardware rfkill for QCA6390) which has something to do with rfkill, so i give it a try and build a kernel with this patch reverted. I will report tomorrow. 

At the moment I use an own build Kernel with manjaro configuration (5.16.18 at the Moment) with the ath11k:powersave-in-station-mode-commit and my system is absolute stable.
Comment 3 D.F. 2022-03-30 18:46:57 UTC
Update: 
It was not possible with my skills to revert commit ec038c6127fa772d2c5604e329f22371830d5fa6 ath11k: add support for hardware rfkill for QCA6390 and successfully build a kernel.

What other things can I do to help debugging this issue?
Comment 4 D.F. 2022-03-31 07:01:29 UTC
Update2:
I tried your ath.git tree's master branch and that's also not working. It's the same behavior as described above. Booting with blacklisted ath11k_pci module works without any problems. Manually loading ath11k_pci afterwards works in 10 from 10 tries and wifi is usable without any issue.
Comment 5 D.F. 2022-03-31 07:23:03 UTC
Update3: 
Because you mentioned that you use connman I disabled NetworkManager and installed connman. With this setup the problem is gone and the system is fully usable with Kernel 5.17. So there must be an issue with kernel 5.17 - ath11k - and NetworkManager.
Comment 6 Sven 2022-04-03 12:20:51 UTC
Hi Dominik,
it seems that I had the same problem. After upgrading to Kernel 5.17, my system freezed after boot. However, I can't reproduce it since I've upgraded my firmware to version 3.5.1 with the fwupdmgr.

Which *System Firmware* version do you run on your XPS?

You can check it with: fwupdmgr get-devices.
Comment 7 Sven 2022-04-03 12:31:35 UTC
Oh, I've just seen that you're already running 3.5.1. Then I can not really say why the problem no longer occurs on my system.
Comment 8 D.F. 2022-04-03 18:04:09 UTC
Hi Sven. Hi Kalle.
I have spend several hours this weekend debugging this problem and it is really frustrating. Nothing I have tried was successful. I compared every log-message between the unproblematic boot with kernel 5.16.x and the crashing kernel 5.17.x...nothing to find what points me to the background of the problem. 
The only two things that are obvious are:

1. blacklisting ath11k_pci on boot and manually enabling after login works
2. deactivating NerworkManager.service on boot and enabling after login works also

@Sven. What kind of linux distribution an desktop you are using? Do you use Networkmanager?

What is the output of dmesg | grep ath11k while booting kernel 5.17?
Comment 9 Kalle Valo 2022-04-04 07:55:39 UTC
(In reply to dominik.foerderer from comment #8)
> I have spend several hours this weekend debugging this problem and it is
> really frustrating. Nothing I have tried was successful. I compared every
> log-message between the unproblematic boot with kernel 5.16.x and the
> crashing kernel 5.17.x...nothing to find what points me to the background of
> the problem. 

It's difficult to know what you have tested as you only say "5.16.x" and "5.17.x". Please be specific with the releases and don't use letter x at all. The best is to use commit ids, that way there's no confusion. And also always mention if you are using Linus' releases, stable releases or releases built by your distro. For tracking regressions like this it's best to use Linus' tree and forget other releases altogether.

I have been using v5.17 release from Linus (commit f443e374ae13) all the weekend and I have not seen any issues. The best course of action is the time consuming git bisect:

https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html

If you want to speed up the bisect you could only limit yourself to only testing ath11k commits:

git bisect start -- drivers/net/wireless/ath/ath11k

But limitting the bisect is not without risk, if the regression is outside of of ath11k you will not notice it.
Comment 10 D.F. 2022-04-04 19:12:19 UTC
Thank you for your patience with me.

I have done what you suggested. I took Linus tree (commit f443e374ae13) and start a bisect with start good v5.16 and bad v5.17-rc1. That way I have found the problematic commit. It's commit 9dcf6808b253a72b2c90eed179863bf5fab7d68c (ath11k: add 11d scan offload support). With this commit reverted the problem is gone.

Here is my bisect-log:

git bisect log
git bisect start '--' 'drivers/net/wireless/ath/ath11k'
# good: [df0cc57e057f18e44dac8e6c18aba47ab53202f9] Linux 5.16
git bisect good df0cc57e057f18e44dac8e6c18aba47ab53202f9
# bad: [e783362eb54cd99b2cac8b3a9aeac942e6f6ac07] Linux 5.17-rc1
git bisect bad e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
# good: [e94b07493da31705c3fdd0b2854f0cffe1dacb3c] ath11k: Set IRQ affinity to CPU0 in case of one MSI vector
git bisect good e94b07493da31705c3fdd0b2854f0cffe1dacb3c
# bad: [d3d358efc553de4f9d803c889a2e91523ea90f19] ath11k: add spectral/CFR buffer validation support
git bisect bad d3d358efc553de4f9d803c889a2e91523ea90f19
# good: [d1147a316b53df9cb0152e415ec41dcb6ea62c1c] ath11k: add support for WCN6855 hw2.1
git bisect good d1147a316b53df9cb0152e415ec41dcb6ea62c1c
# bad: [dddaa64d0af37275314a656bd8f8e941799e2d61] ath11k: add wait operation for tx management packets for flush from mac80211
git bisect bad dddaa64d0af37275314a656bd8f8e941799e2d61
# good: [ed05c7cf1286d7e31e7623bce55ff135723591bf] ath11k: avoid deadlock by change ieee80211_queue_work for regd_update_work
git bisect good ed05c7cf1286d7e31e7623bce55ff135723591bf
# bad: [9dcf6808b253a72b2c90eed179863bf5fab7d68c] ath11k: add 11d scan offload support
git bisect bad 9dcf6808b253a72b2c90eed179863bf5fab7d68c
# good: [0b05ddad8e4bd56bda42b9dc491c1b127720f063] ath11k: add configure country code for QCA6390 and WCN6855
git bisect good 0b05ddad8e4bd56bda42b9dc491c1b127720f063
# first bad commit: [9dcf6808b253a72b2c90eed179863bf5fab7d68c] ath11k: add 11d scan offload support
Comment 11 D.F. 2022-04-05 06:16:31 UTC
To be a little more precise. The Problem is gone before the bad commit. I was not able to revert this single commit (9dcf6808b253a72b2c90eed179863bf5fab7d68c) in Linus 5.17 tree, as the kernel won't build after that.
Comment 12 D.F. 2022-04-05 17:13:38 UTC
Created attachment 300703 [details]
Patch to revert commit 9dcf6808b253a72b2c90eed179863bf5fab7d68c ath11k: add 11d scan offload support on kernel 5.17
Comment 13 D.F. 2022-04-05 17:39:26 UTC
In the meantime I was able to create a patch which reverts commit:
"9dcf6808b253 ath11k: add 11d scan offload support" 
on Kernel 5.17. I have added the patch on #12.

I have tested Kernel 5.17 from Linus tree ((commit f443e374ae13)) and 5.17.1 from the stable tree (commit 59db887d13b3) with the reverted patch and the Problems described above are gone.
Comment 14 Kalle Valo 2022-04-11 09:23:54 UTC
(In reply to dominik.foerderer from comment #4)
> Update2:
> I tried your ath.git tree's master branch and that's also not working. It's
> the same behavior as described above. Booting with blacklisted ath11k_pci
> module works without any problems. Manually loading ath11k_pci afterwards
> works in 10 from 10 tries and wifi is usable without any issue.

What commit id from ath.git branch did you test?

I talked with Wen and he said that this commit should fix it:

ath11k: reduce the wait time of 11d scan and hw scan while add interface

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/drivers/net/wireless/ath/ath11k?id=1f682dc9fb3790aa7ec27d3d122ff32b1eda1365

Could you test that commit, please? Latest ath.git master branch (commit id 607c3dc27503 and tag ath-202204060834) contains that commit.
Comment 15 D.F. 2022-04-11 18:40:05 UTC
(In reply to Kalle Valo from comment #14)

> What commit id from ath.git branch did you test?

It was commit e3fd86d89535 Tag ath-202203281150, which was the most recent one that day. 

> Could you test that commit, please? Latest ath.git master branch (commit id
> 607c3dc27503 and tag ath-202204060834) contains that commit.

I have now tested ath.git master branch (commit id 607c3dc27503 and tag ath-202204060834) and tried about 20 boots with different power settings. I also used the Notebook 2 hours with normal workload and it seems the problem is gone.

I also applied the commit ath11k: reduce the wait time of 11d scan and hw scan while add interface (id 1f682dc9fb37) to the stable tree (5.17.2 commit id 70a10e90d47f) and tested it for a while and it also seems, that the problem is solved.

I haven't tried the commit on linus tree but it can be assumed, that it's also working there.

Summing up it's probably necessary to push this commit to the stable tree (branch 5.17.y) as well as to linus tree before 5.18 is finished.
Comment 16 Kalle Valo 2022-04-13 15:47:54 UTC
(In reply to dominik.foerderer from comment #15)
> (In reply to Kalle Valo from comment #14)
> 
> > What commit id from ath.git branch did you test?
> 
> It was commit e3fd86d89535 Tag ath-202203281150, which was the most recent
> one that day. 
> 
> > Could you test that commit, please? Latest ath.git master branch (commit id
> > 607c3dc27503 and tag ath-202204060834) contains that commit.
> 
> I have now tested ath.git master branch (commit id 607c3dc27503 and tag
> ath-202204060834) and tried about 20 boots with different power settings. I
> also used the Notebook 2 hours with normal workload and it seems the problem
> is gone.
> 
> I also applied the commit ath11k: reduce the wait time of 11d scan and hw
> scan while add interface (id 1f682dc9fb37) to the stable tree (5.17.2 commit
> id 70a10e90d47f) and tested it for a while and it also seems, that the
> problem is solved.
> 
> I haven't tried the commit on linus tree but it can be assumed, that it's
> also working there.
> 
> Summing up it's probably necessary to push this commit to the stable tree
> (branch 5.17.y) as well as to linus tree before 5.18 is finished.

Thank you for the very thorough test report! This makes my job so much easier.

I'll submit the patch to v5.18 and the stable team should then pick it up for their v5.17 releases. But I'll do that next week, going for vacation now :)
Comment 19 D.F. 2022-04-29 06:04:29 UTC
Thank you very much!