Bug 201713

Summary: iwlwifi: 9260: ASSERT 28AA leading to RCU CPU Stalls
Product: Drivers Reporter: mcline.psp
Component: network-wirelessAssignee: DO NOT USE - assign "network-wireless-intel" component instead (linuxwifi)
Status: CLOSED CODE_FIX    
Severity: blocking CC: mcline.psp
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.19.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: Full dmesg log that shows the frequency of firmware crashes.
My Gentoo kernel config for Dell XPS 15 9570
Dmesg log showing unhandled alg iwlwifi error.

Description mcline.psp 2018-11-17 08:09:47 UTC
Created attachment 279483 [details]
Full dmesg log that shows the frequency of firmware crashes.

Hello,

#### 1.0 - Brief description of the problem
On my Dell XPS 15 9570 with an Intel Wireless-AC 9570 card, running Gentoo, using the latest version of the Pf-kernel, the firmware for Iwlwifi(iwlmvm) continously crashes when loaded by the kernel leading to an RCU CPU stall that renders my machine unresponsive until it is restarted.

#### 2.0 - Information about my system
- Computer Model: Dell XPS 15 9570
- GNU/Linux Disro - Gentoo
- **Not** dual-booting windows
- Processor Information: Intel(R) Core(TM) i7-8750H
- Kernel Infromation: [4.19.0-pf4 #23 SMP PREEMPT](https://gitlab.com/post-factum/pf-kernel/wikis/README)
- Wireless Card: [Intel Corporation Wireless-AC 9260 (rev 29)](https://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/dual-band-wireless-ac-9260-brief.pdf)
- Iwlwifi Kernel Module Information:
 - Full Ucode Name: iwlwifi-9260-th-b0-jf-b0-41.ucode
 - Firmware Version: 38.755cfdd8.0
- Wifi Software: Connman and WPA Supplicant

#### 3.0 - Detailed description of the problem
I have a Dell XPS 15 9570 with an Intel Wireless-AC 9260 card, running Gentoo with the latest version of the Pf-kernel. Whenever the kernel attempts to load the firmware for the wireless card, the firmware crashes producing the following error in the dmesg log:

> Microcode SW error detected. Restarting 0x0.
> iwlwifi 0000:3b:00.0: Start IWL Error Log Dump:
> iwlwifi 0000:3b:00.0: Status: 0x00000100, count: 6
> iwlwifi 0000:3b:00.0: Loaded firmware version: 38.755cfdd8.0
> iwlwifi 0000:3b:00.0: 0x000028AA | ADVANCED_SYSASSERT
>
> [snip]
>
> ieee80211 phy0: Hardware restart was requested

From the dmesg log and the documentation [here](https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging), we see that this message indicates that the firmware for the wireless card is crashing. When the wireless card's firmware crashes, it appears that the kernel attempts to restart the wireless. Once the wireless card is restarted, the kernel attempts to load the firmware for the wireless card again and the firmware subsequently crashes again. Thus, what ends up happening is that the kernel gets stuck in a crash-restart-crash infinite loop which does not terminate. After 1 iteration of this loop (i.e., the firmware crashes, the device is started, the firmware crashes again) an RCU CPU stall occurs indicated by the following error in dmesg:

> INFO: rcu_preempt self-detected stall on CPU
> rcu:         3-...!: (7 ticks this GP) idle=53e/1/0x4000000000000002 softirq=460035/460036 fqs=59
> rcu:          (t=3288 jiffies g=1168373 q=4332)
> rcu: rcu_preempt kthread starved for 2973 jiffies! g1168373 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=11
> rcu: RCU grace-period kthread stack dump:
> rcu_preempt     I    0    10      2 0x80000000
> Call Trace:
>  ? 0xffffffff8188819c
>  ? 0xffffffff810638a2
>  ? 0xffffffff8188d102
>  ? 0xffffffff810638a2
>  0xffffffff81888c1d
>  0xffffffff8188c185
>  ? 0xffffffff81087d00
>  0xffffffff81081fd8
>  0xffffffff8105f712
>  ? 0xffffffff81081af0
>  ? 0xffffffff8105f610
>  0xffffffff81a001bf

This RCU CPU stall warning **always** follows after the wireless card's firmware crashes and the call stack has the same value every time this warning is printed in dmesg.

The wireless card's firmware crash-restart-crash infinite loop that produces these RCU CPU stall warnings leads to latency spikes in my system's operation (i.e., the system freezes). Depending on the system's load, we see variances in the duration of these freezes:

- from 10-30 seconds (in the best case; light system load; this case is uncommon) to
- to 3-30 minutes (average case; moderate system load; this case is extremely common) to 
- indefinitely requiring the system to be rebooted (worst case; high system load; this case occurs more than the best case, but less than the average case).

Naturally, when these system freezes occur we see latency in all userspace activities, but, this also leads to latency in kernel operations also. Indeed, this issue causes **all** aspects of the entire system to freeze and many times this can only be resolved by restarting the system altogether. Although the system generally recovers from these freezes in the best and average cases, as the system's uptime proceeds the freezes get progressively worse going from the best/average case to the worst requiring the system to be restarted. Once more the time this progression takes depends on the system's load, under

- low system load: the progression can take a few hours to days;
- moderate system load: the progression takes approximately 1 to 4 hours;
- high system load: less than an hour.

Additionally, in many cases after my system recovers from the freezes induced by these RCU stalls, which are induced by the wireless card's firmware continually crashing,
- touchpad functionality is completely gone  and is only restored when the computer is rebooted;
- mouse functionality is gone and the computer must be rebooted in order for this functionality to be restored;
- userspace applications (i.e., chrome, commands executed in the terminal, typing in GNU Emacs, etc.) become extremely slow (fixed by rebooting the computer).
- wifi connection functionality is completely gone, the wireless card interface does not appear when a user executes commands such as "iw, iwlist, ip, connman"
Finally, as the

#### 4.0 - What we know from experimentation
For the past 2 months, I have Googled, read forums and documentation, and talked with other individuals. Based on experimentation (with other individuals), the probelm was replicated on the following:
     
- (16) Computers:
 1. My Dell XPS 15 9570
 2. Dell XPS 15 9575
 3. Dell XPS 15 9560
 4. Dell XPS 15 9550
 5. Dell Precision 5520
 6. Dell Precision 5510
 7. Lenovo ThinkPad X1 Extreme
 8. Lenovo X1 Carbon 6th Gen
 9. Lenovo X1 Carbon 5th Gen
 10. Lenovo X1 ThinkPad Yoga 3rd Gen
 11. Lenovo ThinkPad T580
 12. Lenovo ThinkPad P72
 13. A desktop with an ASUS ROG Maximus X Hero motherboard
 14. Asus ROG Zephyrus S (GX531)
 15. ASUS ZenBook Flip 14 UX461UN
 16. ASUS ZenBook 3 Deluxe UX490UA

- (4) GNU/Linux Distros:
 1. Gentoo
 2. Arch Linux
 3. Ubuntu
 4. Fedora

- Kernel Versions: **All** kernel version from 4.15 to 4.19.1
 
Experimentation utilized the git-vcs version of the linux-firmware package (i.e., on gentoo, sys-kernel/linux-firmware-99999999::gentoo) which provides the following firmware blobs for the Intel Wireless-AC 9260 card:

- /lib/firmware/iwlwifi-9260-th-b0-jf-b0-33.ucode
- /lib/firmware/iwlwifi-9260-th-b0-jf-b0-34.ucode
- /lib/firmware/iwlwifi-9260-th-b0-jf-b0-38.ucode
- /lib/firmware/iwlwifi-9260-th-b0-jf-b0-41.ucode

From experimentation we know that:
- The firmware crashes -> the wireless card is restarted -> the firmware crashes again; this cycle continues infinitely (i.e., what we refer to as the crash-restart-crash infinite loop); the correct error output is displayed in dmesg.
- The crash-restart-crash ifinite loop leads to CPU usage and eventually results in RCU CPU stalls (and in several cases RCU thread starvation) which generally renders the system unusable unless it is rebooted.
- This issue can **not** be replicated on Windows for all tested computers(the Intel Wireless-AC 9260 card works perfectly).
- This issue can only be replicated under GNU/Linux for all tested computers.
- This problem has been replicated across numerous different computers, using different GNU/Linux distros, and kernel versions.
- This is **not** a software, hardware, kernel config, distribution-specific, hardware-specific, or bios-related issue. It is **entirely** an issue with the firmware crashing and/or the way the kernel loads the firmware.
- This problem occcurs regardless of the firmware version loaded by the kernel.

- **The severity of the RCU CPU stalls this issues cases can be mitaged by doing the following in the kernel config:**
 - Modify the values of CONFIG_RCU_FANOUT, CONFIG_RCU_FANOUT_LEAF
 - Setting CONFIG_RCU_BOOST=Y
 - Setting CONFIG_RCU_BOOST_DELAY to a **prime** number **less** than the default of 500
 - Set the processor timer frequency to either 1000Hz or 300Hz
*Note:* these mitigations lessen the severity of the RCU CPU stalls caused by the crash-restart-crash infinite loop. They do **not** resolve the problem.
 
#### 5.0 - How to replicate this problem
There is no specific sequence of commands that trigger this problem. Rather, the problem occurs **regardless** of what the user is doing. However, one will most likely encounter this issue if the following conditions are satisfied:
1. Using Gentoo, Arch Linux, Ubuntu, and/or Fedora.
2. Using an Intel Wireless-AC 9260 card.
3. Running kernel versions >=4.15.
4. Kernel configured to:
 a. Enable firmware loading.
 b. Load versions 33, 34, 38, or 41 of the iwlwifi 9260 firmware from the /lib/firmware directory.
 c. Enable iwlwifi and iwlmvm firmware support.

#### 6.0 - Other notes
1. I will attach the following:
 - A full dmesg log.
 - My Gentoo Pf-kernel config.
2 I am not using any bluetooth devices with my computer so I do not know what residual effect this issue has on bluetooth functionality.

Thank you for any and all help.
Comment 1 mcline.psp 2018-11-17 08:15:31 UTC
Created attachment 279485 [details]
My Gentoo kernel config for Dell XPS 15 9570
Comment 2 mcline.psp 2018-11-17 08:20:26 UTC
It appears that I am unable to edit some minor markdown formatting errors in my original comment. My apologies.
Comment 3 Emmanuel Grumbach 2018-11-17 17:26:01 UTC
You used -38.ucode. Can you please update your kernel or use our backport driver to test -41.ucode?
Comment 4 mcline.psp 2018-11-20 00:53:18 UTC
(In reply to Emmanuel Grumbach from comment #3)
> You used -38.ucode. Can you please update your kernel or use our backport
> driver to test -41.ucode?

Per your suggestion, I (1) updated my kernel to latest kernel version and configured the kernel to load firmware -41.ucode and (2) use the backport driver to test version -41.ucode on all kernel version from 4.15 to 4.19.1.

I ran both cases on the 16 machines I mentioned in my original comment and the issue still persists.

Additionally, sometimes version -41.ucode would cause the machine to hang when the Xserver was started from tty with the "startx" command. In this case, there was no wifi functionality at all (which persisted across reboots).
Comment 5 mcline.psp 2018-11-20 07:59:00 UTC
I also forgot to mention that after a "sufficiently large" time (approximately 2-3 hours on average) the crash-restart-crash infinite loop which causes RCU CPU Stalls and the RCU kthread starvations eventually leads to the system running out of memory and a kernel panic.
Comment 6 Emmanuel Grumbach 2018-11-20 08:13:27 UTC
I have involved the firmware team. They want to collect debug data from you.

First thing is to disable fw_restart to prevent the load on the machine.

options iwlwifi fw_restart=0

Then, I'll send a debug firmware later.
In the meantime, you can refer to [1] to see how to collect the debug information we'll need you to provide.
Please take the time to read the privacy notice [2].

Thanks.

[1] https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#firmware_debugging

[2] https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#privacy_aspects
Comment 7 Emmanuel Grumbach 2018-11-20 10:50:17 UTC
I have talked with FW people. They claim it is fixed in -41.ucode.
You can't use -41.ucode with kernel 4.19.

Please install our backport driver to test:
https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/core_release

If you still see the failure with -41.ucode, please send a log of this.


Thanks.
Comment 8 mcline.psp 2018-11-20 11:03:08 UTC
(In reply to Emmanuel Grumbach from comment #7)
> I have talked with FW people. They claim it is fixed in -41.ucode.
> You can't use -41.ucode with kernel 4.19.
> 
> Please install our backport driver to test:
> https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/core_release
> 
> If you still see the failure with -41.ucode, please send a log of this.
> 
> 
> Thanks.

Hmmm, that's weird. Because I ran the backport driver with my current kernel on my Dell XPS 15 9570 (i.e., Pf-kernel 4.19.1) and I lost all wifi functionality and my system hangs when I try to launch an Xsession. It could be possible that it's not installing correctly. I ran all the directions detailed here:
https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/core_release
and then I make sure to regenerate my initramfs.

Do I need to make any changes to my kernel config (i.e., changing iwlwifi from being built into the kernel to being built as a loadable module)? Currently, iwlwifi and iwlmvm firmware support are both built into my kernel (as per my kernel.config file I attached) so I don't know if that is an issue. Additionally, I include my initramfs into my kernel as an integrated-initramfs.
Comment 9 Emmanuel Grumbach 2018-11-20 11:12:46 UTC
No need to do anything. Just check the version of the firmware you run with ethtool -i wlp2s0

For any failure you see, please attach the dmesg output.
Comment 10 Emmanuel Grumbach 2018-11-20 21:00:52 UTC
Sorry I hadn't read your message carefully enough. You must not compile iwlwifi in the kernel. It must be a loadable module. This you can use backport.
Comment 11 mcline.psp 2018-11-29 03:59:34 UTC
(In reply to Emmanuel Grumbach from comment #10)
> Sorry I hadn't read your message carefully enough. You must not compile
> iwlwifi in the kernel. It must be a loadable module. This you can use
> backport.

My apologies for the delay. As you mentioned this issue is fixed in kernel version 4.20 using the -41.ucode. Specifically for Gentoo, the issue is fixed in all kernels that have the same version(s) as >=sys-kernel/git-sources-4.20_rc3.

However, I have noticed that in my dmesg once I will see the numerous iwlwifi error messages which all say:
> iwlwifi 0000:3b:00.0: Unhandled alg: 0x707

I have attached my new dmesg.log for reference.
Comment 12 mcline.psp 2018-11-29 04:00:20 UTC
Created attachment 279731 [details]
Dmesg log showing unhandled alg iwlwifi error.
Comment 13 Emmanuel Grumbach 2018-11-29 08:14:31 UTC
This is a different issue which should be harmless. Please open a different bug if you really feel annoyance from this.
Comment 14 mcline.psp 2018-11-30 19:06:54 UTC
(In reply to Emmanuel Grumbach from comment #13)
> This is a different issue which should be harmless. Please open a different
> bug if you really feel annoyance from this.

Will do thanks for all the help