Bug 204151

Summary: iwlwifi: firmware crash on Intel Wireless AC 3168 when sending GEO_TX_POWER_LIMIT
Product: Drivers Reporter: sharvil.nanavati
Component: network-wirelessAssignee: DO NOT USE - assign "network-wireless-intel" component instead (linuxwifi)
Status: CLOSED CODE_FIX    
Severity: normal CC: bugzilla, hahnjo, hi-angel, ilya.lipnitskiy, ivan.galben, jethro.rose, jian-hong, johannes.hirte, linux, liubomirwm, luca, me, nicolopiazzalunga, noodles, noonetinone, pbrobinson, pedretti.fabio, rootkit85, vicamo, WyRe12
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.15.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: Relevant dmesg output
systemd journal output, same issue, fedora 30 on asrock taichi x470
dmesg output from iwlwifi 3168
FW crash messages extracted from kernel log.
Crash on 5.3.0
dmesg showing Microcode SW error

Description sharvil.nanavati 2019-07-12 21:36:20 UTC
Created attachment 283657 [details]
Relevant dmesg output

Running Ubuntu 18.04 LTS with stock kernel (4.15.0-54-generic). Firmware failure with stock firmware from linux-firmware package (version 29.1044073957.0) and firmware supplied by Emmanuel (29.4009927039.0).

Takes about 1 day of uptime for the first firmware crash. After that, firmware restarts are fairly common (2-3 per day). Throughput drops by 3-4x after first firmware restart and never recovers. No specific repro steps; simply remain connected to a 5GHz WPA2 Personal network with occasional network activity.
Comment 1 Jethro Rose 2019-07-22 06:05:23 UTC
I have also been seeing this sort of behaviour since about April/May with Fedora 30 and various kernel versions.  Seems to be load-dependent - it will be fine for a while but do anything "strenous" network wise (like, say download distribution updates) and it seems more likely to happen.  Sometimes after less than an hour.  ifconfig down/up gets the network back.

I thought it was kernel related but as far as i can determine it is wireless firmware related, as rolling back to a previous fedora package (that was dated prior to May 14th on the fedora side) seems to revert the issue.  But my testing has been limited as every time i update the box it seems to break again.

If anyone has some suggestions for diagnostics I'm willing to assist.

similar systemd journal output from my box also attached.
Comment 2 Jethro Rose 2019-07-22 06:06:21 UTC
Created attachment 283893 [details]
systemd journal output, same issue, fedora 30 on asrock taichi x470
Comment 3 Jethro Rose 2019-07-23 02:21:28 UTC
To confirm, this started happening for me in May, with kernel version 5.0.17 (as far as i can determine) onwards.  So it isn't just with kernel 4.x, and as above i do not believe it to be kernel specific, rather wireless firmware specific due to firmware rollback resolving the issue.
Comment 4 Luca Coelho 2019-08-20 11:42:14 UTC
There seems to be an issue with the new FW where it doesn't flag support for the new command version, but uses it anyway.  The crash is caused by the driver sending a command in the wrong (i.e. old) format.

I'll fix this and send a patch/new FW version in a bit.
Comment 5 Luca Coelho 2019-08-20 12:17:24 UTC
*** Bug 204213 has been marked as a duplicate of this bug. ***
Comment 6 Luca Coelho 2019-08-22 05:35:42 UTC
*** Bug 204153 has been marked as a duplicate of this bug. ***
Comment 7 Anthony Jagers 2019-08-30 19:42:41 UTC
Created attachment 284709 [details]
dmesg output from iwlwifi 3168

I'm getting strange kernel output with Intel(R) Dual Band Wireless AC 3168, REV=0x220.  It amounts to just garbage backtrace. For me, it only happens with the 5.3 release candidates. Everything is good with linux stable tree.

I've updated the firmware but no dice. Am I missing a kernel config option?
A bios setting perhaps?  I've got a Asrock x370 taichi  mother board.

Attached is some output from dmesg and lspci. The first 10 lines were pruned.
Around line 180 is the output from lspci.
Comment 8 Chris Clayton 2019-09-07 07:43:17 UTC
Created attachment 284873 [details]
FW crash messages extracted from kernel log.
Comment 9 Chris Clayton 2019-09-07 07:44:12 UTC
Is the patch/new FW mentioned in comment 4 above available yet, please? I'm still seeing this FW crash in 5.3.0-rc7+. I'm using the latest firmware in the linux-firmware git tree (iwlwifi 0000:02:00.0: loaded firmware version 46.6bf1df06.0 op_mode iwlmvm) in the hiope that this is the FW referred to in the comment. Relevant messages from kernel log is attached.
Comment 10 Chris Murphy 2019-09-07 18:17:43 UTC
I'm still seeing it every boot and wake from suspend with 5.3.0-rc7 as well, on an 8260 controller.

[    9.866194] flap.local kernel: iwlwifi 0000:6c:00.0: Microcode SW error detected.  Restarting 0x2000000.
[    9.866330] flap.local kernel: iwlwifi 0000:6c:00.0: Start IWL Error Log Dump:
[    9.866331] flap.local kernel: iwlwifi 0000:6c:00.0: Status: 0x00000080, count: 6
[    9.866332] flap.local kernel: iwlwifi 0000:6c:00.0: Loaded firmware version: 36.77d01142.0
[    9.866334] flap.local kernel: iwlwifi 0000:6c:00.0: 0x00000038 | BAD_COMMAND 
...
[    9.867069] flap.local kernel: iwlwifi 0000:6c:00.0: FW error in SYNC CMD GEO_TX_POWER_LIMIT
...
[    9.867192] flap.local kernel: WARNING: CPU: 2 PID: 685 at drivers/net/wireless/intel/iwlwifi/mvm/scan.c:1874 iwl_mvm_rx_umac_scan_complete_notif.cold+0xc>
[    9.867237] flap.local kernel: CPU: 2 PID: 685 Comm: kworker/2:5 Not tainted 5.3.0-0.rc7.git0.1.fc31.x86_64 #1


And kernel is then tainted 512.
Comment 11 Ilya Lipnitskiy 2019-09-20 04:32:14 UTC
Still happens on 5.3.0 on an 8260 controller, any updates?

kernel: iwlwifi 0000:02:00.0: loaded firmware version 36.77d01142.0 op_mode iwlmvm
kernel: iwlwifi 0000:02:00.0: Detected Intel(R) Dual Band Wireless AC 8260, REV=0x208
kernel: iwlwifi 0000:02:00.0: Microcode SW error detected.  Restarting 0x82000000.
kernel: iwlwifi 0000:02:00.0: Loaded firmware version: 36.77d01142.0
kernel: iwlwifi 0000:02:00.0: FW error in SYNC CMD GEO_TX_POWER_LIMIT
kernel: CPU: 2 PID: 71 Comm: kworker/2:1 Tainted: P           OE     5.3.0-arch1-1-ARCH #1
kernel: Hardware name: HP HP ZBook 15 G3/80D5, BIOS N81 Ver. 01.41 07/16/2019
kernel: Workqueue: events iwl_mvm_async_handlers_wk [iwlmvm]
kernel: Call Trace:
kernel:  dump_stack+0x5c/0x80
kernel:  iwl_trans_pcie_send_hcmd+0x547/0x560 [iwlwifi]
kernel:  ? wait_woken+0x70/0x70
kernel:  iwl_trans_send_cmd+0x59/0xb0 [iwlwifi]
kernel:  iwl_mvm_send_cmd+0x2e/0x80 [iwlmvm]
kernel:  iwl_mvm_get_sar_geo_profile+0x102/0x180 [iwlmvm]
kernel:  iwl_mvm_rx_chub_update_mcc+0x10b/0x1a0 [iwlmvm]
kernel:  iwl_mvm_async_handlers_wk+0xaa/0x140 [iwlmvm]
kernel:  process_one_work+0x1d1/0x3a0
kernel:  worker_thread+0x4a/0x3d0
kernel:  kthread+0xfb/0x130
kernel:  ? process_one_work+0x3a0/0x3a0
kernel:  ? kthread_park+0x80/0x80
kernel:  ret_from_fork+0x35/0x40
kernel: iwlwifi 0000:02:00.0: Failed to get geographic profile info -5
Comment 12 Ilya Lipnitskiy 2019-09-20 04:46:31 UTC
Created attachment 285073 [details]
Crash on 5.3.0
Comment 13 Luca Coelho 2019-09-23 12:34:16 UTC
*** Bug 204917 has been marked as a duplicate of this bug. ***
Comment 14 Luca Coelho 2019-09-24 09:44:03 UTC
*** Bug 204943 has been marked as a duplicate of this bug. ***
Comment 15 Luca Coelho 2019-09-24 10:34:29 UTC
I have just sent a fix for this upstream:

https://patchwork.kernel.org/patch/11158395/

It will hopefully be taken to v5.4 soon and from there to stable v5.3, v5.2 and v4.19.
Comment 16 sharvil.nanavati 2019-09-24 17:03:24 UTC
Looking at the original log I attached to this bug report, I see that the uCode major version for my device is 0x1D (29):

[79300.452830] iwlwifi 0000:05:00.0: 0x0000001D | uCode version major

The patch enables GEO_TX_POWER_LIMIT for uCode major version 36 which wouldn't affect my device.

Maybe I misunderstood something about the nature of the issue / patch, but it seems that the original bug report I reported is different than what is being fixed here. If so, should I file another bug or should we reopen this bug?
Comment 17 Chris Murphy 2019-09-24 17:29:35 UTC
Awesome, thanks Luca!
Comment 18 Chris Clayton 2019-09-24 17:30:54 UTC
On 24/09/2019 18:03, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=204151
> 
> --- Comment #16 from sharvil.nanavati@gmail.com ---
> Looking at the original log I attached to this bug report, I see that the
> uCode
> major version for my device is 0x1D (29):
> 
> [79300.452830] iwlwifi 0000:05:00.0: 0x0000001D | uCode version major
> 
> The patch enables GEO_TX_POWER_LIMIT for uCode major version 36 which
> wouldn't
> affect my device.
> 

Even with that patch applied, I'm still getting the crash too. Hardware and uCode version are shown in:

[    3.447509] iwlwifi 0000:02:00.0: loaded firmware version 46.6bf1df06.0 op_mode iwlmvm
[    3.493018] iwlwifi 0000:02:00.0: Detected Intel(R) Wireless-AC 9260 160MHz, REV=0x324

> Maybe I misunderstood something about the nature of the issue / patch, but it
> seems that the original bug report I reported is different than what is being
> fixed here. If so, should I file another bug or should we reopen this bug?
>
Comment 19 Luca Coelho 2019-09-24 17:52:50 UTC
(In reply to sharvil.nanavati from comment #16)
> Looking at the original log I attached to this bug report, I see that the
> uCode major version for my device is 0x1D (29):
> 
> [79300.452830] iwlwifi 0000:05:00.0: 0x0000001D | uCode version major
> 
> The patch enables GEO_TX_POWER_LIMIT for uCode major version 36 which
> wouldn't affect my device.
> 
> Maybe I misunderstood something about the nature of the issue / patch, but
> it seems that the original bug report I reported is different than what is
> being fixed here. If so, should I file another bug or should we reopen this
> bug?

The original bug you reported, the one that has this line in the logs:

[79300.452799] iwlwifi 0000:05:00.0: 0x00000034 | NMI_INTERRUPT_WDG

...has been fix with commit 0c3d7282233c ("iwlwifi: Add support for SAR South Korea limitation") upstream (it's in v5.3).  This back has also been backported to v5.2.10+.  The new patch I posted here today is not relevant for the "NMI_INTERRUPT_WDG" case.  Please make sure you are using v5.2.10 or above.
Comment 20 Luca Coelho 2019-09-24 17:55:25 UTC
(In reply to Chris Clayton from comment #18)
> Even with that patch applied, I'm still getting the crash too. Hardware and
> uCode version are shown in:
> 
> [    3.447509] iwlwifi 0000:02:00.0: loaded firmware version 46.6bf1df06.0
> op_mode iwlmvm
> [    3.493018] iwlwifi 0000:02:00.0: Detected Intel(R) Wireless-AC 9260
> 160MHz, REV=0x324

With this device you are probably getting the NMI_INTERRUPT_WDG case.  I can't see the kernel version you are using, but it probably doesn't have the fix.  Can you please check and report?

For the record, this is the patch upstream for the NMI_INTERRUPT_WDG case:

https://patchwork.kernel.org/patch/11021735/
Comment 21 sharvil.nanavati 2019-09-24 18:04:27 UTC
Thanks for closing the loop on this, Luca. Much appreciated.
Comment 22 Chris Clayton 2019-09-25 06:34:54 UTC
(In reply to Luca Coelho from comment #20)
> (In reply to Chris Clayton from comment #18)
> > Even with that patch applied, I'm still getting the crash too. Hardware and
> > uCode version are shown in:
> > 
> > [    3.447509] iwlwifi 0000:02:00.0: loaded firmware version 46.6bf1df06.0
> > op_mode iwlmvm
> > [    3.493018] iwlwifi 0000:02:00.0: Detected Intel(R) Wireless-AC 9260
> > 160MHz, REV=0x324
> 
> With this device you are probably getting the NMI_INTERRUPT_WDG case.  I
> can't see the kernel version you are using, but it probably doesn't have the
> fix.  Can you please check and report?
> 
> For the record, this is the patch upstream for the NMI_INTERRUPT_WDG case:
> 
> https://patchwork.kernel.org/patch/11021735/

I'm running 5.3.1 which already includes the patch you've pointed to.

I can generate the Microcode SW error more or less at will by downloading a large file (e.g. the Fedora 30 .iso). I've just done that and I'll attach the output from dmesg in a moment.
Comment 23 Chris Clayton 2019-09-25 06:35:58 UTC
Created attachment 285165 [details]
dmesg showing Microcode SW error
Comment 24 Luca Coelho 2019-09-25 06:48:55 UTC
*** Bug 204983 has been marked as a duplicate of this bug. ***
Comment 25 Luca Coelho 2019-09-25 06:53:31 UTC
(In reply to Chris Clayton from comment #22)
> I'm running 5.3.1 which already includes the patch you've pointed to.
> 
> I can generate the Microcode SW error more or less at will by downloading a
> large file (e.g. the Fedora 30 .iso). I've just done that and I'll attach
> the output from dmesg in a moment.

The error you are getting is a different one:

[  554.990533] iwlwifi 0000:02:00.0: 0x000022CE | ADVANCED_SYSASSERT         

In this case, the signature is SYSASSERT and 0x000022CE.  Can you please file a separate report for it?
Comment 26 Matteo Croce 2019-09-30 12:14:40 UTC
*** Bug 205035 has been marked as a duplicate of this bug. ***
Comment 27 You-Sheng Yang 2019-10-08 03:43:30 UTC
Hi, the patch proposed in comment 15 doesn't seem to fix the error found with a AC 3168 card. The latest firmware revision for AC 3168 is -29, and it's trying to remove -36 from the conditions. I had a build that removes both -29 and -36 together, which fixes this fw error on AC 3168 as expected.
Comment 28 Luca Coelho 2019-10-08 07:20:12 UTC
Thanks for the patch you sent You-Sheng!

As I replied in the list, the 7265D devices do implement this command, so a better fix would be this:

diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/fw.c
b/drivers/net/wireless/intel/iwlwifi/mvm/fw.c
index 0d2229319261..38d89ee9bd28 100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/fw.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/fw.c
@@ -906,8 +906,10 @@ static bool iwl_mvm_sar_geo_support(struct iwl_mvm
*mvm)
         * entirely.
         */
        return IWL_UCODE_SERIAL(mvm->fw->ucode_ver) >= 38 ||
-              IWL_UCODE_SERIAL(mvm->fw->ucode_ver) == 29 ||
-              IWL_UCODE_SERIAL(mvm->fw->ucode_ver) == 17;
+              IWL_UCODE_SERIAL(mvm->fw->ucode_ver) == 17 ||
+              (IWL_UCODE_SERIAL(mvm->fw->ucode_ver) == 29 &&
+               (mvm->trans->hw_rev &
+                CSR_HW_REV_TYPE_MSK) == CSR_HW_REV_TYPE_7265D);
 }
 
 int iwl_mvm_get_sar_geo_profile(struct iwl_mvm *mvm)

Can someone try it?
Comment 29 Luca Coelho 2019-10-08 10:20:23 UTC
Okay, I think this patch is good and I have applied it internally.  I'll push it you as soon as I get confirmation that the patch I posted in the previous comment works.
Comment 30 You-Sheng Yang 2019-10-08 11:15:43 UTC
I can confirm the patch in comment 15 and comment 28 turns off firmware error messages on a HP EliteDesk 800 G3 DM that equips Intel AC 3168.
Comment 31 Luca Coelho 2019-10-08 11:18:45 UTC
Thanks for confirming, You-Sheng!
Comment 32 Luca Coelho 2019-10-11 04:27:50 UTC
Here's the patch that was sent upstream:

https://lore.kernel.org/linux-wireless/20191009094853.PfIm3J8o7DN_Femup-OXkJdtmKE7rftk1ODkm7cx-vk@z/
Comment 33 Luca Coelho 2019-10-11 12:58:29 UTC
*** Bug 205163 has been marked as a duplicate of this bug. ***
Comment 34 Thorsten Leemhuis 2019-10-23 08:39:00 UTC
(In reply to Luca Coelho from comment #32)
> Here's the patch that was sent upstream:
> https://lore.kernel.org/linux-wireless/20191009094853.PfIm3J8o7DN_Femup-
> OXkJdtmKE7rftk1ODkm7cx-vk@z/

That patch/https://git.kernel.org/torvalds/c/12e36d98d3e5acf5fc57774e0a15906d55f30cb9 is not marked for stable afaics. I wonder if it should, as it contains this line:
```
Fixes: f5a47fae6aa3 ("iwlwifi: mvm: fix version check for GEO_TX_POWER_LIMIT support")
```
And that patch was recently backported to 5.3:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.3.y&id=3a0b7157d6a9833cefd40b5f0fa1cc90285bb4b5

Thus I suspect that 12e36d98d3e5 should be backported, too. I guess Bug 205163 is an indicator for that, as it talks about Linux 5.3.5.
Comment 35 Thorsten Leemhuis 2019-10-23 08:47:00 UTC
(In reply to Thorsten Leemhuis from comment #34)
> ```
> Fixes: f5a47fae6aa3 ("iwlwifi: mvm: fix version check for GEO_TX_POWER_LIMIT
> support")
> ```
> And that patch was recently backported to 5.3:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/
> ?h=linux-5.3.y&id=3a0b7157d6a9833cefd40b5f0fa1cc90285bb4b5

Scratch that, I mixed up the commits, sorry for the noise.

But I still wonder if https://git.kernel.org/torvalds/c/12e36d98d3e5acf5fc57774e0a15906d55f30cb9 should be backported, as I see warnings mentioned in Bug 205163 here as well on a system with a 3168
Comment 36 Lyubomir 2019-10-23 17:00:37 UTC
I also still have these issues on my Aspire A515-51G. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1846016
Comment 37 You-Sheng Yang 2019-10-24 02:37:44 UTC
Hi Lyubomir, please help ACK https://lists.ubuntu.com/archives/kernel-team/2019-October/104807.html if that works for you. It has been landed to Ubuntu linux-oem/linux-oem-osp1 tree, but not yet in the -generic trees due to the lack of sufficient ACKs.
Comment 38 You-Sheng Yang 2019-10-24 02:39:14 UTC
Lyubomir, you may use my PPA https://launchpad.net/~vicamo/+archive/ubuntu/ppa-1846016 for test.
Comment 39 Luca Coelho 2019-12-16 19:55:11 UTC
*** Bug 204153 has been marked as a duplicate of this bug. ***
Comment 40 Jonathan McDowell 2020-03-10 10:48:27 UTC
FWIW this has regressed in 5.5 - tracked in Bug 206395