Bug 203891

Summary: iwlwifi: 9560: Failed to load firmware chunk! -- Failed to run INIT ucode: -110
Product: Drivers Reporter: Tormen (tormen)
Component: network-wirelessAssignee: DO NOT USE - assign "network-wireless-intel" component instead (linuxwifi)
Status: CLOSED CODE_FIX    
Severity: normal CC: leho, linuxwifi, luca, tormen
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.1.8 Subsystem:
Regression: No Bisected commit-id:
Attachments: 2019-06-13=0839_journalctl_direct-after-boot with more debug-output
2019-06-13 test-it.output.txt
2019-06-13 test-it.journalctl.log
2019-06-13 test-it.txt
2019-06-15_journalctl_-b.log working after reboot now (but not after rmmod / modprobe)
2019-06-15_test-it.output.txt
2019-06-15_test-it.journalctl.log -- of new test-it run
fix candidate
2019-06-21 shell log (what I did inclusive output of test-it.sh and it's command)
2019-06-21 test-it.journalctl.log -- of new test-it run
2019-06-21 journalctl_between_rfkill-keyboard-shortcut_to-turn-OFF___until___to-turn-ON.log
2019-06-21 journalctl_-b COMPLETE (boot until this moment)
2019-06-21 strace_iwctl_device_list.txt
2019-06-21 journalctl from test-it.sh + subsequent strace iwctl device list

Description Tormen 2019-06-13 22:25:18 UTC
Created attachment 283261 [details]
2019-06-13=0839_journalctl_direct-after-boot with more debug-output

Linux huit 5.1.8-arch1-1-ARCH #1 SMP PREEMPT Sun Jun 9 20:28:28 UTC 2019 x86_64 GNU/Linux
linux-firmware 20190514.711d329


Dear Intel Wifi Team,

I lost wifi functionality since June 10th at 11h11. Trying to load the wifi driver (either during boot or when reloading the iwlwifi kernel module) now looks like this:

Jun 11 10:23:52 huit kernel: iwlwifi 0000:00:14.3: enabling device (0000 -> 0002)
Jun 11 10:23:52 huit kernel: iwlwifi 0000:00:14.3: loaded firmware version 46.3cfab8da.0 op_mode iwlmvm
Jun 11 10:23:53 huit kernel: iwlwifi 0000:00:14.3: Detected Intel(R) Wireless-AC 9560 160MHz, REV=0x318
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: Failed to load firmware chunk!
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: iwlwifi transaction failed, dumping registers
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: iwlwifi device config registers:
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: 00000000: a3708086 00100406 02800010 00800010 9133c004 00000000 00000000 00000000
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: 00000020: 00000000 00000000 00000000 40308086 00000000 000000c8 00000000 0000010b
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: iwlwifi device memory mapped registers:
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: 00000000: 18489008 00000040 00000080 00000000 00000000 00000000 00000000 00000000
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: 00000020: 00000010 04000005 00000312 d55555d5 d55555d5 d55555d5 80008040 001f0040
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: Could not load the [0] uCode section
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: reporting RF_KILL (radio disabled)
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: Failed to start INIT ucode: -110
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: Collecting data: trigger 16 fired.
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: Not valid error log pointer 0x00000000 for Init uCode
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: Firmware not running - cannot dump error
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: Master Disable Timed Out, 100 usec
Jun 11 10:23:57 huit kernel: iwlwifi 0000:00:14.3: Failed to run INIT ucode: -110

For the full `journalctl -b` output, please see here: http://sprunge.us/NxTxyK

--------------------------

After initial investigation thanks to Emmanuel Grumbach I got more debug output and re-ran the test I did the last time, just now it's a script:

        2019-06-13=0839_journalctl_direct-after-boot with more debug-output

	/bin/sh -ex test-it.txt
	-->  
	test-it.output.txt == stdout + stderr of the run

	test-it.journalctl.log == journalctl produced during run of
	test-it
Comment 1 Tormen 2019-06-13 22:27:08 UTC
Created attachment 283263 [details]
2019-06-13 test-it.output.txt
Comment 2 Tormen 2019-06-13 22:27:39 UTC
Created attachment 283265 [details]
2019-06-13 test-it.journalctl.log
Comment 3 Tormen 2019-06-13 22:27:57 UTC
Created attachment 283267 [details]
2019-06-13 test-it.txt
Comment 4 Tormen 2019-06-13 22:28:49 UTC
From: "Grumbach, Emmanuel" <emmanuel.grumbach@intel.com>
Subject: Re: [linuxwifi] regression with Wireless-AC 9560: Failed to load  firmware chunk! -- Failed to run INIT ucode: -110
Date: Thu, 13 Jun 2019 20:30:58 +0000


Hey,


0000:00:14.3:     CSR_MSIX_HW_INT_MASK_AD: 0X6e0001c7
0000:00:14.3:   CSR_MSIX_HW_INT_CAUSES_AD: 0X00000080

What it means is that the only interrupt we have here is the RFKILL
interrupt. It's not that the HW sent an interrupt and that it was
masked or something...

Very strange.

Since it worked, we'll have to ask you bisect.

Thanks.
Comment 5 Tormen 2019-06-15 16:41:58 UTC
Created attachment 283285 [details]
2019-06-15_journalctl_-b.log    working after reboot now (but not after rmmod / modprobe)


The laptop went low battery and shutdown overnight.

Directly after rebooting, now the failstate that I had seems gone and everything /seems/ to work again - see 2019-06-15_working-journalctl.log

I think, during the moment the problem occured and all my tests I might not have shutdown the machine, but actually just put it in stand-by or rebooted it!
Comment 6 Tormen 2019-06-15 16:55:09 UTC
Created attachment 283289 [details]
2019-06-15_test-it.output.txt

Now that it works after reboot (don't know why, because I did not change /anything/), rmmod and modprobe still don't work.

This is the output of test-it.sh  

changes to test-it.sh:
  * added 3 echo lines: START, NEXT, DONE
  * added a sleep 3 after the `modprobe iwlwifi` allowing for things to settle first)
  * added 3 `ip link` at the beginning, after rmmod and at the end because `iwctl device list` is empty , but there is a device wlo1 appearing !!
Comment 7 Tormen 2019-06-15 16:56:02 UTC
Created attachment 283291 [details]
2019-06-15_test-it.journalctl.log -- of new test-it run
Comment 8 Tormen 2019-06-15 16:57:27 UTC
Jun 15 18:50:42 huit kernel: iwlwifi 0000:00:14.3: loaded firmware version 46.3cfab8da.0 op_mode iwlmvm
Jun 15 18:50:42 huit kernel: iwlwifi 0000:00:14.3: Detected Intel(R) Wireless-AC 9560 160MHz, REV=0x318
Jun 15 18:50:42 huit kernel: iwlwifi 0000:00:14.3: base HW address: 18:56:80:77:0d:58
Jun 15 18:50:42 huit kernel: iwlwifi 0000:00:14.3 wlo1: renamed from wlan0
Jun 15 18:50:42 huit kernel: iwlwifi 0000:00:14.3: RF_KILL bit toggled to disable radio.
Jun 15 18:50:42 huit kernel: iwlwifi 0000:00:14.3: reporting RF_KILL (radio disabled)
Jun 15 18:50:42 huit kernel: iwlwifi 0000:00:14.3: U iwl_run_init_mvm_ucode RFKILL while calibrating.
Jun 15 18:50:42 huit kernel: iwlwifi 0000:00:14.3: Failed to start RT ucode: -132


So the modprobe after rmmod still results in a "Failed to start RT ucode" but this time with -132 ;)
Comment 9 Emmanuel Grumbach 2019-06-16 12:57:07 UTC
Created attachment 283293 [details]
fix candidate

Please try this patch.
Comment 10 Emmanuel Grumbach 2019-06-19 06:21:35 UTC
ping?
Comment 11 Tormen 2019-06-19 11:59:40 UTC
Pong. Really sorry. Was on a conference the last 2 days.
Will try to test it today, latest tomorrow.
Comment 12 Tormen 2019-06-21 08:26:00 UTC
Hi Emmanuel,

I added your patch to 5.1.8 kernel (your debug patch is also still there :).

Summary: Things changed, but I can't get it to work.

In short:

* Booting: All good.
* RF-KILL keyboard-shortcut (on DELL laptop): turns off wifi (rfkill blocked)
* RF-KILL keyboard-shortcut (on DELL laptop): pressing it again unblocks wlan0.... BUT I always have to manually bring the wireless interface up. Question: Who would be to blame for this?? once I manually `ip link set wlan0 up` things work again
* BUT test-it.sh does not yet function. So rmmod .. modprobe won't get me a working wireless connection: wlan0 gets renamed to wlo1 by the kernel ... why?! / what does that mean??
* This was "DOWN" as well. When I tried to `ip link set wlo1 up` it, I saw: "iwlwifi 0000:00:14.3: BIOS contains WGDS but no WRDS" -- What does this mean ? / Do I need to worry about this?

Tormen
Comment 13 Tormen 2019-06-21 08:27:56 UTC
Created attachment 283355 [details]
2019-06-21 shell log (what I did inclusive output of test-it.sh and it's command)
Comment 14 Tormen 2019-06-21 08:28:44 UTC
Created attachment 283357 [details]
2019-06-21 test-it.journalctl.log -- of new test-it run
Comment 15 Tormen 2019-06-21 08:30:17 UTC
Created attachment 283359 [details]
2019-06-21 journalctl_between_rfkill-keyboard-shortcut_to-turn-OFF___until___to-turn-ON.log
Comment 16 Tormen 2019-06-21 08:30:54 UTC
Created attachment 283361 [details]
2019-06-21 journalctl_-b COMPLETE (boot until this moment)
Comment 17 Emmanuel Grumbach 2019-06-21 10:57:28 UTC
The renaming is because systemd, not the kernel.

The kernel won't UP the interface after rfkill, wpa_supplicant is in charge of this. Bottom line, the bug is fixed.

Patches are queued for 5.2
Comment 18 Tormen 2019-06-21 17:36:27 UTC
Hi Emmanuel,

many thanks for clarifying!

I was using iwd which somehow did not want to work with the device `wlo1` and also trying to manually bringing it up resulted in it being set back to down again ...

... but with wpa_supplicant I can confirm it works.

Only one strangeness: When I reran test-it.sh now the test did not finish as it got stuck during `iwctl device list` !?

I ran an strace and will attach it. This is really strange, no?

`ip link` works for instance.
And this morning `iwctl device list` also worked.
I'll also attach the journalctl but I did not see anything in there.

Tormen.
Comment 19 Tormen 2019-06-21 17:37:23 UTC
Created attachment 283375 [details]
2019-06-21 strace_iwctl_device_list.txt
Comment 20 Tormen 2019-06-21 17:38:29 UTC
Created attachment 283377 [details]
2019-06-21 journalctl from test-it.sh + subsequent strace iwctl device list
Comment 21 Tormen 2019-06-22 17:09:33 UTC
Hi Emmanuel,

just noticed that also after a fresh reboot with your patch the `iwctl device list` command hangs forever every time :((

This new problem seems to coincide with me using your patch.

Any idea?

Tormen
Comment 22 Emmanuel Grumbach 2019-06-22 18:17:22 UTC
dmesg please...
Comment 23 Tormen 2019-06-24 07:41:00 UTC
Boah, this is embarassing:

I confused iwctl with iw :(

iwctl is from iwd. I switched back to wpa_supplicant. Hence iwctl hangs ;)


@Emmanuel:
Thaaaaaaaaank you very much for the quick fix !!!!
Comment 24 Leho Kraav 2019-08-26 11:54:19 UTC
I ran into this problem on Dell Latitude 7400 2-in-1 and kernel 5.3.0-rc6 (last night's `master`).

It seems I was running sys-kernel/linux-firmware-20190717.

Upon investigation, I discovered sys-kernel/linux-firmware-20190815 contains iwlwifi-9000-pu-b0-jf-b0-46.ucode with a different filesize / checksum.

After updating to 20190815, the system seems to behave more stable, at least the wifi adapter isn't crashing.

Emmanuel, can you confirm that even though ucode filename may be exactly the same, it may have behavior impacting differences across linux-firmware releases?
Comment 25 Luca Coelho 2019-08-27 06:06:17 UTC
Hi Leho,

Yes, we update the firmware and keep the same file name.  The number in the filename is the FW API number and is matched against the kernel version (i.e. each kernel version supports a different range of FW API versions).  We still make bugfixes to these firmwares and release them under the same FW API number.

You can check the SHA1 of the firmware build in the WHENCE file and in the logs where you see something like this:

iwlwifi 0000:00:14.3: loaded firmware version 46.3cfab8da.0 op_mode iwlmvm

The first number in the firmware version is the API number, the second number is the SHA1 of the build (this changes when we release bugfixes) and the last number is for internal use and should always be 0 on public firmwares.

I hope this clarifies.