Bug 86231 - iwlwifi: 7260: 0x00000034 | NMI_INTERRUPT_WDG - INIT image - MWG100220470
Summary: iwlwifi: 7260: 0x00000034 | NMI_INTERRUPT_WDG - INIT image - MWG100220470
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-14 10:11 UTC by Ralf
Modified: 2014-11-03 13:31 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.17-rc7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output of the problem with 3.16 Debian kernel (66.26 KB, text/plain)
2014-10-17 09:13 UTC, Ralf
Details
dmesg output with 3.16 kernel, firmware 25.222 (75.45 KB, text/plain)
2014-10-18 08:13 UTC, Ralf
Details
fix - not sure (1.16 KB, patch)
2014-10-23 12:25 UTC, Emmanuel Grumbach
Details | Diff
dmesg log with kernel a812cba9 (76.43 KB, application/octet-stream)
2014-10-24 13:29 UTC, Ralf
Details
dmesg output with 3.16 kernel, firmware 25.228 (65.51 KB, text/plain)
2014-10-26 09:26 UTC, Ralf
Details
dmesg output with vanilla 3.17, debug enabled (98.26 KB, text/plain)
2014-10-26 10:23 UTC, Ralf
Details
tentative fix (2.54 KB, patch)
2014-10-26 15:06 UTC, Emmanuel Grumbach
Details | Diff
tentative fix (2.50 KB, patch)
2014-10-26 15:29 UTC, Emmanuel Grumbach
Details | Diff
dmesg of v3.17 kernel with the path applied (98.28 KB, text/plain)
2014-10-27 19:08 UTC, Ralf
Details
tentative fix - take 2 (3.05 KB, patch)
2014-10-27 19:55 UTC, Emmanuel Grumbach
Details | Diff
dmesg of kernel 3.17, patch v2 (97.26 KB, text/plain)
2014-10-27 21:15 UTC, Ralf
Details
tentative fix - take 3 (3.41 KB, patch)
2014-10-28 08:47 UTC, Emmanuel Grumbach
Details | Diff
dmesg of kernel 3.17, patch v3 (98.13 KB, text/plain)
2014-10-30 18:33 UTC, Ralf
Details
dmesg of kernel 3.17, patch v3, try 2 (98.44 KB, text/plain)
2014-10-31 09:37 UTC, Ralf
Details
dmesg of kernel 3.17, patch v3, try 3 (100.80 KB, text/plain)
2014-11-02 10:30 UTC, Ralf
Details
tentative fix + debug info - take4 (4.54 KB, patch)
2014-11-02 11:08 UTC, Emmanuel Grumbach
Details | Diff
dmesg of kernel 3.17, patch 4 (100.25 KB, text/plain)
2014-11-02 11:21 UTC, Ralf
Details
tentative fix + debug info - take5 (4.76 KB, patch)
2014-11-02 11:33 UTC, Emmanuel Grumbach
Details | Diff
dmesg of kernel 3.17, patch 5 (102.79 KB, text/plain)
2014-11-02 11:48 UTC, Ralf
Details
tentative fix + debug info - take6 (4.76 KB, patch)
2014-11-02 12:00 UTC, Emmanuel Grumbach
Details | Diff
dmesg of kernel 3.17, patch 6 (no failure!) (99.37 KB, text/plain)
2014-11-02 12:44 UTC, Ralf
Details
final fix - before review (4.67 KB, patch)
2014-11-02 13:57 UTC, Emmanuel Grumbach
Details | Diff

Description Ralf 2014-10-14 10:11:17 UTC
Every once in a while, the wireless driver (iwlwifi) prints "Microcode SW error detected" during boot (I will attached a full dmesg). In this case, the wireless card is not detected properly, not shown by "ip addr" nor by NetworkManager. I never saw this appear with kernels up to 3.14, I see it fairly often with the 3.16 Debian kernel, and it still happens occasionally with 3.17-rc7 vanilla. Sometimes, it seems to be "stuck" in that state, so even booting into an older kernel, the error will still appear. No amount of cold-rebooting or even pulling out the battery will fix this.

In any case, I can run the command

  sudo modprobe -r iwlmvm && sudo modprobe -r iwlwifi && sudo modprobe iwlwifi

to make it load properly. This also seems to "reset" the "stuck" state. In other words, now doing a reboot to a 3.14 or older reliably brings the card up on boot. So the only time I saw this failure with <=3.14 on boot was if the same failure was already shown on the previous boot, and I did *not* re-load the module to fix that. With 3.16+, the bug can happen even if the previous boot was fine. That's why I consider this a regression.

I reported this bug against Debian as well, and another user seems to have the same issue: <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763172>

Here's what lspci has to say about the card:

03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 73)
        Subsystem: Intel Corporation Dual Band Wireless-AC 7260
        Flags: bus master, fast devsel, latency 0, IRQ 34
        Memory at ddc00000 (64-bit, non-prefetchable) [size=8K]
        Capabilities: [c8] Power Management version 3
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [40] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number fc-f8-ae-ff-ff-51-88-9e
        Capabilities: [14c] Latency Tolerance Reporting
        Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1 Len=014 <?>
        Kernel driver in use: iwlwifi

The Debian package firmware-iwlwifi is at version 0.43 (whatever that means). Curiously, the dmesg says

[   12.055632] iwlwifi 0000:03:00.0: Direct firmware load for iwlwifi-7260-10.ucode failed with error -2
[   12.058021] iwlwifi 0000:03:00.0: loaded firmware version 23.214.9.0 op_mode iwlmvm

I assume this means that it didn't find the -10 ucode (which, according to <http://wireless.kernel.org/en/users/Drivers/iwlwifi>, is still experimental), and then falls back to the -9? The following ucodes for the 7260 exist on my disk:

$ ls /lib/firmware/iwlwifi-7260-* -l
-rw-r--r-- 1 root root 683236 Jun 16 01:51 /lib/firmware/iwlwifi-7260-7.ucode
-rw-r--r-- 1 root root 679780 Jun 16 01:51 /lib/firmware/iwlwifi-7260-8.ucode
-rw-r--r-- 1 root root 679380 Jun 16 01:51 /lib/firmware/iwlwifi-7260-9.ucode
Comment 1 Emmanuel Grumbach 2014-10-16 16:32:50 UTC
Please try latest firmware.
Your -9.ucode can and should be upgraded.
Comment 2 Emmanuel Grumbach 2014-10-16 19:51:41 UTC
BTW - I find it very hard to believe that this is a regression.

I don't see how this bug would not have happened with an earlier kernel.
Maybe you are using a newer kernel that loads a newer that has a regression, but in this case, it is a regression in the firmware.
In any case, please use the latest firmware from here:

https://git.kernel.org/cgit/linux/kernel/git/egrumbach/linux-firmware.git/plain/iwlwifi-7260-9.ucode?id=e2844339cb779c00f856c554958e16beca99b462


and if you still have issues - please paste / attach the dmesg output into this bug.
Comment 3 Ralf 2014-10-17 08:55:56 UTC
(In reply to Emmanuel Grumbach from comment #2)
> BTW - I find it very hard to believe that this is a regression.
> 
> I don't see how this bug would not have happened with an earlier kernel.
> Maybe you are using a newer kernel that loads a newer that has a regression,
> but in this case, it is a regression in the firmware.

The occurrence of the bug correlates with the 3.16 kernel landing in Debian testing. The last firmware update landed end of June. I did not see this happen for three months, after the firmware upgrade.
The bug not being reproducible, it's of course hard to tell when exactly it appeared.

> In any case, please use the latest firmware from here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/egrumbach/linux-firmware.git/
> plain/iwlwifi-7260-9.ucode?id=e2844339cb779c00f856c554958e16beca99b462

Okay, will do.
Comment 4 Ralf 2014-10-17 09:13:29 UTC
Created attachment 154081 [details]
dmesg output of the problem with 3.16 Debian kernel

With the 3.16 Debian kernel, the bug happens really often (I am under the impression that it's way more frequent than with 3.17-rc7 vanilla), so I can already attach a new dmesg with the error message.

However, the firmware version printed in there did not even change, weird enough.

$ dmesg | grep iwlwifi | head -n 5
[   13.825995] iwlwifi 0000:03:00.0: irq 50 for MSI/MSI-X
[   13.829736] iwlwifi 0000:03:00.0: firmware: direct-loading firmware iwlwifi-7260-9.ucode
[   13.829941] iwlwifi 0000:03:00.0: loaded firmware version 23.214.9.0 op_mode iwlmvm
[   13.874385] iwlwifi 0000:03:00.0: Detected Intel(R) Dual Band Wireless AC 7260, REV=0x144
[   13.874740] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S
$ sha1sum /lib/firmware/iwlwifi-7260-9.ucode 
21ab34f5f5d71a15d56c34551734d9414059893e  /lib/firmware/iwlwifi-7260-9.ucode
$ curl -s 'https://git.kernel.org/cgit/linux/kernel/git/egrumbach/linux-firmware.git/plain/iwlwifi-7260-9.ucode?id=e2844339cb779c00f856c554958e16beca99b462' | sha1sum
21ab34f5f5d71a15d56c34551734d9414059893e  -
Comment 5 Emmanuel Grumbach 2014-10-17 10:29:58 UTC
Stupid me.... I pointed to the wrong commit. Can you please take the firmware from the master branch?

Thanks
Comment 6 Ralf 2014-10-18 08:13:17 UTC
Created attachment 154121 [details]
dmesg output with 3.16 kernel, firmware 25.222

It just did it again, the dmesg is attached.

Interestingly, the first attempt to re-load the module also failed (at 135 in the log) with a "Timeout waiting for hardware access". That never happened before.
The second re-load worked (at 173 in the log), and then the module was functional.
Comment 7 Emmanuel Grumbach 2014-10-18 17:29:13 UTC
I can see that you have the RFKILL switch set to forbid WiFi activity.
Can you please change this and check again?
I'll try to reproduce with the RFKILL switch set to forbid WiFi as well.
Comment 8 Ralf 2014-10-19 08:55:40 UTC
(In reply to Emmanuel Grumbach from comment #7)
> I can see that you have the RFKILL switch set to forbid WiFi activity.
> Can you please change this and check again?

Just to be sure: That RFKILL, is that both the software button to disable WiFi I have in NetworkManager, and the (not always reliably working...) button on my keyboard?

Currently, "dmesg | grep -i kill" has no output. And I can see nearby networks. I assume this means that I have the kill bits disabled, so I won't touch either button anymore.
Comment 9 Emmanuel Grumbach 2014-10-19 12:37:27 UTC
RFKILL is the button on your keyboard or a real switch.

Should I understand that you don't experience any issue anymore?
Comment 10 Ralf 2014-10-19 12:40:51 UTC
(In reply to Emmanuel Grumbach from comment #9)
> Should I understand that you don't experience any issue anymore?

I had a single successful boot this morning. But since the issue is not reproducible, that doesn't really say much. I've had successful boots before. So I cannot yet say whether the issue is solved (or better, worked around) by not using the kill switch.
Now that I know that the kill state could have an impact, I will take this into account when looking for patterns.
Comment 11 Emmanuel Grumbach 2014-10-19 13:56:02 UTC
Ok - I see.
FYI - this issue is very tricky. The firmware gets stuck in a flow it really shouldn't... I don't really see what can be happening there but I'll try to ask people...
Comment 12 Ralf 2014-10-20 12:02:04 UTC
I did some more testing (5 boots in total), and there's definitely a strong correlation between having WiFi disabled in NetworkManager, and the card crashing on next boot: In these tests, the boot always succeeded when WiFi was enabled on last shutdown (3 times), and always failed when it was disabled on last shutdown (2 times). I also checked the older dmesg logs I submitted, and the kill bit was enabled for all of them.

So, it seems, the issue is much more reproducible than I thought.
Comment 13 Ralf 2014-10-20 12:49:31 UTC
I did the same test with the 3.14 Debian kernel, and the pattern seems to be (pending confirmation, but I need tog et back to work now ;-) that initialisation on boot will fail *if the last shutdown was with the 3.16 kernel and WiFi disabled*. It doesn't matter which kernel is used for booting - if I re-boot from the 3.16 to the 3.14 kernel, I get the error on boot if WiFi was disabled. If I then re-load the module, but keep WiFi disabled, the next boot will succeed.
Note sure if this make any sense, and maybe I am "overfitting" here...
Comment 14 Emmanuel Grumbach 2014-10-20 12:55:17 UTC
Ok - thanks for the information.
I am checking with the firmware people.
Comment 15 Emmanuel Grumbach 2014-10-23 12:25:04 UTC
Created attachment 154641 [details]
fix - not sure

Hi,


can you please test this patch?

Thanks
Comment 16 Ralf 2014-10-23 13:24:52 UTC
Which kernel should I apply this to? I tried 3.17 and 3.17.1, but it failed on both.
Comment 17 Emmanuel Grumbach 2014-10-23 13:26:50 UTC
It didn't apply or it doesn't fix the bug?
Comment 18 Ralf 2014-10-23 13:27:58 UTC
It didn't apply.

$ git reset --hard v3.17
HEAD is now at bfe01a5 Linux 3.17
$ patch -p1 <  ~/Desktop/iwlwifi.patch 
patching file drivers/net/wireless/iwlwifi/mvm/ops.c
Hunk #1 FAILED at 455.
1 out of 1 hunk FAILED -- saving rejects to file drivers/net/wireless/iwlwifi/mvm/ops.c.rej
Comment 19 Emmanuel Grumbach 2014-10-23 13:32:35 UTC
This should apply on 3.17*


diff --git a/drivers/net/wireless/iwlwifi/mvm/ops.c b/drivers/net/wireless/iwlwifi/mvm/ops.c
index 610dbcb..79bdff9 100644
--- a/drivers/net/wireless/iwlwifi/mvm/ops.c
+++ b/drivers/net/wireless/iwlwifi/mvm/ops.c
@@ -415,6 +415,7 @@ iwl_op_mode_mvm_start(struct iwl_trans *trans, const struct iwl_cfg *cfg,
                mvm->first_agg_queue = 12;
        }
        mvm->sf_state = SF_UNINIT;
+       mvm->cur_ucode = IWL_UCODE_INIT;

        mutex_init(&mvm->mutex);
        mutex_init(&mvm->d0i3_suspend_mutex);
Comment 20 Ralf 2014-10-23 13:38:24 UTC
It does, after converting spaces back to tabs. I'll reboot later today, back to work for now.
Comment 21 Ralf 2014-10-23 18:26:34 UTC
Unfortunately, this does not fix the issue: I booted into the patched 3.17 kernel, disabled wireless in NM, and re-booted - and the error appeared as usual. Do you want another logfile?
Comment 22 Emmanuel Grumbach 2014-10-23 18:58:02 UTC
Same awful NMI thing with microcode SW error?
If yes, I don't need the log.

Still waiting for an answer from the firmware team
Comment 23 Ralf 2014-10-23 19:13:07 UTC
Yes, the error looks pretty much the same, except for some changed hex numbers.

I did some more tests rebooting from "old, good" to "new, bad" kernels and back, and refined my theory for reproducing the crash in these cases.

When I booted into 3.14 (having WiFi disabled), I got the error on boot as expected. If I just re-boot into the same kernel again at this point, the error remains, as mentioned in the initial report. If I enable WiFi (after a module reload), then disable it again and boot to a bad kernel, there's no error. If I now immediately reboot into the same kernel, there's again no error. if I enable and disable WiFi, the error comes back on the next reboot.

So my current theory is that it's the kernel version that was used when *disabling WiFi*, that's deciding whether the next boot will fail to load the module or not.
Comment 24 Emmanuel Grumbach 2014-10-23 20:08:56 UTC
Ok - so you are saying that if you disable WiFi from 3.14, it works in the next boot and if you disable WiFi in 3.17 it won't - right?

I'll try to diff the two kernels.
Comment 25 Ralf 2014-10-24 08:25:21 UTC
(In reply to Emmanuel Grumbach from comment #24)
> Ok - so you are saying that if you disable WiFi from 3.14, it works in the
> next boot and if you disable WiFi in 3.17 it won't - right?

Yes. If I boot 3.14, enable and disable WiFi, I can boot into any kernel and re-boot as often as I want, there's no error. Once I enable and disable WiFi on 3.16 or 3.17 (not sure about 3.15), the next reboot gives the error.

> I'll try to diff the two kernels.

I started doing a bisect, restricted to the iwlwifi folder. Since testing requires 2 reboots, which means I can not really do it while working at the same time, it will take a while till that's completed. The bisect log so far is

git bisect start 'drivers/net/wireless/iwlwifi/'
# good: [455c6fdbd219161bd09b1165f11699d6d73de11c] Linux 3.14
git bisect good 455c6fdbd219161bd09b1165f11699d6d73de11c
# bad: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
git bisect bad 19583ca584d6f574384e17fe7613dfaeadcdc4a6
# bad: [198890258fc0f9e3270ed1c1794b7610dad92ada] iwlwifi: mvm: Handle power management constraints for additional use-cases
git bisect bad 198890258fc0f9e3270ed1c1794b7610dad92ada

(I didn't actually test the 3.16 and 3.14 upstream tags, just the Debian kernels, and hope that works out...)
Comment 26 Ralf 2014-10-24 08:46:38 UTC
(In reply to Ralf Jung from comment #25)
> (In reply to Emmanuel Grumbach from comment #24)
> > Ok - so you are saying that if you disable WiFi from 3.14, it works in the
> > next boot and if you disable WiFi in 3.17 it won't - right?
> 
> Yes. If I boot 3.14, enable and disable WiFi, I can boot into any kernel and
> re-boot as often as I want, there's no error. Once I enable and disable WiFi
> on 3.16 or 3.17 (not sure about 3.15), the next reboot gives the error.

... unfortunately, cross-kernel reboots do not seem totally consistent. I just tested 01a9ca51, and while rebooting within this kernel (after disabling WiFi from within this kernel) is fine, re-booting into 3.16 is not (so, disabling WiFi with 01a9ca51 and the re-booting into 3.16, there's the error again, which I didn't expect to happen). As usual, things are never as simple as they seem, but at least the "reboot to same kernel" remains consistent so far, so I have a pattern I can use for bisecting.
Comment 27 Emmanuel Grumbach 2014-10-24 10:39:44 UTC
I am extremely grateful for the work you are doing.
I'll look carefully at the diff between 3.14 and 3.16 and try to see if I see something.

Unfortunately, I am not sure the breaking commit is in iwlwifi. The bug is weird enough to be caused by a lot of reasons...

I'll also try to reproduce.

All this, on Sunday.
Comment 28 Ralf 2014-10-24 10:43:30 UTC
(In reply to Emmanuel Grumbach from comment #27)
> I am extremely grateful for the work you are doing.
> I'll look carefully at the diff between 3.14 and 3.16 and try to see if I
> see something.

I'd like the issue to get fixed, and I don't have the knowledge to do that myself, so that's the least I can do ;-)
Also, thanks for working on this - by far not every bug I report gets this amount of attention.

> Unfortunately, I am not sure the breaking commit is in iwlwifi. The bug is
> weird enough to be caused by a lot of reasons...

Well, yes, that's true. But bisecting *all* commits would take ages, and even if this does not find *the* offending commit, it should reduce the search space. Or so I think. If you say it's probably of no help, I'll step the bisect.
Comment 29 Emmanuel Grumbach 2014-10-24 13:05:11 UTC
One thing came through my mind.
Are you sure you are using the same firmware with both kernels?
3.14 supports -8.ucode, but 3.14.6+ supports -9.ucode (I think).
You can check the version being loaded by checking the:

 iwlwifi 0000:01:00.0: loaded firmware version XXX

message in the log.
Comment 30 Ralf 2014-10-24 13:14:57 UTC
This is with the 3.14 Debian kernel, which says it's based on 3.14.15:

[   12.027237] iwlwifi 0000:03:00.0: irq 51 for MSI/MSI-X
[   12.033021] iwlwifi 0000:03:00.0: firmware: direct-loading firmware iwlwifi-7260-9.ucode
[   12.033210] iwlwifi 0000:03:00.0: loaded firmware version 25.222.9.0 op_mode iwlmvm
[   12.049205] iwlwifi 0000:03:00.0: Detected Intel(R) Dual Band Wireless AC 7260, REV=0x144
[   12.049647] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S
[   12.049939] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S

So it seems this one's using the new firmware. I will remember to check this during my bisecting.
Comment 31 Ralf 2014-10-24 13:29:42 UTC
Created attachment 154831 [details]
dmesg log with kernel a812cba9

If I'm not mistaken, this is a log of the same problem happening with a ucode 8 firmware (22.24.8.0) with the a812cba9 kernel (git describe calls this "v3.14-rc2-571-ga812cba").

I also had a spurious working re-boot with this kernel, so maybe the issue is only "almost reproducible". Seems I have to re-test 01a9ca51.
Comment 32 Emmanuel Grumbach 2014-10-24 13:31:52 UTC
BTW - While at it you may want to upgrade the firmware to

https://git.kernel.org/cgit/linux/kernel/git/egrumbach/linux-firmware.git/plain/iwlwifi-7260-9.ucode?id=1f9f9df353b11c9ea0130dfe68053aaaee376df3
Comment 33 Ralf 2014-10-25 10:26:24 UTC
Unfortunately, bisecting just showed that the issue is less reproducible than I thought. I did the following (with the old firmware, to have less stuff changing) for various kernels:
* Boot into kernel
* Make WiFi work (it may be necessary to reload the module)
* Enable WiFi in NM if it was disabled
* Disable Wifi in NM
* Reboot into same kernel

Ultimately, I got the error with every kernel I tested. Including upstream vanilla v3.14, and Debian 3.14. It seems to me that the probability of this happening is way lower with the older kernels (it was like 1 or 2 boots out of 6 failing, whereas it's more like 4 or 5 out of 6 for newer kernels), but that may be just my impression. I also don't understand anymore why I never saw this error until around 3 weeks or a month ago.

I'm now updating the firmware, and will keep you posted if that has any effect.
Comment 34 Emmanuel Grumbach 2014-10-25 16:11:49 UTC
Note that from my analysis, the error (NMI thing) is related to PHY code in the firmware - so it is highly unpredictable.

Last time I had a user with such a problem, he replaced the card and everything got solved...
Comment 35 Ralf 2014-10-25 16:16:01 UTC
(In reply to Emmanuel Grumbach from comment #34)
> Note that from my analysis, the error (NMI thing) is related to PHY code in
> the firmware - so it is highly unpredictable.

That describes it ;-)

> Last time I had a user with such a problem, he replaced the card and
> everything got solved...

This is already the 2nd card in that laptop (the original one didn't do 5GHz WiFi), and currently it's working fine if I just keep it always enabled...
Also, according to <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763172>, I'm not alone.
Comment 36 Emmanuel Grumbach 2014-10-25 19:10:36 UTC
can you please reproduce with:

debug=0xf00ff passed as module parameter to iwlwifi?

thanks.
Comment 37 Ralf 2014-10-26 09:26:47 UTC
Created attachment 155111 [details]
dmesg output with 3.16 kernel, firmware 25.228

There was an update to the Debian kernel, so I'm now at something based on 3.16.5. I also upgraded to the firmware you mentioned.

Finally, I added the following to /etc/modprobe.d/iwlwifi.conf:

  options iwlwifi debug=0xf00ff

However, I cannot see any additional debug output. Do I need to compile the kernel with a particular option for this to have any effect?
Comment 38 Emmanuel Grumbach 2014-10-26 10:05:39 UTC
[   11.489669] iwlwifi: unknown parameter 'debug' ignored

CONFIG_IWLWIFI_DEBUG wasn't selected.
Comment 39 Ralf 2014-10-26 10:23:08 UTC
Created attachment 155121 [details]
dmesg output with vanilla 3.17, debug enabled

I configured a vanilla 3.17 kernel appropriately and reproduced the issue, the log is attached.
After booting, I first re-loaded the module, then enabled WiFi, then disabled it again. I marked those steps in the log with lines containing "GREPME".
Comment 40 Emmanuel Grumbach 2014-10-26 11:29:46 UTC
Ok - I see what is happening... Hell...

I am not sure I will be able to solve the NMI - it is a bad race that can happen. I am just trying to find a way not to force you to reload the module.

I guess you are not switching the RFKILL button in the middle of the boot right?

So basically, the platform is pulling the RFKILL line in a critical flow - when we run the INIT firmware - this can take a 100ms or so - but getting an RFKILL interrupt during this time is ... bad...

WIP
Comment 41 Ralf 2014-10-26 11:45:28 UTC
Oh joy, a race condition.

(In reply to Emmanuel Grumbach from comment #40)
> I guess you are not switching the RFKILL button in the middle of the boot
> right?
> 
> So basically, the platform is pulling the RFKILL line in a critical flow -
> when we run the INIT firmware - this can take a 100ms or so - but getting an
> RFKILL interrupt during this time is ... bad...

I'm not switching that button, no. However, systemd is restoring the rfkill state somewhere during boot, and usually fairly early. It may well be that this races with initialising the module.

It sounds to me like the right thing to do would be not to forward that request (it's coming from software, after all) while the hardware is in the "critical flow". But that's probably the naive view of someone not knowing anything about hardware drivers ;-)
Comment 42 Emmanuel Grumbach 2014-10-26 12:13:41 UTC
I have no clue about systemd - what but if the SW RFKill state is propagated through the HW Rfkill line, this is really bad...

We need to know about RFKill much earlier. If we load the firmware and the driver isn't aware about the state, bad things will happen, since the firmware *will* know - it tries to access the radio registers ... and fails.
Comment 43 Ralf 2014-10-26 12:31:05 UTC
(In reply to Emmanuel Grumbach from comment #42)
> I have no clue about systemd - what but if the SW RFKill state is propagated
> through the HW Rfkill line, this is really bad...

I'm not sure what this means, but this is the line in systemd that restores the rfkill state (in src/rfkill/rfkill.c):

  udev_device_set_sysattr_value(device, "soft", value);

so it seems to be concerned only with the "soft" rfkill state, not the "hard" one. That function lives in libudev, it is documented to

  Update the contents of the sys attribute and the cached value of the device.

My impression is that it just writes to a file in /sys, and indeed the folders /sys/class/rfkill/rfkill* contain a file called "soft".

> We need to know about RFKill much earlier. If we load the firmware and the
> driver isn't aware about the state, bad things will happen, since the
> firmware *will* know - it tries to access the radio registers ... and fails.

I'm pretty sure that before I used systemd, nothing told the driver about this state. Or maybe NetworkManager did, but that happened way later during boot. I cannot find any initscript dealing with the rfkill stuff.
Honestly I'd expect the driver to be able to initialize the hardware without (possibly unreliable) help from userspace.

systemd has an option to disable this rfkill state restoration, I will try whether that changes the behaviour. However, I cannot reboot anymore today (just started a 20h copy job), so that'll have to wait until tomorrow.
Comment 44 Emmanuel Grumbach 2014-10-26 12:39:03 UTC
Right - soft and hard are confusing.

The "soft" of the platform becomes a line pullup which is ... HW as it is seen by the device.

(I have 2 NICs on my laptop)

$ rfkill list 
0: hci0: Bluetooth
	Soft blocked: no
	Hard blocked: no
1: dell-wifi: Wireless LAN
	Soft blocked: no
	Hard blocked: no
4: phy0: Wireless LAN
	Soft blocked: no
	Hard blocked: no
5: phy1: Wireless LAN
	Soft blocked: no
	Hard blocked: no


$ sudo rfkill block 1
$ rfkill list 
0: hci0: Bluetooth
	Soft blocked: no
	Hard blocked: no
1: dell-wifi: Wireless LAN
	Soft blocked: yes
	Hard blocked: no
4: phy0: Wireless LAN
	Soft blocked: no
	Hard blocked: no
5: phy1: Wireless LAN
	Soft blocked: no
	Hard blocked: yes


See?

or maybe systemd. This is mostly an OEM integration thing. It is always messy the way this kind of things are done.

I do agree that our driver should be able to work no-matter-what. And the possibility for this race isn't new to me. But back then it seemed a safe assumption to make :)
Comment 45 Ralf 2014-10-26 13:45:23 UTC
Ah, I see - it's "soft", but only for some platform driver (asus-wlan in my case). It's "hard" for iwlwifi, so it has no easy way to control this.
Things are always more complicated than they seem ;-)

$ sudo rfkill list
0: asus-wlan: Wireless LAN
        Soft blocked: yes
        Hard blocked: no
1: asus-bluetooth: Bluetooth
        Soft blocked: no
        Hard blocked: no
2: hci0: Bluetooth
        Soft blocked: no
        Hard blocked: no
3: phy0: Wireless LAN
        Soft blocked: yes
        Hard blocked: yes

Thanks for the explanation.
Comment 46 Emmanuel Grumbach 2014-10-26 13:48:16 UTC
exactly...

How do you disable WiFi?

Through the nice GUI of the NetworkManager?
Comment 47 Ralf 2014-10-26 13:50:49 UTC
(In reply to Emmanuel Grumbach from comment #46)
> exactly...
> 
> How do you disable WiFi?
> 
> Through the nice GUI of the NetworkManager?

Yes. In the past, the button on my keyboard sometimes failed to disable/enable WiFi, so I usually use NM. Also, I'm doing these tests at home or in my office, with an external screen attached and the lid closed, so I can't even access the key. (And it's just Fn+F2, so I believe it's "soft" as well - I can disable via that key, and then enabled through NM.)
Comment 48 Emmanuel Grumbach 2014-10-26 13:59:45 UTC
I just checked on my system...
The "Enabled WiFi" checkbox works just like on your system.

So I can see only one solution.

1) Make our FW more robust (it won't be tomorrow)
2) ask the systemd folks to play with RFKILL earlier (or maybe later)?

I know that option 2 is a bad hack...

One option for you for now is to blacklist iwlwifi and to load it later in your rc.local maybe?
Comment 49 Ralf 2014-10-26 14:02:49 UTC
(In reply to Emmanuel Grumbach from comment #48)
> One option for you for now is to blacklist iwlwifi and to load it later in
> your rc.local maybe?

I can play with that. I'm not sure if rc.local is executed, but there's /etc/modules-load.d/ for this purpose. It's probably executed way after the rfkill restoration. Though the latter may be confused by seeing less rfkill switches to restore than it saved the status of on last shutdown.
Comment 50 Emmanuel Grumbach 2014-10-26 14:31:52 UTC
I have thought a bit more. I'll try to come up with a patch.
It should help.

Stay tuned.
Comment 51 Emmanuel Grumbach 2014-10-26 15:06:43 UTC
Created attachment 155191 [details]
tentative fix

can you please try this?

It should help.
I am sure it won't solve all the possible races, but it will at least handle the most common one.

Let me know.
Comment 52 Emmanuel Grumbach 2014-10-26 15:07:02 UTC
Of course - I can't really test it..
Comment 53 Ralf 2014-10-26 15:22:07 UTC
The patch doesn't apply to v3.17:

$ patch -p1 < ../iwlwifi-rfkill.patch
patching file drivers/net/wireless/iwlwifi/mvm/fw.c
Hunk #1 succeeded at 282 (offset -78 lines).
Hunk #2 FAILED at 417.
Hunk #3 succeeded at 361 (offset -85 lines).
1 out of 3 hunks FAILED -- saving rejects to file drivers/net/wireless/iwlwifi/mvm/fw.c.rej
patching file drivers/net/wireless/iwlwifi/mvm/mvm.h
Hunk #1 succeeded at 541 (offset -47 lines).
patching file drivers/net/wireless/iwlwifi/mvm/ops.c
Hunk #1 succeeded at 744 (offset -99 lines).
Hunk #2 succeeded at 753 (offset -99 lines).
Comment 54 Emmanuel Grumbach 2014-10-26 15:29:26 UTC
Created attachment 155201 [details]
tentative fix

sorry....
Comment 55 Ralf 2014-10-26 15:38:09 UTC
That applies :)

I'll let you know after the next reboot, which however as previously mentioned will only happen tomorrow.
Comment 56 Emmanuel Grumbach 2014-10-26 15:41:13 UTC
Ok - thanks.

I can't do much testing here. I'll try to look at the code again later to see if I missed something.

I basically try to kill the firmware if I get the RFKill interrupt. But if the interrupt is long to come and the firmware dies before, that's another story which I have to check.

I might use you to hack on the code (I'll provide a patch) and provoke this situation to see how it is handled.
Comment 57 Ralf 2014-10-27 19:08:53 UTC
Created attachment 155471 [details]
dmesg of v3.17 kernel with the path applied

It's now failing with a different symptom on boot: The kernel module crashes/timeouts, or so, and I get a CPU-side kernel backtrace. Doing the reload dance as usual fixes the problem.
Comment 58 Ralf 2014-10-27 19:17:40 UTC
I can also confirm that setting "systemd.restore_state=0" on the kernel cmdline as documented by systemd-rfkill(8) works around the issue. I tried 3 boots, and didn't get the error. But after all, it's a race condition - still I assume based on your analysis that this is a work-around, and I'll report here if I ever see the failure with that cmdline.
Something is still restoring/remembering the previous state, so WiFi is disabled after boot.
Comment 59 Emmanuel Grumbach 2014-10-27 19:55:30 UTC
Created attachment 155481 [details]
tentative fix - take 2

Please try this updated version.
Comment 60 Emmanuel Grumbach 2014-10-27 19:56:34 UTC
(In reply to Ralf Jung from comment #58)
> I can also confirm that setting "systemd.restore_state=0" on the kernel
> cmdline as documented by systemd-rfkill(8) works around the issue. I tried 3
> boots, and didn't get the error. But after all, it's a race condition -
> still I assume based on your analysis that this is a work-around, and I'll
> report here if I ever see the failure with that cmdline.
> Something is still restoring/remembering the previous state, so WiFi is
> disabled after boot.

I will fix the bug - don't worry :)

I am full steam on it :)
Comment 61 Ralf 2014-10-27 21:15:10 UTC
Created attachment 155511 [details]
dmesg of kernel 3.17, patch v2

Here you go. I'm getting an error printed on boot ("Failed to run INIT ucode: -5"). The card is initially gone. After re-loading the module, it appears. I can then enable WiFi, but it takes exceptionally long.

> I will fix the bug - don't worry :)
> 
> I am full steam on it :)

:)
I thought the information may be useful for you.
Comment 62 Emmanuel Grumbach 2014-10-28 08:47:00 UTC
Created attachment 155701 [details]
tentative fix - take 3

Let's try this.

I am sorry - I wish I could reproduce the issue...
Comment 63 Ralf 2014-10-29 18:41:03 UTC
I'm sorry, there is an imminent deadline so I'm currently either working or sleeping. It will all be over tomorrow, then I will test patch v3.
Comment 64 Ralf 2014-10-30 18:33:58 UTC
Created attachment 155881 [details]
dmesg of kernel 3.17, patch v3

Same behaviour as with v2. dmesg attached.
Comment 65 Emmanuel Grumbach 2014-10-30 19:32:26 UTC
Hm... This is really strange...
May I ask you to double check?
Also, the stack is not very trustworthy.

the test_and_clear thing should have avoided the WARNING.

I'll re-thing about it
Comment 66 Ralf 2014-10-31 09:37:11 UTC
Created attachment 156011 [details]
dmesg of kernel 3.17, patch v3, try 2

Well, here you go:

$ wget 'https://bugzilla.kernel.org/attachment.cgi?id=155701' -O ../iwlwifi-rfkill.patch
$ git reset --hard v3.17
$ patch -p1 < ../iwlwifi-rfkill.patch
$ nice make deb-pkg -j5
# install packages
# reboot
# disable WiFi
# reboot
$ sudo modprobe -r iwlmvm && sudo modprobe -r iwlwifi && sudo modprobe iwlwifi
# enable WiF
$ dmesg > dmesg-3.17-patch3
Comment 67 Emmanuel Grumbach 2014-10-31 09:44:29 UTC
Ok, thanks. I'll provide another patch on sunday. I won't try to fix, but I'll try to understand better what is happening.
Comment 68 Emmanuel Grumbach 2014-11-02 09:22:16 UTC
I still need to work on a patch that can give data, but please, recompile your kernel with CONFIG_FRAME_POINTER. That will give me a reliable stack.

Thanks.
Comment 69 Ralf 2014-11-02 10:30:48 UTC
Created attachment 156151 [details]
dmesg of kernel 3.17, patch v3, try 3

I had to reboot three times to get a dmesg with a stack trace again - the other two times, the module failed to load, but without a stacktrace.
Comment 70 Emmanuel Grumbach 2014-11-02 11:08:16 UTC
Created attachment 156171 [details]
tentative fix + debug info - take4

Let's try this.

I am now dumping the stack in various places + adding debug data. Don't be afraid by the nasty logs you'll get.

Thanks.
Comment 71 Ralf 2014-11-02 11:13:00 UTC
Should I try to catch a dmesg like the one I just posted, where the module crashes (or whatever triggers the stackdump) and I get various timeout messages on boot - or should I just go for any dmesg where module loading fails? (If there are stacktraces in the dmesg in any case, I am not sure I can differentiate the two cases.)
Comment 72 Emmanuel Grumbach 2014-11-02 11:18:04 UTC
as long as you don't have wifi when you try to re-enable WiFi after boot, it is a valuable log for me.

Thanks.
Comment 73 Ralf 2014-11-02 11:21:19 UTC
Created attachment 156181 [details]
dmesg of kernel 3.17, patch 4

As you said, it contains lots of scary traces ;-)

> as long as you don't have wifi when you try to re-enable WiFi after boot, it
> is
> a valuable log for me.

I did not have WiFi immediately after boot. As usual, after re-loading the module, WiFi works fine. Is that what you mean?
Comment 74 Emmanuel Grumbach 2014-11-02 11:33:35 UTC
Created attachment 156191 [details]
tentative fix + debug info - take5

what about this?

Thanks.
Comment 75 Ralf 2014-11-02 11:48:08 UTC
Created attachment 156201 [details]
dmesg of kernel 3.17, patch 5

Here you go.
Comment 76 Emmanuel Grumbach 2014-11-02 12:00:23 UTC
Created attachment 156221 [details]
tentative fix + debug info - take6

Sorry - I had a stupid bug in my previous patch...
Comment 77 Ralf 2014-11-02 12:44:41 UTC
Created attachment 156241 [details]
dmesg of kernel 3.17, patch 6 (no failure!)

I did three reboots with patch 6, and each time the card came up as expected (i.e. the device was present, but disabled). The only glitch is a message shown on the console during boot:

  iwlwifi 0000:03:00.0: We were cut short by RFKILL - all is good

Judging from the message, that's probably an expected glitch ;-) . I still attached a dmesg, maybe it tells you something.

Unless you send me another version, I will keep running my system with this patch applied and tell you if anything weird happens.
Comment 78 Emmanuel Grumbach 2014-11-02 12:54:04 UTC
So I guess this bug is fixed?

I will now send this patch for review.
I'll provide the final patch after the review cycle to make sure it still works for you.
Comment 79 Emmanuel Grumbach 2014-11-02 12:57:29 UTC
actually, I was expecting the "All is good" message :)
Comment 80 Emmanuel Grumbach 2014-11-02 13:07:19 UTC
I still need to provide a clean patch - after the internal review cycle.
Comment 81 Ralf 2014-11-02 13:09:46 UTC
(In reply to Emmanuel Grumbach from comment #79)
> actually, I was expecting the "All is good" message :)

And you didn't tell me to avoid spoiling your control group... :D

Since the bug was not 100% reproducible, it's hard to tell whether it is fixed. It certainly looks like it is, though - but I only dare say so after some more days without an error ;-)
Comment 82 Emmanuel Grumbach 2014-11-02 13:57:55 UTC
Created attachment 156251 [details]
final fix - before review

This is clean version of the patch for this issue.
I may still request to test the final version after review, but I don't expect it'll change a lot.

Thanks a lot for your patience testing all my patches!
Comment 83 Ralf 2014-11-02 14:24:10 UTC
Two more successful boots, now with the final patch :)

Tested-by: Ralf Jung <post@ralfj.de>

> Thanks a lot for your patience testing all my patches!

Thanks a lot for fixing the issue :D
Comment 84 Emmanuel Grumbach 2014-11-03 13:31:30 UTC
Fix is now in my tree - will be sent in the next pull request.
Closing the bug.


https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/iwlwifi-fixes.git/commit/?id=31b8b343e019e0a0c57ca9c13520a87f9cab884b

Note You need to log in before you can comment on or make changes to this bug.