Bug 86231
Description
Ralf
2014-10-14 10:11:17 UTC
Please try latest firmware. Your -9.ucode can and should be upgraded. BTW - I find it very hard to believe that this is a regression. I don't see how this bug would not have happened with an earlier kernel. Maybe you are using a newer kernel that loads a newer that has a regression, but in this case, it is a regression in the firmware. In any case, please use the latest firmware from here: https://git.kernel.org/cgit/linux/kernel/git/egrumbach/linux-firmware.git/plain/iwlwifi-7260-9.ucode?id=e2844339cb779c00f856c554958e16beca99b462 and if you still have issues - please paste / attach the dmesg output into this bug. (In reply to Emmanuel Grumbach from comment #2) > BTW - I find it very hard to believe that this is a regression. > > I don't see how this bug would not have happened with an earlier kernel. > Maybe you are using a newer kernel that loads a newer that has a regression, > but in this case, it is a regression in the firmware. The occurrence of the bug correlates with the 3.16 kernel landing in Debian testing. The last firmware update landed end of June. I did not see this happen for three months, after the firmware upgrade. The bug not being reproducible, it's of course hard to tell when exactly it appeared. > In any case, please use the latest firmware from here: > > https://git.kernel.org/cgit/linux/kernel/git/egrumbach/linux-firmware.git/ > plain/iwlwifi-7260-9.ucode?id=e2844339cb779c00f856c554958e16beca99b462 Okay, will do. Created attachment 154081 [details] dmesg output of the problem with 3.16 Debian kernel With the 3.16 Debian kernel, the bug happens really often (I am under the impression that it's way more frequent than with 3.17-rc7 vanilla), so I can already attach a new dmesg with the error message. However, the firmware version printed in there did not even change, weird enough. $ dmesg | grep iwlwifi | head -n 5 [ 13.825995] iwlwifi 0000:03:00.0: irq 50 for MSI/MSI-X [ 13.829736] iwlwifi 0000:03:00.0: firmware: direct-loading firmware iwlwifi-7260-9.ucode [ 13.829941] iwlwifi 0000:03:00.0: loaded firmware version 23.214.9.0 op_mode iwlmvm [ 13.874385] iwlwifi 0000:03:00.0: Detected Intel(R) Dual Band Wireless AC 7260, REV=0x144 [ 13.874740] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S $ sha1sum /lib/firmware/iwlwifi-7260-9.ucode 21ab34f5f5d71a15d56c34551734d9414059893e /lib/firmware/iwlwifi-7260-9.ucode $ curl -s 'https://git.kernel.org/cgit/linux/kernel/git/egrumbach/linux-firmware.git/plain/iwlwifi-7260-9.ucode?id=e2844339cb779c00f856c554958e16beca99b462' | sha1sum 21ab34f5f5d71a15d56c34551734d9414059893e - Stupid me.... I pointed to the wrong commit. Can you please take the firmware from the master branch? Thanks Created attachment 154121 [details]
dmesg output with 3.16 kernel, firmware 25.222
It just did it again, the dmesg is attached.
Interestingly, the first attempt to re-load the module also failed (at 135 in the log) with a "Timeout waiting for hardware access". That never happened before.
The second re-load worked (at 173 in the log), and then the module was functional.
I can see that you have the RFKILL switch set to forbid WiFi activity. Can you please change this and check again? I'll try to reproduce with the RFKILL switch set to forbid WiFi as well. (In reply to Emmanuel Grumbach from comment #7) > I can see that you have the RFKILL switch set to forbid WiFi activity. > Can you please change this and check again? Just to be sure: That RFKILL, is that both the software button to disable WiFi I have in NetworkManager, and the (not always reliably working...) button on my keyboard? Currently, "dmesg | grep -i kill" has no output. And I can see nearby networks. I assume this means that I have the kill bits disabled, so I won't touch either button anymore. RFKILL is the button on your keyboard or a real switch. Should I understand that you don't experience any issue anymore? (In reply to Emmanuel Grumbach from comment #9) > Should I understand that you don't experience any issue anymore? I had a single successful boot this morning. But since the issue is not reproducible, that doesn't really say much. I've had successful boots before. So I cannot yet say whether the issue is solved (or better, worked around) by not using the kill switch. Now that I know that the kill state could have an impact, I will take this into account when looking for patterns. Ok - I see. FYI - this issue is very tricky. The firmware gets stuck in a flow it really shouldn't... I don't really see what can be happening there but I'll try to ask people... I did some more testing (5 boots in total), and there's definitely a strong correlation between having WiFi disabled in NetworkManager, and the card crashing on next boot: In these tests, the boot always succeeded when WiFi was enabled on last shutdown (3 times), and always failed when it was disabled on last shutdown (2 times). I also checked the older dmesg logs I submitted, and the kill bit was enabled for all of them. So, it seems, the issue is much more reproducible than I thought. I did the same test with the 3.14 Debian kernel, and the pattern seems to be (pending confirmation, but I need tog et back to work now ;-) that initialisation on boot will fail *if the last shutdown was with the 3.16 kernel and WiFi disabled*. It doesn't matter which kernel is used for booting - if I re-boot from the 3.16 to the 3.14 kernel, I get the error on boot if WiFi was disabled. If I then re-load the module, but keep WiFi disabled, the next boot will succeed. Note sure if this make any sense, and maybe I am "overfitting" here... Ok - thanks for the information. I am checking with the firmware people. Created attachment 154641 [details]
fix - not sure
Hi,
can you please test this patch?
Thanks
Which kernel should I apply this to? I tried 3.17 and 3.17.1, but it failed on both. It didn't apply or it doesn't fix the bug? It didn't apply. $ git reset --hard v3.17 HEAD is now at bfe01a5 Linux 3.17 $ patch -p1 < ~/Desktop/iwlwifi.patch patching file drivers/net/wireless/iwlwifi/mvm/ops.c Hunk #1 FAILED at 455. 1 out of 1 hunk FAILED -- saving rejects to file drivers/net/wireless/iwlwifi/mvm/ops.c.rej This should apply on 3.17* diff --git a/drivers/net/wireless/iwlwifi/mvm/ops.c b/drivers/net/wireless/iwlwifi/mvm/ops.c index 610dbcb..79bdff9 100644 --- a/drivers/net/wireless/iwlwifi/mvm/ops.c +++ b/drivers/net/wireless/iwlwifi/mvm/ops.c @@ -415,6 +415,7 @@ iwl_op_mode_mvm_start(struct iwl_trans *trans, const struct iwl_cfg *cfg, mvm->first_agg_queue = 12; } mvm->sf_state = SF_UNINIT; + mvm->cur_ucode = IWL_UCODE_INIT; mutex_init(&mvm->mutex); mutex_init(&mvm->d0i3_suspend_mutex); It does, after converting spaces back to tabs. I'll reboot later today, back to work for now. Unfortunately, this does not fix the issue: I booted into the patched 3.17 kernel, disabled wireless in NM, and re-booted - and the error appeared as usual. Do you want another logfile? Same awful NMI thing with microcode SW error? If yes, I don't need the log. Still waiting for an answer from the firmware team Yes, the error looks pretty much the same, except for some changed hex numbers. I did some more tests rebooting from "old, good" to "new, bad" kernels and back, and refined my theory for reproducing the crash in these cases. When I booted into 3.14 (having WiFi disabled), I got the error on boot as expected. If I just re-boot into the same kernel again at this point, the error remains, as mentioned in the initial report. If I enable WiFi (after a module reload), then disable it again and boot to a bad kernel, there's no error. If I now immediately reboot into the same kernel, there's again no error. if I enable and disable WiFi, the error comes back on the next reboot. So my current theory is that it's the kernel version that was used when *disabling WiFi*, that's deciding whether the next boot will fail to load the module or not. Ok - so you are saying that if you disable WiFi from 3.14, it works in the next boot and if you disable WiFi in 3.17 it won't - right? I'll try to diff the two kernels. (In reply to Emmanuel Grumbach from comment #24) > Ok - so you are saying that if you disable WiFi from 3.14, it works in the > next boot and if you disable WiFi in 3.17 it won't - right? Yes. If I boot 3.14, enable and disable WiFi, I can boot into any kernel and re-boot as often as I want, there's no error. Once I enable and disable WiFi on 3.16 or 3.17 (not sure about 3.15), the next reboot gives the error. > I'll try to diff the two kernels. I started doing a bisect, restricted to the iwlwifi folder. Since testing requires 2 reboots, which means I can not really do it while working at the same time, it will take a while till that's completed. The bisect log so far is git bisect start 'drivers/net/wireless/iwlwifi/' # good: [455c6fdbd219161bd09b1165f11699d6d73de11c] Linux 3.14 git bisect good 455c6fdbd219161bd09b1165f11699d6d73de11c # bad: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16 git bisect bad 19583ca584d6f574384e17fe7613dfaeadcdc4a6 # bad: [198890258fc0f9e3270ed1c1794b7610dad92ada] iwlwifi: mvm: Handle power management constraints for additional use-cases git bisect bad 198890258fc0f9e3270ed1c1794b7610dad92ada (I didn't actually test the 3.16 and 3.14 upstream tags, just the Debian kernels, and hope that works out...) (In reply to Ralf Jung from comment #25) > (In reply to Emmanuel Grumbach from comment #24) > > Ok - so you are saying that if you disable WiFi from 3.14, it works in the > > next boot and if you disable WiFi in 3.17 it won't - right? > > Yes. If I boot 3.14, enable and disable WiFi, I can boot into any kernel and > re-boot as often as I want, there's no error. Once I enable and disable WiFi > on 3.16 or 3.17 (not sure about 3.15), the next reboot gives the error. ... unfortunately, cross-kernel reboots do not seem totally consistent. I just tested 01a9ca51, and while rebooting within this kernel (after disabling WiFi from within this kernel) is fine, re-booting into 3.16 is not (so, disabling WiFi with 01a9ca51 and the re-booting into 3.16, there's the error again, which I didn't expect to happen). As usual, things are never as simple as they seem, but at least the "reboot to same kernel" remains consistent so far, so I have a pattern I can use for bisecting. I am extremely grateful for the work you are doing. I'll look carefully at the diff between 3.14 and 3.16 and try to see if I see something. Unfortunately, I am not sure the breaking commit is in iwlwifi. The bug is weird enough to be caused by a lot of reasons... I'll also try to reproduce. All this, on Sunday. (In reply to Emmanuel Grumbach from comment #27) > I am extremely grateful for the work you are doing. > I'll look carefully at the diff between 3.14 and 3.16 and try to see if I > see something. I'd like the issue to get fixed, and I don't have the knowledge to do that myself, so that's the least I can do ;-) Also, thanks for working on this - by far not every bug I report gets this amount of attention. > Unfortunately, I am not sure the breaking commit is in iwlwifi. The bug is > weird enough to be caused by a lot of reasons... Well, yes, that's true. But bisecting *all* commits would take ages, and even if this does not find *the* offending commit, it should reduce the search space. Or so I think. If you say it's probably of no help, I'll step the bisect. One thing came through my mind. Are you sure you are using the same firmware with both kernels? 3.14 supports -8.ucode, but 3.14.6+ supports -9.ucode (I think). You can check the version being loaded by checking the: iwlwifi 0000:01:00.0: loaded firmware version XXX message in the log. This is with the 3.14 Debian kernel, which says it's based on 3.14.15: [ 12.027237] iwlwifi 0000:03:00.0: irq 51 for MSI/MSI-X [ 12.033021] iwlwifi 0000:03:00.0: firmware: direct-loading firmware iwlwifi-7260-9.ucode [ 12.033210] iwlwifi 0000:03:00.0: loaded firmware version 25.222.9.0 op_mode iwlmvm [ 12.049205] iwlwifi 0000:03:00.0: Detected Intel(R) Dual Band Wireless AC 7260, REV=0x144 [ 12.049647] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S [ 12.049939] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S So it seems this one's using the new firmware. I will remember to check this during my bisecting. Created attachment 154831 [details]
dmesg log with kernel a812cba9
If I'm not mistaken, this is a log of the same problem happening with a ucode 8 firmware (22.24.8.0) with the a812cba9 kernel (git describe calls this "v3.14-rc2-571-ga812cba").
I also had a spurious working re-boot with this kernel, so maybe the issue is only "almost reproducible". Seems I have to re-test 01a9ca51.
BTW - While at it you may want to upgrade the firmware to https://git.kernel.org/cgit/linux/kernel/git/egrumbach/linux-firmware.git/plain/iwlwifi-7260-9.ucode?id=1f9f9df353b11c9ea0130dfe68053aaaee376df3 Unfortunately, bisecting just showed that the issue is less reproducible than I thought. I did the following (with the old firmware, to have less stuff changing) for various kernels: * Boot into kernel * Make WiFi work (it may be necessary to reload the module) * Enable WiFi in NM if it was disabled * Disable Wifi in NM * Reboot into same kernel Ultimately, I got the error with every kernel I tested. Including upstream vanilla v3.14, and Debian 3.14. It seems to me that the probability of this happening is way lower with the older kernels (it was like 1 or 2 boots out of 6 failing, whereas it's more like 4 or 5 out of 6 for newer kernels), but that may be just my impression. I also don't understand anymore why I never saw this error until around 3 weeks or a month ago. I'm now updating the firmware, and will keep you posted if that has any effect. Note that from my analysis, the error (NMI thing) is related to PHY code in the firmware - so it is highly unpredictable. Last time I had a user with such a problem, he replaced the card and everything got solved... (In reply to Emmanuel Grumbach from comment #34) > Note that from my analysis, the error (NMI thing) is related to PHY code in > the firmware - so it is highly unpredictable. That describes it ;-) > Last time I had a user with such a problem, he replaced the card and > everything got solved... This is already the 2nd card in that laptop (the original one didn't do 5GHz WiFi), and currently it's working fine if I just keep it always enabled... Also, according to <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763172>, I'm not alone. can you please reproduce with: debug=0xf00ff passed as module parameter to iwlwifi? thanks. Created attachment 155111 [details]
dmesg output with 3.16 kernel, firmware 25.228
There was an update to the Debian kernel, so I'm now at something based on 3.16.5. I also upgraded to the firmware you mentioned.
Finally, I added the following to /etc/modprobe.d/iwlwifi.conf:
options iwlwifi debug=0xf00ff
However, I cannot see any additional debug output. Do I need to compile the kernel with a particular option for this to have any effect?
[ 11.489669] iwlwifi: unknown parameter 'debug' ignored CONFIG_IWLWIFI_DEBUG wasn't selected. Created attachment 155121 [details]
dmesg output with vanilla 3.17, debug enabled
I configured a vanilla 3.17 kernel appropriately and reproduced the issue, the log is attached.
After booting, I first re-loaded the module, then enabled WiFi, then disabled it again. I marked those steps in the log with lines containing "GREPME".
Ok - I see what is happening... Hell... I am not sure I will be able to solve the NMI - it is a bad race that can happen. I am just trying to find a way not to force you to reload the module. I guess you are not switching the RFKILL button in the middle of the boot right? So basically, the platform is pulling the RFKILL line in a critical flow - when we run the INIT firmware - this can take a 100ms or so - but getting an RFKILL interrupt during this time is ... bad... WIP Oh joy, a race condition. (In reply to Emmanuel Grumbach from comment #40) > I guess you are not switching the RFKILL button in the middle of the boot > right? > > So basically, the platform is pulling the RFKILL line in a critical flow - > when we run the INIT firmware - this can take a 100ms or so - but getting an > RFKILL interrupt during this time is ... bad... I'm not switching that button, no. However, systemd is restoring the rfkill state somewhere during boot, and usually fairly early. It may well be that this races with initialising the module. It sounds to me like the right thing to do would be not to forward that request (it's coming from software, after all) while the hardware is in the "critical flow". But that's probably the naive view of someone not knowing anything about hardware drivers ;-) I have no clue about systemd - what but if the SW RFKill state is propagated through the HW Rfkill line, this is really bad... We need to know about RFKill much earlier. If we load the firmware and the driver isn't aware about the state, bad things will happen, since the firmware *will* know - it tries to access the radio registers ... and fails. (In reply to Emmanuel Grumbach from comment #42) > I have no clue about systemd - what but if the SW RFKill state is propagated > through the HW Rfkill line, this is really bad... I'm not sure what this means, but this is the line in systemd that restores the rfkill state (in src/rfkill/rfkill.c): udev_device_set_sysattr_value(device, "soft", value); so it seems to be concerned only with the "soft" rfkill state, not the "hard" one. That function lives in libudev, it is documented to Update the contents of the sys attribute and the cached value of the device. My impression is that it just writes to a file in /sys, and indeed the folders /sys/class/rfkill/rfkill* contain a file called "soft". > We need to know about RFKill much earlier. If we load the firmware and the > driver isn't aware about the state, bad things will happen, since the > firmware *will* know - it tries to access the radio registers ... and fails. I'm pretty sure that before I used systemd, nothing told the driver about this state. Or maybe NetworkManager did, but that happened way later during boot. I cannot find any initscript dealing with the rfkill stuff. Honestly I'd expect the driver to be able to initialize the hardware without (possibly unreliable) help from userspace. systemd has an option to disable this rfkill state restoration, I will try whether that changes the behaviour. However, I cannot reboot anymore today (just started a 20h copy job), so that'll have to wait until tomorrow. Right - soft and hard are confusing. The "soft" of the platform becomes a line pullup which is ... HW as it is seen by the device. (I have 2 NICs on my laptop) $ rfkill list 0: hci0: Bluetooth Soft blocked: no Hard blocked: no 1: dell-wifi: Wireless LAN Soft blocked: no Hard blocked: no 4: phy0: Wireless LAN Soft blocked: no Hard blocked: no 5: phy1: Wireless LAN Soft blocked: no Hard blocked: no $ sudo rfkill block 1 $ rfkill list 0: hci0: Bluetooth Soft blocked: no Hard blocked: no 1: dell-wifi: Wireless LAN Soft blocked: yes Hard blocked: no 4: phy0: Wireless LAN Soft blocked: no Hard blocked: no 5: phy1: Wireless LAN Soft blocked: no Hard blocked: yes See? or maybe systemd. This is mostly an OEM integration thing. It is always messy the way this kind of things are done. I do agree that our driver should be able to work no-matter-what. And the possibility for this race isn't new to me. But back then it seemed a safe assumption to make :) Ah, I see - it's "soft", but only for some platform driver (asus-wlan in my case). It's "hard" for iwlwifi, so it has no easy way to control this. Things are always more complicated than they seem ;-) $ sudo rfkill list 0: asus-wlan: Wireless LAN Soft blocked: yes Hard blocked: no 1: asus-bluetooth: Bluetooth Soft blocked: no Hard blocked: no 2: hci0: Bluetooth Soft blocked: no Hard blocked: no 3: phy0: Wireless LAN Soft blocked: yes Hard blocked: yes Thanks for the explanation. exactly... How do you disable WiFi? Through the nice GUI of the NetworkManager? (In reply to Emmanuel Grumbach from comment #46) > exactly... > > How do you disable WiFi? > > Through the nice GUI of the NetworkManager? Yes. In the past, the button on my keyboard sometimes failed to disable/enable WiFi, so I usually use NM. Also, I'm doing these tests at home or in my office, with an external screen attached and the lid closed, so I can't even access the key. (And it's just Fn+F2, so I believe it's "soft" as well - I can disable via that key, and then enabled through NM.) I just checked on my system... The "Enabled WiFi" checkbox works just like on your system. So I can see only one solution. 1) Make our FW more robust (it won't be tomorrow) 2) ask the systemd folks to play with RFKILL earlier (or maybe later)? I know that option 2 is a bad hack... One option for you for now is to blacklist iwlwifi and to load it later in your rc.local maybe? (In reply to Emmanuel Grumbach from comment #48) > One option for you for now is to blacklist iwlwifi and to load it later in > your rc.local maybe? I can play with that. I'm not sure if rc.local is executed, but there's /etc/modules-load.d/ for this purpose. It's probably executed way after the rfkill restoration. Though the latter may be confused by seeing less rfkill switches to restore than it saved the status of on last shutdown. I have thought a bit more. I'll try to come up with a patch. It should help. Stay tuned. Created attachment 155191 [details]
tentative fix
can you please try this?
It should help.
I am sure it won't solve all the possible races, but it will at least handle the most common one.
Let me know.
Of course - I can't really test it.. The patch doesn't apply to v3.17: $ patch -p1 < ../iwlwifi-rfkill.patch patching file drivers/net/wireless/iwlwifi/mvm/fw.c Hunk #1 succeeded at 282 (offset -78 lines). Hunk #2 FAILED at 417. Hunk #3 succeeded at 361 (offset -85 lines). 1 out of 3 hunks FAILED -- saving rejects to file drivers/net/wireless/iwlwifi/mvm/fw.c.rej patching file drivers/net/wireless/iwlwifi/mvm/mvm.h Hunk #1 succeeded at 541 (offset -47 lines). patching file drivers/net/wireless/iwlwifi/mvm/ops.c Hunk #1 succeeded at 744 (offset -99 lines). Hunk #2 succeeded at 753 (offset -99 lines). Created attachment 155201 [details]
tentative fix
sorry....
That applies :) I'll let you know after the next reboot, which however as previously mentioned will only happen tomorrow. Ok - thanks. I can't do much testing here. I'll try to look at the code again later to see if I missed something. I basically try to kill the firmware if I get the RFKill interrupt. But if the interrupt is long to come and the firmware dies before, that's another story which I have to check. I might use you to hack on the code (I'll provide a patch) and provoke this situation to see how it is handled. Created attachment 155471 [details]
dmesg of v3.17 kernel with the path applied
It's now failing with a different symptom on boot: The kernel module crashes/timeouts, or so, and I get a CPU-side kernel backtrace. Doing the reload dance as usual fixes the problem.
I can also confirm that setting "systemd.restore_state=0" on the kernel cmdline as documented by systemd-rfkill(8) works around the issue. I tried 3 boots, and didn't get the error. But after all, it's a race condition - still I assume based on your analysis that this is a work-around, and I'll report here if I ever see the failure with that cmdline. Something is still restoring/remembering the previous state, so WiFi is disabled after boot. Created attachment 155481 [details]
tentative fix - take 2
Please try this updated version.
(In reply to Ralf Jung from comment #58) > I can also confirm that setting "systemd.restore_state=0" on the kernel > cmdline as documented by systemd-rfkill(8) works around the issue. I tried 3 > boots, and didn't get the error. But after all, it's a race condition - > still I assume based on your analysis that this is a work-around, and I'll > report here if I ever see the failure with that cmdline. > Something is still restoring/remembering the previous state, so WiFi is > disabled after boot. I will fix the bug - don't worry :) I am full steam on it :) Created attachment 155511 [details] dmesg of kernel 3.17, patch v2 Here you go. I'm getting an error printed on boot ("Failed to run INIT ucode: -5"). The card is initially gone. After re-loading the module, it appears. I can then enable WiFi, but it takes exceptionally long. > I will fix the bug - don't worry :) > > I am full steam on it :) :) I thought the information may be useful for you. Created attachment 155701 [details]
tentative fix - take 3
Let's try this.
I am sorry - I wish I could reproduce the issue...
I'm sorry, there is an imminent deadline so I'm currently either working or sleeping. It will all be over tomorrow, then I will test patch v3. Created attachment 155881 [details]
dmesg of kernel 3.17, patch v3
Same behaviour as with v2. dmesg attached.
Hm... This is really strange... May I ask you to double check? Also, the stack is not very trustworthy. the test_and_clear thing should have avoided the WARNING. I'll re-thing about it Created attachment 156011 [details] dmesg of kernel 3.17, patch v3, try 2 Well, here you go: $ wget 'https://bugzilla.kernel.org/attachment.cgi?id=155701' -O ../iwlwifi-rfkill.patch $ git reset --hard v3.17 $ patch -p1 < ../iwlwifi-rfkill.patch $ nice make deb-pkg -j5 # install packages # reboot # disable WiFi # reboot $ sudo modprobe -r iwlmvm && sudo modprobe -r iwlwifi && sudo modprobe iwlwifi # enable WiF $ dmesg > dmesg-3.17-patch3 Ok, thanks. I'll provide another patch on sunday. I won't try to fix, but I'll try to understand better what is happening. I still need to work on a patch that can give data, but please, recompile your kernel with CONFIG_FRAME_POINTER. That will give me a reliable stack. Thanks. Created attachment 156151 [details]
dmesg of kernel 3.17, patch v3, try 3
I had to reboot three times to get a dmesg with a stack trace again - the other two times, the module failed to load, but without a stacktrace.
Created attachment 156171 [details]
tentative fix + debug info - take4
Let's try this.
I am now dumping the stack in various places + adding debug data. Don't be afraid by the nasty logs you'll get.
Thanks.
Should I try to catch a dmesg like the one I just posted, where the module crashes (or whatever triggers the stackdump) and I get various timeout messages on boot - or should I just go for any dmesg where module loading fails? (If there are stacktraces in the dmesg in any case, I am not sure I can differentiate the two cases.) as long as you don't have wifi when you try to re-enable WiFi after boot, it is a valuable log for me. Thanks. Created attachment 156181 [details] dmesg of kernel 3.17, patch 4 As you said, it contains lots of scary traces ;-) > as long as you don't have wifi when you try to re-enable WiFi after boot, it > is > a valuable log for me. I did not have WiFi immediately after boot. As usual, after re-loading the module, WiFi works fine. Is that what you mean? Created attachment 156191 [details]
tentative fix + debug info - take5
what about this?
Thanks.
Created attachment 156201 [details]
dmesg of kernel 3.17, patch 5
Here you go.
Created attachment 156221 [details]
tentative fix + debug info - take6
Sorry - I had a stupid bug in my previous patch...
Created attachment 156241 [details]
dmesg of kernel 3.17, patch 6 (no failure!)
I did three reboots with patch 6, and each time the card came up as expected (i.e. the device was present, but disabled). The only glitch is a message shown on the console during boot:
iwlwifi 0000:03:00.0: We were cut short by RFKILL - all is good
Judging from the message, that's probably an expected glitch ;-) . I still attached a dmesg, maybe it tells you something.
Unless you send me another version, I will keep running my system with this patch applied and tell you if anything weird happens.
So I guess this bug is fixed? I will now send this patch for review. I'll provide the final patch after the review cycle to make sure it still works for you. actually, I was expecting the "All is good" message :) I still need to provide a clean patch - after the internal review cycle. (In reply to Emmanuel Grumbach from comment #79) > actually, I was expecting the "All is good" message :) And you didn't tell me to avoid spoiling your control group... :D Since the bug was not 100% reproducible, it's hard to tell whether it is fixed. It certainly looks like it is, though - but I only dare say so after some more days without an error ;-) Created attachment 156251 [details]
final fix - before review
This is clean version of the patch for this issue.
I may still request to test the final version after review, but I don't expect it'll change a lot.
Thanks a lot for your patience testing all my patches!
Two more successful boots, now with the final patch :) Tested-by: Ralf Jung <post@ralfj.de> > Thanks a lot for your patience testing all my patches! Thanks a lot for fixing the issue :D Fix is now in my tree - will be sent in the next pull request. Closing the bug. https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/iwlwifi-fixes.git/commit/?id=31b8b343e019e0a0c57ca9c13520a87f9cab884b |