Created attachment 177221 [details] lspci output for the wireless card The iwlwifi kernel module crashes reproducibly whenever I suspend to ram and then resume the system again. My wireless card is an Intel AC 7260 that has both Wlan and Bluetooth, I also attached the lspci output for further details (intel-ac7260.txt). I'm on arch linux, with kernel version 4.0.3, and I have the latest wifi microcode for my hardware + kernel version (/usr/lib/firmware/iwlwifi-7260-12.ucode).
Created attachment 177231 [details] dmesg log including the stack trace for the crash
Is this a regression? Did it work on an earlier kernel?
I don't know, I just got the laptop with this wireless card a few days ago. The only thing I can say is that I had the same problem ith 4.0.2, because I got that update between encountering this bug the first time and reproducing and now reporting it.
*the update 4.0.2 -> 4.0.3
Ok - it is the first time I see a report like this, so I am a bit surprised it didn't come up earlier.
Jonas, could you please try to reproduce the bug with this patch applied and provide the crash output (dmesg) so we can get more information about what is going on? diff --git a/drivers/net/wireless/iwlwifi/pcie/trans.c b/drivers/net/wireless/iwlwifi/pcie/trans.c index 9de632f..993ddb9 100644 --- a/drivers/net/wireless/iwlwifi/pcie/trans.c +++ b/drivers/net/wireless/iwlwifi/pcie/trans.c @@ -378,12 +378,13 @@ static void iwl_pcie_apm_lp_xtal_enable(struct iwl_trans *trans) ret = iwl_poll_bit(trans, CSR_GP_CNTRL, CSR_GP_CNTRL_REG_FLAG_MAC_CLOCK_READY, CSR_GP_CNTRL_REG_FLAG_MAC_CLOCK_READY, - 25000); + 35000); if (WARN_ON(ret < 0)) { IWL_ERR(trans, "Access time out - failed to enable LP XTAL\n"); /* Release XTAL ON request */ __iwl_trans_pcie_clear_bit(trans, CSR_GP_CNTRL, CSR_GP_CNTRL_REG_FLAG_XTAL_ON); + iwl_pcie_dump_csr(trans); return; } Meanwhile I'll discuss this with our system people to try to understand better what is going on.
Created attachment 177931 [details] Second dmesg log Okay... I patched my kernel (had to manually apply the patch as the function to patch was about 50 lines higher in my kernel source but it wasn't much of a problem), rebooted into it and let the driver crash again. Skimming across the new dmesg log, it doesn't seem different, but here is is...
Thanks, Jonas! We now have some more registers dumped, so we will try to figure out what is going on.
I found some suspicious things happening, but I'm not sure yet what is going on. What I can see is that when we resume, we try to check if the NIC is rfkilled (by reading a register) and it tells us that it is *not*. This causes us to start the flow of setting things up and then it fails. I have been able to reproduce the part that we get !RFKILL and then almost immediately RFKILL, on my Dell E6430. But I don't get the other problems that you get on your machine. I'll continue investigating... BTW, what is the model of the laptop you're using? And what is the distro?
Created attachment 178161 [details] third dmesg log (suspend with soft-blocked wlan, manually unblocked after resuming) You mean rfkilled by the hardware? I justed installed the rfkill userspace program and soft-blocked the wlan before suspending, then unblocked it the same way after resuming; the driver still crashes. Worth noting might also be that I have a hardware blocking fn key combination, but it doesn't work. xev doesn't recognize any keyboard input when I press it, so it should actually be handled by the hardware to my understanding. This is the official store page of my laptop (that is where I bought it): http://www.tuxedocomputers.com/Linux-Hardware/Linux-Notebooks/15-6-Zoll/TUXEDO-Book-BU1504-15-6-matt-Full-HD-Slim-Book-bis-12h-Akkulaufzeit-Ultrabook-CPUs-bis-Intel-Core-i7-drei-HDD-SSD-bis-16GB-RAM-DVD-Blu-Ray-Brenner.geek I'm running arch linux like mentioned in the bug description. For further hardware details, I'll attach the output of dmidecode.
Created attachment 178171 [details] dmidecode output
Thanks Jonas, and sorry for asking a question that was already answered (the distro). So you're using SW RF-kill, that's good information, I'll take that into account now too.
This is 100% reproducible on my Lenovo X220 if I resume with the HW switch set to turn off the radio. Let me know if there's anything useful for me to test.
We have recently started publishing our development backports tree here: https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/backport-iwlwifi.git I'm not sure using the latest master from there would solve your problem, but you could give it a try if you want. You should also take the firmware from the linux-firmware tree that Emmanuel maintains, with newer firmware versions: https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/linux-firmware.git BUT PLEASE NOTE that installing our backported driver will replace the entire wireless subsystem, so if you use other wireless devices as well, you won't be able to use them simultaneously with the iwlwifi driver. Meanwhile, we are still trying to figure out the reason for the rfkill state toggle during resume...
Created attachment 179261 [details] Experimental fix Hi Jonas, Would it be possible to give this patch a shot and see if it fixes this issue? Cheers, Ido.
I've tested your fix. It does stop the crash from happening, but in a very similar way to soft-blocking the wifi card with rfkill before suspending. When I resume the system, my nm-applet told me wifi was disabled, so I looked at what rfkill had to say about that and it said "Hard blocked: yes". My Fn key combination still does nothing and I've still not seen the plane mode LED being on a single time. So you found a fix for the crash, but not the one I wanted :D
Hi Jonas, Thanks for testing this patch. I suspect there's more than one issue here. For some reason, your platform enables HW rfkill on suspend automatically, which triggered an issue in the wifi driver. This was fixed by the patch I posted earlier, but even with it, HW rfkill is still kept asserted following a resume. The fact that you are unable to toggle the HW rfkill using the function keys may suggest that there are other issues (perhaps ACPI related) with this laptop, which are unrelated to the wifi driver. By the way, have you tried unloading and then reloading the drivers (iwlmvm & iwlwifi)? Could you please share the kernel logs as well? Andy, would you be able to test the patch posted earlier and see if it fixes the issue you've experienced with your Lenovo? Thanks, Ido.
Hi Jonas, I just noticed that the laptop link you posted refers to a driver that handles the flight mode button: https://www.linux-onlineshop.de/forum/index.php?page=Thread&threadID=26 If you haven't already, perhaps it would be a good idea to give it a shot. Cheers, Ido.
Created attachment 179331 [details] dmesg log #4 (of system boot without some sort of hardware reset?) I didn't yet look for additional drivers, I got a linux driver CD with the laptop but thought it was stupid (I might have looked there if I had actually cared about the flight mode button before). Anyway, I installed that driver through the AUR [1] but the button still doesn't do anything. I can modprobe the driver and it shows up with lsmod afterwards, but unfortunately that doesn't change anything. I think the problem isn't the kernel driver, I think the problem is the x11 key binding. The AUR package installed a script into the global xinitrc.d to add a key binding for a keycode, which doesn't work because X doesn't even recognize a key press when I press the flight mode hotkey (tested using xev). Resuming from STR still does the exact same thing as before (flight mode LED doesn't activate or anything like that), and the supplied script clevo-airplane-mode-led-control doesn't work either: It does show how to enable the LED and doing it manually works, but for some reason it expects /sys/class/rfkill/rfkill${WIFI_RF_INDEX}/state to be 0 when flight mode is enabled, but that's actually the state it's in when soft blocked. So maybe this driver is only meant to soft-block the wifi when the hotkey is pressed? Another thing I found while typing in the details here: I can still reproduce the crash of the driver. What I need to do is simply put the system into STR, then resume it, then reboot. This has worked before as well by the way, sorry for not mentioning earlier: After a reboot, the wifi would still not work. Only powering it off and manually starting it again got it back to normal. I attached a dmesg log of the new driver crash. In my current kernel I only have the experimental fix though, not the previous patch that added some register values to the dmesg log. [1] https://aur.archlinux.org/packages/clevo-airplane-mode/
> By the way, have you tried unloading and then reloading the drivers (iwlmvm & > iwlwifi)? Just tried, results in the same thing as rebooting: The iwlwifi module crashes as soon as it's loaded. > Could you please share the kernel logs as well? I don't have anything starting with 'k' in /var/log... What exactly should I be looking for?
FWIW the PCI config space is botched: L1 is disabled, yet LTR is enabled? Not possible....
Well... This isn't getting any better the more I look into it :D I just noticed that restarting doesn't only hard-block the wifi, it also makes bluetooth disappear in the 'rfkill list' output. Previously I used 'rfkill list wifi' so I didn't notice that until now. Maybe it's of significance? I'm really starting to wonder if my hardware was delivered broken though.
Hi Jonas, This certainly feels like a platform issue, not a driver one. The fact that the state is persistent across reboots and the PCI config issue Emmanuel noted seems to suggest something is controlling the module's power and HW rfkill. Please note that tuxedo-wmi driver does a bit more than just creating another input device, so you might need it even if you don't plan on using the flight mode key. For instance, it registers a callback that is being called every time the system resumes, and evaluates some WMI method. However, unless I'm missing something, it doesn't seem like the driver really matches your platform (judging by the dmidecode output and lack of "Model XXXXX found" in dmesg), which might explain some of these issues. Considering that your laptop is fairly new and that this code is almost a couple of years old, this is hardly surprising. Cheers, Ido.
Alright, thank you for your help so far! I have now uninstalled the driver. So one small update on my part: I tried to find something useful on that driver CD I was talking about. Turns out the one that doesn't have the word 'Windows' on it doesn't have the word 'Linux' on it either and just has an older version of the same contents :D I now contacted the customer support, maybe it was just had bad luck and got a broken Wifi+BT card.
Created attachment 179681 [details] lock transport path Can you please try this? It won't fix your platform issues, but I'd like to know if it works as well as the experimental fix. thank you.
Sorry for being slow here -- I can't get 4.1.0-rc7 to wake up from suspend with or without iwlwifi. I'll keep you posted.
HAH Drop a mail to Linus :) He loves regression at -rc7 :P
Sorry, but I can't apply that patch. It didn't auto-apply and iwl_trans_pcie_start_hw looks quite different in my kernel source. I'm on 4.0.5 now.
Created attachment 179701 [details] lock transport path With the correct context
Ah wait. I made another mistake... Sorry for the noise.
Created attachment 179711 [details] lock transport path Finally....
Nope, still didn't apply automatically, and everything that did had a 20 lines offset :D But I managed to apply the patch automatically this time. And it built fine. I'll test it now.
Created attachment 179721 [details] dmesg log #5, now crashing even sooner! There you go... Another dmesg log of a crash. This time I didn't have to suspend or restart :D
Weird... I had applied it on 4.0.5 but it was late at night... I'll check again on Sunday.
So... After a few emails between me and the customer support, I have a 3.18 kernel installed and it isn't affected by this bug! I tested 3.14 at first, but that didn't like my Intel HD Graphics 5500. I did manage to verify that the bug doesn't exist there using the console though. Should I test 3.19 next?
No need to test 3.19. Can you please test the master branch of our backport tree? https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/backport-iwlwifi.git/ Note that your report pointed to a real bug in iwlwifi that we'll fix. I just would like to know if the regression is in iwlwifi or somewhere else. Using the backport tree will help us determining that. If the master branch of our backport tree reproduces the tree, you may want to bisect the backport tree. That is fairly easy and quick. Thank you.
FWIW - I applied my patch from comment #31 on v4.0.5. It applied cleanly. I am now testing it on 4.0.5, but I am pretty sure you made a mistake when applying it or your kernel isn't exactly v4.0.5. I also noticed I made a typo in my previous comment. My feeling is that you had a regression in a platform driver and not in iwlwifi. Can you send the dmesg output of the boot using 3.18? Bisection of iwlwifi will help *only* if you can reproduce your bug on the master branch of the backport tree with 3.18. As I mentioned, my feeling is this configuration will work just fine. Worth trying though.
Created attachment 179861 [details] dmesg log #6, clean start on linux 3.18.14 Here's the log of the clean start, forgot to attach it last time. I'll also try the iwlwifi backport tree in the next few days.
I can still see the prints about the link power states which makes no sense: L1 Disabled LTR enabled This is a bug in the BIOS.
Created attachment 179871 [details] lock transport path I just tested 4.0.5 + my patch. It worked for me. I am attaching a newer version with a few fixes.
Jonas? Andy? :-) Note that the patch that adds a mutex has been merged.
I now tested your latest patch, Emmanuel. I also found out why it didn't apply last time: I accidentally commented out the 4.0.5 patch in the PKGBUILD I used to create the package, so I was applying the patch to 4.0.0. Anyway.. The patch fixed my problem! So now I have a working 4.0.5 kernel :) Should I get another dmesg log or will this simply be closed now?
I would like to get your dmesg please. Unless it is clean:-)
Created attachment 180231 [details] dmesg log #7, suspend+resume without crash on 4.0.5 Well, it is clean in that it doesn't contain a crash log if that's what you mean. But here it is, in case you meant something different or it might be interesting without something crashing.
Thanks. Andy, I am closing the bug, but I am still here in case you still have issues.
A last question: Will this fix be included with linux 4.0.6 upwards (if there will be a 4.0.6) or will it only be there in 4.1+ / 4.2+ or something like that?
4.2. This patch isn't stable material.
So, I'm using 4.2 now and still experiencing this problem, so I guess the patch hasn't been applied yet? How do I track this patches status?
Sorry, but this only went into 4.3, not 4.2 as Emmanuel originally said. Linus opened the merge window for 4.2 on Jun 21, so this patch was definitely too late to make it. If you want to track the patch, you can check your git log for this: commit fa9f3281cbb1075545d4528c84059a3f4e117b44 Author: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Date: Thu Jun 11 20:45:49 2015 +0300 iwlwifi: pcie: lock start_hw / start_fw / stop_device This allows to ensure that we don't have races between them. A user reported that stop_device was called twice upon rfkill interrupt after suspend. When the interrupts are enabled, and right after when we directly check the rfkill state. Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>