Bug 98591 - iwlwifi: 7260: failed to enable LP XTAL upon resume with RFKILL - MWG100236201
Summary: iwlwifi: 7260: failed to enable LP XTAL upon resume with RFKILL - MWG100236201
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-05-18 15:12 UTC by Jonas Platte
Modified: 2015-10-05 15:08 UTC (History)
5 users (show)

See Also:
Kernel Version: 4.0.3
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
lspci output for the wireless card (2.80 KB, text/plain)
2015-05-18 15:12 UTC, Jonas Platte
Details
dmesg log including the stack trace for the crash (88.79 KB, text/x-log)
2015-05-18 15:14 UTC, Jonas Platte
Details
Second dmesg log (88.77 KB, text/x-log)
2015-05-26 13:35 UTC, Jonas Platte
Details
third dmesg log (suspend with soft-blocked wlan, manually unblocked after resuming) (88.14 KB, text/x-log)
2015-05-28 12:20 UTC, Jonas Platte
Details
dmidecode output (12.76 KB, text/plain)
2015-05-28 12:21 UTC, Jonas Platte
Details
Experimental fix (1.26 KB, patch)
2015-06-09 15:30 UTC, Ido Yariv
Details | Diff
dmesg log #4 (of system boot without some sort of hardware reset?) (77.88 KB, text/x-log)
2015-06-09 23:07 UTC, Jonas Platte
Details
lock transport path (7.59 KB, patch)
2015-06-11 19:27 UTC, Emmanuel Grumbach
Details | Diff
lock transport path (7.50 KB, patch)
2015-06-11 20:20 UTC, Emmanuel Grumbach
Details | Diff
lock transport path (7.30 KB, patch)
2015-06-11 20:32 UTC, Emmanuel Grumbach
Details | Diff
dmesg log #5, now crashing even sooner! (69.89 KB, text/x-log)
2015-06-11 20:52 UTC, Jonas Platte
Details
dmesg log #6, clean start on linux 3.18.14 (72.61 KB, text/x-log)
2015-06-13 20:17 UTC, Jonas Platte
Details
lock transport path (8.16 KB, patch)
2015-06-13 20:36 UTC, Emmanuel Grumbach
Details | Diff
dmesg log #7, suspend+resume without crash on 4.0.5 (72.04 KB, text/x-log)
2015-06-18 00:46 UTC, Jonas Platte
Details

Description Jonas Platte 2015-05-18 15:12:26 UTC
Created attachment 177221 [details]
lspci output for the wireless card

The iwlwifi kernel module crashes reproducibly whenever I suspend to ram and then resume the system again. My wireless card is an Intel AC 7260 that has both Wlan and Bluetooth, I also attached the lspci output for further details (intel-ac7260.txt).

I'm on arch linux, with kernel version 4.0.3, and I have the latest wifi microcode for my hardware + kernel version (/usr/lib/firmware/iwlwifi-7260-12.ucode).
Comment 1 Jonas Platte 2015-05-18 15:14:11 UTC
Created attachment 177231 [details]
dmesg log including the stack trace for the crash
Comment 2 Emmanuel Grumbach 2015-05-18 17:20:33 UTC
Is this a regression?
Did it work on an earlier kernel?
Comment 3 Jonas Platte 2015-05-18 17:55:36 UTC
I don't know, I just got the laptop with this wireless card a few days ago. The only thing I can say is that I had the same problem ith 4.0.2, because I got that update between encountering this bug the first time and reproducing and now reporting it.
Comment 4 Jonas Platte 2015-05-18 17:56:22 UTC
*the update 4.0.2 -> 4.0.3
Comment 5 Emmanuel Grumbach 2015-05-18 18:52:46 UTC
Ok - it is the first time I see a report like this, so I am a bit surprised it didn't come up earlier.
Comment 6 Luca Coelho 2015-05-26 07:44:24 UTC
Jonas, could you please try to reproduce the bug with this patch applied and provide the crash output (dmesg) so we can get more information about what is going on?

diff --git a/drivers/net/wireless/iwlwifi/pcie/trans.c b/drivers/net/wireless/iwlwifi/pcie/trans.c
index 9de632f..993ddb9 100644
--- a/drivers/net/wireless/iwlwifi/pcie/trans.c
+++ b/drivers/net/wireless/iwlwifi/pcie/trans.c
@@ -378,12 +378,13 @@ static void iwl_pcie_apm_lp_xtal_enable(struct iwl_trans *trans)
        ret = iwl_poll_bit(trans, CSR_GP_CNTRL,
                           CSR_GP_CNTRL_REG_FLAG_MAC_CLOCK_READY,
                           CSR_GP_CNTRL_REG_FLAG_MAC_CLOCK_READY,
-                          25000);
+                          35000);
        if (WARN_ON(ret < 0)) {
                IWL_ERR(trans, "Access time out - failed to enable LP XTAL\n");
                /* Release XTAL ON request */
                __iwl_trans_pcie_clear_bit(trans, CSR_GP_CNTRL,
                                           CSR_GP_CNTRL_REG_FLAG_XTAL_ON);
+               iwl_pcie_dump_csr(trans);
                return;
        }


Meanwhile I'll discuss this with our system people to try to understand better what is going on.
Comment 7 Jonas Platte 2015-05-26 13:35:25 UTC
Created attachment 177931 [details]
Second dmesg log

Okay... I patched my kernel (had to manually apply the patch as the function to patch was about 50 lines higher in my kernel source but it wasn't much of a problem), rebooted into it and let the driver crash again. Skimming across the new dmesg log, it doesn't seem different, but here is is...
Comment 8 Luca Coelho 2015-05-26 14:28:36 UTC
Thanks, Jonas!

We now have some more registers dumped, so we will try to figure out what is going on.
Comment 9 Luca Coelho 2015-05-28 10:36:04 UTC
I found some suspicious things happening, but I'm not sure yet what is going on.  What I can see is that when we resume, we try to check if the NIC is rfkilled (by reading a register) and it tells us that it is *not*.  This causes us to start the flow of setting things up and then it fails.

I have been able to reproduce the part that we get !RFKILL and then almost immediately RFKILL, on my Dell E6430.  But I don't get the other problems that you get on your machine.  I'll continue investigating...

BTW, what is the model of the laptop you're using? And what is the distro?
Comment 10 Jonas Platte 2015-05-28 12:20:38 UTC
Created attachment 178161 [details]
third dmesg log (suspend with soft-blocked wlan, manually unblocked after resuming)

You mean rfkilled by the hardware? I justed installed the rfkill userspace program and soft-blocked the wlan before suspending, then unblocked it the same way after resuming; the driver still crashes.

Worth noting might also be that I have a hardware blocking fn key combination, but it doesn't work. xev doesn't recognize any keyboard input when I press it, so it should actually be handled by the hardware to my understanding.

This is the official store page of my laptop (that is where I bought it): http://www.tuxedocomputers.com/Linux-Hardware/Linux-Notebooks/15-6-Zoll/TUXEDO-Book-BU1504-15-6-matt-Full-HD-Slim-Book-bis-12h-Akkulaufzeit-Ultrabook-CPUs-bis-Intel-Core-i7-drei-HDD-SSD-bis-16GB-RAM-DVD-Blu-Ray-Brenner.geek

I'm running arch linux like mentioned in the bug description. For further hardware details, I'll attach the output of dmidecode.
Comment 11 Jonas Platte 2015-05-28 12:21:35 UTC
Created attachment 178171 [details]
dmidecode output
Comment 12 Luca Coelho 2015-05-29 08:57:29 UTC
Thanks Jonas, and sorry for asking a question that was already answered (the distro).

So you're using SW RF-kill, that's good information, I'll take that into account now too.
Comment 13 Andy Lutomirski 2015-06-01 23:44:41 UTC
This is 100% reproducible on my Lenovo X220 if I resume with the HW switch set to turn off the radio.  Let me know if there's anything useful for me to test.
Comment 14 Luca Coelho 2015-06-05 10:44:20 UTC
We have recently started publishing our development backports tree here:

https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/backport-iwlwifi.git

I'm not sure using the latest master from there would solve your problem, but you could give it a try if you want.  You should also take the firmware from the linux-firmware tree that Emmanuel maintains, with newer firmware versions:

https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/linux-firmware.git

BUT PLEASE NOTE that installing our backported driver will replace the entire wireless subsystem, so if you use other wireless devices as well, you won't be able to use them simultaneously with the iwlwifi driver.

Meanwhile, we are still trying to figure out the reason for the rfkill state toggle during resume...
Comment 15 Ido Yariv 2015-06-09 15:30:36 UTC
Created attachment 179261 [details]
Experimental fix

Hi Jonas,

Would it be possible to give this patch a shot and see if it fixes this issue?

Cheers,
Ido.
Comment 16 Jonas Platte 2015-06-09 17:51:12 UTC
I've tested your fix. It does stop the crash from happening, but in a very similar way to soft-blocking the wifi card with rfkill before suspending.

When I resume the system, my nm-applet told me wifi was disabled, so I looked at what rfkill had to say about that and it said "Hard blocked: yes". My Fn key combination still does nothing and I've still not seen the plane mode LED being on a single time. So you found a fix for the crash, but not the one I wanted :D
Comment 17 Ido Yariv 2015-06-09 20:23:08 UTC
Hi Jonas,

Thanks for testing this patch.

I suspect there's more than one issue here. For some reason, your platform enables HW rfkill on suspend automatically, which triggered an issue in the wifi driver. This was fixed by the patch I posted earlier, but even with it, HW rfkill is still kept asserted following a resume.

The fact that you are unable to toggle the HW rfkill using the function keys may suggest that there are other issues (perhaps ACPI related) with this laptop, which are unrelated to the wifi driver.

By the way, have you tried unloading and then reloading the drivers (iwlmvm & iwlwifi)? Could you please share the kernel logs as well?

Andy, would you be able to test the patch posted earlier and see if it fixes the issue you've experienced with your Lenovo?

Thanks,
Ido.
Comment 18 Ido Yariv 2015-06-09 20:59:38 UTC
Hi Jonas,

I just noticed that the laptop link you posted refers to a driver that handles the flight mode button:
https://www.linux-onlineshop.de/forum/index.php?page=Thread&threadID=26

If you haven't already, perhaps it would be a good idea to give it a shot.

Cheers,
Ido.
Comment 19 Jonas Platte 2015-06-09 23:07:25 UTC
Created attachment 179331 [details]
dmesg log #4 (of system boot without some sort of hardware reset?)

I didn't yet look for additional drivers, I got a linux driver CD with the laptop but thought it was stupid (I might have looked there if I had actually cared about the flight mode button before).

Anyway, I installed that driver through the AUR [1] but the button still doesn't do anything. I can modprobe the driver and it shows up with lsmod afterwards, but unfortunately that doesn't change anything. I think the problem isn't the kernel driver, I think the problem is the x11 key binding. The AUR package installed a script into the global xinitrc.d to add a key binding for a keycode, which doesn't work because X doesn't even recognize a key press when I press the flight mode hotkey (tested using xev).

Resuming from STR still does the exact same thing as before (flight mode LED doesn't activate or anything like that), and the supplied script clevo-airplane-mode-led-control doesn't work either: It does show how to enable the LED and doing it manually works, but for some reason it expects /sys/class/rfkill/rfkill${WIFI_RF_INDEX}/state to be 0 when flight mode is enabled, but that's actually the state it's in when soft blocked. So maybe this driver is only meant to soft-block the wifi when the hotkey is pressed?

Another thing I found while typing in the details here: I can still reproduce the crash of the driver. What I need to do is simply put the system into STR, then resume it, then reboot. This has worked before as well by the way, sorry for not mentioning earlier: After a reboot, the wifi would still not work. Only powering it off and manually starting it again got it back to normal. I attached a dmesg log of the new driver crash. In my current kernel I only have the experimental fix though, not the previous patch that added some register values to the dmesg log.

[1] https://aur.archlinux.org/packages/clevo-airplane-mode/
Comment 20 Jonas Platte 2015-06-09 23:15:24 UTC
> By the way, have you tried unloading and then reloading the drivers (iwlmvm &
> iwlwifi)?

Just tried, results in the same thing as rebooting: The iwlwifi module crashes as soon as it's loaded.

> Could you please share the kernel logs as well?

I don't have anything starting with 'k' in /var/log... What exactly should I be looking for?
Comment 21 Emmanuel Grumbach 2015-06-09 23:21:07 UTC
FWIW the PCI config space is botched: L1 is disabled, yet LTR is enabled?
Not possible....
Comment 22 Jonas Platte 2015-06-09 23:43:19 UTC
Well... This isn't getting any better the more I look into it :D

I just noticed that restarting doesn't only hard-block the wifi, it also makes bluetooth disappear in the 'rfkill list' output. Previously I used 'rfkill list wifi' so I didn't notice that until now. Maybe it's of significance?

I'm really starting to wonder if my hardware was delivered broken though.
Comment 23 Ido Yariv 2015-06-10 00:02:28 UTC
Hi Jonas,

This certainly feels like a platform issue, not a driver one. The fact that the state is persistent across reboots and the PCI config issue Emmanuel noted seems to suggest something is controlling the module's power and HW rfkill.

Please note that tuxedo-wmi driver does a bit more than just creating another input device, so you might need it even if you don't plan on using the flight mode key. For instance, it registers a callback that is being called every time the system resumes, and evaluates some WMI method.

However, unless I'm missing something, it doesn't seem like the driver really matches your platform (judging by the dmidecode output and lack of "Model XXXXX found" in dmesg), which might explain some of these issues. Considering that your laptop is fairly new and that this code is almost a couple of years old, this is hardly surprising.

Cheers,
Ido.
Comment 24 Jonas Platte 2015-06-10 00:38:22 UTC
Alright, thank you for your help so far! I have now uninstalled the driver.

So one small update on my part: I tried to find something useful on that driver CD I was talking about. Turns out the one that doesn't have the word 'Windows' on it doesn't have the word 'Linux' on it either and just has an older version of the same contents :D

I now contacted the customer support, maybe it was just had bad luck and got a broken Wifi+BT card.
Comment 25 Emmanuel Grumbach 2015-06-11 19:27:50 UTC
Created attachment 179681 [details]
lock transport path

Can you please try this?

It won't fix your platform issues, but I'd like to know if it works as well as the experimental fix.

thank you.
Comment 26 Andy Lutomirski 2015-06-11 19:55:40 UTC
Sorry for being slow here -- I can't get 4.1.0-rc7 to wake up from suspend with or without iwlwifi.  I'll keep you posted.
Comment 27 Emmanuel Grumbach 2015-06-11 19:57:33 UTC
HAH
Drop a mail to Linus :)

He loves regression at -rc7 :P
Comment 28 Jonas Platte 2015-06-11 20:04:32 UTC
Sorry, but I can't apply that patch. It didn't auto-apply and iwl_trans_pcie_start_hw looks quite different in my kernel source. I'm on 4.0.5 now.
Comment 29 Emmanuel Grumbach 2015-06-11 20:20:06 UTC
Created attachment 179701 [details]
lock transport path

With the correct context
Comment 30 Emmanuel Grumbach 2015-06-11 20:25:57 UTC
Ah wait. I made another mistake...
Sorry for the noise.
Comment 31 Emmanuel Grumbach 2015-06-11 20:32:52 UTC
Created attachment 179711 [details]
lock transport path

Finally....
Comment 32 Jonas Platte 2015-06-11 20:45:41 UTC
Nope, still didn't apply automatically, and everything that did had a 20 lines offset :D

But I managed to apply the patch automatically this time. And it built fine. I'll test it now.
Comment 33 Jonas Platte 2015-06-11 20:52:46 UTC
Created attachment 179721 [details]
dmesg log #5, now crashing even sooner!

There you go... Another dmesg log of a crash. This time I didn't have to suspend or restart :D
Comment 34 Emmanuel Grumbach 2015-06-12 03:12:01 UTC
Weird... I had applied it on 4.0.5 but it was late at night...
I'll check again on Sunday.
Comment 35 Jonas Platte 2015-06-12 14:53:15 UTC
So... After a few emails between me and the customer support, I have a 3.18 kernel installed and it isn't affected by this bug!

I tested 3.14 at first, but that didn't like my Intel HD Graphics 5500. I did manage to verify that the bug doesn't exist there using the console though.

Should I test 3.19 next?
Comment 36 Emmanuel Grumbach 2015-06-12 15:00:41 UTC
No need to test 3.19.
Can you please test the master branch of our backport tree?
https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/backport-iwlwifi.git/

Note that your report pointed to a real bug in iwlwifi that we'll fix.
I just would like to know if the regression is in iwlwifi or somewhere else. Using the backport tree will help us determining that.

If the master branch of our backport tree reproduces the tree, you may want to bisect the backport tree. That is fairly easy and quick.

Thank you.
Comment 37 Emmanuel Grumbach 2015-06-13 20:04:33 UTC
FWIW - I applied my patch from comment #31 on v4.0.5. It applied cleanly.

I am now testing it on 4.0.5, but I am pretty sure you made a mistake when applying it or your kernel isn't exactly v4.0.5.

I also noticed I made a typo in my previous comment. My feeling is that you had a regression in a platform driver and not in iwlwifi.
Can you send the dmesg output of the boot using 3.18?
Bisection of iwlwifi will help *only* if you can reproduce your bug on the master branch of the backport tree with 3.18. As I mentioned, my feeling is this configuration will work just fine.
Worth trying though.
Comment 38 Jonas Platte 2015-06-13 20:17:46 UTC
Created attachment 179861 [details]
dmesg log #6, clean start on linux 3.18.14

Here's the log of the clean start, forgot to attach it last time.

I'll also try the iwlwifi backport tree in the next few days.
Comment 39 Emmanuel Grumbach 2015-06-13 20:35:41 UTC
I can still see the prints about the link power states which makes no sense:
L1 Disabled LTR enabled
This is a bug in the BIOS.
Comment 40 Emmanuel Grumbach 2015-06-13 20:36:24 UTC
Created attachment 179871 [details]
lock transport path

I just tested 4.0.5 + my patch. It worked for me.
I am attaching a newer version with a few fixes.
Comment 41 Emmanuel Grumbach 2015-06-17 04:24:57 UTC
Jonas?

Andy?

:-)

Note that the patch that adds a mutex has been merged.
Comment 42 Jonas Platte 2015-06-17 21:48:48 UTC
I now tested your latest patch, Emmanuel. I also found out why it didn't apply last time: I accidentally commented out the 4.0.5 patch in the PKGBUILD I used to create the package, so I was applying the patch to 4.0.0. 

Anyway.. The patch fixed my problem! So now I have a working 4.0.5 kernel :)
Should I get another dmesg log or will this simply be closed now?
Comment 43 Emmanuel Grumbach 2015-06-18 00:39:13 UTC
I would like to get your dmesg please. Unless it is clean:-)
Comment 44 Jonas Platte 2015-06-18 00:46:40 UTC
Created attachment 180231 [details]
dmesg log #7, suspend+resume without crash on 4.0.5

Well, it is clean in that it doesn't contain a crash log if that's what you mean. But here it is, in case you meant something different or it might be interesting without something crashing.
Comment 45 Emmanuel Grumbach 2015-06-18 01:49:06 UTC
Thanks.

Andy, I am closing the bug, but I am still here in case you still have issues.
Comment 46 Jonas Platte 2015-06-18 01:56:24 UTC
A last question: Will this fix be included with linux 4.0.6 upwards (if there will be a 4.0.6) or will it only be there in 4.1+ / 4.2+ or something like that?
Comment 47 Emmanuel Grumbach 2015-06-18 04:25:34 UTC
4.2.

This patch isn't stable material.
Comment 48 Jonas Platte 2015-10-02 01:16:03 UTC
So, I'm using 4.2 now and still experiencing this problem, so I guess the patch hasn't been applied yet? How do I track this patches status?
Comment 49 Luca Coelho 2015-10-05 15:08:34 UTC
Sorry, but this only went into 4.3, not 4.2 as Emmanuel originally said.  Linus opened the merge window for 4.2 on Jun 21, so this patch was definitely too late to make it.

If you want to track the patch, you can check your git log for this:

commit fa9f3281cbb1075545d4528c84059a3f4e117b44
Author: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Date:   Thu Jun 11 20:45:49 2015 +0300

    iwlwifi: pcie: lock start_hw / start_fw / stop_device
    
    This allows to ensure that we don't have races between them.
    A user reported that stop_device was called twice upon
    rfkill interrupt after suspend. When the interrupts are
    enabled, and right after when we directly check the rfkill
    state.
    
    Reviewed-by: Johannes Berg <johannes.berg@intel.com>
    Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>

Note You need to log in before you can comment on or make changes to this bug.