Bug 64541

Summary: Intel Wireless 7260 hardware timed out (lcpci -xxx returns 0xff) randomly
Product: Networking Reporter: Zhuoyun Wei (wzyboy)
Component: WirelessAssignee: networking_wireless (networking_wireless)
Status: CLOSED CODE_FIX    
Severity: normal CC: bjorn, ilw, joakimkoed
Priority: P1    
Hardware: x86-64   
OS: Linux   
URL: http://lkml.kernel.org/r/527A8166.6000701@gmail.com
Kernel Version: 3.12.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel messages when the bug occurs
full dmesg output
lspci after the bug occurs
When the interface is "state DOWN" in "ip link"
When the interface is "state UP" in "ip link" after I ran "ip link set wlan0 up".
When the interface is connected to the Wi-Fi of my dormitory and got an address
patched kernel 3.12 dmesg
patched kernel 3.12 lspci -vvxxx output
dmesg with pci=earlydump
00:1c.1 Root Port config space dump (Windows 8.1)
03:00.0 7260 NIC config space dump (Windows 8.1)
osc_clk.patch
iwlwifi SW restarting dmesg in 3.14.5

Description Zhuoyun Wei 2013-11-07 04:44:41 UTC
Created attachment 113711 [details]
kernel messages when the bug occurs

This bug was originally posted in Linux Wireless mailing lists. (Link: http://thread.gmane.org/gmane.linux.kernel.wireless.general/115259 )  Here is a brief summary of the whole story: (quoted from the author of iwlwifi.ko)

* I have a ThinkPad X240s laptop (Haswell) with _OSC control *not* granted
* L1 Active is enabled
* kernel: 3.12.0
* Nic is PCIe (Gen2 but not sure...)

At some random point, the driver loses access to the NIC: all readl operation return 0xff.
Even lspci returns 0xff:

03:00.0 Network controller: Intel Corporation Wireless 7260 (rev ff)
00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

here is the output of lspci *before* the issue hits:

03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 6b)
00: 86 80 b2 08 06 04 10 00 6b 00 80 02 10 00 00 00
10: 04 00 40 f0 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 62 c2
30: 00 00 00 00 c8 00 00 00 00 00 00 00 09 01 00 00

Each time this bugs occurs, there will be some (to me) strange trace messages in kernel logs, as attached.

For more logs such as the complete dmesg log, please refer to the mailing lists archive link above.
Comment 1 Emmanuel Grumbach 2013-11-07 06:00:55 UTC
Created attachment 113731 [details]
full dmesg output
Comment 2 Emmanuel Grumbach 2013-11-07 14:29:47 UTC
Created attachment 113751 [details]
lspci after the bug occurs
Comment 3 Emmanuel Grumbach 2013-11-10 08:04:23 UTC
Created attachment 114041 [details]
When the interface is "state DOWN" in "ip link"
Comment 4 Emmanuel Grumbach 2013-11-10 08:04:49 UTC
Created attachment 114051 [details]
When the interface is "state UP" in "ip link" after I ran "ip link set wlan0 up".
Comment 5 Emmanuel Grumbach 2013-11-10 08:05:10 UTC
Created attachment 114061 [details]
When the interface is connected to the Wi-Fi of my dormitory and got an address
Comment 6 Zhuoyun Wei 2013-11-11 11:41:21 UTC
Created attachment 114211 [details]
patched kernel 3.12 dmesg
Comment 7 Zhuoyun Wei 2013-11-11 11:41:54 UTC
Created attachment 114221 [details]
patched kernel 3.12 lspci -vvxxx output
Comment 8 Emmanuel Grumbach 2013-11-13 08:48:39 UTC
User reported that disabling L1 manually with setpci prevents the bug from triggering.

# for the 7260 device
  setpci -s03:00.0 0x50.W=0x140
# for the bridge
  setpci -s00:1c.1 0x50.W=0x040


Note that the is tailored to the user's system and won't work for other systems.
Note also that this is a W/A
(just in case someone else is reading this...)
Comment 9 Bjorn Helgaas 2013-11-19 19:01:14 UTC
Created attachment 115151 [details]
dmesg with pci=earlydump

This pci=earlydump info shows that BIOS enabled ASPM L1 on both the 00:1c.1 and 03:00.0 devices.  The 16-bit Link Control register is at 0x50 for both devices:

  pci 0000:00:1c.1 config space:
    50: 42 00 11 70 00 b2 14 00 00 00 40 01 00 00 00 00
  pci 0000:03:00.0 config space:
    50: 42 01 11 10 00 00 00 00 00 00 00 00 00 00 00 00

00:1c.1 Link Control = 0x0042
03:00.0 Link Control = 0x0142

Both show ASPM L1 enabled.  From attachment 113731 [details]:

  DMI: LENOVO 20AKCTO1WW/20AKCTO1WW, BIOS GIET62WW (2.12 ) 09/25/2013
  ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
  acpi PNP0A08:00: Requesting ACPI _OSC control (0x1d)
  acpi PNP0A08:00: ACPI _OSC request failed (AE_SUPPORT), returned control mask: 0x0d
  acpi PNP0A08:00: ACPI _OSC control for PCIe not granted, disabling ASPM
  pci 0000:03:00.0: [8086:08b2] type 00 class 0x028000
  iwlwifi 0000:03:00.0: loaded firmware version 22.0.7.0 op_mode iwlmvm
  iwlwifi 0000:03:00.0: Detected Intel(R) Wireless N 7260, REV=0x144
  iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S

Because the BIOS declined to grant OS control of ASPM, Linux doesn't touch the ASPM configuration at all, and it remains as BIOS configured it.  From the lspci output in attachment 114061 [details]:

  00:1c.1 PCI bridge to [bus 03]
    LnkCap: Port #3, Speed 5GT/s, Width x1, ASPM L0s L1
    LnkCtl: ASPM L1 Enabled; ...
  03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 6b)
    LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1
    LnkCtl: ASPM L1 Enabled; ...

Based on the fact that the workaround in comment #8 prevents the bug, there must be some problem when ASPM L1 is enabled.

But there is no spec-compliant way for Linux to disable ASPM L1 when we don't have permission to control it.  It's clearly possible for us to ignore the BIOS and sledgehammer it.  I actually proposed this at http://lkml.kernel.org/r/20130510225257.GA10847@google.com, but we decided it was too dangerous.  I investigated (see bug 57331) and found that when Windows doesn't have permission to control ASPM, it also apparently ignores driver requests to disable ASPM.

It's possible that the iwlwifi driver could do something to fix this, and of course a BIOS change fix this by leaving L1 disabled.  But those are both out of our control.
Comment 10 Bjorn Helgaas 2014-01-08 20:56:26 UTC
Created attachment 121341 [details]
00:1c.1 Root Port config space dump (Windows 8.1)

wzyboy collected this with RWEverything: http://lkml.kernel.org/r/CALkVjQbjnAk0mFZ-3zjx_15xXEaaiMaVxMMLD1HEUJhVtYXg4g@mail.gmail.com

The NIC works correctly.

I reformatted and annotated this to be more like "lspci -xxxx" output with these vim commands:

:set ff=unix
1GVGu
:%s/ 0...=/ /g
:%s/^00//
:%s/^0//
:%s/=/: / 
:%s/^8: //
:%s/^.8: //
:%s/^..8: //
# join 16 times (0x100 bytes of config space); position at first line to join
JjJjJjJjJjJjJjJjJjJjJjJjJjJjJjJj
Comment 11 Bjorn Helgaas 2014-01-08 21:03:47 UTC
Created attachment 121351 [details]
03:00.0 7260 NIC config space dump (Windows 8.1)

wzyboy collected this with RWEverything: http://lkml.kernel.org/r/CALkVjQaWBq6--sekifyGfJt9FDigsk56eXMUv9YR3sDVnVcX=Q@mail.gmail.com

I compared these dumps from Windows with Linux lspci output, but the lspci output only had the first 0x100 bytes, which doesn't include the L1 PM Substates and a few other things, so it might be useful to collect the "lspci -xxxx" output and compare again.

The differences I found looked innocuous:

  - Linux set the 00:1c.1 PCI_COMMAND_IO bit
  - Linux has the 00:1c.1 Secondary Status "Received Master-Abort" bit set
  - the 00:1c.1 and 03:00.0 Interrupt Line values are different
  - Linux shows the 00:1c.1 PCIe "Link Training" bit set
Comment 12 Emmanuel Grumbach 2014-01-09 08:40:33 UTC
I have done the full comparison - based on another lspci output from wzyboy.
When Linux enables L1 PM Substates it breaks.
I can see that Windows does enable it.

So obviously there is something fishy here...
I have spent tons of time to track the differences. I only found a small thing.

I'll try to send the patch. Unfortunately, I have very little time...
Comment 13 Emmanuel Grumbach 2014-01-12 06:18:21 UTC
Created attachment 121671 [details]
osc_clk.patch

can you please try this?

if it locks up everything, I am afraid you'll have to shut down completely the laptop - including remove the battery...

This is probably the last thing I can think about - next step would be to take the machine to a PCIe analyzer....
Comment 14 Zhuoyun Wei 2014-01-13 02:28:59 UTC
(In reply to Emmanuel Grumbach from comment #13)
>
> can you please try this?
> 

This patch works great! With patched kernel and without setpci trick, my laptop has been up and running for 11.5 hours, during which mora the 30 GiB data has been downloaded (for test), and I did not get a single connection loss!
Comment 15 Emmanuel Grumbach 2014-01-13 06:02:27 UTC
ok - good news.
I'll make a real patch.

Thanks to all who helped!
Comment 16 Zhuoyun Wei 2014-01-13 06:06:49 UTC
Great! Thanks to all!

I'll hold my kernel packages in pacman.conf until Linus Torvalds merged your patch. :-)
Comment 17 Emmanuel Grumbach 2014-01-13 06:10:03 UTC
I guess that the bug can be closed now.
Comment 18 Zhuoyun Wei 2014-01-13 06:17:53 UTC
Done.
Comment 19 Joakim Koed 2014-02-23 22:48:35 UTC
Hello.

Have this patch made it into the kernel yet? I have tried up to kernel 3.13.5 and still experience the same problem (same wifi card)

Just wanted to know if I should wait or If I need to patch it myself (i'm r fairly new to Linux, so I would be glad if I could it it without patching a kernel.)

Thank you.
Comment 20 Zhuoyun Wei 2014-02-24 02:20:07 UTC
(In reply to Joakim Koed from comment #19)
> 
> Have this patch made it into the kernel yet? I have tried up to kernel
> 3.13.5 and still experience the same problem (same wifi card)

I am the original reporter and just upgraded to 3.14.5 the day before yesterday. I've been benchmarking my NIC without any of the w/a mentioned above, even without "power_scheme=1" parameter.

Two days later, 20 GiB data has been transferred and not even one single "connection lost" has occured. The only "glitch" I found is that there will be some errors relating "Microcode SW error detected.  Restarting 0x2000000." each time it (re)connects to an AP. It does not affect usage, however.
Comment 21 Zhuoyun Wei 2014-02-24 02:21:24 UTC
Created attachment 127211 [details]
iwlwifi SW restarting dmesg in 3.14.5

These "error messages" occur from time to time but does not affect normal usage.
Comment 22 Emmanuel Grumbach 2014-02-24 04:02:07 UTC
(In reply to wzyboy from comment #21)
> Created attachment 127211 [details]
> iwlwifi SW restarting dmesg in 3.14.5
> 
> These "error messages" occur from time to time but does not affect normal
> usage.

Please don't mix 2 bugs here.
But this bug has been fixed in 3.13.5 (you messed up kernel versions).
You also need to upgrade your firmware.
Comment 23 Emmanuel Grumbach 2014-02-24 05:39:03 UTC
(In reply to Joakim Koed from comment #19)
> Hello.
> 
> Have this patch made it into the kernel yet? I have tried up to kernel
> 3.13.5 and still experience the same problem (same wifi card)
> 
> Just wanted to know if I should wait or If I need to patch it myself (i'm r
> fairly new to Linux, so I would be glad if I could it it without patching a
> kernel.)
> 
> Thank you.
3.13.5 has all the fixes.
It is another issue. Please, open a new bug.
Comment 24 Zhuoyun Wei 2014-02-24 06:39:59 UTC
(In reply to Emmanuel Grumbach from comment #22)
> (In reply to wzyboy from comment #21)
> > Created attachment 127211 [details]
> > iwlwifi SW restarting dmesg in 3.14.5
> > 
> > These "error messages" occur from time to time but does not affect normal
> > usage.
> 
> Please don't mix 2 bugs here.
> But this bug has been fixed in 3.13.5 (you messed up kernel versions).
> You also need to upgrade your firmware.

I am so sorry that I mixed these bugs together. I thought they were the same.

And sorry again that I misremembered the kernel version. It was 3.14.4 and now I am in 3.14.5. Those error messages are gone. I can see no error messages from "iwlwifi" in dmesg now.

Thanks for your efforts!
Comment 25 Joakim Koed 2014-02-24 07:11:17 UTC
(In reply to Emmanuel Grumbach from comment #22)
> (In reply to wzyboy from comment #21)
> > Created attachment 127211 [details]
> > iwlwifi SW restarting dmesg in 3.14.5
> > 
> > These "error messages" occur from time to time but does not affect normal
> > usage.
> 
> Please don't mix 2 bugs here.
> But this bug has been fixed in 3.13.5 (you messed up kernel versions).
> You also need to upgrade your firmware.

Which firmware are you using? I've tried both from this site: http://www.intel.com/support/wireless/wlan/sb/CS-034398.htm

When I use the one for 3.13+ kernel I only get like 1mbit download. When I use one for 3.11+ with 3.13 kernel I get fullspeed (80-100mbit+)

What is your experience? :) -- Sorry for the offtopic, but I'm really frustrated about this issue.
Comment 26 Emmanuel Grumbach 2014-02-24 07:13:38 UTC
(In reply to Joakim Koed from comment #25)
> (In reply to Emmanuel Grumbach from comment #22)
> > (In reply to wzyboy from comment #21)
> > > Created attachment 127211 [details]
> > > iwlwifi SW restarting dmesg in 3.14.5
> > > 
> > > These "error messages" occur from time to time but does not affect normal
> > > usage.
> > 
> > Please don't mix 2 bugs here.
> > But this bug has been fixed in 3.13.5 (you messed up kernel versions).
> > You also need to upgrade your firmware.
> 
> Which firmware are you using? I've tried both from this site:
> http://www.intel.com/support/wireless/wlan/sb/CS-034398.htm
> 
> When I use the one for 3.13+ kernel I only get like 1mbit download. When I
> use one for 3.11+ with 3.13 kernel I get fullspeed (80-100mbit+)
> 
> What is your experience? :) -- Sorry for the offtopic, but I'm really
> frustrated about this issue.

please open a new bug.
Comment 27 Bjorn Helgaas 2014-06-03 22:48:27 UTC
For reference, this was fixed by

    2d93aee152b1 iwlwifi: pcie: enable oscillator for L1 exit

which appeared in v3.14-rc1 and was marked for stable (v3.10+).