Created attachment 113711 [details] kernel messages when the bug occurs This bug was originally posted in Linux Wireless mailing lists. (Link: http://thread.gmane.org/gmane.linux.kernel.wireless.general/115259 ) Here is a brief summary of the whole story: (quoted from the author of iwlwifi.ko) * I have a ThinkPad X240s laptop (Haswell) with _OSC control *not* granted * L1 Active is enabled * kernel: 3.12.0 * Nic is PCIe (Gen2 but not sure...) At some random point, the driver loses access to the NIC: all readl operation return 0xff. Even lspci returns 0xff: 03:00.0 Network controller: Intel Corporation Wireless 7260 (rev ff) 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff here is the output of lspci *before* the issue hits: 03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 6b) 00: 86 80 b2 08 06 04 10 00 6b 00 80 02 10 00 00 00 10: 04 00 40 f0 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 62 c2 30: 00 00 00 00 c8 00 00 00 00 00 00 00 09 01 00 00 Each time this bugs occurs, there will be some (to me) strange trace messages in kernel logs, as attached. For more logs such as the complete dmesg log, please refer to the mailing lists archive link above.
Created attachment 113731 [details] full dmesg output
Created attachment 113751 [details] lspci after the bug occurs
Created attachment 114041 [details] When the interface is "state DOWN" in "ip link"
Created attachment 114051 [details] When the interface is "state UP" in "ip link" after I ran "ip link set wlan0 up".
Created attachment 114061 [details] When the interface is connected to the Wi-Fi of my dormitory and got an address
Created attachment 114211 [details] patched kernel 3.12 dmesg
Created attachment 114221 [details] patched kernel 3.12 lspci -vvxxx output
User reported that disabling L1 manually with setpci prevents the bug from triggering. # for the 7260 device setpci -s03:00.0 0x50.W=0x140 # for the bridge setpci -s00:1c.1 0x50.W=0x040 Note that the is tailored to the user's system and won't work for other systems. Note also that this is a W/A (just in case someone else is reading this...)
Created attachment 115151 [details] dmesg with pci=earlydump This pci=earlydump info shows that BIOS enabled ASPM L1 on both the 00:1c.1 and 03:00.0 devices. The 16-bit Link Control register is at 0x50 for both devices: pci 0000:00:1c.1 config space: 50: 42 00 11 70 00 b2 14 00 00 00 40 01 00 00 00 00 pci 0000:03:00.0 config space: 50: 42 01 11 10 00 00 00 00 00 00 00 00 00 00 00 00 00:1c.1 Link Control = 0x0042 03:00.0 Link Control = 0x0142 Both show ASPM L1 enabled. From attachment 113731 [details]: DMI: LENOVO 20AKCTO1WW/20AKCTO1WW, BIOS GIET62WW (2.12 ) 09/25/2013 ACPI FADT declares the system doesn't support PCIe ASPM, so disable it acpi PNP0A08:00: Requesting ACPI _OSC control (0x1d) acpi PNP0A08:00: ACPI _OSC request failed (AE_SUPPORT), returned control mask: 0x0d acpi PNP0A08:00: ACPI _OSC control for PCIe not granted, disabling ASPM pci 0000:03:00.0: [8086:08b2] type 00 class 0x028000 iwlwifi 0000:03:00.0: loaded firmware version 22.0.7.0 op_mode iwlmvm iwlwifi 0000:03:00.0: Detected Intel(R) Wireless N 7260, REV=0x144 iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S Because the BIOS declined to grant OS control of ASPM, Linux doesn't touch the ASPM configuration at all, and it remains as BIOS configured it. From the lspci output in attachment 114061 [details]: 00:1c.1 PCI bridge to [bus 03] LnkCap: Port #3, Speed 5GT/s, Width x1, ASPM L0s L1 LnkCtl: ASPM L1 Enabled; ... 03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 6b) LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1 LnkCtl: ASPM L1 Enabled; ... Based on the fact that the workaround in comment #8 prevents the bug, there must be some problem when ASPM L1 is enabled. But there is no spec-compliant way for Linux to disable ASPM L1 when we don't have permission to control it. It's clearly possible for us to ignore the BIOS and sledgehammer it. I actually proposed this at http://lkml.kernel.org/r/20130510225257.GA10847@google.com, but we decided it was too dangerous. I investigated (see bug 57331) and found that when Windows doesn't have permission to control ASPM, it also apparently ignores driver requests to disable ASPM. It's possible that the iwlwifi driver could do something to fix this, and of course a BIOS change fix this by leaving L1 disabled. But those are both out of our control.
Created attachment 121341 [details] 00:1c.1 Root Port config space dump (Windows 8.1) wzyboy collected this with RWEverything: http://lkml.kernel.org/r/CALkVjQbjnAk0mFZ-3zjx_15xXEaaiMaVxMMLD1HEUJhVtYXg4g@mail.gmail.com The NIC works correctly. I reformatted and annotated this to be more like "lspci -xxxx" output with these vim commands: :set ff=unix 1GVGu :%s/ 0...=/ /g :%s/^00// :%s/^0// :%s/=/: / :%s/^8: // :%s/^.8: // :%s/^..8: // # join 16 times (0x100 bytes of config space); position at first line to join JjJjJjJjJjJjJjJjJjJjJjJjJjJjJjJj
Created attachment 121351 [details] 03:00.0 7260 NIC config space dump (Windows 8.1) wzyboy collected this with RWEverything: http://lkml.kernel.org/r/CALkVjQaWBq6--sekifyGfJt9FDigsk56eXMUv9YR3sDVnVcX=Q@mail.gmail.com I compared these dumps from Windows with Linux lspci output, but the lspci output only had the first 0x100 bytes, which doesn't include the L1 PM Substates and a few other things, so it might be useful to collect the "lspci -xxxx" output and compare again. The differences I found looked innocuous: - Linux set the 00:1c.1 PCI_COMMAND_IO bit - Linux has the 00:1c.1 Secondary Status "Received Master-Abort" bit set - the 00:1c.1 and 03:00.0 Interrupt Line values are different - Linux shows the 00:1c.1 PCIe "Link Training" bit set
I have done the full comparison - based on another lspci output from wzyboy. When Linux enables L1 PM Substates it breaks. I can see that Windows does enable it. So obviously there is something fishy here... I have spent tons of time to track the differences. I only found a small thing. I'll try to send the patch. Unfortunately, I have very little time...
Created attachment 121671 [details] osc_clk.patch can you please try this? if it locks up everything, I am afraid you'll have to shut down completely the laptop - including remove the battery... This is probably the last thing I can think about - next step would be to take the machine to a PCIe analyzer....
(In reply to Emmanuel Grumbach from comment #13) > > can you please try this? > This patch works great! With patched kernel and without setpci trick, my laptop has been up and running for 11.5 hours, during which mora the 30 GiB data has been downloaded (for test), and I did not get a single connection loss!
ok - good news. I'll make a real patch. Thanks to all who helped!
Great! Thanks to all! I'll hold my kernel packages in pacman.conf until Linus Torvalds merged your patch. :-)
I guess that the bug can be closed now.
Done.
Hello. Have this patch made it into the kernel yet? I have tried up to kernel 3.13.5 and still experience the same problem (same wifi card) Just wanted to know if I should wait or If I need to patch it myself (i'm r fairly new to Linux, so I would be glad if I could it it without patching a kernel.) Thank you.
(In reply to Joakim Koed from comment #19) > > Have this patch made it into the kernel yet? I have tried up to kernel > 3.13.5 and still experience the same problem (same wifi card) I am the original reporter and just upgraded to 3.14.5 the day before yesterday. I've been benchmarking my NIC without any of the w/a mentioned above, even without "power_scheme=1" parameter. Two days later, 20 GiB data has been transferred and not even one single "connection lost" has occured. The only "glitch" I found is that there will be some errors relating "Microcode SW error detected. Restarting 0x2000000." each time it (re)connects to an AP. It does not affect usage, however.
Created attachment 127211 [details] iwlwifi SW restarting dmesg in 3.14.5 These "error messages" occur from time to time but does not affect normal usage.
(In reply to wzyboy from comment #21) > Created attachment 127211 [details] > iwlwifi SW restarting dmesg in 3.14.5 > > These "error messages" occur from time to time but does not affect normal > usage. Please don't mix 2 bugs here. But this bug has been fixed in 3.13.5 (you messed up kernel versions). You also need to upgrade your firmware.
(In reply to Joakim Koed from comment #19) > Hello. > > Have this patch made it into the kernel yet? I have tried up to kernel > 3.13.5 and still experience the same problem (same wifi card) > > Just wanted to know if I should wait or If I need to patch it myself (i'm r > fairly new to Linux, so I would be glad if I could it it without patching a > kernel.) > > Thank you. 3.13.5 has all the fixes. It is another issue. Please, open a new bug.
(In reply to Emmanuel Grumbach from comment #22) > (In reply to wzyboy from comment #21) > > Created attachment 127211 [details] > > iwlwifi SW restarting dmesg in 3.14.5 > > > > These "error messages" occur from time to time but does not affect normal > > usage. > > Please don't mix 2 bugs here. > But this bug has been fixed in 3.13.5 (you messed up kernel versions). > You also need to upgrade your firmware. I am so sorry that I mixed these bugs together. I thought they were the same. And sorry again that I misremembered the kernel version. It was 3.14.4 and now I am in 3.14.5. Those error messages are gone. I can see no error messages from "iwlwifi" in dmesg now. Thanks for your efforts!
(In reply to Emmanuel Grumbach from comment #22) > (In reply to wzyboy from comment #21) > > Created attachment 127211 [details] > > iwlwifi SW restarting dmesg in 3.14.5 > > > > These "error messages" occur from time to time but does not affect normal > > usage. > > Please don't mix 2 bugs here. > But this bug has been fixed in 3.13.5 (you messed up kernel versions). > You also need to upgrade your firmware. Which firmware are you using? I've tried both from this site: http://www.intel.com/support/wireless/wlan/sb/CS-034398.htm When I use the one for 3.13+ kernel I only get like 1mbit download. When I use one for 3.11+ with 3.13 kernel I get fullspeed (80-100mbit+) What is your experience? :) -- Sorry for the offtopic, but I'm really frustrated about this issue.
(In reply to Joakim Koed from comment #25) > (In reply to Emmanuel Grumbach from comment #22) > > (In reply to wzyboy from comment #21) > > > Created attachment 127211 [details] > > > iwlwifi SW restarting dmesg in 3.14.5 > > > > > > These "error messages" occur from time to time but does not affect normal > > > usage. > > > > Please don't mix 2 bugs here. > > But this bug has been fixed in 3.13.5 (you messed up kernel versions). > > You also need to upgrade your firmware. > > Which firmware are you using? I've tried both from this site: > http://www.intel.com/support/wireless/wlan/sb/CS-034398.htm > > When I use the one for 3.13+ kernel I only get like 1mbit download. When I > use one for 3.11+ with 3.13 kernel I get fullspeed (80-100mbit+) > > What is your experience? :) -- Sorry for the offtopic, but I'm really > frustrated about this issue. please open a new bug.
For reference, this was fixed by 2d93aee152b1 iwlwifi: pcie: enable oscillator for L1 exit which appeared in v3.14-rc1 and was marked for stable (v3.10+).