Bug 107431

Summary: iwlwifi: 3165: L1 off is causing PCIe root complex to kick the NIC out - MWG100250316
Product: Drivers Reporter: utibe (utibe_ng)
Component: network-wirelessAssignee: DO NOT USE - assign "network-wireless-intel" component instead (linuxwifi)
Status: CLOSED WILL_NOT_FIX    
Severity: high CC: aepstein607, linuxwifi, linville, utibe_ng
Priority: P1    
Hardware: Intel   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=110621
Kernel Version: 4.1.12 Subsystem:
Regression: No Bisected commit-id:
Attachments: Journactl iwlwifi before and after error
Trace output
full journalctl
dmesg logs
Output of lspci -xxxx -vvvv
IO Space
Device Space 3165
IO Space
Lspci_before_setpci
CPU Info
dmesg output
journactl_after_error
backport driver with tentative fix
dmesg_after_patch
backport driver with tentative fix
journalctl after first patch
backport driver with tentative fix
journactl_after_patch2
dmesg_after_patch2
journactl_after_patch3
dmesg_after_patch3
iwlwifi with L1 Disabled from driver

Description utibe 2015-11-07 16:51:51 UTC
Created attachment 192341 [details]
Journactl iwlwifi before and after error

I am using a partioned system with Debian Jessie and Windows. Works works perfectly with my wireless hardware. However, for Debian Jessie at random times after start up, my wireless disconnects and I have to re-start the system.

I've researched this error and tried the following without success;
1)  options iwlwifi bt_coex_active=N  (edited /etc/modprobe.d/iwlwifi.conf)
2)          swcrypto=1  (edited /etc/modprobe.d/iwlwifi.conf)
3)          11n_disable=1 (edited /etc/modprobe.d/iwlwifi.conf)
4) Downloaded the lastest firmware from both Intel and github
5) Tried linux kernels 4.2, 4.3 and back to 4.1 and each time using the correct ucode firmware.

All the above yield the same error. Error message after start-up before and after error attached.

Is there any workaround please?
Comment 1 utibe 2015-11-07 16:54:27 UTC
A few corrections;

Works perfectly with Windows
I have tried all bt_coex_active options with no success
Comment 2 Emmanuel Grumbach 2015-11-08 07:27:56 UTC
Can you please the attach the full output of the kernel log? (No grep iwl please).
Comment 3 Emmanuel Grumbach 2015-11-08 07:32:59 UTC
Can you also record tracing?

You can take a look at https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging for instructions.
Comment 4 utibe 2015-11-08 13:16:55 UTC
firmware-version: 25.32.13.0
i have attached the outputs of the following
journactl
gzip -c trace.dat
dmesg.

let me know if you want anything else.
Comment 5 utibe 2015-11-08 13:17:51 UTC
Created attachment 192411 [details]
Trace output
Comment 6 utibe 2015-11-08 13:18:35 UTC
Created attachment 192421 [details]
full journalctl
Comment 7 utibe 2015-11-08 13:19:16 UTC
Created attachment 192431 [details]
dmesg logs
Comment 8 Emmanuel Grumbach 2015-11-08 14:18:00 UTC
Ok - bad news. This is a PCI related issue.
I am not saying that the PCI bus driver is to be blamed. It can be the Intel device as well. But something that is more platform / electrical than pure WiFi driver.

Can you please try:

setpci -s 0000:02:00.0 0x50.B=0x40

Send, send the output of sudo lspci -xxxx -vvvv

And let me know if it helped.

In any case, the above is really a hack.
Comment 9 Emmanuel Grumbach 2015-11-08 14:18:30 UTC
And yes, it is very strange that these issues appear only on Linux.
Comment 10 utibe 2015-11-08 15:04:11 UTC
Created attachment 192461 [details]
Output of lspci -xxxx -vvvv
Comment 11 Emmanuel Grumbach 2015-11-08 15:07:25 UTC
Good - the setpci command did what I thought it'd do.
Question now is if it helps.

Can you please also try to use read write everything on Windows do dump the config space of the device?

Thanks.
Comment 12 utibe 2015-11-08 15:18:06 UTC
Apologies, I don't understand the previous instructions. Is that for Windows 10? and what commands?
At the moment it seems okay, But sometime yesterday it lasted 40 mins. So I will give it 3 hours.
Comment 13 Emmanuel Grumbach 2015-11-08 15:38:29 UTC
http://rweverything.com/


This will allow to see the PCI configuration in windows.
Since you said that it works fine on windows, I'd like to compare
Comment 14 utibe 2015-11-08 15:59:33 UTC
Ok I will switch to Windows later and dump the config with rweverything. At the moment, I'm still watching the Wifi on Linux and it seems okay. Thanks a lot for your great effort. I will let you know how it goes.
Comment 15 utibe 2015-11-08 21:58:11 UTC
Bus 00, Device 00, Function 00 - Advanced Micro Devices Host Bridge
 ID=15661022, SID=80B2103C, Int Pin=None, IRQ=None
 MEM=None IO=None

Bus 00, Device 01, Function 00 - ATI Technologies Inc. VGA Controller (PCIE)
 ID=98511002, SID=80B2103C, Int Pin=INTA, IRQ=None
 MEM=C000000C D000000C FEB00000  IO=F000 

Bus 00, Device 01, Function 01 - ATI Technologies Inc. HD Audio Device (PCIE)
 ID=98401002, SID=80B2103C, Int Pin=INTB, IRQ=2D
 MEM=FEB64004  IO=None

Bus 00, Device 02, Function 00 - Advanced Micro Devices Host Bridge
 ID=156B1022, SID=00000000, Int Pin=None, IRQ=None
 MEM=None IO=None

Bus 00, Device 02, Function 01 - Advanced Micro Devices PCI-to-PCI Bridge (PCIE)
 ID=14391022, SID=80B2103C, Int Pin=INTA, IRQ=05, PriBus=00, SecBus=01, SubBus=01
 MEM=None IO=None

Bus 00, Device 02, Function 02 - Advanced Micro Devices PCI-to-PCI Bridge (PCIE)
 ID=14391022, SID=80B2103C, Int Pin=INTB, IRQ=04, PriBus=00, SecBus=02, SubBus=02
 MEM=FEA00000-FEAFFFFF  IO=None

Bus 00, Device 02, Function 03 - Advanced Micro Devices PCI-to-PCI Bridge (PCIE)
 ID=14391022, SID=80B2103C, Int Pin=INTC, IRQ=0B, PriBus=00, SecBus=03, SubBus=03
 MEM=FE900000-FE9FFFFF  IO=0000E000-0000EFFF 

Bus 00, Device 02, Function 05 - Advanced Micro Devices PCI-to-PCI Bridge (PCIE)
 ID=14391022, SID=80B2103C, Int Pin=INTA, IRQ=05, PriBus=00, SecBus=04, SubBus=04
 MEM=FE800000-FE8FFFFF  IO=None

Bus 00, Device 08, Function 00 - Advanced Micro Devices En/Decryption Controller
 ID=15371022, SID=80B2103C, Int Pin=None, IRQ=None
 MEM=D080000C FE700000 FEB6F000 FEB6A000  IO=None

Bus 00, Device 10, Function 00 - Advanced Micro Devices XHCI USB Controller (PCIE)
 ID=78141022, SID=80B2103C, Int Pin=INTA, IRQ=None
 MEM=FEB68004  IO=None

Bus 00, Device 11, Function 00 - Advanced Micro Devices AHCI Controller
 ID=78041022, SID=80B2103C, Int Pin=INTA, IRQ=13
 MEM=FEB6E000  IO=F140 F130 F120 F110 F100 

Bus 00, Device 12, Function 00 - Advanced Micro Devices EHCI USB Controller
 ID=78081022, SID=80B2103C, Int Pin=INTA, IRQ=12
 MEM=FEB6D000  IO=None

Bus 00, Device 13, Function 00 - Advanced Micro Devices EHCI USB Controller
 ID=78081022, SID=80B2103C, Int Pin=INTA, IRQ=12
 MEM=FEB6C000  IO=None

Bus 00, Device 14, Function 00 - Advanced Micro Devices SMBus Controller
 ID=780B1022, SID=80B2103C, Int Pin=None, IRQ=None
 MEM=None IO=None

Bus 00, Device 14, Function 02 - Advanced Micro Devices HD Audio Device
 ID=780D1022, SID=80B2103C, Int Pin=INTA, IRQ=10
 MEM=FEB60004  IO=None

Bus 00, Device 14, Function 03 - Advanced Micro Devices ISA Bridge
 ID=780E1022, SID=80B2103C, Int Pin=None, IRQ=None
 MEM=None IO=None

Bus 00, Device 18, Function 00 - Advanced Micro Devices Host Bridge
 ID=15801022, SID=00000000, Int Pin=None, IRQ=None
 MEM=None IO=None

Bus 00, Device 18, Function 01 - Advanced Micro Devices Host Bridge
 ID=15811022, SID=00000000, Int Pin=None, IRQ=None
 MEM=None IO=None

Bus 00, Device 18, Function 02 - Advanced Micro Devices Host Bridge
 ID=15821022, SID=00000000, Int Pin=None, IRQ=None
 MEM=None IO=None

Bus 00, Device 18, Function 03 - Advanced Micro Devices Host Bridge
 ID=15831022, SID=00000000, Int Pin=None, IRQ=None
 MEM=None IO=None

Bus 00, Device 18, Function 04 - Advanced Micro Devices Host Bridge
 ID=15841022, SID=00000000, Int Pin=None, IRQ=None
 MEM=None IO=None

Bus 00, Device 18, Function 05 - Advanced Micro Devices Host Bridge
 ID=15851022, SID=00000000, Int Pin=None, IRQ=None
 MEM=None IO=None

Bus 02, Device 00, Function 00 - Intel Corporation Network Controller (PCIE)
 ID=31658086, SID=40108086, Int Pin=INTA, IRQ=None
 MEM=FEAFE004  IO=None

Bus 03, Device 00, Function 00 - Realtek Semiconductor Ethernet Controller (PCIE)
 ID=813610EC, SID=80B2103C, Int Pin=INTA, IRQ=None
 MEM=FE91400C FE91000C  IO=E000 

Bus 04, Device 00, Function 00 - Realtek Semiconductor  Controller (PCIE)
 ID=522910EC, SID=80B2103C, Int Pin=INTA, IRQ=None
 MEM=FE8FF000  IO=None

LowestMemory=C0000000, LowestIO=E000
Comment 16 utibe 2015-11-08 22:02:55 UTC
Is that you wanted from RW everything? The Linux seems okay for hours now but I will let you know if the issue re-surfaces.
Please clarify;

1) Will the problem return if I update my kernel or other debian packages?
2) Is it okay for me to use the setpci command any time I have the same issue again?
3) Do I need to setpci again when I reboot?

Cheers.
Comment 17 utibe 2015-11-09 06:27:58 UTC
hi Emmanuel,
It has gone worse this morning. I left the computer on standby and when I tried to browse this morning it lasted barely 4 minutes and then disconnected. I tried a restart but I couldn't even log on. It certainly has gone terrible.
I even tried the setpci command, but that didn't change anything.
Comment 18 Emmanuel Grumbach 2015-11-09 08:16:38 UTC
Hi,

The RWeverything output is incomplete. I need the full dump of the config space of the 7260 device.
Let me know if you have issues getting that for me.

The problem has not be "solved", but more work arounded by a bad hack that I asked you to try just to see what could be the culprit.
If you want to keep using this hack, you need to do that every time you boot, and I am not surprised that a suspend resume cycle "deletes" the effect of the work around. You'd need to do that after each resume as well.
Comment 19 utibe 2015-11-09 17:17:48 UTC
Please how do I do a full dump of the config space? Are there a set of instructions I could follow?
I just used the setpci command but didn't help this time, wifi disconnected in less than 5 minutes.
Comment 20 Emmanuel Grumbach 2015-11-09 17:33:42 UTC
I recommend to load / unload the driver after the setpci command.

I'll check for instructions for read write everything for windows.
Comment 21 utibe 2015-11-09 19:24:31 UTC
Unloading and loading the iwlwifi driver (together with iwlmvm) seems to make it much better. I will monitor and let you know how it goes.
Thanks a lot
Comment 22 utibe 2015-11-10 19:52:23 UTC
Created attachment 192691 [details]
IO Space
Comment 23 utibe 2015-11-10 19:54:17 UTC
Emmanuel, please is that the I/O space you wanted?
Comment 24 Emmanuel Grumbach 2015-11-10 20:12:17 UTC
no :(
Sorry, I haven't  taken the time to play with it...

Vendor ID: 8086
The device ID is 3165

You should also see 4010 somewhere

This is the device you want.

Thanks for trying so hard!
Comment 25 utibe 2015-11-10 20:36:44 UTC
Created attachment 192711 [details]
Device Space 3165
Comment 26 utibe 2015-11-10 20:37:11 UTC
Is that better?
Comment 27 utibe 2015-11-10 20:45:53 UTC
Created attachment 192721 [details]
IO Space

Seems 4010 is subsystem Id?
Comment 28 Emmanuel Grumbach 2015-11-10 20:49:39 UTC
Yes - this is the one I need.

Thanks...
Lots of differences... :(
Comment 29 utibe 2015-11-10 21:40:52 UTC
Ok good we got the data. So will I need to re-configure (using instructions and/or a patch from yourself) the Linux PCI IO space? Also, my knowledge on this is very limited but I would have thought the difference shouldn't matter as they are different Operating Systems. Does that suggest some of the PCI registers may be defective? Does this affect bluetooth as well as they share the same card?
Cheers
Comment 30 Emmanuel Grumbach 2015-11-12 06:55:58 UTC
Can you please send the output of sudo lspci  -xxxx before you run setpci command. This will be a better baseline for comparison.

thanks.
Comment 31 utibe 2015-11-12 07:12:33 UTC
Created attachment 192861 [details]
Lspci_before_setpci
Comment 32 Emmanuel Grumbach 2015-11-12 08:51:54 UTC
Thank you for being so responsive.
I'll get back to you.
It might take a while since I have a lots of other things to do as well.
Comment 33 Emmanuel Grumbach 2015-11-19 07:50:46 UTC
can you please paste / attach the output of cat /proc/cpuinfo?

thanks.
Comment 34 utibe 2015-11-19 17:31:23 UTC
Created attachment 195001 [details]
CPU Info
Comment 35 utibe 2015-11-23 18:02:27 UTC
Emmanuel,
Is there any update and is the attachment above what you expected?
Cheers
Utibe
Comment 36 Emmanuel Grumbach 2015-11-23 18:28:16 UTC
No update unfortunately. I got everything I need. Thank you very much.

I am trying to get help from the relevant people.
Comment 37 Emmanuel Grumbach 2015-11-25 08:04:04 UTC
I am in contact with the System team about that. It may take time.
In the meantime, please try to add pcie_aspm=off to your kernel command line and let me know what happens (w/o the setpci command).

thanks.
Comment 38 utibe 2015-11-26 18:05:36 UTC
How do I do that? 
Shall I edit /etc/modprobe.d/iwlwifi.conf and add pcie_aspm=off?
OR
should I run iwlwifi.pcie_aspm=off
OR 
should I add pcie_aspm=off to /proc/cmdline

Apologies for the delayed response
Comment 39 Emmanuel Grumbach 2015-11-26 18:09:14 UTC
Just add it to GRUB_CMDLINE_LINUX="" in /etc/default/grub

and then run update-grub.
Comment 40 utibe 2015-11-26 18:58:41 UTC
Changed that line to GRUB_CMDLINE_LINUX="pcie_aspm=off" in /etc/default/grub and then ran update-grub.

After I restarted the system WiFi connection failed again as before.
Comment 41 Emmanuel Grumbach 2015-11-26 19:15:15 UTC
Can you attach the dmesg output?
I am interested in the lines about iwlwifi L1 lines 

Thanks
Comment 42 utibe 2015-11-26 20:49:48 UTC
Created attachment 195561 [details]
dmesg output
Comment 43 utibe 2015-11-26 20:50:40 UTC
Created attachment 195571 [details]
journactl_after_error
Comment 44 utibe 2015-11-26 20:52:30 UTC
Are those attachments okay? dmesg | grep iwlwifi didn't yield any logs but journalctl did. Let me know if you need something further
Comment 45 Emmanuel Grumbach 2015-11-26 21:03:00 UTC
L1 is still enabled. Very strange.
OK - so setpci is still the only work around I can suggest for now. We analyzed the data you sent and couldn't find any smoking gun. We will continue digging, but we can't promise anything at that stage unfortunately.
Comment 46 utibe 2015-11-26 21:18:42 UTC
In the interim, what's the downside of using setpci? I was thinking we could make up a script that runs the setpci command after every system re-boot and also when a wifi connection is made?
Comment 47 Emmanuel Grumbach 2015-11-26 21:58:02 UTC
it is not very safe. You might race with the HW.

can you please try this patch?

diff --git a/drivers/net/wireless/iwlwifi/iwl-7000.c b/drivers/net/wireless/iwlwifi/iwl-7000.c
index 2d4fe1b..26eb554 100644
--- a/drivers/net/wireless/iwlwifi/iwl-7000.c
+++ b/drivers/net/wireless/iwlwifi/iwl-7000.c
@@ -163,6 +163,7 @@ static const struct iwl_ht_params iwl7000_ht_params = {
        .nvm_hw_section_num = NVM_HW_SECTION_NUM_FAMILY_7000,   \
        .non_shared_ant = ANT_A,                                \
        .max_ht_ampdu_exponent = IEEE80211_HT_MAX_AMPDU_64K,    \
+       .host_interrupt_operation_mode = true,                  \
        .dccm_offset = IWL7000_DCCM_OFFSET
 
 const struct iwl_cfg iwl7260_2ac_cfg = {
Comment 48 utibe 2015-11-26 22:18:40 UTC
ok, I will try this and let you know hopefully tomorrow.
Cheers
Comment 49 utibe 2015-11-27 23:28:45 UTC
Please what is the best way of applying this patch? do I just run the commands on the terminal?
Comment 50 Emmanuel Grumbach 2015-11-28 18:41:55 UTC
Created attachment 195671 [details]
backport driver with tentative fix

Please upon the zip file and go into the iwlwifi/ directory. Then:

make defconfig-iwlwifi-public
sed -i 's/CPTCFG_IWLMVM_VENDOR_CMDS=y/# CPTCFG_IWLMVM_VENDOR_CMDS is not set/' .config
make -j4

sudo make install

reboot

thanks!
Comment 51 Emmanuel Grumbach 2015-11-28 18:48:31 UTC
I dug into that issue and found a few HW tweaks that we are missing in our driver.
The driver I attached includes one of them. If that still fails, I have another tweak to introduce. I still would like to know what tweak will solve the issue (assuming that the missing tweaks I found will suffice to fix the bug).
Comment 52 utibe 2015-11-28 19:16:37 UTC
I haven't applied the patch you asked me to on the 26th November. Shall I forget it and just use your latest attachment?
Comment 53 Emmanuel Grumbach 2015-11-28 19:31:34 UTC
Yes. This includes the patch.
Comment 54 utibe 2015-11-28 19:59:42 UTC
Created attachment 195681 [details]
dmesg_after_patch
Comment 55 utibe 2015-11-28 20:01:16 UTC
Still didn't solve the problem and I've attached dmesg. Do you want the product of journalctl?
Comment 56 Emmanuel Grumbach 2015-11-28 20:14:23 UTC
yes please.

I am sending a new driver right now.
Comment 57 Emmanuel Grumbach 2015-11-28 20:20:50 UTC
Created attachment 195691 [details]
backport driver with tentative fix

Here you go.

Thanks.
Comment 58 utibe 2015-11-28 20:44:17 UTC
Created attachment 195701 [details]
journalctl after first patch
Comment 59 Emmanuel Grumbach 2015-11-28 20:55:15 UTC
Created attachment 195711 [details]
backport driver with tentative fix

Here is third fix in case the first two didn't work.
Comment 60 utibe 2015-11-28 21:13:34 UTC
Created attachment 195721 [details]
journactl_after_patch2
Comment 61 utibe 2015-11-28 21:14:10 UTC
Created attachment 195731 [details]
dmesg_after_patch2
Comment 62 utibe 2015-11-28 21:16:33 UTC
2nd patch didn't work either. I'll try the third asap. As a sanity check that the patches are installed as they should be, please could you check the journalctl and dmesg logs and confirm they are what you would expect after the patching?
Comment 63 Emmanuel Grumbach 2015-11-28 21:17:41 UTC
Yes you are running the proper code.
Comment 64 utibe 2015-11-28 21:50:54 UTC
I've tried the third patch and it's working at the minute. If it goes wrong, I will let you know. Do you need any logs mow that it is working okay or do I wait in case I see the fault again?
Comment 65 Emmanuel Grumbach 2015-11-28 21:58:37 UTC
No log needed if it is working. Thanks.
How long does it usually take to stop working?
Comment 66 utibe 2015-11-28 22:01:37 UTC
It has stopped working again. The time varies but the maximum time was about 45 mins. Typically, fails in less than 10 minutes, but in any case it has failed.
Comment 67 utibe 2015-11-28 22:02:43 UTC
Created attachment 195741 [details]
journactl_after_patch3
Comment 68 utibe 2015-11-28 22:03:15 UTC
Created attachment 195751 [details]
dmesg_after_patch3
Comment 69 Emmanuel Grumbach 2015-11-29 10:39:52 UTC
Created attachment 195821 [details]
iwlwifi with L1 Disabled from driver

Here is a new version. 

I have to admit that I am getting out of ideas. I spent a fair amount of time reviewing the HW flows to make sure we are not missing anything.

This version is just to clear some nits out. It disables L1 completely which I can't commit because it works... but not on your platform.

Since you said that Windows work, I reviewed all the flows there as well.
I am afraid I 'll have to close this bug as will not fix.

Let me know anyway what happens with that version.
Comment 70 utibe 2015-12-01 21:19:23 UTC
Sadly that didn't work :( . So if you close this now, what really are my options? Is there a chance it could be re-opened in the future? Could this problem be fixed in a future iwlwifi version?
Also, what sort of hardware issues do you think I might encounter using the setpci command?
Comment 71 Emmanuel Grumbach 2015-12-02 07:36:20 UTC
setpci will not burn your hardware, but it might cause races that may cause the problem to occur from time to time. I don't have any other options to offer to you for now. When we know that we are working with Intel platform (not your case), we can take the actual system on which the bug occurs and plug it into a PCIe analyzer to root cause the problem.
Since the platform here is not Intel, I am not sure our HW / System teams will agree to go to that level of debugging and would require to have your system in our labs.

I can't say anything about the likelihood to have this bug fixed in the future. If it reproduces on an Intel platform and we can have the system in our labs, we may be able to make progress.
Comment 72 Emmanuel Grumbach 2016-01-11 17:34:23 UTC
*** Bug 110621 has been marked as a duplicate of this bug. ***