Bug 198645 - iwlwifi: 7260: Failed to wake NIC for hcmd. Sending ADD_STA: enqueue_hcmd failed: -5
Summary: iwlwifi: 7260: Failed to wake NIC for hcmd. Sending ADD_STA: enqueue_hcmd fai...
Status: CLOSED WILL_NOT_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: DO NOT USE - assign "network-wireless-intel" component instead
URL:
Keywords:
: 198935 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-02-02 12:08 UTC by Antonín Dach
Modified: 2020-07-10 18:57 UTC (History)
8 users (show)

See Also:
Kernel Version: 4.15
Subsystem:
Regression: No
Bisected commit-id:


Attachments
journal of the incident. (27.37 KB, text/plain)
2018-02-02 12:08 UTC, Antonín Dach
Details
System information (3.72 KB, text/plain)
2018-02-02 12:20 UTC, Antonín Dach
Details
Kernel message log on kernel 4.14 (29.81 KB, text/plain)
2018-02-07 14:34 UTC, Dennis M. Pöpperl
Details
P177SM-A / 4.14.20 / dmesg Crash (7.60 KB, text/plain)
2018-02-25 22:28 UTC, starlightknight
Details
P177SM-A / 4.15.5 / dmesg Crash (5.72 KB, text/plain)
2018-02-25 22:29 UTC, starlightknight
Details
P177SM-A TLP Stat (11.97 KB, patch)
2018-02-25 22:29 UTC, starlightknight
Details | Diff
P177SM-A LSPCI -VVV (37.62 KB, text/plain)
2018-02-25 22:29 UTC, starlightknight
Details
Output of `dmesg > dmesg_output` (119.17 KB, text/plain)
2018-03-01 13:20 UTC, lukeinnocenti
Details
Output of `sudo lspci -vvv` (32.17 KB, text/plain)
2018-03-01 13:27 UTC, lukeinnocenti
Details
PCI rescan for 7260 (545 bytes, application/x-shellscript)
2018-04-06 10:52 UTC, Luca Coelho
Details

Description Antonín Dach 2018-02-02 12:08:52 UTC
Created attachment 273971 [details]
journal of the incident.

Hello,

Unstable wireless connection, after a random time dies and cannot be used until system reboot. The rfkill doesn't work when it happens and I can't remove the iwlwifi cause it is in use.


Network:   Card-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
           driver: r8168 v: 8.045.08-NAPI port: d000 bus-ID: 02:00.0
Kernel: 4.15.0-1-MANJARO x86_64 bits: 64 gcc: 7.2.1

See the journal of the incident in attachment.


Will provide more info after I'll know more. 
Further instruction for debugging this will be appreciated.
Comment 1 Antonín Dach 2018-02-02 12:20:11 UTC
Created attachment 273973 [details]
System information
Comment 2 Antonín Dach 2018-02-02 17:47:43 UTC
Missed when pasting newtwork card.
Card-2: Intel Wireless 7260 driver: iwlwifi bus-ID: 04:00.0


Also, I don't see any problem when I use kernel 4.14.15, so this issue is relatively recent.
Comment 3 Antonín Dach 2018-02-02 21:26:34 UTC
(In reply to Antonín Dach from comment #2)
> Missed when pasting newtwork card.
> Card-2: Intel Wireless 7260 driver: iwlwifi bus-ID: 04:00.0
> 
> 
> Also, I don't see any problem when I use kernel 4.14.15, so this issue is
> relatively recent.

I take that back, I can reproduce this on 4.14 series, usually on heavy load.
Comment 4 Dennis M. Pöpperl 2018-02-07 14:34:12 UTC
Created attachment 274051 [details]
Kernel message log on kernel 4.14

I think I have hit the same problem.
Comment 5 Luca Coelho 2018-02-13 09:08:30 UTC
This line (present in both Antonin's and Dennis' logs) usually points to a platform problem:

kernel: Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff)


The 0xffffffff there means that the PCI device is not accessible at all to the driver.

Can you tell me what is the exact PC you are using?

@Dennis do you also have the 7260 NIC?
Comment 6 Antonín Dach 2018-02-13 09:22:10 UTC
(In reply to Luca Coelho from comment #5)
> This line (present in both Antonin's and Dennis' logs) usually points to a
> platform problem:
> 
> kernel: Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff)
> 
> 
> The 0xffffffff there means that the PCI device is not accessible at all to
> the driver.
> 
> Can you tell me what is the exact PC you are using?
> 


I appended inxi output, repasting it here. What exact information are you looking for? I was using the wireless card for a long time on 4.14 without a problem it must have been a recent driver patch that does that.

I am not using IOMMU if someone wonders and turning on/off xHCI and eHCI has no effect on it.

Just a note: Possibly unrelated I have blacklisted sp5100_tco which tries to use registers that my graphics card use and caused my computer not to boot. I have no clue what this watchdog is for so I left it blacklisted. 


################

System:    Host: DS9 Kernel: 4.15.0-1-MANJARO x86_64 bits: 64 gcc: 7.2.1 Desktop: i3 4.14.1
           Distro: Manjaro Linux
Machine:   Device: desktop System: Gigabyte product: N/A serial: N/A
           Mobo: Gigabyte model: F2A88XN-WIFI v: x.x serial: N/A UEFI: American Megatrends v: F6 date: 12/24/2015
CPU:       Quad core AMD Athlon X4 880K (-MCP-) arch: Steamroller rev.1 cache: 8192 KB
           flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm) bmips: 31959
           clock speeds: max: 4000 MHz 1: 2343 MHz 2: 1998 MHz 3: 3334 MHz 4: 3056 MHz
Graphics:  Card: Advanced Micro Devices [AMD/ATI] Tonga PRO [Radeon R9 285/380] bus-ID: 01:00.0
           Display Server: x11 (X.Org 1.19.6 ) driver: amdgpu Resolution: 1920x1080@60.00hz
           OpenGL: renderer: AMD Radeon R9 380 Series (TONGA / DRM 3.23.0 / 4.15.0-1-MANJARO, LLVM 5.0.1)
           version: 4.5 Mesa 17.3.3 Direct Render: Yes
Audio:     Card-1 Advanced Micro Devices [AMD] FCH Azalia Controller driver: snd_hda_intel bus-ID: 00:14.2
           Card-2 Advanced Micro Devices [AMD/ATI] Tonga HDMI Audio [Radeon R9 285/380]
           driver: snd_hda_intel bus-ID: 01:00.1
           Sound: Advanced Linux Sound Architecture v: k4.15.0-1-MANJARO
Network:   Card-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
           driver: r8169 v: 2.3LK-NAPI port: d000 bus-ID: 02:00.0
           IF: enp2s0 state: up speed: 100 Mbps duplex: full mac: 40:8d:5c:5c:36:cf
           Card-2: Intel Wireless 7260 driver: iwlwifi bus-ID: 04:00.0
           IF: wlp4s0 state: down mac: e6:3f:fe:7e:38:17
Drives:    HDD Total Size: 1240.3GB (50.5% used)
           ID-1: /dev/sdb model: WDC_WD5003AZEX size: 500.1GB
           ID-2: /dev/sda model: INTEL_SSDSC2BW24 size: 240.1GB
           ID-3: USB /dev/sdc model: 2115 size: 500.1GB
Partition: ID-1: / size: 220G used: 121G (58%) fs: ext4 dev: /dev/dm-0
Sensors:   System Temperatures: cpu: 8.4C mobo: N/A gpu: 33.0
           Fan Speeds (in rpm): cpu: N/A
Info:      Processes: 168 Uptime: 48 min Memory: 1805.0/15990.9MB Init: systemd Gcc sys: 7.2.1
           Client: Shell (bash 4.4.121) inxi: 2.3.56
Comment 7 Luca Coelho 2018-02-13 10:16:32 UTC
Sorry, the format of the attachment was a bit odd and I missed the machine configuration from there.

I was asking to make sure your machine was not one of the machines that are known to have problems in the PCI bus power.

Earlier you said that you could reproduce the problem also with 4.14, so if it is a patch, it's something that went into stable.  In that case, you'll have to give me the exact version that used to work and the version with which  you reproduced the issue per comment #3.

As I said, this kind of error is usually either power failure in the bus (e.g. too low voltage provided by the card) or some other kind of physical problem.  Since this is a desktop, it's probably easy to open it up and try to disconnect and clean the contacts in the WiFi module to see if that helps... Could you try that?

Also, I don't think anything related to the in the logs you provided, but is this maybe happening after a suspend/resume cycle? You mentioned that the problem usually happens under heavy load, so that's probably not the case, but just to be sure. ;)
Comment 8 Antonín Dach 2018-02-13 10:27:04 UTC
(In reply to Luca Coelho from comment #7)
> Sorry, the format of the attachment was a bit odd and I missed the machine
> configuration from there.
> 

Yes I redirected the output not realizing it would have the color symbols.

> I was asking to make sure your machine was not one of the machines that are
> known to have problems in the PCI bus power.
> 
> Earlier you said that you could reproduce the problem also with 4.14, so if
> it is a patch, it's something that went into stable.  In that case, you'll
> have to give me the exact version that used to work and the version with
> which  you reproduced the issue per comment #3.
> 
I will have to brush off my compilation/bisect skills, could take month since is not pressing me that hard and I drilled cable hole in the wall and wired my PC.

> As I said, this kind of error is usually either power failure in the bus
> (e.g. too low voltage provided by the card) or some other kind of physical
> problem.  Since this is a desktop, it's probably easy to open it up and try
> to disconnect and clean the contacts in the WiFi module to see if that
> helps... Could you try that?
> 

Yes I'll do that, how do I check voltage on mini PCIe? I have 12V and 5V levels on 0.1V variance so it's not looking faulty to me.

> Also, I don't think anything related to the in the logs you provided, but is
> this maybe happening after a suspend/resume cycle? You mentioned that the
> problem usually happens under heavy load, so that's probably not the case,
> but just to be sure. ;)


It happens without prior suspend/resume cycle, however possible unrelated I have trouble with my realtek LAN card after resume, I have to manualy rmod the module and insert it back in to get it working. It happens on both kernel r8169 driver or official dkms r8169 so it might have something to do with the entire network stack cause .
Comment 9 Dennis M. Pöpperl 2018-02-14 11:59:47 UTC
This is my system:

System:    Kernel: 4.14.14-300.fc27.x86_64
           Distro: Fedora release 27 (Twenty Seven)
Machine:   Device: laptop System: Acer product: TravelMate B115-M v: V1.26 
           Mobo: Acer model: Roxy v: Type2 - A01 Board Version 
CPU:       Quad core Intel Pentium N3540
Network:   Card-1: Intel Wireless 7260 driver: iwlwifi
Controller driver: r8169

This is the output from TLP. Note the deactivated wireless and runtime power management. With these disabled I did not experience the bug anymore so far. These options used to work with the stable kernels until around 4.14, but I'm no sure when exactly the workaround became necessary.


+++ Wireless
wlp2s0(iwlwifi)               : wifi, connected, power management = off

+++ Runtime Power Management
Device blacklist = 02:00:0
Driver blacklist = iwlwifi

/sys/bus/pci/devices/0000:00:00.0/power/control = auto (0x060000, Host bridge, iosf_mbi_pci)
/sys/bus/pci/devices/0000:00:02.0/power/control = auto (0x030000, VGA compatible controller, i915)
/sys/bus/pci/devices/0000:00:13.0/power/control = auto (0x010601, SATA controller, ahci)
/sys/bus/pci/devices/0000:00:14.0/power/control = auto (0x0c0330, USB controller, xhci_hcd)
/sys/bus/pci/devices/0000:00:1a.0/power/control = auto (0x108000, Encryption controller, mei_txe)
/sys/bus/pci/devices/0000:00:1b.0/power/control = auto (0x040300, Audio device, snd_hda_intel)
/sys/bus/pci/devices/0000:00:1c.0/power/control = auto (0x060400, PCI bridge, pcieport)
/sys/bus/pci/devices/0000:00:1c.1/power/control = auto (0x060400, PCI bridge, pcieport)
/sys/bus/pci/devices/0000:00:1c.2/power/control = auto (0x060400, PCI bridge, pcieport)
/sys/bus/pci/devices/0000:00:1f.0/power/control = auto (0x060100, ISA bridge, lpc_ich)
/sys/bus/pci/devices/0000:00:1f.3/power/control = auto (0x0c0500, SMBus, i801_smbus)
/sys/bus/pci/devices/0000:02:00.0/power/control = on   (0x028000, Network controller, iwlwifi)
/sys/bus/pci/devices/0000:03:00.0/power/control = auto (0x020000, Ethernet controller, r8169)
Comment 10 Antonín Dach 2018-02-15 17:00:06 UTC
(In reply to Dennis M. Pöpperl from comment #9)


Hey Dennis I noticed we are using the same LAN card driver and I had issue with it also and I got my LAN resume fixed by enabling amdgpu DC layer (which you don't have any need for). I am now able to use r8169 after resume and I didn't have any WIFI drops today either and I didn't even do system upgrade.

So maybe you could try blacklist r8169 (or uninstall official dkms driver r8169) and see if it fixes your WIFI driver maybe?. I ma just guessing but you have the same issue as I have and this helped me a lot.

Again, I am puzzled how graphics card layer can have any effect on my networking. 

I'm still not done what Luca suggested and maybe I am completely wrong but I didn't have wifi drop today.
Comment 11 Antonín Dach 2018-02-16 08:55:04 UTC
(In reply to Antonín Dach from comment #10)

> didn't have wifi drop today.

Sorry it did happen the next day once, so it's unrelated. Damn I am clueless :(
Comment 12 Emmanuel Grumbach 2018-02-25 19:37:18 UTC
*** Bug 198935 has been marked as a duplicate of this bug. ***
Comment 13 starlightknight 2018-02-25 22:28:06 UTC
I'm also affected. 

Here is my system:

Laptop: Clevo P177SM-A / Intel Haswell i7-4810MQ / 32GB DDR3-1600 RAM
OS: Linux Mint 18.3
Kernel: 4.15.5 (Also reproduced on 4.14.6, 4.14.20)

Router: Turris Omnia / OpenWRT.
Chipset: Qualcomm Atheros QCA9880 802.11bgnac
Mode: 802.11ac
Band: 5ghz
Channel: 36 (5.180 GHz)
Width: 80mhz
Encryption: WPA2 PSK (TKIP, CCMP)

I'll attach dmesg samples as well as tlp stat & lspci -vvv for my system. Please let me know if I can provide anything else to help

For me, it doesn't appear to be caused by load; it seems more likely to happen when fairly idle. For example, it happens much faster if I close apps like Slack & NextCloud, which would otherwise keep some idle network chatter going in the background.
Comment 14 starlightknight 2018-02-25 22:28:49 UTC
Created attachment 274457 [details]
P177SM-A / 4.14.20 / dmesg Crash
Comment 15 starlightknight 2018-02-25 22:29:13 UTC
Created attachment 274459 [details]
P177SM-A / 4.15.5 / dmesg Crash
Comment 16 starlightknight 2018-02-25 22:29:37 UTC
Created attachment 274461 [details]
P177SM-A TLP Stat
Comment 17 starlightknight 2018-02-25 22:29:59 UTC
Created attachment 274463 [details]
P177SM-A LSPCI -VVV
Comment 18 lukeinnocenti 2018-03-01 13:00:35 UTC
I've been having a very similar problem on my machine for some time. I've read through this and a number of other related bug reports, but I noticed some patterns that I've not seen reported in the other posts. I understand that this seems to be most likely a problem related to the card itself

My machine is a Dell Precision M3800 laptop, with intel 7260ac network card, using Ubuntu with iwlwifi kernel driver.

First of all, I used to have windows on the same machine, and had the same problem: connection dropped more or less randomly. On windows I could (temporarily) solve the problem and regain connectivity by going through a weird cycle of disconnecting and reconnecting, disabling and re-enabling the network adapter, to be done a variable number of times. Sometimes reconnecting worked only if done from a specific control panel window and not from another. Also, turning the laptop off after the connection dropped this way resulted in the laptop taking a very long time to turn off. Until a few months ago trying to reboot the machine after the connection drop resulted consistently in a BSOD (this stopped being the case after what I'm assuming was some silent windows update).

Also, most notably, this did not happen on most network. I could (and can) connect more or less without problems to most networks, except to the network of my university. There, I basically cannot use the connection as it drops often after just a few minutes, and the only way I could find to bring it back up is to reboot the pc.

This last point is what puzzles me the most. Main difference between the university network and others, I'm guessing, is the scale of it: a very large number of machines connecting and disconnecting to the network, so maybe the IP gets reassigned more often or things like that (I don't really know a lot about this stuff). Is there a way to diagnose exactly what characteristic of the network causes the problem?

Finally, on windows I usually could manage to bring the connection back up after some disabling and reenabling the network adapter, but on Ubuntu I can't seem to be able to do the same. I tried disabling the iwlwifi kernel module and reenabling it again but still nothing. Isn't there a way to "simulate" the rebooting of the laptop from the point of view of the network card, without actually rebooting it?

I also posted about this problem on askubuntu, see this post: https://askubuntu.com/q/1010309/157358.

I will attach further details.
Comment 19 lukeinnocenti 2018-03-01 13:20:31 UTC
Created attachment 274483 [details]
Output of `dmesg > dmesg_output`

Attached output of dmesg. It shows the log resulting from
1. Booting up the system
2. The connection dropping soon enough
3. Me bringing the interface down with `ip link set wlp6s0 down` and then trying to bring it back up with `ip link set wlp6s0 up` (which doesn't work and gives the RTNETLINK error)
Comment 20 starlightknight 2018-03-01 13:23:28 UTC
I am wondering if this could possibly be a driver bug triggered in response to bug(s) in the router its connected to. 

The reason I say this is:

* Around the time this started happening, my router (full specs above with card model) received a firmware update to kernel 4.4.113 (this was also around the time I updated the laptop to 4.15 series). This was around the beginning of February.

* This now happens to me as often as hourly when at home, which is driving me crazy

* When I'm at work (different router), it has never happened ever, even when the same machine, hardware, and kernel are connected the entire day

* If it was a hardware failure problem, I would expect this to occur regardless of router.

* I can now reproduce this issue on older kernels that used to work for me, whether it be 4.4, 4.10, 4.13, 4.14, or 4.15. I can do this when booting my last installed kernel version of that series that used to work (rather than latest of that series, where a regression may have been backported in the case of 4.4 or 4.14). So it seems unlikely to me that this is a recent problem in iwlwifi, but that doesn't mean it isn't a bug either.

* Lack of a larger number of reports could potentially be explained if the problem varies with router hardware/software
Comment 21 lukeinnocenti 2018-03-01 13:27:05 UTC
Created attachment 274485 [details]
Output of `sudo lspci -vvv`

Output of `sudo lspci -vvv`
Comment 22 Antonín Dach 2018-03-01 16:04:27 UTC
(In reply to starlightknight from comment #20)


I didn't update the router firmware, but using non default one DD-WRT v24-sp2 (03/25/13) std

Something odd is triggering it.
Comment 23 Simon Ye 2018-03-20 05:15:46 UTC
This is a duplicate of https://bugzilla.kernel.org/show_bug.cgi?id=191601, which also contains a script that might be able to reset your card without a reboot.

Please see my longer comments here: https://bugs.launchpad.net/ubuntu/+source/linux-firmware/+bug/1673344. In summary, I believe this to be a longstanding hardware-software bug in Linux due to a hardware degraded 7260 NIC. Replacing the 7260 card fixed it for me.
Comment 24 Luca Coelho 2018-04-06 10:52:28 UTC
Created attachment 275129 [details]
PCI rescan for 7260

Yes, there are many reasons that can cause this and we can't really do anything to recover from it in the driver.

We have recently been working on a workaround in the driver that may help recover by rescanning the PCI bus and causing the device to be reprobed.  The workaround works when the problem happens for a very short time (i.e. when there is a quick current drop in the power provided to the NIC) and comes back to normal.  This will go into 4.18.

One way to test if this workaround will help is by doing it from userspace.  You could try to run the attached script when the problem occurs and see if it helps.

Other things to check is the connection of the NIC to the PCI socket.  Sometimes the connectors may get dirty or some other physical contact problem, so removing the nick, cleaning the contacts thoroughly and connecting again may help.
Comment 25 Luca Coelho 2018-04-06 10:54:39 UTC
I'm going to close this as WILL_NOT_FIX, because there is nothing we can do from the driver side.  The workaround will come out soon and may solve the problem for some, but not for everyone, though.  So I'll will not consider this workaround as a solution for this bug.
Comment 26 Severus 2020-01-09 03:40:23 UTC
I found that anyone use Intel Corporation Dual Band Wireless-AC 7260 will crash module with 802.11ac with 80Ghz channel ban. If disable 80Ghz channel band on router, it works fine with 20-40Ghz channel band. 

Intel Corporation Dual Band Wireless-AC 7260 supported 802.11ac, then I think the bug is belong to driver?
Comment 27 Anderson 2020-01-16 11:30:25 UTC
I get the similar issue on my Asus TP500LN which changed the network card to Intel 7260. The connection could be established after boot up, but after some times, especially with the heavy traffic, the network card will fail, and not able to detect the card anymore, unless reboot the laptop.


I install the Windows10 on the laptop, and the Intel 7260 works smoothly on the Windows 10. Therefore, I think the hardware of my laptop and network card is fine. Could this issue been re-opened and investigated again?
Comment 28 Martin Zecher 2020-07-10 18:57:24 UTC
(In reply to Severus from comment #26)
> I found that anyone use Intel Corporation Dual Band Wireless-AC 7260 will
> crash module with 802.11ac with 80Ghz channel ban. If disable 80Ghz channel
> band on router, it works fine with 20-40Ghz channel band. 
> 
> Intel Corporation Dual Band Wireless-AC 7260 supported 802.11ac, then I
> think the bug is belong to driver?

Thanks a lot man. This is not a real fix but at least at home I don't experience the problem anymore.

Note You need to log in before you can comment on or make changes to this bug.