Bug 173101 - Iwlwifi: 8260: TFD queue hang
Summary: Iwlwifi: 8260: TFD queue hang
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: DO NOT USE - assign "network-wireless-intel" component instead
Depends on:
Reported: 2016-09-28 19:21 UTC by Tobias Schmetzer
Modified: 2017-01-09 14:34 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.8
Regression: No
Bisected commit-id:

dmesg, syslog, 3 (probably equal) iwldumps (1.45 MB, application/x-gzip)
2016-09-28 19:21 UTC, Tobias Schmetzer
iwlwifi-8000C-22.ucode firmware with debugging enabled (2.02 MB, application/octet-stream)
2016-10-11 09:52 UTC, Luca Coelho
attachment-15437-0.html (874 bytes, text/html)
2016-10-21 14:33 UTC, Emmanuel Grumbach

Description Tobias Schmetzer 2016-09-28 19:21:51 UTC
Created attachment 239981 [details]
dmesg, syslog, 3 (probably equal) iwldumps

Dear wifi-team,

I cannot establish a stable wifi connection. It usually doesn't work at all, but sometimes I get a couple of Mbytes through.
My Linux Ubuntu which I installed on a new Laptop. In the beginning I installed the latest stable Ubuntu Version 16.04.1. The wifi wouldn't work and I read a lot about Firmware Bugs being resolved in Firmware Version 22 which wouldn't be supported by kernel 4.4. So I installed Ubuntu Version 16.10 final beta with kernel 4.8. It came with Firmware Version 22 but unfortunately the wifi Networking bug was still there. 

dmesg sys mainly:
[ 1306.587264] ieee80211 phy0: Hardware restart was requested
[ 1307.469264] iwlwifi 0000:01:00.0: L1 Enabled - LTR Disabled
[ 1322.966428] iwlwifi 0000:01:00.0: Queue 2 stuck for 10000 ms.
[ 1322.969498] iwlwifi 0000:01:00.0: Microcode SW error detected.  Restarting 0x2000000.
[ 1322.969508] iwlwifi 0000:01:00.0: CSR values:

>ethtool -i wlp1s0 | grep firmware
firmware-version: 22.361476.0

>uname -a 
Linux camilo 4.8.0-17-generic #19-Ubuntu SMP Sun Sep 25 05:29:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

>lspci -v
01:00.0 Network controller: Intel Corporation Wireless 8260 (rev 3a)
	Subsystem: Intel Corporation Wireless 8260
	Flags: bus master, fast devsel, latency 0, IRQ 129
	Memory at a5000000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: <access denied>
	Kernel driver in use: iwlwifi
	Kernel modules: iwlwifi

Attached you'll find the full dmesg, syslog, 3 (probably equal) iwldumps
Comment 1 Emmanuel Grumbach 2016-09-28 19:54:56 UTC

can you try the workarounds described here:

Comment 2 Tobias Schmetzer 2016-10-04 20:20:17 UTC
- USB3 doesn't seem to be the problem. I couldn't turn it off, but using it or not didn't change anything.
- I haven't tried changing the wifi's power save mode as the problem occurs right after start.
- The workaround cfg80211_disable_40mhz_24ghz=1 solely (not using any other workaround) did make the wifi work, even without the error message. 
- the workaround 11n_disable=1 solely (not using any other workaround) did also make the wifi work.

=> maybe duplicate of https://bugzilla.kernel.org/show_bug.cgi?id=172431
Comment 3 Tobias Schmetzer 2016-10-04 20:56:18 UTC
I could not switch to 5 Ghz frequency as my router offers only a 2.4 Ghz frequency. But doint a "iwlist wlp1s0 scanning" I did encounter that there are 3 routers (including mine) using the same channel 6. But they seem to be assigned to 3 different cells.
Comment 4 Luca Coelho 2016-10-11 07:41:33 UTC
I'll soon provide you with a debug version of the firmware for further investigation.
Comment 5 Luca Coelho 2016-10-11 09:52:57 UTC
Created attachment 241461 [details]
iwlwifi-8000C-22.ucode firmware with debugging enabled

Can you take some logs with the attached firmware, following the instructions here?


This will allow us to further debug the issues you are experiencing.

And please make sure you read and understand the privacy aspects of sending such logs to us:

Comment 6 Tobias Schmetzer 2016-10-12 21:05:26 UTC
I'm unable to reproduce the error at the moment. I'll try to find out why during the next couple of days. I've already turned the iwlwifi modprobe options back to the original but it still works. There's also been an update to my Ubuntu 16.10 though.
Comment 7 Mo 2016-10-16 12:31:57 UTC
Seeing the same issues, on the same card, also on Ubuntu 16.10. I will email the firmware debug file and dmesg output.

Additional information:
>ethtool -i wlp4s0 | grep firmware
firmware-version: 22.391740.0

>uname -a 
Linux x1ii 4.8.0-22-generic #24-Ubuntu SMP Sat Oct 8 09:15:00 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

>lspci -v
04:00.0 Network controller: Intel Corporation Wireless 8260 (rev 3a)
	Subsystem: Intel Corporation Wireless 8260
	Flags: bus master, fast devsel, latency 0, IRQ 130
	Memory at e1100000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: <access denied>
	Kernel driver in use: iwlwifi
	Kernel modules: iwlwifi

Laptop model: Lenovo Thinkpad Carbon X1 2016 (4th gen). I've been seeing the same problem on previous kernel versions.
Comment 8 Tobias Schmetzer 2016-10-18 18:18:26 UTC
Finally the error occurred again. Looks like the error doesn't appear all the time as I was unable to reproduce it during the last couple of days (I didn't have time to check every day). Feels like there might be one (kind) of device that interferes but that's not always turned on. I will append three dumps I made.
Comment 9 Tobias Schmetzer 2016-10-18 18:51:19 UTC
I sent the dumps by mail. 

When using the workaround I don't get queue error anymore neither any other error from iwlwifi. But apart from having a slower performance, I do also have a bad connection with high latency and packets get lost as a ping to my router shows:

64 bytes from icmp_seq=584 ttl=64 time=16.9 ms
64 bytes from icmp_seq=585 ttl=64 time=81.5 ms
64 bytes from icmp_seq=587 ttl=64 time=98.3 ms
64 bytes from icmp_seq=588 ttl=64 time=876 ms
64 bytes from icmp_seq=590 ttl=64 time=9188 ms
64 bytes from icmp_seq=604 ttl=64 time=8486 ms
64 bytes from icmp_seq=606 ttl=64 time=8694 ms
64 bytes from icmp_seq=608 ttl=64 time=8112 ms

Perhaps that helps you finding the problem. If this has nothing to do with the firmware crash I would open a new ticket.
Comment 10 Emmanuel Grumbach 2016-10-18 20:38:43 UTC
Ok - the data you attached is instructive. You are hitting the "CCA on extension channel" problem. In that sense, it is a duplicate of https://bugzilla.kernel.org/show_bug.cgi?id=172431

The high latencies you see when you have the workaround in place are interesting.
Can you create a manual dump when those occur using the fw_dbg_collect debugfs hook?

Comment 11 Emmanuel Grumbach 2016-10-18 20:44:48 UTC
My previous comment was for Tobias.
And yes, it is indeed related to interference.
What you can try is to see if an earlier version of the firmware works better for you.
Comment 12 Tobias Schmetzer 2016-10-19 12:07:50 UTC
Glad you were able to analyze the problem!
I will try to do the manual coredump to discover the reason for the latency.
The high latency does also not always show up but it definitely did for a certain time (This might again be the  interference with a device that was switched on at that time).
Like what range of earlier version do you propose? And under what circumstances? Do you mean to see if the workaround had worked better with an earlier firmware version than 22?
Comment 13 Emmanuel Grumbach 2016-10-19 12:59:56 UTC
Can you try -21.ucode for example?
Comment 14 Emmanuel Grumbach 2016-10-20 20:09:15 UTC
Hi Mo,

I got the data you sent privately. You are suffering from the same problem as Tobias which is that we constantly have CCA on the extension channel.

I'll let Luca collapse all the bugs as duplicate for now.
FYI - from my experience, these bugs are hard to address and fix and the firmware team is really busy so the iterations are long...
Comment 15 Mo 2016-10-21 13:49:49 UTC
Thanks for looking at it Emmanuel.

FYI: I tried the -21.ucode firmware, same thing. Also for me, cfg80211_disable_40mhz_24ghz=1 doesn't fix the issue, but workaround 11n_disable=1 does.
Comment 16 Emmanuel Grumbach 2016-10-21 14:33:45 UTC
Created attachment 242121 [details]

Makes sense since your problem is on extended 5GHz, so that explains why cfg80211_disable_4mhz doesn't help and it also explains why 11n_disable helps.

On Fri, 2016-10-21 at 13:49 +0000, bugzilla-daemon@bugzilla.kernel.org wrote:


--- Comment #15 from Mo <h8uvkyqnpfrzqyc@mopage.de<mailto:h8uvkyqnpfrzqyc@mopage.de>> ---
Thanks for looking at it Emmanuel.

FYI: I tried the -21.ucode firmware, same thing. Also for me,
cfg80211_disable_40mhz_24ghz=1 doesn't fix the issue, but workaround
11n_disable=1 does.
Comment 17 Tobias Schmetzer 2016-10-25 19:55:06 UTC
I had sent you the manual dump and you said you didn't see anything special and needed the firmware team for further analysis. 

As the problem occurred again I did some more testing and found out:
- although I have 40 GHz turned off there's high latency and a remarkable package loss paired with almost no data coming through right after opening a connection like an arbitrary website. The firmware doesn't crash though and there's no problem reported in syslog.
- after disabling 11n in iwlwifi.conf the problem is gone but the speed is pretty low of course (A speed test with the internet connection on Linux (40 Ghz and 11n disabled) is now at an average of 1,5M/0,8M/25ms (down/up/ping) whereas with the Windows driver I reach an average of 11M/0,8M/25ms (down/up/ping))
Comment 18 Tobias Schmetzer 2016-10-25 20:10:54 UTC
Sorry I need to correct myself: Even with 11n disabled the high latencies show up and occasionally I get a couple of bytes through (even normal websites take ages to get completely loaded)
Comment 19 Emmanuel Grumbach 2016-10-25 20:14:17 UTC
@Tobias: thanks.

Thing is that at this point, we need real firmware eng to look at this :)
I am a driver guy who triages the problems, but at this stage, we really want the firmware guys to check what happens. What I can do is to prepare a debug version of the firmware that will be less tolerant to latencies, and then, the FW crash will happen faster. We can hope that the FW crash will trigger even w/o 40MHz (and maybe even w/o 11n) to get valuable data.
What is the ping latency you see when things go wrong?
Comment 20 Emmanuel Grumbach 2016-10-25 20:48:08 UTC
@Tobias, I think I'd like you to open a new bug for the issues w/o 11n. It can't be the same problem.

BTW: if the data is encrypted, you can safely store it in bugzilla. Nobody can read it anyway and it make it a bit easier for us to track which dump matches what scenario.
Comment 21 Emmanuel Grumbach 2017-01-09 14:34:01 UTC
Since it is a CCA problem, we can't do much unfortunately.
Closing as will not fix.

Note You need to log in before you can comment on or make changes to this bug.