Bug 203577

Summary: iwlwifi: 8260: traffic dies - WIFI-25674
Product: Drivers Reporter: Rich (rhintze)
Component: network-wirelessAssignee: DO NOT USE - assign "network-wireless-intel" component instead (linuxwifi)
Status: CLOSED CODE_FIX    
Severity: normal CC: enban, jerry, ronan
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.1.1 Subsystem:
Regression: No Bisected commit-id:

Description Rich 2019-05-11 22:25:48 UTC
Under normal conditions it is working.  When I attempt to Speedtest the wlan0 interface if silently fails.  Nothing in dmesg or syslog.  

- test site; http://www.dslreports.com/speedtest 

- Host Slackware64-current (5-9-2019) (tests fine in kernel 5.0.13) (issue appears in kernel 5.1.0)
- Download test passes flawlessly, upload dies after 5 seconds and then the interface becomes non-responsive.

Hardware:
Network controller: Intel Corporation Wireless 8260 (rev 3a)

(Wired interface works fine on test) on test scenario.
((only wireless interface silently breaks))
Comment 1 Rich 2019-05-12 02:18:41 UTC
Looks like the issue persists in kernel 5.1.1

- After running the speed test the interface stops working and I need to shut down then turn up the interface to use it.

-
Comment 2 Rich 2019-05-12 02:39:44 UTC
Not sure if this is related:
May 11 22:22:42 lemur kernel: [    6.185195] r8169 0000:03:00.1: can't disable ASPM; OS doesn't have ASPM control


Here is what the speed test site says:
During upload the measured speed went to zero and stayed there error:1
The test noticed that uploading stopped. Here are some possible causes:

1. Connection drop during upload.

2. Very large upload buffering. Check "Staged Uploads", in the https://www.dslreports.com/speedtest/preferences. This will use increasingly large single uploads to determine the speed. Or change the upload method to 'web socket'.

3. Your connection is very poor. So poor that packet loss is causing many halts. Please review the ping radar plot by location. Is the ping time reasonable, both "best" and "worst"?

4. An Anti-virus product or browser extension is stalling or buffering the upload. Disable the most likely product, restart the browser, try again. Sophos AV, among others, are known to buffer all uploads (see explanation 2).
Comment 3 Rich 2019-05-12 02:51:58 UTC
root@lemur ~ $ lspci -nnkv | sed -n '/Network/,/^$/p'
02:00.0 Network controller [0280]: Intel Corporation Wireless 8260 [8086:24f3] (rev 3a)
	Subsystem: Intel Corporation Dual Band Wireless-AC 8260 [8086:1010]
	Flags: bus master, fast devsel, latency 0, IRQ 131
	Memory at df200000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [c8] Power Management version 3
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [40] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number e4-a7-a0-xx-xx-xx-xx-xx
	Capabilities: [14c] Latency Tolerance Reporting
	Capabilities: [154] L1 PM Substates
	Kernel driver in use: iwlwifi
	Kernel modules: iwlwifi


* Device Serial Number obfuscated
Comment 4 Emmanuel Grumbach 2019-05-12 08:04:05 UTC
Please share the dmesg output. Thank you.
Comment 5 Rich 2019-05-12 13:31:06 UTC
Not much to see in DMESG as far as I could tell.

before interface failure:
http://lehcar.no-ip.org:8080/~rich/dmesg-pre-fail.txt

after interface failure and bouncing the interface (using Slackware  /etc/rc.d/rc.inet1 restart )
http://lehcar.no-ip.org:8080/~rich/dmesg-post-fail.txt

Here are some screen shots:
1. Failure during testing using kernel 5.1.1:
http://lehcar.no-ip.org:8080/~rich/UL_FAIL_5-1-1_pic.png
  - config: http://lehcar.no-ip.org:8080/~rich/config-huge-5.1.1

2. Success with kernel 5.0.15:
http://lehcar.no-ip.org:8080/~rich/UL_PASS_5-0-15_pic.png
  - config: http://lehcar.no-ip.org:8080/~rich/config-huge-5.0.15
Comment 6 Rich 2019-05-12 19:39:46 UTC
Able to reproduce with iperf3 over LAN

Host :  PC with wireless card: 
     iperf3 -s

Client : ethernet connected client
     iperf3 -c 192.168.1.240 -t 60
           ##PASSED
     iperf3 -c 192.168.1.240 -R -t 60
           ##FAILED

LogFiles:
SERVER:
    http://lehcar.no-ip.org:8080/~rich/iperf3-server.txt
CLIENT:
    http://lehcar.no-ip.org:8080/~rich/iperf3-client.txt
Comment 7 Emmanuel Grumbach 2019-05-12 19:43:26 UTC
Please try to add:

options iwlmvm power_scheme=1

to /etc/modprobe.d/iwlwifi.conf
and reboot.
Comment 8 Rich 2019-05-12 20:21:24 UTC
Added /etc/modprobe.d/iwlwifi.conf [options iwlmvm power_scheme=1]

made it to 37 sec on the Uplink side before it stopped (improvement but still stopping)

LogFiles:
SERVER:
    http://lehcar.no-ip.org:8080/~rich/iperf3_w_power_scheme.txt
Comment 9 Emmanuel Grumbach 2019-05-12 20:25:50 UTC
Please just run pings to your router and see if they stop. It is much easier to debug small traffic than storms of data.
Comment 10 Rich 2019-05-12 22:43:50 UTC
ping seems just fine,  I was able to ping the router for 30 min.  
   
   - The interface seems problem free until I try to stress the UL. 
   - I tried to stress the UL serving a  large (3G) file to a remote client, using scp, I think disk I/O was the bottleneck and I could not duplicate the issue.  So it seems that just running near the maximum that the interface can handle causes the issue.  
     
          - I really don't want to muddy the waters but after stressing the interface causing it to stop responding, and subsequently resetting the interface, and some time passes like (3 or 4 min) I end up rebooting due display instability.  (So the screen starts flashing at a low frequency, 1 second of blackout 2 seconds of normal display)  
Note:  Even when I stop X  with (CRTL - ALT -BACKSPACE) the display issue continues.  Guessing the means it interferes with the Frame buffer?  Not sure here.  I'll try to capture.  This was initially why I suspected buffer overflow.
Comment 11 Rich 2019-05-12 23:20:04 UTC
NOTE:  Could not duplicate display issue, caused wlan0 to lock up several times.  
Not sure if this is random.
Comment 12 Emmanuel Grumbach 2019-05-14 07:41:30 UTC
Can you please record tracing of this?

https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#tracing
Comment 13 Rich 2019-05-14 23:17:58 UTC
Recompiled kernel to include "CONFIG_IWLWIFI_TRACING=y"

here is an attached log trace.dat:
(this is the UL IPERF where the interface stops)
http://lehcar.no-ip.org:8080/~rich/trace.dat
Comment 14 Rich 2019-05-17 04:43:54 UTC
Interesting information:
  I compiled 5.1.3 this and saw the wlan0 interface failure is still present.
  In the interest of digging deeper I compiled again and added this option:
       CONFIG_IWLWIFI_DEBUG=y

  Results as follows:  [all in Mbits/sec]
        kernel 5.0.17 (working test)    DL: ~317  UL: ~277
        kernel 5.1.3  (no-debug option) DL: ~316  UL: Fail (after 9 seconds)
        kernle 5.1.3  (debug)           DL: ~316  UL: 90

--The interface is not failing as before with the debug option in the kernel 
set. However the rate is slower when compaired to kernel 5.0.17.
  RESULTS 5.0.17:
     http://lehcar.no-ip.org:8080/~rich/iperf3-5.0.17.txt

  RESULTS 5.1.3 (CONFIG_IWLWIFI_DEBUG is not set)
     http://lehcar.no-ip.org:8080/~rich/iperf3-5.1.3_no-debug.txt

  RESULTS 5.1.3 (CONFIG_IWLWIFI_DEBUG=y)
     http://lehcar.no-ip.org:8080/~rich/iperf3-5.1.3_debug.txt

Tests using iperf3.  Host 1- gigE connected to router.
                     Host 2- Laptop using wireless interface.
Comment 15 Jerry Xiao 2019-05-20 14:59:06 UTC
I can confirm the bug also affects 8265ac, with kernel version 5.1.2 and 5.1.3.  

My card:   
```
04:00.0 Network controller [0280]: Intel Corporation Wireless 8265 / 8275 [8086:24fd] (rev 78)
	Subsystem: Intel Corporation Dual Band Wireless-AC 8265 [8086:1010]
	Flags: bus master, fast devsel, latency 0, IRQ 134
	Memory at 94000000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [c8] Power Management version 3
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [40] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 00-28-f8-xx-xx-xx-xx-xx
	Capabilities: [14c] Latency Tolerance Reporting
	Capabilities: [154] L1 PM Substates
	Kernel driver in use: iwlwifi
	Kernel modules: iwlwifi
```



The connection dies after stressing the card for 2-3 secs. Even if the card is under normal load, it dies approximately every 15-30 minutes.
Comment 16 Emmanuel Grumbach 2019-05-20 16:30:51 UTC
Dmesg please.
Comment 17 Jerry Xiao 2019-05-21 03:15:24 UTC
The symptoms are pretty much the same.  
No useful output in kmesg.  
```
[    2.632712] iwlwifi 0000:04:00.0: loaded firmware version 36.9f0a2d68.0 op_mode iwlmvm
[    2.870418] iwlwifi 0000:04:00.0: Detected Intel(R) Dual Band Wireless AC 8265, REV=0x230
[    2.931677] iwlwifi 0000:04:00.0: base HW address: 00:xx:xx:xx:xx:xx
[    3.087943] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0

```
I'm using archlinux linux kernel, with the following kernel config:


```
$ zcat /proc/config.gz |grep -i iwlwifi
CONFIG_IWLWIFI=m
CONFIG_IWLWIFI_LEDS=y
CONFIG_IWLWIFI_OPMODE_MODULAR=y
# CONFIG_IWLWIFI_BCAST_FILTERING is not set
# CONFIG_IWLWIFI_PCIE_RTPM is not set
CONFIG_IWLWIFI_DEBUG=y
CONFIG_IWLWIFI_DEBUGFS=y
CONFIG_IWLWIFI_DEVICE_TRACING=y
```
Comment 18 Jerry Xiao 2019-05-21 03:56:01 UTC
Adding options iwlmvm power_scheme=1 does not show any improvements.  
trace.dat captured:  
https://fars.ee/e-t1.dat
Comment 19 Rich 2019-05-21 21:29:13 UTC
Interestingly when I put wireshark on the interface and stressed it, it stopped responding but I did continue to see ARP messaging.  The interface was unusable but I did continue to see ARP broadcast messages.
Comment 20 Rich 2019-05-21 23:53:31 UTC
http://lehcar.no-ip.org:8080/~rich/kernel-5.1.3_fail.pcapng.xz

192.168.1.240 - pc with wireless wlan0
192.168.1.245 - pc banana pi with gige wired
192.168.1.1 - router
Comment 21 Rich 2019-05-22 00:14:04 UTC
updated to:
http://lehcar.no-ip.org:8080/~rich/kernel-5.1.3_fail.pcapng.tar.gz

(apache not setup for .xz)
Comment 22 Jerry Xiao 2019-06-02 15:07:23 UTC
Some new iwlwifi errors was captured (not seen in previous versions). Linux version 5.1.5, Linux firmware 20190514.711d329.

https://fars.ee/Euvp.html  

I simply stressed the interface with data upload, the interface dies and sometimes it prints Microcode SW error detected in kmesg. Then I restart NetworkManager or reload iwlmvm kmod if restarting NetworkManager can't resolve the problem. I managed to reproduce the error several times in the log.
Comment 23 Emmanuel Grumbach 2019-06-02 15:22:46 UTC
This is not related.

Look at https://bugzilla.kernel.org/show_bug.cgi?id=203315
Comment 24 Ronan Pigott 2019-06-22 00:54:24 UTC
My setup is basically identical to Jerry's and I am experiencing silent interface failure as well under high traffic on 5.1, including 5.1.12, especially if there are many simultaneous connections. Very easy to recreate if I try seeding the latest arch iso. Restarting iwd fixes it temporarily for me.

Has the cause of this issue been identified? I've never tried before, but I may be able to bisect if it would help. Which source tree would be suitable for a bisect?
Comment 25 Ronan Pigott 2019-06-29 00:48:39 UTC
So I noticed recently that the default qdisc for my wireless interface is noqueue rather than the fq_codel defined by my kernel config and sysctls. I first thought it was unrelated and asked around in the arch community if anyone else had the issue, but I didn't find anyone with the issue or the same hardware.

This, http://linux-tc-notes.sourceforge.net/tc/doc/sch_noqueue.txt , seems to suggest that noqueue is not a valid qdisc _at all_ for physical interfaces, and I can confirm that new wired interfaces are correctly assigned fq_codel as set by sysctl net.core.default_qdisc, and I am disallowed from removing it (from the wired interface) with "Error: Cannot delete qdisc with handle of zero." , So I believe this is a bug. Is it related?
Comment 26 Emmanuel Grumbach 2019-07-05 03:18:29 UTC
Please test this:
 https://patchwork.kernel.org/patch/11029027/
Comment 27 Ronan Pigott 2019-07-06 01:39:42 UTC
I have applied that patch to the iwlmvm module on my system (5.1.15-arch1-1) and can no longer reproduce the issue.

It's a shame about AMSDU, but I don't need it. Thanks for the fix.
Comment 28 Rich 2019-07-06 16:11:28 UTC
I applied the patch to Slackware64-current (5.1.16) and also can no longer reproduce the issue.
Comment 29 Emmanuel Grumbach 2019-07-06 18:18:41 UTC
Thanks for the feedback
Comment 30 Enrico Bandiello 2019-07-13 09:04:29 UTC
Just wanted to say "Thank you!" for fixing this issue. I was having problems when stressing my 3165AC in upload. Now, with 5.2, everything works as intended.