Bug 203577
Summary: | iwlwifi: 8260: traffic dies - WIFI-25674 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Rich (rhintze) |
Component: | network-wireless | Assignee: | DO NOT USE - assign "network-wireless-intel" component instead (linuxwifi) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | enban, jerry, ronan |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.1.1 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Rich
2019-05-11 22:25:48 UTC
Looks like the issue persists in kernel 5.1.1 - After running the speed test the interface stops working and I need to shut down then turn up the interface to use it. - Not sure if this is related: May 11 22:22:42 lemur kernel: [ 6.185195] r8169 0000:03:00.1: can't disable ASPM; OS doesn't have ASPM control Here is what the speed test site says: During upload the measured speed went to zero and stayed there error:1 The test noticed that uploading stopped. Here are some possible causes: 1. Connection drop during upload. 2. Very large upload buffering. Check "Staged Uploads", in the https://www.dslreports.com/speedtest/preferences. This will use increasingly large single uploads to determine the speed. Or change the upload method to 'web socket'. 3. Your connection is very poor. So poor that packet loss is causing many halts. Please review the ping radar plot by location. Is the ping time reasonable, both "best" and "worst"? 4. An Anti-virus product or browser extension is stalling or buffering the upload. Disable the most likely product, restart the browser, try again. Sophos AV, among others, are known to buffer all uploads (see explanation 2). root@lemur ~ $ lspci -nnkv | sed -n '/Network/,/^$/p' 02:00.0 Network controller [0280]: Intel Corporation Wireless 8260 [8086:24f3] (rev 3a) Subsystem: Intel Corporation Dual Band Wireless-AC 8260 [8086:1010] Flags: bus master, fast devsel, latency 0, IRQ 131 Memory at df200000 (64-bit, non-prefetchable) [size=8K] Capabilities: [c8] Power Management version 3 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [40] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number e4-a7-a0-xx-xx-xx-xx-xx Capabilities: [14c] Latency Tolerance Reporting Capabilities: [154] L1 PM Substates Kernel driver in use: iwlwifi Kernel modules: iwlwifi * Device Serial Number obfuscated Please share the dmesg output. Thank you. Not much to see in DMESG as far as I could tell. before interface failure: http://lehcar.no-ip.org:8080/~rich/dmesg-pre-fail.txt after interface failure and bouncing the interface (using Slackware /etc/rc.d/rc.inet1 restart ) http://lehcar.no-ip.org:8080/~rich/dmesg-post-fail.txt Here are some screen shots: 1. Failure during testing using kernel 5.1.1: http://lehcar.no-ip.org:8080/~rich/UL_FAIL_5-1-1_pic.png - config: http://lehcar.no-ip.org:8080/~rich/config-huge-5.1.1 2. Success with kernel 5.0.15: http://lehcar.no-ip.org:8080/~rich/UL_PASS_5-0-15_pic.png - config: http://lehcar.no-ip.org:8080/~rich/config-huge-5.0.15 Able to reproduce with iperf3 over LAN Host : PC with wireless card: iperf3 -s Client : ethernet connected client iperf3 -c 192.168.1.240 -t 60 ##PASSED iperf3 -c 192.168.1.240 -R -t 60 ##FAILED LogFiles: SERVER: http://lehcar.no-ip.org:8080/~rich/iperf3-server.txt CLIENT: http://lehcar.no-ip.org:8080/~rich/iperf3-client.txt Please try to add: options iwlmvm power_scheme=1 to /etc/modprobe.d/iwlwifi.conf and reboot. Added /etc/modprobe.d/iwlwifi.conf [options iwlmvm power_scheme=1] made it to 37 sec on the Uplink side before it stopped (improvement but still stopping) LogFiles: SERVER: http://lehcar.no-ip.org:8080/~rich/iperf3_w_power_scheme.txt Please just run pings to your router and see if they stop. It is much easier to debug small traffic than storms of data. ping seems just fine, I was able to ping the router for 30 min. - The interface seems problem free until I try to stress the UL. - I tried to stress the UL serving a large (3G) file to a remote client, using scp, I think disk I/O was the bottleneck and I could not duplicate the issue. So it seems that just running near the maximum that the interface can handle causes the issue. - I really don't want to muddy the waters but after stressing the interface causing it to stop responding, and subsequently resetting the interface, and some time passes like (3 or 4 min) I end up rebooting due display instability. (So the screen starts flashing at a low frequency, 1 second of blackout 2 seconds of normal display) Note: Even when I stop X with (CRTL - ALT -BACKSPACE) the display issue continues. Guessing the means it interferes with the Frame buffer? Not sure here. I'll try to capture. This was initially why I suspected buffer overflow. NOTE: Could not duplicate display issue, caused wlan0 to lock up several times. Not sure if this is random. Can you please record tracing of this? https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#tracing Recompiled kernel to include "CONFIG_IWLWIFI_TRACING=y" here is an attached log trace.dat: (this is the UL IPERF where the interface stops) http://lehcar.no-ip.org:8080/~rich/trace.dat Interesting information: I compiled 5.1.3 this and saw the wlan0 interface failure is still present. In the interest of digging deeper I compiled again and added this option: CONFIG_IWLWIFI_DEBUG=y Results as follows: [all in Mbits/sec] kernel 5.0.17 (working test) DL: ~317 UL: ~277 kernel 5.1.3 (no-debug option) DL: ~316 UL: Fail (after 9 seconds) kernle 5.1.3 (debug) DL: ~316 UL: 90 --The interface is not failing as before with the debug option in the kernel set. However the rate is slower when compaired to kernel 5.0.17. RESULTS 5.0.17: http://lehcar.no-ip.org:8080/~rich/iperf3-5.0.17.txt RESULTS 5.1.3 (CONFIG_IWLWIFI_DEBUG is not set) http://lehcar.no-ip.org:8080/~rich/iperf3-5.1.3_no-debug.txt RESULTS 5.1.3 (CONFIG_IWLWIFI_DEBUG=y) http://lehcar.no-ip.org:8080/~rich/iperf3-5.1.3_debug.txt Tests using iperf3. Host 1- gigE connected to router. Host 2- Laptop using wireless interface. I can confirm the bug also affects 8265ac, with kernel version 5.1.2 and 5.1.3. My card: ``` 04:00.0 Network controller [0280]: Intel Corporation Wireless 8265 / 8275 [8086:24fd] (rev 78) Subsystem: Intel Corporation Dual Band Wireless-AC 8265 [8086:1010] Flags: bus master, fast devsel, latency 0, IRQ 134 Memory at 94000000 (64-bit, non-prefetchable) [size=8K] Capabilities: [c8] Power Management version 3 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [40] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 00-28-f8-xx-xx-xx-xx-xx Capabilities: [14c] Latency Tolerance Reporting Capabilities: [154] L1 PM Substates Kernel driver in use: iwlwifi Kernel modules: iwlwifi ``` The connection dies after stressing the card for 2-3 secs. Even if the card is under normal load, it dies approximately every 15-30 minutes. Dmesg please. The symptoms are pretty much the same. No useful output in kmesg. ``` [ 2.632712] iwlwifi 0000:04:00.0: loaded firmware version 36.9f0a2d68.0 op_mode iwlmvm [ 2.870418] iwlwifi 0000:04:00.0: Detected Intel(R) Dual Band Wireless AC 8265, REV=0x230 [ 2.931677] iwlwifi 0000:04:00.0: base HW address: 00:xx:xx:xx:xx:xx [ 3.087943] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0 ``` I'm using archlinux linux kernel, with the following kernel config: ``` $ zcat /proc/config.gz |grep -i iwlwifi CONFIG_IWLWIFI=m CONFIG_IWLWIFI_LEDS=y CONFIG_IWLWIFI_OPMODE_MODULAR=y # CONFIG_IWLWIFI_BCAST_FILTERING is not set # CONFIG_IWLWIFI_PCIE_RTPM is not set CONFIG_IWLWIFI_DEBUG=y CONFIG_IWLWIFI_DEBUGFS=y CONFIG_IWLWIFI_DEVICE_TRACING=y ``` Adding options iwlmvm power_scheme=1 does not show any improvements. trace.dat captured: https://fars.ee/e-t1.dat Interestingly when I put wireshark on the interface and stressed it, it stopped responding but I did continue to see ARP messaging. The interface was unusable but I did continue to see ARP broadcast messages. http://lehcar.no-ip.org:8080/~rich/kernel-5.1.3_fail.pcapng.xz 192.168.1.240 - pc with wireless wlan0 192.168.1.245 - pc banana pi with gige wired 192.168.1.1 - router updated to: http://lehcar.no-ip.org:8080/~rich/kernel-5.1.3_fail.pcapng.tar.gz (apache not setup for .xz) Some new iwlwifi errors was captured (not seen in previous versions). Linux version 5.1.5, Linux firmware 20190514.711d329. https://fars.ee/Euvp.html I simply stressed the interface with data upload, the interface dies and sometimes it prints Microcode SW error detected in kmesg. Then I restart NetworkManager or reload iwlmvm kmod if restarting NetworkManager can't resolve the problem. I managed to reproduce the error several times in the log. This is not related. Look at https://bugzilla.kernel.org/show_bug.cgi?id=203315 My setup is basically identical to Jerry's and I am experiencing silent interface failure as well under high traffic on 5.1, including 5.1.12, especially if there are many simultaneous connections. Very easy to recreate if I try seeding the latest arch iso. Restarting iwd fixes it temporarily for me. Has the cause of this issue been identified? I've never tried before, but I may be able to bisect if it would help. Which source tree would be suitable for a bisect? So I noticed recently that the default qdisc for my wireless interface is noqueue rather than the fq_codel defined by my kernel config and sysctls. I first thought it was unrelated and asked around in the arch community if anyone else had the issue, but I didn't find anyone with the issue or the same hardware. This, http://linux-tc-notes.sourceforge.net/tc/doc/sch_noqueue.txt , seems to suggest that noqueue is not a valid qdisc _at all_ for physical interfaces, and I can confirm that new wired interfaces are correctly assigned fq_codel as set by sysctl net.core.default_qdisc, and I am disallowed from removing it (from the wired interface) with "Error: Cannot delete qdisc with handle of zero." , So I believe this is a bug. Is it related? Please test this: https://patchwork.kernel.org/patch/11029027/ I have applied that patch to the iwlmvm module on my system (5.1.15-arch1-1) and can no longer reproduce the issue. It's a shame about AMSDU, but I don't need it. Thanks for the fix. I applied the patch to Slackware64-current (5.1.16) and also can no longer reproduce the issue. Thanks for the feedback Just wanted to say "Thank you!" for fixing this issue. I was having problems when stressing my 3165AC in upload. Now, with 5.2, everything works as intended. |