Bug 194717 - iwlwifi: 8260: traffic loss upon initiator delba - WIFILNX-676
Summary: iwlwifi: 8260: traffic loss upon initiator delba - WIFILNX-676
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: DO NOT USE - assign "network-wireless-intel" component instead
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-27 00:03 UTC by Matt Berry
Modified: 2017-12-04 19:44 UTC (History)
4 users (show)

See Also:
Kernel Version: 4.11.6-3-ARCH
Subsystem:
Regression: No
Bisected commit-id:


Attachments
wpa.log with debugging on. Exhibits a 15+ period of no access after which one webpage loaded and all subsequent access was lost (4.42 KB, text/plain)
2017-02-27 00:03 UTC, Matt Berry
Details
dmesg from a session that included several periods of no network access (56.02 KB, text/plain)
2017-02-27 00:04 UTC, Matt Berry
Details
Core24 with uSniffer (2.12 MB, application/octet-stream)
2017-06-29 20:12 UTC, Emmanuel Grumbach
Details
Core28 with uSniffer (2.20 MB, application/octet-stream)
2017-06-30 06:41 UTC, Emmanuel Grumbach
Details
fix bar (1.69 KB, patch)
2017-07-24 19:55 UTC, Emmanuel Grumbach
Details | Diff
fix bar (1.70 KB, patch)
2017-07-24 20:02 UTC, Emmanuel Grumbach
Details | Diff
fix bar (1.69 KB, patch)
2017-07-24 20:11 UTC, Emmanuel Grumbach
Details | Diff
do not send delBA (9.84 KB, application/mbox)
2017-08-16 10:19 UTC, Emmanuel Grumbach
Details

Description Matt Berry 2017-02-27 00:03:23 UTC
Created attachment 254927 [details]
wpa.log with debugging on. Exhibits a 15+ period of no access after which one webpage loaded and all subsequent access was lost

I've been struggling to get an Intel 8260 Wireless card to provide 
a reliable network connection on a new laptop. It will work fine 
for 5-10 minutes and then I will suddenly be unable to send/receive 
any network traffic. The card still believes it is associated with 
the AP, but no traffic is flowing. Sometimes the only way to get 
back online is to bring the card down with '/usr/lib/network/
network  stop ProfileName' and then bringing it back up again. The 
timing is  random and can be as low as a minute or as long as 15-20 
minutes.  This happens whether I am plugged in or on battery; using 
my  distribution's iwlwifi or using the Core24 backports driver; 
using kernel 4.9.6 or 4.9.11. 

Symptoms seem to have some correlation with the Access Point being 
connected to. The problem is most frequent when connecting to a 
Linksys EA-8500 5 GHz N network with WPA2-PSK. The problem is 
completely absent when connecting to a Cisco E3000 5 GHz N Network 
also with WPA2-PSK. Other wireless devices (including but not 
limited to Apple and Android phones, Windows laptops, and other 
Linux laptops) have not experienced this issue but two laptops with 
the 8260 rev. 3a do.

I've taken a couple of steps so far, including turning up the debug 
logging in wpa_supplicant, disabling power management, and 
disabling Bluetooth co-existance. None of those improved 
reliability. The wpa_supplicant logs did provide some additional 
information. Just before disconnects occur, the log is quiet and 
then one of two sets of messages appar. One set of messages includes 
a number of log lines appear that mention "wpa_bss_flush_by_age" 
[1].  The other appear to be a WPA key-rotation [2]. dmesg is 
normally quiet during disconnects. When I take the network down, 
when using the backports driver, I get a few lines in dmesg [3]. 
Symptom [1] sometimes appears without losing network access. Symptom 
[2] always causes a loss of network access. Sometimes, but not 
always, they appear in succession.

I should note that I run a 2.4 GHz and a 5 GHz network from
the AP, with similar but different names (WI of the FIger vs. 
WI of the 5ger) but the same passphrase. I'm connecting to the 5GHz 
band. The only reason I mention it is that this card purportedly 
supports  dual-band, but I have not looked into whether it is usable 
/ automatic on Linux yet.


It is also worth noting (but probably should be a separate
bug report) that iw_get_basic_config returns non-zero for the the
wlp4s0 interface after installing the backports driver. This breaks
tools such as iwconfig and Xmobar's Wireless plugin.

[1] Symptom "wpa_bss_flush_by_age"

(preceded by at least 20 seconds of silence)

1486886349.661134: wlp4s0: BSS: Remove id 1 BSSID 14:91:82:72:7c:1e
SSID 'WI of the FIger' due to wpa_bss_flush_by_age
1486886349.661174: wlp4s0: BSS: Remove id 2 BSSID 0c:51:01:e7:3c:94
SSID 'BenderNet' due to wpa_bss_flush_by_age
1486886349.661189: wlp4s0: BSS: Remove id 3 BSSID 18:a6:f7:82:1f:67
SSID 'Trash Fest 2015' due to wpa_bss_flush_by_age
1486886349.661201: wlp4s0: BSS: Remove id 4 BSSID 18:a6:f7:82:1f:66
SSID 'Trash Fest 2016' due to wpa_bss_flush_by_age
1486886349.661213: wlp4s0: BSS: Remove id 5 BSSID 0c:51:01:e7:3c:95
SSID 'BenderNet' due to wpa_bss_flush_by_age
1486886349.661225: wlp4s0: BSS: Remove id 6 BSSID 74:d0:2b:89:e7:b0
SSID 'Fantasia' due to wpa_bss_flush_by_age
1486886349.661237: wlp4s0: BSS: Remove id 7 BSSID f4:f2:6d:70:09:98
SSID 'TP-LINK_0998_5G' due to wpa_bss_flush_by_age
1486886349.661248: wlp4s0: BSS: Remove id 8 BSSID 14:22:db:08:51:c4
SSID '928d' due to wpa_bss_flush_by_age
1486886349.661260: wlp4s0: BSS: Remove id 9 BSSID 74:44:01:3e:32:c3
SSID 'NETGEAR70-5G' due to wpa_bss_flush_by_age
1486886349.661273: wlp4s0: BSS: Remove id 10 BSSID 58:8b:f3:c9:7d:95
SSID 'CenturyLink1775' due to wpa_bss_flush_by_age
1486886349.661285: wlp4s0: BSS: Remove id 11 BSSID 02:1d:d6:32:12:80
SSID '' due to wpa_bss_flush_by_age
1486886349.661309: wlp4s0: BSS: Remove id 12 BSSID 14:22:db:08:51:c6
SSID 'pimpmetro' due to wpa_bss_flush_by_age
1486886349.661328: wlp4s0: BSS: Remove id 13 BSSID ce:35:40:ca:4c:c8
SSID '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' due to
wpa_bss_flush_by_age
1486886349.661340: wlp4s0: BSS: Remove id 14 BSSID f4:f2:6d:2c:d8:59
SSID 'Milo_2.4ghz' due to wpa_bss_flush_by_age
1486886349.661351: wlp4s0: BSS: Remove id 15 BSSID 14:22:db:08:51:cc
SSID 'pimpmetro Guest' due to wpa_bss_flush_by_age
1486886349.661363: wlp4s0: BSS: Remove id 16 BSSID 74:44:01:3e:32:c4
SSID 'NETGEAR70' due to wpa_bss_flush_by_age
1486886349.661374: wlp4s0: BSS: Remove id 17 BSSID 0c:54:a5:74:d3:79
SSID '' due to wpa_bss_flush_by_age
1486886349.661386: wlp4s0: BSS: Remove id 18 BSSID 84:00:2d:75:a0:88
SSID 'KobeonRetirement' due to wpa_bss_flush_by_age
1486886349.661399: wlp4s0: BSS: Remove id 19 BSSID 04:bf:6d:30:c3:04
SSID 'CenturyLink3662' due to wpa_bss_flush_by_age
1486886349.661411: wlp4s0: BSS: Remove id 20 BSSID 00:25:9c:13:1a:90
SSID 'pimpcity-2' due to wpa_bss_flush_by_age
1486886349.661422: wlp4s0: BSS: Remove id 21 BSSID cc:03:fa:64:68:54
SSID 'HOME-6854' due to wpa_bss_flush_by_age
1486886349.661434: wlp4s0: BSS: Remove id 22 BSSID 2c:30:33:b2:30:03
SSID 'secret sauce' due to wpa_bss_flush_by_age
1486886349.661445: wlp4s0: BSS: Remove id 23 BSSID 24:a0:74:75:e9:bc
SSID 'Booster II' due to wpa_bss_flush_by_age
1486886349.661457: wlp4s0: BSS: Remove id 24 BSSID 00:40:10:10:00:03
SSID 'TheMan' due to wpa_bss_flush_by_age
1486886349.661468: wlp4s0: BSS: Remove id 25 BSSID fa:8f:ca:73:31:d0
SSID '' due to wpa_bss_flush_by_age
1486886349.661479: wlp4s0: BSS: Remove id 26 BSSID 14:22:db:08:51:c8
SSID '' due to wpa_bss_flush_by_age
1486886349.661490: wlp4s0: BSS: Remove id 27 BSSID 0c:54:a5:74:d3:7a
SSID 'xfinitywifi' due to wpa_bss_flush_by_age
1486886349.661502: wlp4s0: BSS: Remove id 28 BSSID 06:1d:d6:32:12:80
SSID 'xfinitywifi' due to wpa_bss_flush_by_age
1486886349.661514: wlp4s0: BSS: Remove id 29 BSSID 02:25:9c:13:1a:90
SSID 'pimpcity-2-guest' due to wpa_bss_flush_by_age
1486886349.661525: wlp4s0: BSS: Remove id 30 BSSID e6:3e:fc:8d:7c:f0
SSID 'xfinitywifi' due to wpa_bss_flush_by_age
1486886349.661536: wlp4s0: BSS: Remove id 31 BSSID ce:03:fa:64:68:56
SSID 'xfinitywifi' due to wpa_bss_flush_by_age
1486886349.661548: wlp4s0: BSS: Remove id 32 BSSID 00:00:00:00:00:00
SSID '' due to wpa_bss_flush_by_age




[2] Symptom "State: GROUP_HANDSHAKE -> COMPLETED" 

1488145955.124692: l2_packet_receive: src=14:91:82:72:7c:1f len=131
1488145955.124756: wlp4s0: RX EAPOL from 14:91:82:72:7c:1f
1488145955.124778: RX EAPOL - hexdump(len=131): 02 03 00 7f 02 13 82 
00 10 00 00 00 00 00 00 00 03 59 da 17 82 e0 98 02 37 d6 b3 41 83 81 
4b 03 82 85 2a 18 af 77 fa 15 a4 f9 74 b4 b7 f6 bd 6a b4 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 96 2e 81 37 53 b7 3f 59 be 80 62 65 8e f6 9c e1 00 20 
72 47 56 1f 30 a2 1a 5d 62 b9 bd cb 75 8d 08 a9 52 6c 3d 18 8f f3 6c 
af 0f 8e 69 6a 9e 12 be b8
1488145955.124865: wlp4s0: IEEE 802.1X RX: version=2 type=3 length=
127
1488145955.124874: WPA: RX EAPOL-Key - hexdump(len=131): 02 03 00 7f 
02 13 82 00 10 00 00 00 00 00 00 00 03 59 da 17 82 e0 98 02 37 d6 b3 
41 83 81 4b 03 82 85 2a 18 af 77 fa 15 a4 f9 74 b4 b7 f6 bd 6a b4 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 96 2e 81 37 53 b7 3f 59 be 80 62 65 8e f6 9c 
e1 00 20 72 47 56 1f 30 a2 1a 5d 62 b9 bd cb 75 8d 08 a9 52 6c 3d 18 
8f f3 6c af 0f 8e 69 6a 9e 12 be b8
1488145955.124956: wlp4s0:   EAPOL-Key type=2
1488145955.124979: wlp4s0:   key_info 0x1382 (ver=2 keyidx=0 rsvd=0 
Group Ack MIC Secure Encr)
1488145955.124992: wlp4s0:   key_length=16 key_data_length=32
1488145955.125000:   replay_counter - hexdump(len=8): 00 00 00 00 00 
00 00 03
1488145955.125011:   key_nonce - hexdump(len=32): 59 da 17 82 e0 98 
02 37 d6 b3 41 83 81 4b 03 82 85 2a 18 af 77 fa 15 a4 f9 74 b4 b7 f6 
bd 6a b4
1488145955.125033:   key_iv - hexdump(len=16): 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00
1488145955.125048:   key_rsc - hexdump(len=8): 00 00 00 00 00 00 00 
00
1488145955.125058:   key_id (reserved) - hexdump(len=8): 00 00 00 00 
00 00 00 00
1488145955.125068:   key_mic - hexdump(len=16): 96 2e 81 37 53 b7 3f 
59 be 80 62 65 8e f6 9c e1
1488145955.125121: RSN: encrypted key data - hexdump(len=32): 72 47 
56 1f 30 a2 1a 5d 62 b9 bd cb 75 8d 08 a9 52 6c 3d 18 8f f3 6c af 0f 
8e 69 6a 9e 12 be b8
1488145955.125198: WPA: decrypted EAPOL-Key key data - hexdump(len
=24): [REMOVED]
1488145955.125218: wlp4s0: WPA: RX message 1 of Group Key Handshake 
from 14:91:82:72:7c:1f (ver=2)
1488145955.125227: RSN: msg 1/2 key data - hexdump(len=24): dd 16 00 
0f ac 01 01 00 6e 3a 08 4e d2 9d c3 2d 39 85 58 2b 3e 95 8a 51
1488145955.125247: WPA: GTK in EAPOL-Key - hexdump(len=24): [REMOVED]
1488145955.125255: RSN: received GTK in group key handshake - 
hexdump(len=18): [REMOVED]
1488145955.125266: wlp4s0: State: COMPLETED -> GROUP_HANDSHAKE
1488145955.125279: WPA: Group Key - hexdump(len=16): [REMOVED]
1488145955.125290: wlp4s0: WPA: Installing GTK to the driver 
(keyidx=1 tx=0 len=16)
1488145955.125299: WPA: RSC - hexdump(len=6): 00 00 00 00 00 00
1488145955.125387: wpa_driver_nl80211_set_key: ifindex=2 (wlp4s0) 
alg=3 addr=0x5369b9 key_idx=1 set_tx=0 seq_len=6 key_len=16
1488145955.125421: nl80211: KEY_DATA - hexdump(len=16): [REMOVED]
1488145955.125429: nl80211: KEY_SEQ - hexdump(len=6): 00 00 00 00 00 
00
1488145955.125438:    broadcast key
1488145955.126098: wlp4s0: WPA: Sending EAPOL-Key 2/2
1488145955.126159: WPA: KCK - hexdump(len=16): [REMOVED]
1488145955.126170: WPA: Derived Key MIC - hexdump(len=16): 66 f0 e5 
e6 fd 78 51 14 d6 c8 b1 67 9c 80 a1 62
1488145955.126185: WPA: TX EAPOL-Key - hexdump(len=99): 01 03 00 5f 
02 03 02 00 00 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 66 f0 e5 e6 fd 78 51 14 d6 c8 b1 67 9c 80 a1 
62 00 00
1488145955.126421: wlp4s0: WPA: Group rekeying completed with 
14:91:82:72:7c:1f [GTK=CCMP]
1488145955.126437: CTRL-DEBUG: ctrl_sock-sendmsg: sock=13 
sndbuf=212992 
outq=0 send_len=63
1488145955.126490: CTRL_IFACE monitor sent successfully to 
/tmp/wpa_ctrl_381-1\x00
1488145955.126502: wlp4s0: Cancelling authentication timeout
1488145955.126516: wlp4s0: State: GROUP_HANDSHAKE -> COMPLETED
1488145955.126530: nl80211: Set rekey offload


[3] Symptom "failed to remove key from hardware" when stopping wifi

(the number of failed to remove key from hardware messages is 
 variable and ranges from 1 - 2)

[ 4217.679104] wlp4s0: Limiting TX power to 27 (30 - 3) dBm as 
advertised by 14:91:82:72:7c:1f
[ 5257.962551] wlp4s0: deauthenticating from 14:91:82:72:7c:1f by local
 choice (Reason: 3=DEAUTH_LEAVING)
[ 5257.983312] iwlwifi 0000:04:00.0: Failed to find station
[ 5257.983322] wlp4s0: failed to remove key (1, ff:ff:ff:ff:ff:ff)
 from hardware (-22)
[ 5257.983409] iwlwifi 0000:04:00.0: Failed to find station
[ 5257.983414] wlp4s0: failed to remove key (2, ff:ff:ff:ff:ff:ff) from 
hardware (-22)




System Information:

- Kernel

Linux slate 4.9.6-1-ARCH #1 SMP PREEMPT Thu Jan 26 09:22:26 CET 2017 
x86_64 GNU/Linux
Linux slate 4.9.11-1-ARCH #1 SMP PREEMPT Sun Feb 19 13:45:52 UTC 2017 
x86_64 GNU/Linux

- wpa_supplicant invocation (via netctl):

wpa_supplicant -dd -f/tmp/wpa.log -t -B -P "$pidfile" -i "$interface" 
  -D "$driver" configuration="$STATE_DIR/wpa_supplicant_$interface.conf

- Hardware

(on 4.9.6 kernel)
  *-network
       description: Wireless interface
       product: Wireless 8260
       vendor: Intel Corporation
       physical id: 0
       bus info: pci@0000:04:00.0
       logical name: wlp4s0
       version: 3a
       serial: 34:f3:9a:60:99:24
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress bus_master cap_list ethernet 
physical wireless
       configuration: broadcast=yes driver=iwlwifi 
driverversion=4.9.6-1-ARCH firmware=22.391740.0 ip=10.0.0.188 
latency=0 link=yes multicast=yes wireless=IEEE 802.11
       resources: irq:127 memory:f1000000-f1001fff

(on 4.9.11 kernel w/o backports)
  *-network                 
       description: Wireless interface
       product: Wireless 8260
       vendor: Intel Corporation
       physical id: 0
       bus info: pci@0000:04:00.0
       logical name: wlp4s0
       version: 3a
       serial: 34:f3:9a:60:99:24
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress bus_master cap_list ethernet 
physical wireless
       configuration: broadcast=yes driver=iwlwifi 
driverversion=4.9.11-1-ARCH firmware=22.391740.0 ip=10.0.0.188 
latency=0 link=yes multicast=yes wireless=IEEE 802.11
       resources: irq:127 memory:f1000000-f1001fff

(on 4.9.11 kernel w/ backports)
  *-network                 
       description: Ethernet interface
       product: Wireless 8260
       vendor: Intel Corporation
       physical id: 0
       bus info: pci@0000:04:00.0
       logical name: wlp4s0
       version: 3a
       serial: 34:f3:9a:60:99:24
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress bus_master cap_list 
ethernet physical
       configuration: broadcast=yes driver=iwlwifi 
driverversion=4.9.11-1-ARCH firmware=27.455470.0 ip=10.0.0.188 
latency=0 link=yes multicast=yes
       resources: irq:127 memory:f1000000-f1001fff

- /etc/modprobe.d/iwlwifi.conf

options iwlwifi bt_coex_active=0

- /sys/module/iwlwifi/parameters/bt_coex_active 
N

- /sys/module/iwlwifi/parameters/power_save 
N

- /sys/module/iwlwifi/parameters/power_level 
0

- /sys/module/iwlwifi/parameters/swcrypto 
0

- iwconfig (pre-backports)

wlp4s0    IEEE 802.11  ESSID:"WI of the 5ger"
          Mode:Managed  Frequency:5.18 GHz  Access Point: 14:91:82:72:7C:1F
          Bit Rate=400 Mb/s   Tx-Power=22 dBm
          Retry short limit:7   RTS thr:off   Fragment thr:off
          Power Management:off
          Link Quality=70/70  Signal level=-25 dBm
          Rx invalid nwid:0  Rx invalid crypt:0  Rx invalid frag:0
          Tx excessive retries:0  Invalid misc:27   Missed beacon:0

- iwconfig (post-backports)

wlp4s0    no wireless extensions.

- iw dev wlp4s0 link (post-backports)

Connected to 14:91:82:72:7c:1f (on wlp4s0)
	SSID: WI of the 5ger
	freq: 5180
	RX: 1864340 bytes (2820 packets)
	TX: 200166 bytes (1419 packets)
	signal: -28 dBm
	tx bitrate: 400.0 MBit/s VHT-MCS 9 40MHz short GI VHT-NSS 2

	bss flags:	short-slot-time
	dtim period:	1
	beacon int:	100

- Firmware (post-backports)

/usr/lib/firmware/iwlwifi-8000C-13.ucode
/usr/lib/firmware/iwlwifi-8000C-14.ucode
/usr/lib/firmware/iwlwifi-8000C-15.ucode
/usr/lib/firmware/iwlwifi-8000C-16.ucode
/usr/lib/firmware/iwlwifi-8000C-17.ucode
/usr/lib/firmware/iwlwifi-8000C-18.ucode
/usr/lib/firmware/iwlwifi-8000C-19.ucode
/usr/lib/firmware/iwlwifi-8000C-20.ucode
/usr/lib/firmware/iwlwifi-8000C-21.ucode
/usr/lib/firmware/iwlwifi-8000C-22.ucode
/usr/lib/firmware/iwlwifi-8000C-27.ucode

- dmesg boot up (pre-backports)

[    8.413762] i801_smbus 0000:00:1f.4: SPD Write Disable is set
[    8.413799] i801_smbus 0000:00:1f.4: SMBus using PCI interrupt
[    8.416027] e1000e 0000:00:1f.6: Interrupt Throttling Rate
(ints/sec) set to dynamic conservative mode
[    8.432892] iTCO_vendor_support: vendor-support=0
[    8.434157] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[    8.434290] iTCO_wdt: Found a Intel PCH TCO device (Version=4,
TCOBASE=0x0400)
[    8.434437] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
[    8.439859] Intel(R) Wireless WiFi driver for Linux
[    8.439861] Copyright(c) 2003- 2015 Intel Corporation
[    8.442327] iwlwifi 0000:04:00.0: Direct firmware load for
iwlwifi-8000C-26.ucode failed with error -2
[    8.442337] iwlwifi 0000:04:00.0: Direct firmware load for
iwlwifi-8000C-25.ucode failed with error -2
[    8.442347] iwlwifi 0000:04:00.0: Direct firmware load for
iwlwifi-8000C-24.ucode failed with error -2
[    8.442356] iwlwifi 0000:04:00.0: Direct firmware load for
iwlwifi-8000C-23.ucode failed with error -2
[    8.448422] iwlwifi 0000:04:00.0: loaded firmware version
22.391740.0 op_mode iwlmvm
[    8.470688] iwlwifi 0000:04:00.0: Detected Intel(R) Dual Band
Wireless AC 8260, REV=0x208
[    8.473081] iwlwifi 0000:04:00.0: L1 Enabled - LTR Enabled
[    8.474178] iwlwifi 0000:04:00.0: L1 Enabled - LTR Enabled
[    8.609794] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs'
[    8.611275] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0



- dmesg boot up (post-backports)

[    7.823904] i801_smbus 0000:00:1f.4: SPD Write Disable is set
[    7.823936] i801_smbus 0000:00:1f.4: SMBus using PCI interrupt
[    7.826091] e1000e 0000:00:1f.6: Interrupt Throttling Rate 
(ints/sec) set to dynamic conservative mode
[    7.827770] compat: loading out-of-tree module taints kernel.
[    7.827924] Loading modules backported from iwlwifi
[    7.827924] iwlwifi-stack-public:release/LinuxCore24:5784:f94e9dbf
[    7.839236] iTCO_vendor_support: vendor-support=0
[    7.839718] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[    7.839890] iTCO_wdt: Found a Intel PCH TCO device (Version=4, 
TCOBASE=0x0400)
[    7.840004] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
[    7.841075] Intel(R) Wireless WiFi driver for Linux
[    7.841075] Copyright(c) 2003- 2015 Intel Corporation
[    7.844367] iwlwifi 0000:04:00.0: Direct firmware load for 
iwl-dbg-cfg.ini failed with error -2
[    7.849457] iwlwifi 0000:04:00.0: capa flags index 3 larger than 
supported by driver
[    7.849842] iwlwifi 0000:04:00.0: loaded firmware version 
27.455470.0 op_mode iwlmvm
[    7.865153] iwlwifi 0000:04:00.0: Detected Intel(R) Dual Band 
Wireless AC 8260, REV=0x208
[    7.866736] iwlwifi 0000:04:00.0: L1 Enabled - LTR Enabled
[    7.867688] iwlwifi 0000:04:00.0: L1 Enabled - LTR Enabled
[    8.007505] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs'
[    8.008292] thermal thermal_zone3: failed to read out thermal zone (-5)
[    8.008400] Unable to register with Frequency Manager: -22
[    8.010746] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
Comment 1 Matt Berry 2017-02-27 00:04:28 UTC
Created attachment 254929 [details]
dmesg from a session that included several periods of no network access
Comment 2 Emmanuel Grumbach 2017-05-18 12:47:00 UTC
Hi,

sorry for the long delay. One of our developers has been assigned to work on this bug.

Can you please record tracing of the problem?

Please refer to [1] to check how to do this.
This will allow us to see what is the interaction between the wpa_s and the kernel.


[1] https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#tracing
Comment 3 Matt Berry 2017-06-04 21:26:43 UTC
Apologies, I was traveling. I'm back in my proper timezone so I'll work on getting you a trace today.
Comment 4 Matt Berry 2017-06-05 00:27:22 UTC
I've capture two traces, totaling about 280 MB compressed. What's the procedure for submitting them? I also captured the WPA Supplicant log and dmesg.
Comment 5 Emmanuel Grumbach 2017-06-05 04:06:59 UTC
Dropbox, Google Drive, everything goes :)
Comment 7 Matt Berry 2017-06-05 05:16:44 UTC
Trace #1 captures the beginning of a problematic time period. Just like in the original problem description the card reported being associated with the AP but there was no network connectivity. That trace also captures me applying the only known workaround, which is to "sudo netctl switch-to <NetworkProfile>".

Trace #2 was started just after a problematic time period (I ran out of room in /tmp and had to restart). Unlike the last trace, network connectivity restored itself this time after a few minutes.

Trace #3 starts with the card un-associated. I then try to associate to an AP and the process fails. The WPA supplicant log makes it look like it succeeded, indicating that association and authentication worked. But I never obtained network connectivity. The netctl command hangs, I see the network name appear in the status bar, and then it disappears and I end up un-associated again. The WPA log indicates it was told to terminate, but that was not a conscious action on my part.

The WPA log and dmesg links cover boot until traces 1 and 2. Trace #3 has it's own WPA log and dmesg files in the tarball. Those logs start after the end of the WPA log / dmesg from # 1 & 2.
Comment 9 Emmanuel Grumbach 2017-06-05 05:53:41 UTC
Thanks - I attached those to our internal tracker and the assignee will analyse them.
Comment 10 Matt Berry 2017-06-21 02:01:11 UTC
I wanted to let you know that as of the last OS update I performed, I'm now getting a new iwlwifi related error message: "capa flags index 3 larger than supported by driver".

slate% dmesg | grep iwlwifi 
[    7.331132] iwlwifi 0000:04:00.0: enabling device (0000 -> 0002)
[    7.333791] iwlwifi 0000:04:00.0: Direct firmware load for iwlwifi-8000C-28.ucode failed with error -2
[    7.339846] iwlwifi 0000:04:00.0: capa flags index 3 larger than supported by driver
[    7.340242] iwlwifi 0000:04:00.0: loaded firmware version 27.455470.0 op_mode iwlmvm
[    7.355382] iwlwifi 0000:04:00.0: Detected Intel(R) Dual Band Wireless AC 8260, REV=0x208
[    7.357592] iwlwifi 0000:04:00.0: L1 Enabled - LTR Enabled
[    7.358763] iwlwifi 0000:04:00.0: L1 Enabled - LTR Enabled
[    7.496753] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
[   12.094184] iwlwifi 0000:04:00.0: L1 Enabled - LTR Enabled
[   12.094706] iwlwifi 0000:04:00.0: L1 Enabled - LTR Enabled
[   12.223120] iwlwifi 0000:04:00.0: L1 Enabled - LTR Enabled
[   12.223666] iwlwifi 0000:04:00.0: L1 Enabled - LTR Enabled
slate% uname -a
Linux slate 4.11.5-1-ARCH #1 SMP PREEMPT Wed Jun 14 16:19:27 CEST 2017 x86_64 GNU/Linux
slate% pacman -Qi linux-firmware 
Name            : linux-firmware
Version         : 20170422.ade8332-1
Comment 11 Luca Coelho 2017-06-21 06:36:46 UTC
Hi Matt,

Can you please report this as a separate bug? I'll take a look into it ASAP.
Comment 12 roee.zamir 2017-06-22 10:09:28 UTC
Hi Matt,

seems that trace.dat file in trace-cmd-3 corrupted and can't be parsed.
can you try to parse[1] it?

in addition air capture[2] will really help


[1] $ trace-cmd report -i trace.dat

[2] https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#air_sniffing

Thanks,
Roee
Comment 13 Matt Berry 2017-06-27 05:16:31 UTC
Roee,

Parsing trace-cmd-3 works fine on my end. I've captured the output of `trace-cmd report -i trace-cmd-3/trace.dat` as `trace.dat.decoded.gpg` and you can download it from here: https://s3.amazonaws.com/maberry-temporary/2017-06-04/trace.dat.decoded.gpg


For the air capture, would it be more useful if you had a capture from the perspective of the problematic device or another wireless card in proximity to it (or both for the same period of time)?
Comment 14 Luca Coelho 2017-06-27 05:34:19 UTC
Thanks, Matt!

I'll sync internally with Roee to help him decode the trace-cmd logs you captured.
Comment 15 Luca Coelho 2017-06-27 06:18:19 UTC
Matt,

It seems that trace-3 works fine, we actually cannot decode trace-1.  Can you double-check that one? And we actually need the binary .dat, because we have some extra (non-public) scripts to parse it (you can still encrypt is as usual).
Comment 16 Matt Berry 2017-06-28 02:34:55 UTC
Figures, I ran out of room in /tmp while recording trace-1. I'll get you a new trace.

For the air capture, do you want me to perform it on the problematic device itself, or would you like to see the AP and the problematic device from the perspective of another wireless client?
Comment 17 Matt Berry 2017-06-28 04:12:09 UTC
Here's trace-cmd-4. This trace starts out unassociated, I then associate, time passes, the problem begins, I let the problem persist for 5 minutes while I fruitlessly ssh/dig/ping, and then finally I bring the network down and back up and access is restored.

The timeline file records when I noticed the problem begin (not an exact time) and the time at which I cycled the network to restore access.

Also included is dmesg and wpa_supplicant logs for the affected timeframe.

https://s3.amazonaws.com/maberry-temporary/2017-06-04/trace-cmd-4.tar.gz.gpg
Comment 18 Luca Coelho 2017-06-28 08:55:49 UTC
Thanks! We'll look into the new logs.

Regarding the air capture, it's better to do it from a third device, running in monitor mode, because the problematic may be missing something and that could be the cause of the problem.  But I think in this case we may not need it, so we'll come back to you if we really need it.
Comment 19 Emmanuel Grumbach 2017-06-29 20:04:23 UTC
Hi Matt,

We've looked at the new logs and our conclusions are that for some reason that we can't determine right now, we can't hear any *unicast* packet from the AP starting from a certain point in time. This point in time doesn't exhibit any particularity that could explain the lack of unicast Rx packet afterwards.
We can't know *why* we can't hear any unicast packet, because the log we have right now is at the driver level, so all we can say is that the driver doesn't get any new unicast packet from the AP. We know that the driver isn't even getting those (as opposed to getting them and dropping them), because the log you provided includes all the packets at firmware / driver interface which is way before the layer in the driver that drops packet (duplicates, decryption issues and all that kind stuff).

Now we need to understand why the firmware doesn't give us new unicast packets. It can be:
* the AP stops sending them because it gets stuck on RTS / CTS (may be our bug: us not responding to RTS).
* the AP stops sending them because of another reason
* the AP does send the packets but the firmware doesn't hear them (unlikely since we do hear the multicast packets, and we'd expect the AP to retry until it reaches a low rate at which we should really hear the packet).
* the AP does sent the packet and the firmware hears it, but the firmware drops it for an unknown reason.

An air sniffer capture will be useful here, but maybe there is something easier that can be done and can teach us what is happening here.

I'll attach a firmware with debug enabled, and I'd like you to follow the procedure described here:
https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#firmware_debugging

This will allow to collect a dump of the firmware. We will know what the firmware is doing. I'd like you to create the dump (echo 1 > /sys/.../fw-dbg-collect) when the pings stop working. Please create a few dumps by repeating the echo command a few times (and collecting the data either manually or by the udev rule) while waiting ~10 seconds between the dumps. 3 dumps should be more than enough.

This should be easier to do than having and air sniffer and it can teach us a lot of things.

Please take the time to read the privacy notice here:
https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#privacy_aspects

Thank you.
Comment 20 Emmanuel Grumbach 2017-06-29 20:05:55 UTC
Oh, I also see that you use -27.ucode.
We published -31.ucode, it may be worthwhile to try it (check our backport based dirver here:
https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/core_release

You should Core29.
Comment 21 Emmanuel Grumbach 2017-06-29 20:12:16 UTC
Created attachment 257233 [details]
Core24 with uSniffer

This is the firmware with debug enabled.

Thank you.
Comment 22 Matt Berry 2017-06-30 03:58:30 UTC
Did I get the correct attachment? I don't seem to have the requisite sysfs entries after rebooting.

```
slate% sudo openssl dgst -sha256 /usr/lib/firmware/iwlwifi-8000C-27* 
SHA256(/usr/lib/firmware/iwlwifi-8000C-27.ucode)= 0947146a61ce9ecffc73fb110186c3fe7cc04097aa260627c6f22ab4dd23cb46
SHA256(/usr/lib/firmware/iwlwifi-8000C-27.ucode.debug)= 0947146a61ce9ecffc73fb110186c3fe7cc04097aa260627c6f22ab4dd23cb46
SHA256(/usr/lib/firmware/iwlwifi-8000C-27.ucode.original)= a4432158038960fd9dccace083f77c050393c3a8117cccad5535bf7e24d99ea7
```

```
slate% sudo ls /sys/kernel/debug/iwlwifi                          
ls: cannot access '/sys/kernel/debug/iwlwifi': No such file or directory
```

```
slate% dmesg | grep iwlwifi | grep loaded
[  142.844963] iwlwifi 0000:04:00.0: loaded firmware version 27.546686.0 op_mode iwlmvm
```

Here's what I do have

```
slate% sudo find /sys -name '*iwlwifi*'
/sys/kernel/debug/tracing/events/iwlwifi_io
/sys/kernel/debug/tracing/events/iwlwifi_io/iwlwifi_dev_ioread32
/sys/kernel/debug/tracing/events/iwlwifi_io/iwlwifi_dev_iowrite8
/sys/kernel/debug/tracing/events/iwlwifi_io/iwlwifi_dev_iowrite32
/sys/kernel/debug/tracing/events/iwlwifi_io/iwlwifi_dev_iowrite64
/sys/kernel/debug/tracing/events/iwlwifi_io/iwlwifi_dev_iowrite_prph32
/sys/kernel/debug/tracing/events/iwlwifi_io/iwlwifi_dev_iowrite_prph64
/sys/kernel/debug/tracing/events/iwlwifi_io/iwlwifi_dev_ioread_prph32
/sys/kernel/debug/tracing/events/iwlwifi_io/iwlwifi_dev_irq
/sys/kernel/debug/tracing/events/iwlwifi_io/iwlwifi_dev_ict_read
/sys/kernel/debug/tracing/events/iwlwifi_ucode
/sys/kernel/debug/tracing/events/iwlwifi_ucode/iwlwifi_dev_ucode_cont_event
/sys/kernel/debug/tracing/events/iwlwifi_ucode/iwlwifi_dev_ucode_wrap_event
/sys/kernel/debug/tracing/events/iwlwifi_msg
/sys/kernel/debug/tracing/events/iwlwifi_msg/iwlwifi_err
/sys/kernel/debug/tracing/events/iwlwifi_msg/iwlwifi_warn
/sys/kernel/debug/tracing/events/iwlwifi_msg/iwlwifi_info
/sys/kernel/debug/tracing/events/iwlwifi_msg/iwlwifi_crit
/sys/kernel/debug/tracing/events/iwlwifi_msg/iwlwifi_dbg
/sys/kernel/debug/tracing/events/iwlwifi_data
/sys/kernel/debug/tracing/events/iwlwifi_data/iwlwifi_dev_tx_data
/sys/kernel/debug/tracing/events/iwlwifi_data/iwlwifi_dev_tx_tso_chunk
/sys/kernel/debug/tracing/events/iwlwifi_data/iwlwifi_dev_rx_data
/sys/kernel/debug/tracing/events/iwlwifi
/sys/kernel/debug/tracing/events/iwlwifi/iwlwifi_dev_hcmd
/sys/kernel/debug/tracing/events/iwlwifi/iwlwifi_dev_rx
/sys/kernel/debug/tracing/events/iwlwifi/iwlwifi_dev_tx
/sys/kernel/debug/tracing/events/iwlwifi/iwlwifi_dev_ucode_error
/sys/kernel/debug/tracing/events/iwlwifi/iwlwifi_dev_ucode_event
/sys/bus/pci/drivers/iwlwifi
/sys/module/iwlwifi
/sys/module/iwlwifi/drivers/pci:iwlwifi
/sys/module/cfg80211/holders/iwlwifi
```
Comment 23 Emmanuel Grumbach 2017-06-30 04:38:42 UTC
That's because you don't have debugfs enabled.

Can you compile a driver with CONFIG_IWLWIFI_DEBUGFS enabled?
You can try out our backport version if you want.
That will allow you not to compile a full kernel...
Comment 24 Matt Berry 2017-06-30 04:46:55 UTC
Not sure if you're up early or late :P

I just recompiled the backports tree while working on https://bugzilla.kernel.org/show_bug.cgi?id=196195 . I'll report back in a bit with some dumps for you.

Would it be worth while to get dumps from the Core30 firmware now that I'm on the backports branch? If you attach it I will test it.
Comment 25 Emmanuel Grumbach 2017-06-30 05:16:04 UTC
I can't share Core30 firmware yet, but I can give you a core24 firmware. That will have to wait a few hours though.
Comment 26 Emmanuel Grumbach 2017-06-30 05:16:38 UTC
Sorry, I meant a core 28 firmware.
Comment 27 Matt Berry 2017-06-30 05:30:47 UTC
No rush, I'm on PDT time so I'm about to head to sleep.
Comment 28 Emmanuel Grumbach 2017-06-30 06:41:54 UTC
Created attachment 257245 [details]
Core28 with uSniffer

Hi,

here is Core28 with uSniffer for the firmware debug data collection.

Thanks.
Comment 29 Matt Berry 2017-06-30 07:26:06 UTC
Here is a link to trace-cmd-5, which covers a period where several FW dumps were taken. This is on the backports driver using the provided -27 microcode. dmesg, iw events, and wpa_supplicant logs are also included.

The most interesting bit about this round of data is that the problem recovered after some time, which has happened before but is not typical of this problem. Normally I need to take the network down to restore connectivity.

https://s3.amazonaws.com/maberry-temporary/2017-06-04/trace-cmd-5.tar.gz.gpg

I will try to collect more data tomorrow with the -31 microcode, as it's well past midnight at this point.
Comment 30 Emmanuel Grumbach 2017-06-30 07:35:32 UTC
No rush. I won't be able to look at this until Sunday anyway. Please make sure to produce the dumps while the traffic stalls.

Thanks
Comment 31 Emmanuel Grumbach 2017-07-02 07:13:14 UTC
I am starting to look at this.
Can you please check in your router if you can disable A-MSDU?
It can be called aggregations or something like that.

Thanks.
Comment 32 Matt Berry 2017-07-02 07:31:44 UTC
Sadly I don't think so. This router's stock firmware has only the most minimal set of configuration options. Basically just channel width and mixed-mode/802.11ac-only-mode. I don't have any of the options my old DD-WRT or Tomato APs had.
Comment 33 Emmanuel Grumbach 2017-07-02 08:25:00 UTC
Nope. Not needed.


Can you please load iwlmvm with power_scheme=1 instead?
Since iwlmvm is automatically loaded, you'll need to add an option in iwlwifi.conf file.

Thanks.

I am checking with the firmware team.
Comment 34 Emmanuel Grumbach 2017-07-02 11:51:35 UTC
So I have spent a bunch of time on the traces and the firmware dumps.

I can't see anything obvious besides the fact that the client seems to have lots its IP address.

I can't be sure since I don't have all the data.
At this point, I have to ask for more sensitive data. Can you send a tcpdump capture of the wlan interface?

Please do the following *when you detect the traffic loss*:

sudo tcpdump -w traffic.pcap -i wlp4s0

(this window will be stuck)

Open a new window and in that window, run a few pings to your router

then, stop the tcpdump and send the traffic.pcap file.

Please don't let traffic.pcap run all the time since it will just add misleading info.


I have to emphasis the fact that tcpdump will contain all the data in plaintext. So that will basically give us access to the actual data you are sending. SSL connections remain encrypted at the app layer of course.

What I am looking for is the explanation of a weird pattern that I see:
I see that we send a packet, and the router replies with a BCAST DATA packet. I bet that this packet is an ARP request, but I can't know with the current data you provided.

Thanks!
Comment 35 Matt Berry 2017-07-10 23:16:38 UTC
Sorry for the long delay. Took a long vacation around the 4th ('Murica, and so on). I apologize for the large size of this capture, I had a hard time getting the problem to reproduce. I still have not identified a sure-fire way to trigger the issue. Because it took so long to get a capture, I have uploaded two tarballs. One tarball (trace-cmd-6) contains the trace-cmd along with the pcaps, wpa_supplicant, and dmesg. The other tarball (pcap-only-6) contains the pcaps, wpa_supplicant, and dmesg but doesn't contain the trace-cmd due to the size.

The timeline for this capture involves a good 30 minutes or so of nothing interesting (refreshing this bug report and netcat'ing /dev/urandom from another computer on the network). Then there were two events worth looking at. For each event I stopped all other (attempted) network traffic and just pinged the local gateway (10.0.0.1). Each event has it's own pcap file and there are approximate start/end timestamps in the tarballs.

422 MB: trace-cmd-6: https://s3.amazonaws.com/maberry-temporary/2017-06-04/trace-cmd-6.tar.gz.gpg

41 KB: pcap-only-6: https://s3.amazonaws.com/maberry-temporary/2017-06-04/pcap-only-6.tar.gz.gpg
Comment 36 Emmanuel Grumbach 2017-07-11 14:27:25 UTC
I'll analyse the data.
I have a strong feeling that the issue is related to 11n, can you try to check what happens if you load iwlwifi with 11n_disable=1 ?

Thanks.
Comment 37 Matt Berry 2017-07-12 04:30:54 UTC
It's hard to prove a negative, but I've set `11n_disable=1`. If I don't see the problem by tomorrow night I'd say that's a significant sample size.

I should note that the AP was configured for 802.11ac-only and 80 MHz bandwidth, so I had to change it to Mixed mode operation and 40 MHz bandwidth for this test.
Comment 38 Emmanuel Grumbach 2017-07-12 05:24:16 UTC
Ok.

The next step will be to disable ampdu only.

I hope I'll find the time to look at your logs today or tomorrow.
Comment 39 Matt Berry 2017-07-13 06:49:06 UTC
Since disabling 802.11n yesterday I have not experienced the issue. Of course the speed is noticeably slower, but it consistently works.
Comment 40 Emmanuel Grumbach 2017-07-13 06:51:39 UTC
awesome.
I haven't found the time to look at your logs yet. It will happen soon I hope.

Now we can start to disable less and less features to see what is going wrong.
Can you please run with 11n_disable=2 ?
This will disable Tx A-MPDU, but the Rx speed will get back to what it used to be.

thanks.
Comment 41 Matt Berry 2017-07-13 07:03:55 UTC
I've adjusted 11n_disable=2 and for the sake of changing just one thing at a time left the AP at Mixed Mode and 40 MHz bandwidth. I'll report back tomorrow if I don't see any recurrences or sooner if I do.
Comment 42 Matt Berry 2017-07-16 05:32:19 UTC
I've not noticed the problem re-occur while 11n_disable=2
Comment 43 Emmanuel Grumbach 2017-07-16 12:05:34 UTC
All right, can you please put your AP in the same mode as before?
I looked at your pcap files and indeed, we are stuck with ARP requests.
Digging in the trace now...
Comment 44 Emmanuel Grumbach 2017-07-16 18:08:29 UTC
I spent quality time on your data and I am starting to fear a nasty bug on the AP side.
Can you please test this patch?

diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/rs.c b/drivers/net/wireless/intel/iwlwifi/mvm/rs.c
index cb922c59eb59..23751a7e02a2 100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/rs.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/rs.c
@@ -628,7 +628,7 @@ static int rs_tl_turn_on_agg_for_tid(struct iwl_mvm *mvm,
 
        IWL_DEBUG_HT(mvm, "Starting Tx agg: STA: %pM tid: %d\n",
                     sta->addr, tid);
-       ret = ieee80211_start_tx_ba_session(sta, tid, 5000);
+       ret = ieee80211_start_tx_ba_session(sta, tid, 0);
        if (ret == -EAGAIN) {
                /*
                 * driver and mac80211 is out of sync


Thanks.
Comment 45 Emmanuel Grumbach 2017-07-16 18:25:26 UTC
Of course, when you test the patch above, I'll need you to remove the 11n_disable module parameter (or set it to 0 which is the default).

thanks.
Comment 46 Emmanuel Grumbach 2017-07-16 19:58:19 UTC
BTW - Linksys's website seem to talk about firmware updates that improve wireless stability. Can you check what firmware you have on the AP?

http://cache-www.belkin.com/support/dl/EA8500_1.1.4.169978_Customer_Release_Notes.txt
Comment 47 Matt Berry 2017-07-16 20:10:04 UTC
Looks like my firmware is ahead of everything on that page.

Firmware Version: 1.1.4.173728
Firmware Builddate: 2016-06-21 13:55
Product.type: production
Linux: Linux version 3.4.103 (root@build-vm) (gcc version 4.6.3 20120201 (prerelease) (Linaro GCC 4.6-2012.02) ) #1 SMP PREEMPT Tue Jun 21 05:36:27 PDT 2016
Board: wraith

I'm going to start with the AP settings and let that go for a bit. After I'm convinced the problem is not recurring I'll try your patch.
Comment 48 Emmanuel Grumbach 2017-07-20 17:00:45 UTC
Any chance to have (hopefully good) news before the week end? :)
Comment 49 Matt Berry 2017-07-22 23:42:23 UTC
Apologies for the long delay. Had a rather busy week. An added complication is that one of the recent commits to release/LinuxCore30 horribly breaks my system. I ended up rolling back to the commit used for #196195, which is 8f1a7f43d199ef47f466040c3d0160d1e610979f. Also needed to account for an interface change in the kernel (https://github.com/torvalds/linux/commit/cf124db566e6b036b8bcbe8decbed740bdfac8c6).

AP is setup back in the baseline configuration (802.11ac, 80MHz bw).
11n_disable back in the default configuration (0)

I'll try this out for 24 hours and see if anything happens (hopefully not).

Here's my current diff over 8f1a7f43

```
diff --git a/drivers/net/wireless/intel/iwlwifi/fw/file.h b/drivers/net/wireless/intel/iwlwifi/fw/file.h
index 01817510..cbdc127f 100644
--- a/drivers/net/wireless/intel/iwlwifi/fw/file.h
+++ b/drivers/net/wireless/intel/iwlwifi/fw/file.h
@@ -341,6 +341,7 @@ typedef unsigned int __bitwise iwl_ucode_tlv_capa_t;
  * @IWL_UCODE_TLV_CAPA_TX_POWER_ACK: reduced TX power API has larger
  *     command size (command version 4) that supports toggling ACK TX
  *     power reduction.
+ * @IWL_UCODE_TLV_CAPA_MLME_OFFLOAD: supports MLME offload
  *
  * @NUM_IWL_UCODE_TLV_CAPA: number of bits used
  */
@@ -391,6 +392,7 @@ enum iwl_ucode_tlv_capa {
        IWL_UCODE_TLV_CAPA_EXTEND_SHARED_MEM_CFG        = (__force iwl_ucode_tlv_capa_t)80,
        IWL_UCODE_TLV_CAPA_LQM_SUPPORT                  = (__force iwl_ucode_tlv_capa_t)81,
        IWL_UCODE_TLV_CAPA_TX_POWER_ACK                 = (__force iwl_ucode_tlv_capa_t)84,
+       IWL_UCODE_TLV_CAPA_MLME_OFFLOAD                 = (__force iwl_ucode_tlv_capa_t)96,
 
        NUM_IWL_UCODE_TLV_CAPA
 #ifdef __CHECKER__
diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/rs.c b/drivers/net/wireless/intel/iwlwifi/mvm/rs.c
index d5f2c83f..ba46556a 100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/rs.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/rs.c
@@ -622,7 +622,7 @@ static int rs_tl_turn_on_agg_for_tid(struct iwl_mvm *mvm,
 
        IWL_DEBUG_HT(mvm, "Starting Tx agg: STA: %pM tid: %d\n",
                     sta->addr, tid);
-       ret = ieee80211_start_tx_ba_session(sta, tid, 5000);
+       ret = ieee80211_start_tx_ba_session(sta, tid, 0);
        if (ret == -EAGAIN) {
                /*
                 * driver and mac80211 is out of sync
diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
index ec282eaf..277073ac 100644
--- a/net/mac80211/iface.c
+++ b/net/mac80211/iface.c
@@ -1272,7 +1272,8 @@ static void ieee80211_if_setup(struct net_device *dev)
        ether_setup(dev);
        dev->priv_flags &= ~IFF_TX_SKB_SHARING;
        dev->netdev_ops = &ieee80211_dataif_ops;
-       dev->destructor = ieee80211_if_free;
+       dev->needs_free_netdev = true;
+       dev->priv_destructor = ieee80211_if_free;
 }
 
 static void ieee80211_if_setup_no_queue(struct net_device *dev)
```
Comment 50 Matt Berry 2017-07-23 00:23:03 UTC
Well that didn't take long. The problem occurred so quickly I didn't have a chance to start a trace, so trace-cmd-7 starts in the middle of an event and spans the time until recovery. As I went to go upload trace-cmd-7 another event started so again trace-cmd-8 missed the start of the event but recorded recovery. 

I tried to get a trace that included the beginning of an event, but after the first two occurrences it has not recurred again. I'll keep trying unless you indicate you don't need it.

https://s3.amazonaws.com/maberry-temporary/2017-06-04/trace-cmd-7.tar.gz.gpg

https://s3.amazonaws.com/maberry-temporary/2017-06-04/trace-cmd-8.tar.gz.gpg
Comment 51 Emmanuel Grumbach 2017-07-23 09:45:25 UTC
Regarding the issues you had with compilation, we are testing the merge back from upstream. When that will be fine, we'll have the destructor change supported in our tree.

The fact that the traffic stall is still happening seems to mean that my guess was wrong. At this point, I can't really do much without an air sniffer.
Please check https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#air_sniffing to see how to use a separate system as an air sniffer.
Please make sure that you properly enter the channel definition.

Thanks.
Comment 52 Matt Berry 2017-07-23 22:39:44 UTC
I'm actively recording an air capture. The recording device is an Intel 7265. Although the Intel site claims that chip is 802.11ac, it is refusing to associate on an 80 MHz channel width (AssocResp status is 104, which isn't even in the spec). So I have the AP setup for Mixed mode, 40 MHz bandwidth. Hopefully what you need will appear in the capture as I don't have any other 802.11ac devices around.
Comment 53 Matt Berry 2017-07-24 00:08:28 UTC
After an hour of browsing the problem has not occurred, which is definitely on the higher end of the frequency distribution for occurrences. We have seen the issue occur on 40 Mhz channel width, so it's possible I've just been very unlucky today. You mentioned the issue was observed during ARP requests? I tried to instigate the issue by periodically flushing the ARP cache on the client side to no avail. I don't think I can coerce the AP to flush its cache.
Comment 54 Matt Berry 2017-07-24 00:59:54 UTC
Additionally, the pcap size is only 24 bytes, so I don't think it actually picked anything up. I verified the recording device reports the same channel, frequency, channel width, and center frequency on both devices.


```
# Recording device:
$ iw dev wlo1 info
Interface wlo1
    ifindex 3
    wdev 0x1
    addr 64:80:99:XX:XX:XX
    type monitor
    wiphy 0
    channel 40 (5200 MHz), width: 40 MHz, center1: 5190 MHz
    txpower 22.00 dBm

# Problem device:
$ iw dev wlp4s0 info | grep channel
    channel 40 (5200 MHz), width: 40 MHz, center1: 5190 MHz
```

The command used to record was:

```
sudo tcpdump -i wlo1 -w capture.pcap -C 2 2>&1 | tee capture.log
```
Comment 55 Emmanuel Grumbach 2017-07-24 05:22:01 UTC
Why did you add -C2 to tcpdump?

The man pages day it limits the size of the capture.
Comment 56 Matt Berry 2017-07-24 05:30:19 UTC
I interpreted it as creating a series of size-limited output files. I tend to do this when I don't know in advance the size of the capture.

> -C file_size
              Before  writing a raw packet to a savefile, check whether the file is currently larger than file_size and, if so, close the current savefile and open a new one.
Comment 57 Emmanuel Grumbach 2017-07-24 19:55:42 UTC
Created attachment 257685 [details]
fix bar

Hi,

people here identified a very rare case that can cause issues.
I doubt it is what you are suffering from since I didn't see any BAR in your captures. Can you please nevertheless give it a try?

PS: what happens when you remove the -C2 from tcpdump?

thanks.
Comment 58 Emmanuel Grumbach 2017-07-24 20:02:07 UTC
Created attachment 257687 [details]
fix bar

fix the wrap around in the patch

Note that this patch is absolutely not tested...
Comment 59 Emmanuel Grumbach 2017-07-24 20:11:40 UTC
Created attachment 257689 [details]
fix bar

yet another (last?) version.

Kinda late here.
Comment 60 Emmanuel Grumbach 2017-07-25 08:33:20 UTC
Nope - this patch is absolute irrelevant.
Sorry.

Any news w/o the -C2 parameter to tcpdump?

2bytes  is a bit short...
Comment 61 Matt Berry 2017-07-26 05:25:40 UTC
I'm going to give it another shot right now.
Comment 62 Matt Berry 2017-07-26 06:51:26 UTC
The -C2 was irrelevant. I don't know the root cause of why that capture didn't log anything in that one instance, but I suspect I misconfigured the frequency or channel width or something. After I had a known working capture setup I ran a simple experiment of refreshing the browser once while capturing. Both with and without -C2 captured ~4k packets each.

So good news, I managed to capture an event in a reasonable amount of time. trace-cmd-9 is a mere 19 MB. Much better than that 400MB behemoth from a few weeks ago. It covers the entire span of an event. I began pinging the local gateway and taking firmware dumps approx. every 10s until 20s or so after the event self-resolved.

As a reminder, I'm still on the patched backport branch which includes the patch from Comment 44. The firmware is the uSniffer-27 firmware. The AP settings are Mixed Mode, 40 MHz channel width.

https://s3.amazonaws.com/maberry-temporary/2017-06-04/trace-cmd-9.tar.gz.gpg

An additional point of interest: the computer recording the traffic suffered a microcode software error during the event. The dmesg of the recording device is included as capture.dmesg in the tarball for reference. Anecdotally, back in February before you picked up the bug report I attempted the air sniff procedure on my own and noticed that when the event occurred on the problem device, the entire OS on the recording device froze. It was the same recording device then as now, but it was not the same software stack.
Comment 63 Emmanuel Grumbach 2017-07-26 07:04:11 UTC
The capture device crashed because I forgot to mention this to you:
https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi#about_the_monitorsniffer_mode

looking at the data now.
Comment 64 Matt Berry 2017-07-26 07:07:15 UTC
I hope that the capture has what you need then even without the amsdu_size argument. If not I can try again tomorrow.

On a quasi-related note, the Intel site claims that the 7265 is 802.11ac but looking at the output of `iw phy` on the recording device I used there is no reference to `VHT` or 80 MHz. Is a firmware update needed to enable these features?
Comment 65 Emmanuel Grumbach 2017-07-26 07:12:31 UTC
I will check what I see in the capture for now. It'll take me a while.
Your 7265 SKU doesn't support 11ac, but only 11n as mentioned in the dmesg.
Comment 66 Emmanuel Grumbach 2017-07-30 19:20:15 UTC
I have spent again a few hours on this trying to sync the tracing output and the air sniffer capture but, the first ping I could find in the trace is this one:
ping-3887  [003]  6203.076880: iwlwifi_dev_tx:       TX_CMD (0x1c) / seq 0xac1 (188 bytes)

From tracing, I can see its sequence number, but this number doesn't appear in the air sniffer capture.
One of the last packet I can see in the air sniffer is a deauth which means you disconnected. It feels like you first recorded an air sniffer capture, stopped it, and then ran the ping.
I scrutinized the air sniffer capture, but couldn't find anything that could hint at which point you are having your traffic stall.
In the firmware dumps, I can see periods of ~10 seconds without traffic, but when I correlated this with the tracing data, it seems like the driver isn't sending any data to the firmware. All that happened before the pings started running which make it hard for me to know if I am look at the actual problem, or we just don't have data to send.

Can you please always leave a ping running in the background? Please use the -Q0x4f
parameter. This will give a different priority to the ping packet which will make it easier for me to identify in the encrypted sniffer capture.

I have the feeling that having the ping running in the background all the time will make the problem disappear but we'll see.

I am sorry to make you work so hard, but I have spent quite a bit of time on this, but since I reproduce the problem on my end, I don't really have much choice.


One thing I'd like to try also is the following (examples taken from my system):
 cat /sys/kernel/debug/ieee80211/phy0/netdev\:wlp2s0/stations/00\:90\:8f\:81\:3f\:3c/agg_status 
next dialog_token: 0x1
TID		RX	DTKN	SSN		TX	DTKN	pending
00		1	0x01	0x000		0	0x00	000
01		0	0x00	0x000		0	0x00	000
02		0	0x00	0x000		0	0x00	000
...

This means that Rx AMPDU is open on TID 0 (which is used for pretty much everything unless you use -Q parameter to ping e.g.).

# echo tx start 0 >  /sys/kernel/debug/ieee80211/phy0/netdev\:wlp2s0/stations/00\:90\:8f\:81\:3f\:3c/agg_status 
# cat /sys/kernel/debug/ieee80211/phy0/netdev\:wlp2s0/stations/00\:90\:8f\:81\:3f\:3c/agg_status 
next dialog_token: 0x3
TID		RX	DTKN	SSN		TX	DTKN	pending
00		1	0x01	0x000		1	0x02	000

Now I started A-MPDU on TX on TID 0.
Now you can ping and see that everything goes fine. As long as you have pings flowing, we shouldn't stop the A-MPDU automatically. But you can do so manually:

# echo tx stop 0 >  /sys/kernel/debug/ieee80211/phy0/netdev\:wlp2s0/stations/00\:90\:8f\:81\:3f\:3c/agg_status 
# cat /sys/kernel/debug/ieee80211/phy0/netdev\:wlp2s0/stations/00\:90\:8f\:81\:3f\:3c/agg_status 
next dialog_token: 0x4
TID		RX	DTKN	SSN		TX	DTKN	pending
00		1	0x01	0x000		0	0x00	000


I suspect that your AP doesn't like us to stop A-MPDU while we have traffic flowing.

Can you please check this sequence with normal pings (no -Q parameter) on your system?
If we have a easy and short way to reproduce it will help greatly.

Thanks!
Comment 67 Matt Berry 2017-07-30 21:54:04 UTC
The sequence in which I ran the data collection steps was this:
1. Start air sniffer
2. Start trace-cmd
3. Wait for problem to begin
4. Begin pings to confirm it is not just slow DNS query
5. Dump firmware approximately every 10 seconds
6. Recovery happens
7. Take 1-2 more firmware dumps for good measure
6. Stop trace-cmd
7. Stop air-sniffer

I had previously observed that having constant pings in the background greatly lowers the frequency of the issue occurring, but I couldn't confirm it with certainty so I didn't mention it in case it was a red-herring.

I agree that a deterministic set of steps to reproduce would be amazingly helpful, as this normally takes a good chunk of my night to successfully record an event.

Prior to starting this experiment my network traffic was pretty much just idle. I was not recording for this experiment, but it seems to have worked so I will repeat it with trace-cmd and an air sniffer soon.

```
$ sudo cat /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status  
next dialog_token: 0x3
TID		RX	DTKN	SSN		TX	DTKN	pending
00		1	0x01	0x000		0	0x00	000
01		0	0x00	0x000		0	0x00	000
02		0	0x00	0x000		0	0x00	000
03		0	0x00	0x000		0	0x00	000
04		0	0x00	0x000		0	0x00	000
05		0	0x00	0x000		0	0x00	000
06		1	0x01	0x002		0	0x00	000
07		0	0x00	0x000		0	0x00	000
08		0	0x00	0x000		0	0x00	000
09		0	0x00	0x000		0	0x00	000
10		0	0x00	0x000		0	0x00	000
11		0	0x00	0x000		0	0x00	000
12		0	0x00	0x000		0	0x00	000
13		0	0x00	0x000		0	0x00	000
14		0	0x00	0x000		0	0x00	000
15		0	0x00	0x000		0	0x00	000
```

I began pinging with the command: `ping 10.0.0.1 -Q0x4f`

```
 sudo cat /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status
next dialog_token: 0x3
TID		RX	DTKN	SSN		TX	DTKN	pending
00		1	0x01	0x000		0	0x00	000
01		1	0x01	0x007		0	0x00	000
02		0	0x00	0x000		0	0x00	000
03		0	0x00	0x000		0	0x00	000
04		0	0x00	0x000		0	0x00	000
05		0	0x00	0x000		0	0x00	000
06		1	0x01	0x002		0	0x00	000
07		0	0x00	0x000		0	0x00	000
08		0	0x00	0x000		0	0x00	000
09		0	0x00	0x000		0	0x00	000
10		0	0x00	0x000		0	0x00	000
11		0	0x00	0x000		0	0x00	000
12		0	0x00	0x000		0	0x00	000
13		0	0x00	0x000		0	0x00	000
14		0	0x00	0x000		0	0x00	000
15		0	0x00	0x000		0	0x00	000
```

I began aggregation on tx:

```
$ echo tx start 0 | sudo tee /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status 
tx start 0
$ sudo cat /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status                  
next dialog_token: 0x4
TID		RX	DTKN	SSN		TX	DTKN	pending
00		1	0x01	0x000		0	0x00	000
01		1	0x01	0x007		0	0x00	000
02		0	0x00	0x000		0	0x00	000
03		0	0x00	0x000		0	0x00	000
04		0	0x00	0x000		0	0x00	000
05		0	0x00	0x000		0	0x00	000
06		1	0x01	0x002		0	0x00	000
07		0	0x00	0x000		0	0x00	000
08		0	0x00	0x000		0	0x00	000
09		0	0x00	0x000		0	0x00	000
10		0	0x00	0x000		0	0x00	000
11		0	0x00	0x000		0	0x00	000
12		0	0x00	0x000		0	0x00	000
13		0	0x00	0x000		0	0x00	000
14		0	0x00	0x000		0	0x00	000
15		0	0x00	0x000		0	0x00	000
```

No change, so I tried it again:

```
$ echo tx start 0 | sudo tee /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status
tx start 0
tee: '/sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status': Resource temporarily unavailable
```

Now it seems to have worked:

```
$ sudo cat /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status                  
next dialog_token: 0x6
TID		RX	DTKN	SSN		TX	DTKN	pending
00		1	0x01	0x000		1	0x04	000
01		1	0x01	0x007		0	0x00	000
02		0	0x00	0x000		1	0x05	000
03		0	0x00	0x000		0	0x00	000
04		0	0x00	0x000		0	0x00	000
05		0	0x00	0x000		0	0x00	000
06		1	0x01	0x002		0	0x00	000
07		0	0x00	0x000		0	0x00	000
08		0	0x00	0x000		0	0x00	000
09		0	0x00	0x000		0	0x00	000
10		0	0x00	0x000		0	0x00	000
11		0	0x00	0x000		0	0x00	000
12		0	0x00	0x000		0	0x00	000
13		0	0x00	0x000		0	0x00	000
14		0	0x00	0x000		0	0x00	000
15		0	0x00	0x000		0	0x00	000
```

So I stopped it:

```
$ echo tx stop 0 | sudo tee /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status
tx stop 0
$ sudo cat /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status                 
next dialog_token: 0x7
TID		RX	DTKN	SSN		TX	DTKN	pending
00		1	0x01	0x000		1	0x06	000
01		1	0x01	0x007		0	0x00	000
02		0	0x00	0x000		1	0x05	000
03		0	0x00	0x000		0	0x00	000
04		0	0x00	0x000		0	0x00	000
05		0	0x00	0x000		0	0x00	000
06		1	0x01	0x002		0	0x00	000
07		0	0x00	0x000		0	0x00	000
08		0	0x00	0x000		0	0x00	000
09		0	0x00	0x000		0	0x00	000
10		0	0x00	0x000		0	0x00	000
11		0	0x00	0x000		0	0x00	000
12		0	0x00	0x000		0	0x00	000
13		0	0x00	0x000		0	0x00	000
14		0	0x00	0x000		0	0x00	000
15		0	0x00	0x000		0	0x00	000
```

Second time worked. (In retrospect, I think it takes a few seconds to take effect and I was being impatient).

```
$ echo tx stop 0 | sudo tee /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status
tx stop 0
$ sudo cat /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status                 
next dialog_token: 0x7
TID		RX	DTKN	SSN		TX	DTKN	pending
00		1	0x01	0x000		0	0x00	000
01		1	0x01	0x007		0	0x00	000
02		0	0x00	0x000		1	0x05	000
03		0	0x00	0x000		0	0x00	000
04		0	0x00	0x000		0	0x00	000
05		0	0x00	0x000		0	0x00	000
06		1	0x01	0x002		0	0x00	000
07		0	0x00	0x000		0	0x00	000
08		0	0x00	0x000		0	0x00	000
09		0	0x00	0x000		0	0x00	000
10		0	0x00	0x000		0	0x00	000
11		0	0x00	0x000		0	0x00	000
12		0	0x00	0x000		0	0x00	000
13		0	0x00	0x000		0	0x00	000
14		0	0x00	0x000		0	0x00	000
15		0	0x00	0x000		0	0x00	000
```


I noticed this 'blip' in my pings:

```
64 bytes from 10.0.0.1: icmp_seq=264 ttl=64 time=5.13 ms
64 bytes from 10.0.0.1: icmp_seq=265 ttl=64 time=3.57 ms
64 bytes from 10.0.0.1: icmp_seq=266 ttl=64 time=4.89 ms
64 bytes from 10.0.0.1: icmp_seq=267 ttl=64 time=4.61 ms <<<<
64 bytes from 10.0.0.1: icmp_seq=271 ttl=64 time=1018 ms <<<<
64 bytes from 10.0.0.1: icmp_seq=272 ttl=64 time=5.62 ms
64 bytes from 10.0.0.1: icmp_seq=273 ttl=64 time=3.27 ms
```


And the aggregation turned back on automatically after that:

```
$ sudo cat /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status                 
next dialog_token: 0x8
TID		RX	DTKN	SSN		TX	DTKN	pending
00		1	0x01	0x000		1	0x07	000
01		1	0x01	0x007		0	0x00	000
02		0	0x00	0x000		1	0x05	000
03		0	0x00	0x000		0	0x00	000
04		0	0x00	0x000		0	0x00	000
05		0	0x00	0x000		0	0x00	000
06		1	0x01	0x002		0	0x00	000
07		0	0x00	0x000		0	0x00	000
08		0	0x00	0x000		0	0x00	000
09		0	0x00	0x000		0	0x00	000
10		0	0x00	0x000		0	0x00	000
11		0	0x00	0x000		0	0x00	000
12		0	0x00	0x000		0	0x00	000
13		0	0x00	0x000		0	0x00	000
14		0	0x00	0x000		0	0x00	000
15		0	0x00	0x000		0	0x00	000
```


So I stopped it again:

```
$ echo tx stop 0 | sudo tee /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status
tx stop 0
$ sudo cat /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status                 
next dialog_token: 0x8
TID		RX	DTKN	SSN		TX	DTKN	pending
00		1	0x01	0x000		0	0x00	000
01		1	0x01	0x007		0	0x00	000
02		0	0x00	0x000		1	0x05	000
03		0	0x00	0x000		0	0x00	000
04		0	0x00	0x000		0	0x00	000
05		0	0x00	0x000		0	0x00	000
06		1	0x01	0x002		0	0x00	000
07		0	0x00	0x000		0	0x00	000
08		0	0x00	0x000		0	0x00	000
09		0	0x00	0x000		0	0x00	000
10		0	0x00	0x000		0	0x00	000
11		0	0x00	0x000		0	0x00	000
12		0	0x00	0x000		0	0x00	000
13		0	0x00	0x000		0	0x00	000
14		0	0x00	0x000		0	0x00	000
15		0	0x00	0x000		0	0x00	000
```


And a few seconds later there was another aberration in the pings:

```
64 bytes from 10.0.0.1: icmp_seq=387 ttl=64 time=5.08 ms
64 bytes from 10.0.0.1: icmp_seq=388 ttl=64 time=1032 ms
64 bytes from 10.0.0.1: icmp_seq=389 ttl=64 time=27.6 ms
64 bytes from 10.0.0.1: icmp_seq=390 ttl=64 time=3.77 ms
64 bytes from 10.0.0.1: icmp_seq=391 ttl=64 time=3.81 ms
From 10.0.0.12 icmp_seq=398 Destination Host Unreachable
From 10.0.0.12 icmp_seq=399 Destination Host Unreachable
From 10.0.0.12 icmp_seq=400 Destination Host Unreachable
64 bytes from 10.0.0.1: icmp_seq=401 ttl=64 time=2031 ms
64 bytes from 10.0.0.1: icmp_seq=402 ttl=64 time=1018 ms
64 bytes from 10.0.0.1: icmp_seq=403 ttl=64 time=5.05 ms
64 bytes from 10.0.0.1: icmp_seq=404 ttl=64 time=4.80 ms
```
Comment 68 Matt Berry 2017-08-02 05:02:35 UTC
The last two nights have not played out as planned. I will get a recording for you ASAP, but I don't know exactly when that will be.
Comment 69 Emmanuel Grumbach 2017-08-02 05:09:27 UTC
What do you mean?

What happened?

I am sorry I haven't got back to you yet.
Was planning to do so today
Comment 70 Matt Berry 2017-08-02 05:21:27 UTC
I had promised you in my last update that I'd get a recording of the behavior seen when enabling / disabling the transmit aggregation. My plan was to get that recording last night but this is actually the first I've gotten any free moments since then.
Comment 71 Emmanuel Grumbach 2017-08-02 05:27:10 UTC
That's fine. No worries.

I'll send exact instructions on how to do so. My plan is to get a minimal are sniffer capture of that for 2 reasons:

1) to share with our system guys so that we can try to rely less on stopping ampdu.
2) to check that the air sniffer capture is really a smoking gun before sending this to the AP manufacturer 

But I can't write the instructions on my phone on my way to work;)
Comment 72 Emmanuel Grumbach 2017-08-02 12:54:44 UTC
Getting back to you on this.

As I said in my previous comment, I'd like to get a nice sniffer capture, now that we know how to reproduce the problem at will.

First, I think that capture this sniffer capture without security will make it much easier to understand, but we are talking about ping traffic only. This will allow to see the ping request / reply and it is much easier to see (and proof) what packet should get a reply.

Second, note that when you -Q parameter to ping, it changes the TID (traffic ID which corresponds to TOS) used for the ping. When you do "echo tx start 0", you open A-MPDU on TID 0. So we need to stop using -Q parameter on ping for this test. I asked for it because it makes it easier to identify the ping packet when I read an encrypted sniffer capture. But if the sniffer capture is clear (no security), then I can see the payload of all the packets, and I don't need the -Q parameter.

Please close all the other networking related apps. Here are the steps.

Check that A-MPDU is not on for TID 0.
If it is, you can simply do:
echo tx stop 0 | sudo tee /sys/kernel/debug/ieee80211/phy0/netdev:wlp4s0/stations/14:91:82:72:7c:1f/agg_status

Then,
1) start the air sniffer.
2) ping <IP of the AP>
3) wait 5 seconds and leave the ping command running
3) echo tx start 0 | sudo tee /sys/...
4) wait 5 seconds
5) echo tx stop 0 | sudo tee /sys/...
6) wait 5 seconds <== here we expect a hiccup in the pings
7) stop the sniffer

I'll need the air sniffer capture .pcap file and the output of the ping commend.

If you want to delete certain packets from the sniffer capture because it was captured without encryption, please feel free to do so, but let me know that you did so.
I'll need the ping req / reply and all the wifi management packets.

Thanks!

I'll come up with a patch that completely disables usage of stopping A-MPDU in our driver, but I won't be able to merge this in our code base. I need to take this issue with our system guys.
We also need to involve your AP manufacturer, but let's look at the sniffer capture first to make sure that the AP is in fault here.
Comment 73 Matt Berry 2017-08-12 20:29:28 UTC
I have three captures for you. The first two, trace-cmd-10 and trace-cmd-11, did not exhibit the problem. I did my best to keep those traces clean of all traffic except pings. For trace-cmd-10 I followed your procedure verbatim. For trace-cmd-11 I repeated the cycle of turning aggregation on and off trying to coerce the issue to occur. For trace-cmd-12 I added in some minimal web traffic. I refreshed this bug report and a few pages on wireless.wiki.kernel.org. That seemed to have dome something important. About 30 seconds after the last round of refreshing I disabled aggregation and the problem occurred. 

For trace-cmd-11 and trace-cmd-12, since I was not following the procedure verbatim, there is a file called 'timeline' that recorded each agg_status call I made along with the timestamp.

AP Configuration:

I re-configured the AP's 5GHz network to have the SSID "WIFILNX-676" and no security. It remains at 40MHz bandwidth and in mixed mode.

Recording device configuration:

I setup the recording device with the amsdu_size=3 argument.

Problematic device configuration:

Same state as in Comment 62.

https://s3.amazonaws.com/maberry-temporary/2017-06-04/trace-cmd-10.tar.gz.gpg

https://s3.amazonaws.com/maberry-temporary/2017-06-04/trace-cmd-11.tar.gz.gpg

https://s3.amazonaws.com/maberry-temporary/2017-06-04/trace-cmd-12.tar.gz.gpg
Comment 74 Emmanuel Grumbach 2017-08-12 20:37:58 UTC
Hey,

I was about to update this bug as well :)
We had a meeting with the system guys and we agreed to move away from closing the A-MPDU session.
We will proceed to implementation this week. So I hope that I'll get back to you soon with a patch to test.

I'll look at your captures tomorrow or Monday.

Thanks!
Comment 75 Emmanuel Grumbach 2017-08-12 21:04:18 UTC
I have new findings based on the sniffer capture.

Clearly the AP is not answering our ARP requests anymore. At some point, we give up.

It is very late here, but I'll write a detailed description later.
Comment 76 Emmanuel Grumbach 2017-08-13 14:58:46 UTC
Again quality time with Wireshark on your data :)

This is smoking gun...

You are welcome to send the description below to the AP manufacturer:

* The BA agreement is opened on packet #12589 / #12617
* The BA agreement is torn down on packet #16113


The last DATA packet that was sent before the delBA in #16113 was packet #16090 and its sequence number was 2151. It was properly ACK'ed by the AP.

Then the Intel device is sending packet packet with sequence 2152 at #16131, which is ACK'ed, the AP doesn't reply with the ping response.
This behavior continues until we open A-MPDU again in #17083 / #17087. Then, the Intel device sends an ARP request (#17107) which is properly propagated to the BSS since it was transmitted to the BCAST address (#17109). Finally, the AP starts to react to Intel device's packet again by replying to the ARP at packet #17114.

From this point on, the pings are working again.

This is clearly a bad behavior from the AP side.

As I said, Intel will change its behavior to avoid sending delBAs and that should stop confusing the AP, but the AP manufacturer should fix the bug.

I am leaving this bug open until we send the patch to change our behavior.

Phew... this was a long story...
Comment 77 Emmanuel Grumbach 2017-08-13 17:35:26 UTC
I have updated our internal ticket to remove the initiator delba frames that we send.
At this point, we have no more use of all the awesome data you have provided. As far as Intel is concerned, you can remove them from the website.

I strongly encourage you to send this data to the AP manufacturer along with my analysis. Feel free to let us know what they reply.
Comment 78 Matt Berry 2017-08-13 20:49:16 UTC
This is wonderful news, both that you were able to positively root-cause the issue and that there is a path to work around the issue while Linksys/Belkin resolves it on their end.

I've begun the process of informing them of our findings. The case number on their side is 08501094. I've also fully upgraded the router's firmware to revision 1.1.5.178908 and can confirm the issue is still present and reproducible via the same steps listed above.
Comment 79 Emmanuel Grumbach 2017-08-13 21:08:12 UTC
Yeah.

I am sorry it took 6 months from the date you opened the ticket, but this was definitely a tough one especially because these flows are really not the ones we would suspect for such an issue.
This kind of bug is always hard to narrow down: ping loss, etc... A WARNING, or a firmware assert are of course easier to fix...
Without your patience and the time you spent of gathering high quality data, we could make it...
Comment 80 Emmanuel Grumbach 2017-08-16 10:19:26 UTC
Created attachment 257951 [details]
do not send delBA

Hi,


as promised, here is the patch that changes our behavior to not send delBAs which is what confuses your AP.
Please let me know how it goes.

Thanks.
Comment 81 Emmanuel Grumbach 2017-08-19 21:10:39 UTC
Hi,

the patch has been merged in the master branch of our internal tree:
https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/backport-iwlwifi.git/

This is a backport based driver.
The commit of the patch is afd2863606420734857c48a13253e238495e7dda

Please let us know if the bug is now fixed for you.

Thank you.
Comment 82 Emmanuel Grumbach 2017-08-26 18:53:06 UTC
Any news?

The patch is on his way to 4.14.
Comment 83 Matt Berry 2017-08-26 20:38:01 UTC
Last weekend was eclipse weekend. Went on a roadtrip and so I couldn't test it. I may test this weekend, depends when Linksys gets back to me and if they want more tests.
Comment 84 Emmanuel Grumbach 2017-08-26 20:39:20 UTC
Great.

Let me know when we can close this issue.
Comment 85 Emmanuel Grumbach 2017-09-04 15:12:02 UTC
I am now considering this as fixed.
Comment 86 Emmanuel Grumbach 2017-09-04 21:24:32 UTC
I am closing the bug. I'd love to hear from you if you want to send any update. I'll still be notified about any comment you may add.

Thank you!
Comment 87 Matt Berry 2017-10-15 20:35:49 UTC
Just wanted to drop by to let you know I've reached level 3 Linksys support and am in contact with an engineer who is looking at the last capture.
Comment 88 Emmanuel Grumbach 2017-10-15 20:41:15 UTC
Wow... this is fascinating really :)

Keep me posted :)

He may have questions about the capture, feel free to route his questions to me.

Our W/A has been merged and sent upstream though.
Comment 89 Emmanuel Grumbach 2017-11-25 21:40:53 UTC
So, what did they say?

I am asking before I'd love to know if they fix the bug or if they claim that the bug in on Intel side.

Thanks.
Comment 90 Matt Berry 2017-12-04 19:28:42 UTC
Well the good news is that they did not attempt to blame the bug on Intel. The bad news is that their solution to the bug probably won't scale very well. About 2 weeks ago they sent me a replacement router and the replacement (which is not the same model) doesn't have the same behavior. 

So for anyone else reading this who has the Linksys EA8500 and Emmanuel's patch is not helping, my recommendation is to contact Linksys support and link them to this bug report. It took me 2.5 months to get in touch with an engineer, but they initially seemed very interested in solving the problem.
Comment 91 Emmanuel Grumbach 2017-12-04 19:44:48 UTC
Ok. Thanks for letting us know.

The most important here is that we could understand what went wrong and that you can now have a good connectivity with Intel's device.
It was fun to debug this... and for the posterity, Intel Wifi on Linux stopped sending the legit frame that it used to send and that seemed to be confusing that AP.

Note You need to log in before you can comment on or make changes to this bug.