Bug 92451 - iwlwifi: 7260: WiFi connected but no data transferred after rekeying - MWG100235318
Summary: iwlwifi: 7260: WiFi connected but no data transferred after rekeying - MWG10...
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL: http://www.spinics.net/lists/linux-wi...
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-01 16:02 UTC by alexander
Modified: 2015-05-31 06:11 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.18.5-gentoo
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Wifi capture from monitor station (2.82 MB, application/x-gzip)
2015-02-01 16:02 UTC, alexander
Details
iwlwifi client log (3.44 KB, text/plain)
2015-02-01 16:03 UTC, alexander
Details
router log (4.50 KB, text/plain)
2015-02-01 16:04 UTC, alexander
Details
iwlwifi trace during the problem (1.53 MB, application/x-gzip)
2015-02-01 16:05 UTC, alexander
Details
capture showing packets are recived and decrypted but then lost (427.78 KB, application/octet-stream)
2015-05-07 21:32 UTC, alexander
Details
workaround (461 bytes, patch)
2015-05-14 19:56 UTC, alexander
Details | Diff
Capture showing old IV for new key (34.58 KB, application/x-gzip)
2015-05-14 21:38 UTC, alexander
Details

Description alexander 2015-02-01 16:02:39 UTC
Created attachment 165521 [details]
Wifi capture from monitor station

I've a problem with the iwlwifi driver (or probably firmware) on my Thinkpad W530 with an Ultimate-N 6300 Wifi card. 

During normal operation the network freezes sometimes but is claiming to be still connected.
It seems to be possible to reproduce the issue by transfering huge anounts of data.

Using Bug #42978 as a guidance I've gathered all the evidence I could get my hands on.

Problem description:

During opertation the wireless network connection stops working, and a bit later the arp entry for the wlan router times out on the wlan client running iwlwifi.
There is nothing suspicious in dmesg/syslog around the time and the only indication from the syslog is, that the ntp server is no longer replying. (Syslog also attached)
That can happen during light load (browsing the internet) or while downloading data. 

The event outlined in detail here was during downloading data from an internal webdav server (owncloud) via davfs2 with ~2.5MiB/s.

Tcpdump shows, that the wlan client stops reciving any packets from the wlan router on the tcp/ip level. On the other end the wlan router still sees all packets from the client.
(The wlan router is running openwrt and when setting a static arp entry for the gateway arp, icmp echo request and tcp-Syn packets can be seen on the router. 
The tcpdump on the router is of course also showing the answer packets, but those are not visible for the tcpdump on the client any longer.)

--> Wlan upstream communication is fine and only the downlink fail, at least when the wlan was primarly beeing used to downlad data to trigger the bug.

The wifi connection can be "fixed" by disconnecting and reconnecting to the wlan, I'm doing that via the KDE connection manager.

I had the same problem with older kernels and first suspected the wlan router running openwrt and only recently started to investigate the iwlwifi driver, since the other WLAN clients are not showing the same problem. (But there is only one android device there which has compareable usage.)

As plattform I'm currently running gentoo unstable with kernel 3.18.5-gentoo and iwl6000-ucode version 9.221.4.1 (md5: feea228ed059c3a998c12031313242b8). 

Test setup:
0) wifi connection to a openwrt router with EAP-TLS (WPA-PSK not tested)
1) mounting a webdav share from owncloud via davfs2
2) opening ssh connection to the Openwrt router and running date in a loop, once each second
3) taring a directory with >100G of data from the webdav share
4) capturing 802.11 traffic on a seperate system with tcpdump with the card in monitor mode and set to the correct channel (restarting tcpdump from time to time when the capture is getting too big)
5) running "trace-cmd record -e iwlwifi -e iwlwifi_msg -e mac80211 -e cfg80211" (not attached, but I have trace file with the failure in it, ~590MB compressed.)
6) Waiting for the date output on the openwrt router to freeze (may take ~1h)
7) Stopping the long running iwlwifi trace and running a new trace with during the error (1436_trace.dat.gz) 

The attached files are for the wifi drop around 14:36:01 CET (13:36:01 CET)

Speculation:
For the last two attemps (the first seriously debugged ones) there is always a notification about a WPA key handshake in the Router log.
Also in this event the connection recovered after some time (but taking >>10min) without manually reconnecting while still working on this ticket. 

Since I never waited more than 10min previously I now assume that the problem may be caused via a WPA rekey and and fixed by a later one?.

I noticed that the connection was back around 15:49 CET and suspect the WPA EAP handshake at 15:30:19 CET in the openwrt log to have fixed the issue (which started for sure at 14:36 CET)

Attached files:
1436_trace.dat.gz	iwlwifi trace during the error
1436_openwrt.log	log extract from openwrt router
1436_client.log		log extract from iwlwifi client
1436_wifi-capture.pcap.gz wifi capture from another system, 150Byte snaplen only (full packet lenght version availible if needded)

Some more details about the client system:

lspci -v -s 03:00.0
03:00.0 Network controller: Intel Corporation Centrino Ultimate-N 6300 (rev 3e)
        Subsystem: Intel Corporation Centrino Ultimate-N 6300 3x3 AGN
        Flags: bus master, fast devsel, latency 0, IRQ 32
        Memory at f3000000 (64-bit, non-prefetchable) [size=8K]
        Capabilities: [c8] Power Management version 3
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [e0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 24-77-03-ff-ff-d2-5e-a8
        Kernel driver in use: iwlwifi
        Kernel modules: iwlwifi

iwlwifi kernel settings:

CONFIG_IWLWIFI=m
CONFIG_IWLWIFI_LEDS=y
CONFIG_IWLWIFI_OPMODE_MODULAR=y
# CONFIG_IWLWIFI_BCAST_FILTERING is not set
# CONFIG_IWLWIFI_UAPSD is not set
CONFIG_IWLWIFI_DEBUG=y
CONFIG_IWLWIFI_DEBUG_EXPERIMENTAL_UCODE=y
CONFIG_IWLWIFI_DEVICE_TRACING=y

iwlwifi boot messages:
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: can't disable ASPM; OS doesn't have ASPM control
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: irq 32 for MSI/MSI-X
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: Direct firmware load for iwlwifi-6000-exp.ucode failed with error -2
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: Direct firmware load for iwlwifi-6000-6.ucode failed with error -2
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: Direct firmware load for iwlwifi-6000-5.ucode failed with error -2
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: loaded firmware version 9.221.4.1 build 25532 op_mode iwldvm
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: CONFIG_IWLWIFI_DEBUG enabled
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: CONFIG_IWLWIFI_DEBUGFS disabled
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: CONFIG_IWLWIFI_DEVICE_TRACING enabled
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: Detected Intel(R) Centrino(R) Ultimate-N 6300 AGN, REV=0x74
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: L1 Enabled - LTR Disabled
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0 wlp3s0: renamed from wlan0
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: L1 Enabled - LTR Disabled
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: Radio type=0x0-0x3-0x1
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: L1 Enabled - LTR Disabled
Feb  1 13:35:14 perry kernel: iwlwifi 0000:03:00.0: Radio type=0x0-0x3-0x1
Comment 1 alexander 2015-02-01 16:03:39 UTC
Created attachment 165531 [details]
iwlwifi client log
Comment 2 alexander 2015-02-01 16:04:15 UTC
Created attachment 165541 [details]
router log
Comment 3 alexander 2015-02-01 16:05:05 UTC
Created attachment 165551 [details]
iwlwifi trace during the problem
Comment 4 Emmanuel Grumbach 2015-02-07 20:15:52 UTC
what security are you using?
Are you using TKIP by chance?

You seem to have security problems.
Can you try without security?
Comment 5 alexander 2015-02-08 22:27:58 UTC
First, the Wlan router is supporting CCMP only and the wpa_supplicant client log messages are confirming that. But that's probably not very interesing any more:

I did run quite some tests around security with your tip and it looks like this is indeed the problem and unrelated to iwlwifi:

- I did not disable security, but using WPA2-PSK instead of EAP-TLS  fixes the problem

- The problem always happens nearly exactly 1h (deriviation is often around 5s, worst was 25s) after connecting to the AP. But only when some download job was running. (Tested a nearly idle connection for 70min, no connection problem.)

- I tested iwlwifi options swcrypto=1 and for no good reason also (seperatly) wd_disable=0. Both options change nothing, the problem is still reproduceable with these.

- I can reproduce the EAP connection problem using a kubuntu 14.10 boot CD on the same hardware AND on other notebook using the iwl3945 driver (Intel Pro Wireless 3945) instead of iwlwifi

So it looks like using EAP-TLS is causing the issue. 
I now think the problem is in wpa_supplicant, probably in the Openwrt 14.07 wpad binary.

I'll run some further tests, but from a kernel point of view this bug can probably be marked as invalid and be closed.

It seems a bit unlikely to me that the driver/kernel can do anything special for EAP-TLS (compared to PSK) which would cause problems for iwlwifi and iwl3945...
Comment 6 Emmanuel Grumbach 2015-02-09 03:54:30 UTC
Ok. I just saw that something seemed wrong from a security perspective.
I close the bug. Please reopen it if you have data that points to the driver.


Thanks
Comment 7 alexander 2015-05-07 21:29:18 UTC
I'm now very sure this is a kernel issue and reopen this ticket.
I can reproduce the issue with the following cards assume that it is a generic problem in the mac80211 or related code.

These are the network cards I tested and know they are affected:
Intel Corporation Centrino Ultimate-N 6300 (rev 3e)
Intel Corporation Wireless 7260 (rev 83)
Intel Corporation PRO/Wireless 3945ABG [Golan] Network Connection (rev 02)

All of the above are using the mac80211 Softmac from the kernel.

I could not reproduce the issue with an USB Wlan adapter:
0bda:8172 Realtek Semiconductor Corp. RTL8191SU 802.11n WLAN Adapter

This usb card is using the the driver r8712u from staging with an own softmac implementation.
It also has connection problems after a re-key, but recovers within a few seconds.
Here I should also mention that Windows7 is having the issue with my test setup.

Win7 tests ware run not on the main station but on a system with the Intel Wireless 7260 card. But Using Ubuntu on the same system reproduced the known problems to 100%.
(I'm primary debugging this on the system with the Ultimate-N 6300 cards, the other ones were just hit with some quick test.)

Here the summary of what I have found out so far:

The issue is unrelated to EAP-TLS, the real problem are the eapol re-keys.

When the 4-way-handshake is occurring while the wlan is "loaded" the rx path from a "ethernet" point of view stops while tx continues to work. 
A "ping -f 172.16.1.1 -s 1472" has a good chance to trigger the issue, but I'm using "ping -f 172.16.1.1 -s 10000" or tcp transfers in most tests. 
The communication is reestablished after the next rekey. When the ping is still running the load is back and the next rekey will fail again. And so on, without any exceptions.
(With default rekey values hardly a user will wait. Disconnecting and reconnecting also fixes the issue in this case.)

The interface RX counters of the wlan interface is claiming, that no packets are received any longer.
But the interesting point here is, that a monitor interface running on the station sees the missing packets. They are even correctly decrypted! 
I've attached you a filtered capture, see below for details how this was created.

A remote monitor station can also decrypt the packets in both direction and confirms, that the communication over the air is fine.
(If you want to test that you need a patch for wireshark to support rekeys. It's currently for review in the wireshark gerrit.)


Here a manual how you should be able to reproduce the issue:

Most important of course: You need an AP with WPA-PSK encryption doing frequent rekeys. CCMP or TKIP can both be used, I prefer CCMP.
If you have a AP running linux you can force the rekeys by setting "wpa_ptk_rekey=120" for wpa_supplicant to rekey all 2min.
For my openwrt router I do it with this command: "uci set wireless.@wifi-iface[2].wpa_pair_rekey=120; wifi" 

Enable wpa_supplicant debugging on the client, so you can see if and when the rekey occures. 
Again here the command I'm using:
dbus-send --system --print-reply --dest=fi.w1.wpa_supplicant1 /fi/w1/wpa_supplicant1 org.freedesktop.DBus.Properties.Set string:fi.w1.wpa_supplicant1 string:DebugLevel variant:string:"msgdump"

If you also want to see the real keys - makes probably only sense if you need the keys for wireshark - you can also run:
dbus-send --system --print-reply --dest=fi.w1.wpa_supplicant1 /fi/w1/wpa_supplicant1 org.freedesktop.DBus.Properties.Set string:fi.w1.wpa_supplicant1 string:DebugShowKeys variant:boolean:true

Now start a flood ping to a local system with fast connectivity ( Data rates ~1MB/s should trigger the issue and have seen it also with half of the load.)
As soon as wpa_supplicant logs a rekey your connection should freeze and a bit later the arp times out, Till the next rekey, when it will recover.

(You can e.g. look for the text: "State: COMPLETED -> 4WAY_HANDSHAKE" in your log to find the rekeys.)



I've captured how this looks on my system. Additional to the above test setup I've also added a sub-interface on the station to capture packets in monitor mode.
This is working a bit strange on the card, it looks like I have connect to a wlan, try the commands, disconnect again and try again. (Still fighting each time again, needs more testing)

iw phy phy0 interface add mon0 type monitor flags none
iw dev mon0 set monitor none

# Verify, if not in monitor mode keep playing around.
iw mon0 info

Once we are in monitor mode download some data or use the flood ping again. Since in tcp we can see the error much better I prefer tcp connections here.
I've attached you a sample pcap, captured this way. Once you set "IEEE 302.11 -> Ignore the Protection bit" to "Yes - with IV" wireshark will decode the traffic and we can look at the tcp connection.
(Without decryption in wireshark, the packets are already decrypted.) And here we see the "missing" packets....

The wireshark filter is then nicely showing when the connection is lost: 
tcp.seq == 3591761481 or tcp.nxtseq==3591761481 or tcp.ack>=3591761481 or eapol

Exactly after the last packet of the 4-way handshake. (We also seem to lose some packets around the key change, but that's probably normal)
Comment 8 alexander 2015-05-07 21:32:50 UTC
Created attachment 176151 [details]
capture showing packets are recived and decrypted but then lost
Comment 9 alexander 2015-05-07 21:36:08 UTC
One critical word is missing above:

...
Here I should also mention that Windows7 is NOT having the issue with my test setup.
...
Comment 10 alexander 2015-05-09 13:34:43 UTC
I can reproduce the issue up to 4.0.1-gentoo, which is the newest version I checked.

I also had a test with the oldest kernel provided in gentoo, 3.4.105-gentoo.
The problem is present on all.
Comment 11 alexander 2015-05-14 19:54:39 UTC
With some printk's I was able to track the issue down and even finally prevent the freezes. 

I've attached the patch "fixing" the issue for me, but this is probably not the correct fix.

The packets are dropped in the function "ieee80211_crypto_ccmp_decrypt", at the lines I commented out in the patch.

It looks indeed like a fundamental problem in mac80211.
Here my current understanding of the issue:

It looks like this is happening:

 * we rekey and are changing the used encryption key with "ieee80211_key_replace"
 * this correctly replaces the key and we start with a fresh counter "rx_pn" for the new key
 * But the rx handler still gets packets encrypted with the old key after the key change is completed (we just send out the 4th packet of the eapol handshake and the router still has to switch over to the new key, I assume)
 * the commented out code accepts that, since the old counter is >>0
 * but some lines below we set rx_pn" for the new key according to the value packet (WRONG!!)
 * some dozen packets later we finally get packets encrypted with the new key, and with "rx_pn" values belonging to the new key
 * now the commented out code drops the packets, since we set the "rx_pn" for the new key according to the values of old key packets  

With the check disabled my system now rekeys quite well. But this makes handling of packet numbers probably completely useless.
Comment 12 alexander 2015-05-14 19:56:48 UTC
Created attachment 176911 [details]
workaround
Comment 13 Emmanuel Grumbach 2015-05-14 20:02:44 UTC
Yes :)

What you are doing is simple to disable the replay attack detection. Not good :)

But your thorough analysis will help us continue from here.
Apparently, your AP is really problematic. It keeps sending lots of packets with the old key even after it has rekeyed.

I'll add your observation to the internal bug I opened.
Comment 14 alexander 2015-05-14 21:38:35 UTC
Created attachment 176941 [details]
Capture showing old IV for new key

I'm was sure that my AP is changing the key immediately after the 4th packet of the handshake, so your remark let me look at an older capture (EAP-TLS) of the problem again with the new knowledge. (Captured with a third station in monitor mode.)

There is something quite curious there:

With my wireshark patch (it's in wireshark master now) I can not only follow the rekeys, but I also see the used decryption keys, both TK and PMK/PSK per packet.

The encryption key indeed changes immediately after the last packet of the handshake, but the Initialization Vector is still counting up against the old value.

So this is a bug in openwrt after all, at least if the packet counter must reset to zero again with the first packet encrypted with the new key.

I've attached export of that, using the filter "eapol.type==3 || (ip.src == 10.16.2.1 && (frame.number >= 66340) &&frame.number <=66465 )"

When you configure the following PSK's in a very recent git version of wireshark you can look at the issue:
87c6958afa7085c2d6e7ce4fbd2d22cca6996ce08ec56ac89b8810f8740b1618
5a9c725b361733feb9e1490bf530aa3c297841140ddbc82971134eefd20f7330
64040b460406ebc484b84c6f272ce0720ef04cd309a6440fa5ff163b1ab248f8

The Frame 19 is the first one using the new key. Even filtering out all retransmits with "!(wlan.fc.retry == 1)" we have at least one packet with the new key and Initialization Vector based on the old one.

Unfortunately this capture has quite some packet loss, on the "air" and also during capture. (Channel was set prior to the capture and the monitor station was maybe 50cm from the station.)

I may have to sort that out and do it again, but already the capture is clearly showing that wrong Initialization Vector's are set.
Comment 15 Emmanuel Grumbach 2015-05-15 09:08:32 UTC
yes - it seems that your AP is using the same key index *before* and *after* the re-keying. This means that mac80211 can't differentiate between a replay attack and a jump in the PN because the AP started using the new key index.

Do you have a way to contact openWRT folks?
Comment 16 alexander 2015-05-16 17:00:48 UTC
I can (re)open the ticket I have there. I'll for sure start looking into the rekey code on openwrt and poke around there a bit more. 
(The NIC driver on openwrt is also using mac80211, though.)

But you are indicating above, that openwrt (hostapd) could/should use a different key index for the pairwise key when rekeying? 

Now I've only started looking at WPA security to track down my connection issue and I'm as from an expert here. 
That said, my current understanding is, that the pairwise key must always have key index set to zero. 

IEEE802.11-2012 has this statement in "11.6.11.3 Authenticator state machine variables":
> EAPOL-Key frames with a key type of Pairwise and a nonzero key index should
> be ignored.

So I assume the key index "0" is the only way how it can be and the problem is, that we get frames encrypted with the new key but old frame counter values for a while, bumping "rx_pn" to the value used by the old key and then triggering the reply prevention code when the AP is switching over the packet counter for the new key. 

If I'm right, that is bringing up an interesting point to keep in mind: Why does it work on Windows? 
The reply detection should be triggered there also. 
Now I did notice a short hang and with a (normal rate, one packet each second) ping during the rekey, one ping was always lost. I suspect that thix is also caused by the AP issue and that a "proper" rekey would be seamless.

Now somehow Windows recovers while it should not - at least for my understanding. So potentially Windows is only recovering due to an bug/feature in their implementation or I'm got something major wrong here.
Comment 17 Emmanuel Grumbach 2015-05-17 20:13:23 UTC
I have started a thread to plead your cause with the experts :)
I just added you to it.
Comment 18 Emmanuel Grumbach 2015-05-18 19:29:51 UTC
How can you know what key has been used to encrypt the data?
Do you decrypt the data (in Wireshark) and check what key needs to be used to avoid corruption?
Comment 19 alexander 2015-05-18 20:01:03 UTC
exactly. 

But as mentioned you need a very recent version of wireshark for that.
Any stable version will stop decrypting at the rekey and also not include the key information to identify which key was used for the decryption.

You should use a version of the master branch of wireshark from March 10 or newer.

You can find working binaries at the moment only here:
https://www.wireshark.org/download/automated/

or have to build wireshark from source.
Comment 20 Emmanuel Grumbach 2015-05-31 06:10:58 UTC
Ok - I think it is fair to close this bug on the kernel.
The scenario you are hitting was diagnosed to be a problem in openWRT running an atheros device.

Anyone hitting this issue, please see the thread in the URLs above.

Note You need to log in before you can comment on or make changes to this bug.