Bug 78101

Summary: iwlwifi AC 7260: No association and the time event is over - MWG100216251
Product: Drivers Reporter: Johannes Stezenbach (js)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED WILL_NOT_FIX    
Severity: normal CC: alessandro.zucca01, aroesler.privat, cachobot, h.judt, ilw, jackc, js, kernel, leho, maggu2810, marco.caminati, patrakov, stuart.stent
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.15 Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci -vvv
trace-cmd record -e iwlwifi -e cfg80211 -e mac80211 unpatched kernel 3.15, running wpa_supplicant manually
beacons captured on client
print info
dmesg
dmesg after AP reboot
bad beacon
good beacon
FW that doesn't drop the beacon

Description Johannes Stezenbach 2014-06-16 12:53:33 UTC
Created attachment 139941 [details]
lspci -vvv

A new Thinkpad Yoga 20CD00AMGE with Intel(R) Dual Band Wireless AC 7260
fails to connect to the AP in the office, while it works
on my home AP.  WPA2 is used in both cases.  I tried with
kernel 3.14.7 and 3.15.  There is a high density of APs and
clients and lots of traffic in the office environment.

The failure is caused by:
[   32.200142] iwlwifi 0000:04:00.0: No association and the time event is over already...

Some more detail from dmesg:

[    1.161462] iwlwifi 0000:04:00.0: irq 61 for MSI/MSI-X
[    1.164277] iwlwifi 0000:04:00.0: loaded firmware version 22.24.8.0 op_mode iwlmvm
[    1.180348] iwlwifi 0000:04:00.0: Detected Intel(R) Dual Band Wireless AC 7260, REV=0x144
[    1.180651] iwlwifi 0000:04:00.0: L1 Enabled; Disabling L0S
[    1.180885] iwlwifi 0000:04:00.0: L1 Enabled; Disabling L0S

lspci -vn:

04:00.0 0280: 8086:08b2 (rev 83)
        Subsystem: 8086:4270

Full lspci -vvv output attached.


wpa_supplicant retries in an endless loop:

[   28.181020] cfg80211: Calling CRDA to update world regulatory domain
[   31.890168] wlp4s0: authenticate with f8:d1:11:39:1a:8c
[   31.892972] wlp4s0: send auth to f8:d1:11:39:1a:8c (try 1/3)
[   31.903458] wlp4s0: authenticated
[   31.927239] wlp4s0: associate with f8:d1:11:39:1a:8c (try 1/3)
[   31.963374] wlp4s0: RX AssocResp from f8:d1:11:39:1a:8c (capab=0x431 status=0 aid=5)
[   31.968857] wlp4s0: associated
[   32.200142] iwlwifi 0000:04:00.0: No association and the time event is over already...
[   32.200154] wlp4s0: Connection to AP f8:d1:11:39:1a:8c lost
[   32.270241] cfg80211: Calling CRDA to update world regulatory domain
(repeats)

For testing I used a bare Arch Linux installation and ran
wpa_supplicant manually:
  echo 'network={
        ssid="foo"
        psk="bar"
}' >w
  wpa_supplicant -Dnl80211 -iwlp4s0 -cw -d

...
wlp4s0: State: ASSOCIATED -> 4WAY_HANDSHAKE
...
wlp4s0: WPA: Key negotiation completed with f8:d1:11:39:1a:8c [PTK=CCMP GTK=CCMP]
...
wlp4s0: CTRL-EVENT-CONNECTED - Connection to f8:d1:11:39:1a:8c completed [id=0 id_str=]
...
nl80211: Drv Event 20 (NL80211_CMD_DEL_STATION) received for wlp4s0
nl80211: Delete station f8:d1:11:39:1a:8c
nl80211: Drv Event 39 (NL80211_CMD_DEAUTHENTICATE) received for wlp4s0
nl80211: Deauthenticate event
wlp4s0: Event DEAUTH (12) received
wlp4s0: Deauthentication notification
wlp4s0:  * reason 4 (locally generated)
wlp4s0:  * address f8:d1:11:39:1a:8c
Deauthentication frame IE(s) - hexdump(len=0): [NULL]
wlp4s0: CTRL-EVENT-DISCONNECTED bssid=f8:d1:11:39:1a:8c reason=4 locally_generated=1
wlp4s0: Auto connect enabled: try to reconnect (wps=0 wpa_state=9)

I also tried "iw dev wlp4s0 set power_save off", it did not help.


Not really knowing what I'm doing I simply tried to comment out
the check causing the disconnect:

drivers/net/wireless/iwlwifi/mvm/time-event.c:iwl_mvm_te_handle_notif()

                iwl_mvm_te_check_disconnect(mvm, te_data->vif,
                        "No association and the time event is over already...");

And lo and behold, it can connect.
Comment 1 Emmanuel Grumbach 2014-06-16 13:15:20 UTC
I need a few data on the AP. What AP do you have?
More importantly, what is its beacon interval?

a trace-cmd output will let us know the answer to the second question.

trace-cmd record -e iwlwifi -e cfg80211 -e mac80211
Comment 2 Johannes Stezenbach 2014-06-16 13:25:26 UTC
The beacon interval on the office AP (TP-Link TL-WR1043ND v1.8 with
vendor firmware, Atheros AR9132); After connecting with the hack patch
mentioned above:

$ iw dev wlp4s0 link
Connected to f8:d1:11:39:1a:8c (on wlp4s0)
        SSID: foo
        freq: 2462
        RX: 13135 bytes (103 packets)
        TX: 2033 bytes (19 packets)
        signal: -43 dBm
        tx bitrate: 1.0 MBit/s

        bss flags:      CTS-protection short-preamble short-slot-time
        dtim period:    0
        beacon int:     100

Will try to get the trace-cmd output asap (with unpatched kernel).
Comment 3 Emmanuel Grumbach 2014-06-16 13:27:04 UTC
dtim period 0??

hmm... sounds like a bug in iw... - or our driver...
Comment 4 Johannes Stezenbach 2014-06-16 13:53:41 UTC
Created attachment 139951 [details]
trace-cmd record -e iwlwifi -e cfg80211 -e mac80211
unpatched kernel 3.15, running wpa_supplicant manually
Comment 5 Johannes Stezenbach 2014-06-16 14:19:01 UTC
Created attachment 139961 [details]
beacons captured on client

iw phy phy0 interface add mon0 type monitor
ip link set up dev mon0
iw dev mon0 set freq 2642
tcpdump -i mon0 -s10000 -w cap.pcap

then used wireshark to extract the beacons only,
hope it is useful
Comment 6 Johannes Stezenbach 2014-06-19 12:27:09 UTC
Did you have some time to look at the trace + beacons?
It seems the dtim period 0 is not from the beacon (it has dtim per 1),
but it is in the trace:

drv_bss_info_changed: phy0 vif:wlp4s0(2)
  assoc:1 aid:9 cts:1 shortpre:1 shortslot:1 dtimper:0
  bcnint:100 assoc_cap:0x431 basic_rates:0xf enable_beacon:0
  ht_operation_mode:0

Maybe you need some additional trace to find out where
the dtimper:0 comes from?

(Even if the AP were buggy and would send DTIM period 0, there
is a check in ieee80211_set_associated() that should catch it.)
Comment 7 Emmanuel Grumbach 2014-06-19 12:29:19 UTC
no - no time.
I am travelling and very busy right now. This is why I asked you to open a bug. So that I can have a real tracking and not just a mail in my inbox.
Comment 8 Emmanuel Grumbach 2014-06-19 15:33:49 UTC
Hi,

I haven't looked at the traces yet - but I doubt they'll help me now that I looked a bit more at the code.
Can you try this:

diff --git a/net/mac80211/mlme.c b/net/mac80211/mlme.c
index e37b97d..661dc8b 100644
--- a/net/mac80211/mlme.c
+++ b/net/mac80211/mlme.c
@@ -4438,6 +4438,8 @@ int ieee80211_mgd_assoc(struct ieee80211_sub_if_data *sdata,
                        sdata->vif.bss_conf.sync_device_ts =
                                bss->device_ts_beacon;
                        sdata->vif.bss_conf.sync_dtim_count = dtim_count;
+                       sdata->vif.bss_conf.dtim_period =
+                               ifmgd->dtim_period ? : 1;
                }
        } else {
                assoc_data->timeout = jiffies;
Comment 9 Johannes Stezenbach 2014-06-20 10:26:46 UTC
Hi,

I removed my workaround hack and added your change
(linux-3.15), and also added a printk for
ifmgd->dtim_period.

It works, I can connect, and the printk shows
ifmgd->dtim_period is 1, the value expected from the beacon.
(i.e. the "? : 1" is not needed in my case)
Comment 10 Emmanuel Grumbach 2014-06-20 13:34:14 UTC
:)

Ok - thanks for the testing.
Comment 11 Emmanuel Grumbach 2014-06-20 16:52:37 UTC
patch published
Comment 12 Emmanuel Grumbach 2014-06-26 06:09:18 UTC
Created attachment 140971 [details]
print info

Hi again,

so I sent my patch and the maintainer asks a few questions that I couldn't answer because I don't full understand how my patch solves your problem.

I attached a patch - please apply it and reproduce the issue without the fix.
This will shed more light on what is going on.

My patch is very likely to be incomplete.

Thank you for your help.
Comment 13 Johannes Stezenbach 2014-06-26 14:35:36 UTC
Created attachment 141021 [details]
dmesg

Today I found your previous patch didn't fix the issue.  Maybe on the
day I tested the environmental conditions changed, or I made a mistake.
(However, I used the machine with the patch applied so I'm sure
wifi worked, and I triple checked I backed out my hack patch in
iwl_mvm_te_handle_notif().)  Sigh...

I'm attaching the requested dmesg, but I guess it is not so interesting now.
(just ran wpa_supplicant manually for short time).
It looks like no beacon was recieved during the connection
attempt, but I captured packets on a second machine, and it shows
the beacons (although some are missing).
Comment 14 Emmanuel Grumbach 2014-06-26 16:18:19 UTC
what you sent *is* useful - I just want to understand what you had.
You had my patch from comment 12 obviously, but had you the patch from comment 8?

I'll look at the logs later tonight (hopefully).
Comment 15 Emmanuel Grumbach 2014-06-26 16:42:59 UTC
from what I see here - you really seem not to hear any beacon from your AP... which is really weird...
Comment 16 Johannes Stezenbach 2014-06-26 16:44:10 UTC
The log is without the patch from comment 8, I thought
"reproduce the issue without the fix" means to back it out.
I also tested with the fix, it didn't make a difference today.
Didn't keep the log, though, but if I've seen it correctly
all the debug prints only printed zeros.

Thanks,
Comment 17 Johannes Stezenbach 2014-06-26 17:14:35 UTC
PS: I had also run tcpdump on a monitor interface
(like in comment 5) during the connection attempt, no beacons were captured.
Since the packet capture taken on the other machine
also had missing beacons (roughly 50% mssing),
maybe some neighbouring device is interfering.
(the distance to the AP is ~4 meters).
So the root cause could be related to receiption quality?
However, other management frames get through so I'm not
sure how this is possible except if the firmware is acting funny.
Comment 18 Emmanuel Grumbach 2014-06-26 17:50:08 UTC
well - some APs are sometimes problematic. We've seen APs stop sending beacons for a few seconds... But in your case, it seems really severe... A bit too severe to be possible. Have you tried to reboot the AP?
Comment 19 Johannes Stezenbach 2014-06-27 09:36:59 UTC
Created attachment 141091 [details]
dmesg after AP reboot

After AP reboot, it can connect (without any fix patch).
Attached is the dmesg with the debug patch applied.

So it seems the AP is part of the problem, however all other
devices can connect.  Maybe the issue is simply the timeout
for the "No association and the time event is over already..."
check is too short, the driver should wait longer for a beacon?
Comment 20 Johannes Stezenbach 2014-06-27 15:00:56 UTC
I noticed the AP was set to automatic channel selection
and selects a different channel on each boot, so I set
it back to channel 11 and also tried a few other channels:
It could still connect, so the issue seems not related to
channel setting.

The download speed (from local http server) varies wildly
between 5MB/s and 100K/s, while a tiny rt2800usb dongle seems
to yield more consistent and on average faster download speed.
However, my Thinkpad X230 (Ultimate-N 6300 AGN) is
consistently slow with this AP (~100K/s) and sometimes
even stalls completely for a few seconds.

Maybe of interest, "iw phy" reports "Available Antennas: TX 0 RX 0"
in both cases, while the Yoga has two antennas (2x2) and
the X230 has three (3x3).  Any idea about it?

BTW, the AP is a TP-Link TL-WR1043ND v1.8 running vendor firmware.
(Atheros AR9132)
Comment 21 Emmanuel Grumbach 2014-06-28 19:05:42 UTC
please don't mix issues in the same bug.

From my point of view - it seems that this issue was an issue with your AP.
Comment 22 Johannes Stezenbach 2014-06-28 19:56:01 UTC
Sorry, I was just dumping some information gathered during testing
before it gets lost.

Agreed, the AP seems to be at least part of the problem, however
other devices can connect without problem.  I hope it is of
interest to you to find out how you could improve the driver
to work better with shitty APs.  I can only offer to do
testing and tracing.

Anyway, since I have a workaround (see bottom of issue description),
you can close the bug if no one else sees the same issue.
Comment 23 caminati 2014-07-13 17:23:42 UTC
I have the same problem as Johannes: I never manage to associate to my AP longer than one second or so. 
dmesg says:

cfg80211: Calling CRDA to update world regulatory domain
wlan0: authenticate with xxx
wlan0: send auth to xxx (try 1/3)
wlan0: authenticated
iwlwifi 0000:01:00.0 wlan0: disabling HT as WMM/QoS is not supported by the AP
iwlwifi 0000:01:00.0 wlan0: disabling VHT as WMM/QoS is not supported by the AP
wlan0: associate with xxx (try 1/3)
wlan0: RX AssocResp from xxx (capab=0x1 status=0 aid=4)
wlan0: associated
iwlwifi 0000:01:00.0: No association and the time event is over already...
wlan0: Connection to AP xxx lost
cfg80211: Calling CRDA to update world regulatory domain

Note that:

1) I rebooted my AP, didn't fix.
2) All other devices I tried in the many years I've been using this AP worked.
3) It is an old AP, you can find details attached.

Given that, I ask that this bug is reopened.

Model: 		HomePortal 1000SW 				
Serial Number: 		442111004305
Hardware Version: 		2700-000303-004
Software Version: 		3.5.15
Comment 24 Emmanuel Grumbach 2014-07-13 17:58:49 UTC
What kernel version do you have?
Comment 25 caminati 2014-07-13 23:13:42 UTC
(In reply to Emmanuel Grumbach from comment #24)
> What kernel version do you have?

I built backports-3.16-rc1-1 on a 3.8.13 kernel version.
Comment 26 Emmanuel Grumbach 2014-07-21 08:54:02 UTC
@caminati

Can you please record tracing?

sudo trace-cmd record -e iwlwifi -e mac80211 -e iwlwifi_msg

You can send the data privately to me if you prefer.
Comment 27 Johannes Stezenbach 2014-07-21 09:58:58 UTC
After been away for some time the issue reappeared when I came back,
currently the AP uptime is 23 days.  I know you think it is an AP issue
and I could just reboot it again, but in case you want me to test/debug
why the beacons are not received, let me know.  As I mentioned before
the AP works for anyone else, office mates using a large zoo of
Android devices for development.
Comment 28 caminati 2014-07-21 10:02:00 UTC
(In reply to Emmanuel Grumbach from comment #26)
> @caminati
> 
> Can you please record tracing?
> 
> sudo trace-cmd record -e iwlwifi -e mac80211 -e iwlwifi_msg
> 
> You can send the data privately to me if you prefer.

Thanks for your attention.

Sadly, I get
"debugfs is not configured on this kernel".

At the moment, I have no time nor gear to rebuild the kernel.
I hope I will in the near future, in which case I will keep this bugthread updated.
Comment 29 Emmanuel Grumbach 2014-07-21 11:40:09 UTC
I am trying to find a way to be more robust to this case. Patch will follow...
Comment 30 Emmanuel Grumbach 2014-07-21 13:48:29 UTC
Can you try this?

diff --git a/drivers/net/wireless/iwlwifi/mvm/mac80211.c b/drivers/net/wireless/iwlwifi/mvm/mac80211.c
index c49b08c..0d17b44 100644
--- a/drivers/net/wireless/iwlwifi/mvm/mac80211.c
+++ b/drivers/net/wireless/iwlwifi/mvm/mac80211.c
@@ -2229,9 +2229,9 @@ static void iwl_mvm_mac_mgd_prepare_tx(struct ieee80211_hw *hw,
 {
        struct iwl_mvm *mvm = IWL_MAC80211_GET_MVM(hw);
        u32 duration = min(IWL_MVM_TE_SESSION_PROTECTION_MAX_TIME_MS,
-                          200 + vif->bss_conf.beacon_int);
+                          300 + vif->bss_conf.beacon_int);
        u32 min_duration = min(IWL_MVM_TE_SESSION_PROTECTION_MIN_TIME_MS,
-                              100 + vif->bss_conf.beacon_int);
+                              250 + vif->bss_conf.beacon_int);

        if (WARN_ON_ONCE(vif->bss_conf.assoc))
                return;

This is just a try to easily make it more robust.
I am working on a make a more generic way. But I'd like to know if this helps to know if we are in the right direction.

Thanks.
Comment 31 Harald Judt 2014-07-21 14:41:59 UTC
It seems I have this problem too (linux-3.15.2 and older). Sometimes my wireless is lucky and some connection works (there are several APs with the same ESSID in range), but most times I find in dmesg:

iwlwifi 0000:03:00.0: No association and the time event is over already...
wlan0: Connection to AP xx:xx:xx:xx:xx:xx lost
wlan0: direct probe to xx:xx:xx:xx:xx:xx (try 2/3)
wlan0: direct probe to xx:xx:xx:xx:xx:xx (try 3/3)
wlan0: authentication with xx:xx:xx:xx:xx:xx timed out
wlan0: authenticate with yy:yy:yy:yy:yy:yy
wlan0: direct probe to yy:yy:yy:yy:yy:yy (try 1/3)
wlan0: direct probe to yy:yy:yy:yy:yy:yy (try 2/3)
wlan0: direct probe to yy:yy:yy:yy:yy:yy (try 3/3)

I'm going to try your patches and hope they help me too.
Comment 32 Johannes Stezenbach 2014-07-22 07:54:04 UTC
The fw session protection change in iwl_mvm_mac_mgd_prepare_tx()
made no difference for me.
Comment 33 Johannes Stezenbach 2014-07-22 09:01:01 UTC
FWIW, I wanted to check fw_rx_stats in iwlmvm debugfs directory,
but all values are zero. Comment in fw-api.h for struct iwl_notif_statistics
indicates the stats might only be sent while associated, so I temporarily
added the workaround described at the end of this bug's description.
But the values are still zero.  I notice the REPLY_STATISTICS_CMD 0x9c
mentioned in the comment is removed from the command enum.
Something wrong here?
Comment 34 Johannes Stezenbach 2014-07-23 09:44:51 UTC
After some experimenting, I found the device stops receiving beacons
when it tries to connect, but receives beacons on monitor interface
when the managed interface is down. I.e.:

iw phy phy0 interface add mon0 type monitor
ip link set up dev mon0
iw dev mon0 set channel 7
tcpdump -i mon0
-> tcpdump receives beacons (actually using wireshark)

iw dev wlan0 set channel 7
ip link set up dev wlan0
(also tried: iw dev wlan0 connect foo 2442)
-> tcpdump stops receiving beacons

iw dev wlan0 scan
-> tcpdump recieves some beacons from other AP, but not from our AP;
   it also receives probe responses, also from our AP

ip link set down dev wlan0
-> tcpdump resumes receiving beacons

Any idea about it?  And why would this behaviour change
when I reboot the AP???  (I should reboot the AP to confirm,
but since it might take days or weeks until the issue re-appears
I'm not doing it yet.)

(Doing the same experiment using Ralink usb dongle works as expected,
receiving beacons all the time.)
Comment 35 Emmanuel Grumbach 2014-07-27 18:17:51 UTC
We had a problem with beacon filtering - but that is disabled in 3.15:
http://lxr.free-electrons.com/source/drivers/net/wireless/iwlwifi/mvm/mac80211.c#L829

When you have a STATION vif, we don't let anything through unless you scan / or try to associate.
Try to scan on wlan0 and you'll see packets on mon0. After all, you don't want to get interrupts for any packet in the air when you are not scanning / associating.

Note that the monitor interface is a real promiscuous mode when it is the only interface active. If you have a STATION interface, it'll just copy the packets coming from the STATION.
I am not sure about the last sentence, but this is what I remember.

I have no reasons to think it'll help, but can you please try this firmware?
https://git.kernel.org/cgit/linux/kernel/git/egrumbach/linux-firmware.git/plain/iwlwifi-7260-9.ucode?h=Core6&id=59a2c0aa8b9e26533d01d153a9be2c5f61cc0d62

Thanks.
Comment 36 Johannes Stezenbach 2014-07-28 09:05:54 UTC
The experimental firmware v25.223.9.0 does not change the
behaviour.  FWIW I'm currently using kernel 3.16.0-rc6.

At this point I think the best way forward would be to reboot
the AP to see if the previous result can be reproduced and
the connection can be established.  And capture beacons
and probe resposes before and after AP reboot for comparison.

Any other idea what to test?

(I think monitor mode shows that beacons can be received so
there is no issue with signal quality or antenna setup.  More
likely it is an issue with beacon filtering?  I added the same
&& false as in 3.15 to 3.16.0-rc6 but without effect.)
Comment 37 Emmanuel Grumbach 2014-07-28 09:34:06 UTC
Thanks for testing.

let me know what happens after your reboot the AP.
I have new debug mechanism in 3.17 that can be very useful, but for that I'd need FW team and they are typically very busy (not that I am not).
Comment 38 Andreas 2014-07-28 09:45:05 UTC
Hi,

you will not be surprised: I'm here as I have the same issue. The kernel I'm using is 3.13.0-32-generic from Ubuntu 14.04 x86_64. If you ask me to debug stuff please give me a very easy howto, as I'm not as deep into it ;-).

What I want to add is an observation I did: I'm working at 3 locations with 3 access-points. All 3 of them have the issue, but:
* 2 Fritz!Boxes (7390 and 7490) are more stable then a Sophos AP30.
* When I'm sitting directly next to one of my Fritz!Boxes I can work much longer (several hours) without connection-losses and reconnection mostly works. Being in a bigger distance I ran into the issue quite fast (within a few minutes) and got the connection back only by unloading the modules and reloading them.

Maybe this helps. If not: Just ignore it ;-).



Andreas
Comment 39 Johannes Stezenbach 2014-07-28 12:30:43 UTC
After AP reboot the connection works.  I've taken network
captures with wireshark before and after reboot, both using
mon0 on the iwlwifi and on a Ralink USB dongle (just in case).
I also captured the WPA2 connection, it shows one beacon is
received right after the key exchange, then the next beacon
~5 seconds later (apparently beacon filter at work).

After inspecting the captures, I found the beacon sent by the
AP appears to be corrupt.  I guess the firmware drops it, but
the corrupt beacon still has enough valid information to make
the connection work with other devices.
Comment 40 Johannes Stezenbach 2014-07-28 12:34:17 UTC
Created attachment 144411 [details]
bad beacon

corrupted beacon before AP reboot, one 00 byte is missing
at the end of the WMM/WME element, causing the following
elements (including HT capabilities) to be ignored
Comment 41 Johannes Stezenbach 2014-07-28 12:34:48 UTC
Created attachment 144421 [details]
good beacon

good beacon after AP reboot
Comment 42 Emmanuel Grumbach 2014-07-28 13:55:51 UTC
Ok - that was fruitful

The bad beacon is really messed up.
You should check if you can upgrade the AP's firmware. Since the beacon is broken, the Intel firmware will throw them away (unless we are in sniffer mode).

This answers all the questions :)
Changing our firmware to let the beacons go is not trivial (just asked the FW team). I am not saying we won't make that change, I am just saying it will be a long shot.
Comment 43 Johannes Stezenbach 2014-07-31 10:28:21 UTC
FWIW, I found there was indeed a firmware update for the AP available.
I installed it and the issue reappeared after just two
days of uptime :-(
It means if you have a 7260 AC firmware update I could still test it.
(The particular hardware revision of the AP has issues with OpenWRT,
otherwise I would install it right away.)
Comment 44 Emmanuel Grumbach 2014-07-31 10:38:00 UTC
can you please sniff for the beacon again - to make sure that we are having the same issue?
Comment 45 Johannes Stezenbach 2014-07-31 10:54:25 UTC
I did it already, the error is exactily the same: The last
(zero) byte of the WMM/WME IE is missing, i.e. the next IEs
start one byte too early and thus can't be decoded.
Comment 46 Emmanuel Grumbach 2014-07-31 11:00:43 UTC
FWIW: thanks to this input - I talked to people here and we will try to see what we can do in these cases.
It will take time though...
Comment 47 Emmanuel Grumbach 2014-08-25 14:12:38 UTC
I will close this bug as 3rd party. After all, it has been clearly proven that the bug is on the AP side.
Comment 48 Emmanuel Grumbach 2014-09-02 10:42:50 UTC
FW team has started to look for a way not to drop the beacon - not sure it will find one though...
Comment 49 Johannes Stezenbach 2014-09-02 10:53:16 UTC
Thank you!

One idea: is the dropped beacon related to the beacon filter, i.e.
could the issue be fixed by enabling the beacon filter only after
the first beacon has been received?  I'm not sure how the firmware
works, is there a possibility to disable the beacon filter
completely for testing?  (I'm assuming this could be tested
on driver level without firmware change.)
Comment 50 Emmanuel Grumbach 2014-09-02 11:04:00 UTC
the problem is not the power save feature called beacon filtering.
The problem is that the beacon is so broken, that the firmware drops the packet because it can't find the IEs in the right place. So basically, we need to change the firmware to be more permissive, but this is very risky, because then there is code that will run on a broken beacon. Assumption that were taken aren't true anymore. This is why, it is not simple at all.
Comment 51 Emmanuel Grumbach 2014-09-02 11:25:29 UTC
Created attachment 149071 [details]
FW that doesn't drop the beacon

Please try this firmware.
I can't promise that the "fix" will be delivered to the code base and that we will be able to formally release this "fix".
Comment 52 Johannes Stezenbach 2014-09-02 13:49:04 UTC
The firmware can connect, but the connection speed is very slow
(like several seconds just for a DNS lookup) and the connection
is unstable.

[    1.319821] iwlwifi 0000:04:00.0: loaded firmware version 23.214.9.0 op_mode iwlmvm
...
[    9.811460] wlan1: authenticate with f8:d1:11:39:1a:8c
[    9.814692] wlan1: send auth to f8:d1:11:39:1a:8c (try 1/3)
[    9.826796] wlan1: authenticated
[    9.826913] wlan1: associating with AP with corrupt beacon
[    9.829009] wlan1: associate with f8:d1:11:39:1a:8c (try 1/3)
[    9.835252] wlan1: RX AssocResp from f8:d1:11:39:1a:8c (capab=0x31 status=0 aid=1)
[    9.836292] wlan1: associated

BTW, I found the "corrupt beacon" message is from mac80211/mlme.c,
it sets the IEEE80211_BSS_CORRUPT_BEACON flag.  Seems this
problem is not so uncommon...
Comment 53 Emmanuel Grumbach 2014-09-02 13:52:11 UTC
try to disable power save:

sudo iw wlan1 set power_save off
Comment 54 Johannes Stezenbach 2014-09-02 14:32:47 UTC
That works much better, download speed fluctuates between
100KB/s and 1MB/s.  Usable.

Thanks!
Comment 55 Emmanuel Grumbach 2014-09-02 19:15:34 UTC
yeah... but ... This is not something that we can live with...
This AP is just ... bad. And we have no way to detect how bad it is and disable power save when we face such bad AP...

Even if the FW will integrate the change they made for you (and this is far from being obvious), we still have a big challenge with your AP...

Note that other vendors seem not to implement power save (not that I have any real data - but how could they get 2.4Mps?).
Comment 56 Johannes Stezenbach 2014-09-03 11:58:21 UTC
Some data points I remember from past tests:
- when I rebooted the AP, the driver worked with acceptable
  speed even without power save disabled
- when I applied the workaround given at the end of the issue
  description, speed was also acceptable (but I wonder how power
  save could work without beacon receiption, maybe it was disabled
  implicitly?)
- Android devices seem to have no issues with the AP, I think (hope)
  these generally have power save enabled

Thus I think the AP is buggy but not completely broken.

Maybe the experimental firmware forwards the broken beacon
but disables the internal processing related to power save,
e.g. never sends queued frames?

IIRC the rt2800usb driver does not support power save (or rather
the USB hardware interface doesn't, I think the rt2800pci devices
support power save).
Comment 57 Leho Kraav 2014-09-04 08:51:45 UTC
Here's another one struggling with the AC7260 on kernel 3.16.1. At a relatively large conference (200 people in the room), the adapter initially connected, but the connection was very intermittent. After I did rmmod iwlmvm and re-modprobe'd, adapter wouldn't connect at all. It's not possible to reboot AP's at conferences etc.

Adapter has been working fine at home and office and mostly everywhere else though, I'd say 95+% uptime. Kinda sucks to run into the issue unexpectedly like this though.

Currently I'm at a nearby shop buying a couple of backup USB wifi adapters. They have rt2800usb and rtl8192uc-based stuff for sale here so I'll be able to test the same environment with those and can report back.
Comment 58 Emmanuel Grumbach 2014-09-04 08:59:00 UTC
@Leho - if you don't see the exact same print (No association and the time event is over) please don't report in this bug.
Find another one, or open a new one.
Comment 59 Leho Kraav 2014-09-04 10:36:18 UTC
Emmanuel, it is the exact same thing. The very same message "No association and the time event is over" is there, association fails after WPA password entry etc. Apologies for not being more detailed about it, it's difficult to provide logs in the middle of the day right now.

I do have the two additional USB adapters available now. This provided an additional discovery immediately. When this non-connection state happened with iwlwifi, I suspended the machine, went to the shop, started testing the RT5370 and RTL8192UC based USB adapters. Set up an Galaxy S5 based hotspot and NONE of the 3 wifi adapters were able to finish the connection, getting stuck exactly the same place (well, based on what shows in dmesg). 

This seems to indicate that the whole wifi stack gets into a confused state of some sort? Wired network connection worked fine in the shop.

Cold rebooted and wifi connectivity was restored. After coming back from a reboot, all adapters connected to the S5 hotspot without issues. Suspended machine, walked back to conference.

Now I'm sitting at the conference, trying out the rt2800usb based "148f:5370 Ralink Technology, Corp. RT5370 Wireless Adapter". It connected to the AP but traffic throughput was very intermittent, just like with iwlwifi.

I have just cold rebooted with /etc/modprobe.d/wifi conf including "blacklist iwlwifi iwlmvm". Connection has been now running with great speed on top of rt2800usb without any apparent issues.
Comment 60 Emmanuel Grumbach 2014-09-04 10:58:12 UTC
First thing I'd try is to disable power save.
I think that quite a few vendors (can't really say that too loud :)) don't implement power save. And this can avoid lots of issue with broken APs...
So please do just like in Comment#53
Comment 61 Leho Kraav 2014-09-04 11:03:35 UTC
Aha. I missed that it was a post-connection iw command. But this is a bit confusing overall, because modinfo iwlwifi says:

...
vermagic:       3.16.1 SMP preempt mod_unload 
...
  parm:           power_save:enable WiFi power management (default: disable) (bool)
...

And hence I have been living with the assumption that power save has always been disabled by default and therefore any "disable power_save" advice doesn't apply. What's the truth here?
Comment 62 Emmanuel Grumbach 2014-09-04 11:06:32 UTC
yeah iwlwifi also has a powersave parmater - this is an old legacy :)

By default iwlmvm will have powersave enabled, iwldvm will not (well not aggressive powersave). I know, confusing... :)
Comment 63 Emmanuel Grumbach 2014-10-27 07:09:57 UTC
Can someone update on this bug with the latest firmware we released:

https://git.kernel.org/cgit/linux/kernel/git/egrumbach/linux-firmware.git/plain/iwlwifi-7260-9.ucode?id=1f9f9df353b11c9ea0130dfe68053aaaee376df3

I don't think that it should be fixed.
But OTOH, it is worth checking.

Note that since this bug is caused by an AP bug - it is very low priority.
Comment 64 Johannes Stezenbach 2014-10-27 12:21:56 UTC
The new firmware can't connect (while the one from comment 51 can).

[  483.981326] iwlwifi 0000:04:00.0: loaded firmware version 25.228.9.0 op_mode iwlmvm

[  495.217051] wlan1: authenticate with f8:d1:11:39:1a:8c
[  495.220118] wlan1: send auth to f8:d1:11:39:1a:8c (try 1/3)
[  495.221872] wlan1: authenticated
[  495.222514] wlan1: associate with f8:d1:11:39:1a:8c (try 1/3)
[  495.226764] wlan1: RX AssocResp from f8:d1:11:39:1a:8c (capab=0x31 status=0 aid=2)
[  495.229703] wlan1: associated
[  495.527188] iwlwifi 0000:04:00.0: No association and the time event is over already...
[  495.527225] wlan1: Connection to AP f8:d1:11:39:1a:8c lost

# sha1sum /lib/firmware/iwlwifi-7260-9.ucode
98fb865e5f0c7b2bf52dc5a1ee77a0752eea75ad  /lib/firmware/iwlwifi-7260-9.ucode
Comment 65 Emmanuel Grumbach 2014-11-04 07:45:03 UTC
after a long discussion with the firmware, they can't integrate the change that they made in the code base for the moment.

Closing the bug as Will not Fix.
Comment 66 JC 2015-02-12 18:35:33 UTC
I have Sony Xperia L with latest stock firmware and only one possibility to connect to wifi hotspot on this device from my Latitude E7440 with 7260 is use firmware in comment #51.

Other devices (for example another notebook with another wifi card) connects to Xperia hotspot w/o any problem.

Is it possible to implement it as an option of driver? I know changes in firmware are still necessary.
Comment 67 Emmanuel Grumbach 2015-02-12 19:16:50 UTC
please try the latest firmware from: https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/linux-firmware.git/

You'll need -10 or -12.
We added a few relaxations (for problems seen with Airport Xtrem). But it still won't help for the broken beacon attached in this bug.
Please give it a try.
Comment 68 JC 2015-02-18 14:39:54 UTC
Debian unstable, linux-image-3.19.0-trunk-686-pae, latest -12 firmware: CAN'T CONNECT
Comment 69 Emmanuel Grumbach 2015-02-18 15:02:08 UTC
Sorry, nothing can be done from the driver.
Comment 70 JC 2015-02-18 15:40:10 UTC
And from firmware?
Comment 71 Emmanuel Grumbach 2015-02-18 17:25:22 UTC
I am afraid that won't happen...
Comment 72 JC 2015-02-18 22:31:29 UTC
It's sad. IMHO it's against "be conservative in what you send and liberal in what you receive" rule. OK, i can throw my poor Sony phone to the trash and get a newer one (again) or do the same with Intel wifi card. But i would prefer to make software compatible.

I understood fixing the real reason would be better but it's not possible now. Sony will never release a fix and unofficial ROMs with newer android have another bugs and still in beta stage.
Comment 73 Leho Kraav 2015-03-28 23:14:28 UTC
Guys, I have noticed that if I get into the "No association and the time event is over already..." cycle, restarting wpa_supplicant (2.3, but probably earlier too) helps. This is what I do:

modprobe -rv iwlmvm # keeps NetworkManager from auto-relaunching
systemctl stop NetworkManager
pkill -e wpa
modprobe -v iwlwifi
systemctl start NetworkManager
Comment 74 Leho Kraav 2015-03-28 23:14:57 UTC
Just upgraded to wpa_supplicant-2.4, we'll see how this does.
Comment 75 Emmanuel Grumbach 2015-03-29 05:52:43 UTC
@Leho - you are restarting the whole driver.

And if that helps, it means that you are not suffering from the same bug as Johannes.
Comment 76 Leho Kraav 2015-03-29 09:55:34 UTC
Yes, in the example I'm restarting everything. NetworkManager didn't  auto-restart before <1.0.0, but it seems to now. Either way, previous isolation attempts have proven that on this system, it has been wpa_supplicant alone responsible for... something. Would be interesting to hear from others, not sure where else if not on this bug though.
Comment 77 Cacho Nero 2015-08-07 00:10:40 UTC
Emmanuel, is there any chance that you have a newer version of the firmware with the patch from Comment#51. Currently using Linux 4.1 and it won't accept anything under ucode-10.

Thank you.
Comment 78 Emmanuel Grumbach 2015-08-07 05:01:49 UTC
No.
Comment 79 Alexander E. Patrakov 2017-03-15 14:45:11 UTC
For some (but not all) cases of this AP brokenness, this helps:

modprobe iwlwifi 11n_disable=1

(tested in Domicilio Lorenzo hotel in Davao City, PH - one of their access points reliably triggers the issue)
Comment 80 JC 2017-03-16 08:44:10 UTC
I had some TP-Links AP with OpenWRT system. Unstable with all firmware versions, module parameters (11n disable for example) etc. After months i changed AP to Mikrotik router (Atheros AR9300). Unstable again. Then i changed "disconnect timeout" to 15s on Mikrotik and voila, it works w/o probs even with default iwlwifi settings. I think Intel card/firmware/module is buggy or too much strict for some AP. Very badly usable in some configurations.