Bug 97181

Summary: iwlwifi: mvm: can't retry to associate if the AP sent deauth between auth and assoc - MWG100237204
Product: Drivers Reporter: ailin.nemui+kernel
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED CODE_FIX    
Severity: normal CC: ailin.nemui+kernel, ilw, mail
Priority: P1    
Hardware: All   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=119241
Kernel Version: 4.0.0 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: journalctl output when trying to connect with -12 firmware
journalctl output when successfully connecting with -9 firmware
backport based version
trace file with newest driver/firmware
another trace with wpa_supplicant -dd
supplicant log with -dd
Core10 firmware with trigger on auth timeout
Firmware dump do debug this issue
dmesg before saving the firmware dump
Core10 firmware with trigger on auth timeout
new dump
Core10 firmware with trigger on auth timeout
new dump
iwlwifi with BT AG reverted
fix
dmesg log from latest driver
disable vendor commands
dmesg log with lots of microcode errors

Description ailin.nemui+kernel 2015-04-23 22:16:26 UTC
Created attachment 174941 [details]
journalctl output when trying to connect with -12 firmware

Since I installed the newest firmware I am unable to connect to the wireless network. For now I can only show you some output from journalctl, I will try to help if you tell me what to do but I'm more a regular user than kernel developer. Using my distribution provided kernel compiles (openSUSE 4.0.0-2.g53b125b-desktop)

Going back to -9 makes the network connect again. I also attached that journalctl output in the noproblem log file.
Comment 1 ailin.nemui+kernel 2015-04-23 22:17:00 UTC
Created attachment 174951 [details]
journalctl output when successfully connecting with -9 firmware
Comment 2 Emmanuel Grumbach 2015-04-24 06:05:42 UTC
please try with this firmware:
https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/linux-firmware.git/plain/iwlwifi-7260-12.ucode
Comment 3 ailin.nemui+kernel 2015-04-24 19:28:09 UTC
Hi Emmanuel,

Thanks for your reply. Before: iwlwifi 0000:02:00.0: loaded firmware version
 25.15.12.0 op_mode iwlmvm

Your link: iwlwifi 0000:02:00.0: loaded firmware version
 25.17.12.0 op_mode iwlmvm

No change in behaviour -- I cannot connect with either one.
Comment 4 Emmanuel Grumbach 2015-04-25 18:28:09 UTC
please provide the dmesg output without any debug flag.
Please record tracing:

sudo trace-cmd record -e iwlwifi -e iwlwifi_msg -e mac80211 -e cfg80211
Comment 5 ailin.nemui+kernel 2015-04-27 16:13:48 UTC
I get this error:

/sys/kernel/debug/tracing/events/cfg80211/filter
/sys/kernel/debug/tracing/events/*/cfg80211/filter
/sys/kernel/debug/tracing/events/mac80211/filter
/sys/kernel/debug/tracing/events/*/mac80211/filter
/sys/kernel/debug/tracing/events/iwlwifi_msg/filter
/sys/kernel/debug/tracing/events/*/iwlwifi_msg/filter
trace-cmd: No such file or directory
  No events enabled with iwlwifi_msg


I guess the distro people who made my kernel didn't enable trace support m(? will see about compiling my own kernel in the next days...
Comment 6 Emmanuel Grumbach 2015-04-27 17:42:34 UTC
Created attachment 175141 [details]
backport based version

You can try the backport based driver I am attaching.
Comment 7 ailin.nemui+kernel 2015-04-28 18:28:26 UTC
Created attachment 175211 [details]
trace file with newest driver/firmware

hi, thank you for the backport. I installed it and created the attached trace with trace-cmd record -e iwlwifi -e iwlwifi_msg -e mac80211 -e cfg80211

I hope it can help you to fix this annoying bug of mine.
Comment 8 Emmanuel Grumbach 2015-04-28 18:44:20 UTC
Ok - I can't see why all this is happening.
Please record a new trace and attach the dmesg output (without any debug flags)
I'd also need the logs of the supplicant with -dd to make it verbose.

Thanks.
Comment 9 ailin.nemui+kernel 2015-04-28 19:16:46 UTC
for now here is the dmesg log that matches the above trace:

[21161.906569] iwlwifi 0000:02:00.0: L1 Enabled - LTR Enabled
[21161.907175] iwlwifi 0000:02:00.0: L1 Enabled - LTR Enabled
[21162.029875] iwlwifi 0000:02:00.0: L1 Enabled - LTR Enabled
[21162.030479] iwlwifi 0000:02:00.0: L1 Enabled - LTR Enabled
[21166.648611] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 1/3)
[21166.753330] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 2/3)
[21166.857427] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 3/3)
[21166.961554] wlp2s0: association with 54:e6:fc:e5:53:5c timed out
[21180.164180] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 1/3)
[21180.268300] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 2/3)
[21180.372348] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 3/3)
[21180.476434] wlp2s0: association with 54:e6:fc:e5:53:5c timed out
[21194.769966] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 1/3)
[21194.875093] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 2/3)
[21194.979193] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 3/3)
[21195.081272] wlp2s0: association with 54:e6:fc:e5:53:5c timed out
[21217.772526] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 1/3)
[21217.879633] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 2/3)
[21217.983725] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 3/3)
[21218.088825] wlp2s0: association with 54:e6:fc:e5:53:5c timed out
[21240.773098] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 1/3)
[21240.876210] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 2/3)
[21240.978324] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 3/3)
[21241.082407] wlp2s0: association with 54:e6:fc:e5:53:5c timed out
[21260.207799] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 1/3)
[21260.311938] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 2/3)
[21260.418022] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 3/3)
[21260.522106] wlp2s0: association with 54:e6:fc:e5:53:5c timed out
[21270.637224] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 1/3)
[21270.741345] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 2/3)
[21270.845485] wlp2s0: associate with 54:e6:fc:e5:53:5c (try 3/3)
[21270.947542] wlp2s0: association with 54:e6:fc:e5:53:5c timed out


Here is the matching output in wpa_supplicant.log

wlp2s0: CTRL-EVENT-SCAN-STARTED 
wlp2s0: Trying to associate with 54:e6:fc:e5:53:5c (SSID='TP-LINK_E5535C' freq=2462 MHz)
wlp2s0: Authentication with 54:e6:fc:e5:53:5c timed out.
wlp2s0: CTRL-EVENT-DISCONNECTED bssid=54:e6:fc:e5:53:5c reason=3 locally_generated=1
wlp2s0: CTRL-EVENT-SCAN-STARTED 
wlp2s0: Trying to associate with 54:e6:fc:e5:53:5c (SSID='TP-LINK_E5535C' freq=2462 MHz)
wlp2s0: Authentication with 54:e6:fc:e5:53:5c timed out.
wlp2s0: CTRL-EVENT-DISCONNECTED bssid=54:e6:fc:e5:53:5c reason=3 locally_generated=1
wlp2s0: CTRL-EVENT-SSID-TEMP-DISABLED id=0 ssid="TP-LINK_E5535C" auth_failures=1 duration=10 reason=CONN_FAILED
wlp2s0: CTRL-EVENT-SCAN-STARTED 
wlp2s0: Trying to associate with 54:e6:fc:e5:53:5c (SSID='TP-LINK_E5535C' freq=2462 MHz)
wlp2s0: Authentication with 54:e6:fc:e5:53:5c timed out.
wlp2s0: CTRL-EVENT-DISCONNECTED bssid=54:e6:fc:e5:53:5c reason=3 locally_generated=1
wlp2s0: CTRL-EVENT-SSID-TEMP-DISABLED id=0 ssid="TP-LINK_E5535C" auth_failures=1 duration=10 reason=CONN_FAILED
wlp2s0: CTRL-EVENT-SCAN-STARTED 
wlp2s0: CTRL-EVENT-SCAN-STARTED 
wlp2s0: Trying to associate with 54:e6:fc:e5:53:5c (SSID='TP-LINK_E5535C' freq=2462 MHz)
wlp2s0: Authentication with 54:e6:fc:e5:53:5c timed out.
wlp2s0: CTRL-EVENT-DISCONNECTED bssid=54:e6:fc:e5:53:5c reason=3 locally_generated=1
wlp2s0: CTRL-EVENT-SSID-TEMP-DISABLED id=0 ssid="TP-LINK_E5535C" auth_failures=1 duration=10 reason=CONN_FAILED
wlp2s0: CTRL-EVENT-SCAN-STARTED 
wlp2s0: CTRL-EVENT-SCAN-STARTED 
wlp2s0: CTRL-EVENT-SSID-REENABLED id=0 ssid="TP-LINK_E5535C"
wlp2s0: Trying to associate with 54:e6:fc:e5:53:5c (SSID='TP-LINK_E5535C' freq=2462 MHz)
wlp2s0: CTRL-EVENT-DISCONNECTED bssid=54:e6:fc:e5:53:5c reason=3 locally_generated=1
wlp2s0: Trying to associate with 54:e6:fc:e5:53:5c (SSID='TP-LINK_E5535C' freq=2462 MHz)
wlp2s0: Association request to the driver failed
wlp2s0: Authentication with 54:e6:fc:e5:53:5c timed out.
wlp2s0: CTRL-EVENT-DISCONNECTED bssid=54:e6:fc:e5:53:5c reason=3 locally_generated=1
wlp2s0: CTRL-EVENT-SSID-TEMP-DISABLED id=0 ssid="TP-LINK_E5535C" auth_failures=1 duration=10 reason=CONN_FAILED
wlp2s0: CTRL-EVENT-SCAN-STARTED 
wlp2s0: CTRL-EVENT-SCAN-STARTED 
wlp2s0: CTRL-EVENT-SSID-REENABLED id=0 ssid="TP-LINK_E5535C"
wlp2s0: Trying to associate with 54:e6:fc:e5:53:5c (SSID='TP-LINK_E5535C' freq=2462 MHz)
wlp2s0: CTRL-EVENT-DISCONNECTED bssid=54:e6:fc:e5:53:5c reason=3 locally_generated=1
wlp2s0: CTRL-EVENT-SCAN-STARTED 
wlp2s0: Trying to associate with 54:e6:fc:e5:53:5c (SSID='TP-LINK_E5535C' freq=2462 MHz)
wlp2s0: CTRL-EVENT-DISCONNECTED bssid=54:e6:fc:e5:53:5c reason=3 locally_generated=1

I will try to make a new/more traces if needed, but can you give me a little hand here what and how I need to mess with supplicant to add this -dd flag? do I need to change the unit file and restart it?
Comment 10 Emmanuel Grumbach 2015-04-29 09:47:30 UTC
your supplicant is doing something weird I think.
What is its version?

can you try this:

diff --git a/drivers/net/wireless/iwlwifi/mvm/mac80211.c b/drivers/net/wireless/iwlwifi/mvm/mac80211.c
index dda887f..72e1b7a 100644
--- a/drivers/net/wireless/iwlwifi/mvm/mac80211.c
+++ b/drivers/net/wireless/iwlwifi/mvm/mac80211.c
@@ -455,9 +455,11 @@ int iwl_mvm_mac_setup_register(struct iwl_mvm *mvm)
                hw->flags |= IEEE80211_HW_MFP_CAPABLE;

        hw->flags |= IEEE80211_SINGLE_HW_SCAN_ON_ALL_BANDS;
+#if 0
        hw->wiphy->features |=
                NL80211_FEATURE_SCHED_SCAN_RANDOM_MAC_ADDR |
                NL80211_FEATURE_SCAN_RANDOM_MAC_ADDR;
+#endif

        hw->sta_data_size = sizeof(struct iwl_mvm_sta);
        hw->vif_data_size = sizeof(struct iwl_mvm_vif);
@@ -544,12 +546,14 @@ int iwl_mvm_mac_setup_register(struct iwl_mvm *mvm)
        else
                hw->wiphy->flags &= ~WIPHY_FLAG_PS_ON_BY_DEFAULT;

+#if 0
        hw->wiphy->flags |= WIPHY_FLAG_SUPPORTS_SCHED_SCAN;
        hw->wiphy->max_sched_scan_ssids = PROBE_OPTION_MAX;
        hw->wiphy->max_match_sets = IWL_SCAN_MAX_PROFILES;
        /* we create the 802.11 header and zero length SSID IE. */
        hw->wiphy->max_sched_scan_ie_len =
                SCAN_OFFLOAD_PROBE_REQ_SIZE - 24 - 2;
+#endif

        hw->wiphy->features |= NL80211_FEATURE_P2P_GO_CTWIN |
                               NL80211_FEATURE_LOW_PRIORITY_SCAN |
Comment 11 ailin.nemui+kernel 2015-05-03 10:19:31 UTC
your patch didn't change anything (still cannot connect) 

I attached a new trace (before your patch) and the wpa_supplicant log (it's wpa_supplicant 2.2)
Comment 12 ailin.nemui+kernel 2015-05-03 10:20:17 UTC
Created attachment 175661 [details]
another trace with wpa_supplicant -dd
Comment 13 ailin.nemui+kernel 2015-05-03 10:21:05 UTC
Created attachment 175671 [details]
supplicant log with -dd
Comment 14 Emmanuel Grumbach 2015-05-06 12:41:45 UTC
Created attachment 176001 [details]
Core10 firmware with trigger on auth timeout

please take this firmware and work with the backport driver I sent earlier.

When you'll have an authentication timeout, a firmware dump will be created.

Please collect the data by:

cat /sys/devices/virtual/devcoredump/devcd1/data > iwl-1.dump
echo 1 > /sys/devices/virtual/devcoredump/devcd1/data

You will probably have 5 dumps until it will stop dumping the data.
Please attach the dumps to the bug and I will transfer them to the firmware team.

Please take the time to check our privacy note here:
https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi#privacy_aspects

thanks
Comment 15 ailin.nemui+kernel 2015-05-11 20:45:37 UTC
Hi Emmanuel,

I attach the dump, there was only one, I fear there might be an issue. See the dmesg log, it says Sorry - debug buffer is only 4096K while you requested 65536K.
Comment 16 ailin.nemui+kernel 2015-05-11 20:46:12 UTC
Created attachment 176451 [details]
Firmware dump do debug this issue
Comment 17 ailin.nemui+kernel 2015-05-11 20:46:40 UTC
Created attachment 176461 [details]
dmesg before saving the firmware dump
Comment 18 Emmanuel Grumbach 2015-05-11 20:55:19 UTC
(In reply to ailin.nemui+kernel from comment #15)
> Hi Emmanuel,
> 
> I attach the dump, there was only one, I fear there might be an issue. See
> the dmesg log, it says Sorry - debug buffer is only 4096K while you
> requested 65536K.

This is fine. But the fact that you have in error interrupt after the collection is strange.

Thanks - I'll send the data to the firmware team.
Comment 19 Emmanuel Grumbach 2015-05-13 19:42:22 UTC
Created attachment 176731 [details]
Core10 firmware with trigger on auth timeout

Hi,

the firmware team asked if it is possible to reproduce and get a new dump with the firmware attached.
It will dump more debug info.

thanks you.
Comment 20 ailin.nemui+kernel 2015-05-14 08:01:08 UTC
Created attachment 176771 [details]
new dump

find the new dump attached. weird to me that it is even smaller than before ;-)
Comment 21 Emmanuel Grumbach 2015-05-14 08:03:54 UTC
thanks.
Comment 22 Emmanuel Grumbach 2015-05-14 09:10:51 UTC
Created attachment 176791 [details]
Core10 firmware with trigger on auth timeout

I am sorry, I made a stupid mistake.
I'll have to ask you to repro with the firmware attached.

Sorry about that.
Comment 23 ailin.nemui+kernel 2015-05-14 16:21:08 UTC
Created attachment 176881 [details]
new dump

good luck
Comment 24 Emmanuel Grumbach 2015-05-14 18:05:19 UTC
this one worked thanks.

All this debug infra is still quite new and it has some glitches from time to time ;)
Comment 25 mail@telenczuk.pl 2015-05-29 22:48:36 UTC
I experience the same problem in kernel 3.19.0.
Comment 26 Emmanuel Grumbach 2015-05-30 17:44:24 UTC
Are you using Bluetooth?
Comment 27 Emmanuel Grumbach 2015-05-30 17:45:16 UTC
Can you try the backport based version attached to this bug?
Comment 28 mail@telenczuk.pl 2015-05-30 21:00:06 UTC
that worked! Thanks!

I'm not using bluetooth, although I found that it was switched on.
Comment 29 Emmanuel Grumbach 2015-05-31 07:19:44 UTC
(In reply to mail@telenczuk.pl from comment #28)
> that worked! Thanks!
> 

Ok - this is really strange. We are investigating.

I could reproduce but only if I am far away from the AP. Can you please try to go back to the previous driver (just delete the updates dir in /lib/modules/`uname -r`/) and let me know if you can reproduce the problem when you are really close to the AP?

thanks.
Comment 30 Emmanuel Grumbach 2015-05-31 18:33:45 UTC
Created attachment 178431 [details]
iwlwifi with BT AG reverted

Ok - I think I found the problem.

Can you please try the backport tree attached?
You should be able to reproduce the problem on that tree.
Comment 31 Emmanuel Grumbach 2015-05-31 18:34:51 UTC
Created attachment 178441 [details]
fix

Then, please apply the patch attached to the backport tree. And let me know what happens.

I hope that this will fix the problem.

Thank you.
Comment 32 Emmanuel Grumbach 2015-05-31 19:16:02 UTC
Moving this bug to fixed.

I will wait a few days to get confirmations.
Comment 33 ailin.nemui+kernel 2015-05-31 20:25:27 UTC
Is this supposed to fix my issue as well?
Comment 34 Emmanuel Grumbach 2015-05-31 22:50:55 UTC
Yes
Comment 35 Emmanuel Grumbach 2015-06-01 06:24:47 UTC
(In reply to ailin.nemui+kernel from comment #33)
> Is this supposed to fix my issue as well?

I just noticed that your AP is sending deauth with a bad sequence number.
we should be able to cope with that even if your AP is really at fault here.
I need to investigate this a little more.
Comment 36 Emmanuel Grumbach 2015-06-01 11:25:44 UTC
Ok we had 2 bugs here.
@ailin.nemui+kernel@gmail.com

can you try:

https://p.sipsolutions.net/d21ea08039a75985.txt


thanks.
Comment 37 Emmanuel Grumbach 2015-06-01 12:54:19 UTC
All the fixes have been merged to the master branch of:

https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/backport-iwlwifi.git/log/

please test this.
Comment 38 ailin.nemui+kernel 2015-06-01 21:06:49 UTC
So sad, it still doesn't work :-(

I tried with the backport-iwlwifi from git together with the firmwares from git, but it just doesn't connect. I also turned off bluetooth but that didn't change anything. *Cries*
Comment 39 ailin.nemui+kernel 2015-06-01 21:07:16 UTC
Created attachment 178521 [details]
dmesg log from latest driver
Comment 40 mail@telenczuk.pl 2015-06-01 21:51:11 UTC
It worked for me.
Comment 41 ailin.nemui+kernel 2015-06-01 22:08:40 UTC
one curious thing I noticed is that on the working driver, the dmesg says

wlp2s0: authenticate with 54:e6:fc:e5:53:5c
wlp2s0: send auth to 54:e6:fc:e5:53:5c (try 1/3)
wlp2s0: authenticated
wlp2s0: associate with 54:e6:fc:e5:53:5c (try 1/3)
wlp2s0: RX AssocResp from 54:e6:fc:e5:53:5c (capab=0x431 status=0 aid=2)
wlp2s0: associated



i.e. it first says authenticate, whereas on the broken ones it never says authenticate, only

wlp2s0: associate with 54:e6:fc:e5:53:5c (try 1/3)
wlp2s0: deauthenticated from 54:e6:fc:e5:53:5c while associating (Reason: 9=STA_REQ_ASSOC_WITHOUT_AUTH)
Comment 42 Emmanuel Grumbach 2015-06-01 23:35:32 UTC
please record tracing with the latest backport tree.
I also would like to see the supplicant output (with -dd).

thanks.
Comment 43 Emmanuel Grumbach 2015-06-02 07:19:30 UTC
I looked again at your older supplicant logs and it looks strange.

can you please run your supplicant with -T while you record kernel tracing?

thanks.
Comment 44 Emmanuel Grumbach 2015-06-02 07:35:45 UTC
can you please make sure you have only one instance of the supplicant running?
It seems from your logs from comment 13 that you had two supplicants running which is surely a problem.

When you want to add debug level to the supplicant, you need to make sure that the daemon spawned by the OS isn't running. Note that the network manager will re-start the supplicant when you kill it. You'd have to stop the network manager as well.
Comment 45 Emmanuel Grumbach 2015-06-02 08:11:08 UTC
Created attachment 178561 [details]
disable vendor commands

Johannes found a bug in supplicant 2.2:

can you try the patch attached and apply it to the backprot tree to work around the bug in the supplicant?

Thanks.
Comment 46 Johannes Berg 2015-06-02 09:04:15 UTC
For posterity, here's a description:

we have a vendor command 11, and it's causing the following:
 * subcmd 11 interpreted as qca radar detect
 * get_capa tries to call it, but obviously it can't
 * get_capa returns an error, so wpa_s->drv_flags don't get set, however drv->capa.flags *is* set
 * wpa_s calls associate() to do a connect(), but driver_nl80211 knows SME is supported so thinks this is pure associcate()

wpa_s's commit 5f9c92f8f7063b8e35118b6b4d39d10a46fea20f fixed the first step here
Comment 47 ailin.nemui+kernel 2015-06-02 09:10:24 UTC
Hi,

on a whim, I updated wpa_supplicant to 2.4 prior to receiving all of your comments, and indeed this solves the issue! I can now connect again. Your explanation sounds highly likely as the underlying issue.

However, now my dmesg log is getting spammed like in the attachment, does this tell you anything? Maybe it should be moved to another issue though.
Comment 48 ailin.nemui+kernel 2015-06-02 09:11:11 UTC
Created attachment 178571 [details]
dmesg log with lots of microcode errors
Comment 49 Johannes Berg 2015-06-02 09:12:49 UTC
btw, the bug was only fixed in supplicant master branch, not released yet, so versions 2.2 through 2.4 are affected.
Comment 50 Johannes Berg 2015-06-02 09:14:14 UTC
Interesting, I don't think 2.4 fixes this bug? but perhaps it has some slightly different behaviour somewhere along the line so it doesn't affect it as much ...
Comment 51 ailin.nemui+kernel 2015-06-02 09:15:17 UTC
oh, weird. 2.4 makes it connect again. but maybe there is a more severe issue combined with 2.2
Comment 52 Emmanuel Grumbach 2015-06-02 09:32:27 UTC
The Microcode errors are unrelated.
You can downgrade to -13.ucode to check it is better.
Comment 53 ailin.nemui+kernel 2015-06-02 11:04:04 UTC
fyi it's exactly the same with -13.ucode
Comment 54 Emmanuel Grumbach 2015-06-02 11:06:16 UTC
Please open a new bug.

thank you.
Comment 55 Emmanuel Grumbach 2015-06-11 10:18:11 UTC
FYI - the Microcode errors have been fixed in our latest driver.
Please pull from backport.