Bug 195299

Summary: iwlwifi: 8265: NULL pointer dereference when recovering from 14FD sysassert - WIFILNX-786
Product: Drivers Reporter: djagoo (dev)
Component: network-wirelessAssignee: DO NOT USE - assign "network-wireless-intel" component instead (linuxwifi)
Status: CLOSED CODE_FIX    
Severity: high CC: auroux, bugzilla.kernel.2017, candrews, elitebadger, james, lsiudut, luca, mail2benny, mhjungk, mmfmarin, oposum, piotrsbk, roberto.catini, techtebatoye, willismonroe, wompy, woutermont
Priority: P1    
Hardware: Intel   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=193641
Kernel Version: 4.10.8-1-ARCH Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output
dmesg after using -22.ucode
oops log
patch with potential fix
dmesg with kernel patch
dmesg with proposed patch
dmesg kernel 4.11.2-1-ARCH with 5Ghz wifi connecting again
dmesg on Arch - linux 4.11.2-1- linux firmware 20170422.ade8332-1 - ucode 20170511-1
dmesg with kernel 4.11.3, Intel 7260 AC, firmware 17.459231.0
Syslog from kernel oops
oops, kernel 4.9, core28 drivers, firmware v31
patch to fix restart problem
patch to fix restart problem (for upstream, 4.9)
kernel log generated during connection with the patch
iw-scan
iwlwifi-fix
cfg80211 - fix
iwlwifi-fix
iwlwifi-fix
cfg80211 - fix

Description djagoo 2017-04-09 08:41:59 UTC
Created attachment 255793 [details]
dmesg output

On the new 5th Gen Lenovo X1 Carbon and Intel 8265 on a fresh Arch install iwlwifi keeps crashing the moment network is coming up. 

➜  ~ ethtool -i wlp4s0 | grep firmware
firmware-version: 27.455470.0
Comment 2 djagoo 2017-04-10 04:51:36 UTC
Didn't help, removed the -21 -22 and -27 ucode files from /lib/firmware, placed the -22 firmware from this link there. 

After reboot:

➜  ~ ethtool -i wlp4s0 | grep firmware
firmware-version: 22.361476.0

dmesg is in dmesg-22.txt
Comment 3 djagoo 2017-04-10 04:52:43 UTC
Created attachment 255799 [details]
dmesg after using -22.ucode
Comment 4 Luca Coelho 2017-04-10 11:52:18 UTC
This is related to bug 193641, but more serious due to the NULL pointer dereference.  Let's handle the dereference here and the actual SYSASSERT that triggers it in the other bug report.

Created an internal ticket for tracking.
Comment 5 djagoo 2017-04-10 19:11:28 UTC
Okay, let me know if I can be of any help. I'll use a wired connection for the time being.
Comment 6 djagoo 2017-04-11 10:18:00 UTC
Needed some time but now I understand what bug 193641 was saying. My DD-WRT was choosing channel 44 (5220MHz) on "auto" with 40MHz width and this seems not to be allowed. So I tried using some channels at 20MHz without any crash. After I changed it back to 40MHz it crashes again. So I'm staying on 20MHz width.

Thanks for pointing me in the right direction.
Comment 7 Willis Monroe 2017-04-14 21:54:22 UTC
Created attachment 255891 [details]
oops log
Comment 8 Willis Monroe 2017-04-14 21:55:50 UTC
Finding the same thing on Ubuntu after upgrading to 17.04 with an ASUS laptop.  Seems to happen regularly on trying to connect to 5ghz networks.  After the oops the command line is very unresponsive.  'modprobe -r iwlwifi' never returns and can't be killed, essentially hanging the terminal or tty.
Comment 9 Piotr Dąbrowski 2017-04-16 00:32:39 UTC
I have the same problem on Asus UX305FA with Intel® Dual Band Wireless-N 7265 after upgrade to Ubuntu 17.04 (also remains when clean install was done). The issue does not occur on Ubuntu 16.10, 16.04 or Windows 10.

Connecting to 2.4GHz networks works well, but the problem appears when trying to connect to 5GHz networks.

Related: https://bugs.launchpad.net/ubuntu/+source/linux-firmware/+bug/1682744
Comment 10 Wouter Termont 2017-04-21 21:22:29 UTC
Same here on another Asus UX305FA, running archlinux, since somewhere between the late 4.8 and early 4.10 kernel updates.
Comment 11 Wouter Termont 2017-04-23 08:03:00 UTC
Is there any progress on this? If not, how can we help? Because it is really annoying not being able to connect to half of the wireless networks, including eduroam and a lot of public hotspots.
Comment 12 Luca Coelho 2017-04-23 09:19:06 UTC
We are working on it.  I'll put it in high priority tomorrow (Monday) so we can advance this and provide you with the solution.
Comment 13 wompy 2017-04-25 18:39:56 UTC
I have an ASUS UX303L and it is also affected. The crashes started somehwhere between 4.8 and 4.10.
Comment 14 wompy 2017-04-30 16:49:14 UTC
Is there any way we can help with testing? This bug is really annoying, since I often even get into Gnome before my laptop locks up. I would even appreciate a work around for now.

Thanks a lot for putting time into this topic!
Comment 15 Piotr Dąbrowski 2017-04-30 22:46:38 UTC
There is a workaround that worked for me - I just disabled the option "automatically connect to this network when available" for every 5 GHz network in network manager, so it doesn't crash trying to connect to them. It won't work if your problem isn't connected to a specified network and it just crashes randomly.
Comment 16 Willis Monroe 2017-05-01 18:10:51 UTC
I was able to roll-back the linux-firmware package in my distribution (Ubuntu) and everything works fine for now.  Looking forward to a fix.
Comment 17 Luca Coelho 2017-05-02 12:05:52 UTC
There are two issues here, obviously.  One is that we're getting the firmware SYSASSERT.  The other, more serious one because it causes a kernel oops, is that we are failing to recover properly.

The NULL pointer dereference seems to be happening here in the iwl_mvm_realloc_queues_after_restart() function, in the loop to reallocates the queues.

I'm going to try to force this to be reproduced on my machine (by forcing the SYSASSERT in the right place).
Comment 18 Luca Coelho 2017-05-02 15:20:35 UTC
Created attachment 256175 [details]
patch with potential fix

Found the problem.  The oops was happening because mac80211 was calling iwlwifi to add a station (the one we're authenticating with), but that was not valid anymore because of the reconfig.

To prevent this from happening, I changed mac80211 so that it will bail out in ieee80211_prep_connection() if there is an ongoing reconfiguration.

Can you please try this patch and report back? This should at least fix the oops.  Once we pass that hurdle, we can continue to investigate the actual FW asserts.
Comment 19 geoffrey 2017-05-05 05:59:47 UTC
I applied the patch yesterday on my UX305 but it seems the flag in_reconfig is not set in my case, so the oops is still here.

Here is my dmesg: https://pastebin.com/aSzih42W

c0:25:06:e7:cf:e7 is my 2.4Gzh network, set for auto-connection.
c0:25:06:e7:cf:e8 is my 5Ghz I connected manually after boot.

I'm quite new to this kind of debugging but will be glad to help. Please ask what I can do during the week-end.

Not sure if it helps but I added some printk, they are as well in my dmesg output. The modified source: https://pastebin.com/EKn52hjr

Thanks for taking time to track this issue.
Comment 20 Luca Coelho 2017-05-05 19:30:49 UTC
Geoffrey,

Can you please paste the logs as attachments here? It's much easier to keep track when they are all available in the same way.
Comment 21 wompy 2017-05-06 01:30:17 UTC
Created attachment 256233 [details]
dmesg with kernel patch

Luca,

Thanks for working on this. I patched, compiled and installed the kernel (my first time) and I think I succeeded. After a reboot I still see the same behavior. I attached the dmesg output.
Comment 22 Wouter Termont 2017-05-16 14:05:49 UTC
Any progress on this?
Comment 23 Piotr Dąbrowski 2017-05-18 16:54:10 UTC
I'm not sure if that can help, but I can confirm the first affected version of linux-firmware in Ubuntu is 1.157.9 or 1.157.10, as I just updated from 1.157.8 and dmesg showed the same error as above.

Downgrading to 1.157.8 makes connecting to 5 GHz networks using Dual Band Wireless AC 7265 (REV=0x210) stable again. I use Ubuntu 16.04 with 4.11 kernel.

There is a changelog for this versions:
http://changelogs.ubuntu.com/changelogs/pool/main/l/linux-firmware/linux-firmware_1.157.10/changelog

Also on working configuration shows in dmesg (1.157.8) loaded firmware version 22.361476.0 and on nonworking (1.157.10) 22.391740.0 (op_mode iwlmvm). I can post the whole dmesg log if needed.
Comment 24 Wouter Termont 2017-05-22 15:45:41 UTC
The status is still NEEDINFO, yet the info is that the patch does not work. What more can be done to help? I don't want to sound ungrateful (I think it's amazing how much time developers put into community!), but I've already been waiting months for this and I'm not sure what else I can do but wait.
Comment 25 geoffrey 2017-05-23 05:46:44 UTC
Created attachment 256677 [details]
dmesg with proposed patch

Hi, I would also like to help, please provide some advice for what we can do. Blacklisting all 5Ghz network is not fun. As requested I add as attachment my previously linked dmesg for tracking purpose.
Comment 26 Luca Coelho 2017-05-24 11:06:55 UTC
There are two issues here.  One is the oops, which I'm investigating.  The other is the SYSASSERT.  I'll take this with the firmware team.

The information that it happens with 22.361476 but not with 22.391740 is very good.  I'll try to track down the changes in the firmware between those two that is causing the issue.

This is what we call Core19 (the first FW version number - 3).  Can someone try our newer Core24 release[1]? Or kernel v4.11 or higher? We support the new -27.ucode firmware in those.


[1] https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/core_release
Comment 27 geoffrey 2017-05-25 08:25:38 UTC
Created attachment 256711 [details]
dmesg kernel 4.11.2-1-ARCH with 5Ghz wifi connecting again

Hi,

I just tried with kernel 4.11.2-1-ARCH. Dmesg is not error free but I can connect to my 5Ghz network without crash.

Not sure which Core release I use and how to check. From dmesg 

iwlwifi 0000:02:00.0: Direct firmware load for iwlwifi-7265D-28.ucode failed with error -2.

loaded firmware version 27.455470.0 op_mode iwlmvm

Dmesg is attached.
Comment 28 Denis Auroux 2017-05-28 02:09:25 UTC
I keep having these crashes on occasion with iwlwifi-7260-17.ucode (firmware version 17.459231.0) on Fedora 25 as well, and my only solution so far has been to downgrade to a 4.8.16 kernel. The firmware still crashes, but the system doesn't, and I can usually recover by turning wi-fi off and back on; whereas with 4.10.x kernels processes become unkillable and it always ends with a hard shutdown.  

Should I expect anything from the 4.11.x kernels once they arrive in Fedora 25, given that the 7260 firmware isn't evolving anymore?  (i.e., will the 4.11.x kernels be able to recover from this SYSASSERT, or am I still stuck?)

Thanks,
Denis
Comment 29 Roberto Catini 2017-05-28 13:23:23 UTC
Created attachment 256747 [details]
dmesg on Arch - linux 4.11.2-1- linux firmware 20170422.ade8332-1 - ucode 20170511-1

Seems solved on my ASUS UX305F, as in post #27
Comment 30 Wouter Termont 2017-05-29 16:46:29 UTC
Seems indeed to be solved for my Arch setup on a Asus UX305FA with the 4.11 kernel. Thanks!
Comment 31 Denis Auroux 2017-06-02 23:20:11 UTC
Definitely not fixed with kernel 4.11.3 and Intel 7260 on Fedora 25. (unsurprisingly, since there is no firmware update for the 7260 and the kernel bug preventing a graceful recovery from the firmware crash itself hasn't been addressed.)

Kernel: 4.11.3-200.fc25.x86_64
Firmware version: 17.459231.0

Syslog excerpt attached. Last known good kernel for me is 4.8.16.  None of the 4.10.x work properly, and I think 4.9.x also have the problem but I didn't keep any on my system to check.

Denis
Comment 32 Denis Auroux 2017-06-02 23:21:15 UTC
Created attachment 256847 [details]
dmesg with kernel 4.11.3, Intel 7260 AC, firmware 17.459231.0
Comment 33 Nathan Baker 2017-06-26 23:15:20 UTC
I may or may not be seeing a similar issue. This occurs when first connecting instead of recovery, and it happens every time.

Here's a summary of the log...I'll upload the whole thing:

> Jun 27 09:25:12 nathanb-nuc wpa_supplicant[559]: wlp3s0: SME: Trying to
> authenticate with 6c:3b:6b:3f:76:2e (SSID='Vistagate5G-3F762F' freq=5200 MHz)
> Jun 27 09:25:12 nathanb-nuc kernel: wlp3s0: authenticate with
> 6c:3b:6b:3f:76:2e
> Jun 27 09:25:12 nathanb-nuc kernel: iwlwifi 0000:03:00.0: Microcode SW error
> detected.  Restarting 0x2000000.
> Jun 27 09:25:12 nathanb-nuc kernel: iwlwifi 0000:03:00.0: Start IWL Error Log
> Dump:
> Jun 27 09:25:12 nathanb-nuc kernel: iwlwifi 0000:03:00.0: Status: 0x00000000,
> count: 6
> Jun 27 09:25:12 nathanb-nuc kernel: iwlwifi 0000:03:00.0: Loaded firmware
> version: 27.532463.0
> Jun 27 09:25:12 nathanb-nuc kernel: iwlwifi 0000:03:00.0: 0x000014FD |
> ADVANCED_SYSASSERT        
> Jun 27 09:25:12 nathanb-nuc kernel: ieee80211 phy0: Hardware restart was
> requested
> Jun 27 09:25:12 nathanb-nuc kernel: iwlwifi 0000:03:00.0: FW error in SYNC
> CMD PHY_CONTEXT_CMD
> Jun 27 09:25:12 nathanb-nuc kernel: CPU: 6 PID: 559 Comm: wpa_supplicant
> Tainted: G           O    4.11.6-3-ARCH #1
> Jun 27 09:25:12 nathanb-nuc kernel: BUG: unable to handle kernel NULL pointer
> dereference at 000000000000011c
> Jun 27 09:25:12 nathanb-nuc kernel: IP: iwl_mvm_add_sta+0x4f1/0x780 [iwlmvm]
> Jun 27 09:25:12 nathanb-nuc kernel: PGD 4a22e0067 
> Jun 27 09:25:12 nathanb-nuc kernel: PUD 4a8c59067 
> Jun 27 09:25:12 nathanb-nuc kernel: PMD 0 
> Jun 27 09:25:12 nathanb-nuc kernel: 
> Jun 27 09:25:12 nathanb-nuc kernel: Oops: 0000 [#1] PREEMPT SMP

Hope this helps.
Comment 34 Nathan Baker 2017-06-26 23:15:59 UTC
Created attachment 257187 [details]
Syslog from kernel oops
Comment 35 Luca Coelho 2017-06-27 05:25:45 UTC
Thanks, Nathan.

Yes, you're experiencing the same issue here.  Note the 0x14FD SYSASSERT and the 0x11c signature in the NULL pointer dereference.

You can try to update your firmware to the latest version from the linux-firmware.git repo[1]? You seem to be using an unofficial version "27.532463.0", which we probably sent you and which probably doesn't have the fix for the 0x14FD sysassert.

At least Geoffrey reported the official one as fixing the 0x14FD bug.

Or, even better, you could try the latest firmware we released (Core28, ucode -31)[2], which hasn't been sent to the official linux-firmware.git tree, but will be sent soon.


[1] https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/iwlwifi-8000C-27.ucode

[2] https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/linux-firmware.git/plain/iwlwifi-8000C-31.ucode
Comment 36 Luca Coelho 2017-06-27 05:31:30 UTC
Denis,

Unfortunately the 7260 is already at a much lower pace of updates in the firmware side.  But I'll check with the firmware team if we can get (or have already) backported the fix for the 0x14FD sysassert.

In general, we definitely need to fix the Oops.  Since so many people are seeing it, I'll increase our internal priority on fixing this issue.
Comment 37 Nathan Baker 2017-06-27 07:28:23 UTC
My version of iwlwifi doesn't want to load the -31 firmware. It tries up through -28 and then stops.

I'm a little reluctant to replace the -27 I'm using at the moment because anecdotally (I haven't measured it) it seems to not drop as much with assert 1007. But I'd rather use the new -27 than install the iwlwifi from the git repo again, since that was a proper pain, so if those are my only options then I'll go with the former!
Comment 38 Luca Coelho 2017-06-27 09:39:53 UTC
Ah, sorry, I thought you still had our backport release installed.

The official -27 should work (at least according to Geoffrey).  So it's your call, I can recommend 3 options (to use 5GHz):

1. Install the latest official -27;
2. Install our Core28 release and use -31;
3. Wait for kernel v4.13 to reach your distro.
Comment 39 Luca Coelho 2017-06-27 11:00:39 UTC
Denis,

Actually I was confused that there was a fix in the firmware for the 0x14FD problem.  The problem, as described in bug 193641 comment 21 (and later comments), channel 104 can't be a control channel if the bandwidth is set to 80MHz.  Can you check whether that is the case too?

Nathan, can you also check this?
Comment 40 Denis Auroux 2017-06-27 12:14:09 UTC
Luca, I'm confused about what I'm supposed to check exactly. My 7260 produces the sysassert with some frequency while scanning and trying to associate to a large campus-wide network with many access points, I am not setting manually any invalid frequency parameters, I'm letting NetworkManager and wpa_supplicant find the access point for me... 

If someone can tell me exactly how to configure wpa_supplicant or NetworkManager (or the kernel module options? can't figure where this is done) to disable 80 MHz bandwidth altogether, I'm happy to do that and see if it fixes the problem.

Or am I supposed to try the latest firmware git even though it sounds like nothing was fixed there? (or did something actually change in iwlwifi-7260-17.ucode that might fix the issue?)

In any case I can live with the SYSASSERT, it's the subsequent kernel oops that is making my life miserable. 

Thanks in advance,
Denis
Comment 41 Emmanuel Grumbach 2017-06-28 21:11:00 UTC
*** Bug 196211 has been marked as a duplicate of this bug. ***
Comment 42 Łukasz Siudut 2017-06-28 22:01:58 UTC
Created attachment 257211 [details]
oops, kernel 4.9, core28 drivers, firmware v31
Comment 43 Łukasz Siudut 2017-06-28 22:02:17 UTC
Hey everyone. I got hit by the same issue on my ThinkPad Yoga X1 (more details in 196211, marked as duplicate in #41).

Just tied core28 + firmware 31 on two kernels - and unfortunately nothing has changed. kspp 4.11 freezes completly, vanilla 4.9 is giving me enough time to dump dmesg on the disk (see attached file).
Comment 44 Johannes Berg 2017-06-30 09:05:27 UTC
Created attachment 257247 [details]
patch to fix restart problem

Can you try this patch? It fixed the issue for me.
Comment 45 Johannes Berg 2017-06-30 15:34:41 UTC
Created attachment 257255 [details]
patch to fix restart problem (for upstream, 4.9)

The other patch version probably won't apply cleanly, this one is diffed against v4.9 but will likely also work on other kernel versions.
Comment 46 Łukasz Siudut 2017-06-30 20:29:30 UTC
Created attachment 257267 [details]
kernel log generated during connection with the patch

With the path system didn't hang, seems that firmware got restarted without deadlock. I could feel the computer lagging for a second or so every time it tried to connect to the network. And connection did't succeed eventually, obviously. Kernel 4.11.
Comment 47 Johannes Berg 2017-06-30 21:23:19 UTC
Great. So let's say we've fixed the crash bug, and there's obviously still the problem that we get this assert in the first place...

(what driver are you using that you got the LED scheduling while sleeping warning? I thought I introduced it just yesterday and fixed it today ...)
Comment 48 Łukasz Siudut 2017-06-30 21:27:22 UTC
Oh, this is fresh checkout from the git. I took the easy way, just clicked the link from the #26 and copied it from the url bar. So I guess it makes sense ;-).
Comment 49 Johannes Berg 2017-06-30 21:28:18 UTC
Regarding the actual bug that leads to the assert, can you provide the "iw wlan0 scan dump" output for the AP in question (last log shows it as being 18:d6:c7:90:df:d9)?
Comment 50 Łukasz Siudut 2017-06-30 21:35:55 UTC
Created attachment 257269 [details]
iw-scan

Absolutely. I didn't reflash my unit back to openwrt as I wanted to be able to provide you feedback (only dd-wrt is triggering the problem for me). Let me know if you need anything else.
Comment 51 Johannes Berg 2017-06-30 21:44:28 UTC
Hmm. This is a strange (and I think invalid?) configuration. You're essentially using the channels like this:

[36][40][44][48]
        |C |
    | HT40 |
|    VHT 80    |

according to this information. Even channel 44 HT40- isn't really valid per spec, it should be 44 HT40+.

Did you configure this manually in LEDE? It should've just picked HT40+ ("secondary channel offset: above") instead of HT40- ("secondary channel offset: below") and I think it'd all work.

(I need to get some sleep now, and then I'll be on an extended vacation. If I find time, I may respond, but I think somebody else from the team will pick this up if not.)
Comment 52 Łukasz Siudut 2017-06-30 21:49:57 UTC
This is DD-WRT, I just configured ssid to LEDE (don't ask...). And I don't really have flexibility in upper/lower channel settings... Anyway, I'm also tired, I'll play with it tomorrow. Thanks!
Comment 53 Łukasz Siudut 2017-07-01 08:27:46 UTC
So I played a bit with channels and it seems that you are correct.

I think that DD-WRT is a bit too liberal in terms how you can choose main and upper/lower channel. Basically you can choose channel that you want + four options regarding extension - LL (-6), LU (-2), UL (+2), UU (+6). Once I set UU and either 36 or 52 - I was able to connect w/o issues. Attempt of setting any of channels from between resulted in failure.

It would also explain why many reports are dd-wrt related...
Comment 54 Johannes Berg 2017-07-01 17:36:21 UTC
Having four options is fine, actually, but it needs to be internally consistent.

So you can have

Control |-| | | |
HT      |---| | |
VHT     |-------|

Control | |-| | |
HT      |---| | |
VHT     |-------|

Control | | |-| |
HT      |   |---|
VHT     |-------|

Control | | | |-|
HT      |   |---|
VHT     |-------|

but letting the user pick HT40+/- and VHT configuration separately is wrong, since options like you had aren't valid:


Control | | |-| |
HT      | |---| |
VHT     |-------|

Control | |-| | |
HT      | |---| |
VHT     |-------|


I think perhaps it's simply not picking the right *HT* channel here, when you selected VHT? LL, LU, UL, UU are all perfectly fine options, but they way they map to HT40+/- needs to be done right and doesn't seem to be so here.
Comment 55 Łukasz Siudut 2017-07-01 17:42:39 UTC
Yes, I kinda intuitively figured that out after your first comment. Is this somehow actionable from your side? I'll try to contact dd-wrt developers and let them know about the issue.
Comment 56 Johannes Berg 2017-07-01 19:45:21 UTC
I should add to this, if only for my colleagues :)
(and some of this may be only comprehensible for them)

These configurations I mentioned, with the HT40 being in the center of the VHT80 channel, are actually technically not feasible. That's just not how the thing works, since VHT80 is considered a main HT40 channel plus an extension, just like HT40 itself is a main 20MHz channel plus a 20MHz extension. I'm not even sure if PHYs would typically be able to cope with such a thing - they would have no issue decoding an 80 MHz signal on this, but might not be set up to search for a narrower 40 MHz signal in the center, rather than one of the higher/lower 40 MHz part (and knowing which one to search in ahead of time!)


Now, mind you, that's actually not related to the sysassert - mac80211 figures out that the channel setting is invalid and says

[76850.177566] wlp4s0: AP VHT information doesn't match HT, disable VHT

(from your log)

so mac80211 ends up using HT40, which it selects according to the information from the scan:

        HT operation:
                 * primary channel: 44
                 * secondary channel offset: below


With this message, VHT80 is completely out of the way because it's bogus.

The problem now is that the remaining configuration, HT40- on channel 44, while a technically feasible configuration, is invalid according to the Annex E in 802.11-2016 (http://standards.ieee.org/getieee802/download/802.11-2016.pdf). It's a bit complicated to read, but basically we'd be looking for an operating class that allows channel 44 with Channel Spacing 40 (MHz) and "PrimaryChannelUpperBehavior". This doesn't exist.

Normally, this isn't actually very interesting though. In the iwlwifi case, however, somebody decided that the firmware should not only validate the pure regulatory requirements (which shouldn't be an issue here since the same subband is used whether this is 40/HT40+ or 44/HT40-, the former is OK the latter is not), but at the same time also the channelization given in the 802.11 spec.

Of course there are still good reasons for this channelization given in the spec, mostly that it makes overlapping BSSes (networks) behave better together.


As far as a solution goes, I think that the driver should already have set up the right regulatory flags, but perhaps we're not checking them correctly in mac80211? If you run "iw phy0 channels" you should see information such as

	* 5220 MHz [44] 
	  Maximum TX power: 15.0 dBm
	  No IR
	  Channel widths: 20MHz HT40+

(copy/pasted from my system)

which actually tells us that the device isn't willing to accept HT40- on channel 44. This should, in theory, be checked by cfg80211_chandef_usable(), which should lead to the downgrade loop in ieee80211_determine_chantype() being entered and actually selecting 20 MHz for your AP. Somehow this seems to not be working, but it's not immediately clear to me why. Perhaps in your case the regulatory data is actually something else, erroneously? You could check with "iw phy0 channels".

Now, selecting a 20 MHz channel for an AP that was supposed to support 80 MHz is obviously not an ideal outcome. I'm not sure what we can do about that though, other than print more hints to the user about the channelization. Perhaps we should some code at least into "iw" that flags these common problems (VHT/HT mismatch, and operating class mismatch) so that at least we can point users to run that when they encounter an issue. Even better, perhaps tools such as NetworkManager could flag such things. At the very least, we could print a message when the downgrade happens for "regulatory reasons" that really look more like operating class mismatches. We obviously already print a message for the HT/VHT mismatch, but that message is obviously impossible to understand for anyone not intimately familiar with the subject matter.

hope that helps!
Comment 57 Johannes Berg 2017-07-01 19:49:38 UTC
Oh, and in case it wasn't clear - DD-WRT really ought to not allow people (at least without warnings/expert config/whatever) to select something that's not in the operating classes plan in 802.11, but I can't really do anything about that.
Comment 58 Łukasz Siudut 2017-07-01 20:00:27 UTC
Pretty cool explanation, thanks a lot Johannes!

You correct again, it seems my regulatory data is wrong (or isn't present at all). `iw reg get` yields "country EU: DFS-UNSET", `iw phy0 channels`:

        * 5220 MHz [44] 
          Maximum TX power: 22.0 dBm
          No IR
          Channel widths: 20MHz HT40- HT40+ VHT80

Also I just realized that I didn't have regdb and crda packages installed, just fixed this. For some reason I still can't set reg to PL, but I'll find out why later.

Once again thanks, lot of cool stuff!
Comment 59 Johannes Berg 2017-07-01 20:01:46 UTC
Aha, ok, that's probably the problem here. I think though that the *driver* actually should be providing the regulatory data in this case, or at least the HT40 flags. Since I'm on vacation now, I'll let somebody else take over from here :)
Comment 60 Emmanuel Grumbach 2017-07-01 20:07:43 UTC
My system (with 9260) says:

	* 5220 MHz [44] 
	  Maximum TX power: 22.0 dBm
	  No IR
	  Channel widths: 20MHz HT40- HT40+ VHT80 VHT160

I'll take it from here. Thanks Johannes.

The internal data (iwlwifi) data we have doesn't say much more since it just says if a channel can be part of a 20 / 40 / 80 / 160 channel:

[ 1083.905255] iwlwifi 0000:02:00.0: U iwl_parse_nvm_mcc_info Ch. 5220 [5.2GHz] VALID WIDE 40MHZ 80MHZ 160MHZ INDOOR_ONLY GO_CONCURRENT (0xf61): Ad-Hoc not supported


I'll dig deeper tomorrow. Kinda late here.
Comment 61 Johannes Berg 2017-07-01 20:15:45 UTC
Right, but if the firmware checks it we should probably get the data out. I actually tend to think the firmware shouldn't check this anyway since it's not strictly regulatory, but that's how the regulatory data gets encoded iirc.

Note that in cfg80211_chandef_usable() we have a comment about 80MHz:

        /*
         * TODO: What if there are only certain 80/160/80+80 MHz channels
         *       allowed by the driver, or only certain combinations?
         *       For 40 MHz the driver can set the NO_HT40 flags, but for
         *       80/160 MHz and in particular 80+80 MHz this isn't really
         *       feasible and we only have NO_80MHZ/NO_160MHZ so far but
         *       no way to cover 80+80 MHz or more complex restrictions.
         *       Note that such restrictions also need to be advertised to
         *       userspace, for example for P2P channel selection.
         */

so we may need to address that as well to not have the same sysassert again with some kind of misconfigured VHT80.
Comment 62 Łukasz Siudut 2017-07-02 09:49:43 UTC
Just a side note. I was playing with regulatory settings and found out why I was unable to change it - because of LAR. Once I passed lar_disable=1 to iwlwifi I'm able to change reg.

Nevertheless it doesn't help. Even with regulatory set to IE channel widths remains unchanged thus problem may reoccur:

iw reg get:

  global
  country IE: DFS-ETSI
          (2402 - 2482 @ 40), (N/A, 20), (N/A)
          (5170 - 5250 @ 80), (N/A, 20), (N/A), AUTO-BW
          (5250 - 5330 @ 80), (N/A, 20), (0 ms), DFS, AUTO-BW
          (5490 - 5710 @ 160), (N/A, 27), (0 ms), DFS
          (57000 - 66000 @ 2160), (N/A, 40), (N/A)


iw phy0 channels:

        * 5220 MHz [44] 
          Maximum TX power: 20.0 dBm
          No IR
          Channel widths: 20MHz HT40- HT40+ VHT80
Comment 63 Emmanuel Grumbach 2017-07-02 13:56:38 UTC
I have the same flags as Łukasz on my system with 8260.

HT40- is allowed there.
What is interesting is that based on the mvm code, I can see that we enable HT40- and HT40+ on all the HT40 capable channels upon driver load. This is because iwl_init_channel_map from iwl-nvm-parse.c doesn't clear the IEEE80211_CHAN_NO_HT40{PLUS,MINUS} flags like its counterpart in iwl-eeprom-parse.c does (we use iwl-nvm-parse.c for 8260).

What should happen though is that when we initialize the MCC stuff, we do have iwl_nvm_get_regdom_bw_flags which does:
	} else if (nvm_chan[ch_idx] <= last_5ghz_ht &&
		   (nvm_flags & NVM_CHANNEL_40MHZ)) {
		if ((ch_idx - NUM_2GHZ_CHANNELS) % 2 == 0)
			flags &= ~NL80211_RRF_NO_HT40PLUS;
		else
			flags &= ~NL80211_RRF_NO_HT40MINUS;
	}


but that impacts NL80211_RRF_NO_HT40PLUS which doesn't have any effect on iw's ouput nor the caps while associating to the AP.
_RRF_ flags seem to be related to regulatory stuff.
I'll ask people who are more familiar with the regulatory code.
Comment 64 Denis Auroux 2017-07-02 14:12:39 UTC
All the discussion of valid/invalid channel settings is great, but I sure am glad that the issue with the kernel oops also got addressed, because in my setting (client connecting to a large campus network with many APs, one of which might have been somehow bad) it made for a spectacular DoS on linux/iwlwifi laptops.

I can't diagnose things anymore because I just relocated and am now 2500 miles away from the campus network where I was having these issues. It was an enterprise-grade network with mostly Cisco AP's, and perhaps there was a faulty AP near my location; the syslog from before the move contained lines like

[1364624.761916] wlp4s0: authenticate with cc:16:7e:5b:79:7f
[1364624.761921] wlp4s0: AP VHT information doesn't match HT, disable VHT
[1364624.761923] wlp4s0: capabilities/regulatory prevented using AP HT/VHT configuration, downgraded

This one didn't lead to a crash, I routinely associated to that AP without adverse consequences. On the other hand:

[1366233.401624] wlp4s0: authenticate with cc:16:7e:5b:7b:2f
[1366233.401632] wlp4s0: AP VHT information doesn't match HT, disable VHT
[1366233.402864] iwlwifi 0000:04:00.0: Microcode SW error detected.  Restarting 0x2000000.

This one didn't go well at all. (-> sysassert etc)

[1366259.400015] wlp4s0: authenticate with 00:3a:7d:da:a0:bf
[1366259.400027] wlp4s0: AP VHT information doesn't match HT, disable VHT
[1366259.401426] iwlwifi 0000:04:00.0: Microcode SW error detected.  Restarting 0x2000000.

(-> sysassert as well)

So: it seems that there were multiple AP's causing the problems. (Unless these were all within the same physical AP, handling the 3 different variants of the campus network -- open, secured, eduroam). My guess is that the APs with bad VHT information caused the firmware crash when the regulatory settings didn't kick in to cause HT/VHT to get entirely disabled on attempts to connect to that AP.  Surprised that there would be several different bad APs though ???

Anyway, I can't really blame the firmware for crashing when there's a bad AP, but the kernel should be more resilient... haven't tested the patch and probably won't unless I re-encounter the issue on a different campus, but not having the system crash entirely is certainly a very useful thing.

Denis
Comment 65 Emmanuel Grumbach 2017-07-03 12:47:40 UTC
Łukasz, can you please check if you still the SYSASSERT 14FD when you disable LAR?

I have been diving deep in the regulatory code the whole day, I hope I'll get a patch soon.
I already have a version that worked for me, but I am not quite happy with it. Trying to see how to do things in a better way.
Comment 66 Emmanuel Grumbach 2017-07-03 13:42:23 UTC
Created attachment 257305 [details]
iwlwifi-fix
Comment 67 Emmanuel Grumbach 2017-07-03 13:43:25 UTC
Created attachment 257307 [details]
cfg80211 - fix

Hi,

here are 2 patches.
Can you please check that they fix the ASSERT 14FD for you?
I could reproduce the problem and they fixed it for me.

Thanks!
Comment 68 Emmanuel Grumbach 2017-07-03 13:56:16 UTC
Created attachment 257311 [details]
iwlwifi-fix

update the iwlwifi's patch.

The previous had a sparse error (that didn't have any impact).
Comment 69 Emmanuel Grumbach 2017-07-03 13:57:04 UTC
These patches allow you to connect to the AP that is configured to Channel 40 HT40+ even with LAR enabled (which is the default).

Let me know.

Thanks!
Comment 70 Emmanuel Grumbach 2017-07-05 20:41:23 UTC
@Denis - please contact send me the output of iw phy0 channels

Note that you'll need a very new version of iw for that.

@Nathan - did you have a chance to test the patches?
Comment 71 Emmanuel Grumbach 2017-07-06 11:18:58 UTC
Created attachment 257387 [details]
iwlwifi-fix

Final version of the patch for iwlwifi.
Comment 72 Emmanuel Grumbach 2017-07-06 11:19:38 UTC
Created attachment 257389 [details]
cfg80211 - fix

Final version of the patch for cfg80211.
Comment 73 Emmanuel Grumbach 2017-07-06 13:45:22 UTC
Patches are now merged in our backport tree in master branch.

I'll close this bug in a few days if I don't get more feedback on this.
All, please test :)
Comment 74 Łukasz Siudut 2017-07-06 14:05:55 UTC
Perfect, thank you! I'll give it a try this or tomorrow evening :).

Also sorry for not testing it earlier, my laptop was in repair.
Comment 75 Emmanuel Grumbach 2017-07-11 14:29:17 UTC
I am closing the bug. You can keep commenting here and I'll keep being notified. Please reopen in case the bug is not solved for you.

Thanks.
Comment 76 Nathan Baker 2017-07-11 18:12:04 UTC
Do you know when the patches will get picked up by a release? Maybe a dumb question; not really familiar with how your branches work, sorry.
Comment 77 Emmanuel Grumbach 2017-07-11 18:58:46 UTC
If you run our backport tree, it is already there in the master branch.
If you are talking about kernel releases, I have published the cfg80211 patch, but I can't really tell when it will be applied. I guess I can try to have that applied on 4.13.
The iwlwifi patch will be routed to 4.13, but Luca is on vacation.

Note that Johannes who maintains in on an extended vacation as well (as he said him earlier in this bug) and hence the cfg80211 / mac80211 tree isn't really maintained as usual.
I know that Kalle will apply urgent stuff, but I can't really say more.
Comment 78 Craig Andrews 2017-09-19 00:14:47 UTC
It appears, at least as of 4.13, that the cfg80211 patch https://bugzilla.kernel.org/attachment.cgi?id=257389 has not been applied.

Is there a plan/timeline as to when it may get accepted?
Comment 79 Emmanuel Grumbach 2017-09-19 04:05:24 UTC
It is in 4.14-rc1:
commit 4e0854a74f08e6a9d847f2c2cfae7b07c931d125
Author: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Date:   Wed Sep 6 13:45:40 2017 +0300

    cfg80211: honor NL80211_RRF_NO_HT40{MINUS,PLUS}
    
    Honor the NL80211_RRF_NO_HT40{MINUS,PLUS} flags in
    reg_process_ht_flags_channel. Not doing so leads can lead
    to a firmware assert in iwlwifi for example.
    
    Fixes: b0d7aa59592b ("cfg80211: allow wiphy specific regdomain management")
    Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
    Signed-off-by: Johannes Berg <johannes.berg@intel.com>