Bug 215937

Summary: IWLMEI sets wlan device in DOWN state after suspend
Product: Drivers Reporter: Toke Høiland-Jørgensen (toke)
Component: network-wireless-intelAssignee: Default virtual assignee for network-wireless-intel (drivers_network-wireless-intel)
Status: NEEDINFO ---    
Severity: normal CC: andyrtr, avraham.stern, dereklh24, emmanuel.grumbach, kernelbugs, kvalo, mail, pavol, pbrobinson, plumerlis
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.17 Subsystem:
Regression: No Bisected commit-id:
Attachments: DMESG output from iwlwifi w/o IWLMEI enabled
trace output w/o IWLMEI enabled
DMESG output from iwlwifi w/IWLMEI enabled
trace output w/IWLMEI enabled
DMESG when hanging in iwlmei during suspend
wait_for_device_down_before_suspend
device_down.patch
compilation fix for 6.4-rc1

Description Toke Høiland-Jørgensen 2022-05-03 11:38:19 UTC
After a suspend-resume cycle, my laptop started being disconnected from the WiFi network after coming back from suspend. This seems to be related to IWLMEI, since disabling CONFIG_IWLMEI makes the problem go away.

Original report on linux-wireless is here:
https://lore.kernel.org/all/87czhe39p6.fsf@toke.dk/

Many thanks to Emmanuel for helping me pinpoint IWLMEI as the culprit!
Comment 1 Emmanuel Grumbach 2022-05-03 13:22:08 UTC
Can you please send the output with debug=0xf as a module parameter to iwlwifi.
I hope you have CONFIG_IWLWIFI_DEBUG enabled

I'd like also to have a trace-cmd with:

-e iwlwifi -e iwlwifi_msg -e iwlmei_sap_cmd

Thanks.
Comment 2 Toke Høiland-Jørgensen 2022-05-04 14:38:30 UTC
Created attachment 300873 [details]
DMESG output from iwlwifi w/o IWLMEI enabled
Comment 3 Toke Høiland-Jørgensen 2022-05-04 14:39:07 UTC
Created attachment 300874 [details]
trace output w/o IWLMEI enabled
Comment 4 Toke Høiland-Jørgensen 2022-05-04 14:43:55 UTC
Created attachment 300875 [details]
DMESG output from iwlwifi w/IWLMEI enabled
Comment 5 Toke Høiland-Jørgensen 2022-05-04 14:44:20 UTC
Created attachment 300876 [details]
trace output w/IWLMEI enabled
Comment 6 Toke Høiland-Jørgensen 2022-05-04 14:44:59 UTC
Alright, attached both with and without IWLMEI enabled in kconfig...
Comment 7 Emmanuel Grumbach 2022-05-04 14:50:48 UTC
Thanks!

I can't open you tracing data :(
Not sure why.

I'll take a look at the debug data next week (independence day here), but from what I see, we the driver does get the device, so I can't understand why iwd can't get the device back.
This works perfectly with the NetworkManager we've been testing with.
Comment 8 Emmanuel Grumbach 2022-05-04 14:51:20 UTC
I'll try to see how we can reproduce, but since I don't have a setup and someone else is handling this now, I can't commit on the timeline.
Comment 9 Toke Høiland-Jørgensen 2022-05-04 15:08:19 UTC
I gzipped the trace data this time; did you decompress them? I can read them just fine if I re-download the files and decompress...

And no worries, I can work around the issue in the meantime. Maybe networkmanager does an 'ifup' after coming back from suspend or something?
Comment 10 Toke Høiland-Jørgensen 2022-05-21 22:16:19 UTC
Created attachment 301011 [details]
DMESG when hanging in iwlmei during suspend

So I also managed to get an iwlmei-related kernel hang during suspend. It's not totally reproducible, but it did happen a couple of times; dmesg attached...

Any updates on your end, otherwise? :)
Comment 11 Emmanuel Grumbach 2022-05-23 05:50:36 UTC
Unfortunately, I haven't got the chance to look at this :(

About the hang, I reviewed all the usages of the iwlmei's internal mutex and I can't see any place where I could have forgotten to release it.
Another thing that is strange is the warning pops up 4 minutes after association and complains that the task has been stuck for more than 120 seconds.

I think I'll need to add trace events for each call from iwlwifi to iwlmei.
Comment 12 Joachim Breitner 2022-08-15 12:14:12 UTC
*** Bug 216362 has been marked as a duplicate of this bug. ***
Comment 13 Joachim Breitner 2022-08-17 18:00:07 UTC
JFTR, I can reproduce the issue and confim that `CONFIG_IWLMEI n` makes it go away.

Out of curiosity: Is there a way to disable iwlmei dynamically (command line parameter or so), so that people affected by this can work around without having to recompile their kernel?
Comment 14 Emmanuel Grumbach 2022-08-26 14:34:05 UTC
*** Bug 216397 has been marked as a duplicate of this bug. ***
Comment 15 Kalle Valo 2022-08-29 10:23:46 UTC
So there's no fix available? As this is a regression I propose forcing to disable CONFIG_IWLMEI in the kernel until it's fixed.
Comment 16 Blanche Schaefer 2022-09-08 19:08:42 UTC
Could someone re-open https://bugzilla.kernel.org/show_bug.cgi?id=216397 ?

It was wrongly closed as duplicate of this one and there is no evidence of that.
Comment 17 Avi Stern 2022-09-22 15:00:00 UTC
Created attachment 301847 [details]
wait_for_device_down_before_suspend

Hi Toke,

Can you please see the attached patch - should fix the issue of the interface staying down after suspend.
The interfaces were removed when iwlmei was removed, so they weren't brought up on resume.
Comment 18 Emmanuel Grumbach 2022-10-11 14:54:09 UTC
Hi Toke, did you get a chance to test the fix candidate suggested by Avi?

Thanks.
Comment 19 Toke Høiland-Jørgensen 2022-10-11 19:28:09 UTC
No, I didn't; sorry for the delay!

Tried applying the patch now (to my 5.19.13 distro kernel), but I'm afraid it doesn't compile :(

drivers/net/wireless/intel/iwlwifi/mei/main.c: In function ‘iwl_mei_remove’:
drivers/net/wireless/intel/iwlwifi/mei/main.c:1909:35: error: ‘struct iwl_mei’ has no member named ‘device_down’
 1909 |                         down = mei->device_down;
      |                                   ^~
Comment 20 Emmanuel Grumbach 2022-10-11 19:49:51 UTC
I see...
Gregory still has a long backlog of patches to send upstream...
Avi sent a patch based on our internal tree.
I am attaching the patch you're missing.
I am on vacation right now and can't really check if this will be enough, so let's hope it'll work.
Comment 21 Emmanuel Grumbach 2022-10-11 19:50:53 UTC
Created attachment 302980 [details]
device_down.patch
Comment 22 Joachim Breitner 2022-10-12 07:11:28 UTC
Which kernel should this apply against?

I am getting

Hunk #1 succeeded at 1370 (offset -230 lines).
can't find file to patch at input line 188
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/intc-scripts/chromeOS/hdrs/mei/iwl-mei.h b/intc-scripts/chromeOS/hdrs/mei/iwl-mei.h
|index a28141a41c..ee14baa013 100644
|--- a/intc-scripts/chromeOS/hdrs/mei/iwl-mei.h
|+++ b/intc-scripts/chromeOS/hdrs/mei/iwl-mei.h
--------------------------


against linux-6.0 (on nixos)

Is there a git repo somewhere with a commit that has the patches applied?
Comment 23 Joachim Breitner 2022-10-12 07:15:48 UTC
Oh, and trying to go back to my old work-around configuration, with IWLMEI=n,  now get

error: unused option: IWLMEI

This worked in 5.19. Did the configuration parameter name change?
Comment 24 Joachim Breitner 2022-10-12 07:46:10 UTC
Ah, based on https://github.com/torvalds/linux/commit/8997f5c8a62760db69fd5c56116705796322c8ed the option now now off anyways, and I don’t have to recompile my kernel in 6.0. Let’s try…
Comment 25 Toke Høiland-Jørgensen 2022-10-12 12:47:24 UTC
Right, so after applying both the wait_for_device_down_before_suspend and the device_down patches to my 5.19.13 kernel, and also removing the BROKEN config on iwlmei, it seems things are more or less working as they should.

That is, the iwlmei module is loaded at boot, and after a suspend-resume the device stays up without having to restart iwd.

I do, however, still get a warning in dmesg during suspend:

[  100.149818] wlan0: deauthenticating from 78:8a:20:4a:87:b7 by local choice (Reason: 3=DEAUTH_LEAVING)
[  100.372680] iwlmei 0000:00:16.0-13280904-7792-4fcb-a1aa-5e70cbb1e865: Couldn't get ACK from CSME on HOST_GOES_DOWN message
[  100.372701] iwlmei 0000:00:16.0-13280904-7792-4fcb-a1aa-5e70cbb1e865: failed to send the SAP_ME_MSG_CHECK_SHARED_AREA message -19

But I'm not sure if that is important?
Comment 26 Emmanuel Grumbach 2022-10-12 14:12:24 UTC
I think you're missing another patch that I made...
oh well...
The maintenance was transfered from Luca to Gregory and because of that, we had a long window during which no patches were sent upstream and now the backlog is very big.

You can try our internal backport tree:
https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/backport-iwlwifi.git/

I just pushed our latest master there.
Comment 27 Emmanuel Grumbach 2023-01-15 10:22:19 UTC
All the patches should be merged in 6.2-rc3
Can you try?
Thanks.
Comment 28 Joachim Breitner 2023-01-16 08:50:00 UTC
I can once nixpkgs has that version.

FWIW, The last months I have been running iwd-1.30 with the patch “[PATCH] netdev: try catching buggy driver behavior to avoid hangs” on an unpatched 6.1.1 kernel, and things were mostly fine, it seems. But due to the weather I did not much go to the places where the reception was so that it would tend to switch stations and get in the bad state.
Comment 29 Emmanuel Grumbach 2023-01-16 08:53:19 UTC
Oh - I don't think the scenario disussed here is related.
We're talking about suspend resume with iwlmei loaded and iwd.
Comment 30 Joachim Breitner 2023-01-16 08:55:15 UTC
Ah, of course, sorry.

Anyways, I will test once 6.2 is out, whether I want to or not :-)
Comment 31 Emmanuel Grumbach 2023-01-16 09:17:57 UTC
Currently iwlmei has been marked as BROKEN because of this bug, so I am afraid it won't be compiled.
Comment 32 Toke Høiland-Jørgensen 2023-01-16 14:44:04 UTC
I'll see if I can't get an -rc kernel built and tested one of these days...
Comment 33 Emmanuel Grumbach 2023-01-16 14:54:31 UTC
I think patches were merged in already in 6.1.
Is there a point to check?
You still need to enable iwlmei and build to test since it is disabled by default.
Comment 34 Toke Høiland-Jørgensen 2023-01-16 15:16:50 UTC
Well, as you say I'll need to build a custom kernel in any case, so can test with either 6.1 or 6.2-rc; which do you prefer?
Comment 35 Emmanuel Grumbach 2023-01-16 15:17:36 UTC
6.2-rc
Comment 36 Toke Høiland-Jørgensen 2023-01-20 22:19:02 UTC
Okay, so I tried building a 6.2-rc4 kernel, removing the BROKEN tag from iwlmei and enabling it... and compilation fails with:

drivers/net/wireless/intel/iwlwifi/mei/main.c: In function ‘iwl_mei_handle_amt_state’:
drivers/net/wireless/intel/iwlwifi/mei/main.c:791:17: error: too many arguments to function ‘iwl_mei_cache.ops->rfkill’
  791 |                 iwl_mei_cache.ops->rfkill(iwl_mei_cache.priv, false, false);
      |                 ^~~~~~~~~~~~~
drivers/net/wireless/intel/iwlwifi/mei/main.c: In function ‘iwl_mei_handle_csme_taking_ownership’:
drivers/net/wireless/intel/iwlwifi/mei/main.c:832:17: error: too many arguments to function ‘iwl_mei_cache.ops->rfkill’
  832 |                 iwl_mei_cache.ops->rfkill(iwl_mei_cache.priv, true, true);
      |                 ^~~~~~~~~~~~~
drivers/net/wireless/intel/iwlwifi/mei/main.c: In function ‘iwl_mei_register’:
drivers/net/wireless/intel/iwlwifi/mei/main.c:1777:25: error: too many arguments to function ‘ops->rfkill’
 1777 |                         ops->rfkill(priv, mei->link_prot_state, false);
      |                         ^~~
Comment 37 Emmanuel Grumbach 2023-01-22 19:03:35 UTC
Oh no...

Sorry for that.
I'll take this with Gregory, a patch was upstreamed apparently...
Clearly, we need to check how this happened...
Comment 38 Emmanuel Grumbach 2023-05-16 07:06:22 UTC
I am not working on that part anymore, but the one who does work on this told me all the patches were merged in 6.4.

Thanks
Comment 39 Avi Stern 2023-05-16 07:16:28 UTC
Created attachment 304276 [details]
compilation fix for 6.4-rc1

It seems there is still a compilation error on 6.4-rc1.
can you please try the attached patch before testing?

Thanks.