After a suspend-resume cycle, my laptop started being disconnected from the WiFi network after coming back from suspend. This seems to be related to IWLMEI, since disabling CONFIG_IWLMEI makes the problem go away.
Original report on linux-wireless is here:
Many thanks to Emmanuel for helping me pinpoint IWLMEI as the culprit!
Can you please send the output with debug=0xf as a module parameter to iwlwifi.
I hope you have CONFIG_IWLWIFI_DEBUG enabled
I'd like also to have a trace-cmd with:
-e iwlwifi -e iwlwifi_msg -e iwlmei_sap_cmd
Created attachment 300873 [details]
DMESG output from iwlwifi w/o IWLMEI enabled
Created attachment 300874 [details]
trace output w/o IWLMEI enabled
Created attachment 300875 [details]
DMESG output from iwlwifi w/IWLMEI enabled
Created attachment 300876 [details]
trace output w/IWLMEI enabled
Alright, attached both with and without IWLMEI enabled in kconfig...
I can't open you tracing data :(
Not sure why.
I'll take a look at the debug data next week (independence day here), but from what I see, we the driver does get the device, so I can't understand why iwd can't get the device back.
This works perfectly with the NetworkManager we've been testing with.
I'll try to see how we can reproduce, but since I don't have a setup and someone else is handling this now, I can't commit on the timeline.
I gzipped the trace data this time; did you decompress them? I can read them just fine if I re-download the files and decompress...
And no worries, I can work around the issue in the meantime. Maybe networkmanager does an 'ifup' after coming back from suspend or something?
Created attachment 301011 [details]
DMESG when hanging in iwlmei during suspend
So I also managed to get an iwlmei-related kernel hang during suspend. It's not totally reproducible, but it did happen a couple of times; dmesg attached...
Any updates on your end, otherwise? :)
Unfortunately, I haven't got the chance to look at this :(
About the hang, I reviewed all the usages of the iwlmei's internal mutex and I can't see any place where I could have forgotten to release it.
Another thing that is strange is the warning pops up 4 minutes after association and complains that the task has been stuck for more than 120 seconds.
I think I'll need to add trace events for each call from iwlwifi to iwlmei.
*** Bug 216362 has been marked as a duplicate of this bug. ***
JFTR, I can reproduce the issue and confim that `CONFIG_IWLMEI n` makes it go away.
Out of curiosity: Is there a way to disable iwlmei dynamically (command line parameter or so), so that people affected by this can work around without having to recompile their kernel?
*** Bug 216397 has been marked as a duplicate of this bug. ***
So there's no fix available? As this is a regression I propose forcing to disable CONFIG_IWLMEI in the kernel until it's fixed.
Could someone re-open https://bugzilla.kernel.org/show_bug.cgi?id=216397 ?
It was wrongly closed as duplicate of this one and there is no evidence of that.
Created attachment 301847 [details]
Can you please see the attached patch - should fix the issue of the interface staying down after suspend.
The interfaces were removed when iwlmei was removed, so they weren't brought up on resume.
Hi Toke, did you get a chance to test the fix candidate suggested by Avi?
No, I didn't; sorry for the delay!
Tried applying the patch now (to my 5.19.13 distro kernel), but I'm afraid it doesn't compile :(
drivers/net/wireless/intel/iwlwifi/mei/main.c: In function ‘iwl_mei_remove’:
drivers/net/wireless/intel/iwlwifi/mei/main.c:1909:35: error: ‘struct iwl_mei’ has no member named ‘device_down’
1909 | down = mei->device_down;
Gregory still has a long backlog of patches to send upstream...
Avi sent a patch based on our internal tree.
I am attaching the patch you're missing.
I am on vacation right now and can't really check if this will be enough, so let's hope it'll work.
Created attachment 302980 [details]
Which kernel should this apply against?
I am getting
Hunk #1 succeeded at 1370 (offset -230 lines).
can't find file to patch at input line 188
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
|diff --git a/intc-scripts/chromeOS/hdrs/mei/iwl-mei.h b/intc-scripts/chromeOS/hdrs/mei/iwl-mei.h
|index a28141a41c..ee14baa013 100644
against linux-6.0 (on nixos)
Is there a git repo somewhere with a commit that has the patches applied?
Oh, and trying to go back to my old work-around configuration, with IWLMEI=n, now get
error: unused option: IWLMEI
This worked in 5.19. Did the configuration parameter name change?
Ah, based on https://github.com/torvalds/linux/commit/8997f5c8a62760db69fd5c56116705796322c8ed the option now now off anyways, and I don’t have to recompile my kernel in 6.0. Let’s try…
Right, so after applying both the wait_for_device_down_before_suspend and the device_down patches to my 5.19.13 kernel, and also removing the BROKEN config on iwlmei, it seems things are more or less working as they should.
That is, the iwlmei module is loaded at boot, and after a suspend-resume the device stays up without having to restart iwd.
I do, however, still get a warning in dmesg during suspend:
[ 100.149818] wlan0: deauthenticating from 78:8a:20:4a:87:b7 by local choice (Reason: 3=DEAUTH_LEAVING)
[ 100.372680] iwlmei 0000:00:16.0-13280904-7792-4fcb-a1aa-5e70cbb1e865: Couldn't get ACK from CSME on HOST_GOES_DOWN message
[ 100.372701] iwlmei 0000:00:16.0-13280904-7792-4fcb-a1aa-5e70cbb1e865: failed to send the SAP_ME_MSG_CHECK_SHARED_AREA message -19
But I'm not sure if that is important?
I think you're missing another patch that I made...
The maintenance was transfered from Luca to Gregory and because of that, we had a long window during which no patches were sent upstream and now the backlog is very big.
You can try our internal backport tree:
I just pushed our latest master there.
All the patches should be merged in 6.2-rc3
Can you try?
I can once nixpkgs has that version.
FWIW, The last months I have been running iwd-1.30 with the patch “[PATCH] netdev: try catching buggy driver behavior to avoid hangs” on an unpatched 6.1.1 kernel, and things were mostly fine, it seems. But due to the weather I did not much go to the places where the reception was so that it would tend to switch stations and get in the bad state.
Oh - I don't think the scenario disussed here is related.
We're talking about suspend resume with iwlmei loaded and iwd.
Ah, of course, sorry.
Anyways, I will test once 6.2 is out, whether I want to or not :-)
Currently iwlmei has been marked as BROKEN because of this bug, so I am afraid it won't be compiled.
I'll see if I can't get an -rc kernel built and tested one of these days...
I think patches were merged in already in 6.1.
Is there a point to check?
You still need to enable iwlmei and build to test since it is disabled by default.
Well, as you say I'll need to build a custom kernel in any case, so can test with either 6.1 or 6.2-rc; which do you prefer?
Okay, so I tried building a 6.2-rc4 kernel, removing the BROKEN tag from iwlmei and enabling it... and compilation fails with:
drivers/net/wireless/intel/iwlwifi/mei/main.c: In function ‘iwl_mei_handle_amt_state’:
drivers/net/wireless/intel/iwlwifi/mei/main.c:791:17: error: too many arguments to function ‘iwl_mei_cache.ops->rfkill’
791 | iwl_mei_cache.ops->rfkill(iwl_mei_cache.priv, false, false);
drivers/net/wireless/intel/iwlwifi/mei/main.c: In function ‘iwl_mei_handle_csme_taking_ownership’:
drivers/net/wireless/intel/iwlwifi/mei/main.c:832:17: error: too many arguments to function ‘iwl_mei_cache.ops->rfkill’
832 | iwl_mei_cache.ops->rfkill(iwl_mei_cache.priv, true, true);
drivers/net/wireless/intel/iwlwifi/mei/main.c: In function ‘iwl_mei_register’:
drivers/net/wireless/intel/iwlwifi/mei/main.c:1777:25: error: too many arguments to function ‘ops->rfkill’
1777 | ops->rfkill(priv, mei->link_prot_state, false);
Sorry for that.
I'll take this with Gregory, a patch was upstreamed apparently...
Clearly, we need to check how this happened...
I am not working on that part anymore, but the one who does work on this told me all the patches were merged in 6.4.
Created attachment 304276 [details]
compilation fix for 6.4-rc1
It seems there is still a compilation error on 6.4-rc1.
can you please try the attached patch before testing?