Bug 219353 - rtw88/btrtl: reading RTL8821CU firmware during PMSG_THAW hibernation phase causes hibernation hangs
Summary: rtw88/btrtl: reading RTL8821CU firmware during PMSG_THAW hibernation phase ca...
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: Wireless (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-10-05 13:12 UTC by Maciej S. Szmigiero
Modified: 2024-11-01 13:47 UTC (History)
3 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Maciej S. Szmigiero 2024-10-05 13:12:06 UTC
Devices that read firmware during PMSG_THAW hibernation phase seem to occasionally cause hibernation to hang after the image has been written and the system is powering down.

The issue was reproduced at least on upstream kernel versions 6.11.1 and 6.10.6 with the root file system being btrfs.

The specific device that attempts firmware read during that phase is USB RTL8821CU, which gets reset at that stage due to commit 04b8c8143d46 ("btusb: fix Realtek suspend/resume") and so it tries to request_firmware() from the root filesystem after that thaw/reset, when the hibernation image is being written.

It usually succeeds, however often it deadlocks somewhere in btrfs code resulting in the system failing to power off after writing the hibernate image:
power_off() calls dpm_suspend_start(), which calls dpm_prepare(), which waits for device probe to finish.
And device probe is stuck forever trying to load that USB stick firmware from the filesystem.

So in the end the system never powers off during (after) hibernation.

I asked [1] on linux-pm whether filesystem read access is supposed to be working normally at that point, but sadly got no answer.
That e-email message also has "Show Blocked State" trace showing call stacks during that failed power off on 6.10.6.

[1]: https://lore.kernel.org/linux-pm/3c95fb54-9cac-4b4f-8e1b-84ca041b57cb@maciej.szmigiero.name/
Comment 1 Maciej S. Szmigiero 2024-10-19 14:43:49 UTC
Pavel has responded [2] that reading firmware is not supported during PMSG_THAW hibernation phase.

This means that this is a not a PM/hibernation bug but an rtw88/btrtl one - reassigning accordingly.

I am currently using a simple workaround at [3] but having a proper fix would be nice.


[2]: https://lore.kernel.org/linux-pm/ZwF6JEHIQda92sIL@duo.ucw.cz/
[3]: https://github.com/torvalds/linux/commit/f6188a940324b4bc8f51dcb1a9ae1a489e57bd1d
Comment 2 Ping-Ke Shih 2024-10-21 00:52:32 UTC
I'm not familiar with hibernation nor firmware loader. As description of kernel doc [4], firmware chache mechanism can handle suspend/resume and hibernation cases, and drivers seem no need special deals. 

Have you enabled below in your system? 
```
config FW_CACHE
        bool "Enable firmware caching during suspend"
        depends on PM_SLEEP
        default y if PM_SLEEP
        help
          Because firmware caching generates uevent messages that are sent
          over a netlink socket, it can prevent suspend on many platforms.
          It is also not always useful, so on such platforms we have the
          option.

          If unsure, say Y.
```

If you have already enabled, maybe digging why it doesn't work is a way.

If you think I must have some fixes in rtw88, could you please share an example how I can do? I feel other USB devices have similar problems, and they must have some proper handles that rtw88 is missing to do. 

[4] https://www.kernel.org/doc/html/v6.11/driver-api/firmware/firmware_cache.html
Comment 3 Maciej S. Szmigiero 2024-10-21 10:47:40 UTC
I have CONFIG_FW_CACHE enabled, however in this case I think it does not work because commit 04b8c8143d46 ("btusb: fix Realtek suspend/resume") causes the USB core to re-bind the driver on thaw (due to reset_resume being set).

Cached firmware is AFAIK attached using the devres mechanism which means the cache is purged when the driver gets detached by this reset_resume mechanism.
So when it is bound again there's no firmware in cache anymore.

I see multiple different possibilities of fixing this:
Possibility 1) Just don't probe the device during PMSG_THAW hibernation phase, as WiFi/BT NIC is hardly useful for hibernation snapshot writing anyway.
This is more or less like my current workaround from [3], just would need some cleanups like exporting "in_suspend" via some getter function or replacing this variable access with system-wide PM callbacks registered by rtw88 and btrtl drivers.

Possibility 2) Add some custom way to cache firmware in rtw88 and btrtl drivers which does survive driver unbind and re-bind.

Possibility 3) Avoid resetting the device via reset_resume altogether. I'm not sure however whether it is possible to achieve what commit 04b8c8143d46 did in some other way.
Comment 4 Ping-Ke Shih 2024-11-01 02:06:01 UTC
The discussion [1] is slightly similar to this problem. Maybe built-in firmware [2] or initramfs also can be a solution. 


[1] https://lore.kernel.org/linux-wireless/tencent_9E86EA49928C177B25C1691D4454AAB21106@qq.com/T/#m1d3eec1b151788309fce5f38734d2a32ea062812
[2] https://www.kernel.org/doc/html/v6.11/driver-api/firmware/built-in-fw.html
Comment 5 Maciej S. Szmigiero 2024-11-01 13:47:09 UTC
Here, the problem happens at the hibernation snapshot writing time - not at the boot time like at [1].
initramfs is only used at the boot time and by the time the machine gets hibernated initramfs is long gone from RAM.

On the other hand, using built-in firmware doesn't scale well - to be compatible with all rtw88/btrtl devices one would need to include 1 MiB+ of firmware data in the kernel (and that's XZ-compressed size, not raw):
$ du --max-depth=0 -h /lib/firmware/{rtl_bt,rtw88}
832K    /lib/firmware/rtl_bt
332K    /lib/firmware/rtw88

Note You need to log in before you can comment on or make changes to this bug.