Bug 217530

Summary: Waking up from resume locks up on sr device
Product: Power Management Reporter: Joe Breuer (linux-kernel)
Component: Hibernation/SuspendAssignee: Rafael J. Wysocki (rjw)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bagasdotme, bvanassche
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:

Description Joe Breuer 2023-06-07 16:51:01 UTC
I'm running LibreELEC.tv on an x86_64 machine that, following a (kernel) update, now locks up hard while trying to device_resume() => device_lock() on sr 2:0:0:0 (the only sr device in the system).

Through some digging of my own, I can pretty much isolate the fault in this device_lock() call:
https://elixir.bootlin.com/linux/v6.3.4/source/drivers/base/power/main.c#L919

I put an additional debug line exactly before the device_lock(dev) call, like this:
dev_info(dev, "device_lock() in device_resume()");

This is the last diagnostic I see, that device_lock() call never returns, ie line 920 in main.c is never reached (confirmed via TRACE_RESUME).
The device, in my case, is printed as sr 2:0:0:0.

Knowing this, as a workaround, booting with libata.force=3:disable (libata port 3 corresponds to the SATA channel that sr 2:0:0:0 is attached to) allows suspend/resume to work correctly (but the optical drive is not accessible, obviously).

When resume hangs, the kernel is not _completely_ locked, interestingly the machine responds to pings and I see the e1000e 'link up' message a couple seconds after the hanging sr2 device_lock().
Magic SysRq, however, does NOT work in that state; possibly because not enough of USB is resumed yet. Resuming devices seems to broadly follow a kind of breadth-first order; I see USB ports getting resumed closely before the lockup, but no USB (target) devices.

This is a regression, earlier kernels would work correctly on the exact same hardware. Since it's an 'embedded' type (LibreELEC.tv) install that overwrites its system parts completely on each update, I don't have a clear historical record of kernel versions. From the timeline and my memory, moving from 5.x to 6.x would make sense. Due to the nature of the system, it's somewhat inconvenient for me to try numerous kernel versions blindly for a bisection; I will try to test against some current 5.x soon, however.

I do have the hope that this information already might give someone with more background a strong idea about the issue.

Next, I will try to put debug_show_all_locks() before device_lock(), since I can't Alt+SysRq+d.
Comment 1 Joe Breuer 2023-06-07 17:33:57 UTC
... getting that debug_show_all_locks() added went much quicker than I'd estimated.

Unfortunately, it changes behaviour - the lockup is now shown on sd 0:0:0:0, and therefore cannot be worked around with libata.force=3:disable any more.

I have some output from the end of the resume attempt including held locks:

https://postimg.cc/zV4xfK8D

The *two* kworkers holding ata_scsi_dev_rescan et al do look like a... possibly not so great idea to me, but I don't know nearly enough about the driver model to say whether that's working as intended.
Comment 2 Bagas Sanjaya 2023-06-08 02:35:07 UTC
(In reply to Joe Breuer from comment #0)
> I'm running LibreELEC.tv on an x86_64 machine that, following a (kernel)
> update, now locks up hard while trying to device_resume() => device_lock()
> on sr 2:0:0:0 (the only sr device in the system).
> 

What kernel version before and after update?

> This is a regression, earlier kernels would work correctly on the exact same
> hardware. Since it's an 'embedded' type (LibreELEC.tv) install that
> overwrites its system parts completely on each update, I don't have a clear
> historical record of kernel versions. From the timeline and my memory,
> moving from 5.x to 6.x would make sense. Due to the nature of the system,
> it's somewhat inconvenient for me to try numerous kernel versions blindly
> for a bisection; I will try to test against some current 5.x soon, however.
> 

Can you mention last known good kernel version? And what version you have this
regression with?

Regarding bisection, you have to do one if no developers can figure out what's
going on by code inspection alone (unfortunately). Since you have this regression on production machine, can you set up a testing environment (with
exact same setup) and do bisection there?
Comment 3 Joe Breuer 2023-06-09 18:45:01 UTC
Thank you for the feedback and suggestions!

Going through kernels that are officially released with LibreELEC, I can at least now say that 5.10.161 is known good; 6.1.7 is bad.

Trying a small number of other kernel versions, 5.17.15, 5.18.19, 5.19.17 those all came up bad, as well.

Pondering things while those (compiles) were running, I found a way to test bisection kernels that fits in "maintenance windows". The lucky break is that there's enough space on LibreELEC's boot partition for an additional kernel, which can then be started only selectively from the EXTLINUX prompt. That way, I could avoid hefty "TV no work" service level violations ;-)

The bisection straightforwardly leads to this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85

scsi: core: pm: Rely on the device driver core for async power management

That... certainly makes sense.

It also suggests trying scsi_mod.scan=sync as a workaround, but unfortunately that changes nothing - still two kworkers in scsi_rescan_device holding locks (which I can only presume are what's preventing that device_lock() in main.c:919 as mentioned above to proceed).

I'm open to suggestions how to analyze this further.


I don't have what I need to duplicate that setup - the only component I can think is really relevant would be the mainboard, an Intel DH77EB. I only have the one that's built into an HTPC case, and I don't have anything on hand that I could swap into there, either.

But, having found that nice "additional kernel" workaround, I can continue to try out things with what I have.
Comment 4 Joe Breuer 2023-07-01 12:38:38 UTC
... for anyone coming across this, there is a patch and fix and the relevant discussion is over on the mailing lists:

https://lore.kernel.org/all/20230615083326.161875-1-dlemoal@kernel.org/

The way I understand it, the fix is on track to be mainlined for 6.1 and 6.3:

https://lore.kernel.org/all/20230626180800.790347981@linuxfoundation.org/
https://lore.kernel.org/all/20230626180805.887749702@linuxfoundation.org/
Comment 5 Joe Breuer 2023-07-15 09:03:37 UTC
Current git master LibreELEC.tv now comes with kernel 6.4.1 which I can confirm contains a working fix for my system.

I also feel like I should add the regression tracker for this:

https://linux-regtracking.leemhuis.info/regzbot/regression/lore/2d1fdf6d-682c-a18d-2260-5c5ee7097f7d@gmail.com/

Awesome to see how such obscure issues are getting fixed!