Bug 217530
Summary: | Waking up from resume locks up on sr device | ||
---|---|---|---|
Product: | Power Management | Reporter: | Joe Breuer (linux-kernel) |
Component: | Hibernation/Suspend | Assignee: | Rafael J. Wysocki (rjw) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | bagasdotme, bvanassche |
Priority: | P3 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: |
Description
Joe Breuer
2023-06-07 16:51:01 UTC
... getting that debug_show_all_locks() added went much quicker than I'd estimated. Unfortunately, it changes behaviour - the lockup is now shown on sd 0:0:0:0, and therefore cannot be worked around with libata.force=3:disable any more. I have some output from the end of the resume attempt including held locks: https://postimg.cc/zV4xfK8D The *two* kworkers holding ata_scsi_dev_rescan et al do look like a... possibly not so great idea to me, but I don't know nearly enough about the driver model to say whether that's working as intended. (In reply to Joe Breuer from comment #0) > I'm running LibreELEC.tv on an x86_64 machine that, following a (kernel) > update, now locks up hard while trying to device_resume() => device_lock() > on sr 2:0:0:0 (the only sr device in the system). > What kernel version before and after update? > This is a regression, earlier kernels would work correctly on the exact same > hardware. Since it's an 'embedded' type (LibreELEC.tv) install that > overwrites its system parts completely on each update, I don't have a clear > historical record of kernel versions. From the timeline and my memory, > moving from 5.x to 6.x would make sense. Due to the nature of the system, > it's somewhat inconvenient for me to try numerous kernel versions blindly > for a bisection; I will try to test against some current 5.x soon, however. > Can you mention last known good kernel version? And what version you have this regression with? Regarding bisection, you have to do one if no developers can figure out what's going on by code inspection alone (unfortunately). Since you have this regression on production machine, can you set up a testing environment (with exact same setup) and do bisection there? Thank you for the feedback and suggestions! Going through kernels that are officially released with LibreELEC, I can at least now say that 5.10.161 is known good; 6.1.7 is bad. Trying a small number of other kernel versions, 5.17.15, 5.18.19, 5.19.17 those all came up bad, as well. Pondering things while those (compiles) were running, I found a way to test bisection kernels that fits in "maintenance windows". The lucky break is that there's enough space on LibreELEC's boot partition for an additional kernel, which can then be started only selectively from the EXTLINUX prompt. That way, I could avoid hefty "TV no work" service level violations ;-) The bisection straightforwardly leads to this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85 scsi: core: pm: Rely on the device driver core for async power management That... certainly makes sense. It also suggests trying scsi_mod.scan=sync as a workaround, but unfortunately that changes nothing - still two kworkers in scsi_rescan_device holding locks (which I can only presume are what's preventing that device_lock() in main.c:919 as mentioned above to proceed). I'm open to suggestions how to analyze this further. I don't have what I need to duplicate that setup - the only component I can think is really relevant would be the mainboard, an Intel DH77EB. I only have the one that's built into an HTPC case, and I don't have anything on hand that I could swap into there, either. But, having found that nice "additional kernel" workaround, I can continue to try out things with what I have. ... for anyone coming across this, there is a patch and fix and the relevant discussion is over on the mailing lists: https://lore.kernel.org/all/20230615083326.161875-1-dlemoal@kernel.org/ The way I understand it, the fix is on track to be mainlined for 6.1 and 6.3: https://lore.kernel.org/all/20230626180800.790347981@linuxfoundation.org/ https://lore.kernel.org/all/20230626180805.887749702@linuxfoundation.org/ Current git master LibreELEC.tv now comes with kernel 6.4.1 which I can confirm contains a working fix for my system. I also feel like I should add the regression tracker for this: https://linux-regtracking.leemhuis.info/regzbot/regression/lore/2d1fdf6d-682c-a18d-2260-5c5ee7097f7d@gmail.com/ Awesome to see how such obscure issues are getting fixed! |