Bug 199693
Summary: | [bisect 08810a4119aaebf6318f209ec5dd9828e969cba4] (pci runtime) System freeze after resuming from suspend (amdgpu) | ||
---|---|---|---|
Product: | Power Management | Reporter: | Thomas Martitz (kugel) |
Component: | Hibernation/Suspend | Assignee: | Rafael J. Wysocki (rjw) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | john-s-84, kugel, lenb, rjw, rui.zhang, ukrkyi, yu.c.chen |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | v4.17-rc5 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
Patch making resume work
PM / core: Make direct_complete depend on pm_runtime_enabled() initially PM / core: Debug devices with direct_complete set and runtime PM disabled dmesg PM / core: Debug devices with direct_complete set and runtime PM disabled dmesg (no_pm_callbacks) PM / core: Set direct_complete for devices without PM callbacks |
Description
Thomas Martitz
2018-05-11 15:23:38 UTC
*** This bug has been marked as a duplicate of bug 199609 *** This is not the same as bug 199609 I'm afraid. This bug describes a behavior where the very first resume. However, if I make a fix (I'll post it later this day hopefully) for this bug, then I start seeing bug 199609 too. In fact, amdgpu developers asked me to file a bug report against the kernel while bug 199609 was moved to amdgpu. It looks like some pci devices(graphic) are not safe to be kept in runtime suspended when doing suspend/resume. Add Rafael too. Not necessarily, I'd say. Using the attached patch (I posted this to linux-pm, too). Makes it work. Also, I think it's another dev that affects amdgpu somehow, restricting the above patch to amdgpu-bound devices does not fix the issue. Created attachment 275999 [details]
Patch making resume work
This is a good finding, thanks! The problem is, though, that the change in comment #5 should not affect any functionality in theory. What is does is to prevent the initial value of power.direct_complete from depending on the pm_runtime_suspended(dev) return value at the time of its evaluation, which is in device_prepare(). However, later on, __device_suspend() will evaluate pm_runtime_status_suspended(dev) and it will clear power.direct_complete if that returns "false" without invoking any callbacks. Moreover, it will clear power.direct_complete for the device's parent and suppliers in that case. Now, the execution of device_prepare() takes place in the parents-first order, whereas the execution of __device_suspend() occurs in the children-first order, so no PM callbacks should be executed for the device itself and its parent, and any of its suppliers before setting power.direct_complete in device_prepare() and clearing it in __device_suspend() if the device is not runtime-suspended at that point. Overall, the patch from comment #5 simply removes an optimization and while it can be applied in principle, it would be good to understand why it helps in the first place. Created attachment 276039 [details]
PM / core: Make direct_complete depend on pm_runtime_enabled() initially
Please check if this patch is still sufficient to make the resume work on your system.
Huge thanks for your efforts. Your explaination sheds some light to this (for me anyway) Unfortunately, your proposed patch in comment #7 does not make resume work on my system. (In reply to Thomas Martitz from comment #8) > Huge thanks for your efforts. Your explaination sheds some light to this > (for me anyway) Unfortunately, your proposed patch in comment #7 does not > make resume work on my system. But that confirms my suspicion. :-) What appears to be happening is that there is a device in your system with runtime PM disabled and with runtime PM status set to "suspended". If you remove the pm_runtime_enabled() component from the initial value of power.direct_complete for that device (which also effectively happens in the patch from comment #5), power.direct_complete will be set from it and that will effectively prevent any suspend/resume callbacks from being invoked for it and the device will stay functional throughout the suspend. It will not resume properly, but that may not be a problem if it hasn't suspended in the first place. Now, amdgpu probably depends on that device in some way, so keeping that device alive throughout the cycle makes amdgpu happy. This likely means that amdgpu has an ordering issue exposed by the core change you've found. Of course, setting power.direct_complete for devices with runtime PM disabled doesn't make any sense whatever, so the underlying bug needs to be fixed, so we'll need to find the offending device and deal with it. I'll cut a patch for that in the next couple of days, but you may try to do it yourself now that you know what to look for. :-) Created attachment 276051 [details] PM / core: Debug devices with direct_complete set and runtime PM disabled Something like this. It makes the change made by the patch from comment #5, but in addition to that it prints messages for devices having power.direct_complete set and runtime PM disabled (that should never happen) and aborts suspend in the next phase if at least one device like that is found. Please apply and try to suspend. The suspend should be aborted (if I'm not mistaken), so attach a dmesg log generated after that. Created attachment 276075 [details]
dmesg
Here's the output. Please note for how many devices your debug statement is printed, even though it shouldn't happen.
Created attachment 276081 [details]
PM / core: Debug devices with direct_complete set and runtime PM disabled
OK, thanks!
The majority of these devices don't have any PM callbacks, however, so it actually is safe to set direct_complete for them even without enabled runtime PM.
Please test this patch the same as the previous one.
Created attachment 276109 [details]
dmesg (no_pm_callbacks)
Hm, with that patch no debug message is printed. I understand that this is not your expectation?
In contrast do your previous patch, suspend was *not* aborted with this one and resume works (i guess it shouldn't be since my patch is included and no dev seems to match the criteria)
(In reply to Thomas Martitz from comment #13) > Created attachment 276109 [details] > dmesg (no_pm_callbacks) > > Hm, with that patch no debug message is printed. I understand that this is > not your expectation? It is one of the outcomes I took into consideration. :-) > In contrast do your previous patch, suspend was *not* aborted with this one > and resume works (i guess it shouldn't be since my patch is included and no > dev seems to match the criteria) OK, so it looks like something depends on a device without PM callbacks. Interesting ... Created attachment 276117 [details]
PM / core: Set direct_complete for devices without PM callbacks
Please test this patch.
It should fix the resume issue for you and also it restores the direct_complete optimization for parents of devices without PM callbacks.
Indeed, your patch lets resume work. Thanks! \o/ Final patch merged as commit c62ec4610c40bcc44f2d3d5ed1c312737279e2f3. |