Lenovo laptop ideapad 320s (BIOS 2.41), Arch Linux. With kernel 4.16.7 resume from hibernate results in kernel panic (caps lock blinking). Resume from suspend works fine. Kernel 4.16.6 and lower worked fine too.
I've bisected the issue to this commit:
b517f3893df80201772523fd7d58b191c70d9ee4 is the first bad commit
Author: Mika Westerberg <firstname.lastname@example.org>
Date: Fri Apr 20 15:22:02 2018 +0300
PCI / PM: Do not clear state_saved in pci_pm_freeze() when smart suspend is set
commit ae860a19f37c686e7c5816e96640168b7174a096 upstream.
If a driver uses DPM_FLAG_SMART_SUSPEND and the device is already
runtime suspended when hibernate is started PCI core skips runtime
resuming the device but still clears pci_dev->state_saved. After the
hibernation image is written pci_pm_thaw_noirq() makes sure subsequent
thaw phases for the device are also skipped leaving it runtime suspended
with pci_dev->state_saved == false.
When the device is eventually runtime resumed pci_pm_runtime_resume()
restores config space by calling pci_restore_standard_config(), however
because pci_dev->state_saved == false pci_restore_state() never actually
restores the config space leaving the device in a state that is not what
the driver might expect.
For example here is what happens for intel-lpss I2C devices once the
hibernation snapshot is taken:
intel-lpss 0000:00:15.0: power state changed by ACPI to D0
intel-lpss 0000:00:1e.0: power state changed by ACPI to D3cold
video LNXVIDEO:00: Restoring backlight state
PM: hibernation exit
i2c_designware i2c_designware.1: Unknown Synopsys component type: 0xffffffff
i2c_designware i2c_designware.0: Unknown Synopsys component type: 0xffffffff
i2c_designware i2c_designware.1: timeout in disabling adapter
i2c_designware i2c_designware.0: timeout in disabling adapter
Since PCI config space is not restored the device is still in D3hot
making MMIO register reads return 0xffffffff.
Fix this by clearing pci_dev->state_saved only if we actually end up
runtime resuming the device.
Fixes: c4b65157aeef (PCI / PM: Take SMART_SUSPEND driver flag into account)
Signed-off-by: Mika Westerberg <email@example.com>
Cc: 4.15+ <firstname.lastname@example.org> # 4.15+
Signed-off-by: Rafael J. Wysocki <email@example.com>
Signed-off-by: Greg Kroah-Hartman <firstname.lastname@example.org>
I have also experience the same problem on 4.17-rc5. But after reverting this commit I can resume from hibernate just fine.
Created attachment 276013 [details]
PCI / PM: Clear state_saved in more cases in pci_pm_freeze()
Can you check if this patch helps, please?
Sorry to report, but the patch did not help. (I've compiled twice just to be sure).
I wish I could tell more about the kernel panic at resume, but all I get is a blank screen and the caps lock blinking once the hibernation image is read completly...
I do remember though that I had similar trouble with resume from hibernate on this laptop prior to 4.16.
Thank you for looking into this!
Created attachment 276017 [details]
Clear state in poweroff only if runtime resumed
Here is another patch to try. I'm not entirely sure if this has anything to do with the problem at hand but I think we may need to do the same in poweroff phase as well.
->poweroff doesn't matter here, because the resume code doesn't see what it does and that's when the crash happens.
Created attachment 276019 [details]
PCI / PM: set state_saved when skipping noirq device suspend
OK, so let's try this one.
Sorry, neither 276017 nor 276019 worked :(
Can you attach output of 'sudo lspci -vv' and full dmesg after the system has been booted up (e.g from a normal boot)?
(In reply to Marek Lotke from comment #6)
> Sorry, neither 276017 nor 276019 worked :(
So the patch from comment #5 is almost a revert of the problematic commit.
Please try to comment out the
pci_dev->state_saved = true;
line added by it in pci_pm_freeze_noirq() and see if that works.
Created attachment 276021 [details]
Slightly modified revert of the problematic commit
IOW, try to apply something like this and see if it works (it should).
Created attachment 276023 [details]
Slightly modified revert of the problematic commit
Better yet, please test this one instead and (if it works) please attach a dmesg output after 1 hibernate-resume cycle.
Created attachment 276027 [details]
dmesg output after resume from hibernation with att 276023
Created attachment 276029 [details]
lspci -vv output
With the patch from attachment 276023 [details] I can resume from hibernation! Thank you!
Well, that just means that it works for you with the Mika's patch reverted which we've known already.
I have one more thing for you to test, though.
Created attachment 276035 [details]
PCI / PM: Do not set state_saved too early on system suspend
If suspend-to-RAM (or suspend-to-idle) works for you on this system, please test it with the attached patch.
Mika, after your commit, pci_pm_default_resume_early() in pci_pm_restore_noirq() will restore the config space of intel-lpss and that seems to be sufficient to trigger the crash. Would it be possible to try to reproduce this locally in the lab?
With 276035 I can still do suspend-to-RAM - it was always working. But resume from hibernate is still broken. This is 4.16.7 + 276035 patch. Should I apply 276035 together with 276023?
(In reply to Marek Lotke from comment #17)
> With 276035 I can still do suspend-to-RAM - it was always working.
> But resume from hibernate is still broken. This is 4.16.7 + 276035 patch.
> Should I apply 276035 together with 276023?
I have a machine with the intel-lpss device here, but it resumes from hibernation correctly after the Mika's commit, so the issue appears to be specific to your system.
I will attach some additional debug patches to collect more information from it, stay tuned.
Created attachment 276037 [details]
PCI / PM: Always resume runtime-suspended devices during hibernation
Please test this patch on top of the failing kernel (without any previous patches applied) and report back.
No luck. 276037 on top of 4.16.7 does not resume from hibernate.
Sorry about that.
The patch from comment #19 basically restores the state from before commit c4b65157aeef (PCI / PM: Take SMART_SUSPEND driver flag into account) that introduced the bug fixed by the Mika's commit. So, it looks like the bug introduced by commit c4b65157aeef incidentally made resume from hibernation kind of work on your system and now the Mika's fix (which is correct) broke it again.
Did resume from hibernation work for you before 4.15?
I've just tried Arch's lts - 4.14.40-1 - resume from hibernation does not work.
I'm guessing you know this already - as a workaround I can remove intel_lpss_pci, intel_lpss modules before hibernation and then resume from hibernation works. I just have to modprobe them back to get the touchpad working.
Another workaround is adding intel_lpss_pci to initrd - found it here: https://bbs.archlinux.org/viewtopic.php?id=231881
(In reply to Rafael J. Wysocki from comment #16)
> Mika, after your commit, pci_pm_default_resume_early() in
> pci_pm_restore_noirq() will restore the config space of intel-lpss and that
> seems to be sufficient to trigger the crash. Would it be possible to try to
> reproduce this locally in the lab?
I already tried yesterday on KBL system with LPSS but I was not able to reproduce this. Unfortunately that system does not have working touchpanel which may be needed to trigger crash when there is actual I2C traffic. I'll try today with another KBL system.
(In reply to Marek Lotke from comment #23)
> Another workaround is adding intel_lpss_pci to initrd - found it here:
Yes, please do that.
The device needs to be initialized properly by the restore kernel before control is passed back to the image one and I'm not even sure if that can be avoided.
(In reply to Mika Westerberg from comment #24)
> (In reply to Rafael J. Wysocki from comment #16)
> > Mika, after your commit, pci_pm_default_resume_early() in
> > pci_pm_restore_noirq() will restore the config space of intel-lpss and that
> > seems to be sufficient to trigger the crash. Would it be possible to try
> > reproduce this locally in the lab?
> I already tried yesterday on KBL system with LPSS but I was not able to
> reproduce this. Unfortunately that system does not have working touchpanel
> which may be needed to trigger crash when there is actual I2C traffic. I'll
> try today with another KBL system.
It looks like you need to leave the device uninitialized on restore before jumping back to the image kernel to reproduce this. Since the same callbacks are used for hibernation restore and system-wide resume, that is rather expected (or at least not surprising).
@Marek: For now I'll mark this as "documented" and we'll try to reproduce the failure locally and improve the hibernation handling in LPSS, but it may just require proper initialization after S4 wakeup.
> It looks like you need to leave the device uninitialized on restore before
> jumping back to the image kernel to reproduce this. Since the same
> callbacks are used for hibernation restore and system-wide resume, that is
> rather expected (or at least not surprising).
I figured if I build intel-lpss* as modules and put "resume=/path/to/disk" to the kernel command line the non-hibernation kernel should leave the LPSS devices uninitialized (because image restoration happens before udev gets to run and thus it cannot load additional modules) and once the hibernation image is restored, it should trigger the crash but unfortunately I still can't reproduce it here.
OK, it must be strictly system-specific, then.
But yes, it may be necessary to initialize devices after S4 wakeup and before loading the hibernate image in general.