Bug 199747

Summary: resume from hibernate results in kernel panic (bisected)
Product: Power Management Reporter: Marek Lotke (lollul)
Component: Hibernation/SuspendAssignee: Rafael J. Wysocki (rjw)
Status: RESOLVED DOCUMENTED    
Severity: normal CC: bjorn, lenb, mika.westerberg
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.16.7 Subsystem:
Regression: No Bisected commit-id:
Attachments: PCI / PM: Clear state_saved in more cases in pci_pm_freeze()
Clear state in poweroff only if runtime resumed
PCI / PM: set state_saved when skipping noirq device suspend
Slightly modified revert of the problematic commit
Slightly modified revert of the problematic commit
dmesg output after resume from hibernation with att 276023
lspci -vv output
PCI / PM: Do not set state_saved too early on system suspend
PCI / PM: Always resume runtime-suspended devices during hibernation

Description Marek Lotke 2018-05-17 09:12:05 UTC
Lenovo laptop ideapad 320s (BIOS 2.41), Arch Linux. With kernel 4.16.7 resume from hibernate results in kernel panic (caps lock blinking). Resume from suspend works fine. Kernel 4.16.6 and lower worked fine too.
I've bisected the issue to this commit:

b517f3893df80201772523fd7d58b191c70d9ee4 is the first bad commit
commit b517f3893df80201772523fd7d58b191c70d9ee4
Author: Mika Westerberg <mika.westerberg@linux.intel.com>
Date:   Fri Apr 20 15:22:02 2018 +0300

    PCI / PM: Do not clear state_saved in pci_pm_freeze() when smart suspend is set
    
    commit ae860a19f37c686e7c5816e96640168b7174a096 upstream.
    
    If a driver uses DPM_FLAG_SMART_SUSPEND and the device is already
    runtime suspended when hibernate is started PCI core skips runtime
    resuming the device but still clears pci_dev->state_saved. After the
    hibernation image is written pci_pm_thaw_noirq() makes sure subsequent
    thaw phases for the device are also skipped leaving it runtime suspended
    with pci_dev->state_saved == false.
    
    When the device is eventually runtime resumed pci_pm_runtime_resume()
    restores config space by calling pci_restore_standard_config(), however
    because pci_dev->state_saved == false pci_restore_state() never actually
    restores the config space leaving the device in a state that is not what
    the driver might expect.
    
    For example here is what happens for intel-lpss I2C devices once the
    hibernation snapshot is taken:
    
      intel-lpss 0000:00:15.0: power state changed by ACPI to D0
      intel-lpss 0000:00:1e.0: power state changed by ACPI to D3cold
      video LNXVIDEO:00: Restoring backlight state
      PM: hibernation exit
      i2c_designware i2c_designware.1: Unknown Synopsys component type: 0xffffffff
      i2c_designware i2c_designware.0: Unknown Synopsys component type: 0xffffffff
      i2c_designware i2c_designware.1: timeout in disabling adapter
      i2c_designware i2c_designware.0: timeout in disabling adapter
    
    Since PCI config space is not restored the device is still in D3hot
    making MMIO register reads return 0xffffffff.
    
    Fix this by clearing pci_dev->state_saved only if we actually end up
    runtime resuming the device.
    
    Fixes: c4b65157aeef (PCI / PM: Take SMART_SUSPEND driver flag into account)
    Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
    Cc: 4.15+ <stable@vger.kernel.org> # 4.15+
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

I have also experience the same problem on 4.17-rc5. But after reverting this commit I can resume from hibernate just fine.
Comment 1 Rafael J. Wysocki 2018-05-17 09:31:33 UTC
Created attachment 276013 [details]
PCI / PM: Clear state_saved in more cases in pci_pm_freeze()

Can you check if this patch helps, please?
Comment 2 Marek Lotke 2018-05-17 12:07:02 UTC
Sorry to report, but the patch did not help. (I've compiled twice just to be sure).
I wish I could tell more about the kernel panic at resume, but all I get is a blank screen and the caps lock blinking once the hibernation image is read completly...
I do remember though that I had similar trouble with resume from hibernate on this laptop prior to 4.16.
Thank you for looking into this!
Comment 3 Mika Westerberg 2018-05-17 12:25:06 UTC
Created attachment 276017 [details]
Clear state in poweroff only if runtime resumed

Here is another patch to try. I'm not entirely sure if this has anything to do with the problem at hand but I think we may need to do the same in poweroff phase as well.
Comment 4 Rafael J. Wysocki 2018-05-17 12:32:13 UTC
->poweroff doesn't matter here, because the resume code doesn't see what it does and that's when the crash happens.
Comment 5 Rafael J. Wysocki 2018-05-17 12:50:44 UTC
Created attachment 276019 [details]
PCI / PM: set state_saved when skipping noirq device suspend

OK, so let's try this one.
Comment 6 Marek Lotke 2018-05-17 13:24:10 UTC
Sorry, neither 276017 nor 276019 worked :(
Comment 7 Mika Westerberg 2018-05-17 14:30:52 UTC
Can you attach output of 'sudo lspci -vv' and full dmesg after the system has been booted up (e.g from a normal boot)?
Comment 8 Rafael J. Wysocki 2018-05-17 15:36:42 UTC
(In reply to Marek Lotke from comment #6)
> Sorry, neither 276017 nor 276019 worked :(

So the patch from comment #5 is almost a revert of the problematic commit.

Please try to comment out the

pci_dev->state_saved = true;

line added by it in pci_pm_freeze_noirq() and see if that works.
Comment 9 Rafael J. Wysocki 2018-05-17 15:42:30 UTC
Created attachment 276021 [details]
Slightly modified revert of the problematic commit

IOW, try to apply something like this and see if it works (it should).
Comment 10 Rafael J. Wysocki 2018-05-17 16:05:36 UTC
Created attachment 276023 [details]
Slightly modified revert of the problematic commit

Better yet, please test this one instead and (if it works) please attach a dmesg output after 1 hibernate-resume cycle.
Comment 11 Marek Lotke 2018-05-17 18:09:03 UTC
Created attachment 276027 [details]
dmesg output after resume from hibernation with att 276023
Comment 12 Marek Lotke 2018-05-17 18:09:40 UTC
Created attachment 276029 [details]
lspci -vv output
Comment 13 Marek Lotke 2018-05-17 18:10:30 UTC
With the patch from attachment 276023 [details] I can resume from hibernation! Thank you!
Comment 14 Rafael J. Wysocki 2018-05-17 20:15:14 UTC
Well, that just means that it works for you with the Mika's patch reverted which we've known already.

I have one more thing for you to test, though.
Comment 15 Rafael J. Wysocki 2018-05-17 20:17:26 UTC
Created attachment 276035 [details]
PCI / PM: Do not set state_saved too early on system suspend

If suspend-to-RAM (or suspend-to-idle) works for you on this system, please test it with the attached patch.
Comment 16 Rafael J. Wysocki 2018-05-17 20:31:10 UTC
Mika, after your commit, pci_pm_default_resume_early() in pci_pm_restore_noirq() will restore the config space of intel-lpss and that seems to be sufficient to trigger the crash.  Would it be possible to try to reproduce this locally in the lab?
Comment 17 Marek Lotke 2018-05-17 20:51:18 UTC
With 276035 I can still do suspend-to-RAM - it was always working. But resume from hibernate is still broken. This is 4.16.7 + 276035 patch. Should I apply 276035 together with 276023?
Comment 18 Rafael J. Wysocki 2018-05-17 21:03:13 UTC
(In reply to Marek Lotke from comment #17)
> With 276035 I can still do suspend-to-RAM - it was always working.

Good, thanks!

> But resume from hibernate is still broken. This is 4.16.7 + 276035 patch.
> Should I apply 276035 together with 276023?

No, thanks.

I have a machine with the intel-lpss device here, but it resumes from hibernation correctly after the Mika's commit, so the issue appears to be specific to your system.

I will attach some additional debug patches to collect more information from it, stay tuned.
Comment 19 Rafael J. Wysocki 2018-05-17 21:16:50 UTC
Created attachment 276037 [details]
PCI / PM: Always resume runtime-suspended devices during hibernation

Please test this patch on top of the failing kernel (without any previous patches applied) and report back.
Comment 20 Marek Lotke 2018-05-17 21:46:57 UTC
No luck. 276037 on top of 4.16.7 does not resume from hibernate.
Comment 21 Rafael J. Wysocki 2018-05-17 21:58:54 UTC
Sorry about that.

The patch from comment #19 basically restores the state from before commit c4b65157aeef (PCI / PM: Take SMART_SUSPEND driver flag into account) that introduced the bug fixed by the Mika's commit.  So, it looks like the bug introduced by commit c4b65157aeef incidentally made resume from hibernation kind of work on your system and now the Mika's fix (which is correct) broke it again.

Did resume from hibernation work for you before 4.15?
Comment 22 Marek Lotke 2018-05-17 22:13:57 UTC
I've just tried Arch's lts - 4.14.40-1 - resume from hibernation does not work.

I'm guessing you know this already - as a workaround I can remove intel_lpss_pci, intel_lpss modules before hibernation and then resume from hibernation works. I just have to modprobe them back to get the touchpad working.
Comment 23 Marek Lotke 2018-05-17 22:24:08 UTC
Another workaround is adding intel_lpss_pci to initrd  - found it here: https://bbs.archlinux.org/viewtopic.php?id=231881
Comment 24 Mika Westerberg 2018-05-18 07:57:47 UTC
(In reply to Rafael J. Wysocki from comment #16)
> Mika, after your commit, pci_pm_default_resume_early() in
> pci_pm_restore_noirq() will restore the config space of intel-lpss and that
> seems to be sufficient to trigger the crash.  Would it be possible to try to
> reproduce this locally in the lab?

I already tried yesterday on KBL system with LPSS but I was not able to reproduce this. Unfortunately that system does not have working touchpanel which may be needed to trigger crash when there is actual I2C traffic. I'll try today with another KBL system.
Comment 25 Rafael J. Wysocki 2018-05-18 08:07:11 UTC
(In reply to Marek Lotke from comment #23)
> Another workaround is adding intel_lpss_pci to initrd  - found it here:
> https://bbs.archlinux.org/viewtopic.php?id=231881

Yes, please do that.

The device needs to be initialized properly by the restore kernel before control is passed back to the image one and I'm not even sure if that can be avoided.
Comment 26 Rafael J. Wysocki 2018-05-18 08:11:44 UTC
(In reply to Mika Westerberg from comment #24)
> (In reply to Rafael J. Wysocki from comment #16)
> > Mika, after your commit, pci_pm_default_resume_early() in
> > pci_pm_restore_noirq() will restore the config space of intel-lpss and that
> > seems to be sufficient to trigger the crash.  Would it be possible to try
> to
> > reproduce this locally in the lab?
> 
> I already tried yesterday on KBL system with LPSS but I was not able to
> reproduce this. Unfortunately that system does not have working touchpanel
> which may be needed to trigger crash when there is actual I2C traffic. I'll
> try today with another KBL system.

It looks like you need to leave the device uninitialized on restore before jumping back to the image kernel to reproduce this.  Since the same callbacks are used for hibernation restore and system-wide resume, that is rather expected (or at least not surprising).

@Marek: For now I'll mark this as "documented" and we'll try to reproduce the failure locally and improve the hibernation handling in LPSS, but it may just require proper initialization after S4 wakeup.
Comment 27 Mika Westerberg 2018-05-18 09:17:03 UTC
> It looks like you need to leave the device uninitialized on restore before
> jumping back to the image kernel to reproduce this.  Since the same
> callbacks are used for hibernation restore and system-wide resume, that is
> rather expected (or at least not surprising).

I figured if I build intel-lpss* as modules and put "resume=/path/to/disk" to the kernel command line the non-hibernation kernel should leave the LPSS devices uninitialized (because image restoration happens before udev gets to run and thus it cannot load additional modules) and once the hibernation image is restored, it should trigger the crash but unfortunately I still can't reproduce it here.
Comment 28 Rafael J. Wysocki 2018-05-18 09:42:42 UTC
OK, it must be strictly system-specific, then.

But yes, it may be necessary to initialize devices after S4 wakeup and before loading the hibernate image in general.