Bug 194631

Summary: System hangs up after resuming from suspend
Product: Power Management Reporter: klev.paul
Component: Hibernation/SuspendAssignee: Zhang Rui (rui.zhang)
Status: CLOSED INSUFFICIENT_DATA    
Severity: normal CC: carstenblaauw, itumaykin+kernel, polarathene-signup, rui.zhang, svenne, yu.c.chen
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.9.9 Subsystem:
Regression: No Bisected commit-id:
Attachments: journalctl log after problem
journalctl after first (normal) wake up
dmesg after normal sleep
dmesg after normal resume with kernel 4.8.17
dmesg after hang on resume with kernel 4.10.6 on intel gpu

Description klev.paul 2017-02-19 08:18:00 UTC
Created attachment 254831 [details]
journalctl log after problem

System hangs up after resuming from second suspend.
The first wake up is usually normal, but after the second one
the screen is black, no respond to key press, no network access.
Possible only to reboot the computer.

Downgrading kernel to 4.8.x branch resolves the issue
Comment 1 klev.paul 2017-02-19 08:20:29 UTC
Created attachment 254833 [details]
journalctl after first (normal) wake up
Comment 2 Coacher 2017-02-22 13:54:57 UTC
I experience similar problem even with the first resume from suspend.
I'm not sure if its exactly the same as my system unable to produce any useful logs during such a hang.
Comment 3 Zhang Rui 2017-02-27 07:00:33 UTC
(In reply to klev.paul from comment #1)
> Created attachment 254833 [details]
> journalctl after first (normal) wake up

can you attach the dmesg output after first resume?
the output you attached seems just contain boot log.
Comment 4 klev.paul 2017-02-27 11:59:28 UTC
Created attachment 254961 [details]
dmesg after normal sleep
Comment 5 klev.paul 2017-02-27 12:25:57 UTC
Created attachment 254965 [details]
dmesg after normal resume with kernel 4.8.17

dmesg after normal sleep with kernel 4.8.17 (never met the problem)
Comment 6 klev.paul 2017-02-27 12:26:50 UTC
Comment on attachment 254965 [details]
dmesg after normal resume with kernel 4.8.17

>[ 1398.852405] PM: Syncing filesystems ... done.
>[ 1398.876863] PM: Preparing system for sleep (mem)
>[ 1398.877225] Freezing user space processes ... (elapsed 0.002 seconds) done.
>[ 1398.879530] Double checking all user space processes after OOM killer
>disable... (elapsed 0.000 seconds) 
>[ 1398.879759] Freezing remaining freezable tasks ... (elapsed 0.001 seconds)
>done.
>[ 1398.881394] PM: Suspending system (mem)
>[ 1398.881462] Suspending console(s) (use no_console_suspend to debug)
>[ 1398.886051] sd 2:0:0:0: [sdb] Synchronizing SCSI cache
>[ 1398.886158] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>[ 1398.886200] sd 2:0:0:0: [sdb] Stopping disk
>[ 1398.887035] sd 0:0:0:0: [sda] Stopping disk
>[ 1399.629411] PM: suspend of devices complete after 747.807 msecs
>[ 1399.644883] PM: late suspend of devices complete after 15.462 msecs
>[ 1399.646550] ehci-pci 0000:00:1d.0: System wakeup enabled by ACPI
>[ 1399.646608] xhci_hcd 0000:00:14.0: System wakeup enabled by ACPI
>[ 1399.661651] PM: noirq suspend of devices complete after 16.764 msecs
>[ 1399.662029] ACPI: Preparing to enter system sleep state S3
>[ 1399.662472] ACPI : EC: EC stopped
>[ 1399.662473] PM: Saving platform NVS memory
>[ 1399.662488] Disabling non-boot CPUs ...
>[ 1399.689849] smpboot: CPU 1 is now offline
>[ 1399.712695] smpboot: CPU 2 is now offline
>[ 1399.729325] smpboot: CPU 3 is now offline
>[ 1399.732305] ACPI: Low-level resume complete
>[ 1399.732354] ACPI : EC: EC started
>[ 1399.732355] PM: Restoring platform NVS memory
>[ 1399.733385] Enabling non-boot CPUs ...
>[ 1399.733421] x86: Booting SMP configuration:
>[ 1399.733422] smpboot: Booting Node 0 Processor 1 APIC 0x1
>[ 1399.736062]  cache: parent cpu1 should not be sleeping
>[ 1399.736362] CPU1 is up
>[ 1399.736420] smpboot: Booting Node 0 Processor 2 APIC 0x2
>[ 1399.739251]  cache: parent cpu2 should not be sleeping
>[ 1399.739620] CPU2 is up
>[ 1399.739639] smpboot: Booting Node 0 Processor 3 APIC 0x3
>[ 1399.742202]  cache: parent cpu3 should not be sleeping
>[ 1399.742528] CPU3 is up
>[ 1399.745587] ACPI: Waking up from system sleep state S3
>[ 1399.764052] ehci-pci 0000:00:1d.0: System wakeup disabled by ACPI
>[ 1399.764180] xhci_hcd 0000:00:14.0: System wakeup disabled by ACPI
>[ 1399.764258] PM: noirq resume of devices complete after 17.808 msecs
>[ 1399.764491] PM: early resume of devices complete after 0.217 msecs
>[ 1399.766337] iwlwifi 0000:02:00.0: RF_KILL bit toggled to disable radio.
>[ 1399.766833] sd 0:0:0:0: [sda] Starting disk
>[ 1399.766834] sd 2:0:0:0: [sdb] Starting disk
>[ 1399.843138] rtc_cmos 00:01: System wakeup disabled by ACPI
>[ 1400.003014] usb 1-1.3: reset high-speed USB device number 3 using ehci-pci
>[ 1400.090198] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
>[ 1400.090468] ata1.00: supports DRM functions and may not be fully accessible
>[ 1400.091569] ata1.00: disabling queued TRIM support
>[ 1400.092255] ata1.00: supports DRM functions and may not be fully accessible
>[ 1400.092749] ata1.00: disabling queued TRIM support
>[ 1400.093084] ata1.00: configured for UDMA/133
>[ 1400.172923] usb 1-1.5: reset high-speed USB device number 4 using ehci-pci
>[ 1400.343094] PM: resume of devices complete after 578.593 msecs
>[ 1400.343410] PM: Finishing wakeup.
>[ 1400.343412] Restarting tasks ... done.
>[ 1400.551512] IPv6: ADDRCONF(NETDEV_UP): enp1s0: link is not ready
>[ 1400.662953] xhci_hcd 0000:00:14.0: Cannot set link state.
>[ 1400.662958] usb usb4-port3: cannot disable (err = -32)
>[ 1400.745995] r8169 0000:01:00.0 enp1s0: link down
>[ 1400.746048] IPv6: ADDRCONF(NETDEV_UP): enp1s0: link is not ready
>[ 1400.747020] IPv6: ADDRCONF(NETDEV_UP): wlp2s0: link is not ready
>[ 1400.983050] xhci_hcd 0000:00:14.0: Cannot set link state.
>[ 1400.983061] usb usb4-port3: cannot disable (err = -32)
>[ 1400.983089] xhci_hcd 0000:00:14.0: Cannot set link state.
>[ 1400.983103] usb usb4-port3: cannot disable (err = -32)
>[ 1401.039653] xhci_hcd 0000:00:14.0: Cannot set link state.
>[ 1401.039661] usb usb4-port4: cannot disable (err = -32)
>[ 1401.096334] xhci_hcd 0000:00:14.0: Cannot set link state.
>[ 1401.096345] usb usb4-port4: cannot disable (err = -32)
>[ 1401.096381] xhci_hcd 0000:00:14.0: Cannot set link state.
>[ 1401.096391] usb usb4-port4: cannot disable (err = -32)
>[ 1402.699650] r8169 0000:01:00.0 enp1s0: link up
>[ 1402.699661] IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0: link becomes ready
>[ 1402.872974] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>[ 1403.186135] ata3.00: configured for UDMA/133
Comment 7 Brennan Kinney 2017-03-04 23:26:43 UTC
I can confirm the same experience with 4.9 and 4.10 kernels. I'm using Manjaro KDE. On the 2nd suspend/resume cycle, I may see a blackscreen with my mouse cursor, sometimes I get to see the lockscreen with what appears to be 1 second of input working(the lockscreen password field has a blinking cursor, mouse can move, not sure if I could type), then everything hangs. I'm unable to switch to another TTY, SSH does not work, I'm stuck unless I reset the machine manually via button.

One time on 4.10 I did not lose input, I was able to unlock the screen and see my desktop as it was before suspending. Nothing was responsive to keyboard/mouse however and I still could not change TTY. From recent experiences dealing with a boot problem with Plymouth on live media, I'm suspecting /dev/fb0 to have disappeared, this is what happened with Plymouth preventing TTY switching and hanging the active display.

This is happening on a Skylake system for me where I'm using the iGPU with generic modesetting driver instead of xf86-video-intel. I've yet to confirm if it's hardware/driver related. 4.8 kernel series works perfectly fine. I'll try supply some logs shortly.
Comment 8 Svenne Krap 2017-03-06 08:56:11 UTC
I have the exact same problem on a skylake i7-6600U (Lenovo X1 yoga). 

I have also not seen the problem in 4.8...
Comment 9 Chen Yu 2017-03-13 02:55:29 UTC
(In reply to klev.paul from comment #4)
> Created attachment 254961 [details]
> dmesg after normal sleep

Please add 'no_console_suspend' in the grub commandline, and also test with
different /sys/power/pm_test options, the detail to debugging suspend issues
can be found here:
https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt
Comment 10 Zhang Rui 2017-03-20 04:16:12 UTC
ping...
Comment 11 klev.paul 2017-03-20 04:25:40 UTC
In my case, system happens to resume normally
after disabling asynchronous suspend
(echo 0 > /sys/power/pm_async)
Comment 12 Chen Yu 2017-03-20 09:28:31 UTC
(In reply to klev.paul from comment #11)
> In my case, system happens to resume normally
> after disabling asynchronous suspend
> (echo 0 > /sys/power/pm_async)
It seems there is some driver dependency during resume, which breaks the system.
Would you please bisect between 4.8.x and 4.9.9(please test with vanilla kernel, not the stable tree), I know what might be time-consuming, but that's one of the most straightforward way for debugging.
Another debugging method is to boot the system with minimal drivers via 'root=/dev/ram rw', then do a 'binary' testing with half modules loaded and so on.
Comment 13 Chen Yu 2017-03-27 03:35:35 UTC
ping...
Comment 14 Brennan Kinney 2017-03-27 08:40:28 UTC
I've never done a bisect or done much with kernels before besides update them via my package manager. I don't think I'd have the time to do the process regardless, but if it's helpful to provide other information I can still do so, such as running commands suggested earlier? 

I'm not sure if it's related but I have a systemd task that freezes the system up in a similar fashion at midnight.

Since it seems to be due to a power management change in 4.9, could the bisecting be made easier with the known commits? They seem to be listed here: https://kernelnewbies.org/Linux_4.9#head-798eee1861a17b373b1bff875a595ffe6722aa7c

I still haven't tried using the nvidia card as primary gpu, not sure if any of the DRM changes for Intel in 4.9 could also be a part of the cause? List of things that got changed is here: https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.9-More-i915-DRM
Comment 15 Brennan Kinney 2017-03-28 10:39:11 UTC
Can confirm disabling pm_async works. Trying nvidia as my primary GPU showed a dmesg on resume instead of black screen with cursor, the text was trimmed and appears to be from boot about a bad ACPI table. Relevant snippets follow:

journalctl -b 0 -p err 
-- Logs begin at Wed 2017-01-04 01:37:18 NZDT, end at Tue 2017-03-28 16:01:01 NZDT. --
Mar 28 15:26:05 pola-pc kernel: ACPI Error: [\_SB_.PCI0.XHC_.RHUB.HS11] Namespace lookup failure, AE_NOT_FOUND (20160930/dswload-210)
Mar 28 15:26:05 pola-pc kernel: ACPI Exception: AE_NOT_FOUND, During name lookup/catalog (20160930/psobject-227)
Mar 28 15:26:05 pola-pc kernel: ACPI Exception: AE_NOT_FOUND, (SSDT:xh_rvp08) while loading table (20160930/tbxfload-228)
Mar 28 15:26:05 pola-pc kernel: ACPI Error: 1 table load failures, 5 successful (20160930/tbxfload-246)


lspci -nk
00:14.0 0c03: 8086:a12f (rev 31)
        Subsystem: 1849:a12f
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci


cat /proc/acpi/wakeup
Device  S-state   Status   Sysfs node
XHC       S4    *enabled   pci:0000:00:14.0


dmesg|egrep -i "DMI:"
[    0.000000] DMI: To Be Filled By O.E.M. To Be Filled By O.E.M./H170 Performance/Hyper, BIOS P7.20A 01/23/2017
                                                                                                                                                                                            

                                                                                                                                                                                            dmesg|egrep -i "DSDT"
[    0.000000] ACPI: DSDT 0x0000000087F311E8 0210CE (v02 ALASKA A M I    01072009 INTL 20120913)


dmesg|egrep -i "acpi"
https://pastebin.com/mrxmmWU0


journalctl -b (first suspend/resume snippet)
https://pastebin.com/JbU61sNk
Comment 16 Brennan Kinney 2017-03-28 11:04:50 UTC
TL;DR: Using CFQ instead of BFQ for IO scheduler fixes the issue for me. `echo cfq > /sys/block/sda/queue/scheduler`.

---

This thread suggests that my BIOS is at fault for not providing proper ACPI info and it's up to getting an update from the vendor: https://forum.manjaro.org/t/acpi-error-dssp-namespace-lookup-failure-ae-not-found-on-4-9-0-1/15474

Alternatively attempting to fix that myself by following: https://wiki.archlinux.org/index.php/DSDT

Related kernel issue states the error is not important and should be hidden under a log level in future kernel updates: https://bugzilla.kernel.org/show_bug.cgi?id=193531

This issue links to the above one and fixes those errors by using kernel param `acpi_osi=` with a value: https://bugzilla.redhat.com/show_bug.cgi?id=1413342

Kernel dev conversation involving Linus that further confirms these errors are probably unrelated: https://lkml.org/lkml/2016/10/5/341

---

I've looked into the similar issue I was having with a systemd.timer/service which turned out to be logrotate. When running this on a different tty from my DE I'd get dmesg output just before the system lockup. BFQ was the cause. Changing the IO scheduler from BFQ to another like CFQ not only fixed the lockup from logrotate.service but also from the 2nd suspend/resume issue described here. Disabling pm_async didn't help with the logrotate lockup, so is it still a power management issue or can it be caused by I/O scheduler bug?

I have details with output snippets here: https://forum.manjaro.org/t/systemd-timer-crashing-my-system-at-midnight/19609
Comment 17 Zhang Rui 2017-03-29 01:57:19 UTC
so there are two ways to workaround the issue
1. disable async resume by "echo 0 > /sys/power/pm_async"
2. use CFQ instead of BFQ by "echo cfq > /sys/block/sda/queue/scheduler"
right?
Comment 18 Brennan Kinney 2017-03-29 09:47:51 UTC
@Zhang You'd need other users to confirm that they were using BFQ and the switch works for them too. It also seems that my filesystem choice XFS was part of the BFQ bug as well :P The command is also per disk(in my case a single SSD drive `sda`).

Here is the end part of the dmesg output from the hard lockup showing the stacktrace: https://imgur.com/a/HWkOR

I've not run into any issues so far with the IO scheduler change, will report back if symptoms occur.
Comment 19 Carsten Blaauw 2017-04-01 11:56:38 UTC
Hi,

since 4.9 I got freezes on resume but never at the first resume. Now with 4.10.6 I get a freeze at resume all the time, even at the firt one. I also use XFS. Both changing /sys/power/pm_async and /sys/block/sda/queque/scheduler do not remedy this i.e. I still get those freezes. But maybe that's a good thing, because it is easier to debug than freezes that do not happen all the time. If you need debug information, please instruct me what I need to do to provide it.

Thanks,

Carsten
Comment 20 Brennan Kinney 2017-04-01 12:15:09 UTC
@Carsten, have you tried to switch TTY (ctrl+alt+f2 for example), log in and suspend/resume from there? `systemctl suspend` should work. When resuming if lucky you'll get some dmesg output like I did that helped me identify BFQ and XFS as an issue.
Comment 21 Carsten Blaauw 2017-04-01 12:24:55 UTC
@Brennan, keyboard does not work at all I'm afraid. And now I noticed that it will not break at each resume (that was myabe a coincidence), now it failed at the third one. But it never takes a lot of tries, anyway...
Comment 22 Carsten Blaauw 2017-04-01 12:31:40 UTC
So, keyboard does not work to switch TTY, but I can issue the magic sysrq. And if I watch the disks activity led the kernel seems to operate normally. Looks like a freeze of the graphics only. Maybe I could SSH into the machine when in this state. Does this give us the possibility of getting more debugging info?
Comment 23 Brennan Kinney 2017-04-01 12:43:07 UTC
@Carsten If you can SSH in during the freeze it might be possible to get some info to debug. In my case upon resuming once the lockup happens moments after I'm unable to SSH in, everything locks up. If you're able to it's more than likely a different issue than this one where multiple users can reproduce the lockup always on the 2nd suspend/resume since boot.

You may want to look into tools like powertop, if you have TLP installed or similar power management packages disabling them or changing settings. Might be some other kernel params for graphics you could try, or open-source vs proprietary, with intel there is generic modesetting vs xf86-video-intel as well.

If SSH works, check dmesg and journalctl -b. If you find anything useful that points towards graphics find an existing or create a relevant issue for it.
Comment 24 Carsten Blaauw 2017-04-01 12:45:43 UTC
Hello,

it was possible to SSH into the machine and I pulled a dmesg output (attached to this ticket as "dmesg-output.txt"). There's a crash mentioned that looks like it happend inside the intel graphics driver. Please have a look.
Comment 25 Carsten Blaauw 2017-04-01 12:46:34 UTC
Created attachment 255683 [details]
dmesg after hang on resume with kernel 4.10.6 on intel gpu
Comment 26 Zhang Rui 2017-04-04 07:06:26 UTC
From Carsten' dmesg:
[    0.032000] smpboot: CPU0: Intel(R) Core(TM)2 Duo CPU     E8200  @ 2.66GHz (family: 0x6, model: 0x17, stepping: 0x6)

Fron Klev' dmesg:
[    0.082201] smpboot: CPU0: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz (family: 0x6, model: 0x3a, stepping: 0x9)

These are two different problems, on different laptops, to me.

And for the others, it is also possible that you're experiencing different problems, just with similar symptom.

Carsten,
please file a bug report at freedesktop.org to get help from the graphics experts.

To avoid being mislead by some many different bug reporters, I will focus on the issue reproduced on Klev' platforms in this bug report.
And for others, please file a new bug report for your own problems, together with acpidump, dmesg output and model name of your laptops. Unless you can confirm your laptop model is exactly the same as Klev' laptop.
Comment 27 Zhang Rui 2017-06-17 07:47:43 UTC
Bug closed as we don't get valid test result from the bug reporter.
Please feel free to reopen it if you can provide the information required in comment #9 and #12.
For the others, please file new bug reporters instead, unless you're using exactly the same model of laptop.