Bug 204529 - Clevo P150HM freezes on suspend
Summary: Clevo P150HM freezes on suspend
Status: CLOSED MOVED
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Sleep-Wake (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Zhang Rui
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-08 04:31 UTC by Julian Sikorski
Modified: 2019-10-20 13:32 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.2.6-200.fc30.x86_64
Subsystem:
Regression: No
Bisected commit-id:


Attachments
screenshot of console during the error (140.84 KB, image/png)
2019-10-20 10:13 UTC, Julian Sikorski
Details
better screenshot of console during error (137.65 KB, image/png)
2019-10-20 11:19 UTC, Julian Sikorski
Details
dmesg output (74.96 KB, text/plain)
2019-10-20 11:33 UTC, Julian Sikorski
Details

Description Julian Sikorski 2019-08-08 04:31:43 UTC
Hello,

for several weeks now my laptop started having issues during suspend. It usually happens for a few times after a new kernel is installed, after which it goes away on its own - only to return with a kernel update.
I have managed to capture the pm-trace:

 dmesg | grep -i -A2 magic
[    4.634329] PM:   Magic number: 14:984:176
[    4.634412] acpi device:39: hash matches
[    4.634425] acpi device:0c: hash matches

I bought this machine way back in 2011 and it was suspending and resuming successfully for years, even after the suspend/resume stopped working under Windows after I upgraded the GPU.
What additional info I could provide in order to make this bug possible to solve? Thank you!
Comment 1 Zhang Rui 2019-09-03 12:52:32 UTC
(In reply to Julian Sikorski from comment #0)
> Hello,
> 
> for several weeks now my laptop started having issues during suspend.

what did you do that triggers this issue?
do you have a know kernel version that suspend works all the time?
Comment 2 Julian Sikorski 2019-09-03 19:06:26 UTC
The issue seems to be triggered by kernel update: after updating the kernel the freezes happen a couple of times, after which the situation stabilizes and suspend works.
I do not really have a version working all the time, it also would be hard to find as the issue does not happen every time but only a few times after rebooting into a new kernel.
In the meantime I was able to test all modes listed in [1] and I was never able to trigger the lockup using any of the modes.
In case it helps, the issue looks as follows: screen goes black, leds blink, then the wlan led turns on and off the final timeup and the laptop never suspends.
Is there a way to somehow delay the suspend of screen, ethernet or some other way to produce a log of what is happening?
what are acpi devices 39 and 0c? Does this hash match even make sense?

[1] https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt
Comment 3 Julian Sikorski 2019-09-03 19:12:53 UTC
I am now thinking that kernel updates have nothing to do with this issue, it is rather triggered by reboot/cold boot. I just experienced the lockup by rebooting from linux into windows and back again.
Comment 4 Julian Sikorski 2019-10-13 08:22:32 UTC
In the meantime I was able to confirm that the issue is somewhat reliably triggerred by cold boot. I have now upgraded to Fedora 31 prerelease, its kernel-5.3.4-300.fc31.x86_64 is affected too. My gut feeling is that the problem was not present during Fedora 30 timeframe, which would put a regression after its release kernel-5.0.9-301.fc30.x86_64. I will try to test this theory and report back.
Comment 5 Julian Sikorski 2019-10-13 08:37:50 UTC
I have managed to trigger the issue a few times with kernel-5.1.18-300.fc30.x86_64. Then the resume started working with the same kernel.
Out of curiosity: this laptop is now over 8 years old. Could hardware issues like weak RTC battery or dried out caps be responsible for such behaviour?
Comment 6 Julian Sikorski 2019-10-13 09:28:01 UTC
I have now tried every known reproducer on kernel-5.0.9-301.fc30.x86_64 and I was not able to trigger the freeze. I am afraid this is as good of a proof that the regression happened during 5.0 -> 5.1 as I will be able to provide, as due to intermittent nature of this problem I can only definitely confirm which kernels are affected but not which are not.

I have also noticed that there is a difference in LED behaviour between freezing and not:
If the laptop suspends successfully, the wifi led turns off, drive led turns on, drive led turns off and the laptop suspends:
https://youtu.be/1CHzkR-wsTA

During the freeze, after the drive led turns off, the wifi led goes back on and off again:
https://youtu.be/Cz-ZS8ccSig

Not sure how useful the LED info is, but I thought any information might be useful at this point.
Comment 7 Julian Sikorski 2019-10-19 16:04:11 UTC
I have upgraded to Fedora 31 (5.3.6-300.fc31.x86_64), it is also affected.
In the meantime I have observed that running
# echo core > /sys/power/pm_test
# echo mem > /sys/power/state
# echo none > /sys/power/pm_test
does reliably prevent the lockups from happening.
Comment 8 Julian Sikorski 2019-10-20 09:57:50 UTC
I tried netconsole, but it appears that the connection is killed before the issue occurs:
Oct 20 11:55:07 192.168.0.4  [  114.780323] PM: suspend entry (deep)
Oct 20 11:55:07 192.168.0.4  [  114.786783] Filesystems sync: 0.006 seconds
Oct 20 11:55:07 192.168.0.4  [  115.048760] Freezing user space processes ... (elapsed 0.002 seconds) done.
Oct 20 11:55:07 192.168.0.4  [  115.050887] OOM killer disabled.
Oct 20 11:55:07 192.168.0.4  [  115.050894] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
Oct 20 11:55:07 192.168.0.4  [  115.052165] wlp4s0: deauthenticating from ac:22:05:e2:26:3a by local choice (Reason: 3=DEAUTH_LEAVING)
Oct 20 11:55:07 192.168.0.4  [  115.066048] sd 0:0:0:0: [sda] Synchronizing SCSI cache
Oct 20 11:55:07 192.168.0.4  [  115.068668] sd 0:0:0:0: [sda] Stopping disk
Comment 9 Julian Sikorski 2019-10-20 10:13:15 UTC
Created attachment 285573 [details]
screenshot of console during the error

I have managed to get something displayed on the local console using no_console_suspend and filming it in slow motion. The quality is bad, but one can read that the first error is something like:
xhci_hcd WARN xmc cmd_run timeout
This would indicate the error I am having might be related to this:
https://askubuntu.com/questions/1003015/dell-precision-7510-fails-to-suspend-xhci-hcd-timeout
Comment 10 Julian Sikorski 2019-10-20 10:43:24 UTC
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1844021

According to git history the possible source of this problem
https://lkml.org/lkml/2019/2/7/1028
Was introduced to 5.1.x branch in 5.1.8. This would match with why I was seeing freezes in 5.1.18 but not in 5.0.9.

According to the ubuntu bug report, the fix is
https://lkml.org/lkml/2019/9/26/581

I will build a kernel including this patch and report back.
Comment 11 Julian Sikorski 2019-10-20 11:19:38 UTC
Created attachment 285577 [details]
better screenshot of console during error

I have now tried 5.3.7-300.fc31 which included the timeout extension fix. Unfortunately, the problem is still there.
My laptop contains a different USB controller though:
02:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03)
Comment 12 Julian Sikorski 2019-10-20 11:33:34 UTC
Created attachment 285579 [details]
dmesg output

As trying to suspend from tty did not freeze the system with this new kernel, I was able to obtain text logs:

[  381.200817] PM: suspend entry (deep)
[  381.208037] Filesystems sync: 0.007 seconds
[  381.212775] Freezing user space processes ... (elapsed 0.002 seconds) done.
[  381.215164] OOM killer disabled.
[  381.215172] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[  381.216871] wlp4s0: deauthenticating from ac:22:05:e2:26:3a by local choice (Reason: 3=DEAUTH_LEAVING)
[  381.227244] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[  381.228008] sd 0:0:0:0: [sda] Stopping disk
[  381.806585] xhci_hcd 0000:02:00.0: WARN: xHC CMD_RUN timeout
[  381.806600] PM: suspend_common(): xhci_pci_suspend+0x0/0xd0 returns -110
[  381.806604] PM: pci_pm_suspend(): hcd_pci_suspend+0x0/0x30 returns -110
[  381.806609] PM: dpm_run_callback(): pci_pm_suspend+0x0/0x130 returns -110
[  381.806612] PM: Device 0000:02:00.0 failed to suspend async: error -110
[  383.877358] PM: Some devices failed to suspend, or early wake event detected
[  383.880034] sd 0:0:0:0: [sda] Starting disk
[  383.900618] jme 0000:03:00.0 enp3s0f0: Link is down
[  384.185026] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[  384.185066] ata4: SATA link down (SStatus 0 SControl 300)
[  384.200108] ata3.00: configured for UDMA/133
[  384.450962] firewire_core 0000:05:00.0: rediscovered device fw0
[  384.521436] PM: resume devices took 0.644 seconds
[  384.521479] OOM killer enabled.
[  384.521483] Restarting tasks ... done.
[  384.978674] psmouse serio2: synaptics: queried max coordinates: x [..5692], y [..4680]
[  385.416945] video LNXVIDEO:00: Restoring backlight state
[  385.424806] PM: suspend exit
Comment 13 Julian Sikorski 2019-10-20 13:32:35 UTC
I am going to move this to linux-usb

Note You need to log in before you can comment on or make changes to this bug.