Bug 81191

Summary: Sometimes the laptop reboots just after resume from suspend to disk
Product: Power Management Reporter: Vitaliy Filippov (vitalif)
Component: Hibernation/SuspendAssignee: Zhang Rui (rui.zhang)
Status: CLOSED INSUFFICIENT_DATA    
Severity: normal CC: aaron.lu, rui.zhang, tianyu.lan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.14, 3.16 debian, most 3.x versions Subsystem:
Regression: No Bisected commit-id:

Description Vitaliy Filippov 2014-07-27 07:41:09 UTC
Hi!

I have a problem that exists for a long time, as far a I can remember - it exists in all or nearly all 3.x versions and maybe even in 2.6.38 or so...

The problem is, SOMETIMES the system reboots instead of resuming from suspend to disk (both swsusp and uswsusp). It reads the image normally until 100%, but then just reboots and loads the next time as if there was no hibernation image.

MOST times resume is working without problem...

Can you suggest me how to collect more information on this problem?
Comment 1 Lan Tianyu 2014-08-04 02:28:05 UTC
Could you try latest v3.16-rc6 kernel?
Comment 2 Zhang Rui 2014-09-09 14:52:50 UTC
please read /etc/fstab and find out your swap partition, say, /dev/sdax, and then reboot the kernel with boot option "resume=/dev/sdax", can you reproduce this problem then?
Comment 3 Vitaliy Filippov 2014-10-17 10:11:09 UTC
Oops. Sorry. I've probably missed the email notifications about the previous comments...

I'm now on 3.16 kernel on both laptops, and both of them (one newer Samsung 880Z5E with IVB Core i7-3635QM, one older Dell Studio XPS 16 with Core i7 Q 720) have this problem.

resume=/dev/sdaX option is of course there in the command line. The problem isn't related to finding the resume image; the resume image is found and read successfully, but after that something goes wrong and the laptop reboots.

I don't know what conditions lead to it - there's nothing special when the problem reproduces.
Comment 4 Vitaliy Filippov 2014-10-17 10:12:40 UTC
(it reproduces randomly each maybe 10th time or so...)
Comment 5 Aaron Lu 2014-10-17 14:38:56 UTC
Please refer to https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt, basically, you can do:
# cd /sys/power
# echo devices > pm_test
# echo disk > state
multiple times and see if this triggers any error.
Comment 6 Zhang Rui 2014-10-24 06:28:53 UTC
ping...
Comment 7 Vitaliy Filippov 2014-10-24 08:00:40 UTC
Tried the suggestion from comment 5 20 times on Samsung, nothing happened, and there was no errors in dmesg except 

[ 3058.621722] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec
[ 3058.621726] radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000000002e last fence id 0x000000000000002c on ring 5)
[ 3058.621729] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
[ 3058.621733] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).

but the primary card is Intel HD4000, so this didn't affect the system; also this happened every time, so it couldn't cause an error which only reproduces sometimes.

Meanwhile it seems I was incorrect about older Dell XPS 16 - it had ~40 day uptime so the error didn't reproduce for those 40 days :) it could have even more, but after the last suspend/resume radeon GPU stalled, the screen became white so I turned it off by pressing power button for 5 seconds.

The error stil does reproduce sometimes on Samsung though.
Comment 8 Aaron Lu 2014-10-24 08:04:36 UTC
Then try pm_trace may give us some hint: https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-power

From your description, it seems that the image is restored and then somewhere went wrong and the system isn't resumed correctly, which should be caught by pm_trace.
Comment 9 Vitaliy Filippov 2014-10-24 08:35:28 UTC
Hm, I've tried to suspend/resume after doing pm_test (having it set to none) and on the second attempt got something new - kernel panic with the error message visible on the screen... took a photo on mobile :)

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffffa006b61f>] usb_device_match+0x2f/0x80 [usbcore]
PGD 0
Oops: 0002 [#1] SMP
Modules linked in: ........
CPU: 2 PID: 4565 Comm: systemd-sleep Not tainted 3.16-1-amd64 #1 Debian 3.16.2-3
Hardware name: SAMSUNG ELECTRONICS CO., LTD. 870Z5E/880Z5E/680Z5E/NP880Z5E-X01UB, BIOS P02ADH.008.130604.SK 06/04/2013
...registers...
Call Trace:
__device_attach+0x22/0x40
bus_for_eachdrv+0x53/0x90
device_attach+0x98/0xc0
rebind_marked_interfaces.isra.12+0x75/0xb0 [usbcore]
usb_resume_complete+0x18/0x20 [usbcore]
dpm_complete+0x11a/0x370
hibernation_snapshot+0x18e/0x370
hibernate+0x152/0x200
state_store+0xcc/0xe0
kernfs_fop_writee+0xda/0x150
vfs_write+0xb2/0x1f0
SyS_write+0x42/0xa0
page_fault+0x28/0x30
system_call_fast_compare_end+0x10/0x15

This was on the screen for several seconds, then some NMI dumps showed up. The system didn't resume, but didn't reboot also.

Don't know if is't the same error or not, I've not seen this before.
Comment 10 Zhang Rui 2015-02-15 08:14:10 UTC
can you always reproduce this NULL pointer reference bug?
If yes, I think it would be great to fix that first and see if there is anything difference.
BTW, it would be good to try to reproduce the problem in the latest upstream kernel.
Comment 11 Aaron Lu 2015-02-27 07:31:23 UTC
Ping
Comment 12 Zhang Rui 2015-03-29 13:30:03 UTC
Bug closed as there is no response from the bug reporter for more than a month.
Please feel free to reopen it if you can reproduce the problem in latest upstream kernel and provide the information requested in comment #10.
Comment 13 Vitaliy Filippov 2015-06-05 21:28:44 UTC
It seems it's in fact fixed in newer kernels (3.18? 3.19? 4.0?), because I've not seen it for several months. So everything is ok now, thanks :)