Bug 95231 - Fails to suspend from hibernation - Asus H97/Pro-Intel
Summary: Fails to suspend from hibernation - Asus H97/Pro-Intel
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-22 21:01 UTC by htd
Modified: 2015-12-28 05:50 UTC (History)
6 users (show)

See Also:
Kernel Version: 3.19.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Screenshot kernel dump (2.83 MB, image/jpeg)
2015-03-22 21:01 UTC, htd
Details
Screenshot kernel dump 4.0.0-rc5 (2.26 MB, image/jpeg)
2015-03-23 06:40 UTC, htd
Details
Screencopy 2 (kernel 4.0.0) (973.36 KB, image/jpeg)
2015-04-17 06:53 UTC, htd
Details

Description htd 2015-03-22 21:01:18 UTC
Created attachment 171671 [details]
Screenshot kernel dump

This is an ASUS H97 Pro mainboard with the latest BIOS 2705 (March 2015).
S2R works flawlessly. S2D works ok too, but the system can't suspend properly about 3-4 times out of 10.

Went through the whole kernel suspend debug guide, with no result.
The only hint I have is a screenshot of a kernel dump when the machine hangs on resume (hard reset required).

Thanks, Heinz.
Comment 1 htd 2015-03-23 06:40:27 UTC
Created attachment 171721 [details]
Screenshot kernel dump 4.0.0-rc5

This is the screendump using bog standard vanilla 4.0.0-rc5.
Comment 2 Aaron Lu 2015-03-23 06:56:00 UTC
The image you captured is triggered by a suspend debug? like this?
# echo devices > /sys/power/pm_test
# echo disk > /sys/power/state
Comment 3 htd 2015-03-23 07:16:44 UTC
No, it's triggered in real life, on a production machine.
Debugging the way you describe in your comment doesn't trigger it.
Comment 4 htd 2015-04-14 06:45:38 UTC
FWIW: yhe problem persists in 4.0.0.
Comment 5 Aaron Lu 2015-04-14 06:51:02 UTC
Do you mean the same dump as shown in the attached image happened with v4.0 kernel?
Comment 6 htd 2015-04-14 06:58:35 UTC
Yes. Did hibernate by mistake, and encountered the same error. Usually, I do no longer S2D, because this is a production machine and every failure to wake up causes data corruption.

Btw: Win7-64 installed for testing purpose hibernates/wakes up fine.
Comment 7 Aaron Lu 2015-04-14 07:04:48 UTC
Does this occur when resuming from hibernation or sometime after resumed from hibernation?
Comment 8 htd 2015-04-14 07:09:05 UTC
The machine runs flawlessly. It only happens when resuming from S2D. I can see it because I have set no_console_suspend=1.
Comment 9 htd 2015-04-14 10:32:03 UTC
Btw: booting with maxcpus=1 solves the problem (but this is of course not an option on a multicore Xeon).
Comment 10 Aaron Lu 2015-04-15 06:26:53 UTC
It seems to have something to do with CPU hotplug. When kernel resumes from hibernation, after hibernation image is restored, it will offline all non-boot CPUs and then call syscore_suspend where the timekeeping_suspended will be set and if any code tries to access timekeeping code, e.g. ktime_get which is called from cpu_idle_loop, the warning call trace as you have attached will be printed.

Normally, after all non-boot CPUs are offlined, only CPU0 is alive and it shouldn't be in the idle loop but to execute the syscore_suspend. But your case seems to suggest that after all non-boot CPUs are offlined, CPU0 is still running idle loop. I have no idea why this can occur...
Comment 11 htd 2015-04-15 06:48:17 UTC
Ok, thanks! Can I help in any way to debug this? And what should I do in this case?
Comment 12 Aaron Lu 2015-04-15 07:35:25 UTC
Is it possible to setup serial console? Or we will not be able to capture any logs.
Comment 13 htd 2015-04-15 09:30:08 UTC
The affected machine has no RS232 port, and the same applies to all the other machines I have access to. Is there any possibility to emulate a serial port? How could I do that?
Comment 14 htd 2015-04-15 16:44:46 UTC
Ok, I have just ordered two adapter cables providing the RS-232 serial interface and the connections necessary to connect a laptop running minicom to the affected machine. I will report back here when I the order has arrived.

Is there something special I should take care of in order to be able to catch the most relevant information when configuring the serial console?

Or should I even post the screendump to the lkml in the meantime, in hope that it rings a bell for somebody?
Comment 15 Aaron Lu 2015-04-16 01:46:55 UTC
(In reply to htd from comment #14)
> Ok, I have just ordered two adapter cables providing the RS-232 serial
> interface and the connections necessary to connect a laptop running minicom
> to the affected machine. I will report back here when I the order has
> arrived.
> 
> Is there something special I should take care of in order to be able to
> catch the most relevant information when configuring the serial console?
> 
> Or should I even post the screendump to the lkml in the meantime, in hope
> that it rings a bell for somebody?

I think that is a good idea. Posting to the PM mailing list should be appropriate: linux-pm@vger.kernel.org. Describe the problem precisely and attach the dump image, let's see if they have some better idea. Thanks.
Comment 16 htd 2015-04-17 06:50:04 UTC
Posted a description of the problem to the linux-pm mailing list and got redirected back here. On request, attached is what I see when the system hangs when trying to resume from S2D with "idle=poll" set.
Comment 17 htd 2015-04-17 06:53:04 UTC
Created attachment 174251 [details]
Screencopy 2 (kernel 4.0.0)

Here it hangs (hard reset required) when booting with "idle=poll" and trying to resume from S2D.
Comment 18 htd 2015-04-29 18:32:33 UTC
Eventually, the serial cables are arrived and the serial console is set up and working. What can I do to debug this problem further? Are there any special parameters, configurations or the like I should emphasize on?
Comment 19 htd 2015-05-10 14:07:49 UTC
The problem still persists with vanilla 4.0.2. The serial console shows that the machine hibernates fine. When resuming, the whole image gets loaded, and then it hangs at the same place as shown in the attachments in this thread. Unfortunately, the serial console also hangs and there is no output. The last hang caused 11 GB data in lost and found, and the replay of a backup was necessary to be able to use the machine.
Comment 20 Aaron Lu 2015-05-19 03:00:56 UTC
(In reply to htd from comment #1)
> Created attachment 171721 [details]
> Screenshot kernel dump 4.0.0-rc5
> 
> This is the screendump using bog standard vanilla 4.0.0-rc5.

Does this happen before or after non-boot CPUs are offlined?
Comment 21 htd 2015-05-19 05:39:33 UTC
As far as I can see, it happens directly after.
Comment 22 Aaron Lu 2015-05-19 07:14:34 UTC
Do you have initcall_debug in your kernel cmdline? If not, please add that and attach the serial log when the problem occurs next time.
Comment 23 Chen Yu 2015-09-08 01:46:23 UTC
I remember there was once a cpu idle problem dealing with resuming, but should be fixed, can you please help confirm if it exists on latest kernel?
Thanks
Yu
Comment 24 Chen Yu 2015-11-03 16:20:09 UTC
Hi, Heinz
did you still see this bug on 4.3 ?
thanks
Comment 25 htd 2015-11-08 12:07:37 UTC
(In reply to Chen Yu from comment #24)
> Hi, Heinz
> did you still see this bug on 4.3 ?
> thanks

Yes, unfortunately. When the machine tries to come back, I can see how the image is getting uncompressed, and then the screen gets black and all is dead (and the filesystems corrupted).
Comment 26 Aaron Lu 2015-11-09 02:31:44 UTC
(In reply to htd from comment #25)
> (In reply to Chen Yu from comment #24)
> > Hi, Heinz
> > did you still see this bug on 4.3 ?
> > thanks
> 
> Yes, unfortunately. When the machine tries to come back, I can see how the
> image is getting uncompressed, and then the screen gets black and all is
> dead (and the filesystems corrupted).

Is the serial log available? If so, do you see anything problematic?
Comment 27 biscuit_2014 2015-11-16 12:18:27 UTC
I can confirm this bug.

Hardware: ASUS K401LB5200
Linux OS: Ubuntu 14.04.3 amd64
Linux kernel: ubuntu 3.19.0-33-generic , mainline 4.1.13

There is a 1/3 probability of resume failure. The system hang after follow console output(with no_console_suspend)

Disabling non-boot CPUs...
intel_pstate CPU 1 exiting
smpboot: CPU 1 is now offline
intel_pstate CPU 2 exiting
smpboot: CPU 2 is now offline
intel_pstate CPU 3 exiting
smpboot: CPU 3 is now offline

Just downgraded to 3.16, it seems work well, but may need further tests.

more info:
http://askubuntu.com/questions/678265/xubuntu-14-04-fails-to-resume-from-hibernation
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1490494
Comment 28 biscuit_2014 2015-11-18 07:38:28 UTC
I found that the resume will more likely to fail when the image is large, maybe this is a useful info.

I added a ubuntu bug report https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1516606

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1516606/+attachment/4520611/+files/bug1516606_hibernate_debug_info.tar.gz
Comment 29 Chen Yu 2015-11-23 04:08:14 UTC
(In reply to htd from comment #25)
> (In reply to Chen Yu from comment #24)
> > Hi, Heinz
> > did you still see this bug on 4.3 ?
> > thanks
> 
> Yes, unfortunately. When the machine tries to come back, I can see how the
> image is getting uncompressed, and then the screen gets black and all is
> dead (and the filesystems corrupted).

Does 'dead' mean the system hang as #Comment 1(ktime_get warning) or as #Comment 16(no response after 'smpboot: CPU 7 is now offline', with idle=poll set)?
 can you please test by:
1. append init=/bin/bash text resume=/dev/sdaxx(your swap partition)
2. boot into simple shell, and swapon /dev/sdaxx
3. echo disk > /sys/power/state
4. resume system by 'init=/bin/bash text resume=/dev/sdaxx'
Comment 30 Zhang Rui 2015-12-28 05:50:44 UTC
Bug closed as there is no response from the original reporter for more than a month. Please feel free to reopen it if you can provide the information requested.

biscuit_2014@outlook.com
Please file a new bug report with the detailed information and laptop model name.

Note You need to log in before you can comment on or make changes to this bug.