Created attachment 171671 [details] Screenshot kernel dump This is an ASUS H97 Pro mainboard with the latest BIOS 2705 (March 2015). S2R works flawlessly. S2D works ok too, but the system can't suspend properly about 3-4 times out of 10. Went through the whole kernel suspend debug guide, with no result. The only hint I have is a screenshot of a kernel dump when the machine hangs on resume (hard reset required). Thanks, Heinz.
Created attachment 171721 [details] Screenshot kernel dump 4.0.0-rc5 This is the screendump using bog standard vanilla 4.0.0-rc5.
The image you captured is triggered by a suspend debug? like this? # echo devices > /sys/power/pm_test # echo disk > /sys/power/state
No, it's triggered in real life, on a production machine. Debugging the way you describe in your comment doesn't trigger it.
FWIW: yhe problem persists in 4.0.0.
Do you mean the same dump as shown in the attached image happened with v4.0 kernel?
Yes. Did hibernate by mistake, and encountered the same error. Usually, I do no longer S2D, because this is a production machine and every failure to wake up causes data corruption. Btw: Win7-64 installed for testing purpose hibernates/wakes up fine.
Does this occur when resuming from hibernation or sometime after resumed from hibernation?
The machine runs flawlessly. It only happens when resuming from S2D. I can see it because I have set no_console_suspend=1.
Btw: booting with maxcpus=1 solves the problem (but this is of course not an option on a multicore Xeon).
It seems to have something to do with CPU hotplug. When kernel resumes from hibernation, after hibernation image is restored, it will offline all non-boot CPUs and then call syscore_suspend where the timekeeping_suspended will be set and if any code tries to access timekeeping code, e.g. ktime_get which is called from cpu_idle_loop, the warning call trace as you have attached will be printed. Normally, after all non-boot CPUs are offlined, only CPU0 is alive and it shouldn't be in the idle loop but to execute the syscore_suspend. But your case seems to suggest that after all non-boot CPUs are offlined, CPU0 is still running idle loop. I have no idea why this can occur...
Ok, thanks! Can I help in any way to debug this? And what should I do in this case?
Is it possible to setup serial console? Or we will not be able to capture any logs.
The affected machine has no RS232 port, and the same applies to all the other machines I have access to. Is there any possibility to emulate a serial port? How could I do that?
Ok, I have just ordered two adapter cables providing the RS-232 serial interface and the connections necessary to connect a laptop running minicom to the affected machine. I will report back here when I the order has arrived. Is there something special I should take care of in order to be able to catch the most relevant information when configuring the serial console? Or should I even post the screendump to the lkml in the meantime, in hope that it rings a bell for somebody?
(In reply to htd from comment #14) > Ok, I have just ordered two adapter cables providing the RS-232 serial > interface and the connections necessary to connect a laptop running minicom > to the affected machine. I will report back here when I the order has > arrived. > > Is there something special I should take care of in order to be able to > catch the most relevant information when configuring the serial console? > > Or should I even post the screendump to the lkml in the meantime, in hope > that it rings a bell for somebody? I think that is a good idea. Posting to the PM mailing list should be appropriate: linux-pm@vger.kernel.org. Describe the problem precisely and attach the dump image, let's see if they have some better idea. Thanks.
Posted a description of the problem to the linux-pm mailing list and got redirected back here. On request, attached is what I see when the system hangs when trying to resume from S2D with "idle=poll" set.
Created attachment 174251 [details] Screencopy 2 (kernel 4.0.0) Here it hangs (hard reset required) when booting with "idle=poll" and trying to resume from S2D.
Eventually, the serial cables are arrived and the serial console is set up and working. What can I do to debug this problem further? Are there any special parameters, configurations or the like I should emphasize on?
The problem still persists with vanilla 4.0.2. The serial console shows that the machine hibernates fine. When resuming, the whole image gets loaded, and then it hangs at the same place as shown in the attachments in this thread. Unfortunately, the serial console also hangs and there is no output. The last hang caused 11 GB data in lost and found, and the replay of a backup was necessary to be able to use the machine.
(In reply to htd from comment #1) > Created attachment 171721 [details] > Screenshot kernel dump 4.0.0-rc5 > > This is the screendump using bog standard vanilla 4.0.0-rc5. Does this happen before or after non-boot CPUs are offlined?
As far as I can see, it happens directly after.
Do you have initcall_debug in your kernel cmdline? If not, please add that and attach the serial log when the problem occurs next time.
I remember there was once a cpu idle problem dealing with resuming, but should be fixed, can you please help confirm if it exists on latest kernel? Thanks Yu
Hi, Heinz did you still see this bug on 4.3 ? thanks
(In reply to Chen Yu from comment #24) > Hi, Heinz > did you still see this bug on 4.3 ? > thanks Yes, unfortunately. When the machine tries to come back, I can see how the image is getting uncompressed, and then the screen gets black and all is dead (and the filesystems corrupted).
(In reply to htd from comment #25) > (In reply to Chen Yu from comment #24) > > Hi, Heinz > > did you still see this bug on 4.3 ? > > thanks > > Yes, unfortunately. When the machine tries to come back, I can see how the > image is getting uncompressed, and then the screen gets black and all is > dead (and the filesystems corrupted). Is the serial log available? If so, do you see anything problematic?
I can confirm this bug. Hardware: ASUS K401LB5200 Linux OS: Ubuntu 14.04.3 amd64 Linux kernel: ubuntu 3.19.0-33-generic , mainline 4.1.13 There is a 1/3 probability of resume failure. The system hang after follow console output(with no_console_suspend) Disabling non-boot CPUs... intel_pstate CPU 1 exiting smpboot: CPU 1 is now offline intel_pstate CPU 2 exiting smpboot: CPU 2 is now offline intel_pstate CPU 3 exiting smpboot: CPU 3 is now offline Just downgraded to 3.16, it seems work well, but may need further tests. more info: http://askubuntu.com/questions/678265/xubuntu-14-04-fails-to-resume-from-hibernation https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1490494
I found that the resume will more likely to fail when the image is large, maybe this is a useful info. I added a ubuntu bug report https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1516606 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1516606/+attachment/4520611/+files/bug1516606_hibernate_debug_info.tar.gz
(In reply to htd from comment #25) > (In reply to Chen Yu from comment #24) > > Hi, Heinz > > did you still see this bug on 4.3 ? > > thanks > > Yes, unfortunately. When the machine tries to come back, I can see how the > image is getting uncompressed, and then the screen gets black and all is > dead (and the filesystems corrupted). Does 'dead' mean the system hang as #Comment 1(ktime_get warning) or as #Comment 16(no response after 'smpboot: CPU 7 is now offline', with idle=poll set)? can you please test by: 1. append init=/bin/bash text resume=/dev/sdaxx(your swap partition) 2. boot into simple shell, and swapon /dev/sdaxx 3. echo disk > /sys/power/state 4. resume system by 'init=/bin/bash text resume=/dev/sdaxx'
Bug closed as there is no response from the original reporter for more than a month. Please feel free to reopen it if you can provide the information requested. biscuit_2014@outlook.com Please file a new bug report with the detailed information and laptop model name.