Bug 77571

Summary: unable to handle kernel paging request on resume
Product: Power Management Reporter: kb
Component: Hibernation/SuspendAssignee: Rafael J. Wysocki (rjw)
Status: CLOSED DOCUMENTED    
Severity: normal CC: aaron.lu, bjoernv, bojan, jlee, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.14.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: error + trace

Description kb 2014-06-10 06:07:00 UTC
Created attachment 138781 [details]
error + trace

Every once a while (not always, ~once a month) my computer fails to resume from hibernation. This is long going across several kernel versions. I've captured the error on a photo, see attached.
Comment 1 kb 2014-06-19 05:07:22 UTC
Happened a couple of times since I reported. Most recently on kernel version 3.14.6-1-ARCH
Comment 2 kb 2014-06-29 18:26:28 UTC
Still happening in 3.15.2
Comment 3 Aaron Lu 2014-08-04 02:47:10 UTC
Does adding hibernate=nocompress help?
Comment 4 kb 2014-08-11 18:26:28 UTC
The bug did not happen for a long time; but I got a crash today with kernel 3.15.5

I will try with that option. I suspect if nothing else it will take longer to suspend :)

Do you have any tips how to increase likelihood/reproducibility of the bug?
Comment 5 kb 2014-08-29 18:44:45 UTC
I've been running with nocompress for a while. I see this as a good quick fix/workaround, but are there any plans to fix this issue?
Comment 6 Bojan Smojver 2014-08-29 22:49:53 UTC
(In reply to kb from comment #5)
> I've been running with nocompress for a while. I see this as a good quick
> fix/workaround, but are there any plans to fix this issue?

Compression/decompression code stores a checkum of the image, so when it decompresses, we can be reasonably certain that what was read in is the same as what was written to the disk.

One other difference is that compression/decompression uses multiple threads. Are you sure that you don't have some sort of device that has a driver problem causing this failure?
Comment 7 kb 2014-09-01 06:09:38 UTC
(In reply to Bojan Smojver from comment #6)
> Compression/decompression code stores a checkum of the image, so when it
> decompresses, we can be reasonably certain that what was read in is the same
> as what was written to the disk.

Is this bug related to incorrect checksum?

> One other difference is that compression/decompression uses multiple
> threads. Are you sure that you don't have some sort of device that has a
> driver problem causing this failure?

I don't have any *problems*, but I have some error messages related to
a) gpu (amd r9 290) driver - to be solved in 3.17 hopefully
b) occasional late reset of NIC - but it works fine
Comment 8 Bojan Smojver 2014-09-01 06:41:14 UTC
(In reply to kb from comment #7)
 
> Is this bug related to incorrect checksum?

I was just pointing out that the restored image is checked for validity, so the act of compression/decompression should not (in theory) affect what we get back.

> I don't have any *problems*, but I have some error messages related to
> a) gpu (amd r9 290) driver - to be solved in 3.17 hopefully
> b) occasional late reset of NIC - but it works fine

Very hard to say, but it is possible that differnet timings of multithreaded compression/decompression code are interacting differntly with one of the drivers and causing the problem intermittently.
Comment 9 kb 2014-09-01 13:28:43 UTC
(In reply to Bojan Smojver from comment #8)
> Very hard to say, but it is possible that differnet timings of multithreaded
> compression/decompression code are interacting differntly with one of the
> drivers and causing the problem intermittently.

I've got this HW only since January; this bug was happening to me on the old HW also. It was a completely different platform (AMD vs Intel) and it did not have the 2 driver issues I outlined.

Is there something I can do to help us get to the bottom of this?
Comment 10 Bojan Smojver 2014-09-01 21:38:23 UTC
(In reply to kb from comment #9)

> Is there something I can do to help us get to the bottom of this?

Good question...

When this happens, the decompression already finished? Or didn't start yet?

This is an x86_64 kernel, correct?
Comment 11 kb 2014-09-02 16:21:10 UTC
(In reply to Bojan Smojver from comment #10)
> When this happens, the decompression already finished? Or didn't start yet?

It does not crash immediately. So judging by the timing, it's not before. Either during or after.

> This is an x86_64 kernel, correct?

Yes - stock arch linux
Comment 12 Bojan Smojver 2014-09-02 21:43:58 UTC
(In reply to kb from comment #11)

> It does not crash immediately. So judging by the timing, it's not before.
> Either during or after.

So, the decompression has finished, the thawed kernel was restored and switched to and is up and running, then this happens. Correct?

Or something else?
Comment 13 kb 2014-09-03 14:25:57 UTC
(In reply to Bojan Smojver from comment #12)
> So, the decompression has finished, the thawed kernel was restored and
> switched to and is up and running, then this happens. Correct?
> 
> Or something else?

Wouldn't dare to guess. In the "screen shot" (image attached to this bug report) it takes 1.5s since last log to crash. That might be too fast for complete decompress (that machine had 8GB of ram)? Can you see something from the stack trace? I can enable compression and do some more tests if needed.
Comment 14 Bojan Smojver 2014-09-05 06:58:57 UTC
(In reply to kb from comment #13)
> That might be too fast for complete decompress (that machine had 8GB of ram)?

Yeah, I probably.

> Can you see something from the stack trace? I can enable compression and do
> some more tests if needed.

The only thing I can think of is that load_image_lzo() function trying to access something at the wrong address.

Can you load your kernel image and debugging symbols into gdb and find out what load_image_lzo+0x8b4 is?
Comment 15 kb 2014-09-15 18:13:15 UTC
OK, but it'll take some times since I'll need to recompile the kernel with debug and wait for the issue to happen.
Comment 16 Bojan Smojver 2014-09-15 21:02:00 UTC
Doesn't you distro provide debug packages? I thought this was a distro kernel we were talking about...
Comment 17 kb 2014-09-16 15:43:53 UTC
(In reply to Bojan Smojver from comment #16)
> Doesn't you distro provide debug packages? I thought this was a distro
> kernel we were talking about...

sadly no -afaik arch doesn't have that despite many discussions
Comment 18 Aaron Lu 2014-10-17 08:04:14 UTC
Any update kb?
Comment 19 kb 2014-10-17 18:24:16 UTC
Hi,
I'm still trying to get the exception. I'm running with the debug kernel for 1+ month, but it did not happen so far.
Comment 20 kb 2014-10-30 17:09:40 UTC
I have a little trouble resuming with my debug kernel. Any ideas on the below?

[   14.720661] random: nonblocking pool is initialized
[   23.970455] PM: Using 3 thread(s) for decompression.
PM: Loading and decompressing image data (2825755 pages)...
[   24.189372] PM: Image loading progress:   0%
[   24.349777] PM: 0xab9bc000 in e820 nosave region: [mem 0xab9bc000-0xab9c2fff]
[   24.425282] PM: Read 11303020 kbytes in 0.45 seconds (25117.82 MB/s)
[   24.426079] PM: Error -14 resuming
[   24.426092] PM: Failed to load hibernation image, recovering.
[   24.501604] PM: Basic memory bitmaps freed
[   24.501605] Restarting tasks ... done.
[   24.503087] PM: Hibernation image not present or could not be loaded.
Comment 21 kb 2014-11-01 09:36:24 UTC
I see this patch in recent kernels
http://lkml.org/lkml/2014/8/4/375

Which explains both the original oops and the new error -14
It says something about BIOS being at fault, but frankly I've got no clue what to do to fix it. Any ideas?
Comment 22 Bojan Smojver 2014-11-01 10:25:28 UTC
(In reply to kb from comment #21)
> I see this patch in recent kernels
> http://lkml.org/lkml/2014/8/4/375
> 
> Which explains both the original oops and the new error -14
> It says something about BIOS being at fault, but frankly I've got no clue
> what to do to fix it. Any ideas?

Please take this with a massive amount of salt, because I really have absolutely no idea what I am talking about here, but does your computer support EFI boot?

PS. I never tried EFI, don't have any machines that support it, have no idea how it works, whether Linux hibernation will work with it etc. Just a stab in the dark...
Comment 23 kb 2014-11-01 17:46:25 UTC
Yes it supports EFI boot but I never got it working despite several attempts, so I'm booting in the bios mode.
Comment 24 Lee, Chun-Yi 2014-11-04 04:13:29 UTC
(In reply to kb from comment #21)
> I see this patch in recent kernels
> http://lkml.org/lkml/2014/8/4/375

This patch add more clearly message to indicate the page in hibernate image that's in CURRENT e820 region. Without this patch, you can only see the kernel oops as your attached picture in bug description.

The kernel oops in your attached picture causes by hibernate code try to write image content to a page that didn't allocate, because this page included by current e820 region. It's random happen because BIOS changed the e820 table.

> 
> Which explains both the original oops and the new error -14
> It says something about BIOS being at fault, but frankly I've got no clue
> what to do to fix it. Any ideas?

[   24.349777] PM: 0xab9bc000 in e820 nosave region: [mem 0xab9bc000-0xab9c2fff]

The above message indicates 0xab9bc000 address is in the e820 region when resuming. That means BIOS did NOT keep the e820 constant between hibernating and restoring.

Please check does your machine running [platform] mode of hibernating?

# cat /sys/power/disk
[platform] shutdown reboot suspend

If the machine using [shutdown] mode and didn't see [platform] mode, then this machine doesn't well support hibernate because BIOS doesn't provide _S4_. You can also check dmesg may have similar message as the following:

[    0.571263] ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S4_] (20130...

IF your machine supported _S4_, means kernel using [platform] mode. Then BIOS doesn't do the job to keep the e820 table doesn't change after _S4_ evaluated by OS kernel, it's BIOS bug.

A way MAY relieving symptoms is using memmap kernel parameter to always mark the address as non-usable like e820 table in kernel:

    memmap=0xab9bc000\$0xab9c2fff         <=== you can change the area to minimize by yourself

But, due to the e820 table is handled by BIOS, you need add address to the memmap parameter when every time you got the same problem. I means maybe not just one address used by BIOS e820 changing.
Comment 25 kb 2014-11-04 16:29:51 UTC
Thanks for the comment, this really helps. I guess by e820 you mean this http://en.wikipedia.org/wiki/E820

I indeed use shutdown method to power off my system (I do it manually in my hibernate script). I had no idea that there would be any problem with this. I'll do some more experiments and report back.
Comment 26 kb 2014-11-04 16:33:14 UTC
Actually 2 more things: my dmesg reports this
[    0.311556] ACPI: (supports S0 S3 S4 S5)
And as I remember the problem also happens when I use the reboot method (I will double check this).
Comment 27 Lee, Chun-Yi 2014-11-05 00:20:48 UTC
(In reply to kb from comment #26)
> Actually 2 more things: my dmesg reports this
> [    0.311556] ACPI: (supports S0 S3 S4 S5)
> And as I remember the problem also happens when I use the reboot method (I
> will double check this).

That's better using [platform] mode when your machine supported S4.
Comment 28 kb 2014-11-05 18:19:08 UTC
Unfortunately, with the latest 3.17.2 kernel my system reboots right after the 'suspending consoles' message when resuming from hibernate. Any idea how to investigate?
Comment 29 Lee, Chun-Yi 2014-11-06 02:12:55 UTC
(In reply to kb from comment #28)
> Unfortunately, with the latest 3.17.2 kernel my system reboots right after
> the 'suspending consoles' message when resuming from hibernate. Any idea how
> to investigate?

Sorry I have no idea about this situation. It's possible a triple fault happened when hibernate resuming, it's hard to debug.

Is it 100% reproduced? You can try the debugging way in Documentation/power/basic-pm-debugging.txt.

For capture debug log, please remember remove splash from distro, and you can try to using no_console_suspend to grab more messages:

console=tty0 earlyprintk=tty0 debug no_console_suspend=1 loglevel=9 nomodeset

The best is enable serial console if the machine support it.


And, if this issue didn't see before 3.17, and the issue very easy to reproduce, then maybe git bisect can find out which kernel patch relate to the problem.
Comment 30 Zhang Rui 2014-12-02 12:44:29 UTC
ping...
Comment 31 kb 2014-12-03 17:22:28 UTC
I'm still here. The reboot issue was caused by bad initrd. I've gone through some resume cycles and didn't get any problems in some time. I'm keeping my hopes up.
Comment 32 kb 2014-12-27 15:00:56 UTC
I think this case can be closed. The original problem (kernel crash) was caught independently by Lee's patch. The cause is likely the e820 thing.
Comment 33 Lee, Chun-Yi 2015-01-19 05:01:49 UTC
(In reply to Lee, Chun-Yi from comment #24)
> (In reply to kb from comment #21)
...
> A way MAY relieving symptoms is using memmap kernel parameter to always mark
> the address as non-usable like e820 table in kernel:
> 
>     memmap=0xab9bc000\$0xab9c2fff         <=== you can change the area to
> minimize by yourself
> 

I found this memmap example should change to: memmap=0x00000001\$0xab9bc000

Base on Documentation/kernel-parameters.txt

        memmap=nn[KMG]$ss[KMG]
                        [KNL,ACPI] Mark specific memory as reserved.
                        Region of memory to be reserved is from ss to ss+nn.
                        Example: Exclude memory from 0x18690000-0x1869ffff
                                 memmap=64K$0x18690000
                                 or
                                 memmap=0x10000$0x18690000