Bug 77571
Summary: | unable to handle kernel paging request on resume | ||
---|---|---|---|
Product: | Power Management | Reporter: | kb |
Component: | Hibernation/Suspend | Assignee: | Rafael J. Wysocki (rjw) |
Status: | CLOSED DOCUMENTED | ||
Severity: | normal | CC: | aaron.lu, bjoernv, bojan, jlee, rui.zhang |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.14.2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | error + trace |
Happened a couple of times since I reported. Most recently on kernel version 3.14.6-1-ARCH Still happening in 3.15.2 Does adding hibernate=nocompress help? The bug did not happen for a long time; but I got a crash today with kernel 3.15.5 I will try with that option. I suspect if nothing else it will take longer to suspend :) Do you have any tips how to increase likelihood/reproducibility of the bug? I've been running with nocompress for a while. I see this as a good quick fix/workaround, but are there any plans to fix this issue? (In reply to kb from comment #5) > I've been running with nocompress for a while. I see this as a good quick > fix/workaround, but are there any plans to fix this issue? Compression/decompression code stores a checkum of the image, so when it decompresses, we can be reasonably certain that what was read in is the same as what was written to the disk. One other difference is that compression/decompression uses multiple threads. Are you sure that you don't have some sort of device that has a driver problem causing this failure? (In reply to Bojan Smojver from comment #6) > Compression/decompression code stores a checkum of the image, so when it > decompresses, we can be reasonably certain that what was read in is the same > as what was written to the disk. Is this bug related to incorrect checksum? > One other difference is that compression/decompression uses multiple > threads. Are you sure that you don't have some sort of device that has a > driver problem causing this failure? I don't have any *problems*, but I have some error messages related to a) gpu (amd r9 290) driver - to be solved in 3.17 hopefully b) occasional late reset of NIC - but it works fine (In reply to kb from comment #7) > Is this bug related to incorrect checksum? I was just pointing out that the restored image is checked for validity, so the act of compression/decompression should not (in theory) affect what we get back. > I don't have any *problems*, but I have some error messages related to > a) gpu (amd r9 290) driver - to be solved in 3.17 hopefully > b) occasional late reset of NIC - but it works fine Very hard to say, but it is possible that differnet timings of multithreaded compression/decompression code are interacting differntly with one of the drivers and causing the problem intermittently. (In reply to Bojan Smojver from comment #8) > Very hard to say, but it is possible that differnet timings of multithreaded > compression/decompression code are interacting differntly with one of the > drivers and causing the problem intermittently. I've got this HW only since January; this bug was happening to me on the old HW also. It was a completely different platform (AMD vs Intel) and it did not have the 2 driver issues I outlined. Is there something I can do to help us get to the bottom of this? (In reply to kb from comment #9) > Is there something I can do to help us get to the bottom of this? Good question... When this happens, the decompression already finished? Or didn't start yet? This is an x86_64 kernel, correct? (In reply to Bojan Smojver from comment #10) > When this happens, the decompression already finished? Or didn't start yet? It does not crash immediately. So judging by the timing, it's not before. Either during or after. > This is an x86_64 kernel, correct? Yes - stock arch linux (In reply to kb from comment #11) > It does not crash immediately. So judging by the timing, it's not before. > Either during or after. So, the decompression has finished, the thawed kernel was restored and switched to and is up and running, then this happens. Correct? Or something else? (In reply to Bojan Smojver from comment #12) > So, the decompression has finished, the thawed kernel was restored and > switched to and is up and running, then this happens. Correct? > > Or something else? Wouldn't dare to guess. In the "screen shot" (image attached to this bug report) it takes 1.5s since last log to crash. That might be too fast for complete decompress (that machine had 8GB of ram)? Can you see something from the stack trace? I can enable compression and do some more tests if needed. (In reply to kb from comment #13) > That might be too fast for complete decompress (that machine had 8GB of ram)? Yeah, I probably. > Can you see something from the stack trace? I can enable compression and do > some more tests if needed. The only thing I can think of is that load_image_lzo() function trying to access something at the wrong address. Can you load your kernel image and debugging symbols into gdb and find out what load_image_lzo+0x8b4 is? OK, but it'll take some times since I'll need to recompile the kernel with debug and wait for the issue to happen. Doesn't you distro provide debug packages? I thought this was a distro kernel we were talking about... (In reply to Bojan Smojver from comment #16) > Doesn't you distro provide debug packages? I thought this was a distro > kernel we were talking about... sadly no -afaik arch doesn't have that despite many discussions Any update kb? Hi, I'm still trying to get the exception. I'm running with the debug kernel for 1+ month, but it did not happen so far. I have a little trouble resuming with my debug kernel. Any ideas on the below? [ 14.720661] random: nonblocking pool is initialized [ 23.970455] PM: Using 3 thread(s) for decompression. PM: Loading and decompressing image data (2825755 pages)... [ 24.189372] PM: Image loading progress: 0% [ 24.349777] PM: 0xab9bc000 in e820 nosave region: [mem 0xab9bc000-0xab9c2fff] [ 24.425282] PM: Read 11303020 kbytes in 0.45 seconds (25117.82 MB/s) [ 24.426079] PM: Error -14 resuming [ 24.426092] PM: Failed to load hibernation image, recovering. [ 24.501604] PM: Basic memory bitmaps freed [ 24.501605] Restarting tasks ... done. [ 24.503087] PM: Hibernation image not present or could not be loaded. I see this patch in recent kernels http://lkml.org/lkml/2014/8/4/375 Which explains both the original oops and the new error -14 It says something about BIOS being at fault, but frankly I've got no clue what to do to fix it. Any ideas? (In reply to kb from comment #21) > I see this patch in recent kernels > http://lkml.org/lkml/2014/8/4/375 > > Which explains both the original oops and the new error -14 > It says something about BIOS being at fault, but frankly I've got no clue > what to do to fix it. Any ideas? Please take this with a massive amount of salt, because I really have absolutely no idea what I am talking about here, but does your computer support EFI boot? PS. I never tried EFI, don't have any machines that support it, have no idea how it works, whether Linux hibernation will work with it etc. Just a stab in the dark... Yes it supports EFI boot but I never got it working despite several attempts, so I'm booting in the bios mode. (In reply to kb from comment #21) > I see this patch in recent kernels > http://lkml.org/lkml/2014/8/4/375 This patch add more clearly message to indicate the page in hibernate image that's in CURRENT e820 region. Without this patch, you can only see the kernel oops as your attached picture in bug description. The kernel oops in your attached picture causes by hibernate code try to write image content to a page that didn't allocate, because this page included by current e820 region. It's random happen because BIOS changed the e820 table. > > Which explains both the original oops and the new error -14 > It says something about BIOS being at fault, but frankly I've got no clue > what to do to fix it. Any ideas? [ 24.349777] PM: 0xab9bc000 in e820 nosave region: [mem 0xab9bc000-0xab9c2fff] The above message indicates 0xab9bc000 address is in the e820 region when resuming. That means BIOS did NOT keep the e820 constant between hibernating and restoring. Please check does your machine running [platform] mode of hibernating? # cat /sys/power/disk [platform] shutdown reboot suspend If the machine using [shutdown] mode and didn't see [platform] mode, then this machine doesn't well support hibernate because BIOS doesn't provide _S4_. You can also check dmesg may have similar message as the following: [ 0.571263] ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S4_] (20130... IF your machine supported _S4_, means kernel using [platform] mode. Then BIOS doesn't do the job to keep the e820 table doesn't change after _S4_ evaluated by OS kernel, it's BIOS bug. A way MAY relieving symptoms is using memmap kernel parameter to always mark the address as non-usable like e820 table in kernel: memmap=0xab9bc000\$0xab9c2fff <=== you can change the area to minimize by yourself But, due to the e820 table is handled by BIOS, you need add address to the memmap parameter when every time you got the same problem. I means maybe not just one address used by BIOS e820 changing. Thanks for the comment, this really helps. I guess by e820 you mean this http://en.wikipedia.org/wiki/E820 I indeed use shutdown method to power off my system (I do it manually in my hibernate script). I had no idea that there would be any problem with this. I'll do some more experiments and report back. Actually 2 more things: my dmesg reports this [ 0.311556] ACPI: (supports S0 S3 S4 S5) And as I remember the problem also happens when I use the reboot method (I will double check this). (In reply to kb from comment #26) > Actually 2 more things: my dmesg reports this > [ 0.311556] ACPI: (supports S0 S3 S4 S5) > And as I remember the problem also happens when I use the reboot method (I > will double check this). That's better using [platform] mode when your machine supported S4. Unfortunately, with the latest 3.17.2 kernel my system reboots right after the 'suspending consoles' message when resuming from hibernate. Any idea how to investigate? (In reply to kb from comment #28) > Unfortunately, with the latest 3.17.2 kernel my system reboots right after > the 'suspending consoles' message when resuming from hibernate. Any idea how > to investigate? Sorry I have no idea about this situation. It's possible a triple fault happened when hibernate resuming, it's hard to debug. Is it 100% reproduced? You can try the debugging way in Documentation/power/basic-pm-debugging.txt. For capture debug log, please remember remove splash from distro, and you can try to using no_console_suspend to grab more messages: console=tty0 earlyprintk=tty0 debug no_console_suspend=1 loglevel=9 nomodeset The best is enable serial console if the machine support it. And, if this issue didn't see before 3.17, and the issue very easy to reproduce, then maybe git bisect can find out which kernel patch relate to the problem. ping... I'm still here. The reboot issue was caused by bad initrd. I've gone through some resume cycles and didn't get any problems in some time. I'm keeping my hopes up. I think this case can be closed. The original problem (kernel crash) was caught independently by Lee's patch. The cause is likely the e820 thing. (In reply to Lee, Chun-Yi from comment #24) > (In reply to kb from comment #21) ... > A way MAY relieving symptoms is using memmap kernel parameter to always mark > the address as non-usable like e820 table in kernel: > > memmap=0xab9bc000\$0xab9c2fff <=== you can change the area to > minimize by yourself > I found this memmap example should change to: memmap=0x00000001\$0xab9bc000 Base on Documentation/kernel-parameters.txt memmap=nn[KMG]$ss[KMG] [KNL,ACPI] Mark specific memory as reserved. Region of memory to be reserved is from ss to ss+nn. Example: Exclude memory from 0x18690000-0x1869ffff memmap=64K$0x18690000 or memmap=0x10000$0x18690000 |
Created attachment 138781 [details] error + trace Every once a while (not always, ~once a month) my computer fails to resume from hibernation. This is long going across several kernel versions. I've captured the error on a photo, see attached.