I have an Acer V5-573P-6896 Laptop (Intel i5-4200U) running Debian Jessie which has had a persistent issue resuming from hibernation. I've had this problem at since I purchased the laptop back with Wheezy and about 3.18. I've been trying to debug it on and off for some time. Most times the laptop will simply hang during resume but often times it will just work. It's fairly unpredictable when it will work or when it won't. During testing and debugging I've had long streaks where it works just fine but with everyday use it rarely works more than once in a row. I can usually reliably reproduce it by doing a fresh reboot to my desktop, starting Chrome and Ice Dove and within one or two hibernation cycles it is sure to freeze. When the system freezes, the fan speeds up likely indicating the CPU is busy. I can toggle the keyboard LED backlight as well as the LCD backlight using the Fn keys. (I assume this means the firmware is still operational). There is no capslock LED or similar. The magic sysrq key also does not work. I've done testing with pm_test and found that every mode except 'none' works every single time. It also makes no difference if I use pm-hibernate or echo disk to /sys/power/state. Setting pm_async to zero also does not help. I've found that (with some persistence) I can reproduce the problem if I boot into a minimal environment (init=/bin/bash), with minimal modules loaded. To do this I run a number of instances of a very simple test program that just allocates and writes to a large buffer of memory, then repeatedly hibernate. Occassionally this will hit the bug. Thus, I think we can eliminate most driver issues as few (if any) drivers are loaded before hibernation when the problem is triggered. I've tried two different swap devices: an mSATA SSD and a SATA HDD. Both exhibit the problem and have drastically different image loading speeds so it doesn't seem like a timing issue. I also tried an extra large partition with 20G available and it didn't make a difference. I played around with /sys/power/image_size and didn't see any concrete improvement (though very vaguely it seemed like a lower number froze slightly less often). I did some testing to see if a specific image size fails but could find no clear correlation. I've also done some debugging with pm_trace. I instrumented the kernel with a series of extra TRACE_RESUME and printk calls. I used this to verify that the kernel gets as far as the resume_image call in swsusp_arch_resume. I also added a few trace calls and printks to where I believe the image kernel is supposed to resume (in create_image where a comment says control flow magically appears). I don't know if TRACE_RESUME is supposed to work at this point, but I have no other way of knowing what's going on after resume_image if it doesn't work. The laptop does not have a serial port so I can't do any debugging that involves one. At this point I feel like I'm out of things to try. Any help would be appreciated. Thanks, Logan
Created attachment 213721 [details] dmesg output Attached a dmesg output just after a failed resume with pm_trace and instrumented kernel.
I found this bug today, looks very similar but no new information: https://bugzilla.kernel.org/show_bug.cgi?id=112761
(In reply to Logan Gunthorpe from comment #2) > I found this bug today, looks very similar but no new information: > > https://bugzilla.kernel.org/show_bug.cgi?id=112761 I'm not sure if this is the same problem, because 3.18 kernel work well for bug #112761. So please double check if 3.18 is broken.
I just re-tested v3.18.0 and have been able to reproduce the issue.
I saw some warning for resume of pm_trace [ 1.355199] calling late_resume_init+0x0/0x1a0 @ 1 [ 1.356024] Magic number: 0:212:178 [ 1.356841] hash matches arch/x86/power/hibernate_64.c:101 [ 1.357724] acpi device:0e: hash matches [ 1.358559] platform: hash matches would you please provide grep . /sys/bus/acpi/devices/device\:0e/*
could you provide your arch/x86/power/hibernate_64.c
(In reply to Chen Yu from comment #6) > could you provide your arch/x86/power/hibernate_64.c I mean, the version for #Comment 1
Created attachment 214391 [details] acpi grep As requested, the output of the grep command.
Created attachment 214401 [details] Hacked up hibernate_64.c
Created attachment 214411 [details] Patch of my current complete debug hacks to the kernel So you can see other potential places for the pm_trace to have occurred. I've attached my complete diff from 4.5.1
I'll also note that 3.18.0 seems to be more reliable. It's harder to reproduce the problem with that kernel version. But I did see it a couple times, almost in a row.
Ok, so I've found that a 3.18.0 kernel with a localmodconfig largely hibernates reliably. 3.19.0 with the same config largely does not. I managed to bisect this to commit f5b2831d654167d7. (I'll attach a bisect log). Building a v3.19 kernel with the patch reverted also produced a kernel that largely works. I noticed a subtle logic error in the offending change such that the attached patch also fixes the problem. However, the _endlessly_ frustrating thing is it looks like a similar change has already been made by 4.5 and thus I can't apply a similar fix to a modern kernel. I'll keep digging as I have time.
Created attachment 215361 [details] Bisect log between v3.18 and v3.19
Created attachment 215371 [details] Patch to fix the issue in v3.19
Looks like the patch I attached is very similar to commit 55696b1f66
Way too many kernel builds later: 3.19 through 4.2 with the attached patch work. 4.3 with the patch did not work. I bisected again (I'll attach a second log) to find the bug in commit: [ab76f7b4ab2397ffdd2f1eb07c55697d19991d10] x86/mm: Set NX on gap between __ex_table and rodata Reverting that commit in 4.3 (with the attached patch) and 4.5 result in working kernels. I don't understand why that commit causes my issue. Based on my understanding, it looks like it should be pretty benign. So, in summary, looks like hibernation on my hardware has been broken by two issues: 1. Between 3.19 and 4.3: f5b2831d6 breaks hibernation and is fixed by 55696b1f66 in 4.4. 2. From 4.3 on: ab76f7b4a breaks hibernation I don't know what a reasonable fix would be for the second issue besides reverting the commit.
Created attachment 215471 [details] Bisect log between v4.1 and v4.3 (with patch)
(In reply to Logan Gunthorpe from comment #16) > Way too many kernel builds later: > > 3.19 through 4.2 with the attached patch work. 4.3 with the patch did not > work. I bisected again (I'll attach a second log) to find the bug in commit: > > [ab76f7b4ab2397ffdd2f1eb07c55697d19991d10] x86/mm: Set NX on gap between > __ex_table and rodata This is a good point. > > Reverting that commit in 4.3 (with the attached patch) and 4.5 result in > working kernels. > 2. From 4.3 on: ab76f7b4a breaks hibernation > > I don't know what a reasonable fix would be for the second issue besides > reverting the commit. Me neither, I was thinking if hibernation resume process is using the function between end of __ex_table and the start of rodata, since this commit has disabled its 'x' attribute.. if there any suspect func in 'sudo cat /proc/kallsyms' , from my laptop, there is one: ffffffff81829590 R __start___ex_table ffffffff8182b658 R __stop___ex_table ffffffff81a00000 r __func__.50233 //humm? ffffffff81a00000 R __start_rodata
(In reply to Logan Gunthorpe from comment #17) > Created attachment 215471 [details] > Bisect log between v4.1 and v4.3 (with patch) Could you also attach your kernel config?
Created attachment 215541 [details] Kernel config for 4.5.2 As requested, the kernel config I used. For all other versions it's the same but with a 'make olddeefconfig'.
Yeah, the function you quoted from kallsyms is actually in the rodata section: notice that it has the same address as __start_rodata. In theory, I don't think there can be anything between the two unless somewhere the hibernation code is occasionally specifically using it. But I can't find anything to suggest it does.
*** Bug 112761 has been marked as a duplicate of this bug. ***
What does cat /sys/kernel/debug/kernel_page_tables show?
Created attachment 216641 [details] Kernel Page Tables 4.6 - bad Attached: Kernel page tables on a stock 4.6 kernel without patch reverted. (So hibernation does not work)
Created attachment 216651 [details] Kernel Page Tables 4.6 - good Attached: Kernel page tables on a 4.6 kernel with patch reverted. (So hibernation does work)
Note: I've reproduced the exact same thing on 4.6 kernel. On a stock kernel, hibernation does not work. With the same commit reverted it does work.
Can you report this via email to x86 at kernel.org, linux-kernel at vger.kernel.org, keescook at chromium.org, and me? Include the kernel config used to test 4.6 as well. See REPORTING-BUGS in the kernel tree. I tried adding Ingo and Kees directly to this bug but bugzilla.kernel.org doesn't seem to recognize their addresses, so maybe they don't use bugzilla.
Also, along with kernel config, can you attach your /proc/kallsyms for good and bad?
Should probably also cc the hibernation maintainers, linux-pm at vger.kernel.org
Created attachment 216831 [details] kallsyms for a working 4.6 kernel
Created attachment 216841 [details] kallsyms for a stock 4.6 kernel
Created attachment 216851 [details] Kernel config for 4.6
No problem. I'll send an email shortly. Thanks for the list of who to copy.
Enable CONFIG_KALLSYMS_ALL=y so we can see all the symbols in kallsyms. Or you could use objdump -x to dump the symbols from your vmlinux file, either way.
Created attachment 216891 [details] kallsyms for a working 4.6 kernel With CONFIG_KALLSYMS_ALL=y
Created attachment 216901 [details] kallsyms for a stock 4.6 kernel With CONFIG_KALLSYMS_ALL=y
Created attachment 216911 [details] Kernel Page Tables 4.6 - good With CONFIG_KALLSYMS_ALL=y
Created attachment 216921 [details] Kernel Page Tables 4.6 - bad CONFIG_KALLSYMS_ALL=y
Ok, I was wondering why my kallsyms had some missing symbols. Anyway, I've recompiled both kernels and attached new files.
Fixed at: https://patchwork.kernel.org/patch/9172981/
commit 65c0554b73c920023cc8998802e508b798113b46 Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Date: Thu Jun 30 18:11:41 2016 +0200 x86/power/64: Fix kernel text mapping corruption during image restoration