Bug 116941 - Commit ab76f7b4ab2397f x86/mm: Set NX on gap between __ex_table and rodata break resume from hibernation - Acer V5-573P-6896 Laptop (Intel i5-4200U)
Summary: Commit ab76f7b4ab2397f x86/mm: Set NX on gap between __ex_table and rodata b...
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Stephen Smalley
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-22 19:04 UTC by Logan Gunthorpe
Modified: 2016-07-13 06:06 UTC (History)
4 users (show)

See Also:
Kernel Version: 4.6
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg output (116.23 KB, text/plain)
2016-04-22 19:05 UTC, Logan Gunthorpe
Details
acpi grep (150 bytes, text/plain)
2016-04-26 15:56 UTC, Logan Gunthorpe
Details
Hacked up hibernate_64.c (3.74 KB, text/x-csrc)
2016-04-26 15:58 UTC, Logan Gunthorpe
Details
Patch of my current complete debug hacks to the kernel (10.50 KB, patch)
2016-04-26 16:00 UTC, Logan Gunthorpe
Details | Diff
Bisect log between v3.18 and v3.19 (2.74 KB, text/x-log)
2016-05-05 22:15 UTC, Logan Gunthorpe
Details
Patch to fix the issue in v3.19 (1.07 KB, patch)
2016-05-05 22:15 UTC, Logan Gunthorpe
Details | Diff
Bisect log between v4.1 and v4.3 (with patch) (3.10 KB, text/x-log)
2016-05-07 17:38 UTC, Logan Gunthorpe
Details
Kernel config for 4.5.2 (124.47 KB, application/octet-stream)
2016-05-08 03:00 UTC, Logan Gunthorpe
Details
Kernel Page Tables 4.6 - bad (174.58 KB, text/plain)
2016-05-19 00:53 UTC, Logan Gunthorpe
Details
Kernel Page Tables 4.6 - good (193.17 KB, text/plain)
2016-05-19 00:54 UTC, Logan Gunthorpe
Details
kallsyms for a working 4.6 kernel (2.02 MB, text/plain)
2016-05-19 16:48 UTC, Logan Gunthorpe
Details
kallsyms for a stock 4.6 kernel (2.02 MB, text/plain)
2016-05-19 16:49 UTC, Logan Gunthorpe
Details
Kernel config for 4.6 (126.12 KB, application/octet-stream)
2016-05-19 16:49 UTC, Logan Gunthorpe
Details
kallsyms for a working 4.6 kernel (975.99 KB, application/gzip)
2016-05-19 22:21 UTC, Logan Gunthorpe
Details
kallsyms for a stock 4.6 kernel (976.26 KB, application/gzip)
2016-05-19 22:22 UTC, Logan Gunthorpe
Details
Kernel Page Tables 4.6 - good (177.82 KB, text/plain)
2016-05-19 22:23 UTC, Logan Gunthorpe
Details
Kernel Page Tables 4.6 - bad (163.96 KB, text/plain)
2016-05-19 22:24 UTC, Logan Gunthorpe
Details

Description Logan Gunthorpe 2016-04-22 19:04:04 UTC
I have an Acer V5-573P-6896 Laptop (Intel i5-4200U) running Debian Jessie which has had a persistent issue resuming from hibernation. I've had this problem at since I purchased the laptop back with Wheezy and about 3.18. I've been trying to debug it on and off for some time.

Most times the laptop will simply hang during resume but often times it will just work. It's fairly unpredictable when it will work or when it won't. During testing and debugging I've had long streaks where it works just fine but with everyday use it rarely works more than once in a row. I can usually reliably reproduce it by doing a fresh reboot to my desktop, starting Chrome and Ice Dove and within one or two hibernation cycles it is sure to freeze.

When the system freezes, the fan speeds up likely indicating the CPU is busy. I can toggle the keyboard LED backlight as well as the LCD backlight using the Fn keys. (I assume this means the firmware is still operational). There is no capslock LED or similar. The magic sysrq key also does not work.

I've done testing with pm_test and found that every mode except 'none' works every single time. It also makes no difference if I use pm-hibernate or echo disk to /sys/power/state. Setting pm_async to zero also does not help.

I've found that (with some persistence) I can reproduce the problem if I boot into a minimal environment (init=/bin/bash), with minimal modules loaded. To do this I run a number of instances of a very simple test program that just allocates and writes to a large buffer of memory, then repeatedly hibernate. Occassionally this will hit the bug. Thus, I think we can eliminate most driver issues as few (if any) drivers are loaded before hibernation when the problem is triggered.

I've tried two different swap devices: an mSATA SSD and a SATA HDD. Both exhibit the problem and have drastically different image loading speeds so it doesn't seem like a timing issue. I also tried an extra large partition with 20G available and it didn't make a difference. I played around with /sys/power/image_size and didn't see any concrete improvement (though very vaguely it seemed like a lower number froze slightly less often). I did some testing to see if a specific image size fails but could find no clear correlation.

I've also done some debugging with pm_trace. I instrumented the kernel with a series of extra TRACE_RESUME and printk calls. I used this to verify that the kernel gets as far as the resume_image call in swsusp_arch_resume. I also added a few trace calls and printks to where I believe the image kernel is supposed to resume (in create_image where a comment says control flow magically appears). I don't know if TRACE_RESUME is supposed to work at this point, but I have no other way of knowing what's going on after resume_image if it doesn't work. The laptop does not have a serial port so I can't do any debugging that involves one.

At this point I feel like I'm out of things to try. Any help would be appreciated.

Thanks,

Logan
Comment 1 Logan Gunthorpe 2016-04-22 19:05:33 UTC
Created attachment 213721 [details]
dmesg output

Attached a dmesg output just after a failed resume with pm_trace and instrumented kernel.
Comment 2 Logan Gunthorpe 2016-04-25 17:28:25 UTC
I found this bug today, looks very similar but no new information:

https://bugzilla.kernel.org/show_bug.cgi?id=112761
Comment 3 Zhang Rui 2016-04-26 03:25:31 UTC
(In reply to Logan Gunthorpe from comment #2)
> I found this bug today, looks very similar but no new information:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=112761

I'm not sure if this is the same problem, because 3.18 kernel work well for bug #112761. So please double check if 3.18 is broken.
Comment 4 Logan Gunthorpe 2016-04-26 05:54:29 UTC
I just re-tested v3.18.0 and have been able to reproduce the issue.
Comment 5 Chen Yu 2016-04-26 07:45:37 UTC
I saw some warning for resume of pm_trace
[    1.355199] calling  late_resume_init+0x0/0x1a0 @ 1
[    1.356024]   Magic number: 0:212:178
[    1.356841]   hash matches arch/x86/power/hibernate_64.c:101
[    1.357724] acpi device:0e: hash matches
[    1.358559]  platform: hash matches

would you please provide
grep . /sys/bus/acpi/devices/device\:0e/*
Comment 6 Chen Yu 2016-04-26 07:48:06 UTC
could you provide your arch/x86/power/hibernate_64.c
Comment 7 Chen Yu 2016-04-26 07:49:34 UTC
(In reply to Chen Yu from comment #6)
> could you provide your arch/x86/power/hibernate_64.c

I mean, the version for #Comment 1
Comment 8 Logan Gunthorpe 2016-04-26 15:56:37 UTC
Created attachment 214391 [details]
acpi grep

As requested, the output of the grep command.
Comment 9 Logan Gunthorpe 2016-04-26 15:58:34 UTC
Created attachment 214401 [details]
Hacked up hibernate_64.c
Comment 10 Logan Gunthorpe 2016-04-26 16:00:51 UTC
Created attachment 214411 [details]
Patch of my current complete debug hacks to the kernel

So you can see other potential places for the pm_trace to have occurred. I've attached my complete diff from 4.5.1
Comment 11 Logan Gunthorpe 2016-04-26 17:21:23 UTC
I'll also note that 3.18.0 seems to be more reliable. It's harder to reproduce the problem with that kernel version. But I did see it a couple times, almost in a row.
Comment 12 Logan Gunthorpe 2016-05-05 22:14:36 UTC
Ok, so I've found that a 3.18.0 kernel with a localmodconfig largely hibernates reliably. 3.19.0 with the same config largely does not.

I managed to bisect this to commit f5b2831d654167d7. (I'll attach a bisect log).

Building a v3.19 kernel with the patch reverted also produced a kernel that largely works.

I noticed a subtle logic error in the offending change such that the attached patch also fixes the problem.

However, the _endlessly_ frustrating thing is it looks like a similar change has already been made by 4.5 and thus I can't apply a similar fix to a modern kernel.

I'll keep digging as I have time.
Comment 13 Logan Gunthorpe 2016-05-05 22:15:12 UTC
Created attachment 215361 [details]
Bisect log between v3.18 and v3.19
Comment 14 Logan Gunthorpe 2016-05-05 22:15:52 UTC
Created attachment 215371 [details]
Patch to fix the issue in v3.19
Comment 15 Logan Gunthorpe 2016-05-05 22:33:21 UTC
Looks like the patch I attached is very similar to commit 55696b1f66
Comment 16 Logan Gunthorpe 2016-05-07 17:35:43 UTC
Way too many kernel builds later:

3.19 through 4.2 with the attached patch work. 4.3 with the patch did not work. I bisected again (I'll attach a second log) to find the bug in commit:

[ab76f7b4ab2397ffdd2f1eb07c55697d19991d10] x86/mm: Set NX on gap between __ex_table and rodata

Reverting that commit in 4.3 (with the attached patch) and 4.5 result in working kernels. 

I don't understand why that commit causes my issue. Based on my understanding, it looks like it should be pretty benign.

So, in summary, looks like hibernation on my hardware has been broken by two issues:

1. Between 3.19 and 4.3: f5b2831d6 breaks hibernation and is fixed by 55696b1f66 in 4.4.

2. From 4.3 on: ab76f7b4a breaks hibernation

I don't know what a reasonable fix would be for the second issue besides reverting the commit.
Comment 17 Logan Gunthorpe 2016-05-07 17:38:00 UTC
Created attachment 215471 [details]
Bisect log between v4.1 and v4.3 (with patch)
Comment 18 Chen Yu 2016-05-08 01:19:11 UTC
(In reply to Logan Gunthorpe from comment #16)
> Way too many kernel builds later:
> 
> 3.19 through 4.2 with the attached patch work. 4.3 with the patch did not
> work. I bisected again (I'll attach a second log) to find the bug in commit:
> 
> [ab76f7b4ab2397ffdd2f1eb07c55697d19991d10] x86/mm: Set NX on gap between
> __ex_table and rodata
This is a good point.
> 
> Reverting that commit in 4.3 (with the attached patch) and 4.5 result in
> working kernels. 

> 2. From 4.3 on: ab76f7b4a breaks hibernation
> 
> I don't know what a reasonable fix would be for the second issue besides
> reverting the commit.

Me neither, I was thinking if hibernation resume process is using the function
between end of __ex_table and the start of rodata, since this commit has disabled its 'x' attribute..
if there any suspect func in 'sudo cat /proc/kallsyms' , from my laptop, there is one:
ffffffff81829590 R __start___ex_table
ffffffff8182b658 R __stop___ex_table
ffffffff81a00000 r __func__.50233  //humm?
ffffffff81a00000 R __start_rodata
Comment 19 Chen Yu 2016-05-08 01:20:34 UTC
(In reply to Logan Gunthorpe from comment #17)
> Created attachment 215471 [details]
> Bisect log between v4.1 and v4.3 (with patch)

Could you also attach your kernel config?
Comment 20 Logan Gunthorpe 2016-05-08 03:00:14 UTC
Created attachment 215541 [details]
Kernel config for 4.5.2

As requested, the kernel config I used. For all other versions it's the same but with a 'make olddeefconfig'.
Comment 21 Logan Gunthorpe 2016-05-08 03:03:38 UTC
Yeah, the function you quoted from kallsyms is actually in the rodata section: notice that it has the same address as __start_rodata.

In theory, I don't think there can be anything between the two unless somewhere the hibernation code is occasionally specifically using it. But I can't find anything to suggest it does.
Comment 22 Zhang Rui 2016-05-10 07:09:28 UTC
*** Bug 112761 has been marked as a duplicate of this bug. ***
Comment 23 Stephen Smalley 2016-05-18 12:49:37 UTC
What does cat /sys/kernel/debug/kernel_page_tables show?
Comment 24 Logan Gunthorpe 2016-05-19 00:53:08 UTC
Created attachment 216641 [details]
Kernel Page Tables 4.6 - bad

Attached: Kernel page tables on a stock 4.6 kernel without patch reverted. (So hibernation does not work)
Comment 25 Logan Gunthorpe 2016-05-19 00:54:21 UTC
Created attachment 216651 [details]
Kernel Page Tables 4.6 - good

Attached: Kernel page tables on a 4.6 kernel with patch reverted. (So hibernation does work)
Comment 26 Logan Gunthorpe 2016-05-19 00:55:14 UTC
Note: I've reproduced the exact same thing on 4.6 kernel. On a stock kernel, hibernation does not work. With the same commit reverted it does work.
Comment 27 Stephen Smalley 2016-05-19 13:14:51 UTC
Can you report this via email to x86 at kernel.org, linux-kernel at vger.kernel.org, keescook at chromium.org, and me?
Include the kernel config used to test 4.6 as well.
See REPORTING-BUGS in the kernel tree.
I tried adding Ingo and Kees directly to this bug but bugzilla.kernel.org doesn't seem to recognize their addresses, so maybe they don't use bugzilla.
Comment 28 Stephen Smalley 2016-05-19 13:41:50 UTC
Also, along with kernel config, can you attach your /proc/kallsyms for good and bad?
Comment 29 Stephen Smalley 2016-05-19 13:57:34 UTC
Should probably also cc the hibernation maintainers, linux-pm at vger.kernel.org
Comment 30 Logan Gunthorpe 2016-05-19 16:48:09 UTC
Created attachment 216831 [details]
kallsyms for a working 4.6 kernel
Comment 31 Logan Gunthorpe 2016-05-19 16:49:04 UTC
Created attachment 216841 [details]
kallsyms for a stock 4.6 kernel
Comment 32 Logan Gunthorpe 2016-05-19 16:49:53 UTC
Created attachment 216851 [details]
Kernel config for 4.6
Comment 33 Logan Gunthorpe 2016-05-19 16:50:31 UTC
No problem. I'll send an email shortly. Thanks for the list of who to copy.
Comment 34 Stephen Smalley 2016-05-19 20:36:37 UTC
Enable CONFIG_KALLSYMS_ALL=y so we can see all the symbols in kallsyms.
Or you could use objdump -x to dump the symbols from your vmlinux file, either way.
Comment 35 Logan Gunthorpe 2016-05-19 22:21:21 UTC
Created attachment 216891 [details]
kallsyms for a working 4.6 kernel

With CONFIG_KALLSYMS_ALL=y
Comment 36 Logan Gunthorpe 2016-05-19 22:22:18 UTC
Created attachment 216901 [details]
kallsyms for a stock 4.6 kernel

With CONFIG_KALLSYMS_ALL=y
Comment 37 Logan Gunthorpe 2016-05-19 22:23:04 UTC
Created attachment 216911 [details]
Kernel Page Tables 4.6 - good

With CONFIG_KALLSYMS_ALL=y
Comment 38 Logan Gunthorpe 2016-05-19 22:24:55 UTC
Created attachment 216921 [details]
Kernel Page Tables 4.6 - bad

CONFIG_KALLSYMS_ALL=y
Comment 39 Logan Gunthorpe 2016-05-19 22:30:34 UTC
Ok, I was wondering why my kallsyms had some missing symbols.

Anyway, I've recompiled both kernels and attached new files.
Comment 40 Chen Yu 2016-06-15 06:35:16 UTC
Fixed at:
https://patchwork.kernel.org/patch/9172981/
Comment 41 Chen Yu 2016-07-13 06:06:51 UTC
commit 65c0554b73c920023cc8998802e508b798113b46
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Thu Jun 30 18:11:41 2016 +0200

    x86/power/64: Fix kernel text mapping corruption during image restoration

Note You need to log in before you can comment on or make changes to this bug.