Bug 116791 - System hangs after multiple suspend/resume cycles with CPU soft lockup - Clevo W840SU Core i7-4500U Haswell
Summary: System hangs after multiple suspend/resume cycles with CPU soft lockup - Clev...
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Chen Yu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-20 19:53 UTC by Robert
Modified: 2016-07-18 02:21 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.5.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg with 1 working resume and 1 failed (93.21 KB, application/octet-stream)
2016-04-20 19:53 UTC, Robert
Details
dmidecode (11.43 KB, application/octet-stream)
2016-04-20 19:53 UTC, Robert
Details
iomem (3.12 KB, application/octet-stream)
2016-04-20 19:54 UTC, Robert
Details
ioports (1.36 KB, application/octet-stream)
2016-04-20 19:55 UTC, Robert
Details
lspci -vvv (26.33 KB, application/octet-stream)
2016-04-20 19:55 UTC, Robert
Details
soft lockup screenshot kernel 4.6.0-rc7 (2.90 MB, image/jpeg)
2016-05-10 20:38 UTC, Robert
Details
log after failed resume kernel 4.6.0-rc7 (21.35 KB, application/octet-stream)
2016-05-10 20:39 UTC, Robert
Details
successful resume kernel 4.6.0-rc7 (67.08 KB, application/octet-stream)
2016-05-10 20:39 UTC, Robert
Details
System hangs on resume without oops. (2.16 MB, image/jpeg)
2016-06-12 20:49 UTC, Robert
Details
Bug: Scheduling while atomic image 1. (1.82 MB, image/jpeg)
2016-06-12 20:51 UTC, Robert
Details
Bug: Scheduling while atomic image 2. (1.96 MB, image/jpeg)
2016-06-12 20:51 UTC, Robert
Details
Bug: Scheduling while atomic image 3. (2.15 MB, image/jpeg)
2016-06-12 20:52 UTC, Robert
Details

Description Robert 2016-04-20 19:53:13 UTC
Created attachment 213411 [details]
dmesg with 1 working resume and 1 failed

I am running Archlinux on a Clevo W840SU Core i7-4500U Haswell machine. The current setup is kernel 4.5.1-1-ARCH #1 SMP PREEMPT x86_64 and the system is running systemd. The newest Intel ucode update is applied on boot. There is no BIOS update available.

After running multiple suspend/resume cycles the system ends up with a black screen. No ping, no console switching, no capslock. The cycle count differs from 3 with a full running system (Xorg, KDE, firefox) to >10 when started with nomodeset and hibernated from command line.

Basically, the symptoms are the same as in Bug #104771 and the problem persists since version 4.x of the kernel.

Different hibernation systems (echo disk > /sys/power/state, systemctl hibernate, uswsusp, tuxonice) lead to the same result. The steps from basic-pm-debugging.txt were performed, platform and shutdown do not change anything, all pm_test cases work for 15 times in a row. Running a minimal configuration is not yet possible on this machine, as the root partition is encrypted. pm_trace does not point to any module, but shows the following after a failed resume:
[    0.658342]   Magic number: 1:163:177
[    0.659322] acpi device:0d: hash matches

After some suspend/resume cycles, in few rare cases, dmesg shows the following on a failed resume, full messages are in the attachment:

[   64.390036] Restarting tasks ... 
[   64.391004] pci_bus 0000:01: Allocating resources
[   64.391021] pci_bus 0000:02: Allocating resources
[   64.391062] pci_bus 0000:03: Allocating resources
[   64.391080] i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment
[   64.391694] pci_bus 0000:01: Allocating resources
[   64.391710] pci_bus 0000:02: Allocating resources
[   64.391751] pci_bus 0000:03: Allocating resources
[   64.391767] i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment
[   64.392450] done.
[   64.397339] general protection fault: 0000 [#1] PREEMPT SMP 
and some seconds later 
[   91.278397] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [systemd:1]
[  119.277892] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [systemd:1]
[  151.277315] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [systemd:1]

The system does not recover from this and crashes later. 

Does the trace yield enough information to find some further hints on the problem?

Thanks in advance and sorry in case this is a duplicate.
Comment 1 Robert 2016-04-20 19:53:58 UTC
Created attachment 213421 [details]
dmidecode
Comment 2 Robert 2016-04-20 19:54:37 UTC
Created attachment 213431 [details]
iomem
Comment 3 Robert 2016-04-20 19:55:04 UTC
Created attachment 213441 [details]
ioports
Comment 4 Robert 2016-04-20 19:55:36 UTC
Created attachment 213451 [details]
lspci -vvv
Comment 5 Chen Yu 2016-05-06 07:29:07 UTC
thanks for your report, as you are using 4.5.1-1-ARCH, does this problem still exist on pure mainline 4.5.1 and latest kernel?

[   64.397339] general protection fault: 0000 [#1] PREEMPT SMP 
[   64.415167] CPU: 0 PID: 3837 Comm: systemd-rfkill Tainted: G     U    IO    4.5.1-1-ARCH #1
[   64.442575] Call Trace:
[   64.444222]  [<ffffffff811f9ed7>] lookup_fast+0x57/0x340
[   64.445864]  [<ffffffff811f8a4b>] ? path_init+0x1fb/0x370
[   64.447483]  [<ffffffff811fb8ae>] path_openat+0x2ce/0x1080
[   64.449069]  [<ffffffff811db0f5>] ? mem_cgroup_end_page_stat+0x25/0x50
[   64.450641]  [<ffffffff811fd9f1>] do_filp_open+0x91/0x100
[   64.452181]  [<ffffffff8120a8d7>] ? __alloc_fd+0xc7/0x190
[   64.453732]  [<ffffffff811ec95f>] do_sys_open+0x13f/0x210
[   64.455287]  [<ffffffff811eca4e>] SyS_open+0x1e/0x20
[   64.456854]  [<ffffffff815ad6ae>] entry_SYSCALL_64_fastpath+0x12/0x6d
[   64.458407] Code: 0f 84 9f 00 00 00 4c 89 f0 45 89 f2 49 89 d7 48 c1 e8 20 48 89 75 d0 49 89 fc 49 89 c0 eb 08 48 8b 1b 48 85 db 74 7e 4c 8d 6b f8 <8b> 53 fc 4c 3b 63 10 75 eb 48 83 7b 08 00 74 e4 83 e2 fe 41 f6 
[   64.460207] RIP  [<ffffffff81205eb7>] __d_lookup_rcu+0x77/0x150
[   64.461864]  RSP <ffff8800cec53c88>
[   64.463992] ---[ end trace babf5e3c47f011f1 ]---
Comment 6 Robert 2016-05-10 20:37:19 UTC
Hi,

thanks for the response. I compiled kernel 4.6.0-rc7 (nevermind the -ARCH in the screenshots/logs, I took the config from /proc/config.gz), removed the i915 module from the configuration, applied nomodeset on boot and got the following two errors. Unfortunately I could not get the system to boot without systemd.

The image is taken after the first 5 suspend/resume cycles (system was unresponsive) and the log a hard reset after 15 suspends (systemd was unresponsive). After the next reboot I suspended/resumed successfully for about 50 times in a row, this problem seems tricky.

Any further hints?
Comment 7 Robert 2016-05-10 20:38:08 UTC
Created attachment 215811 [details]
soft lockup screenshot kernel 4.6.0-rc7
Comment 8 Robert 2016-05-10 20:39:00 UTC
Created attachment 215821 [details]
log after failed resume kernel 4.6.0-rc7
Comment 9 Robert 2016-05-10 20:39:25 UTC
Created attachment 215831 [details]
successful resume kernel 4.6.0-rc7
Comment 10 Chen Yu 2016-05-14 15:24:29 UTC
What if the following patch applied?
https://patchwork.kernel.org/patch/7454481/
Comment 11 Robert 2016-06-12 19:03:02 UTC
Hi, thanks for the link to the patch. I have built the kernel and tried various suspend/resume/hang cycles, but unfortunately I did not return to the console but the system hangs with a black screen. As far as I can see the patch will only show a problem with the hardware, but I will keep on trying.
Comment 12 Robert 2016-06-12 20:49:49 UTC
Created attachment 219611 [details]
System hangs on resume without oops.
Comment 13 Robert 2016-06-12 20:51:05 UTC
Created attachment 219621 [details]
Bug: Scheduling while atomic image 1.
Comment 14 Robert 2016-06-12 20:51:38 UTC
Created attachment 219631 [details]
Bug: Scheduling while atomic image 2.
Comment 15 Robert 2016-06-12 20:52:15 UTC
Created attachment 219641 [details]
Bug: Scheduling while atomic image 3.
Comment 16 Robert 2016-06-12 20:56:53 UTC
Hi again, the reason I did not see the errors was that I forgot to remove the i915 module from the kernel. For the new attachments I booted with nomodeset, created a 1Gb file from /dev/urandom in RAM (this speeds up the bugs)  and got (besides 95% successful resumes) the errors from the attachments several times. I did not see the "Scheduling while atomic" before and it produces more output, which I am unable to capture. 
I did not see any error related to the e820 map, so I hope the attachments are relevant.
Comment 17 Chen Yu 2016-06-17 11:17:43 UTC
(In reply to Robert from comment #16)
> Hi again, the reason I did not see the errors was that I forgot to remove
> the i915 module from the kernel. For the new attachments I booted with
> nomodeset, created a 1Gb file from /dev/urandom in RAM (this speeds up the
> bugs)  and got (besides 95% successful resumes) the errors from the
> attachments several times. I did not see the "Scheduling while atomic"
> before and it produces more output, which I am unable to capture. 
> I did not see any error related to the e820 map, so I hope the attachments
> are relevant.

It looks like a usb problem to me, do you have a chance to disable usb from the config and recompile the kernel, using PS/2 keyboard to confirm if this problem still exist?
Comment 18 Zhang Rui 2016-06-27 05:50:09 UTC
ping...
Comment 19 Chen Yu 2016-07-08 02:50:01 UTC
Hi Robert,
recently we have solved a couple of hibernation problems during resume due to
broken page tables, would you please try the following two patches on top of latest kernel, to see if there is any difference(yes, also please remove i915 from kernel config, and remove usb would be even better):

https://patchwork.kernel.org/patch/9217459/
https://patchwork.kernel.org/patch/9208541/
The 2nd patch might be in 4.7-rc7.
Comment 20 Robert 2016-07-10 16:29:29 UTC
Hi Chen,
sorry for the long delay, I was on holidays.  Thank you very much for the feedback, I will recompile my kernel with the proposed patches and disable Intel and USB during the next few days.
Comment 21 Robert 2016-07-12 22:10:43 UTC
Hi,
I am now running kernel 4.7-rc7 with the first proposed patch (the second is already included) and after many suspend/resume cycles I was not able to reproduce any of the above errors, even with i915 enabled.

Thank you all for your patience and help, I think this report may be closed.

Note You need to log in before you can comment on or make changes to this bug.