Bug 37132

Summary: NMI on hibernate - ASUS F6A
Product: ACPI Reporter: Dmitry Nezhevenko (dion)
Component: Power-Sleep-WakeAssignee: acpi_power-sleep-wake
Status: CLOSED INSUFFICIENT_DATA    
Severity: normal CC: aaron.lu, alan, florian, lenb, maciej.rutecki, rjw, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.39 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: dmesg after successfull suspend/resume cycle
dmesg: Uhhuh. NMI received for unknown reason 3d on CPU 0.
trying to hibernate again after NMI
dmesg: trying without runtime PM
full dmesg after booting with acpi=off (working hibernate).

Description Dmitry Nezhevenko 2011-06-10 11:00:30 UTC
Created attachment 61452 [details]
dmesg after successfull suspend/resume cycle

I've found that laptop crashes randomly on 2.6.39 kernel on resume from hiberate.

It may be either kernel regression or some changes to userspace. Sometimes kernel just warns about NMI received, sometimes reports reiserfs corruption (with a lot of app segfaults after this). Sometimes kernel just crashes.

First hibernate/resume usually keeps kernel alive (but with some warns). 

It's ASUS F6A laptop:
00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03)
00:1c.2 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 3 (rev 03)
00:1c.5 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 6 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93)
00:1f.0 ISA bridge: Intel Corporation ICH9M LPC Interface Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03)
02:00.0 Network controller: Intel Corporation WiFi Link 5100
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
Comment 1 Dmitry Nezhevenko 2011-06-10 11:01:33 UTC
Created attachment 61462 [details]
dmesg: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Comment 2 Dmitry Nezhevenko 2011-06-10 11:02:17 UTC
Created attachment 61472 [details]
trying to hibernate again after NMI
Comment 3 Dmitry Nezhevenko 2011-06-10 11:03:11 UTC
Created attachment 61482 [details]
dmesg: trying without runtime PM
Comment 4 Dmitry Nezhevenko 2011-06-10 11:04:50 UTC
Just note: ReiserFS is actually not corrupted. I've tried to hard reset when kernel reports corruption and fsck just replays journal without any errors.
Comment 5 Dmitry Nezhevenko 2011-06-10 11:13:35 UTC
This was debian kernel, but exactly same happens on self-compiled
Comment 6 Len Brown 2011-06-14 01:26:58 UTC
Do you see NMI errors only when you are using hibernate
or do you see them when you have not attempted to hibernate?

do you see any errors when using system suspend to memory?

This is marked as a regression in 2.6.39,
So it worked fine in 2.6.38?
Comment 7 Dmitry Nezhevenko 2011-06-14 13:38:28 UTC
Well, I'm not sure about NMI stuff in the past. But hibernate was probably working well some times ago.

I've tried to compile 2.6.37 and 2.6.38 and both produces NMI errors and crashes on next attempt.

A lot of things were changed except kernel:
- Debian userspace upgrade
- 2GB -> 4GB of RAM
- 32bit -> 64bit kernel and userspace

So I don't know the cause. For some strange reason I can't compile 2.6.36 and earlier kernels due to "gcc: error: elf_x86_64: No such file or directory" message. Maybe it's not compatible with current GCC.

Basically now it doesn't work in 2.6.38 and 2.6.37 (amd64).

Other answers:
- there are no NMI messages during normal usage (so it only happens on wakeup from hibernate)
- TuxOnIce crashes on resume every time: http://lists.tuxonice.net/pipermail/tuxonice-users/2010-December/000723.html
- s2ram works with 2.6.39 without NMI. I can suspend/resume a lot of times without issue.
Comment 8 Dmitry Nezhevenko 2011-06-16 20:28:53 UTC
I've also tested 2.6.35 kernel compiled with GCC 4.5. And it still crashes with same random issues.

I've also tried to change shutdown method to "shutdown", "test", "reboot" (/sys/power/disk). "test" one works (but it doesn't save memory, probably just suspends devices. All others (shutdown, reboot, platform) fails.

At the same time, adding "acpi=off" to kernel command line fixes hibernate issue. I'm able to hibernate multiple times without NMI/crashes/corruptions. But acpi=off is unacceptable on laptop.

Any ideas? What other info I should provide?
Comment 9 Dmitry Nezhevenko 2011-06-16 20:30:01 UTC
Created attachment 62362 [details]
full dmesg after booting with acpi=off (working hibernate).
Comment 10 Dmitry Nezhevenko 2011-06-17 11:43:09 UTC
It looks like I've finally found how to hibernate/resume without "acpi=off". 

The issue is that I've i915 module loaded before resume (from initrd). So I've removed it from initrd and was able to hibernate/resume multiple times.

I'm still getting strange NMI, but only on first "resume". Every next hibernate/resume works without NMI messages (until reboot).

I don't know, is it safe to ignore these NMI messages?
Comment 11 Rafael J. Wysocki 2011-06-26 22:31:33 UTC
Comment #8 suggests that this issue shouldn't be listed as a regression from
2.6.39, so I'm dropping it from that list.
Comment 12 Len Brown 2011-07-26 01:26:20 UTC
so hibernate is now working once the initrd is repaired?

so the only issue here is that you get NMI errors,
and you get those only after hibernating?

No, I wouldn't use a machine that gives NMI errors.
Does it pass memtest86+?
Comment 13 Dmitry Nezhevenko 2011-07-26 08:43:20 UTC
Yes. Once I've removed i915 module from initrd (it's still present in /lib/modules and loaded later after mounting real rootfs), machine hibrnates and resumes without any issues. 

I'm still getting NMI error on "first" resume. But if I hibernate it second time without reboot, everything is clear. Machine is pretty stable with ~2 weeks uptime.

also NMI error disappears if "nmi_watchdog=0" is added to kernel parameters.

memtest is fully ok.
Comment 14 Zhang Rui 2012-01-18 05:20:05 UTC
It's great that the kernel bugzilla is back.

Can you please verify if the problem still exists in the latest upstream
kernel?
Comment 15 Dmitry Nezhevenko 2012-02-20 10:21:57 UTC
Yes. It still happens when using command line without any special options. But as I said "nmi_watchdog=0" fixes it for me. It's pretty stable with this configuration.
Comment 16 Alan 2012-08-08 09:43:12 UTC
Please test a current kernel. This is probably fixed by

151b61284776be2d6f02d48c23c3625678960b97
Comment 17 Zhang Rui 2012-11-28 05:17:15 UTC
ping...
Comment 18 Zhang Rui 2013-04-25 02:16:32 UTC
ping...