Created attachment 61452 [details]
dmesg after successfull suspend/resume cycle
I've found that laptop crashes randomly on 2.6.39 kernel on resume from hiberate.
It may be either kernel regression or some changes to userspace. Sometimes kernel just warns about NMI received, sometimes reports reiserfs corruption (with a lot of app segfaults after this). Sometimes kernel just crashes.
First hibernate/resume usually keeps kernel alive (but with some warns).
It's ASUS F6A laptop:
00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03)
00:1c.2 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 3 (rev 03)
00:1c.5 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 6 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93)
00:1f.0 ISA bridge: Intel Corporation ICH9M LPC Interface Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03)
02:00.0 Network controller: Intel Corporation WiFi Link 5100
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
Created attachment 61462 [details]
dmesg: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Created attachment 61472 [details]
trying to hibernate again after NMI
Created attachment 61482 [details]
dmesg: trying without runtime PM
Just note: ReiserFS is actually not corrupted. I've tried to hard reset when kernel reports corruption and fsck just replays journal without any errors.
This was debian kernel, but exactly same happens on self-compiled
Do you see NMI errors only when you are using hibernate
or do you see them when you have not attempted to hibernate?
do you see any errors when using system suspend to memory?
This is marked as a regression in 2.6.39,
So it worked fine in 2.6.38?
Well, I'm not sure about NMI stuff in the past. But hibernate was probably working well some times ago.
I've tried to compile 2.6.37 and 2.6.38 and both produces NMI errors and crashes on next attempt.
A lot of things were changed except kernel:
- Debian userspace upgrade
- 2GB -> 4GB of RAM
- 32bit -> 64bit kernel and userspace
So I don't know the cause. For some strange reason I can't compile 2.6.36 and earlier kernels due to "gcc: error: elf_x86_64: No such file or directory" message. Maybe it's not compatible with current GCC.
Basically now it doesn't work in 2.6.38 and 2.6.37 (amd64).
- there are no NMI messages during normal usage (so it only happens on wakeup from hibernate)
- TuxOnIce crashes on resume every time: http://lists.tuxonice.net/pipermail/tuxonice-users/2010-December/000723.html
- s2ram works with 2.6.39 without NMI. I can suspend/resume a lot of times without issue.
I've also tested 2.6.35 kernel compiled with GCC 4.5. And it still crashes with same random issues.
I've also tried to change shutdown method to "shutdown", "test", "reboot" (/sys/power/disk). "test" one works (but it doesn't save memory, probably just suspends devices. All others (shutdown, reboot, platform) fails.
At the same time, adding "acpi=off" to kernel command line fixes hibernate issue. I'm able to hibernate multiple times without NMI/crashes/corruptions. But acpi=off is unacceptable on laptop.
Any ideas? What other info I should provide?
Created attachment 62362 [details]
full dmesg after booting with acpi=off (working hibernate).
It looks like I've finally found how to hibernate/resume without "acpi=off".
The issue is that I've i915 module loaded before resume (from initrd). So I've removed it from initrd and was able to hibernate/resume multiple times.
I'm still getting strange NMI, but only on first "resume". Every next hibernate/resume works without NMI messages (until reboot).
I don't know, is it safe to ignore these NMI messages?
Comment #8 suggests that this issue shouldn't be listed as a regression from
2.6.39, so I'm dropping it from that list.
so hibernate is now working once the initrd is repaired?
so the only issue here is that you get NMI errors,
and you get those only after hibernating?
No, I wouldn't use a machine that gives NMI errors.
Does it pass memtest86+?
Yes. Once I've removed i915 module from initrd (it's still present in /lib/modules and loaded later after mounting real rootfs), machine hibrnates and resumes without any issues.
I'm still getting NMI error on "first" resume. But if I hibernate it second time without reboot, everything is clear. Machine is pretty stable with ~2 weeks uptime.
also NMI error disappears if "nmi_watchdog=0" is added to kernel parameters.
memtest is fully ok.
It's great that the kernel bugzilla is back.
Can you please verify if the problem still exists in the latest upstream
Yes. It still happens when using command line without any special options. But as I said "nmi_watchdog=0" fixes it for me. It's pretty stable with this configuration.
Please test a current kernel. This is probably fixed by