Bug 30492

Summary: kernel sometimes hangs on hibernation
Product: Power Management Reporter: Martin Steigerwald (Martin)
Component: Hibernation/SuspendAssignee: power-management_other
Status: CLOSED INSUFFICIENT_DATA    
Severity: normal CC: rjw, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38-rc7 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: PM/Hibernate: Try to avoid crashing the kernel unnecessarily
PM / Hibernate: Alternative method of avoiding crashes
PM / Hibernate: Alternative method of avoiding crashes (v2)
PM / Hibernate: Alternative method of avoiding crashes (v3)

Description Martin Steigerwald 2011-03-05 14:48:59 UTC
While testing for bug #30482 and also before with 2.6.37 and 2.6.38 I had it that on some occasions the kernel will hang on hibernation.

My ThinkPad T42 indicates that hibernation process is in progress by blinking the moon LED. I also switched to a tty. But then it just sits there and blinks for minutes without any apparent progress on hibernation process.

Raphael, I think I mentioned this to you somewhere already and you hinted that it might be some issue with freeing enough memory for hibernation. Please advice on how to proceed with that. It happens quite rarely but chances are that it happens more often when I have more applications open prior to suspending. Bisection is out of question for me, cause I seem to recall having this issue since switching to Radeon KMS.
Comment 1 Martin Steigerwald 2011-03-05 14:51:51 UTC
This all is with in kernel suspend via hibernate script 1.99. I am using the following wrapper script for some of my own stuff:

shambhala:/etc> cat acpi/hibernate-extra.sh 
#!/bin/sh

# Zur Sicherheit gleich am Anfang alle ausstehenden Änderungen schreiben
sync

# Versuchen, möglichst viele LowMem Pages freizubekommen
# Dir Entries legt Ext4 offenbar auch ins LowMem
# Und mit zu wenig LowMem Pages klappt der Tiefschlaf mit
# Radeon DRM KMS nicht.
#echo 3 > /proc/sys/vm/drop_caches

# Alternativ kleineres Image bauen, siehe LKML:
# Re: does hibernate to disk try hard enough to free memory? (23.2.2011)
echo 710000000 > /sys/power/image_size

# Network Manager schlafen legen
# siehe /usr/lib/pm-utils/sleep.d/55NetworkManager
dbus-send --print-reply --system                        \
        --dest=org.freedesktop.NetworkManager \
        /org/freedesktop/NetworkManager       \
        org.freedesktop.NetworkManager.sleep

# ifplugd stoppen
#/etc/init.d/ifplugd stop
#ifdown eth0

# Systemzeit in Hardware-Uhr speichern
/etc/init.d/hwclock.sh stop

# Uptimed stoppen, damit er die Rekorde schreibt
/etc/init.d/uptimed stop

# Zur Sicherheit hier nochmal alle ausstehenden Änderungen schreiben
sync

# Gutnacht
# /etc/acpi/hibernate.sh
#echo 1 > /sys/power/tuxonice/do_hibernate
#pm-suspend-hybrid
#pm-hibernate
hibernate-disk

# Uptimed wieder starten. Dabei schreibt er erneut die Rekorde
/etc/init.d/uptimed start

# Rekorde gleich schreiben
sync

# Festplatten-Parameter wieder setzen
/etc/init.d/hdparm start

# Systemzeit anhand Hardware-Uhr wieder setzen
/etc/init.d/hwclock.sh start

# Network Manager aufwecken
dbus-send --print-reply --system                        \
        --dest=org.freedesktop.NetworkManager \
        /org/freedesktop/NetworkManager       \
        org.freedesktop.NetworkManager.wake

# ifplugd starten
#/etc/init.d/ifplugd start
Comment 2 Rafael J. Wysocki 2011-03-05 21:25:51 UTC
Is the problem reproducible at any reasonable rate?
Comment 3 Martin Steigerwald 2011-03-06 10:32:58 UTC
Well, I thought not so, cause it could take 10 to 12 days for it to happen. But now it just happened again, with 2.6.38-rc7-g212e349. It seems to trigger more easily when I keep more applications open prior to hibernation. I worked around it by closing all, or all but one application prior to initiating hibernation. I could stop using this work around in order to try to more easily reproduce this bug or even start some more applications...

But I think first I test your patch from bug #30482, unless you direct me otherwise.
Comment 4 Rafael J. Wysocki 2011-03-06 11:50:44 UTC
Please do.  The issues are independent of each other anyway.
Comment 5 Rafael J. Wysocki 2011-03-06 11:54:46 UTC
Created attachment 50202 [details]
PM/Hibernate: Try to avoid crashing the kernel unnecessarily

Please check if this patch makes a difference.

If it helps, please see if you're able to trigger the WARN_ON().
Comment 6 Martin Steigerwald 2011-03-06 19:15:33 UTC
Trying with patch now. Thanks
Comment 7 Martin Steigerwald 2011-03-14 23:09:18 UTC
Did not have any hang so far with this patch and the one of bug #30482 applied. But since I had a image size of 700000000 set manually before and still had this hang occasionally, I think the second patch, the one from this bug report, helps.

I did not see any warn:

shambhala:/> zgrep "WARN" /var/log/syslog* | grep kernel | grep -v thinkpad_acpi
/var/log/syslog.1:Mar  8 23:07:39 shambhala kernel: WARNING: at drivers/ata/libata-core.c:6130 ata_host_detach+0xf6/0x100()

Do I have to grep differently for that?

Will try harder - with more applications open, more memory pressure - when I am back to Germany, but until now all seems fine.
Comment 8 Martin Steigerwald 2011-03-16 23:13:58 UTC
Kernel hung again even with both patches applied. This time 2.6.38 plus those two patches:

martin@shambhala:~/Computer/Shambhala/Kernel/2.6.38/linux-2.6.38.y> patch -p1 < ../0002-try-to-avoid-crashing-the-kernel-unnecessarily-bug-30492.patch
patching file kernel/power/snapshot.c
Reversed (or previously applied) patch detected!  Assume -R? [n] ^C
-crashing-the-kernel-unnecessarily-bug-30492.patch
martin@shambhala:~/Computer/Shambhala/Kernel/2.6.38/linux-2.6.38.y> patch -p1 < ../0001-refine-autoestimation-of-image-size-bug-30482.patch
patching file kernel/power/snapshot.c
Reversed (or previously applied) patch detected!  Assume -R? [n] c^C
martin@shambhala:~/Computer/Shambhala/Kernel/2.6.38/linux-2.6.38.y> cat /sys/power/image_size 
703320064

I had quite some stuff open, including Iceweasel, KMail, Kontact.

Seems that your patch does not yet (completely) fix this issue.
Comment 9 Martin Steigerwald 2011-03-16 23:15:47 UTC
Last stuff in syslog is this:

Mar 17 00:06:00 shambhala kernel: e1000: eth0 NIC Link is Down
Mar 17 00:06:01 shambhala NetworkManager[2077]: <info> (eth0): carrier now OFF (device state 8, deferring action for 4
 seconds)
Mar 17 00:06:01 shambhala kernel: usb 3-2: USB disconnect, address 2
Mar 17 00:06:05 shambhala NetworkManager[2077]: <info> (eth0): device state change: 8 -> 2 (reason 40)
Mar 17 00:06:05 shambhala NetworkManager[2077]: <info> (eth0): deactivating device (reason: 40).
Mar 17 00:06:06 shambhala NetworkManager[2077]: <info> (eth0): canceled DHCP transaction, DHCP client pid 4669
Mar 17 00:06:17 shambhala kernel: PM: Marking nosave pages: 000000000009f000 - 0000000000100000
Mar 17 00:06:17 shambhala kernel: PM: Basic memory bitmaps created
Mar 17 00:06:18 shambhala kernel: PM: Syncing filesystems ... done.
Comment 10 Martin Steigerwald 2011-03-29 09:17:23 UTC
Testing with that patch for quite a while. It again hung on hibernation. Thus the patch does not seem to fix the issue, at least not completely. Or there are several causes for hangs on hibernation and it fixed one, but not the other.
Comment 11 Rafael J. Wysocki 2011-05-04 22:58:56 UTC
Created attachment 56672 [details]
PM / Hibernate: Alternative method of avoiding crashes

Hi,

Please test this patch instead.  It should fail hibernation if there's
a problem that would crash the kernel in free_unnecessary_pages().
If that happens, please attach the output of dmesg.
Comment 12 Rafael J. Wysocki 2011-05-04 23:01:08 UTC
Created attachment 56682 [details]
PM / Hibernate: Alternative method of avoiding crashes (v2)

Sorry, please test this one instead.
Comment 13 Rafael J. Wysocki 2011-05-05 18:57:23 UTC
Created attachment 56772 [details]
PM / Hibernate: Alternative method of avoiding crashes (v3)

The previous patch was broken, sorry about that.  Please try this one.
Comment 14 Martin Steigerwald 2011-05-05 19:51:45 UTC
Rafael, I did not have any hibernate related crashes in the last time, possibly cause I closed applications in order for the suspend to succeed. Shall I test this one together with the one from bug 34102?
Comment 15 Rafael J. Wysocki 2011-05-05 20:07:42 UTC
Yes, please.  Anyway, you're the only person known to me that could reproduce
this problem, so please keep this patch on top of your kernel just in case
the problem triggers.
Comment 16 Martin Steigerwald 2011-05-09 12:13:04 UTC
I am now compiling a kernel with that patch.

I had a hang at preallocation as I accidentally hibernated the machine - I wanted to change brightness, but hit the wrong key - a hour ago or so. Hopefully it triggers another time.
Comment 17 Martin Steigerwald 2011-05-09 12:16:54 UTC
This was while hibernating with two KDE 4 sessions running - instead of one which for now always worked okay.
Comment 18 Martin Steigerwald 2011-05-09 15:05:24 UTC
No, I think it wasn't a hang, cause actually the disk drive was still spinning. I think it would have come back, after a longer time. Cause now I tested with higher reserved_size values and it always came back (from a failed hibernation, but I will detail this in bug 34102)
Comment 19 Rafael J. Wysocki 2011-05-09 19:00:02 UTC
With the patch from comment #13 I think you'll rather see a failing hibernation
than a hang.  Anyway, please report if you see any of them. :-)
Comment 20 Zhang Rui 2012-01-18 03:20:09 UTC
It's great that the kernel bugzilla is back.

Can you please verify if the problem still exists in the latest upstream kernel, with & without the patch in comment #13?
Comment 21 Martin Steigerwald 2012-01-18 08:32:45 UTC
Thanks for this reminder as well.

I didn´t have it again and the ThinkPad T42 I talked about. But since I got that shiny new T520 I did not use it to memory heavy workloads anymore. Thus I am not sure whether this bug has been resolved or whether I am not triggering anymore.

Since for now, I do not intend to use memory heavy workloads on the T42, I will just close this one as well. Should I ever get this hang again, I can always reopen it.