Bug 11794
Description
Michael Lashkevich
2008-10-20 04:09:01 UTC
OK Any chance to try a kernel.org kernel earlier that 2.6.27? What happens if you do 'echo disk > /sys/power/state' (as root) under 2.6.27-2 (it should hibernate, so are you able to resume)? Also, in future please don't paste logs, they are completely unreadable in this form. Please add them as attachments instead. May I try any binaries of earlier versions from the SuSE repositories? The result of 'echo disk >/sys/power/state' is the same. The screen blackens, the computer powers off after some delay, then when I power it on it boots normally (with some replaying the filesystem journal). Created attachment 18375 [details]
/var/log/pm-suspend.log after failed resume
Created attachment 18376 [details]
boot.msg after booting instead of resuming
I tried the binaries "kernel-default-2.6.25.4-60.1.i586.rpm" from the repositary "http://download.opensuse.org/repositories/home:/hmacht/openSUSE_10.3/". The same problem. What's your swap configuration, and did you have large-memory processes running prior to hiberation (e.g. a few hundred browser tabs)? If the sum total of your swap is not sufficient to save the total of your current processes, then hibernation does not work, and that's what you see. i.e. in the "reboot", the last kernel is supposed to read saved memory-dump from swap, and if swap was not configured properly or too small, then the saved memory-dump is not complete and the kernel proceeds to boot normal. First of all, hibernation works properly with 2.6.22 for much larger memory usage than that I tested with 2.6.27, 2.6.25 for. Namely, I tested it with one kde session, one xterm (with mc) and one thunderbird. Nothing more. For example, just now I have one kde-session, three xterm's, one thunderbird and one firefox. It takes 256788k + 40528k buffers + 230648 cache + 16k swap according to output of "free". My RAM is 1026217k total, my swap space (/dev/sda1) is 2008084k total, my usual memory image size calculated by the s2disk utility and saved to /var/lib/s2disk.conf is about 500M. What's the result of 'cat /sys/power/disk' on your box? lashkevi@toshi-lashk:~> cat /sys/power/disk [platform] test testproc shutdown reboot You have an fsck on boot, which means your system did not go into hibernation properly. You need to find the bits of the log just *before* hibernation, rather than after waking up. (e.g. a driver might have refused to go into sleep, etc). Michael, please check if the system resumes correctly after # echo shutdown > /sys/power/disk # echo disk > /sys/power/state If not, please send the output of 'cat /proc/cmdline' and 'cat /proc/swaps'. It is possible that the kernel can't find the image during the resume for some reason. (In reply to comment #11) > If not, please send the output of 'cat /proc/cmdline' and 'cat /proc/swaps'. toshi-lashk:~ # cat /proc/cmdline root=/dev/sda3 vga=791 resume=/dev/sda1 splash=verbose acpi_osi=!Linux showopts toshi-lashk:~ # cat /proc/swaps Filename Type Size Used Priority /dev/sda1 partition 2008084 0 42 You see, the 'resume=' parameter points to the correct swap partition. A comment: The parameter 'acpi_osi=!Linux' is added since the FN/Fx keys do not work at all without it. (In reply to comment #11) > Michael, please check if the system resumes correctly after > > # echo shutdown > /sys/power/disk > # echo disk > /sys/power/state Sorry, I forgot to say: the system does NOT resume correctly with these commands. Can you please attach dmesg output from right after a failing resume? Created attachment 18408 [details]
dmesg output just after failed resume
Created attachment 18409 [details] the "/var/log/pm-suspend.log" file after a suspend INTERRUPTED by pressing the backspace key (In reply to comment #10) > You have an fsck on boot, which means your system did not go into hibernation > properly. > > You need to find the bits of the log just *before* hibernation, rather than > after waking up. (e.g. a driver might have refused to go into sleep, etc). To satisfy your request I attach the "/var/log/pm-suspend.log" file after an attempted to suspend interrupted by backspace to resume without complete creating the image. But I am afraid it sais nothing new... (In reply to comment #15) > Created an attachment (id=18408) [details] > dmesg output just after failed resume The image seems to be there, but also it seems to have been damaged. Is the sequence of events such that after hibernating the box is powered off and then you power it on and it reboots while trying to resume, or does it boot "normally" instead of resuming? Can you also attach dmesg from after a clean boot, please? Created attachment 18410 [details] Output of "dmesg" after a clean boot (In reply to comment #17) > Is the sequence of events such that after hibernating the box is powered off > and then you power it on and it reboots while trying to resume, or does it > boot > "normally" instead of resuming? > It just boots normally. I mean, thire is no reboot after starting grub. It starts like it usually does while resuming (without the start menu) but proceeds with a normal boot. (In reply to comment #16) > To satisfy your request I attach the "/var/log/pm-suspend.log" file after an > attempted to suspend interrupted by backspace to resume without complete > creating the image. But I am afraid it sais nothing new... Wrong log, and wrong way to do it. I wrote the log just before a failed resume, and I meant it - not an "interrupted suspend", and wrong file. I meant *the log*, i.e. /var/log/messages, the hundred lines just before the new <date> <hostname> kernel: Linux version 2.6.xxx... gcc.. <date> <hostname> kernel: Command line: ro root=... (In reply to comment #20) > and I meant it - not an "interrupted suspend", and wrong file. I meant *the > log*, i.e. /var/log/messages, the hundred lines just before the new > <date> <hostname> kernel: Linux version 2.6.xxx... gcc.. > <date> <hostname> kernel: Command line: ro root=... Ok, but how to save the "/var/log/messages" file after hibernation but before powering off? I tried to hibernate in the "reboot" mode and then save the file from the rescue disc mounting the filesystem in the readonly mode, but it did not help. The last line was earlier than the actual hibernation process. Probably, the system checked and repaired the filesystem silently while mounting the filesystem. Hmm, your problem is clearly not about resume, but that your system did not hibernate properly (so there is no good state to resume from). Basically your system crashed while trying to hibernate, I think, so you just get a normal boot/repair after crash. I think a serial console and a 2nd machine (and the know-how to use a serial console) is needed to debug this sort of problem. Michael, I also think that the system crashes while saving the hibernation image and this kind of failure is unfortunately very difficult to debug. Frankly, the only practical method I can suggest you at the moment is to carry out git bisection and see which commit broke things for you. Rafael, please, explain, what is "git bisection" and give me practical instructions how to do it. Well, that unfortunately requires you to compile the kernel from sources and install it (multiple times). On openSUSE it's not very difficult, though. Have you ever done it? Once, long ago. Recently, I only compiled modules. I think I'll manage it. Created attachment 18454 [details]
Kernel compilation and git bisection instructions
Please follow the instructions in this attachment. If you have any problems, please ask.
I tried to make the git bisection but encountered a problem even at its first step. To be sure I began with trying 2.6.28-rc2 generated initially from the git tree, but the resume problem remains for this version. Then I tried a bisection between versions 2.6.22 (definitely good) and 2.6.25 (already bad), which was proposed to be 2.6.24-rc2-default, but it has a problem with modprobe as early as at the boot process. Suspend to disk fails and resumes at some early stage without creating image (ERR=255). Even when I halt the system normally it fails to unmount correctly my root filesystem. Probably, I used a wrong .config file (I took it from my old 2.6.22. I also tried that from 2.6.27 but with this last the kernel ide modules were not generated at all so it could not correctly make initrd.) That isn't good at all. Do you normally use IDE drivers or libata/PATA drivers? Created attachment 18476 [details]
lsmod output for 2.6.22
Yes, the kernel uses them (ide_core, piix, ata_piix, libata, ahci). See this attachment for 2.6.22. For 2.6.27 see the pm-suspend.log attached earlier.
The IDE driver seems to be not used, though. Anyway, please run 'git bisect reset' in your repository directory, run 'git pull' from there to update the kernel source to the latest version (that will be 2.6.28-rc2-something), run 'make xconfig' and set CONFIG_PM_DEBUG. Compile and install the kernel and run: # echo core > /sys/power/pm_test # echo disk > /sys/power/state (that will test the hibernation code without creating the image) and see if the system remains stable after that. Created attachment 18478 [details]
Dmesg output after 'echo core >/sys/power/pm_test; echo disk >/sys/power/state'
This is the dmesg output after
# modprobe -r rtl8187
# echo core > /sys/power/pm_test
# echo disk > /sys/power/state
(The fist line is necessary since the computer freezes while suspending with this module.)
I tried to hibernate the computer with 'echo 1 >/sys/power/pm_trace'. After booting, I got the following dmesg lines: Magic number: 0:77:376 hash matches drivers/base/power/main.c:390 I wonder why you need to modprobe -r rtl8187 - the driver emits a warning on suspend/hibernate: "rtl8187 1-6:1.0: no suspend for driver rtl8187?", but the kernel unloads the whole of the usb stack on suspend/hibernate so the warning is just a warning. (In reply to comment #34) Because with this driver loaded the computer freezes forever at some early stage of hibernation(freezing tasks, I believe). This is an experimental fact. First, I'm changing the severity of this bug to "normal", as it seems to be related to your hardware in a non-trivial way. Second, your dmesg attached to comment #32 (btw, please set the 'plain text' type for your text attachments, that makes browsing them _much_ easier) looks pretty normal. Did the system work correctly after that? Finally, please file a separate bug for the rtl8187 issue. The system worked correctly after 'echo core...'. When the kernel fails to resume there is a message in the boot.msg: <3>Unable to find swap-space signature Is it relevant to the problem? Yes, it is. This means that the hibernation code managed to change the swap signature of your resume device. BTW, are there any swap devices right after a failing resume in /proc/swaps? Yes, swapon succeeds after that. If you do file a bug against rtl8187 hibernation, please add me to the CC: (I was one of the people who made rtl8187B support possible - i.e. the bulk of the wireless driver code you are using). (In reply to comment #39) > Yes, swapon succeeds after that. Can you please try to run 'mkswap' on your swap space and retest hibernation with 'echo disk > /sys/power/state'? (In reply to comment #40) > If you do file a bug against rtl8187 hibernation, please add me to the CC: (I > was > one of the people who made rtl8187B support possible - i.e. the bulk of the > wireless driver code you are using). Ok. I'll do it. (In reply to comment #41) > Can you please try to run 'mkswap' on your swap space and retest hibernation > with 'echo disk > /sys/power/state'? > It does not help. Moreover, today I failed to resume with 2.6.22. I thought that it was something wrong with the installation of the kernel after so many fsck's, and updated by the use of OpenSUSE update utility from 2.6.22.18 to 2.6.22.19 (and rebooted after that), but it did not help. Maybe something wrong with the disk? Maybe better to make full fsck from the rescue disk? If you have a spare disk, you can try to set up a swap partition in there and try to hibernate. Alternatively, you can try to use a swap file for hibernation. If this is a disk problem, fsck won't help fix it (swap partitions don't contain filesystems anyway, so fsck doesn't work on them). Hmm, I normally do suspend, so I just put my toshiba laptop into hibernate and it resumes alright as it should. It is a satellite A210, not the same model - I know for a fact that the rtl8187 driver code path on this and yours is identical. This is with fedora 9 2.6.26.7-86.fc9.x86_64 . How are you configuring your networking? NetworkManager should automatically disconnect on hibernate and ifconfig down the device, then the kernel unloads the whole USB core stack plus drivers, there shouldn't need to be a manual "modprobe -r". (In reply to comment #45) > Hmm, I normally do suspend, so I just put my toshiba laptop into hibernate > and > it resumes alright as it should. It is a satellite A210, not the same model - > I > know for a fact that the rtl8187 driver code path on this and yours is > identical. This is with fedora 9 2.6.26.7-86.fc9.x86_64 . > > How are you configuring your networking? NetworkManager should automatically > disconnect on hibernate and ifconfig down the device, then the kernel unloads > the whole USB core stack plus drivers, there shouldn't need to be a manual > "modprobe -r". > Yes, I use NetwrokManager. Maybe that is because of the non-standard identifier of this card (0bda:8198) on Toshiba Satellite L300? For former releases and drivers I needed to add this idenitifier manually into the source code. (In reply to comment #44) > If you have a spare disk, you can try to set up a swap partition in there and > try to hibernate. > > Alternatively, you can try to use a swap file for hibernation. > > If this is a disk problem, fsck won't help fix it (swap partitions don't > contain filesystems anyway, so fsck doesn't work on them). > I remade the swap partition with the flag '-c', which forces to check the space for bad blocks. No bad blocks found. The use of a swap file does not help either. Can you compile the current mainline kernel from sources with CONFIG_PM_DEBUG enabled and install it? (In reply to comment #48) > Can you compile the current mainline kernel from sources with CONFIG_PM_DEBUG > enabled and install it? > Done. Please attach dmesg from this kernel taken right after a failing resume (as a text attachment). Created attachment 18502 [details]
dmesg output just after failed resume (2.6.28-rc2-17-default, CONFIG_PM_DEBUG=y)
Here you are.
Unfortunately, the log doesn't reveal _anything_. I'm getting a bit frustrated with that. Is there a possibility to set up a serial console on this box? Yes, since the problem I believe is about crash during hibernation (not resume), log after resume does not help; unless you can somehow override the disk-recovery/fsck right after the crash, *before* the resume. Actually I wonder if mounting the disk by a 2nd machine read-only will work. (read-write certainly won't, since the first thing is "recover-journal" or "fsck"). Oh, there is one other option - configure syslog to send the log over the network to a 2nd machine (in addition to logging locally). But one only gets logs as far as the networking facility is up, so one probably does get very close to the moment of crash. Honestly, I think the most informative/useful route is debugging with a serial console - this would get at the logs as close to the moment of crash as possible. What means a serial console? I could try again to boot my computer from a rescue disc just after suspend and mount the filesystems readonly. To do it, maybe it is better first to redirect syslog to the ext2 boot partition instead of /var/log on the reiserfs root partition? Serial console is a second machine with 'minicom' or equivalent program attached to your box with a serial null-modem cable and capable of reading the kernel messages on the fly. Another possibility to debug it a bit further would be to use the latest version of user space hibernation utilities (the CVS version from suspend.sf.net) and enable the image verification option. That might give us access to kernel logs generated during failing hibernation. (In reply to comment #55) > Serial console is a second machine with 'minicom' or equivalent program > attached to your box with a serial null-modem cable and capable of reading > the > kernel messages on the fly. > My notebook has no serial port, only USB ones. > Another possibility to debug it a bit further would be to use the latest > version of user space hibernation utilities (the CVS version from > suspend.sf.net) and enable the image verification option. That might give us > access to kernel logs generated during failing hibernation. > I installed it and suspended with 'debug verify image = y'. It successfully verified the image, but again failed to resume. Then I found no new log messages. Created attachment 18564 [details]
A comment of Alexander Prokofiev, the sysadmin of Landau Institute
I consulted the system administrator at my work, Alexander Prokofiev, on this problem, and he is wonders why dmesg after failed resume does not contain any lines about attempt to resume and the reason of failure. I attach here his comment translated into English.
Actually, the advice to replace pr_debug with printk in this fragment of code is a good one. You can also compile the kernel (2.6.28-rc2 or newer) with CONFIG_DYNAMIC_PRINTK_DEBUG set and add 'dynamic_printk' to the kernel command line. There is a change in 2.6.28-rc that makes pr_debug messages only be printed if there's 'dynamic_printk' in the kernel command line (however, it makes _all_ of the pr_debug messages appear in the log, which may not be what you want). Created attachment 18588 [details] Output of "dmesg |grep 'PM:'" after failed resume (In reply to comment #58) The parameter CONFIG_DYNAMIC_PRINTK_DEBUG does not seem to work even together with the kernel command line parameter 'dynamic_printk'. That is why I replaced all the functions pr_debug(...) to printk(...) in all files in the files in the directory "kernel/power". This gave several additional log messages (see the attachment), which are the same for both clear boot and boot after failed resume. Is it possible include some other printk lines to debug it further? Well, I'll have to see why the debug messages are busted. The information you obtained is really useful, because: PM: Resume from partition /dev/sda1 PM: Checking hibernation image. PM: Error -6 checking image file means "No such device or address", so it looks like your resume kernel can't handle the resume device. It can be a missing driver or something wrong with the initrd image. Created attachment 18598 [details]
Output of mkinitrd
Here is the output of the mkinitrd command. What could be wrong with it? It seems to contain all necessary drivers.
By the way, it seems from dmesg that the ahci and libata modules are loaded _after_ the attempt of resume. Is it a correct behaviour?
(In reply to comment #61) > Created an attachment (id=18598) [details] > Output of mkinitrd > > Here is the output of the mkinitrd command. What could be wrong with it? It > seems to contain all necessary drivers. Yes, it appears so. > By the way, it seems from dmesg that the ahci and libata modules are loaded > _after_ the attempt of resume. Is it a correct behaviour? No, it isn't if you try to resume after hibernating with 'echo disk > /sys/power/state'. Please try to compile these drivers directly into the kernel and retest. The same error -6 when I try to suspend and resume for the kernel compiled with CONFIG_PM_STD_PARTITION="/dev/sda1" CONFIG_SCSI=y CONFIG_ATA=y CONFIG_SATA_AHCI=y CONFIG_ATA_GENERIC=y Well, this probably means that /dev/sda1 is not present in your initrd image, so the kernel cannot find the device. I don't know why this happens, though. # mkinitrd -k vmlinuz-2.6.28-rc2-17-default -i initrd-2.6.28-rc2-17-default Kernel image: /boot/vmlinuz-2.6.28-rc2-17-default Initrd image: /boot/initrd-2.6.28-rc2-17-default Root device: /dev/sda3 (mounted on / as reiserfs) Resume device: /dev/sda1 Kernel Modules: ata_piix hwmon thermal_sys processor thermal fan reiserfs edd ide-core piix crc-t10dif sd_mod usbcore ohci-hcd uhci-hcd ehci-hcd hid usbhid Features: block usb 17685 blocks Rafael, the problem is indeed in the initrd file! When I upgraded to 2.6.27 from the SuSE repositories, I needed to upgrade mkinitrd from 2.1 to 2.4 due to failed dependencies. Now, since I use a compiled kernel, I falled back to mkinitrd-2.1 and rebuilt the initrd files. And now HIBERNATION AND RESUME WORKS! The problem with 2.6.22 that occured while debugging was because I occasionally rebuilt it initrd with mkinitrd 2.4. Hence, I must address the problem to the SuSE team rather than to the kernel team. Sorry for taking so much of your time and thank you for help. Best regards, Michael I'm glad you found the source of the trouble. :-) |