|Summary:||2.6.30 - 2.6.32 prevents my system from waking up from suspend|
|Severity:||high||CC:||hpa, lenb, mingo, rjw, yinghai|
|Bug Depends on:|
|Bug Blocks:||7216, 13070, 56331|
lspci -vvv output
dmesg output before suspend
dmesg output after first suspend
dmesg output after second suspend
/proc/modules output after suspend
/proc/interrupts after suspend
dmesg output after running echoing pm_test and state
dmesg output without any proc changes after resume
Image of the kernel freeze
dmesg after freezer pm-test
dmesg of removed e820 patch - before suspend
dmesg of removed e820 patch - after suspend
reduced .config from my bisecting kernel (should be version 2.6.28)
dmesg after suspending with pm_trace - 2.6.31-rc5
Description unggnu 2009-05-20 14:09:12 UTC
Created attachment 21449 [details] lspci -vvv output If I resume from suspend the fan goes on but the monitor still doesn't give a signal. Then the fan goes off and after some time the desktop is shown again but you can't do anything except of moving the mouse because the hard disk partitions are mounted readonly and many hard disk controller messages appear in console. Sometimes it even give a monitor signal at all. Interestingly if I use echo 1 > /sys/power/pm_trace it resumes fine without the fan off problem and waiting. After that I can suspend fine at least after some tests. 2.6.29 and all older Kernels I have tested so far doesn't have this problem. They resume fine. It happend with every 2.6.30 kernel I have tested so far. I am using the kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.30-rc6/ . Board: Gigabyte EP45-DS3L Linux BlackBox 2.6.30-020630rc6-generic #020630rc6 SMP Mon May 18 14:46:29 UTC 2009 x86_64 GNU/Linux This is the output I got with dmesg after enabling pm_trace on first resume: [ 111.271156] PM: resume devices took 5.672 seconds [ 111.271160] ------------[ cut here ]------------ [ 111.271162] WARNING: at /home/kernel-ppa/mainline/build/kernel/power/main.c:176 suspend_test_finish+0x7f/0x90() [ 111.271163] Hardware name: EP45-DS3L [ 111.271164] Component: resume devices [ 111.271165] Modules linked in: binfmt_misc ppdev radeon drm bridge stp bnep vboxnetflt vboxdrv video output lp parport snd_hda_codec_atihdmi arc4 ecb snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss ath5k snd_seq_midi snd_rawmidi snd_seq_midi_event mac80211 snd_seq led_class snd_timer snd_seq_device intel_agp serio_raw iTCO_wdt iTCO_vendor_support joydev snd soundcore snd_page_alloc cfg80211 aes_x86_64 aes_generic xts gf128mul hid_microsoft usb_storage r8169 mii usbhid dm_crypt [last unloaded: pcspkr] [ 111.271190] Pid: 4407, comm: pm-suspend Not tainted 2.6.30-020630rc6-generic #020630rc6 [ 111.271191] Call Trace: [ 111.271196] [<ffffffff802504bd>] warn_slowpath_fmt+0xdd/0x110 [ 111.271198] [<ffffffff80250c59>] ? try_acquire_console_sem+0x49/0x50 [ 111.271201] [<ffffffff80250cab>] ? acquire_console_semaphore_for_printk+0x4b/0x90 [ 111.271204] [<ffffffff806a7a3f>] ? _spin_lock_irqsave+0x2f/0x50 [ 111.271207] [<ffffffff8053dd7c>] ? usb_suspend_both+0x12c/0x210 [ 111.271209] [<ffffffff80251557>] ? printk+0x67/0x70 [ 111.271213] [<ffffffff8041b197>] ? kobject_put+0x27/0x60 [ 111.271215] [<ffffffff804c11d2>] ? put_device+0x12/0x20 [ 111.271217] [<ffffffff804c9616>] ? dpm_complete+0x116/0x130 [ 111.271220] [<ffffffff8027fbaf>] suspend_test_finish+0x7f/0x90 [ 111.271222] [<ffffffff8027fc5a>] suspend_devices_and_enter+0x9a/0xd0 [ 111.271224] [<ffffffff8027ff0b>] enter_state+0xdb/0x100 [ 111.271226] [<ffffffff8027ffdf>] state_store+0xaf/0xf0 [ 111.271229] [<ffffffff8041af87>] kobj_attr_store+0x17/0x20 [ 111.271231] [<ffffffff8034e56d>] flush_write_buffer+0x5d/0x90 [ 111.271233] [<ffffffff8034e6a1>] sysfs_write_file+0x61/0xa0 [ 111.271236] [<ffffffff802efd57>] vfs_write+0xc7/0x180 [ 111.271238] [<ffffffff802eff00>] sys_write+0x50/0x90 [ 111.271241] [<ffffffff80210f82>] system_call_fastpath+0x16/0x1b [ 111.271242] ---[ end trace 47224b87c730128d ]--- [ 111.271325] PM: Finishing wakeup. [ 111.271326] Restarting tasks ... done.
Comment 1 unggnu 2009-05-20 14:09:48 UTC
Created attachment 21450 [details] dmesg output before suspend
Comment 2 unggnu 2009-05-20 14:10:12 UTC
Created attachment 21451 [details] dmesg output after first suspend
Comment 3 unggnu 2009-05-20 14:10:33 UTC
Created attachment 21452 [details] dmesg output after second suspend
Comment 4 unggnu 2009-05-20 14:11:04 UTC
Created attachment 21453 [details] /proc/modules output after suspend
Comment 5 unggnu 2009-05-20 14:11:29 UTC
Created attachment 21454 [details] /proc/interrupts after suspend
Comment 6 unggnu 2009-05-20 14:13:10 UTC
Created attachment 21455 [details] /proc/cpuinfo If anything else is needed let me know.
Comment 7 Zhang Rui 2009-05-21 08:02:52 UTC
is the dmesg gotten when not using "echo 1 > /sys/power/pm_trace"? please make sure your kernel is built with CONFIG_PM_DEBUG=y, and then run echo devices > /sys/power/pm_test echo mem > /sys/power/state wait for a few seconds until the laptop wakeup itself, does the problem still exist after resume?
Comment 8 Zhang Rui 2009-05-21 08:03:43 UTC
(In reply to comment #7) > is the dmesg gotten when not using "echo 1 > /sys/power/pm_trace"? > please attach the dmesg output without this command. > please make sure your kernel is built with CONFIG_PM_DEBUG=y, > and then run > echo devices > /sys/power/pm_test > echo mem > /sys/power/state > wait for a few seconds until the laptop wakeup itself, > does the problem still exist after resume? please attach the dmesg output after resume if the problem still exists.
Comment 9 unggnu 2009-05-21 10:38:21 UTC
The first dmesg output before suspend is without the command. > please attach the dmesg output without this command This could be hard because if I see something on the screen after a while I couldn't save anything because all partitions are mounted read only. I couldn't even shutdown regularly. I am trying to give the other information soon.
Comment 10 unggnu 2009-05-21 10:43:16 UTC
Created attachment 21465 [details] dmesg output after running echoing pm_test and state > please make sure your kernel is built with CONFIG_PM_DEBUG=y, > and then run > echo devices > /sys/power/pm_test > echo mem > /sys/power/state > wait for a few seconds until the laptop wakeup itself, > does the problem still exist after resume? I haven't checked the CONFIG status but it isn't needed I guess because with the power state command the screen only went dark for some seconds and the pc doesn't go to sleep. Anyway here is the dmesg output.
Comment 11 unggnu 2009-05-21 11:23:22 UTC
Created attachment 21466 [details] dmesg output without any proc changes after resume
Comment 12 unggnu 2009-05-21 11:27:50 UTC
Created attachment 21467 [details] Image of the kernel freeze I changed to console because I couldn't unlock the screen saver after resume with the problem. I connected my usb stick and saved the dmesg output on it after resume. After hitting ctrl + alt + del the kernel freezes and I got this output. Anyway it looks like I have to recompile the kernel with CONFIG_PM_DEBUG=y or is this enough?
Comment 13 Zhang Rui 2009-05-22 02:07:51 UTC
if you run echo freezer > /sys/power/pm_test echo mem > /sys/power/state there are no such kind of error messages, right?
Comment 14 unggnu 2009-05-22 06:34:47 UTC
Created attachment 21482 [details] dmesg after freezer pm-test I doesn't get the error but it doesn't got to sleep either. To recheck I have restarted the system and used pm-suspend instead of "echo mem > /sys/power/state" but it still doesn't go to sleep. The display blanks for some seconds and then everything returns.
Comment 15 unggnu 2009-05-23 06:50:22 UTC
Do it still need to recompile the kernel or do you have a clue what could be the reason?
Comment 16 Zhang Rui 2009-05-25 06:08:18 UTC
no, that's the right behavior. hmm, what if you run "echo shutdown > /sys/power/disk" before entering S3? note I mean S3, so you don't need to run "echo blabla > /sys/power/pm_test" this time.
Comment 17 unggnu 2009-05-25 07:09:51 UTC
If I do this I get directly the freeze messages after resume containing many aes_crypt and ext4 issues. Btw. I am using dm-crypt with LUKS if this is relevant. I couldn't save anything because of this. Thanks for trying to nail this down.
Comment 18 unggnu 2009-05-25 07:19:43 UTC
I have tried rc7 and it seems to work fine after two suspends without any proc changes. As soon as I have made some more tests I mark this issue as resolved if no problems appear again.
Comment 19 unggnu 2009-05-26 11:19:21 UTC
I didn't know why it worked but I wasn't able to get it working again afterwards. Most of the time I get directly the kernel message errors.
Comment 20 unggnu 2009-05-26 16:28:52 UTC
What else is needed to nail down the problem?
Comment 21 Len Brown 2009-05-27 01:51:49 UTC
As 2.6.29 worked and 2.6.30-rc fails, can you bisect what commit between those two causes the failure?
Comment 22 unggnu 2009-05-27 06:21:06 UTC
Wow - this will be time consuming. I guess I need ccache and build my own kernel for this pc to save time. I report back if I found something.
Comment 23 unggnu 2009-05-27 20:17:22 UTC
According to bisect it is either 78a8b35bc7abf8b8333d6f625e08c0f7cc1c3742 or 6d7942dc2a70a7e74c352107b150265602671588 The last version doesn't boot so I couldn't nail it down further. I have bisect the commits between 2.6.29-rc8 and 2.6.30-rc1 . I hope that this helps.
Comment 24 Rafael J. Wysocki 2009-05-27 20:45:13 UTC
On Wednesday 27 May 2009, email@example.com wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13351 > > --- Comment #23 from firstname.lastname@example.org 2009-05-27 20:17:22 --- > According to bisect it is either 78a8b35bc7abf8b8333d6f625e08c0f7cc1c3742 or > 6d7942dc2a70a7e74c352107b150265602671588 > The last version doesn't boot so I couldn't nail it down further. > > I have bisect the commits between 2.6.29-rc8 and 2.6.30-rc1 . > > I hope that this helps. Thanks for bisecting. We have a serious regression here which appears to have been caused by x86 patches. Please help! Best, Rafael
Comment 25 Yinghai Lu 2009-05-27 20:59:30 UTC
that two patches only make the 0x1000 - 0x6000 to be really reserved in e820 according to low memory corruption... please disable low memory checking to see what will happen [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009e800 (usable) [ 0.000000] BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved) [ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 00000000cfee0000 (usable) [ 0.000000] BIOS-e820: 00000000cfee0000 - 00000000cfee2000 (ACPI NVS) [ 0.000000] BIOS-e820: 00000000cfee2000 - 00000000cfef0000 (ACPI data) [ 0.000000] BIOS-e820: 00000000cfef0000 - 00000000cff00000 (reserved) [ 0.000000] BIOS-e820: 00000000e0000000 - 00000000e4000000 (reserved) [ 0.000000] BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) [ 0.000000] BIOS-e820: 0000000100000000 - 0000000130000000 (usable) ... [ 0.000000] e820 update range: 00000000cff00000 - 0000000100000000 (usable) ==> (reserved) [ 0.000000] last_pfn = 0xcfee0 max_arch_pfn = 0x100000000 [ 0.000000] e820 update range: 0000000000001000 - 0000000000006000 (usable) ==> (reserved) [ 0.000000] Scanning 1 areas for low memory corruption [ 0.000000] modified physical RAM map: [ 0.000000] modified: 0000000000000000 - 0000000000001000 (usable) [ 0.000000] modified: 0000000000001000 - 0000000000006000 (reserved) [ 0.000000] modified: 0000000000006000 - 000000000009e800 (usable) [ 0.000000] modified: 000000000009f800 - 00000000000a0000 (reserved) [ 0.000000] modified: 00000000000f0000 - 0000000000100000 (reserved) [ 0.000000] modified: 0000000000100000 - 00000000cfee0000 (usable) [ 0.000000] modified: 00000000cfee0000 - 00000000cfee2000 (ACPI NVS) [ 0.000000] modified: 00000000cfee2000 - 00000000cfef0000 (ACPI data) [ 0.000000] modified: 00000000cfef0000 - 00000000cff00000 (reserved) [ 0.000000] modified: 00000000e0000000 - 00000000e4000000 (reserved) [ 0.000000] modified: 00000000fec00000 - 0000000100000000 (reserved) [ 0.000000] modified: 0000000100000000 - 0000000130000000 (usable)
Comment 26 unggnu 2009-05-27 22:13:52 UTC
I got several low memory error messages while suspending but I only marked that revisions as bad that actually freezes my system. It was still possible to work after the low memory error message.
Comment 27 unggnu 2009-05-28 08:02:13 UTC
Ok, it really seems to be commit 78a8b35bc7abf8b8333d6f625e08c0f7cc1c3742. I have disabled low memory checking with 2.6.30-rc7 and it doesn't help. System still freezes but after I removed the commit 78a8b35bc7abf8b8333d6f625e08c0f7cc1c3742 manually suspend works three times in a row without errors. Since the low memory checking was disabled I don't know if this message would still appear but at least the freezing was gone.
Comment 28 Rafael J. Wysocki 2009-05-28 16:02:08 UTC
Caused by: commit 78a8b35bc7abf8b8333d6f625e08c0f7cc1c3742 Author: Yinghai Lu <email@example.com> Date: Thu Mar 12 22:36:01 2009 -0700 x86: make e820_update_range() handle small range update Signed-off-by: Yinghai Lu <firstname.lastname@example.org> Cc: email@example.com LKML-Reference: <49B9F0C1.firstname.lastname@example.org> Signed-off-by: Ingo Molnar <email@example.com> First-Bad-Commit : 78a8b35bc7abf8b8333d6f625e08c0f7cc1c3742
Comment 29 Yinghai Lu 2009-05-28 19:19:26 UTC
can you post dmesg after revert that patch? we can check if the e820 get changed or not.
Comment 30 unggnu 2009-05-29 08:28:21 UTC
Created attachment 21618 [details] dmesg of removed e820 patch - before suspend Sure but after trying to get the dmesg after suspend system hangs again so it is either not this patch or a combination of them. Do you have any idea how to nail it down further. It seems to be more random than I thought so bisect won't really work especially if the source is more than one commit.
Comment 31 unggnu 2009-05-29 08:30:29 UTC
Created attachment 21619 [details] dmesg of removed e820 patch - after suspend
Comment 33 unggnu 2009-06-05 11:26:09 UTC
Still an issue in vanilla 2.6.30-rc8. Aren't the debug information in dmesg helping to find the problem?
Comment 34 Yinghai Lu 2009-06-07 23:13:31 UTC
you may still try to bisect it down. and together with quilt to remove suspious commits before compile and test.
Comment 35 unggnu 2009-06-08 19:38:51 UTC
Do you know a doc for this and how to disable/remove a commit from bisecting procedure?
Comment 36 H. Peter Anvin 2009-06-08 20:19:29 UTC
"git commit skip"
Comment 37 unggnu 2009-06-29 06:07:48 UTC
It is still an issue in 2.6.31-rc1. But there have be a new development. If I compile my own kernel with the reduced .config which I created for bisecting my PC also hangs with 2.6.29 and 2.6.28 with the ext4 dmesg error messages. But at the same time the Ubuntu Jaunty 2.6.28 and afaik the Mainline kernels from the Ubuntu rep 2.6.28 and 29 doesn't have this problem. Or it is partly ext4 related but it happens even if my root file system wasn't ext4. It wouldn't also explain the fan stopping after resume. Maybe a function which is activated in a normal config but disabled in 2.6.30+ is the reason for this. I am attaching my reduced .config. Maybe you have another idea what I can do to nail it down. Bisecting wouldn't help with this.
Comment 38 unggnu 2009-06-29 06:09:35 UTC
Created attachment 22135 [details] reduced .config from my bisecting kernel (should be version 2.6.28)
Comment 39 unggnu 2009-08-01 14:36:59 UTC
Created attachment 22567 [details] dmesg after suspending with pm_trace - 2.6.31-rc5 I have gathered the Magic number and hash match according to this howto: https://wiki.ubuntu.com/DebuggingKernelSuspend [ 0.910691] Magic number: 0:926:740 [ 0.910693] hash matches /home/kernel-ppa/mainline/build/drivers/base/power/main.c:419 Linux ubuntu-desktop 2.6.31-020631rc5-generic #020631rc5 SMP Sat Aug 1 09:04:48 UTC 2009 x86_64 GNU/Linux I have used the mainline kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.31-rc5/ In most cases even the display gets not signal anymore.
Comment 40 unggnu 2009-08-01 14:38:49 UTC
What else can I do? Bisecting obviously doesn't help to nail down the problem.
Comment 41 unggnu 2009-08-19 14:03:41 UTC
With 2.6.31, even the latest rc6 the system doesn't wake up at all anymore. At least the display gets no signal. How can I debug this? Is this a graphic driver issue or can this also have something to do with the hard disk driver?
Comment 42 unggnu 2009-10-01 11:30:28 UTC
Still no response after resume with 2.6.32-rc1.
Comment 43 Rafael J. Wysocki 2009-10-01 20:26:35 UTC
Thanks for the update. Unfortunately, we have no ideas about the possible root cause of the problem.
Comment 44 unggnu 2009-10-02 07:35:33 UTC
Isn't there any other procedure how to nail down the problem where the kernel hangs? I have already posted the magic number. I have no additional ideas but I am going to check some older mainline kernels.
Comment 45 Rafael J. Wysocki 2009-10-02 16:59:12 UTC
(In reply to comment #34) > you may still try to bisect it down. and together with quilt to remove > suspious > commits before compile and test. Well, in the face of comment #27, I'm not sure about what information we can get from that.
Comment 46 Rafael J. Wysocki 2009-10-02 17:06:53 UTC
(In reply to comment #44) > Isn't there any other procedure how to nail down the problem where the kernel > hangs? Not really. > I have already posted the magic number. That didn't reveal any potential culprits > I have no additional ideas but I am going to check some older mainline > kernels. What happens if you run the 'core' pm_test test, ie. # echo core > /sys/power/pm_test # echo mem > /sys/power/state Does it return to the normal state after 5 seconds (as it should) or does it behave incorrectly?
Comment 47 unggnu 2009-10-02 19:23:09 UTC
Many thanks for your support. The behavior was too sporadic at the time to nail it down with bisecting so I guess the commit doesn't necessary have something to do with the problem. All pm_tests work fine except that in case of "processors" I get the error message: [ 380.564014] ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4 [ 380.564016] ata2: irq_stat 0x00400040, connection status changed [ 380.588014] ata3: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4 [ 380.588016] ata3: irq_stat 0x00400040, connection status changed An interesting thing is that before the error messages something like this is mentioned: "[ 380.548669] ata2.00: configured for UDMA/133" It is interesting because I am using only SATA, even for my CD drive and I have disabled the IDE controller in my Bios. Just in case Suspend also doesn't work if I use the default Bios settings.
Comment 48 Rafael J. Wysocki 2009-10-02 21:12:55 UTC
Did you try booting with init=/bin/bash and suspending in that configuration?
Comment 49 unggnu 2009-10-03 08:43:27 UTC
Thanks for your reply. I used the recovery mode some time ago and had the same problem. The same happens with init=/bin/bash.
Comment 50 Rafael J. Wysocki 2009-10-03 19:30:34 UTC
If the 'core' pm_test test succeeds, the failure to resume correctly appears to be related to the fact that control goes through the BIOS during suspend and resume. Did you try hibernation and, if you did, did it resume correctly? Also, please apply the patch at http://patchwork.kernel.org/patch/45314/ (it may require some source code surgery, please let me know if there are any problems with that), boot with acpi_sleep=s3_set_sci_en, try to suspend (to RAM)/resume and see if that works.
Comment 51 unggnu 2009-10-04 12:00:20 UTC
Thanks for your reply. No, I haven't tried hibernation because I have no swap. But I am try to test it soon. I have applied the patch to 2.6.32-rc1 from the Ubuntu mainline kernel repository. With my self compiled kernel resume seems to work after some time but I got the same problems I have described above again, hard disk errors and so on. It is not possible to write anything on disk anymore and often the whole file system is corrupted after restart. The system also seems to wake up with my kernel without the option. Of course it doesn't help because of the hard disk problems. I am going to activate S1 in Bios and check what happens. In the changelog of my bios a Vista standby fix is mentioned. So maybe I check an old bios version too.
Comment 52 Rafael J. Wysocki 2009-10-04 13:58:06 UTC
The observations described in comment #51 indicate that your SATA controller doesn't resume correctly, although Intel SATA controllers are generally known to be handled correctly during suspend and resume. That, as well as your previous observations, shows that the problem is somehow related to the BIOS.
Comment 53 unggnu 2009-10-04 14:34:31 UTC
It has indeed something to do with the Bios. Changing back to version F8 fixes the problem. Even the newer Kernels seem to suspend fine with it. I just don't know what the Ubuntu Kernel 2.28 does different to work even with F9/10.
Comment 54 Rafael J. Wysocki 2009-10-04 14:40:36 UTC
Well, I have no idea. I guess we can close this bug now?
Comment 55 unggnu 2009-10-04 15:29:29 UTC
Yes, I am closing it. I guess I have to contact Gigabyte about this issue. Many thanks for your help.