Bug 13351
Description
unggnu
2009-05-20 14:09:12 UTC
Created attachment 21450 [details]
dmesg output before suspend
Created attachment 21451 [details]
dmesg output after first suspend
Created attachment 21452 [details]
dmesg output after second suspend
Created attachment 21453 [details]
/proc/modules output after suspend
Created attachment 21454 [details]
/proc/interrupts after suspend
Created attachment 21455 [details]
/proc/cpuinfo
If anything else is needed let me know.
is the dmesg gotten when not using "echo 1 > /sys/power/pm_trace"? please make sure your kernel is built with CONFIG_PM_DEBUG=y, and then run echo devices > /sys/power/pm_test echo mem > /sys/power/state wait for a few seconds until the laptop wakeup itself, does the problem still exist after resume? (In reply to comment #7) > is the dmesg gotten when not using "echo 1 > /sys/power/pm_trace"? > please attach the dmesg output without this command. > please make sure your kernel is built with CONFIG_PM_DEBUG=y, > and then run > echo devices > /sys/power/pm_test > echo mem > /sys/power/state > wait for a few seconds until the laptop wakeup itself, > does the problem still exist after resume? please attach the dmesg output after resume if the problem still exists. The first dmesg output before suspend is without the command.
> please attach the dmesg output without this command
This could be hard because if I see something on the screen after a while I couldn't save anything because all partitions are mounted read only. I couldn't even shutdown regularly.
I am trying to give the other information soon.
Created attachment 21465 [details] dmesg output after running echoing pm_test and state > please make sure your kernel is built with CONFIG_PM_DEBUG=y, > and then run > echo devices > /sys/power/pm_test > echo mem > /sys/power/state > wait for a few seconds until the laptop wakeup itself, > does the problem still exist after resume? I haven't checked the CONFIG status but it isn't needed I guess because with the power state command the screen only went dark for some seconds and the pc doesn't go to sleep. Anyway here is the dmesg output. Created attachment 21466 [details]
dmesg output without any proc changes after resume
Created attachment 21467 [details]
Image of the kernel freeze
I changed to console because I couldn't unlock the screen saver after resume with the problem. I connected my usb stick and saved the dmesg output on it after resume. After hitting ctrl + alt + del the kernel freezes and I got this output.
Anyway it looks like I have to recompile the kernel with CONFIG_PM_DEBUG=y or is this enough?
if you run echo freezer > /sys/power/pm_test echo mem > /sys/power/state there are no such kind of error messages, right? Created attachment 21482 [details]
dmesg after freezer pm-test
I doesn't get the error but it doesn't got to sleep either. To recheck I have restarted the system and used pm-suspend instead of "echo mem > /sys/power/state" but it still doesn't go to sleep. The display blanks for some seconds and then everything returns.
Do it still need to recompile the kernel or do you have a clue what could be the reason? no, that's the right behavior. hmm, what if you run "echo shutdown > /sys/power/disk" before entering S3? note I mean S3, so you don't need to run "echo blabla > /sys/power/pm_test" this time. If I do this I get directly the freeze messages after resume containing many aes_crypt and ext4 issues. Btw. I am using dm-crypt with LUKS if this is relevant. I couldn't save anything because of this. Thanks for trying to nail this down. I have tried rc7 and it seems to work fine after two suspends without any proc changes. As soon as I have made some more tests I mark this issue as resolved if no problems appear again. I didn't know why it worked but I wasn't able to get it working again afterwards. Most of the time I get directly the kernel message errors. What else is needed to nail down the problem? As 2.6.29 worked and 2.6.30-rc fails, can you bisect what commit between those two causes the failure? Wow - this will be time consuming. I guess I need ccache and build my own kernel for this pc to save time. I report back if I found something. According to bisect it is either 78a8b35bc7abf8b8333d6f625e08c0f7cc1c3742 or 6d7942dc2a70a7e74c352107b150265602671588 The last version doesn't boot so I couldn't nail it down further. I have bisect the commits between 2.6.29-rc8 and 2.6.30-rc1 . I hope that this helps. On Wednesday 27 May 2009, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13351 > > --- Comment #23 from unggnu@googlemail.com 2009-05-27 20:17:22 --- > According to bisect it is either 78a8b35bc7abf8b8333d6f625e08c0f7cc1c3742 or > 6d7942dc2a70a7e74c352107b150265602671588 > The last version doesn't boot so I couldn't nail it down further. > > I have bisect the commits between 2.6.29-rc8 and 2.6.30-rc1 . > > I hope that this helps. Thanks for bisecting. We have a serious regression here which appears to have been caused by x86 patches. Please help! Best, Rafael that two patches only make the 0x1000 - 0x6000 to be really reserved in e820 according to low memory corruption... please disable low memory checking to see what will happen [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009e800 (usable) [ 0.000000] BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved) [ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 00000000cfee0000 (usable) [ 0.000000] BIOS-e820: 00000000cfee0000 - 00000000cfee2000 (ACPI NVS) [ 0.000000] BIOS-e820: 00000000cfee2000 - 00000000cfef0000 (ACPI data) [ 0.000000] BIOS-e820: 00000000cfef0000 - 00000000cff00000 (reserved) [ 0.000000] BIOS-e820: 00000000e0000000 - 00000000e4000000 (reserved) [ 0.000000] BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) [ 0.000000] BIOS-e820: 0000000100000000 - 0000000130000000 (usable) ... [ 0.000000] e820 update range: 00000000cff00000 - 0000000100000000 (usable) ==> (reserved) [ 0.000000] last_pfn = 0xcfee0 max_arch_pfn = 0x100000000 [ 0.000000] e820 update range: 0000000000001000 - 0000000000006000 (usable) ==> (reserved) [ 0.000000] Scanning 1 areas for low memory corruption [ 0.000000] modified physical RAM map: [ 0.000000] modified: 0000000000000000 - 0000000000001000 (usable) [ 0.000000] modified: 0000000000001000 - 0000000000006000 (reserved) [ 0.000000] modified: 0000000000006000 - 000000000009e800 (usable) [ 0.000000] modified: 000000000009f800 - 00000000000a0000 (reserved) [ 0.000000] modified: 00000000000f0000 - 0000000000100000 (reserved) [ 0.000000] modified: 0000000000100000 - 00000000cfee0000 (usable) [ 0.000000] modified: 00000000cfee0000 - 00000000cfee2000 (ACPI NVS) [ 0.000000] modified: 00000000cfee2000 - 00000000cfef0000 (ACPI data) [ 0.000000] modified: 00000000cfef0000 - 00000000cff00000 (reserved) [ 0.000000] modified: 00000000e0000000 - 00000000e4000000 (reserved) [ 0.000000] modified: 00000000fec00000 - 0000000100000000 (reserved) [ 0.000000] modified: 0000000100000000 - 0000000130000000 (usable) I got several low memory error messages while suspending but I only marked that revisions as bad that actually freezes my system. It was still possible to work after the low memory error message. Ok, it really seems to be commit 78a8b35bc7abf8b8333d6f625e08c0f7cc1c3742. I have disabled low memory checking with 2.6.30-rc7 and it doesn't help. System still freezes but after I removed the commit 78a8b35bc7abf8b8333d6f625e08c0f7cc1c3742 manually suspend works three times in a row without errors. Since the low memory checking was disabled I don't know if this message would still appear but at least the freezing was gone. Caused by: commit 78a8b35bc7abf8b8333d6f625e08c0f7cc1c3742 Author: Yinghai Lu <yinghai@kernel.org> Date: Thu Mar 12 22:36:01 2009 -0700 x86: make e820_update_range() handle small range update Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: jbeulich@novell.com LKML-Reference: <49B9F0C1.10402@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> First-Bad-Commit : 78a8b35bc7abf8b8333d6f625e08c0f7cc1c3742 can you post dmesg after revert that patch? we can check if the e820 get changed or not. Created attachment 21618 [details]
dmesg of removed e820 patch - before suspend
Sure but after trying to get the dmesg after suspend system hangs again so it is either not this patch or a combination of them. Do you have any idea how to nail it down further. It seems to be more random than I thought so bisect won't really work especially if the source is more than one commit.
Created attachment 21619 [details]
dmesg of removed e820 patch - after suspend
Handled-By : Yinghai Lu <yinghai@kernel.org> Still an issue in vanilla 2.6.30-rc8. Aren't the debug information in dmesg helping to find the problem? you may still try to bisect it down. and together with quilt to remove suspious commits before compile and test. Do you know a doc for this and how to disable/remove a commit from bisecting procedure? "git commit skip" It is still an issue in 2.6.31-rc1. But there have be a new development. If I compile my own kernel with the reduced .config which I created for bisecting my PC also hangs with 2.6.29 and 2.6.28 with the ext4 dmesg error messages. But at the same time the Ubuntu Jaunty 2.6.28 and afaik the Mainline kernels from the Ubuntu rep 2.6.28 and 29 doesn't have this problem. Or it is partly ext4 related but it happens even if my root file system wasn't ext4. It wouldn't also explain the fan stopping after resume. Maybe a function which is activated in a normal config but disabled in 2.6.30+ is the reason for this. I am attaching my reduced .config. Maybe you have another idea what I can do to nail it down. Bisecting wouldn't help with this. Created attachment 22135 [details]
reduced .config from my bisecting kernel (should be version 2.6.28)
Created attachment 22567 [details] dmesg after suspending with pm_trace - 2.6.31-rc5 I have gathered the Magic number and hash match according to this howto: https://wiki.ubuntu.com/DebuggingKernelSuspend [ 0.910691] Magic number: 0:926:740 [ 0.910693] hash matches /home/kernel-ppa/mainline/build/drivers/base/power/main.c:419 Linux ubuntu-desktop 2.6.31-020631rc5-generic #020631rc5 SMP Sat Aug 1 09:04:48 UTC 2009 x86_64 GNU/Linux I have used the mainline kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.31-rc5/ In most cases even the display gets not signal anymore. What else can I do? Bisecting obviously doesn't help to nail down the problem. With 2.6.31, even the latest rc6 the system doesn't wake up at all anymore. At least the display gets no signal. How can I debug this? Is this a graphic driver issue or can this also have something to do with the hard disk driver? Still no response after resume with 2.6.32-rc1. Thanks for the update. Unfortunately, we have no ideas about the possible root cause of the problem. Isn't there any other procedure how to nail down the problem where the kernel hangs? I have already posted the magic number. I have no additional ideas but I am going to check some older mainline kernels. (In reply to comment #34) > you may still try to bisect it down. and together with quilt to remove > suspious > commits before compile and test. Well, in the face of comment #27, I'm not sure about what information we can get from that. (In reply to comment #44) > Isn't there any other procedure how to nail down the problem where the kernel > hangs? Not really. > I have already posted the magic number. That didn't reveal any potential culprits > I have no additional ideas but I am going to check some older mainline > kernels. What happens if you run the 'core' pm_test test, ie. # echo core > /sys/power/pm_test # echo mem > /sys/power/state Does it return to the normal state after 5 seconds (as it should) or does it behave incorrectly? Many thanks for your support. The behavior was too sporadic at the time to nail it down with bisecting so I guess the commit doesn't necessary have something to do with the problem. All pm_tests work fine except that in case of "processors" I get the error message: [ 380.564014] ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4 [ 380.564016] ata2: irq_stat 0x00400040, connection status changed [ 380.588014] ata3: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4 [ 380.588016] ata3: irq_stat 0x00400040, connection status changed An interesting thing is that before the error messages something like this is mentioned: "[ 380.548669] ata2.00: configured for UDMA/133" It is interesting because I am using only SATA, even for my CD drive and I have disabled the IDE controller in my Bios. Just in case Suspend also doesn't work if I use the default Bios settings. Did you try booting with init=/bin/bash and suspending in that configuration? Thanks for your reply. I used the recovery mode some time ago and had the same problem. The same happens with init=/bin/bash. If the 'core' pm_test test succeeds, the failure to resume correctly appears to be related to the fact that control goes through the BIOS during suspend and resume. Did you try hibernation and, if you did, did it resume correctly? Also, please apply the patch at http://patchwork.kernel.org/patch/45314/ (it may require some source code surgery, please let me know if there are any problems with that), boot with acpi_sleep=s3_set_sci_en, try to suspend (to RAM)/resume and see if that works. Thanks for your reply. No, I haven't tried hibernation because I have no swap. But I am try to test it soon. I have applied the patch to 2.6.32-rc1 from the Ubuntu mainline kernel repository. With my self compiled kernel resume seems to work after some time but I got the same problems I have described above again, hard disk errors and so on. It is not possible to write anything on disk anymore and often the whole file system is corrupted after restart. The system also seems to wake up with my kernel without the option. Of course it doesn't help because of the hard disk problems. I am going to activate S1 in Bios and check what happens. In the changelog of my bios a Vista standby fix is mentioned. So maybe I check an old bios version too. The observations described in comment #51 indicate that your SATA controller doesn't resume correctly, although Intel SATA controllers are generally known to be handled correctly during suspend and resume. That, as well as your previous observations, shows that the problem is somehow related to the BIOS. It has indeed something to do with the Bios. Changing back to version F8 fixes the problem. Even the newer Kernels seem to suspend fine with it. I just don't know what the Ubuntu Kernel 2.28 does different to work even with F9/10. Well, I have no idea. I guess we can close this bug now? Yes, I am closing it. I guess I have to contact Gigabyte about this issue. Many thanks for your help. |