Bug 10923 (resumeturnsoff)
Summary: | System turns off during wake from S3, Wake Regression from 2.6.25 (SMP related) | ||
---|---|---|---|
Product: | ACPI | Reporter: | Dionisus Torimens (djtm) |
Component: | Power-Sleep-Wake | Assignee: | Rafael J. Wysocki (rjw) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | acpi-bugzilla, bunk, pavel, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.26rc8 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 10492 | ||
Attachments: |
dmesg after the system resumed
dmesg before the sytem resumed (2.6.25.6 as well) 2.6.25.6 kernel config git bisect log git bisect log (second try) .config which triggers the bug .config which does not trigger the bug (smp) .config diff between triggering bug and not(smp) config 2.6.26-rc8 non-SMP (resume does not work) config 2.6.26-rc8 SMP (does resume) diff between both configs sleep.sh Debug patch #1 Debug patch #2 Fix patch |
Description
Dionisus Torimens
2008-06-15 16:24:02 UTC
Created attachment 16495 [details]
dmesg after the system resumed
The first lines are lost, I will attach the dmesg before the resume below.
Created attachment 16496 [details]
dmesg before the sytem resumed (2.6.25.6 as well)
Created attachment 16497 [details]
2.6.25.6 kernel config
Hi, Dionisus Will you please use git-bisect to identify which commit the regression is caused by? Will you please kill the acpid and see whether the problem still exists? (If acpid doesn't exist, please kill the process who is using the /proc/acpi/event. The command of "lsof /proc/acpi/event" can get the process id). Thanks. (In reply to comment #1) > Created an attachment (id=16495) [details] > dmesg after the system resumed > no message showing that the system is shutting down... and it seems that the system is resume back successfully. could you please remove as more modules as you can before suspend and see if it's some device that turns off the computer. Hi ykzhao, thanks for helping to fix this problem. I don't think I will have time for a bisect anytime soon. There is not process showing if lsof /proc/acpi/event as I've tested in single user mode. I rechecked that. It's not a regular shutdown, but a sudden one, I can hear that from the sound of the hard disk shutting down suddenly. *Notice*: The dmesg log I appended above is from 2.6.*25* The log of 2.6.26 is in the bug I linked in my first post. Check here for my boot logs with 2.6.*26 rc6*: --> http://bugzilla.kernel.org/show_bug.cgi?id=10914 Hi Zhang Rui, I've tried to remove all modules before suspend but it did not change anything. Please look here http://bugzilla.kernel.org/show_bug.cgi?id=10914 for my dmesg and other log files. This entry is being used for tracking a regression from 2.6.25. Please don't close it until the problem is fixed in the mainline. I've tried upgrading the BIOS, but now the situation is even worse. I can't resume from S3 anymore at all. The video driver is not initialized anymore under 2.6.25.6. 2.6.26rc6 still produces very weird behavior in resume: * weirdly blinking power led (should go on and out, but only goes on darkly and shakily) * after some time the system is either reset or shut down. sometimes it hangs completely * if the system shuts off during resume, the hard disk goes off with a clearly audible "click". Does anyone have any idea how this bug could have appeared? I'm happy to check out certain patches or things, just don't have time for a bisect at the moment. And we do want this fixed before the release, right? I tried the hash matches trick several time to find the place in 2.6.25.7: 217-[ 4.725935] PM: Adding info for No Bus:network_throughput 218-[ 4.726083] Magic number: 0:512:775 219:[ 4.726085] hash matches drivers/base/power/main.c:228 220:[ 4.726087] hash matches device psaux 185-[ 4.727604] Magic number: 0:512:478 186:[ 4.727656] hash matches drivers/base/power/main.c:228 218-[ 4.715548] Magic number: 0:512:775 219:[ 4.715550] hash matches drivers/base/power/main.c:228 220:[ 4.715552] hash matches device psaux this resolves to if (!error && dev->class && dev->class->resume) { dev_dbg(dev,"class resume\n"); error = dev->class->resume(dev); } up(&dev->sem); --> TRACE_RESUME(error); return error; } What does psaux mean? Isn't that the mouse/keyboard? Fixed the issue with 2.6.25. The problem was caused by the hard disk encryption and showed up after booting with init=/bin/bash. The Problem with 2.6.26 is still the same, though. Even with hard disk encryption disabled. I've managed to get hash matches (pm_trace) for it, but does not help me much: hash matches device ttyt4 hash matches device device:2e next try: hash matches device ttyxa hash matches device tty36 what's device:2e? Does this tell something to someone? Someone have an idea what else I can try? Latest working 2.6.25.7. Latest non-working 2.6.26rc7. please run "cat /sys/bus/acpi/devices/device:2e:/path". please check if reverting commit 1b7fc5aae8867046f8d3d45808309d5b7f2e036a helps. If not, it would be great if you can run git bisect to narrow down the problem to a specific commit. I tried to revert these http://lkml.org/lkml/2008/6/11/423 from here ftp://ftp.kernel.org/pub/linux/kernel/people/lenb/acpi/patches/release/2.6.26/acpi-release-20080321-2.6.26-rc5.diff.gz which includes the one you named, but without success. Device 2e is SATA. I will post the complete content. Interestingly it only exists under that name in 2.6.26, in .25 there is no device 2e. I have a new hash match btw: [ 6.725318] Magic number: 12:785:818 [ 6.725418] hash matches device ptypf And I'm doing the tests with init=/bin/bash now. the Device 2e path is "\_SB_.PCI0.SATA". But I think that was at the time I had the resume issue because the BIOS didn't properly unlock my hard driver after the suggested BIOS upgrade. Created attachment 16614 [details]
git bisect log
Git couldn't tell me which patch is the bad one. I had to skip a few because the kernel would not compile(the first ones) or link(the last one).
There are only 'skip'ped commit left to test. The first bad commit could be any of: e44b7b7525ad9d43163ab5e60c784325419e0ea6 77ad386e596c6b0930cc2e09e3cce485e3ee7f72 We cannot bisect more! While trying Bisecting: 1 revisions left to test after this [e44b7b7525ad9d43163ab5e60c784325419e0ea6] x86: move suspend wakeup code to C I get this linking error: LD init/built-in.o LD .tmp_vmlinux1 arch/x86/kernel/built-in.o: In function `acpi_save_state_mem': (.text+0x10059): undefined reference to `setup_trampoline' make: *** [.tmp_vmlinux1] Error 1 Command exited with non-zero status 2 It is a very suspicious patch though, isn't it? Could someone help me verify it? I'll upload the second bisect log. Created attachment 16617 [details]
git bisect log (second try)
I tried compiling with SMP on and I could resume without problems even with the patch applied. Maybe it's only triggered when compiled without SMP. I'll try to verify that... Hi, it would be great if someone could help me fix this before 2.6.26, Thanks. CONFIRMED expectation: The problem *disappears* in an smp compile with otherwise 100% equal options. This means it must be with 99.9% probability be caused by [e44b7b7525ad9d43163ab5e60c784325419e0ea6] x86: move suspend wakeup code to C. But only if SMP is *not* used. That was a tricky one... Now please remove the NEEDINFO and assign this bug to someone. Let's try to fix this before the release! Thanks! Created attachment 16618 [details]
.config which triggers the bug
Created attachment 16619 [details]
.config which does not trigger the bug (smp)
Created attachment 16620 [details]
.config diff between triggering bug and not(smp)
32-bit or 64-bit system? I believe I did check UP/32-bit operation, but I'm not sure if I checked UP/64-bit config. 64-bit. I think I've supplied all information requested, and I'm watching the bug so would someone please remove the NEEDINFO tag. If there's anything else I can do please let me know. Thanks. > commit e44b7b7525ad9d43163ab5e60c784325419e0ea6 > Author: Pavel Machek <pavel@suse.cz> > Date: Thu Apr 10 23:28:10 2008 +0200 > > x86: move suspend wakeup code to C caused a different regression. please try out the patch in bug 10927 I already tried out the patch from http://bugzilla.kernel.org/show_bug.cgi?id=10927#c91 . It didn't work for me. (removing needinfo). I'll try to reproduce it here. ...I can reproduce it, 64-bit system with UP configuration. So any 64-bit UP configuration is likely to be affected? This almost certainly is related to the fact that we use the trampoline for the wake-up on 64-bit. I'll have a look at it later today. (In reply to comment #27) > Created an attachment (id=16618) [details] > .config which triggers the bug How did you generate this .config? There should be CONFIG_X86_TRAMPOLINE=y in it for things to work. (In reply to comment #36) > ...I can reproduce it, 64-bit system with UP configuration. Can you attach the failing config, please? On Fri 2008-06-27 09:08:28, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=10923 > > > > > > ------- Comment #40 from rjw@sisk.pl 2008-06-27 09:08 ------- > (In reply to comment #36) > > ...I can reproduce it, 64-bit system with UP configuration. > > Can you attach the failing config, please? Relevant part is: # CONFIG_SMP is not set CONFIG_X86_PC=y # CONFIG_X86_ELAN is not set # CONFIG_X86_VOYAGER is not set # CONFIG_X86_NUMAQ is not set # CONFIG_X86_SUMMIT is not set # CONFIG_X86_BIGSMP is not set # CONFIG_X86_VISWS is not set # CONFIG_X86_GENERICARCH is not set # CONFIG_X86_ES7000 is not set # CONFIG_X86_RDC321X is not set # CONFIG_X86_VSMP is not set # CONFIG_PARAVIRT_GUEST is not set ... CONFIG_X86_TRAMPOLINE=y Well, I'm not sure you're seeing the same problem at all. Without CONFIG_X86_TRAMPOLINE=y the 64-bit UP resume won't work. It should work with it, though. Does the SMP kernel work for you on the same box? I generated the config with make menuconfig. Before it was changed a bit during bisecting. And yes, as said above, the same kernel with the same config compiled on the same machine works, if SMP is set. The two configs are attached, as is the diff between the configs. I will try manually setting the option. CONFIG_X86_TRAMPOLINE=y No, the trampoline does not change anything. I will attach new .configs. Created attachment 16645 [details]
config 2.6.26-rc8 non-SMP (resume does not work)
Created attachment 16646 [details]
config 2.6.26-rc8 SMP (does resume)
Created attachment 16647 [details]
diff between both configs
The other configs were taken at the end of bisecting at the point where the above mentioned patch was applied. I now uploaded the configs from 2.6.26-rc8 and the diff between SMP and not SMP. I can't reproduce it with a UP kernel, so that must be something nontrivial. It would be helpful to determine the point in which it breaks. Dionisus, what happens if you pass acpi_sleep=s3_beep to the kernel in the command line, both in the SMP, ie. working, and the UP, ie. failing, cases? Dear Refael, It didn't change anything in either kernel. One still resumes, the other doesn't. But that's no wonder for me as beep is known not to work in Linux on my Laptop unfortunately. I guess I should file a bug report about that. Anything else I can do? I'll try to find a way around the beep issue, but I have little hope at the moment. What exactly do you do to suspend the system? Created attachment 16657 [details]
sleep.sh
I run this script as init=/sleep.sh
I've tried Ubuntu's sleep script as well. But it gives me less information. This script allowed me to see and fix the errors the kernel was reporting during resume (hard drive not unlocked on resume) after I upgraded my bios.
Created attachment 16662 [details]
Debug patch #1
With this patch applied, the box should just hang on an attempt to resume (ie. right after pressing the power button).
Please check if that happens.
Affirmative. With this patch to the smp kernel it does not resume anymore. (The behaviour is slightly different than the unpatched rc8 UP, though: It does not turn itself off or reboot but just hangs.) Can you apply this patch to the _failing_ kernel (ie. the UP one) and see if that changes its behavior? It does change the behavior of the UP as well. It behaves just like the SMP with the patch: It does not shut down or restart but just hang during resume. Anything else I can do? I guess there's little chance to get this fixed before the release of 2.6.26? Created attachment 16692 [details]
Debug patch #2
Sorry, I was distracted by some other issues.
Please apply this patch instead of the previous one and see if the behavior of the failing (eg. UP) kernel is the same as with debug patch #1.
(In reply to comment #58) > I guess there's little chance to get this fixed before the release of 2.6.26? > Well, we don't really know what the root cause of the problem is yet ... As far as I can tell the behavior is the same. It gets stuck as well. But I can't be 100% sure if it doesn't go a little further... Testing these is really easy, though. If you like I can test several at once. Just let me know if there's anything else I can do to help. (In reply to comment #61) > As far as I can tell the behavior is the same. It gets stuck as well. But I > can't be 100% sure if it doesn't go a little further... If it hangs instead of powering off (which is what happens without the patch, if I understood your report correctly), it gets to the piece of code modified by the patch. Which is good, BTW, because it means that the jump to the trampoline code actually works. > Testing these is really easy, though. If you like I can test several at once. > Just let me know if there's anything else I can do to help. Okay, I'll try to prepare a series of patches for you to test. > If it hangs instead of powering off (which is what happens without the patch, > if I understood your report correctly), correct. instead of powering off or rebooting it hangs and does nothing. > Okay, I'll try to prepare a series of patches for you to test. Great! If you tell me what command to insert where in order to find the problematic point, I can try to do it, maybe save you a couple patches. Okay, please add something like 1: hlt jmp 1b right before the comment "/* Enable PAE mode and PGE */" in arch/x86/kernel/head_64.S . If the box hangs instead of powering off, please move that right before the comment starting with "/* Finally jump to run C code and to be on real kernel address" and see what happens. If, again, the box hangs instead of powering off, put something like the above into arch/x86/kernel/acpi/wakeup_64.S between the lines jne bogus_64_magic movw $__KERNEL_DS, %ax and see what happens. > right before the comment "/* Enable PAE mode and PGE */" in
> arch/x86/kernel/head_64.S .
it hangs instead of booting at all.
(In reply to comment #64) > right before the > comment starting with "/* Finally jump to run C code and to be on real kernel > address" and see what happens. It also hangs before booting. " jne bogus_64_magic 1: hlt jmp 1b movw $__KERNEL_DS, %ax" does not change anything. It reboots/shuts down instead of resuming. I don't know if this helps you, but if I comment out these two lines: // #ifdef CONFIG_SMP addq %rbp, trampoline_level4_pgt + 0(%rip) addq %rbp, trampoline_level4_pgt + (511*8)(%rip) // #endif the system resumes pretty much normally. (In reply to comment #68) > I don't know if this helps you, Yes, it does, thanks a lot! Created attachment 16741 [details]
Fix patch
Please test the attached patch that should fix the problem.
Yes, that works! :) I was just skipping by the place and I thought: worth a try. :) Thanks! I'm glad we could fix this before the release :) Regressions list annotation: Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Patch : http://marc.info/?l=linux-kernel&m=121520913609798&w=4 Acked-by: Pavel Machek <pavel@suse.cz> (And thanks for handling that. I still wonder... in comment #49 you said it works for you on UP...?) Yes, it did work for me. My kernel is not relocatable, though, while yours probably is. fixed by commit 64e83b5a919a65eb35b63dd7e07c188379ff8ce6 |