Bug 10923 (resumeturnsoff)

Summary: System turns off during wake from S3, Wake Regression from 2.6.25 (SMP related)
Product: ACPI Reporter: Dionisus Torimens (djtm)
Component: Power-Sleep-WakeAssignee: Rafael J. Wysocki (rjw)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, bunk, pavel, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26rc8 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 10492    
Attachments: dmesg after the system resumed
dmesg before the sytem resumed (2.6.25.6 as well)
2.6.25.6 kernel config
git bisect log
git bisect log (second try)
.config which triggers the bug
.config which does not trigger the bug (smp)
.config diff between triggering bug and not(smp)
config 2.6.26-rc8 non-SMP (resume does not work)
config 2.6.26-rc8 SMP (does resume)
diff between both configs
sleep.sh
Debug patch #1
Debug patch #2
Fix patch

Description Dionisus Torimens 2008-06-15 16:24:02 UTC
Latest working kernel version: 2.6.25.6
Earliest failing kernel version: 2.6.26rc6 (not tested in between)
Distribution: Ubuntu 8.04
Hardware Environment: Acer Extensa 5220
Software Environment: KDE/single user mode
Problem Description: When the system is put into standby mode and then resumed, it seems to try to resume for a while but then turns off completely before anything can be seen.
Steps to reproduce: Suspend System, Resume system.

Check here for my boot logs with 2.6.26rc6:
http://bugzilla.kernel.org/show_bug.cgi?id=10914

Dmesg with 2.6.25.6 will be attached for comparison.
Comment 1 Dionisus Torimens 2008-06-15 16:26:23 UTC
Created attachment 16495 [details]
dmesg after the system resumed

The first lines are lost, I will attach the dmesg before the resume below.
Comment 2 Dionisus Torimens 2008-06-15 16:26:54 UTC
Created attachment 16496 [details]
dmesg before the sytem resumed (2.6.25.6 as well)
Comment 3 Dionisus Torimens 2008-06-15 16:44:27 UTC
Created attachment 16497 [details]
2.6.25.6 kernel config
Comment 4 ykzhao 2008-06-15 17:55:43 UTC
Hi, Dionisus
   Will you please use git-bisect to identify which commit the regression is caused by?
   Will you please kill the acpid and see whether the problem still exists? (If acpid doesn't exist, please kill the process who is using the /proc/acpi/event. The command of "lsof /proc/acpi/event" can get the process id).
   Thanks.
Comment 5 Zhang Rui 2008-06-15 23:59:48 UTC
(In reply to comment #1)
> Created an attachment (id=16495) [details]
> dmesg after the system resumed
> 
no message showing that the system is shutting down...
and it seems that the system is resume back successfully.
could you please remove as more modules as you can before suspend and see if it's some device that turns off the computer.
Comment 6 Dionisus Torimens 2008-06-16 00:05:23 UTC
Hi ykzhao,

thanks for helping to fix this problem.

I don't think I will have time for a bisect anytime soon.

There is not process showing if lsof /proc/acpi/event
as I've tested in single user mode. I rechecked that.

It's not a regular shutdown, but a sudden one, I can hear that from the
sound of the hard disk shutting down suddenly.

*Notice*: The dmesg log I appended above is from 2.6.*25*
The log of 2.6.26 is in the bug I linked in my first post.

Check here for my boot logs with 2.6.*26 rc6*:
--> http://bugzilla.kernel.org/show_bug.cgi?id=10914
Comment 7 Dionisus Torimens 2008-06-16 00:06:36 UTC
2.6.26 rc6:
http://bugzilla.kernel.org/show_bug.cgi?id=10914#c1
Comment 8 Dionisus Torimens 2008-06-16 01:19:54 UTC
Hi Zhang Rui,

I've tried to remove all modules before suspend but it did not change anything. Please look here http://bugzilla.kernel.org/show_bug.cgi?id=10914 for my dmesg and other log files.
Comment 9 Rafael J. Wysocki 2008-06-16 05:54:52 UTC
This entry is being used for tracking a regression from 2.6.25.  Please don't
close it until the problem is fixed in the mainline.
Comment 10 Dionisus Torimens 2008-06-18 06:55:51 UTC
I've tried upgrading the BIOS, but now the situation is even worse. I can't resume from S3 anymore at all. The video driver is not initialized anymore under 2.6.25.6. 

2.6.26rc6 still produces very weird behavior in resume:

 * weirdly blinking power led (should go on and out, but only goes on darkly and shakily)
 * after some time the system is either reset or shut down. sometimes it hangs completely
 * if the system shuts off during resume, the hard disk goes off with a clearly audible "click".
Comment 11 Dionisus Torimens 2008-06-18 21:25:04 UTC
Does anyone have any idea how this bug could have appeared?

I'm happy to check out certain patches or things, just don't have time for a bisect at the moment.

And we do want this fixed before the release, right?
Comment 12 Dionisus Torimens 2008-06-19 01:32:44 UTC
I tried the hash matches trick several time to find the place in 2.6.25.7:

217-[    4.725935] PM: Adding info for No Bus:network_throughput
218-[    4.726083]   Magic number: 0:512:775
219:[    4.726085]   hash matches drivers/base/power/main.c:228
220:[    4.726087]   hash matches device psaux

185-[    4.727604]   Magic number: 0:512:478
186:[    4.727656]   hash matches drivers/base/power/main.c:228

218-[    4.715548]   Magic number: 0:512:775
219:[    4.715550]   hash matches drivers/base/power/main.c:228
220:[    4.715552]   hash matches device psaux
Comment 13 Dionisus Torimens 2008-06-19 01:35:18 UTC
this resolves to 
	if (!error && dev->class && dev->class->resume) {
		dev_dbg(dev,"class resume\n");
		error = dev->class->resume(dev);
	}

	up(&dev->sem);

-->	TRACE_RESUME(error);
	return error;
}

What does psaux mean? Isn't that the mouse/keyboard?
Comment 14 Dionisus Torimens 2008-06-19 03:20:29 UTC
Fixed the issue with 2.6.25. The problem was caused by the hard disk encryption and showed up after booting with init=/bin/bash.

The Problem with 2.6.26 is still the same, though. Even with hard disk encryption disabled. 

I've managed to get hash matches (pm_trace) for it, but does not help me much:
hash matches device ttyt4
hash matches device device:2e
next try:
hash matches device ttyxa
hash matches device tty36

what's device:2e? Does this tell something to someone? Someone have an idea what else I can try?
Comment 15 Dionisus Torimens 2008-06-20 21:59:24 UTC
Latest working 2.6.25.7.
Latest non-working 2.6.26rc7.
Comment 16 Zhang Rui 2008-06-22 19:57:57 UTC
please run "cat /sys/bus/acpi/devices/device:2e:/path".
please check if reverting commit 1b7fc5aae8867046f8d3d45808309d5b7f2e036a helps.
If not, it would be great if you can run git bisect to narrow down the problem to a specific commit.
Comment 17 Dionisus Torimens 2008-06-23 21:42:26 UTC
I tried to revert these
http://lkml.org/lkml/2008/6/11/423
from here
ftp://ftp.kernel.org/pub/linux/kernel/people/lenb/acpi/patches/release/2.6.26/acpi-release-20080321-2.6.26-rc5.diff.gz
which includes the one you named, but without success.

Device 2e is SATA. I will post the complete content.

Interestingly it only exists under that name in 2.6.26, in .25 there is no device 2e.
Comment 18 Dionisus Torimens 2008-06-23 21:45:23 UTC
I have a new hash match btw:
[    6.725318]   Magic number: 12:785:818
[    6.725418]   hash matches device ptypf

And I'm doing the tests with init=/bin/bash now.
Comment 19 Dionisus Torimens 2008-06-24 01:59:57 UTC
the Device 2e path is "\_SB_.PCI0.SATA". But I think that was at the time I had the resume issue because the BIOS didn't properly unlock my hard driver after the suggested BIOS upgrade.
Comment 20 Dionisus Torimens 2008-06-25 00:16:31 UTC
Created attachment 16614 [details]
git bisect log

Git couldn't tell me which patch is the bad one. I had to skip a few because the kernel would not compile(the first ones) or link(the last one).
Comment 21 Dionisus Torimens 2008-06-25 00:18:29 UTC
There are only 'skip'ped commit left to test.
The first bad commit could be any of:
e44b7b7525ad9d43163ab5e60c784325419e0ea6
77ad386e596c6b0930cc2e09e3cce485e3ee7f72
We cannot bisect more!
Comment 22 Dionisus Torimens 2008-06-25 01:49:57 UTC
While trying
Bisecting: 1 revisions left to test after this
[e44b7b7525ad9d43163ab5e60c784325419e0ea6] x86: move suspend wakeup code to C

I get this linking error:

  LD      init/built-in.o
  LD      .tmp_vmlinux1
arch/x86/kernel/built-in.o: In function `acpi_save_state_mem':
(.text+0x10059): undefined reference to `setup_trampoline'
make: *** [.tmp_vmlinux1] Error 1
Command exited with non-zero status 2

It is a very suspicious patch though, isn't it? Could someone help me verify it? I'll upload the second bisect log.
Comment 23 Dionisus Torimens 2008-06-25 01:58:17 UTC
Created attachment 16617 [details]
git bisect log (second try)
Comment 24 Dionisus Torimens 2008-06-25 02:55:58 UTC
I tried compiling with SMP on and I could resume without problems even with the patch applied. Maybe it's only triggered when compiled without SMP. I'll try to verify that...
Comment 25 Dionisus Torimens 2008-06-25 03:09:37 UTC
Hi,
it would be great if someone could help me fix this before 2.6.26,
Thanks.
Comment 26 Dionisus Torimens 2008-06-25 04:52:35 UTC
CONFIRMED expectation:
The problem *disappears* in an smp compile with otherwise 100% equal options.

This means it must be with 99.9% probability be caused by 
[e44b7b7525ad9d43163ab5e60c784325419e0ea6] x86: move suspend wakeup code to C.

But only if SMP is *not* used. That was a tricky one...

Now please remove the NEEDINFO and assign this bug to someone. 
Let's try to fix this before the release! 

Thanks!
Comment 27 Dionisus Torimens 2008-06-25 05:37:55 UTC
Created attachment 16618 [details]
.config which triggers the bug
Comment 28 Dionisus Torimens 2008-06-25 05:38:18 UTC
Created attachment 16619 [details]
.config which does not trigger the bug (smp)
Comment 29 Dionisus Torimens 2008-06-25 05:40:34 UTC
Created attachment 16620 [details]
.config diff between triggering bug and not(smp)
Comment 30 Pavel Machek 2008-06-25 06:45:35 UTC
32-bit or 64-bit system?

I believe I did check UP/32-bit operation, but I'm not sure if I checked UP/64-bit config.
Comment 31 Dionisus Torimens 2008-06-26 03:02:47 UTC
64-bit.
Comment 32 Dionisus Torimens 2008-06-26 18:08:11 UTC
I think I've supplied all information requested, and I'm watching the bug so would someone please remove the NEEDINFO tag. 
If there's anything else I can do please let me know.
Thanks.
Comment 33 Len Brown 2008-06-26 18:46:39 UTC
> commit e44b7b7525ad9d43163ab5e60c784325419e0ea6
> Author: Pavel Machek <pavel@suse.cz>
> Date:   Thu Apr 10 23:28:10 2008 +0200
>
>    x86: move suspend wakeup code to C

caused a different regression.
please try out the patch in bug 10927
Comment 34 Dionisus Torimens 2008-06-26 19:25:36 UTC
I already tried out the patch from http://bugzilla.kernel.org/show_bug.cgi?id=10927#c91 . It didn't work for me.
Comment 35 Pavel Machek 2008-06-27 03:00:11 UTC
(removing needinfo).

I'll try to reproduce it here.
Comment 36 Pavel Machek 2008-06-27 03:55:18 UTC
...I can reproduce it, 64-bit system with UP configuration.
Comment 37 Dionisus Torimens 2008-06-27 05:23:39 UTC
So any 64-bit UP configuration is likely to be affected?
Comment 38 Rafael J. Wysocki 2008-06-27 06:19:11 UTC
This almost certainly is related to the fact that we use the trampoline for the wake-up on 64-bit.  I'll have a look at it later today.
Comment 39 Rafael J. Wysocki 2008-06-27 08:41:34 UTC
(In reply to comment #27)
> Created an attachment (id=16618) [details]
> .config which triggers the bug

How did you generate this .config?

There should be CONFIG_X86_TRAMPOLINE=y in it for things to work.
Comment 40 Rafael J. Wysocki 2008-06-27 09:08:28 UTC
(In reply to comment #36)
> ...I can reproduce it, 64-bit system with UP configuration.

Can you attach the failing config, please?
Comment 41 Pavel Machek 2008-06-27 13:00:39 UTC
On Fri 2008-06-27 09:08:28, bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=10923
> 
> 
> 
> 
> 
> ------- Comment #40 from rjw@sisk.pl  2008-06-27 09:08 -------
> (In reply to comment #36)
> > ...I can reproduce it, 64-bit system with UP configuration.
> 
> Can you attach the failing config, please?

Relevant part is:

# CONFIG_SMP is not set
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_X86_RDC321X is not set
# CONFIG_X86_VSMP is not set
# CONFIG_PARAVIRT_GUEST is not set
...
CONFIG_X86_TRAMPOLINE=y
Comment 42 Rafael J. Wysocki 2008-06-27 14:17:20 UTC
Well, I'm not sure you're seeing the same problem at all.

Without CONFIG_X86_TRAMPOLINE=y the 64-bit UP resume won't work.  It should work with it, though.

Does the SMP kernel work for you on the same box?
Comment 43 Dionisus Torimens 2008-06-27 17:54:09 UTC
I generated the config with make menuconfig. Before it was changed a bit during bisecting.

And yes, as said above, the same kernel with the same config compiled on the same machine works, if SMP is set. The two configs are attached, as is the diff between the configs.

I will try manually setting the option. CONFIG_X86_TRAMPOLINE=y
Comment 44 Dionisus Torimens 2008-06-27 17:59:12 UTC
No, the trampoline does not change anything. I will attach new .configs.
Comment 45 Dionisus Torimens 2008-06-27 18:08:15 UTC
Created attachment 16645 [details]
config 2.6.26-rc8 non-SMP (resume does not work)
Comment 46 Dionisus Torimens 2008-06-27 18:09:45 UTC
Created attachment 16646 [details]
config 2.6.26-rc8 SMP (does resume)
Comment 47 Dionisus Torimens 2008-06-27 18:10:22 UTC
Created attachment 16647 [details]
diff between both configs
Comment 48 Dionisus Torimens 2008-06-27 18:12:52 UTC
The other configs were taken at the end of bisecting at the point where the above mentioned patch was applied. I now uploaded the configs from 2.6.26-rc8 and the diff between SMP and not SMP.
Comment 49 Rafael J. Wysocki 2008-06-29 13:25:32 UTC
I can't reproduce it with a UP kernel, so that must be something nontrivial.
Comment 50 Rafael J. Wysocki 2008-06-29 13:28:06 UTC
It would be helpful to determine the point in which it breaks.

Dionisus, what happens if you pass acpi_sleep=s3_beep to the kernel in the command line, both in the SMP, ie. working, and the UP, ie. failing, cases?
Comment 51 Dionisus Torimens 2008-06-29 15:12:20 UTC
Dear Refael,
It didn't change anything in either kernel. One still resumes, the other doesn't.

But that's no wonder for me as beep is known not to work in Linux on my Laptop unfortunately. I guess I should file a bug report about that.

Anything else I can do? I'll try to find a way around the beep issue, but I have little hope at the moment.
Comment 52 Rafael J. Wysocki 2008-06-29 15:18:34 UTC
What exactly do you do to suspend the system?
Comment 53 Dionisus Torimens 2008-06-29 15:32:47 UTC
Created attachment 16657 [details]
sleep.sh

I run this script as init=/sleep.sh

I've tried Ubuntu's sleep script as well. But it gives me less information. This script allowed me to see and fix the errors the kernel was reporting during resume (hard drive not unlocked on resume) after I upgraded my bios.
Comment 54 Rafael J. Wysocki 2008-06-30 02:52:10 UTC
Created attachment 16662 [details]
Debug patch #1

With this patch applied, the box should just hang on an attempt to resume (ie. right after pressing the power button).

Please check if that happens.
Comment 55 Dionisus Torimens 2008-06-30 06:41:25 UTC
Affirmative. With this patch to the smp kernel it does not resume anymore.

(The behaviour is slightly different than the unpatched rc8 UP, though: It does not turn itself off or reboot but just hangs.)
Comment 56 Rafael J. Wysocki 2008-06-30 10:13:49 UTC
Can you apply this patch to the _failing_ kernel (ie. the UP one) and see if that changes its behavior?
Comment 57 Dionisus Torimens 2008-06-30 18:58:04 UTC
It does change the behavior of the UP as well. It behaves just like the SMP with the patch: It does not shut down or restart but just hang during resume.
Comment 58 Dionisus Torimens 2008-07-02 04:18:05 UTC
Anything else I can do? I guess there's little chance to get this fixed before the release of 2.6.26?
Comment 59 Rafael J. Wysocki 2008-07-02 04:25:31 UTC
Created attachment 16692 [details]
Debug patch #2

Sorry, I was distracted by some other issues.

Please apply this patch instead of the previous one and see if the behavior of the failing (eg. UP) kernel is the same as with debug patch #1.
Comment 60 Rafael J. Wysocki 2008-07-02 04:26:52 UTC
(In reply to comment #58)
> I guess there's little chance to get this fixed before the release of 2.6.26?
>

Well, we don't really know what the root cause of the problem is yet ...
Comment 61 Dionisus Torimens 2008-07-02 17:02:31 UTC
As far as I can tell the behavior is the same. It gets stuck as well. But I can't be 100% sure if it doesn't go a little further...

Testing these is really easy, though. If you like I can test several at once. Just let me know if there's anything else I can do to help.
Comment 62 Rafael J. Wysocki 2008-07-03 04:15:23 UTC
(In reply to comment #61)
> As far as I can tell the behavior is the same. It gets stuck as well. But I
> can't be 100% sure if it doesn't go a little further...

If it hangs instead of powering off (which is what happens without the patch, if I understood your report correctly), it gets to the piece of code modified by the patch.  Which is good, BTW, because it means that the jump to the trampoline code actually works.

> Testing these is really easy, though. If you like I can test several at once.
> Just let me know if there's anything else I can do to help.

Okay, I'll try to prepare a series of patches for you to test.
Comment 63 Dionisus Torimens 2008-07-03 05:14:11 UTC
> If it hangs instead of powering off (which is what happens without the patch,
> if I understood your report correctly),
correct. instead of powering off or rebooting it hangs and does nothing.

> Okay, I'll try to prepare a series of patches for you to test.
Great! If you tell me what command to insert where in order to find the problematic point, I can try to do it, maybe save you a couple patches.
Comment 64 Rafael J. Wysocki 2008-07-03 15:56:42 UTC
Okay, please add something like

1:
    hlt
    jmp 1b

right before the comment "/* Enable PAE mode and PGE */" in arch/x86/kernel/head_64.S .

If the box hangs instead of powering off, please move that right before the comment starting with "/* Finally jump to run C code and to be on real kernel address" and see what happens.

If, again, the box hangs instead of powering off, put something like the above into arch/x86/kernel/acpi/wakeup_64.S between the lines

	jne	bogus_64_magic

	movw	$__KERNEL_DS, %ax

and see what happens.
Comment 65 Dionisus Torimens 2008-07-03 18:22:08 UTC
> right before the comment "/* Enable PAE mode and PGE */" in
> arch/x86/kernel/head_64.S .
it hangs instead of booting at all.
Comment 66 Dionisus Torimens 2008-07-03 18:37:55 UTC
(In reply to comment #64)
> right before the
> comment starting with "/* Finally jump to run C code and to be on real kernel
> address" and see what happens.
It also hangs before booting.
Comment 67 Dionisus Torimens 2008-07-03 18:47:53 UTC
"        jne     bogus_64_magic

1:
        hlt
        jmp 1b

        movw    $__KERNEL_DS, %ax"

does not change anything. It reboots/shuts down instead of resuming.
Comment 68 Dionisus Torimens 2008-07-03 19:01:03 UTC
I don't know if this helps you, but if I comment out these two lines:

// #ifdef CONFIG_SMP
        addq    %rbp, trampoline_level4_pgt + 0(%rip)
        addq    %rbp, trampoline_level4_pgt + (511*8)(%rip)
// #endif

the system resumes pretty much normally.
Comment 69 Rafael J. Wysocki 2008-07-04 14:28:07 UTC
(In reply to comment #68)
> I don't know if this helps you,

Yes, it does, thanks a lot!
Comment 70 Rafael J. Wysocki 2008-07-04 14:33:48 UTC
Created attachment 16741 [details]
Fix patch

Please test the attached patch that should fix the problem.
Comment 71 Dionisus Torimens 2008-07-04 17:33:33 UTC
Yes, that works! :)

I was just skipping by the place and I thought: worth a try. :)

Thanks! I'm glad we could fix this before the release :)
Comment 72 Rafael J. Wysocki 2008-07-05 12:57:56 UTC
Regressions list annotation:
Handled-By : Rafael J. Wysocki <rjw@sisk.pl>
Patch : http://marc.info/?l=linux-kernel&m=121520913609798&w=4
Comment 73 Pavel Machek 2008-07-05 15:50:45 UTC
Acked-by: Pavel Machek <pavel@suse.cz>

(And thanks for handling that. I still wonder... in comment #49 you said it works for you on UP...?)
Comment 74 Rafael J. Wysocki 2008-07-05 15:55:19 UTC
Yes, it did work for me.  My kernel is not relocatable, though, while yours probably is.
Comment 75 Adrian Bunk 2008-07-06 00:33:17 UTC
fixed by commit 64e83b5a919a65eb35b63dd7e07c188379ff8ce6